bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 BOM parse error


From: Bruno Haible
Subject: Re: UTF-8 BOM parse error
Date: Mon, 13 Sep 2004 13:41:01 +0200
User-agent: KMail/1.5

David Necas wrote:
> Gettext version: 0.14.1
>
> Problem: msgfmt (and probably other gettext tools) print an
> unhelpful error
>
>     somefile.po:1:2: parse error
>
> when a PO file starts with UTF-8 BOM (0xef 0xbb 0xbf).

This behaviour is correct. The so-called "UTF-8 BOM" is specified in
the document that defines UTF-8, namely RFC 3629
(http://www.faqs.org/rfcs/rfc3629.html):
   "It is important to understand that the character U+FEFF appearing at
    any position other than the beginning of a stream MUST be interpreted
    with the semantics for the zero-width non-breaking space, and MUST
    NOT be interpreted as a signature."

The PO file format only allows for ASCII white space characters, not for
U+FEFF.

The Unix Unicode FAQ (http://www.cl.cam.ac.uk/~mgk25/unicode.html) also
says:
   "Linux/Unix does not use any BOMs and signatures. They would break
    far too many existing ASCII syntax conventions"

> What makes it worse is that any UTF-8-capable text editor or
> viewer does not show the BOM (or at least should not show),
> so one gazes at the file wondering what could be wrong with
> the comment on its first line...

Yes. I've also once seen the problem on an XML file.

The problem is not the file formats which don't allow U+FEFF to be ignored.
The problem are the editors which put the "UTF-8 BOM".

> Do not use tab characters. Their effect is not predictable.

Do not use UTF-8 BOM. Its effect is predictable: it causes hassles.

Bruno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]