[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-16 and (ice-9 rdelim)
From: |
Mike Gran |
Subject: |
Re: UTF-16 and (ice-9 rdelim) |
Date: |
Mon, 18 Jan 2010 13:29:19 -0800 (PST) |
> From: Neil Jerram
> Hi Mike,
> > But, if you just want to get rid of a BOM, you can cut it down to
> > a rule. If the first code point that a port reads is U+FEFF and if the
> > encoding has the string "utf" in it, ignore it. If the first code point
> > is U+FFFE and the encoding has "utf" in it, flag an error.
>
> Agreed.
>
> Out of interest, does that mean that iconv will auto-detect the
> endianness if the encoding does not explicitly say "le" or "be"?
The Unicode FAQ from unicode.org says that "the unmarked form (UTF-16, UTF-32)
uses big-endian byte serialization by default, but may include a byte order
mark at the beginning to indicate the actual byte serialization used." So,
I guess the strictly correct thing to do for UTF-16 would be to
* check for a BOM.
* if it exists
* if it is U+FFFE, modify the port encoding to UTF-16-LE
* if it is U+FEFF, leave the port encoding as UTF-16
* discard the BOM
* else, leave the port-encoding to UTF-16
and similarly for UTF-32.
Thanks,
- Mike