[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
From: |
Jeroen Frijters |
Subject: |
RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes |
Date: |
Thu, 18 Nov 2004 10:03:22 +0100 |
Archie Cobbs wrote:
> This is arguable in my opinion. Does the UTF-8 specification say that
> only currently defined Unicode characters may be encoded/decoded?
It has nothing to do with defined or undefined characters. Java strings
do not contain characters, but UTF-16 codepoints. When a Unicode
character above 0xFFFF is put in a Java string, the character code is
converted to two UTF-16 codepoints (a so called surrogate pair). These
surrogate pair codepoints are in the range 0xD800-0xDFFF and
conveniently Unicode doesn't define any characters in this range, so if
you encounter a Java char in this range, it isn't actually a Unicode
character, but only half of it. If a string contains half of a surrogate
pair, this string is malformed and so the UTF-8 encoder (which encodes
Unicode characters) is right to encode this as an invalid character.
Now, it's actually possible to have the UTF-8 encoder/decoder
encode/decode these half surrogate pairs symmetrically so if there is a
good reason to do so, we can certainly do that.
> What about Java class files? They contain arbitrary 16 byte characters
> encoded using "UTF-8" .. by your logic, isn't that a violation? Etc.
I don't understand what you mean here.
> I guess it depends on whether UTF-8 is defined as a 16 byte value
> encoding or a Unicode character encoding.. but even if it's defined
> as the latter, in practice, it is certainly used as the
> former a lot...
UTF = Unicode Transformation Format. Do you have any examples of code
that uses strings to store arbitrary binary data *and* use UTF-8
encoding? Since Sun's implementation doesn't support it, I think it's
unlikely that much code depends on it.
Regards,
Jeroen