classpath-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes


From: Jeroen Frijters
Subject: RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
Date: Thu, 18 Nov 2004 10:03:22 +0100

Archie Cobbs wrote:
> This is arguable in my opinion. Does the UTF-8 specification say that
> only currently defined Unicode characters may be encoded/decoded?

It has nothing to do with defined or undefined characters. Java strings
do not contain characters, but UTF-16 codepoints. When a Unicode
character above 0xFFFF is put in a Java string, the character code is
converted to two UTF-16 codepoints (a so called surrogate pair). These
surrogate pair codepoints are in the range 0xD800-0xDFFF and
conveniently Unicode doesn't define any characters in this range, so if
you encounter a Java char in this range, it isn't actually a Unicode
character, but only half of it. If a string contains half of a surrogate
pair, this string is malformed and so the UTF-8 encoder (which encodes
Unicode characters) is right to encode this as an invalid character.

Now, it's actually possible to have the UTF-8 encoder/decoder
encode/decode these half surrogate pairs symmetrically so if there is a
good reason to do so, we can certainly do that.

> What about Java class files? They contain arbitrary 16 byte characters
> encoded using "UTF-8" .. by your logic, isn't that a violation? Etc.

I don't understand what you mean here.

> I guess it depends on whether UTF-8 is defined as a 16 byte value
> encoding or a Unicode character encoding.. but even if it's defined
> as the latter, in practice, it is certainly used as the 
> former a lot...

UTF = Unicode Transformation Format. Do you have any examples of code
that uses strings to store arbitrary binary data *and* use UTF-8
encoding? Since Sun's implementation doesn't support it, I think it's
unlikely that much code depends on it.

Regards,
Jeroen




reply via email to

[Prev in Thread] Current Thread [Next in Thread]