RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes

classpath-patches

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes

From:	Jeroen Frijters
Subject:	RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
Date:	Thu, 18 Nov 2004 10:03:22 +0100

Archie Cobbs wrote:
> This is arguable in my opinion. Does the UTF-8 specification say that
> only currently defined Unicode characters may be encoded/decoded?

It has nothing to do with defined or undefined characters. Java strings
do not contain characters, but UTF-16 codepoints. When a Unicode
character above 0xFFFF is put in a Java string, the character code is
converted to two UTF-16 codepoints (a so called surrogate pair). These
surrogate pair codepoints are in the range 0xD800-0xDFFF and
conveniently Unicode doesn't define any characters in this range, so if
you encounter a Java char in this range, it isn't actually a Unicode
character, but only half of it. If a string contains half of a surrogate
pair, this string is malformed and so the UTF-8 encoder (which encodes
Unicode characters) is right to encode this as an invalid character.

Now, it's actually possible to have the UTF-8 encoder/decoder
encode/decode these half surrogate pairs symmetrically so if there is a
good reason to do so, we can certainly do that.

> What about Java class files? They contain arbitrary 16 byte characters
> encoded using "UTF-8" .. by your logic, isn't that a violation? Etc.

I don't understand what you mean here.

> I guess it depends on whether UTF-8 is defined as a 16 byte value
> encoding or a Unicode character encoding.. but even if it's defined
> as the latter, in practice, it is certainly used as the 
> former a lot...

UTF = Unicode Transformation Format. Do you have any examples of code
that uses strings to store arbitrary binary data *and* use UTF-8
encoding? Since Sun's implementation doesn't support it, I think it's
unlikely that much code depends on it.

Regards,
Jeroen

[Prev in Thread]

Current Thread

[Next in Thread]

[cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes, Jeroen Frijters, 2004/11/17
- Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes, Archie Cobbs, 2004/11/17
- RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes, Jeroen Frijters, 2004/11/17
  - Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes, Archie Cobbs, 2004/11/17
- RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes, Jeroen Frijters <=
  - Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes, Archie Cobbs, 2004/11/18
- RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes, Jeroen Frijters, 2004/11/18
  - Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes, Archie Cobbs, 2004/11/18

Prev by Date: [cp-patches] [Patch] support for direct buffers
Next by Date: RE: [cp-patches] [Patch] support for direct buffers
Previous by thread: Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
Next by thread: Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
Index(es):
- Date
- Thread