[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
From: |
Jeroen Frijters |
Subject: |
RE: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes |
Date: |
Thu, 18 Nov 2004 17:37:10 +0100 |
Archie Cobbs wrote:
> I'm simply complaining that the following doesn't work:
>
> String s = "\ud8aa";
> byte[] b = s.getBytes("UTF-8");
> String t = new String(b, "UTF-8");
> System.out.println(s.equals(t)); // prints false!
>
> If you run this under the JDK, it prints "false".
The string isn't valid Unicode so the UTF-8 encoder is within its rights
to encode the surrogate as an invalid character.
> In other words, there are certain String objects that Sun's
> UTF-8 encoding is not capable of encoding, because it doesn't
> handle all possible character values in the range
> 0x0000 - 0xffff.
I understand what you mean, but you have to face the fact that the range
of 0xD800-0xDFFF doesn't contain valid unicode character and as such
will not be encoded by UTF-8.
> Yes, which is how I came across this bug. There are classes
> in Classpath that store arbitrary binary data within String
> objects.
Class files don't use UTF-8 to encode strings, they use the format used
by DataOutputStream.writeUTF (what Sun calls "modified UTF").
So maybe all we need to do is make sure that
DataOutputStream.writeUTF/DataInputStream.readUTF can roundtrip *any*
string (even if it has invalid Unicode characters).
Regards,
Jeroen