[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes
From: |
Archie Cobbs |
Subject: |
Re: [cp-patches] FYI: Patch: character encoder/decoder cleanup/fixes |
Date: |
Wed, 17 Nov 2004 11:48:56 -0600 (CST) |
Jeroen Frijters wrote:
> > > I committed the attached patch to remove the throwing of
> > > CharConversionException from the character encoders/decoders.
> > >
> > > For encoders, unsupported characters are now always
> > replaced with a '?'
> > > byte and for the UTF8 decoder, invalid UTF-8 bytes are replaced by a
> > > Unicode REPLACEMENT CHARACTER (\uFFFD) in the output stream.
> >
> > Just curious.. does this implementation have the same problem as
> > described in
> > http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4628881 ?
> > I.e., is it a lossy encoding for "invalid" characters?
>
> At the moment the UTF-8 encoder/decoder is fully symmetrical for all
> "characters" (really UTF-16 codepoints), but this is actually a bug, IMO
> unpaired surrogate pairs shouldn't be decoded (like the bug parade
> comment says, the test case is bogus).
This is arguable in my opinion. Does the UTF-8 specification say that
only currently defined Unicode characters may be encoded/decoded?
What about Java class files? They contain arbitrary 16 byte characters
encoded using "UTF-8" .. by your logic, isn't that a violation? Etc.
I guess it depends on whether UTF-8 is defined as a 16 byte value
encoding or a Unicode character encoding.. but even if it's defined
as the latter, in practice, it is certainly used as the former a lot...
-Archie
__________________________________________________________________________
Archie Cobbs * CTO, Awarix * http://www.awarix.com
*
Confidentiality Notice: This e-mail message, including any attachments, is for
the sole use of the intended recipient(s) and may contain confidential and
privileged information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply e-mail and destroy all copies of
the original message.
*