[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
From: |
Robin Redeker |
Subject: |
Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation |
Date: |
Mon, 22 Oct 2007 13:07:54 +0200 |
User-agent: |
Mutt/1.5.11+cvs20060403 |
On Mon, Oct 22, 2007 at 12:03:10PM +0200, Paolo Bonzini wrote:
>
[.snip.]
> > "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
> >
> >The JSON parser has no choice but to operate on Unicode characters.
> >Parsing a UTF-16 encoded JSON text byte-wise will just not work :)
>
> Oops, my fault. And can you specify different encodings?
That can't really be specified in the JSON text itself, it's usually
an out-of-band thing. eg. both ends agree on sending UTF-8 encoded JSON.
Of course there is a heuristic, which is even defined in the RFC:
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
00 00 00 xx UTF-32BE
00 xx 00 xx UTF-16BE
xx 00 00 00 UTF-32LE
xx 00 xx 00 UTF-16LE
xx xx xx xx UTF-8
But thats rather ugly IMO.
The cleanes interface for the JSON parser/serializer would be to
receive and produce UnicodeStrings and let the programmer worry about
encoding.
(An octet parsing wrapper can then always be defined later ontop of that.)
>
> > "Any character may be escaped."
>
> Sure, you can write "\u0041". In this case the JSON reader will return
> a UnicodeString. That's why I wrote 'switching to UnicodeStrings as
> soon as we find a \uXXXX is a conservative approximation".
Ah, yes, now I understand the string-parsing code you wrote.
> >>>If it returns UnicodeString no literal string access on a Dictionary
> >>>returned by
> >>>the JSON parser will work as it would get only a String object which has
> >>>a different
> >>>hash function than UnicodeString.
> >>Hmmm, this has to be fixed.
> >
> >Is it fixable?
>
> To some extent, it should be. For example, in the case of
> UnicodeStrings I can define the hash to be computed by translating
> (internally) to UTF-8 and hashing the result. Then, we can cross our
> fingers and hope that Strings are also UTF-8 (good bet nowadays), and
> just implement EncodedString>>hash as "^self asUnicodeString hash".
>
> As a note to myself, this would mean also skipping the first 3 bytes of
> a String to be hashed, if they are the BOM.
Hm, I agree that hasing Strings in their UTF-8 encoded form is a good
approximation.
Which will of course horribly break if someone chooses to use eg. german
"umlaute"
in the source code in latin-1 encoding, or maybe not. How is the encoding of a
literal string determined?
> >Of course it would already be helpful if it would work for ASCII characters
> >and Latin-1
>
> ASCII characters and UTF-8 please. :-) I'm also from a Latin-1 country,
> but I try to think as international as possible. :-)
That Smalltalk source code literals come in UTF-8 encoded form is a bold
assumption (which is increasingly right these days on Linux and other OSs :-)
- [Help-smalltalk] [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/20
- Message not available
- Message not available
- [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/21
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation,
Robin Redeker <=
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Robin Redeker, 2007/10/22
- Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation, Paolo Bonzini, 2007/10/22