help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation


From: Robin Redeker
Subject: Re: [Help-smalltalk] Re: [bug] UnicodeString conversion truncation
Date: Mon, 22 Oct 2007 13:07:54 +0200
User-agent: Mutt/1.5.11+cvs20060403

On Mon, Oct 22, 2007 at 12:03:10PM +0200, Paolo Bonzini wrote:
> 
[.snip.]
> >   "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."
> >
> >The JSON parser has no choice but to operate on Unicode characters.
> >Parsing a UTF-16 encoded JSON text byte-wise will just not work :)
> 
> Oops, my fault.  And can you specify different encodings?

That can't really be specified in the JSON text itself, it's usually
an out-of-band thing. eg. both ends agree on sending UTF-8 encoded JSON.
Of course there is a heuristic, which is even defined in the RFC:

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

But thats rather ugly IMO.
The cleanes interface for the JSON parser/serializer would be to
receive and produce UnicodeStrings and let the programmer worry about
encoding.
(An octet parsing wrapper can then always be defined later ontop of that.)

> 
> >   "Any character may be escaped."
> 
> Sure, you can write "\u0041".  In this case the JSON reader will return 
> a UnicodeString.  That's why I wrote 'switching to UnicodeStrings as 
> soon as we find a \uXXXX is a conservative approximation".

Ah, yes, now I understand the string-parsing code you wrote.

> >>>If it returns UnicodeString no literal string access on a Dictionary 
> >>>returned by
> >>>the JSON parser will work as it would get only a String object which has 
> >>>a different
> >>>hash function than UnicodeString.
> >>Hmmm, this has to be fixed.
> >
> >Is it fixable?
> 
> To some extent, it should be.  For example, in the case of 
> UnicodeStrings I can define the hash to be computed by translating 
> (internally) to UTF-8 and hashing the result.  Then, we can cross our 
> fingers and hope that Strings are also UTF-8 (good bet nowadays), and 
> just implement EncodedString>>hash as "^self asUnicodeString hash".
> 
> As a note to myself, this would mean also skipping the first 3 bytes of 
> a String to be hashed, if they are the BOM.

Hm, I agree that hasing Strings in their UTF-8 encoded form is a good 
approximation.
Which will of course horribly break if someone chooses to use eg. german 
"umlaute"
in the source code in latin-1 encoding, or maybe not. How is the encoding of a
literal string determined?

> >Of course it would already be helpful if it would work for ASCII characters
> >and Latin-1
> 
> ASCII characters and UTF-8 please. :-)  I'm also from a Latin-1 country, 
> but I try to think as international as possible. :-)

That Smalltalk source code literals come in UTF-8 encoded form is a bold
assumption (which is increasingly right these days on Linux and other OSs :-)




reply via email to

[Prev in Thread] Current Thread [Next in Thread]