help-smalltalk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Help-smalltalk] Re: Unicode problem on parsing XML


From: Paolo Bonzini
Subject: [Help-smalltalk] Re: Unicode problem on parsing XML
Date: Wed, 05 May 2010 18:43:05 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100330 Fedora/3.0.4-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.4

On 05/05/2010 05:36 AM, Bèrto ëd Sèra wrote:
Now I'm using gst 3.2 and iliad 0.8. What I get from the following code:
content := 'taxonomy.xml' asFile.
parser := XML.XMLParser new.
parser validate: false.
parser parse: content readStream.

Not related, but it's best to use "XML.SAXParser defaultParserClass
new". Unlike XMLParser, other parsers may not construct the DOM by default, so you end up with:

    PackageLoader fileInPackage: 'XML-XMLParser'.
    content := 'taxonomy.xml' asFile.
    parser := XML.SAXParser defaultParserClass new.
    parser validate: false.
    parser saxDriver: (driver := XML.DOM_SAXDriver new).
    parser parse: content readStream
    driver document

This won't fix the bug but will make you a good citizen (see NEWS file in GST 3.2).

the breakers are, for example:
1)...the æ and œ ligatures, ...
2) Devanāgarī script for Hindi
3) Japanese Rōmaji script

The breaker is _entities_, not characters.

I was wondering what changed... or, most probably, what kind of silly
mistake I'm making...

Nothing, it's a bug. The easiest way to fix it is to use the XML-Expat package. You just have to replace the first line above with these two:

    PackageLoader fileInPackage: 'XML-Expat'.
    PackageLoader fileInPackage: 'XML-DOM'.

It's _thousands_ of times faster too.

But if you insist, this patch fixes it:

diff --git a/packages/xml/parser/XML.st b/packages/xml/parser/XML.st
index 309cf36..a9ebb7f 100644
--- a/packages/xml/parser/XML.st
+++ b/packages/xml/parser/XML.st
@@ -2950,7 +2950,7 @@ Instance Variables:
            ifTrue:
                [sax fatalError: (BadCharacterSignal new
messageText: 'A character with Unicode value %1 is not legal' % {n})].
-       data nextPut: (Character value: n).
+       data display: (Character codePoint: n).
        self getNextChar
     ]

diff --git a/packages/xml/parser/package.xml b/packages/xml/parser/package.xml
index 2e0bcce..fc72811 100644
--- a/packages/xml/parser/package.xml
+++ b/packages/xml/parser/package.xml
@@ -13,6 +13,7 @@

   <prereq>XML-SAXParser</prereq>
   <prereq>XML-DOM</prereq>
+  <prereq>Iconv</prereq>

   <filein>XML.st</filein>
   <file>XML.st</file>

Paolo




reply via email to

[Prev in Thread] Current Thread [Next in Thread]