Re: [Chicken-users] Basic abnf usage?

Hi, Moritz--

On Fri, Mar 27, 2015 at 11:32 AM, Moritz Heidkamp <address@hidden> wrote:

On 27 March 2015 18:18 CET, Matt Gushee wrote:

> BTW, in case you are interested, I was going to use comparse, which is a
> bit easier to learn (don't know about performance), but found that it
> didn't support UTF-8, which in my world is not acceptable.

as the author of Comparse I wonder how you got that impression?

That's a fair question. I was working on a toy XML parser as a learning exercise, and I thought "hmm ... this should support UTF-8". So I attempted to use utf8-srfi-14 in place of regular srfi-14; then certain parsing functions didn't work as expected. I also looked at the comparse source, and saw that it imports [non-UTF8] srfi-13 and -14.

To illustrate some of the problems I'm finding:

    [FILE: jpstuff.scm]
    (use comparse)
    (use srfi-14)
    ;(use utf8-srfi-14)

;(use unicode-char-sets)

; Some Japanese text. In case it doesn't display properly in your browser,

; the string called 'sake' is the first three characters of 'sake-is-delicious',

; and the character called 'nichi' is the first character of 'sake'.

    (define sake "日本酒")
    (define sake-is-delicious "日本酒は美味しいです。")
    (define nichi #\日)

    ; From a toy XML parser I've been building
    (define ident-start (in (char-set-union char-set:alphabetic (char-set #\_))))
    (define ident-char (in (char-set-union char-set:alphabetic char-set:digit (char-set #\_ #\- #\: #\.))))
    (define ident (sequence ident-start (zero-or-more ident-char)))

Okay, let's try some parsing:

#;1> (load "jpstuff.scm")
; loading jpstuff.scm ...
; loading /usr/local/lib/chicken/7/comparse.import.so ...

.... [loading various extensions] ....

#;2> (parse (char-seq sake) sake-is-delicious)
"日本酒"
#<parser-input #\� #\� #\� #\� #\� #\#\� #\� #\� #\�>
; 2 values

OK so far ...

#;3> (parse (is nichi) sake)
#f
#<parser-input #\� #\� #\� #\� #\� #\� #\� #\� #\�>
; 2 values

Not useful for arbitrary-language text ...

#;4> (parse item sake)
#\
#<parser-input #\� #\� #\� #\� #\� #\� #\� #\�>
; 2 values

Not so good. BTW, I also tried wrapping the text with (->parser-input ...). Didn't seem to make any difference.

#;5> (parse ident "h1")
(#\h (#\1))
#<parser-input-end>
; 2 values
#;6> (parse ident "paragraph")
(#\p (#\a #\r #\a #\g #\r #\a #\p #\h))
#<parser-input-end>
; 2 values

Good.

#;7> (parse ident sake)
(#\� ())
#<parser-input #\� #\� #\� #\� #\� #\� #\� #\�>
; 2 values

???

#;8> ; Now using utf8-srfi-14 & unicode-char-sets
#;8> (load "jpstuff.scm")
; loading jpstuff.scm ...
; loading /usr/local/lib/chicken/7/utf8-srfi-14.import.so ...
; loading /usr/local/lib/chicken/7/iset.import.so ...
; loading /usr/local/lib/chicken/7/srfi-4.import.so ...
; loading /usr/local/lib/chicken/7/utf8-lolevel.import.so ...
; loading /usr/local/lib/chicken/7/lolevel.import.so ...
; loading /usr/local/lib/chicken/7/utf8-srfi-14.so ...
; loading /usr/local/lib/chicken/7/utf8-lolevel.so ...
; loading /usr/local/lib/chicken/7/iset.so ...
; loading /usr/local/lib/chicken/7/unicode-char-sets.import.so ...
#;9> (parse ident "h1")
#f
#<parser-input #\h #\1>
; 2 values

Wrong. Or at least, quite unexpected.

#;10> (parse ident sake)
#f
#<parser-input #\� #\� #\� #\� #\� #\� #\� #\� #\�>
; 2 values

Also not what we want.

So, while I can see that it is possible to use certain combinators with non-ASCII text, this does not seem like proper UTF-8 support to me. Or is there some way to set up the environment or prepare the input that would prevent these issues?

--

Matt

From:	Matt Gushee
Subject:	Re: [Chicken-users] Basic abnf usage?
Date:	Fri, 27 Mar 2015 14:19:32 -0600