chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] Basic abnf usage?


From: Moritz Heidkamp
Subject: Re: [Chicken-users] Basic abnf usage?
Date: Sat, 28 Mar 2015 12:33:44 +0100

Hey Matt,

On 27 March 2015 21:19 CET, Matt Gushee wrote:

> That's a fair question. I was working on a toy XML parser as a learning
> exercise, and I thought "hmm ... this should support UTF-8". So I attempted
> to use utf8-srfi-14 in place of regular srfi-14; then certain parsing
> functions didn't work as expected. I also looked at the comparse source,
> and saw that it imports [non-UTF8] srfi-13 and -14.

ah, that's what you are referring to, I see! It's like that because I
didn't want to force a utf8 dependency on the user. The affected parser
combinators are easily constructed from more primitive ones, though. For
example, you could define a utf8 char-set aware `in' combinator like
this:

  (use comparse
  (use utf8-srfi-14)
  
  (define (utf8-in cs)
    (satisfies (lambda (c) (char-set-contains? cs c))))


> #;3> (parse (is nichi) sake)
> #f
> #<parser-input #\� #\� #\� #\� #\� #\� #\� #\� #\�>
> ; 2 values
>
> Not useful for arbitrary-language text ...

This doesn't work because `sake' is a string and Comparse operates on
the byte level by default (in accordance with CHICKEN core string
procedures). You need to decode the input as UTF-8 to make it work as
expected, e.g. using the utf8 egg's `string->list' procedure:

#;162> (parse (is nichi) (string->list sake))
#\x65e5
#<parser-input lazy-seq #\x672c #\x9152>


> #;4> (parse item sake)
> #\
> #<parser-input #\� #\� #\� #\� #\� #\� #\� #\�>
> ; 2 values

Same here.


> Not so good. BTW, I also tried wrapping the text with (->parser-input ...).
> Didn't seem to make any difference.

Yeah, `parse' does that implicitly for you.


> #;7> (parse ident sake)
> (#\� ())
> #<parser-input #\� #\� #\� #\� #\� #\� #\� #\�>
> ; 2 values
>
> ???

Same as above applies.


> #;9> (parse ident "h1")
> #f
> #<parser-input #\h #\1>
> ; 2 values
>
> Wrong. Or at least, quite unexpected.

This is because you are passing a utf8 char-set to `in' now which does
not satisfy srfi-14's `char-set?' predicate so it will treat it as a
possible input value to match against with `eq?' (or `memq', to be
precise). Maybe we shoud have two different `in's instead of a single
polymorphic one to make it a bit more strict. It seemed like a nice API
at the time ;-)


> #;10> (parse ident sake)
> #f
> #<parser-input #\� #\� #\� #\� #\� #\� #\� #\� #\�>
> ; 2 values
>
> Also not what we want.

Same cause here, of course.


> So, while I can see that it is possible to use certain combinators with
> non-ASCII text, this does not seem like proper UTF-8 support to me. Or is
> there some way to set up the environment or prepare the input that would
> prevent these issues?

We could create a comparse-utf8 egg to facilitate this. It's not
currently on my agenda but I will put it in my Comparse notes for future
reference. If you feel inclined to create one, I'm happy to provide you
with code review and feedback!

Hope that helps
Moritz

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]