chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] Parsing Simple Markup


From: Andy Bennett
Subject: Re: [Chicken-users] Parsing Simple Markup
Date: Sun, 21 Sep 2014 16:01:28 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Icedove/24.6.0

Hi,

> I am  a new user to Scheme in general and to Chicken in particular, nice
> to meet you all.

Welcome!


> A few examples of what I am trying to parse:
> 
> 1. Tags that identify structural elements of a document:
> [chapter] "Chapter Title"
> [heading1] "Heading Title"
> [list]
> ...
> [end]
> 
> [quote]
> ...
> [end]
> 
> 2. Tags that identify formatting of text:
> <bold<text>  ;single formatting command with no value
> <indent 5<text> ; formatting command with a value
> <dropcap<O>nce upon a time
> <bold, smallcap, size +2<text> ;a command group which has multiple
> formatting commands enclosed within <...<.
> 
> A command group can be singular:
> 
> <...<
> 
> or have multiple commands separated by commas:
> 
> <...,...,...,<
> 
> the closing > signalling the end of the command group.

This is not entirely dissimilar to Markdown so I'd echo Peter's advice
to check out lowdown, the CHICKEN Markdown implementation, and comparse,
the parser library lowdown is implemented in.

I'll also point you to the eMail address parsing egg:
http://api.call-cc.org/doc/email-address which is another example of a
parser written with comparse. It's interesting because, unlike lowdown,
it implements a parser for just a small number of things: eMail
addresses and lists of eMail addresses.

comparse is a parser combinator library. This means that you specify
parts of your grammar / language and a procedure which can parse that
thing is returned. You then combine these parsers to produce other
parsers that, for example, can parse "X then Y", "X or Y", "X then Y
then Y", etc. It takes a couple of hours to wrap your head around it but
it's very powerful. The email-address parser is build up starting from
sets of characters and resulting in two procedures: one that parses and
eMail address and one that parses a "sequence of eMail address".


> The idea is to make typesetting with Groff very simple and intuitive for
> any user - not just programmers and hackers.  The markup we are working
> on is called Typesetting Markup Language (TML).  So it would convert
> html-like commands and generate a Groff document from it.

comparse allows to take your results and give them as arguments to other
procedures. In the eMail address egg I use this to populate an internal
data type that represents an eMail address. You could use an
intermediate data type like this or you could try to write a number of
different procedures which immediately output the parsed thing in the
required format.


> Right now I am trying to do a prototype which generated Groff in the
> backend, but the idea is to have a general purpose markup that could
> also be used to generate LaTex/Contex, HTML xml etc....

...it's probably best to generate an intermediate format then. The
lowdown egg generates "SXML" which can easily be rendered down to HTML.

SXML is an s-expression representation of the tree structure of XML.

See here for an illustration of SXML:
http://www.more-magic.net/posts/lispy-dsl-sxml.html


> In Perl I am able to do most of this with regular expressions, but I'm
> hitting my head against the wall when it comes to multiple formatting
> commands within a group <...,...,...<

In comparse something like <X,Y,Z< would be something like:

(off the top of my head, without testing anything)

; fIorz's separated-by parser :
(define (separated-by sep-parser field-parser)
  (sequence* ((head field-parser)
              (tail (zero-or-more
                      (preceded-by
                        sep-parser
                        field-parser))))

             (result (cons head tail))))



(define the-parser
  (sequence-of (char "<")
               (separated-by
                 ","
                 (maybe ; support null elements
                   (any-of X Y Z))
               (char "<")))


The email-address additionally has the "delimited-by" parser to support
white space around the commas. Above I've used the "maybe" parser to
show how you'd support <X,,Y,Z< as well as <X,Y,Z<



> Also to note....I am NOT a programmer of developer - I am a hobbyist and
> doing this for fun!

It looks like you're on the right tracks.


> My idea was that I could read a line of text from a file at a time.  My
> understanding is that the input would be read into an "s-expression"
> (which I understand to basically be a list).  Then could "car" the first
> item of the list and match it against my "tags" or "formatting commands"
> (which would be defined as something like below)
> 
> (define chapter "[chapter]")
> (define list:digit "[list:digit]")
> (define list:alpha "[list:alpha]")
> (define end-list "[end]")
> (define close-command-group ">")
> (define command-group-begin "<")
> (define command-group-end "<")
> (define bold "bold")
> (define smallcap "smallcap")
> (define dropcap "dropcap")

Don't worry about reading the input: let comparse do that for you.

Other than that, it looks like the rules you have defined there aren't a
million miles from the way comparse would let you specify things. The
additional complexity is that compares returns a procedure that you
apply to the string or port you want to parse whereas, above, you just
have some literals bound to names.

Using sequence and sequence* you can apply some parsers and then build
up your tree structure from the parsed things. You can then later car /
cdr down that structure to do things with it.



> This is my first attempt at functional programming so I realize I may
> not be approaching this in the best way.

You're on the right path: keep going. I'd recommend to forget about the
regular expressions for now tho'.



Good luck! Let us know how you get on.





Regards,
@ndy

-- 
address@hidden
http://www.ashurst.eu.org/
0x7EBA75FF




reply via email to

[Prev in Thread] Current Thread [Next in Thread]