chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] Parsing Simple Markup


From: Peter Bex
Subject: Re: [Chicken-users] Parsing Simple Markup
Date: Sun, 21 Sep 2014 12:10:55 +0200
User-agent: Mutt/1.4.2.3i

On Sat, Sep 20, 2014 at 11:19:08AM -0400, Yves Cloutier wrote:
> Hello,
> 
> I am  a new user to Scheme in general and to Chicken in particular, nice to
> meet you all.

Hello Yves, and welcome to the CHICKEN community!

> I came to scheme looking for an alternative to Perl for doing a personal
> project which involves parsing an input file, identifying html-like
> commands and converting those to Groff code.

That should be pretty doable.  We already have several eggs for parsing
various markup languages, you may want to take a look at their
implementations for inspiration:

- html-parser
- htmlprag
- lowdown
- mistie
- svnwiki-sxml

> Scheme is a totally different paradigm that I'm used to, so while I wait
> for my books to arrive I will need some hand-holding...hope that's ok.

No problem!  We always help out newbies.  If you have specific questions,
you might also like to try our IRC channel.  There's usually someone
around to answer your question.

> 1) Is the Chicken Scheme manual available for purchase?  Online docs are
> great but I like to have a hardcopy so that I can read offline.

I'm afraid not.  You're the first person to ask for a hardcopy.  Of course
you can always print it...  The manual is in svnwiki syntax, which can be
translated to sxml/html or markdown.  It's also human-readable so you
could print out the sources.  There's a copy of the manual with every
tarball, which gets installed as HTML in your system's doc directory,
so it's always available when you're offline.

> 2) The best way to learn is to get your hands dirty so I was looking at
> doing everything from scratch, but then I saw input-parse (
> http://wiki.call-cc.org/eggref/4/input-parse) which seems pretty much like
> what I need.  But i can't seem to find this in the Eggs.  It says that page
> does not exist yet.

Which page are you talking about?  http://wiki.call-cc.org/eggref/4/input-parse
looks fine to me.

> In Perl I am able to do most of this with regular expressions, but I'm
> hitting my head against the wall when it comes to multiple formatting
> commands within a group <...,...,...<

There's a famous quote by Jamie Zawinsky about regular expressions, which
seems like it applies in this case:  "Some people, when confronted with
a problem, think “I know, I'll use regular expressions.” Now they have
two problems".

Having said that, I think the SRE notation for regular expressions makes
them a lot more readable.  However, parsing complex languages using
regular expressions is a bad idea...

> Also to note....I am NOT a programmer of developer - I am a hobbyist and
> doing this for fun!

...since you're not a programmer, you may not be familiar with formal
language theory.  The idea there is that there are several classes of
languages (or grammars), and only so-called "regular grammars" can
stricly be parsed with regular expressions.  A regular grammar is
basically one which requires no extra information to parse it, aside
from the current rule in the parser.  It also means that no backtracking
is needed when parsing it.  Irregex (like Perl) can do backtracking,
which muddles the waters quite a bit, and I think this is one of the
reasons people get confused about the abilities of regexes.

A good rule of thumb to remember is: if your syntax allows to "nest"
things, regular expressions alone cannot parse it.  For instance, in
HTML you can arbitrarily nest markup instructions like <b><i>..</i></b>,
but also <div><div>...</div></div>.  This is why people will tell you
that HTML/XML cannot be parsed with regular expressions.  If you try
anyway, you set yourself up for failure.  Many security issues have
historically been due to poor parsing choices.  If you're interested
in this stuff, see also http://www.langsec.org, which is a group of
people who are using a language-theoretical approach to fighting
insecurity.

You may be able to do partial parsing steps of a complex grammar
using regular expressions combined with some code to "drive" it.
This is the typical PHP/Perl approach of parsing languages, with the
reference implementations of Markdown and Textile being prime examples.
However, this quickly becomes untractable, and inevitably leads to the
aforementioned security issues.  Instead, I'd advise you to use one of
the parsing eggs, or roll your own recursive descent parser.  If
performance is not much of a consideration, that's pretty easy to do in
Scheme, and you don't need any dependencies.

> My idea was that I could read a line of text from a file at a time.  My
> understanding is that the input would be read into an "s-expression" (which
> I understand to basically be a list).

That sounds problematic, because it will limit your ability to have
modifiers that span multiple lines.  Of course, it's still possible with
additional bookkeeping, but you may find it easier to just parse from a
character stream, handling newline symbols in the grammar instead of
being fundamental to the way your syntax must be parsed.

> This is my first attempt at functional programming so I realize I may not
> be approaching this in the best way.

To start with something simple, I'd advise the "comparse" egg or the
"abnf" egg.  Both are very similar, the main difference is that comparse
has a permissive license, but it is quite a bit slower and more verbose
than abnf.

If you would like more inspiration and examples, you might want to check
out the "alternatives" directory in the uri-generic egg:
https://code.call-cc.org/svn/chicken-eggs/release/4/uri-generic/trunk/alternatives
(usename: "anonymous", password: "")
uri-generic is a strict RFC-compliant URI parser, and I've been
experimenting with porting it to various parsing libraries, and
wrote some notes in the README file.

> Regards, and looking forward to playing with Scheme!

I hope you'll have fun!

Cheers,
Peter
-- 
http://www.more-magic.net



reply via email to

[Prev in Thread] Current Thread [Next in Thread]