[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-users] Parsing HTML, best practice with Chicken
From: |
Peter Bex |
Subject: |
Re: [Chicken-users] Parsing HTML, best practice with Chicken |
Date: |
Mon, 29 Dec 2014 12:57:28 +0100 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Mon, Dec 29, 2014 at 03:28:15AM +0100, mfv wrote:
> So far, I have been getting the site with http-client, the raw html to sxml
> with html-parser, and trying to process the resulting list with
> matchable/srfi-13.
I would recommend avoiding that, as it can get really messy. sxpath is meant
for this sort of thing, but unfortunately it's really difficult to use IMO.
I somehow always manage to get it working with sxpath when I need to do
some web scraping, but it's somewhat painful.
> I am not sure how much good it will do to use regex on those
> lists.
You can't, in general. Neither would I recommend this, except perhaps
when parsing the text content (and even then it might fail due to inline
markup).
> Are there any packages like Python's Beautifulsoup in the Chicken
> arsenal?
That sort of thing is sorely lacking. There's a promising "zipper"
library written by Moritz Heidkamp, but so far it's unreleased and
undocumented. If you're feeling very adventurous you could have
a look at it: https://bitbucket.org/DerGuteMoritz/zipper
There also used to be an sxml-match egg for CHICKEN 3, but nobody's
bothered to port it to CHICKEN 4 so far. AFAIK its main advantage was
that it was exactly like "matchable", but document order-insensitive for
attribute nodes.
> ; grab a website
> (define lnk
> ; "http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291521-3773")
> (define raw (with-input-from-request lnk #f read-string))
>
> ;; convert site crawl data from html to sxml
> (define sxml (html->sxml raw))
This can be done directly, without creating an intermediate
large string, by using html->sxml on a port:
(define sxml (call-with-input-request lnk #f html->sxml))
In fact, I didn't even know you could use html->sxml on a
string. This seems to be an undocumented feature of html-parser :)
Cheers,
Peter
--
http://www.more-magic.net