grammatica-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Grammatica-users] Parsing data out of an html file


From: Oliver Gramberg
Subject: Re: [Grammatica-users] Parsing data out of an html file
Date: Mon, 7 Feb 2011 14:27:03 +0100


Hi, Andrew,


your token definition,

SKIP_EVERYTHING         = <<.*>> %ignore%

does exactly what its name says: it skips everything. The reason is that the tokenizer works "greedily," i.e., it eats as many characters as it can, once a valid match is found. This is called the "longest match principle." The reason this principle is employed, in turn, is efficiency: The tokenizer doesn't have to backtrack, and therefore effectively reads each character only once.

Let's assume you are actually interested in the "England" bit of the line you show to be your target. Grammatica's %ignore% is all-or-nothing, therefore, it is not of much help here: The line is identified by the markup at the beginning of the line, so you cannot just throw away *all* markup; also, you want to throw away most of the content, but not *all*.

Fortunately, there's another way: To ignore something can also mean *not to do anything with it*, or, in Grammatica's terms: to do nothing in the method that is called when such a token is found.

So, the easiest solution to your problem might involve
(1) declaring a token that exactly matches HTML markup before the location where you want to extract data;
(2) declaring a token that matches all HTML markup, i.e., starts with "<";
(3) declaring a token that matches all HTML non-markup, i.e., starts with "[^<]";

- Token (1) must come first in your grammar, this way Grammatica choses it over (2) when your identifying markup appears in the input.
- When (1) is found, you set (in the appropriate callback method) a flag that indicates that the next non-markup is the data you want to extract.
- Only when the flag is set, (3) is used as output.
- Don't forget to reset the flag.


On the other hand, with such a small number of tokens, it might be even easier to handle this with a small script:

perl -n extract.pl output.html > extract.txt

with this line as the contents of extract.pl:

print $1 if m|<div class="endOfDayLeft"><a href="">


Regards
Oliver Gramberg

reply via email to

[Prev in Thread] Current Thread [Next in Thread]