[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Grammatica-users] HTML grammar??
From: |
Rodgers, Kevin |
Subject: |
Re: [Grammatica-users] HTML grammar?? |
Date: |
Wed, 15 Mar 2006 11:30:12 -0700 |
Per Cederberg writes:
> Well, I guess it would be possible to write an HTML
> grammar for Grammatica. But the question is more if
> it would really be a good fit. The thing with HTML
> is that *lots* of the real-world web pages are
> invalid (syntactically).
>
> So I think to write a good HTML-parser, one really
> needs to do it by hand. Adding special code
> everywhere to recover from common problems and
> issues.
>
> Also, HTML is a very unstrict syntax, allowing new
> unknown tags to be used, end tags to be omitted, etc,
> etc. So it is very hard to create a correct BNF
> grammar that covers all that still provides something
> more than a pure tokenizer.
HTML 4 and XHTML are very formally specified languages: SGML and XML
applications, respectively. So it should be feasible to parse valid
HTML/XHTML documents with Grammatica.
Handling the vast amount of ill-formed and invalid HTML published on the
web is a separate problem. I would try to solve it by piping each
document through Tidy (to generate valid XHTML) and Grammatica (to
process it).
--
Kevin
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [Grammatica-users] HTML grammar??,
Rodgers, Kevin <=