grammatica-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Grammatica-users] HTML grammar??


From: Rodgers, Kevin
Subject: Re: [Grammatica-users] HTML grammar??
Date: Wed, 15 Mar 2006 11:30:12 -0700

Per Cederberg writes:
> Well, I guess it would be possible to write an HTML
> grammar for Grammatica. But the question is more if
> it would really be a good fit. The thing with HTML
> is that *lots* of the real-world web pages are
> invalid (syntactically).
>
> So I think to write a good HTML-parser, one really
> needs to do it by hand. Adding special code
> everywhere to recover from common problems and
> issues.
>
> Also, HTML is a very unstrict syntax, allowing new
> unknown tags to be used, end tags to be omitted, etc,
> etc. So it is very hard to create a correct BNF
> grammar that covers all that still provides something
> more than a pure tokenizer.

HTML 4 and XHTML are very formally specified languages: SGML and XML
applications, respectively.  So it should be feasible to parse valid
HTML/XHTML documents with Grammatica.

Handling the vast amount of ill-formed and invalid HTML published on the
web is a separate problem.  I would try to solve it by piping each
document through Tidy (to generate valid XHTML) and Grammatica (to
process it).

--
Kevin


reply via email to

[Prev in Thread] Current Thread [Next in Thread]