[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Lynx-dev] pse help.
From: |
David Woolley |
Subject: |
Re: [Lynx-dev] pse help. |
Date: |
Thu, 11 Jun 2009 08:15:47 +0100 |
User-agent: |
Thunderbird 2.0.0.21 (X11/20090302) |
karsten harazim wrote:
wonder if it seems to be possible to extract information from existing
websites into some exel document like extracting all names, adresses,
phone numbers, email, url etc from pages like
that: http://www.muenster.de/schulen-alle-1.html
Technically, you need something like XSLT to do this, although you are
rather dependent on the author actually writing HTML according to true
spirit of HTML, which is rather rare. You may need to convert the HTML
to XML syntax, before using XSLT.
For the actual download, you would be better using one of the specialist
tools, like curl or wget.
However, actually doing so is likely to be illegal. Even if you the
information is a pure collection of facts, in countries like the UK, the
would be covered by a database copyright. At least one reason why Lynx
can get blocked form sites is that it is often used to extract
information without the surrounding advertising/branding.
--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.