Re: [Lynx-dev] pse help.

lynx-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Lynx-dev] pse help.

From:	David Woolley
Subject:	Re: [Lynx-dev] pse help.
Date:	Thu, 11 Jun 2009 08:15:47 +0100
User-agent:	Thunderbird 2.0.0.21 (X11/20090302)

karsten harazim wrote:

wonder if it seems to be possible to extract information from existingwebsites into some exel document like extracting all names, adresses,phone numbers, email, url etc from pages likethat: http://www.muenster.de/schulen-alle-1.html

Technically, you need something like XSLT to do this, although you arerather dependent on the author actually writing HTML according to truespirit of HTML, which is rather rare. You may need to convert the HTMLto XML syntax, before using XSLT.

For the actual download, you would be better using one of the specialisttools, like curl or wget.

However, actually doing so is likely to be illegal. Even if you theinformation is a pure collection of facts, in countries like the UK, thewould be covered by a database copyright. At least one reason why Lynxcan get blocked form sites is that it is often used to extractinformation without the surrounding advertising/branding.


--
David Woolley
Emails are not formal business letters, whatever businesses may want.
RFC1855 says there should be an address here, but, in a world of spam,
that is no longer good advice, as archive address hiding may not work.

[Prev in Thread]

Current Thread

[Next in Thread]

[Lynx-dev] pse help., karsten harazim, 2009/06/10
- Re: [Lynx-dev] pse help., Thomas Dickey, 2009/06/10
- Re: [Lynx-dev] pse help., David Woolley <=

Prev by Date: Re: [Lynx-dev] pse help.
Next by Date: Re: [Lynx-dev] lynx2.8.7pre.5
Previous by thread: Re: [Lynx-dev] pse help.
Next by thread: [Lynx-dev] lynx2.8.7pre.5 distrowatch.com
Index(es):
- Date
- Thread