[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] WARC File Creation - Scope Issues
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] WARC File Creation - Scope Issues |
Date: |
Fri, 12 Apr 2013 10:32:13 +0200 |
User-agent: |
KMail/1.13.7 (Linux/3.2.0-4-amd64; KDE/4.8.4; x86_64; ; ) |
Hello Mark,
to capture a single document just execute e.g.
wget --warc-file single_page
'https://webarchive.jira.com/wiki/display/wayback/Wayback+Installation+and+Configuration+Guide#WaybackInstallationandConfigurationGuide-
URLsandWebApplications'
To save a page + requisites (everything you need to display it),
add the -p / --page-requisites option. Consult the man pages for a detailed
explanation.
Regards, Tim
Am Thursday 11 April 2013 schrieb McFate, Mark:
> This is not a 'bug' by any means, but I could find no better place to post
> this so please forgive me...
>
> I've used 'wget' for years but am just now discovering the real power it
> has. Lately I have upgraded to v1.14 so that I can take advantage of WARC
> file creation. But I need to learn a lot more. In particular, I'm having
> trouble controlling the scope of the content returned by wget when using
> the -warc-file option (or even when not). The -mirror option is nice, but
> in many circumstances it returns far too much information, and limiting
> the return using the -l option requires trial and error as I am never sure
> how deep to set it.
>
> For example, I would like to retrieve the following set of pages as a WARC,
> but don't really want anything else from this domain:
> https://webarchive.jira.com/wiki/display/wayback/Wayback+Installation+and+
> Configuration+Guide#WaybackInstallationandConfigurationGuide-URLsandWebAppl
> ications. Is it even possible using wget to capture a complete WARC
> containing only this document?
>
> So, I'm looking for guidance that might be pertinent to using wget for WARC
> retrieval. Please point me to anything you think might be helpful.
> Thanks.
>
> Mark A. McFate
> Digital Library Applications Developer
> Burling Library, Grinnell College
> Grinnell, IA 50112-1690
> address@hidden<mailto:address@hidden>