|
From: | Gijs van Tulder |
Subject: | [Bug-wget] Invalid Content-Length header in WARC files, on some platforms |
Date: | Mon, 12 Nov 2012 22:34:23 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121028 Thunderbird/16.0.2 |
Hi,There's a somewhat serious issue in the WARC-generating code: on some platforms (presumably the ones where off_t is not a 64-bit number) the Content-Length header at the top of each WARC record has an incorrect length. On these platforms it is sometimes 0, sometimes 1, but never the correct length. This makes the whole WARC file unreadable.
The code works fine on many platforms, but it is apparently a problem on some PowerPC and ARM systems, and maybe other systems as well.
Existing WARC files with this problem can be repaired by replacing the value of the Content-Length header with the correct value, for each WARC record in the file. The content of the WARC records is there, it's just the Content-Length header that is wrong.
The attached patch fixes the problem in warc.c. It replaces off_t by wgint and uses the number_to_static_string function from util.c.
Regards, Gijs
wget-warc-content-length.patch
Description: Text Data
[Prev in Thread] | Current Thread | [Next in Thread] |