wget-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget2 | Stack overflow downloading a deepy nested website (#659)


From: Andrew White (@awhite27)
Subject: Re: wget2 | Stack overflow downloading a deepy nested website (#659)
Date: Tue, 30 Apr 2024 10:39:01 +0000



Andrew White commented: 
https://gitlab.com/gnuwget/wget2/-/issues/659#note_1887101433


This is the stack trace. wget2 was built with the current master with "-g -O0". 
I've replaced any text identifying the website with `<removed>`. The functions 
calls between `#5` and `#9` keep repeating in the stack trace until I got bored 
scrolling.

I originally downloaded the website with wget and was running wget2 with "-nc" 
over it to download any new files when it crashed. Doing the same with wget 
works fine.

The issue with this website is the URLs are all CGI generated and I estimate, 
based on looking at values in the queries and the pages I have downloaded is 
the nesting at least 500 deep. Probably the best way to reproduce it is to 
write a simple CGI script that generates pages with a link to a URL with an 
incrementing counter. eg

`script.cgi?level=1` returns a page with a link to `script.cgi?level=2`. 
`script.cgi?level=2` returns a page with a link to `script.cgi?level=3` etc.


```
(gdb) r
Starting program: /usr/local/bin/wget2 -r -l inf -nc -np -p --xattr -a wget.log 
<removed>
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000555555564870 in write_out (default_fp=<error reading variable: Cannot 
access memory at address 0x7fffff7fec38>,
    data=<error reading variable: Cannot access memory at address 
0x7fffff7fec30>,
    len=<error reading variable: Cannot access memory at address 
0x7fffff7fec28>,
    with_timestamp=<error reading variable: Cannot access memory at address 
0x7fffff7fec24>,
    colorstring=<error reading variable: Cannot access memory at address 
0x7fffff7fec18>,
    color_id=<error reading variable: Cannot access memory at address 
0x7fffff7fec20>) at log.c:62
62      {
(gdb) list
57              const char *data,
58              size_t len,
59              int with_timestamp,
60              const char *colorstring,
61              wget_console_color color_id)
62      {
63              FILE *fp;
64              int fd = -1;
65
66              if (!data || (ssize_t)len <= 0)
(gdb) bt
#0  0x0000555555564870 in write_out (default_fp=<error reading variable: Cannot 
access memory at address 0x7fffff7fec38>,
    data=<error reading variable: Cannot access memory at address 
0x7fffff7fec30>,
    len=<error reading variable: Cannot access memory at address 
0x7fffff7fec28>,
    with_timestamp=<error reading variable: Cannot access memory at address 
0x7fffff7fec24>,
    colorstring=<error reading variable: Cannot access memory at address 
0x7fffff7fec18>,
    color_id=<error reading variable: Cannot access memory at address 
0x7fffff7fec20>) at log.c:62
#1  0x0000555555564c2e in write_info (fp=0x7ffff7e21760 <_IO_2_1_stdout_>,
    data=0x7fffff7ffd70 "URI content encoding = 'utf-8' (set by server 
response)\n", len=56) at log.c:153
#2  0x0000555555564d3e in write_info_stdout (data=0x7fffff7ffd70 "URI content 
encoding = 'utf-8' (set by server response)\n", len=56) at log.c:184
#3  0x00007ffff7f5f904 in logger_vprintf_func (logger=0x7ffff7fbf860 
<info_logger>, fmt=0x55555557f790 "URI content encoding = '%s' (%s)\n",
    args=0x7fffff800da8) at logger.c:47
#4  0x00007ffff7f5f554 in wget_info_printf (fmt=0x55555557f790 "URI content 
encoding = '%s' (%s)\n") at log.c:58
#5  0x000055555556e4fc in html_parse (job=0x0, level=0, fname=0x55556e3d2020 
<removed>,
    html=0x55556e3d2130 "<html> <removed>"..., html_len=34372, 
encoding=0x55555557e99c "utf-8", base=0x55556e3cd2f0)
    at wget.c:2660
#6  0x000055555556eab2 in html_parse_localfile (job=0x0, level=0,
    fname=0x55556e3d2020 , encoding=0x55555557e99c "utf-8", base=0x55556e3cd2f0)
    at wget.c:2755
#7  0x0000555555567a13 in parse_localfile (job=0x0, fname=0x55556e3d2020 
<removed>,
    encoding=0x55555557e99c "utf-8", mimetype=0x7fffff801410 "text/html", 
base=0x55556e3cd2f0) at wget.c:558
#8  0x0000555555568e0b in queue_url_from_remote (job=0x0, 
encoding=0x55555557e99c "utf-8",
    url=0x7fffff801650 <removed>, flags=0, download_name=0x0) at wget.c:923
#9  0x000055555556e902 in html_parse (job=0x0, level=0, fname=0x55556e3c4690 
<removed>,
    html=0x55556e3c47a0 "<html><removed>"..., html_len=39145, 
encoding=0x55555557e99c "utf-8", base=0x55556e3befb0)
    at wget.c:2725
#10 0x000055555556eab2 in html_parse_localfile (job=0x0, level=0,
    fname=0x55556e3c4690 <removed>, encoding=0x55555557e99c "utf-8", 
base=0x55556e3befb0)
    at wget.c:2755
#11 0x0000555555567a13 in parse_localfile (job=0x0, fname=0x55556e3c4690 
<removed>,
    encoding=0x55555557e99c "utf-8", mimetype=0x7fffff801ba0 "text/html", 
base=0x55556e3befb0) at wget.c:558
#12 0x0000555555568e0b in queue_url_from_remote (job=0x0, 
encoding=0x55555557e99c "utf-8",
    url=0x7fffff801de0 <removed>, flags=0, download_name=0x0) at wget.c:923
#13 0x000055555556e902 in html_parse (job=0x0, level=0, fname=0x55556e3b4de0 
"<removed>"..., html_len=44322, encoding=0x55555557e99c "utf-8", 
base=0x55556e3afe30)
    at wget.c:2725
#14 0x000055555556eab2 in html_parse_localfile (job=0x0, level=0,
    fname=0x55556e3b4de0 <removed>, encoding=0x55555557e99c "utf-8", 
base=0x55556e3afe30)
    at wget.c:2755
#15 0x0000555555567a13 in parse_localfile (job=0x0, fname=0x55556e3b4de0 
<removed>,
    encoding=0x55555557e99c "utf-8", mimetype=0x7fffff802330 "text/html", 
base=0x55556e3afe30) at wget.c:558
#16 0x0000555555568e0b in queue_url_from_remote (job=0x0, 
encoding=0x55555557e99c "utf-8",
    url=0x7fffff802570 <removed>, flags=0, download_name=0x0) at wget.c:923
```

-- 
Reply to this email directly or view it on GitLab: 
https://gitlab.com/gnuwget/wget2/-/issues/659#note_1887101433
You're receiving this email because of your account on gitlab.com.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]