[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Concurrency and wget
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] Concurrency and wget |
Date: |
Tue, 10 Apr 2012 17:52:20 +0200 |
User-agent: |
KMail/1.13.7 (Linux/3.2.0-2-amd64; KDE/4.7.4; x86_64; ; ) |
Meanwhile, I wrote a simple proof of concept (parallel dummy downloads using
threads, dummy downloading of chunks, etc.).
I am at the point where I want to implement HTTP-Header metalink (RFC 6249).
I just can't find any servers to test with... maybe you can help me out ?
Well, since there is no response to my previous post: is there any interest in
getting that done anyway ?
Tim
Am Tuesday 03 April 2012 schrieb Tim Ruehsen:
> Hi Giuseppe, hi Micah,
>
> while couldn't sleep last night, I thought about wget and concurrency...
>
> I had the idea of using a top-down approach to outline what wget is doing.
> Just to have a overview without struggling with the details of
> implementation. As a side effect one would have a (textual? graphical?)
> starting point for contributors to rush into the project. A chance to have
> a clear and well documented design.
>
> Since maintenance of a flowchart is time-consuming and requires some extra
> skills and tools, pure texts in the form of a "programming language" seems
> to fit.
>
> Here is just a beginning, let's say a basis for discussions.
> If you don't mind, I would like take part in ongoing development.
>
> Basic wget functionality (download given URI/IRI):
>
> main (URI) {
> put <URI> into <queue>
>
> while <queue> is not empty {
> download_and_analyse(next <queue> entry)
> }
> }
>
> download_and_analyse (URI) {
> download URI to FILE
> add URI to <downloaded>
> remove URI from <queue>
> scan FILE and add URIs to <queue> if not already in <downloaded>
> }
>
>
> Extended for simple multitasking (threaded, multi processes or even
> distributed).
> This is just one possible design for concurrent downloads.
> Maybe you have a more elegant idea.
>
> main (URI) {
> create <N> downloaders
> put <URI> into <queue>
>
> wait for status message from downloader {
> print status
> if <queue> is empty {
> stop downloaders
> we are done
> }
> }
> }
>
> downloader {
> wait for and allocate entry in <queue> {
> download_and_analyse(entry)
> }
> }
>
> download_and_analyse (URI) {
> download URI to FILE
> add URI to <downloaded>
> remove URI from <queue>
> scan FILE and add URIs to <queue> if not already in <downloaded>
> }
>
>
> Extended to download a URI from several sources in parallel.
> main and downloader stay the same, just download_and_analyse() is extended.
>
> download_and_analyse (URI) {
> /* download URI to FILE */
> put <X> chunk entries into <chunk_queue>
> create <X> chunkloaders
> wait for status message from chunkloader {
> send modified status message to main
> if <chunk_queue> is empty {
> stop chunk_loaders
> end loop
> }
> }
>
> add URI to <downloaded>
> remove URI from <queue>
> scan FILE and add URIs to <queue> if not already in <downloaded>
> }
>
> chunk_loader {
> wait for and allocate entry in <chunk_queue> {
> download(entry)
> remove entry from <chunk_queue>
> }
> }
>
> After some iterations we should come to a point where we can make further
> decisions:
> - how to implement concurrency (threads, processes, distributed process,
> (cloud))
> - how to implement communication between tasks
> - is a wget rewrite reasonable ?
> - which existing code to recycle ?
> - creating libraries from existing code (e.g. libwget) or use external
> libraries
> (e.g. for network stuff, parsing and creating URI/IRIs, etc.)
> - create a list of test code, especially for the library code
> - ... etc etc ...
>
>
> Tim