[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] split: --chunks option
From: |
Jim Meyering |
Subject: |
Re: [PATCH] split: --chunks option |
Date: |
Sat, 06 Feb 2010 08:56:36 +0100 |
Pádraig Brady wrote:
> I got a bit of time for the review last night...
>
> This was your last interface change for this:
>
> -b, --bytes=SIZE put SIZE bytes per output file\n\
> + -b, --bytes=/N generate N output files\n\
> + -b, --bytes=K/N print Kth of N chunks of file\n\
> -C, --line-bytes=SIZE put at most SIZE bytes of lines per output file\n\
> -d, --numeric-suffixes use numeric suffixes instead of alphabetic\n\
> -l, --lines=NUMBER put NUMBER lines per output file\n\
> + -l, --lines=/N generate N eol delineated output files\n\
> + -l, --lines=K/N print Kth of N eol delineated chunks\n\
> + -n, --number=N same as --bytes=/N\n\
> + -n, --number=K/N same as --bytes=K/N\n\
> + -r, --round-robin=N generate N eol delineated output files using\n\
> + round-robin style distribution.\n\
> + -r. --round-robin=K/N print Kth of N eol delineated chunk as -rN would\n\
> + have generated.\n\
> + -t, --term=CHAR specify CHAR as eol. This will also convert\n\
> + -b to its line delineated equivalent (-C if\n\
> + splitting normally, -l if splitting by\n\
> + chunks). C escape sequences are accepted.\n\
>
> Thinking more about it, I think adding 2 modes of operation to the
> already slightly complicated -bCl options is too confusing.
> Since this is a separate mode of operation; i.e. one would be
> specifying a particular number of files for a different reason
> than a particular size, it would be better as a separate option.
> So I changed -n to operate as follows. This is more general if
> we want to add new split methods in future, and also compatible with
> the existing BSD -n without needing a redundant option.
I like it.
> -n N split into N files based on size of input
> -n K/N output K of N to stdout
s/K/Kth/ removes an ambiguity (and two more, below)
> -n l/N split into N files while maintaining lines
s/while maintaining/without splitting/ (and below)
> -n l/K/N output K of N to stdout while maintaining lines
> -n r/N like `l' but use round robin distribution instead of size
> -n r/K/N likewise but only output K of N to stdout
>
> Other changes I made in the attached version are:
>
> Removed the -t option as that's separate.
> Removed erroneous 'c' from getopt() parameters.
> Use K/N in code rather than M/N to match user instructions.
> Added suffix len setter/checker based on N so that
> we fail immediately if the wrong -a is specified, or
> if -a is not specified we auto set it.
> Flagged 0/N as an error, rather than treating like /N.
> Changed r/K/N to buffer using stdio for much better performance (see below)
> Fixed up the errno on some errors()
> Normalized all "write error" messages so that all of these commands
> output a single translated error message, of the form:
> "split: write error: No space left on device"
> split -n 1/10 $(which split) >/dev/full
> stdbuf -o0 split -n 1/10 $(which split) >/dev/full
> seq 10 | split -n r/1/10 >/dev/full
> seq 10 | stdbuf -o0 split -n r/1/10 >/dev/full
Nice.
> Re the performance of the round robin implementation;
> using stdio helps a LOT as can be seen with:
> -------------------------------------------------------
> $ time yes | head -n10000000 | ./split-fwrite -n r/1/1 | wc -l
> 10000000
>
> real 0m1.568s
> user 0m1.486s
> sys 0m0.072s
>
> $ time yes | head -n10000000 | ./split-write -n r/1/1 | wc -l
> 10000000
>
> real 0m50.988s
> user 0m7.548s
> sys 0m43.250s
Indeed. We don't see a 32-x speed-up (esp. non-algorithmic) very often.
> I still need to look at the round robin implementation when
> outputting to file rather than stdout. I may default to using
> stdio, but give an option to flush each line. I'm testing
> with this currently which is performing badly just doing write()
> -------------------------------------------------------
> #create fifos
> yes | head -n4 | ../split -n r/4 fifo
> for f in x*; do rm $f && mkfifo $f; done
>
> #consumer
> (for f in x*; do md5sum $f& done) > md5sum.out
>
> #producer
> seq 100000 | split -n r/4
> -------------------------------------------------------
>
> BTW, other modes perform well with write()
> -------------------------------------------------------
> $ yes | head -n10000000 > 10m.txt
> $ time ./split -n l/1/1 <10m.txt | wc -l
> 10000000
>
> real 0m0.201s
> user 0m0.145s
> sys 0m0.043s
>
> $ time ./split -n 1/1 <10m.txt | wc -l
> 10000000
>
> real 0m0.199s
> user 0m0.154s
> sys 0m0.041s
>
> $ time ./split -n 1 <10m.txt
>
> real 0m0.088s
> user 0m0.000s
> sys 0m0.081s
> -------------------------------------------------------
>
> Here is stuff I intend TODO before checking in:
> s/pread()/dd::skip()/ or at least add pread to bootstrap.conf
> fix info docs for reworked interface
> try to refactor duplicated code
Thanks for doing all that!