[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] split: --chunks option
From: |
Chen Guo |
Subject: |
Re: [PATCH] split: --chunks option |
Date: |
Sat, 28 Nov 2009 11:38:07 -0800 (PST) |
Hi Padraig,
> I do think --number is more general than --chunk as it allows you to specify
> only 1 number
> to get the behaviour described above. Also I notice that FreeBSDs split
> recently
> got a '-n chunk_count' option, so it would be good to maintain compat with
> that
> if possible.
>
I read the FreeBSD source. It's interesting that the Berkeley gave the copy
right
to UC Regents, who just skyrocketed my tuition. Anyhow...
More on topic, their --number option is actually quite trivial; they get size =
st_size/n
and proceed like it's --bytes=size. In a sense, this chunks option can actually
be
seen as an extension to their --number option.
I think what I'll end up doing is, implement their --number option, outputting
the chunks
to files. Then extend it to support --number=n/tot, which outputs to stdout.
Then for delineation by newlines, I'll call it something like --number-lines=n,
outputting
all chunks with split's cwrite to files, and what I have now
--number-lines=n/tot, which
extracts a chunk to stdout.
> We also need to decide how to select between text and binary modes for
> --number.
> Note reading from non seekable input complicates things.
> For binary data I don't see how one could support --number.
>
So under this scheme then it'd be up to the user whether to use --number or
--number-lines. --number of course supports binary, since it's byte
delineation rather than line delineation.
Lastly, I tested using this with sorting. As expected, it's not faster. This is
done on
gcc 14, rand is a million line ASCII file generated by gensort. Like I said,
I'll try
to implement the same concept, but internally within sort so we're free of the
pipe
overhead, and see how that goes.
address@hidden:~/testing$ time ./sortgl --threads=8 rand > /dev/null
real 0m1.820s
user 0m5.236s
sys 0m0.168s
address@hidden:~/testing$ time sort -m <(./split -c1,8 rand | sort) <(./split
-c2,8 rand | sort) <(./split -c3,8 rand | sort) <(./split -c4,8 rand | sort)
<(./split -c5,8 rand | sort) <(./split -c6,8 rand | sort) <(./split -c7,8 rand
| sort) <(./split -c8,8 rand | sort) > /dev/null
real 0m2.198s
user 0m5.324s
sys 0m0.440s
And lastly you guys probably wont hear back from me for a couple of weeks on
anything. it's the end of the quarter at UCLA and that means fun projects and
even
more fun finals.