[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] split: --chunks option
From: |
Pádraig Brady |
Subject: |
Re: [PATCH] split: --chunks option |
Date: |
Thu, 26 Nov 2009 09:30:36 +0000 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Chen Guo wrote:
> Hi all,
> This is mostly a step towards multithreaded sort the unix way, but as
> Padraig mentioned, has its other uses.
Thanks again for looking at this.
> Parsing and I/O are not my strong suits, so I have a couple of questions:
>
> Are there more appropriate functions than open and pread to use here? I
> usually see wrapper functions called in place of actual functions like fopen,
> fread, etc, and it feels rather inappropriate for me to use open and pread
> here.
>
> And are there any suggestions for parsing the --chunk option in a better
> way? I feel having two separate options specifying both required values is
> redundant, so I decided to separate the values by a comma, as Jim had in an
> example he linked me. The way I wrote it, it feels like a hacked workaround,
> but I'm not sure how else to get around that comma.
That's pretty much what I was thinking from the first mail I quoted:
The `read_chunk` process above is currently awkward and
inefficient to implement with dd and split. As a first step
I think it would be very useful to add a --number option to
`split`, like:
--number=5 #split input into 5 chunks
--number=2/5 #output chunk 2 of 5 to [to stdout]
In text mode, this should handle only splitting on a line
boundary, and adding any extra data into the last chunk.
I do think --number is more general than --chunk as it allows you to specify
only 1 number
to get the behaviour described above. Also I notice that FreeBSDs split recently
got a '-n chunk_count' option, so it would be good to maintain compat with that
if possible.
We also need to decide how to select between text and binary modes for --number.
Note reading from non seekable input complicates things.
For binary data I don't see how one could support --number.
>
> Also, any opinions on how the lines should be output? As of now I just
> have it as stdout, since that's how I see sort would use it. And of course,
> anything else I missed/could've done better? Thanks a lot guys.
It makes sense to just send the single "chunk" to stdout.
cheers,
Pádraig.