[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] Add new program: psub
From: |
Bo Borgerson |
Subject: |
Re: [PATCH] Add new program: psub |
Date: |
Sat, 03 May 2008 12:13:38 -0400 |
User-agent: |
Thunderbird 2.0.0.12 (X11/20080227) |
Bo Borgerson wrote:
> Hi,
>
> This program uses the temporary fifo management system that I built
> for zargs to provide generic process substitution for arguments to a
> sub-command.
>
> This program has some advantages over the process substitution built
> into some shells (bash, zsh, ksh, ???):
>
> 1. It doesn't rely on having a shell that supports built-in process
> substitution.
> 2. By using descriptively named temporary fifos it allows programs
> that include filenames in output or diagnostic messages to provide
> more useful information than with '/dev/fd/*' inputs.
> 3. It supports `--files0-from=F' style argument passing, as well.
>
> Also available for fetch at:
>
> $ git fetch git://repo.or.cz/coreutils/bo.git psub:psub
>
Hi,
I'd like to share another use for this tool.
As discussed previously there is a performance penalty when using `sort
-m' for exceeding a certain number of inputs (NMERGE). Any more inputs
and temporary files will be used, which increases both I/O and CPU cost.
One way to avoid this extra cost is to increase NMERGE. Another would
be to use tributary processes that each merge a subset of inputs and
feed into the main merge. This has a potential added advantage on
multi-processor machines of spreading the workload among processors.
In the following example I have 32 inputs (named `0'..`31') each with
1048576 records. Each record is a single character and there are
obviously large contiguous blocks of identical records. NMERGE is 16
(the default).
----
$ time sort -m *
real 0m9.107s
user 0m6.380s
sys 0m0.300s
$ time for i in 012 3456 789; do echo $i | sed 's/.*/"<sort -mu
*\[&\]"/'; done | xargs psub sort -m
real 0m3.792s
user 0m3.744s
sys 0m0.052s
----
And just to give a sense of how that breaks down:
----
$ for i in 012 3456 789; do echo $i | sed 's/.*/"<ls *\[&\]"/'; done |
xargs psub wc -l
11 /tmp/psubsUegiv/ls *[012]
12 /tmp/psubsUegiv/ls *[3456]
9 /tmp/psubsUegiv/ls *[789]
32 total
----
With longer records and no identical records in a given input the
benefit of spreading the work across processors becomes more apparent.
The following is with 64 files with 262144 records each. Each record is
4 characters long. I have a Core 2 Duo.
----
$ time sort -m *
real 0m13.183s
user 0m12.793s
sys 0m0.376s
$ time for i in 01 23 45 67 89; do echo $i | sed 's/.*/"<sort -mu
*\[&\]"/'; done | xargs psub sort -m
real 0m6.660s
user 0m12.401s
sys 0m0.168s
$ for i in 01 23 45 67 89; do echo $i | sed 's/.*/"<ls *\[&\]"/'; done |
xargs psub wc -l
14 /tmp/psubG0UkXb/ls *[01]
14 /tmp/psubG0UkXb/ls *[23]
12 /tmp/psubG0UkXb/ls *[45]
12 /tmp/psubG0UkXb/ls *[67]
12 /tmp/psubG0UkXb/ls *[89]
64 total
----
The multi-process benefit is amplified on machines with more available
processors. With the current trend of increasing numbers of on-die
processor cores I think this sort of easy technique for taking advantage
of concurrency is going to become more broadly beneficial.
Thanks,
Bo
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: [PATCH] Add new program: psub,
Bo Borgerson <=