coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: parallel sort strangeness


From: Jim Meyering
Subject: Re: parallel sort strangeness
Date: Fri, 29 Jul 2011 15:16:16 +0200

Pádraig Brady wrote:
> On 07/28/2011 01:16 PM, Jim Meyering wrote:
>> For some reason, sort is now very slow here (Fedora 15, x86_64, with
>> little free RAM), even for tiny inputs, but only some of the time,
>> and not when I set OMP_NUM_THREADS=1.
>> (I noticed because ./bootstrap was taking *forever*, traced it,
>> which showed gnulib-tool taking all the time, then traced gnulib-tool,
>> which showed slow sort -- it runs thousands of small ones.)
>>
>> time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
>> sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 3.18s
>> system 98% cpu 3.218 total
>> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
>> sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.01s user 39.90s
>> system 99% cpu 40.077 total
>> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
>> sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 5.65s
>> system 99% cpu 5.679 total
>> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
>> sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 34.16s
>> system 13% cpu 4:12.40 total
>>
>> Then I can run the very same command 40 times in a row, with it
>> terminating in less than .3s total.
>>
>> Then once more and it takes 5 seconds.
>
> Testing here (2 core, 4 threads) does show a large jump in time,
> but only the first time I specified a large number of threads.
>
> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
>
> real  0m0.031s
> user  0m0.015s
> sys   0m0.012s
>
> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort --parallel=100 | uniq -d"
>
> real  0m2.095s
> user  0m0.042s
> sys   0m2.042s
>
> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort --parallel=100 | uniq -d"
>
> real  0m0.207s
> user  0m0.050s
> sys   0m0.157s
>
> My first thought was memory usage which seems more likely as
> more memory is used for the multithreaded case.  Also when
> doing any perf testing with sort I now disable MALLOC_PERTURB_
> considering what large buffers it allocates, otherwise that
> would disable any lazy allocation that could be done.
> I was half considering unsetting MALLOC_PERTURB_ for the
> large buffer allocation internally in sort.
>
> A wild guess on your 5s delay is, maybe that run caused
> something to be swaped out. Maybe your swap is on a
> mechanical disk which needed to be spun up?

The disk I was using was an SSD (including swap).
Most delays were consistently smaller than those listed above.
More like .5 to 2 seconds per run.

> It's unlikely related, but I've had particular issues with
> kswapd0 spinning on F15, and I'm running a patched kernel
> to fix the issue.
>
> Could you repeat tests with a small -S or try the bootstrap
> with MALLOC_PERTURB_ unset?

Good call.  In a few small-input tests, I see that makes maxresident
the same regardless of whether I use --parallel=1 or 12.

That makes sense, too.  Without MALLOC_PERTURB_, an over-allocation
like what we see there causes virtually no penalty, but *with*
MALLOC_PERTURB_=N, for 0 < N < 256, that buffer is both written
(to initialize) and read (to check at free-time), so it has an
impact on the virtual memory footprint, and hence potentially
on swapping.

Now I think this was at least partly self-inflicted, due to my
using MALLOC_PERTURB_=, so I expect to discard the proposed patch.
And buy more RAM ;-)

Thanks.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]