bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: efficient version of 'sort | uniq -c | sort -n'?


From: Paul Eggert
Subject: Re: efficient version of 'sort | uniq -c | sort -n'?
Date: Mon, 21 May 2007 14:42:20 -0700
User-agent: Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux)

Matthew Woehlke <address@hidden> writes:

> Is there an efficient implementation of 'sort | uniq -c | sort -n'? I
> have a 4 GB core file I want to run 'strings' on, and the above is
> really slow.

See Jon Bently, Don Knuth, Doug McIlroy, "Programming pearls: a
literate program", CACM 29, 6 (June 1986), 471-483
<http://doi.acm.org/10.1145/5948.315654>.  Source code is included.  I
think Knuth's solution will run rings around all the solutions
proposed so far, if you tune it right (see below).

> Is there a way already in coreutils to do this? If not, would there
> be any interest in adding such a method?

I dunno, it sounds pretty specialized.  Though there may be some
interest in a combination "sort | uniq -c", I wouldn't think there'd
be any interest in combining all three.

Philip Rowlands <address@hidden> writes:

> A fundamentally more efficient approach would be something like:
>
> perl -lne '$bucket{$_}++; END { foreach $key (keys %bucket) { print 
> "$bucket{$key} $key" } }' | \
>   sort -n

Hmm, well, I tried that, and the Perl approach was tons slower on my
box (Debian stable, 2.4 GHz P4, 1 GB RAM).  The problem was that perl
grabbed too much memory and started thrashing.  The naive 'sort'
pipeline was much faster since it did not thrash.

Assuming there are a lot of duplicates, if you want to speed up the
naive 'sort' implementation you might try this:

strings | sort -S 50% | uniq -c | sort -n

On my host this causes the first 'sort' to take about 150% of the CPU
time of the 'strings' invocation.  The 'uniq' and second 'sort' take
about 100% each.  So the total CPU time is about 4.5x the time it
takes to run 'strings'.  Your mileage may vary of course.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]