bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for spe


From: Kingsley G. Morse Jr.
Subject: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)
Date: Mon, 9 Jul 2018 21:29:53 -0700
User-agent: NeoMutt/20170306 (1.8.0)

Hi Assaf,

I like datamash.

And I like avoiding a pipe with sort's "-u"
option.

And I like your benchmarks.

Mine are humbly attached.

They compare sorting to hashing.

At the moment, hashing seems to me to be about 70%
faster.

And to scale better.

I'd still like uniq to be faster and free from
sort.

Feel free to let me know if I screwed up.

Thanks,
Kingsley

On 07/09/2018 14:53, Assaf Gordon wrote:
> Ηello,
> 
> On 08/07/18 03:03 PM, Kingsley G. Morse Jr. wrote:
> > The main reason I'm writing is to humbly suggest
> > a performance improvement to the "uniq" command.
> > 
> > It currently requires its input to be sorted.
> > 
> > Sorting generally takes about O(N*log(N)) time.
> > 
> > However, it seems to me that sorting is
> > unnecessary.
> > 
> > You can see this is so in the following example
> > 
> >      $ echo -e "b\na\nb" | awk '!seen[$0]++'
> 
> In addition to Paul's reply, here are couple of more practical issues:
> 
> 1.
> GNU sort has several non trivial (and not obvious) advantages:
> 
> It can sort a file larger than available memory (i.e. you
> can sort a 3GB file on machine with 1GB of RAM).
> 
> It can sort using multiple threads, making sorting faster.
> 
> If you have a powerful machine (lots of ram and lots of CPUs),
> you can sort entirely in memory like so:
> 
>     sort -S 10GB --parallel=10 -T /foo/bar INPUT > OUTPUT
> 
> The "-S" sets the maximum amount of memory sort is allowed to use,
> The "--parallel" sets the number of parallel sorting threads,
> The "-T" points to a non-existing directory - ensuring that
> if sort runs out of memory (input too large) then it will fail
> instead of resorting to using disk storage (and slower I/O).
> 
> There are many other combinations (e.g. "--compress-program").
> 
> A simple hash implementation will not be able to do so.
> 
> 
> 2.
> Sort already supports printing only unique values with the "-u" option.
> So by using the correct combination of keys (-k) and unique (-u)
> you can get unique values without even invoking "uniq"
> (if your concert is starting another process).
> 
> Note that uniq will compare the entire line, and sort will "unique"
> only the specified "keys", but sort also has last the "--stable"
> option that can affect the results.
> 
> 
> 3.
> Sometimes you really just want to see the list of unique values,
> and that's valid.
> But often times you want to later examine or do something with the list
> of unique values, and then it is common to sort it.
> 
> A hash implementation of "unique" will not print sorted items,
> and then you'll likely need to pipe it to another "sort" anyhow
> (perhaps with much smaller number of items, but still need sort).
> 
> 
> 4.
> The Unix philosophy often says
>   "Write programs that do one thing and do it well."
> ( https://en.wikipedia.org/wiki/Unix_philosophy )
> 
> GNU Sort does sorting very well.
> Other programs that require sorted input can rely on it (e.g. join,
> uniq, etc.).
> 
> 
> 5.
> Using your awk example is actually a fantastic way to achieve
> what you wanted - it fits perfectly in "do one thing and do it well".
> 
> Note that If you're using a recent Debian or Ubuntu machine,
> they have switched the default awk implementation from GNU awk (gawk)
> to "mawk". "mawk" is indeed faster in some aspects, but it seems gawk is
> much faster when it comes to hashing.
> 
> Observe the following:
> 
>   $ seq 1000000 | time -p gawk '!seen[$0]++' > /dev/null
>   real 0.40
>   user 0.35
>   sys 0.04
>   $ seq 1000000 | time -p mawk '!seen[$0]++' > /dev/null
>   real 10.48
>   user 10.40
>   sys 0.07
> 
> Using awk will also enable you to later extend your task to
> achieve more. Eg. the following program:
>   awk 'NR==FNR{seen[$1]++;next} seen[$1]>0' a.txt b.txt
> 
> Will only print lines from "b.txt" which have a key matching from
> file "a.txt". kind of a hash-based join between two files.
> 
> 
> 6.
> Lastly, if you still want a program that uses hashing to discard
> duplicates in a file, may I humbly suggest GNU datamash:
>    https://www.gnu.org/software/datamash/
> (disclaimer: I'm datamash's developer).
> 
> It can easily remove duplicates, and it uses the same hashing code
> that other coreutils program use. Example:
> 
>   $ printf "%s\n" a a b a b c b | datamash rmdup 1
>   a
>   b
>   c
> 
> Datamash has several additional useful features,
> for example it can remove duplicates from a specific column but still print
> the entire matching line:
> 
>   $ printf "FOO %s %s\n" a 1 a 2 b 3 a 4 b 5 c 6 b 7 \
>         | datamash -W rmdup 2
>   FOO a 1
>   FOO b 3
>   FOO c 6
> 
> 
> 
> Hope this helps,
> regards,
>  - assaf
> 

-- 
Time is the fire in which we all burn.

Attachment: hash_v_sort_benchmarks.1.png
Description: PNG image

Attachment: hash_v_sort_benchmarks.2.png
Description: PNG image

Attachment: uniq_demo
Description: benchmarking script

Attachment: hash_and_sort_benchmarks.gnumeric
Description: spread sheet (gnumeric)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]