bug#32099: New uniq option for speed

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#32099: New uniq option for speed

From:	Assaf Gordon
Subject:	bug#32099: New uniq option for speed
Date:	Mon, 9 Jul 2018 14:53:04 -0600
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0

Ηello,

On 08/07/18 03:03 PM, Kingsley G. Morse Jr. wrote:

The main reason I'm writing is to humbly suggest
a performance improvement to the "uniq" command.

It currently requires its input to be sorted.

Sorting generally takes about O(N*log(N)) time.

However, it seems to me that sorting is
unnecessary.

You can see this is so in the following example

     $ echo -e "b\na\nb" | awk '!seen[$0]++'


In addition to Paul's reply, here are couple of more practical issues:

1.
GNU sort has several non trivial (and not obvious) advantages:

It can sort a file larger than available memory (i.e. you
can sort a 3GB file on machine with 1GB of RAM).

It can sort using multiple threads, making sorting faster.

If you have a powerful machine (lots of ram and lots of CPUs),
you can sort entirely in memory like so:

    sort -S 10GB --parallel=10 -T /foo/bar INPUT > OUTPUT

The "-S" sets the maximum amount of memory sort is allowed to use,
The "--parallel" sets the number of parallel sorting threads,
The "-T" points to a non-existing directory - ensuring that
if sort runs out of memory (input too large) then it will fail
instead of resorting to using disk storage (and slower I/O).

There are many other combinations (e.g. "--compress-program").

A simple hash implementation will not be able to do so.


2.
Sort already supports printing only unique values with the "-u" option.
So by using the correct combination of keys (-k) and unique (-u)
you can get unique values without even invoking "uniq"
(if your concert is starting another process).

Note that uniq will compare the entire line, and sort will "unique"
only the specified "keys", but sort also has last the "--stable"
option that can affect the results.


3.
Sometimes you really just want to see the list of unique values,
and that's valid.
But often times you want to later examine or do something with the list
of unique values, and then it is common to sort it.

A hash implementation of "unique" will not print sorted items,
and then you'll likely need to pipe it to another "sort" anyhow
(perhaps with much smaller number of items, but still need sort).


4.
The Unix philosophy often says
  "Write programs that do one thing and do it well."
( https://en.wikipedia.org/wiki/Unix_philosophy )

GNU Sort does sorting very well.
Other programs that require sorted input can rely on it (e.g. join,
uniq, etc.).


5.
Using your awk example is actually a fantastic way to achieve
what you wanted - it fits perfectly in "do one thing and do it well".

Note that If you're using a recent Debian or Ubuntu machine,
they have switched the default awk implementation from GNU awk (gawk)
to "mawk". "mawk" is indeed faster in some aspects, but it seems gawk is
much faster when it comes to hashing.

Observe the following:

  $ seq 1000000 | time -p gawk '!seen[$0]++' > /dev/null
  real 0.40
  user 0.35
  sys 0.04
  $ seq 1000000 | time -p mawk '!seen[$0]++' > /dev/null
  real 10.48
  user 10.40
  sys 0.07

Using awk will also enable you to later extend your task to
achieve more. Eg. the following program:
  awk 'NR==FNR{seen[$1]++;next} seen[$1]>0' a.txt b.txt

Will only print lines from "b.txt" which have a key matching from
file "a.txt". kind of a hash-based join between two files.


6.
Lastly, if you still want a program that uses hashing to discard
duplicates in a file, may I humbly suggest GNU datamash:
   https://www.gnu.org/software/datamash/
(disclaimer: I'm datamash's developer).

It can easily remove duplicates, and it uses the same hashing code
that other coreutils program use. Example:

  $ printf "%s\n" a a b a b c b | datamash rmdup 1
  a
  b
  c

Datamash has several additional useful features,

for example it can remove duplicates from a specific column but stillprint the entire matching line:


  $ printf "FOO %s %s\n" a 1 a 2 b 3 a 4 b 5 c 6 b 7 \
        | datamash -W rmdup 2
  FOO a 1
  FOO b 3
  FOO c 6



Hope this helps,
regards,
 - assaf

[Prev in Thread]

Current Thread

[Next in Thread]

bug#32099: New uniq option for speed, Kingsley G. Morse Jr., 2018/07/08
- bug#32099: New uniq option for speed, Paul Eggert, 2018/07/09
- bug#32099: New uniq option for speed, Assaf Gordon <=
  - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Kingsley G. Morse Jr., 2018/07/10
    - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Assaf Gordon, 2018/07/10
    - bug#32099: datamash wins! (Was: New uniq option for speed), Kingsley G. Morse Jr., 2018/07/11
    - bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed), Bernhard Voelker, 2018/07/12
    - bug#32099: Testing with other options (Was: Benchmarks: Hashing ~70% faster ), Kingsley G. Morse Jr., 2018/07/13

Prev by Date: bug#32099: New uniq option for speed
Next by Date: bug#29886: tests/rm/rm-readdir-fail is failing with SIGBUS on Gentoo sparc
Previous by thread: bug#32099: New uniq option for speed
Next by thread: bug#32099: Benchmarks: Hashing ~70% faster (Was: New uniq option for speed)
Index(es):
- Date
- Thread