coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to sort and count efficiently?


From: Assaf Gordon
Subject: Re: How to sort and count efficiently?
Date: Sun, 30 Jun 2019 11:53:57 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2


On 2019-06-30 11:10 a.m., Peng Yu wrote:
The problem with this kind of awk program is that everything will be loaded to memory.

Well, those are the to main options: store in memory or resort to disk
I/O. each has its own pros and cons.

But bare `sort` use external files to save memory.
Not exactly - The goal is not to "save" memory -
Sort resorts to external files to be able to complete the
sort even with it runs out of the (alloted) memory (which can be
controlled with the "-S" parameter).

I'm not familiar with a program which implements hashing backed by file-
storage, but perhaps such program exists.

When the hash in awk is too large, accessing it can become very slow (maybe due to potential cache miss or slow down of hash as a function of hash size).

Nothing is "free", and using a hash incurs its own costs.

If you're using the simplified awk hashing program,
try to use other AWK implementations than GNU awk (e.g. I have had
some performance gains from switching to "mawk", the default awk in
Debian).

  $ printf "%s\n" a c b b b b b b c \
     | mawk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }'
  a 1
  b 6
  c 2

Or, if your input is exceedingly large, perhaps consider pre-processing
it and splitting the input into smaller files - each one will have less
strings and hashing them will consume less memory.
The following example splits the input file into 27 files, based on
the first letter of the string (and an "other" file for non-letters):

     mawk '{ l = tolower(substr($0,1,1)) ;
             if (l>="a" && l<="z") {
               print $0 > l
             } else {
               print $0 > "other"
             }
           }' INPUT

This is an O(N) operation that doesn't consume any memory (just lots of
disk I/O) - and the resulting files will be much smaller - then can be hashed with less memory.

Of course this can be extended to split into smaller-grained files.

-assaf



reply via email to

[Prev in Thread] Current Thread [Next in Thread]