Re: How to sort and count efficiently?

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: How to sort and count efficiently?

From:	Assaf Gordon
Subject:	Re: How to sort and count efficiently?
Date:	Sun, 30 Jun 2019 11:53:57 -0600
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2


On 2019-06-30 11:10 a.m., Peng Yu wrote:

The problem with this kind of awk program is that everything will beloaded to memory.


Well, those are the to main options: store in memory or resort to disk
I/O. each has its own pros and cons.

But bare `sort` use external files to save memory.

Not exactly - The goal is not to "save" memory -
Sort resorts to external files to be able to complete the
sort even with it runs out of the (alloted) memory (which can be
controlled with the "-S" parameter).

I'm not familiar with a program which implements hashing backed by file-
storage, but perhaps such program exists.

When the hash in awk is too large, accessing it can become very slow(maybe due to potential cache miss or slow down of hash as a function ofhash size).


Nothing is "free", and using a hash incurs its own costs.

If you're using the simplified awk hashing program,
try to use other AWK implementations than GNU awk (e.g. I have had
some performance gains from switching to "mawk", the default awk in
Debian).

  $ printf "%s\n" a c b b b b b b c \
     | mawk 'a[$1]++ {} END { for (i in a) { print i, a[i] } }'
  a 1
  b 6
  c 2

Or, if your input is exceedingly large, perhaps consider pre-processing
it and splitting the input into smaller files - each one will have less
strings and hashing them will consume less memory.
The following example splits the input file into 27 files, based on
the first letter of the string (and an "other" file for non-letters):

     mawk '{ l = tolower(substr($0,1,1)) ;
             if (l>="a" && l<="z") {
               print $0 > l
             } else {
               print $0 > "other"
             }
           }' INPUT

This is an O(N) operation that doesn't consume any memory (just lots of

disk I/O) - and the resulting files will be much smaller - then can behashed with less memory.


Of course this can be extended to split into smaller-grained files.

-assaf

[Prev in Thread]

Current Thread

[Next in Thread]

How to sort and count efficiently?, Peng Yu, 2019/06/30
- Re: How to sort and count efficiently?, Assaf Gordon, 2019/06/30
  - Re: How to sort and count efficiently?, Assaf Gordon, 2019/06/30
    - Re: How to sort and count efficiently?, Peng Yu, 2019/06/30
    - Re: How to sort and count efficiently?, Assaf Gordon <=

Prev by Date: Re: How to sort and count efficiently?
Previous by thread: Re: How to sort and count efficiently?
Index(es):
- Date
- Thread