[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Shuf reservoir sampling
From: |
Bernhard Voelker |
Subject: |
Re: Shuf reservoir sampling |
Date: |
Sat, 28 Dec 2013 13:49:53 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 |
On 12/28/2013 10:53 AM, Jyothis V wrote:
> I would like to know why shuf.c is using reservoir sampling +
> write_permuted_output_reservoir rather than just using an
> inside-out version Fisher-Yates shuffle.
Reservoir sampling has been added to shuf in the youngest coreutils
release 8.22 via this commit:
http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0
The NEWS file tells us why this is more efficient than before:
http://git.savannah.gnu.org/cgit/coreutils.git/tree/NEWS?id=20d7bce0#n28
shuf outputs subsets of large inputs much more efficiently.
Reservoir sampling is used to limit memory usage based on the number of
outputs, rather than the number of inputs.
This advantage is also mentioned on Wikipedia:
http://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to_Fisher-Yates_shuffle
[...]
Note that although the rest of the cards are shuffled, only the top k are
important in the present context. Therefore, the array a need only track
the cards in the top k positions while performing the shuffle, reducing
the amount of memory needed. Truncating a to length k, the algorithm is
modified accordingly:
[...]
Have a nice day,
Berny