coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Shuf reservoir sampling


From: Bernhard Voelker
Subject: Re: Shuf reservoir sampling
Date: Sat, 28 Dec 2013 13:49:53 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

On 12/28/2013 10:53 AM, Jyothis V wrote:
> I would like to know why shuf.c is using reservoir sampling +
> write_permuted_output_reservoir rather than just using an
> inside-out version Fisher-Yates shuffle.

Reservoir sampling has been added to shuf in the youngest coreutils
release 8.22 via this commit:

  http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0

The NEWS file tells us why this is more efficient than before:

  http://git.savannah.gnu.org/cgit/coreutils.git/tree/NEWS?id=20d7bce0#n28

  shuf outputs subsets of large inputs much more efficiently.
  Reservoir sampling is used to limit memory usage based on the number of
  outputs, rather than the number of inputs.

This advantage is also mentioned on Wikipedia:

  
http://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to_Fisher-Yates_shuffle

  [...]
  Note that although the rest of the cards are shuffled, only the top k are
  important in the present context. Therefore, the array a need only track
  the cards in the top k positions while performing the shuffle, reducing
  the amount of memory needed. Truncating a to length k, the algorithm is
  modified accordingly:
  [...]

Have a nice day,
Berny



reply via email to

[Prev in Thread] Current Thread [Next in Thread]