coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Shuf reservoir sampling


From: Pádraig Brady
Subject: Re: Shuf reservoir sampling
Date: Sun, 29 Dec 2013 13:09:58 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 12/28/2013 05:13 PM, Jyothis V wrote:
> Hi, thanks for the reply. I understand why something like reservoir sampling 
> is needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir 
> sampling, an array of length head_length is obtained. At this stage, the 
> array is not completely shuffled because the first head_length elements were 
> copied verbatim. 2) This array is then randomly permuted while writing. My 
> question is whether these two steps could be clubbed together, just as shown 
> in the second algorithm given in the wikipedia page you mentioned.

Yes there are more improvements possible in this area.

Also we should avoid the more CPU intensive reservoir sampling
in the common case of reading small inputs from stdin:
http://lists.gnu.org/archive/html/coreutils/2013-03/msg00057.html
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/shuf.c;h=456140#l272

Also reservoir sampling is currently not used with --repeat.
But it does seem possible to use reservoir-sampling with replacement:
https://www.siam.org/proceedings/datamining/2004/dm04_053parkb.pdf

thanks,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]