[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] shuf: use reservoir-sampling when possible
From: |
Assaf Gordon |
Subject: |
Re: [PATCH] shuf: use reservoir-sampling when possible |
Date: |
Mon, 11 Mar 2013 17:10:39 -0400 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:10.0.4) Gecko/20120510 Icedove/10.0.4 |
Hello,
Pádraig Brady wrote, On 03/07/2013 06:26 PM:
> On 03/07/2013 07:32 PM, Assaf Gordon wrote:
>> Pádraig Brady wrote, On 03/06/2013 08:24 PM:
>>> On 03/06/2013 11:50 PM, Assaf Gordon wrote:
>>>> Attached is a suggestion to implement reservoir-sampling in shuf:
>>>> When the expected output of lines is known, it will not load the entire
>>>> file into memory - allowing shuffling very large inputs.
>>>>
>>
>>
>>>> static size_t
>>>> read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline,
>>>> size_t k,
>>>> struct randint_source *s)
>> <...>
>>>> struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init
>>>> reservoir*/
>>>
>>> Since this change is mainly about efficient mem usage we should probably
>>> handle
>>> the case where we have small inputs but large k. This will allocate (and
>>> zero)
>>> memory up front. The zeroing will defeat any memory overcommit configured
>>> on the
>>> system, but it's probably better to avoid the large initial commit and
>>> realloc
>>> as required (not per line, but per 1K lines maybe).
>>>
>>
Attached is an updated version (mostly a re-write of the memory allocation
part), as per the comment above.
Also includes a very_expensive valgrind test to exercise the new code.
(and the other patch is the uniform-distribution randomness test).
-gordon
0001-shuf-add-expensive-test-for-randomness.patch
Description: Text Data
0002-shuf-use-reservoir-sampling-when-possible.patch
Description: Text Data