bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Memory leak


From: Andrew J. Schorr
Subject: Re: [bug-gawk] Memory leak
Date: Mon, 27 Mar 2017 14:15:23 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

Hi,

I don't see any memory growth at all using version 4.1.4.
Am I running this correctly?

bash-4.2$ gawk --version | head -1
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.1, GNU MP 6.0.0)

bash-4.2$ wc sample4gnu.pip 10x.pip 
    345     345   91894 sample4gnu.pip
   3450    3450  918940 10x.pip
   3795    3795 1010834 total

bash-4.2$ /bin/time gawk -f test.awk sample4gnu.pip
 
FIELD11|FIELD0|FIELD1|FIELD2|FIELD3|FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11
0.00user 0.00system 0:00.01elapsed 9%CPU (0avgtext+0avgdata 1748maxresident)k
0inputs+8outputs (0major+518minor)pagefaults 0swaps

bash-4.2$ /bin/time gawk -f test.awk 10x.pip
 
FIELD11|FIELD0|FIELD1|FIELD2|FIELD3|FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11
0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 1748maxresident)k
0inputs+8outputs (0major+518minor)pagefaults 0swaps

Regards,
Andy

On Mon, Mar 27, 2017 at 06:03:42PM +0000, Stephane Delsert wrote:
> Hi,
> 
> I've joined a little sample and a little script if you want create a bigger 
> file. This script doesn't change the initial order. My user sort function 
> uses 2 internal tables that could be a research way and I tried to  make a 
> test with a setting of those tables in the BEGIN statement but without 
> success. 
> Normally I use gawk as filter for simple processing. The number of lines in 
> input and in output is huge but the processes remain simple. This tool is 
> already highly powerful and I had processed several billions of lines with 
> high performances nevertheless I will study  all opportunities that this 
> extension can offer.
> 
> Great thanks ,
> Regards,
> 
> Stéphane.
> 
> 
> 
> -----Original Message-----
> From: Andrew J. Schorr [mailto:address@hidden 
> Sent: lundi 27 mars 2017 17:20
> To: Stephane Delsert <address@hidden>
> Cc: address@hidden; Fatima Aliane <address@hidden>; Vihan_Sharma - Vihan 
> Sharma (LiveRamp) <address@hidden>
> Subject: Re: [bug-gawk] Memory leak
> 
> Hi,
> 
> Thanks for bug report. Is it possible for you to supply a small sample 
> dataset that can be used with this script?
> 
> Also, gawk's array implementation currently incurs a lot of overhead for each 
> array entry saved. I think the last time I measured this, it was around 253 
> bytes per array element when the index and the value were both strings. Since 
> you are using numeric indices, the overhead should be less, but it still can 
> consume a tremendous amount of memory. If you load 320 million records, that 
> might come to tens of GB of overhead. Are you certain that the 
> PROCINFO["sorted_in"] setting really matters? I wonder if this is simply a 
> problem with gawk array overhead.
> 
> For working with massive datasets, you might consider trying the gawkextlib 
> lmdb extension. It is very fast and handles large key-value stores. You can 
> download it here:
>    https://sourceforge.net/projects/gawkextlib/files/
> 
> Regards,
> Andy
> 
> On Mon, Mar 27, 2017 at 02:42:28PM +0000, Stephane Delsert wrote:
> > Hi,
> > 
> > We hit a memory leak with gawk for the joined script. This script sorts a 
> > file already sorted on primary keys for additional keys. For achieve this I 
> > used a user defined function and set this function as follow :the  
> > PROCINFO["sorted_in"]="__sort_subsort"
> > We noticed a growth of memory required by gawk with the increase of the 
> > processed records. Gawk  ended after over 320 MM of records. The memory 
> > size was over 20Gb. A post analysis shown that the maximum size of the 
> > tables of the script was 121 elements.
> > I made different tests and it appears that this issue doesn't arrive when I 
> > don't use PROCINFO mechanism. For little files, this script works correctly.
> > 
> > I didn't see this kind of bug in the bug reports. I made tests with version 
> > 4.1.3 and version 4.1.4 without success.
> > 
> > Thank you for your help.
> > 
> > Best regards,
> > 
> > Stéphane Delsert.
> > 
> > **********************************************************************
> > ***** The information contained in this communication is confidential, 
> > is intended only for the use of the recipient named above, and may be 
> > legally privileged.
> > 
> > If the reader of this message is not the intended recipient, you are 
> > hereby notified that any dissemination, distribution or copying of 
> > this communication is strictly prohibited.
> > 
> > If you have received this communication in error, please resend this 
> > communication to the sender and delete the original message or any 
> > copy of it from your computer system.
> > 
> > Thank You.
> > **********************************************************************
> > ******
> 
> >     BEGIN {
> >             FS="|"
> >             OFS="|"
> >             
> >                     sort_old_key_1=""
> >                     sort_old_key_2=""
> >                     sort_old_key_3=""
> >                     sort_old_key_4=""
> >                     sort_old_key_5=""
> >                     sort_old_key_6=""
> >                     sort_old_key_7=""
> >                     sort_old_key_8=""
> >                     sort_old_key_9=""       
> >             split("", tab_store);
> >             split("", subsort_tab1);
> >             split("", subsort_tab2);
> >             nb_tab_store=0;
> >             PROCINFO["sorted_in"]="__sort_subsort"
> >     }
> >     {
> >             FIELD0=$1
> > FIELD1=$2
> > FIELD2=$3
> > FIELD3=$4
> > FIELD4=$5
> > FIELD5=$6
> > FIELD6=$7
> > FIELD7=$8
> > FIELD8=$9
> > FIELD9=$10
> > FIELD10=$11
> > FIELD11=$12
> >             
> > sort_key_1=" " FIELD2
> > sort_key_2=" " FIELD3
> > sort_key_3=" " FIELD4
> > sort_key_4=" " FIELD5
> > sort_key_5=" " FIELD6
> > sort_key_6=" " FIELD7
> > sort_key_7=" " FIELD8
> > sort_key_8=" " FIELD1
> > sort_key_9=" " FIELD9 
> >             sort_prim_compare = ( ( sort_old_key_1 < sort_key_1 ) ? -1 : ( 
> > ( sort_old_key_1 == sort_key_1 ) ? 0 : 1 ) );
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_2 < sort_key_2 ) ? -1 :  ( ( ( sort_old_key_2 == 
> > sort_key_2 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_3 < sort_key_3 ) ? -1 :  ( ( ( sort_old_key_3 == 
> > sort_key_3 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_4 < sort_key_4 ) ? -1 :  ( ( ( sort_old_key_4 == 
> > sort_key_4 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_5 < sort_key_5 ) ? -1 :  ( ( ( sort_old_key_5 == 
> > sort_key_5 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_6 < sort_key_6 ) ? -1 :  ( ( ( sort_old_key_6 == 
> > sort_key_6 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_7 < sort_key_7 ) ? -1 :  ( ( ( sort_old_key_7 == 
> > sort_key_7 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_8 < sort_key_8 ) ? -1 :  ( ( ( sort_old_key_8 == 
> > sort_key_8 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( 
> > sort_old_key_9 < sort_key_9 ) ? -1 :  ( ( ( sort_old_key_9 == 
> > sort_key_9 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> >             
> >     if ( ( sort_prim_compare > 0 ) && ( NR > 1 ) ) {
> >             print "file not correctly sorted at " NR " line " > 
> > ".sortcsv.sh_14831_S.acx_error_message.9d"
> >             exit 9
> >     }
> >             
> > sort_sec_key_1=" " FIELD11 
> >             if ( ( sort_prim_compare != 0 ) || ( NR == 1 ) ) {
> >                     if ( nb_tab_store > 1 ) {
> >                             for ( sort_tmp_line in tab_store )  {
> >                                     print tab_store[sort_tmp_line] ; 
> >                             }
> >                     }
> >                     else  {
> >                             if ( nb_tab_store > 0 )  {
> >                                     print tab_store[0] ; 
> >                             }
> >                     }
> >                     
> >                             sort_old_key_1= sort_key_1 
> >                             sort_old_key_2= sort_key_2 
> >                             sort_old_key_3= sort_key_3 
> >                             sort_old_key_4= sort_key_4 
> >                             sort_old_key_5= sort_key_5 
> >                             sort_old_key_6= sort_key_6 
> >                             sort_old_key_7= sort_key_7 
> >                             sort_old_key_8= sort_key_8 
> >                             sort_old_key_9= sort_key_9 
> >                     split("", tab_store);
> >                     nb_tab_store=0;
> >             }
> >             $1=$1
> >             tab_store[nb_tab_store] = sort_sec_key_1  OFS $0
> >             nb_tab_store += 1;
> >     }
> > 
> > 
> >     END {
> >             for ( sort_tmp_line in tab_store  ) {
> >                     print tab_store[sort_tmp_line] ; 
> >             }
> >     }
> >     function __sort_subsort(i1,v1,i2,v2) 
> >     {
> >             nb_subsort_tab1 = split(v1, subsort_tab1 );
> >             nb_subsort_tab2 = split(v2, subsort_tab2 );
> > 
> >             sort_sec_compare = ( ( subsort_tab1[1] < subsort_tab2[1] ) ? -1 
> > : ( 
> > ( subsort_tab1[1] == subsort_tab2[1] ) ? 0 : 1 ) );
> >             
> >             return(sort_sec_compare)        
> >     }

> Archive:  /var/tmp/samplegnu.zip
> Zip file size: 9071 bytes, number of entries: 3
> -rw-rw-r--  3.0 unx    91894 tx defN 17-Mar-27 13:30 sample4gnu.pip
> -rw-rw-r--  3.0 unx       56 tx defN 17-Mar-27 13:36 README.txt
> -rw-rw-r--  3.0 unx     3427 tx defN 17-Mar-27 10:15 test.awk
> 3 files, 95377 bytes uncompressed, 8601 bytes compressed:  91.0%


-- 
Andrew Schorr                      e-mail: address@hidden
Telemetry Investments, L.L.C.      phone:  917-305-1748
545 Fifth Ave, Suite 1108          fax:    212-425-5550
New York, NY 10017-3630



reply via email to

[Prev in Thread] Current Thread [Next in Thread]