bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Memory leak


From: Stephane Delsert
Subject: Re: [bug-gawk] Memory leak
Date: Tue, 28 Mar 2017 08:00:38 +0000

Hi,

To see the problem you have to duplicate the sample file to reach at least a 
couple of MM of records .

I'm running the script on a virtual Linux server
Linux 2.6.32-642.el6.x86_64 #1 SMP Wed Apr 13 00:51:26 EDT 2016 x86_64 x86_64 
x86_64 GNU/Linux
>From a Redhat distribution 6.8

I launched the process and stop at different points :

The command : 
zcat sdelse1.pip.gz.5t | tail -n +2 | gawk -F '|' -f test.awk | gzip > 
out.sdelse1.pip.gz.5t
 
It's what I have after the script has processed over 16 MM of records : 
PID      USER      PR  NI  VIRT   RES     SHR  S  %CPU    TIME+  %MEM COMMAND
14772 sdelse    20   0    213m 109m 1016 T  0.0           1:34.33  0.3         
gawk

After 24 MM of records :
PID      USER      PR  NI  VIRT   RES     SHR   S  %CPU    TIME+  %MEM COMMAND
14772 sdelse    20   0   287m  182m  1016 T   0.0           2:24.95  0.6        
gawk

After 36 MM of records 
PID      USER      PR  NI  VIRT   RES     SHR   S  %CPU    TIME+  %MEM COMMAND
14772 sdelse    20   0    548m 443m 1016  T   0.0          3:52.30  1.4         
gawk

After 54 MM of records :
PID      USER       PR  NI  VIRT   RES     SHR   S  %CPU    TIME+    %MEM 
COMMAND
14772  sdelse    20   0    950m 845m  1016 T   0.0           6:08.44  2.6       
   gawk

And After 110 MM of records :
PID      USER       PR  NI  VIRT     RES     SHR   S  %CPU    TIME+       %MEM 
COMMAND
14772 sdelse     20   0    2164m 2.0g    1016 T  0.0            13:05.17  6.4   
        gawk

The progression of the memory leak  seems linear and be linked to an allocation 
memory during the call of the sort subscript.

Great thanks.

Best regards,

Stéphane.




-----Original Message-----
From: Andrew J. Schorr [mailto:address@hidden 
Sent: lundi 27 mars 2017 20:15
To: Stephane Delsert <address@hidden>
Cc: address@hidden; Fatima Aliane <address@hidden>; Vihan_Sharma - Vihan Sharma 
(LiveRamp) <address@hidden>
Subject: Re: [bug-gawk] Memory leak

Hi,

I don't see any memory growth at all using version 4.1.4.
Am I running this correctly?

bash-4.2$ gawk --version | head -1
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.1, GNU MP 6.0.0)

bash-4.2$ wc sample4gnu.pip 10x.pip 
    345     345   91894 sample4gnu.pip
   3450    3450  918940 10x.pip
   3795    3795 1010834 total

bash-4.2$ /bin/time gawk -f test.awk sample4gnu.pip
 
FIELD11|FIELD0|FIELD1|FIELD2|FIELD3|FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11
0.00user 0.00system 0:00.01elapsed 9%CPU (0avgtext+0avgdata 1748maxresident)k
0inputs+8outputs (0major+518minor)pagefaults 0swaps

bash-4.2$ /bin/time gawk -f test.awk 10x.pip
 
FIELD11|FIELD0|FIELD1|FIELD2|FIELD3|FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11
0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 1748maxresident)k
0inputs+8outputs (0major+518minor)pagefaults 0swaps

Regards,
Andy

On Mon, Mar 27, 2017 at 06:03:42PM +0000, Stephane Delsert wrote:
> Hi,
> 
> I've joined a little sample and a little script if you want create a bigger 
> file. This script doesn't change the initial order. My user sort function 
> uses 2 internal tables that could be a research way and I tried to  make a 
> test with a setting of those tables in the BEGIN statement but without 
> success. 
> Normally I use gawk as filter for simple processing. The number of lines in 
> input and in output is huge but the processes remain simple. This tool is 
> already highly powerful and I had processed several billions of lines with 
> high performances nevertheless I will study  all opportunities that this 
> extension can offer.
> 
> Great thanks ,
> Regards,
> 
> Stéphane.
> 
> 
> 
> -----Original Message-----
> From: Andrew J. Schorr [mailto:address@hidden
> Sent: lundi 27 mars 2017 17:20
> To: Stephane Delsert <address@hidden>
> Cc: address@hidden; Fatima Aliane <address@hidden>; 
> Vihan_Sharma - Vihan Sharma (LiveRamp) <address@hidden>
> Subject: Re: [bug-gawk] Memory leak
> 
> Hi,
> 
> Thanks for bug report. Is it possible for you to supply a small sample 
> dataset that can be used with this script?
> 
> Also, gawk's array implementation currently incurs a lot of overhead for each 
> array entry saved. I think the last time I measured this, it was around 253 
> bytes per array element when the index and the value were both strings. Since 
> you are using numeric indices, the overhead should be less, but it still can 
> consume a tremendous amount of memory. If you load 320 million records, that 
> might come to tens of GB of overhead. Are you certain that the 
> PROCINFO["sorted_in"] setting really matters? I wonder if this is simply a 
> problem with gawk array overhead.
> 
> For working with massive datasets, you might consider trying the gawkextlib 
> lmdb extension. It is very fast and handles large key-value stores. You can 
> download it here:
>    https://sourceforge.net/projects/gawkextlib/files/
> 
> Regards,
> Andy
> 
> On Mon, Mar 27, 2017 at 02:42:28PM +0000, Stephane Delsert wrote:
> > Hi,
> > 
> > We hit a memory leak with gawk for the joined script. This script sorts a 
> > file already sorted on primary keys for additional keys. For achieve this I 
> > used a user defined function and set this function as follow :the  
> > PROCINFO["sorted_in"]="__sort_subsort"
> > We noticed a growth of memory required by gawk with the increase of the 
> > processed records. Gawk  ended after over 320 MM of records. The memory 
> > size was over 20Gb. A post analysis shown that the maximum size of the 
> > tables of the script was 121 elements.
> > I made different tests and it appears that this issue doesn't arrive when I 
> > don't use PROCINFO mechanism. For little files, this script works correctly.
> > 
> > I didn't see this kind of bug in the bug reports. I made tests with version 
> > 4.1.3 and version 4.1.4 without success.
> > 
> > Thank you for your help.
> > 
> > Best regards,
> > 
> > Stéphane Delsert.
> > 
> > ********************************************************************
> > **
> > ***** The information contained in this communication is 
> > confidential, is intended only for the use of the recipient named 
> > above, and may be legally privileged.
> > 
> > If the reader of this message is not the intended recipient, you are 
> > hereby notified that any dissemination, distribution or copying of 
> > this communication is strictly prohibited.
> > 
> > If you have received this communication in error, please resend this 
> > communication to the sender and delete the original message or any 
> > copy of it from your computer system.
> > 
> > Thank You.
> > ********************************************************************
> > **
> > ******
> 
> >     BEGIN {
> >             FS="|"
> >             OFS="|"
> >             
> >                     sort_old_key_1=""
> >                     sort_old_key_2=""
> >                     sort_old_key_3=""
> >                     sort_old_key_4=""
> >                     sort_old_key_5=""
> >                     sort_old_key_6=""
> >                     sort_old_key_7=""
> >                     sort_old_key_8=""
> >                     sort_old_key_9=""       
> >             split("", tab_store);
> >             split("", subsort_tab1);
> >             split("", subsort_tab2);
> >             nb_tab_store=0;
> >             PROCINFO["sorted_in"]="__sort_subsort"
> >     }
> >     {
> >             FIELD0=$1
> > FIELD1=$2
> > FIELD2=$3
> > FIELD3=$4
> > FIELD4=$5
> > FIELD5=$6
> > FIELD6=$7
> > FIELD7=$8
> > FIELD8=$9
> > FIELD9=$10
> > FIELD10=$11
> > FIELD11=$12
> >             
> > sort_key_1=" " FIELD2
> > sort_key_2=" " FIELD3
> > sort_key_3=" " FIELD4
> > sort_key_4=" " FIELD5
> > sort_key_5=" " FIELD6
> > sort_key_6=" " FIELD7
> > sort_key_7=" " FIELD8
> > sort_key_8=" " FIELD1
> > sort_key_9=" " FIELD9 
> >             sort_prim_compare = ( ( sort_old_key_1 < sort_key_1 ) ? -1 : ( 
> > ( sort_old_key_1 == sort_key_1 ) ? 0 : 1 ) );
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_2 < sort_key_2 ) ? -1 :  ( ( ( sort_old_key_2 == 
> > sort_key_2 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_3 < sort_key_3 ) ? -1 :  ( ( ( sort_old_key_3 == 
> > sort_key_3 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_4 < sort_key_4 ) ? -1 :  ( ( ( sort_old_key_4 == 
> > sort_key_4 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_5 < sort_key_5 ) ? -1 :  ( ( ( sort_old_key_5 == 
> > sort_key_5 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_6 < sort_key_6 ) ? -1 :  ( ( ( sort_old_key_6 == 
> > sort_key_6 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_7 < sort_key_7 ) ? -1 :  ( ( ( sort_old_key_7 == 
> > sort_key_7 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( ( sort_old_key_8 < sort_key_8 ) ? -1 :  ( ( ( sort_old_key_8 == 
> > sort_key_8 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
> >                                     sort_prim_compare = ( sort_prim_compare 
> > == 0 ) ? ( (
> > sort_old_key_9 < sort_key_9 ) ? -1 :  ( ( ( sort_old_key_9 ==
> > sort_key_9 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> >             
> >     if ( ( sort_prim_compare > 0 ) && ( NR > 1 ) ) {
> >             print "file not correctly sorted at " NR " line " > 
> > ".sortcsv.sh_14831_S.acx_error_message.9d"
> >             exit 9
> >     }
> >             
> > sort_sec_key_1=" " FIELD11 
> >             if ( ( sort_prim_compare != 0 ) || ( NR == 1 ) ) {
> >                     if ( nb_tab_store > 1 ) {
> >                             for ( sort_tmp_line in tab_store )  {
> >                                     print tab_store[sort_tmp_line] ; 
> >                             }
> >                     }
> >                     else  {
> >                             if ( nb_tab_store > 0 )  {
> >                                     print tab_store[0] ; 
> >                             }
> >                     }
> >                     
> >                             sort_old_key_1= sort_key_1 
> >                             sort_old_key_2= sort_key_2 
> >                             sort_old_key_3= sort_key_3 
> >                             sort_old_key_4= sort_key_4 
> >                             sort_old_key_5= sort_key_5 
> >                             sort_old_key_6= sort_key_6 
> >                             sort_old_key_7= sort_key_7 
> >                             sort_old_key_8= sort_key_8 
> >                             sort_old_key_9= sort_key_9 
> >                     split("", tab_store);
> >                     nb_tab_store=0;
> >             }
> >             $1=$1
> >             tab_store[nb_tab_store] = sort_sec_key_1  OFS $0
> >             nb_tab_store += 1;
> >     }
> > 
> > 
> >     END {
> >             for ( sort_tmp_line in tab_store  ) {
> >                     print tab_store[sort_tmp_line] ; 
> >             }
> >     }
> >     function __sort_subsort(i1,v1,i2,v2) 
> >     {
> >             nb_subsort_tab1 = split(v1, subsort_tab1 );
> >             nb_subsort_tab2 = split(v2, subsort_tab2 );
> > 
> >             sort_sec_compare = ( ( subsort_tab1[1] < subsort_tab2[1] ) ? -1 
> > : 
> > ( ( subsort_tab1[1] == subsort_tab2[1] ) ? 0 : 1 ) );
> >             
> >             return(sort_sec_compare)        
> >     }

> Archive:  /var/tmp/samplegnu.zip
> Zip file size: 9071 bytes, number of entries: 3
> -rw-rw-r--  3.0 unx    91894 tx defN 17-Mar-27 13:30 sample4gnu.pip
> -rw-rw-r--  3.0 unx       56 tx defN 17-Mar-27 13:36 README.txt
> -rw-rw-r--  3.0 unx     3427 tx defN 17-Mar-27 10:15 test.awk
> 3 files, 95377 bytes uncompressed, 8601 bytes compressed:  91.0%


-- 
Andrew Schorr                      e-mail: address@hidden
Telemetry Investments, L.L.C.      phone:  917-305-1748
545 Fifth Ave, Suite 1108          fax:    212-425-5550
New York, NY 10017-3630



reply via email to

[Prev in Thread] Current Thread [Next in Thread]