bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] Memory leak


From: Andrew J. Schorr
Subject: Re: [bug-gawk] Memory leak
Date: Mon, 27 Mar 2017 11:19:32 -0400
User-agent: Mutt/1.5.21 (2010-09-15)

Hi,

Thanks for bug report. Is it possible for you to supply a small sample dataset
that can be used with this script?

Also, gawk's array implementation currently incurs a lot of overhead for each
array entry saved. I think the last time I measured this, it was around 253
bytes per array element when the index and the value were both strings. Since
you are using numeric indices, the overhead should be less, but it still can
consume a tremendous amount of memory. If you load 320 million records, that
might come to tens of GB of overhead. Are you certain that the 
PROCINFO["sorted_in"]
setting really matters? I wonder if this is simply a problem with gawk array
overhead.

For working with massive datasets, you might consider trying the gawkextlib lmdb
extension. It is very fast and handles large key-value stores. You can download
it here:
   https://sourceforge.net/projects/gawkextlib/files/

Regards,
Andy

On Mon, Mar 27, 2017 at 02:42:28PM +0000, Stephane Delsert wrote:
> Hi,
> 
> We hit a memory leak with gawk for the joined script. This script sorts a 
> file already sorted on primary keys for additional keys. For achieve this I 
> used a user defined function and set this function as follow :the  
> PROCINFO["sorted_in"]="__sort_subsort"
> We noticed a growth of memory required by gawk with the increase of the 
> processed records. Gawk  ended after over 320 MM of records. The memory size 
> was over 20Gb. A post analysis shown that the maximum size of the tables of 
> the script was 121 elements.
> I made different tests and it appears that this issue doesn't arrive when I 
> don't use PROCINFO mechanism. For little files, this script works correctly.
> 
> I didn't see this kind of bug in the bug reports. I made tests with version 
> 4.1.3 and version 4.1.4 without success.
> 
> Thank you for your help.
> 
> Best regards,
> 
> Stéphane Delsert.
> 
> ***************************************************************************
> The information contained in this communication is confidential, is
> intended only for the use of the recipient named above, and may be legally
> privileged.
> 
> If the reader of this message is not the intended recipient, you are
> hereby notified that any dissemination, distribution or copying of this
> communication is strictly prohibited.
> 
> If you have received this communication in error, please resend this
> communication to the sender and delete the original message or any copy
> of it from your computer system.
> 
> Thank You.
> ****************************************************************************

>       BEGIN {
>               FS="|"
>               OFS="|"
>               
>                       sort_old_key_1=""
>                       sort_old_key_2=""
>                       sort_old_key_3=""
>                       sort_old_key_4=""
>                       sort_old_key_5=""
>                       sort_old_key_6=""
>                       sort_old_key_7=""
>                       sort_old_key_8=""
>                       sort_old_key_9=""       
>               split("", tab_store);
>               split("", subsort_tab1);
>               split("", subsort_tab2);
>               nb_tab_store=0;
>               PROCINFO["sorted_in"]="__sort_subsort"
>       }
>       {
>               FIELD0=$1
> FIELD1=$2
> FIELD2=$3
> FIELD3=$4
> FIELD4=$5
> FIELD5=$6
> FIELD6=$7
> FIELD7=$8
> FIELD8=$9
> FIELD9=$10
> FIELD10=$11
> FIELD11=$12
>               
> sort_key_1=" " FIELD2 
> sort_key_2=" " FIELD3 
> sort_key_3=" " FIELD4 
> sort_key_4=" " FIELD5 
> sort_key_5=" " FIELD6 
> sort_key_6=" " FIELD7 
> sort_key_7=" " FIELD8 
> sort_key_8=" " FIELD1 
> sort_key_9=" " FIELD9 
>               sort_prim_compare = ( ( sort_old_key_1 < sort_key_1 ) ? -1 : ( 
> ( sort_old_key_1 == sort_key_1 ) ? 0 : 1 ) );
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_2 < sort_key_2 ) ? -1 :  ( ( ( sort_old_key_2 == 
> sort_key_2 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_3 < sort_key_3 ) ? -1 :  ( ( ( sort_old_key_3 == 
> sort_key_3 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_4 < sort_key_4 ) ? -1 :  ( ( ( sort_old_key_4 == 
> sort_key_4 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_5 < sort_key_5 ) ? -1 :  ( ( ( sort_old_key_5 == 
> sort_key_5 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_6 < sort_key_6 ) ? -1 :  ( ( ( sort_old_key_6 == 
> sort_key_6 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_7 < sort_key_7 ) ? -1 :  ( ( ( sort_old_key_7 == 
> sort_key_7 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_8 < sort_key_8 ) ? -1 :  ( ( ( sort_old_key_8 == 
> sort_key_8 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>                                       sort_prim_compare = ( sort_prim_compare 
> == 0 ) ? ( ( sort_old_key_9 < sort_key_9 ) ? -1 :  ( ( ( sort_old_key_9 == 
> sort_key_9 ) ? 0 : 1 ) ) ) : sort_prim_compare ; 
>               
>       if ( ( sort_prim_compare > 0 ) && ( NR > 1 ) ) {
>               print "file not correctly sorted at " NR " line " > 
> ".sortcsv.sh_14831_S.acx_error_message.9d"
>               exit 9
>       }
>               
> sort_sec_key_1=" " FIELD11 
>               if ( ( sort_prim_compare != 0 ) || ( NR == 1 ) ) {
>                       if ( nb_tab_store > 1 ) {
>                               for ( sort_tmp_line in tab_store )  {
>                                       print tab_store[sort_tmp_line] ; 
>                               }
>                       }
>                       else  {
>                               if ( nb_tab_store > 0 )  {
>                                       print tab_store[0] ; 
>                               }
>                       }
>                       
>                               sort_old_key_1= sort_key_1 
>                               sort_old_key_2= sort_key_2 
>                               sort_old_key_3= sort_key_3 
>                               sort_old_key_4= sort_key_4 
>                               sort_old_key_5= sort_key_5 
>                               sort_old_key_6= sort_key_6 
>                               sort_old_key_7= sort_key_7 
>                               sort_old_key_8= sort_key_8 
>                               sort_old_key_9= sort_key_9 
>                       split("", tab_store);
>                       nb_tab_store=0;
>               }
>               $1=$1
>               tab_store[nb_tab_store] = sort_sec_key_1  OFS $0
>               nb_tab_store += 1;
>       }
> 
> 
>       END {
>               for ( sort_tmp_line in tab_store  ) {
>                       print tab_store[sort_tmp_line] ; 
>               }
>       }
>       function __sort_subsort(i1,v1,i2,v2) 
>       {
>               nb_subsort_tab1 = split(v1, subsort_tab1 );
>               nb_subsort_tab2 = split(v2, subsort_tab2 );
> 
>               sort_sec_compare = ( ( subsort_tab1[1] < subsort_tab2[1] ) ? -1 
> : ( ( subsort_tab1[1] == subsort_tab2[1] ) ? 0 : 1 ) );
>               
>               return(sort_sec_compare)        
>       }



reply via email to

[Prev in Thread] Current Thread [Next in Thread]