[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [bug-gawk] Memory leak
From: |
Stephane Delsert |
Subject: |
Re: [bug-gawk] Memory leak |
Date: |
Tue, 28 Mar 2017 08:00:38 +0000 |
Hi,
To see the problem you have to duplicate the sample file to reach at least a
couple of MM of records .
I'm running the script on a virtual Linux server
Linux 2.6.32-642.el6.x86_64 #1 SMP Wed Apr 13 00:51:26 EDT 2016 x86_64 x86_64
x86_64 GNU/Linux
>From a Redhat distribution 6.8
I launched the process and stop at different points :
The command :
zcat sdelse1.pip.gz.5t | tail -n +2 | gawk -F '|' -f test.awk | gzip >
out.sdelse1.pip.gz.5t
It's what I have after the script has processed over 16 MM of records :
PID USER PR NI VIRT RES SHR S %CPU TIME+ %MEM COMMAND
14772 sdelse 20 0 213m 109m 1016 T 0.0 1:34.33 0.3
gawk
After 24 MM of records :
PID USER PR NI VIRT RES SHR S %CPU TIME+ %MEM COMMAND
14772 sdelse 20 0 287m 182m 1016 T 0.0 2:24.95 0.6
gawk
After 36 MM of records
PID USER PR NI VIRT RES SHR S %CPU TIME+ %MEM COMMAND
14772 sdelse 20 0 548m 443m 1016 T 0.0 3:52.30 1.4
gawk
After 54 MM of records :
PID USER PR NI VIRT RES SHR S %CPU TIME+ %MEM
COMMAND
14772 sdelse 20 0 950m 845m 1016 T 0.0 6:08.44 2.6
gawk
And After 110 MM of records :
PID USER PR NI VIRT RES SHR S %CPU TIME+ %MEM
COMMAND
14772 sdelse 20 0 2164m 2.0g 1016 T 0.0 13:05.17 6.4
gawk
The progression of the memory leak seems linear and be linked to an allocation
memory during the call of the sort subscript.
Great thanks.
Best regards,
Stéphane.
-----Original Message-----
From: Andrew J. Schorr [mailto:address@hidden
Sent: lundi 27 mars 2017 20:15
To: Stephane Delsert <address@hidden>
Cc: address@hidden; Fatima Aliane <address@hidden>; Vihan_Sharma - Vihan Sharma
(LiveRamp) <address@hidden>
Subject: Re: [bug-gawk] Memory leak
Hi,
I don't see any memory growth at all using version 4.1.4.
Am I running this correctly?
bash-4.2$ gawk --version | head -1
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.1, GNU MP 6.0.0)
bash-4.2$ wc sample4gnu.pip 10x.pip
345 345 91894 sample4gnu.pip
3450 3450 918940 10x.pip
3795 3795 1010834 total
bash-4.2$ /bin/time gawk -f test.awk sample4gnu.pip
FIELD11|FIELD0|FIELD1|FIELD2|FIELD3|FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11
0.00user 0.00system 0:00.01elapsed 9%CPU (0avgtext+0avgdata 1748maxresident)k
0inputs+8outputs (0major+518minor)pagefaults 0swaps
bash-4.2$ /bin/time gawk -f test.awk 10x.pip
FIELD11|FIELD0|FIELD1|FIELD2|FIELD3|FIELD4|FIELD5|FIELD6|FIELD7|FIELD8|FIELD9|FIELD10|FIELD11
0.00user 0.00system 0:00.01elapsed 0%CPU (0avgtext+0avgdata 1748maxresident)k
0inputs+8outputs (0major+518minor)pagefaults 0swaps
Regards,
Andy
On Mon, Mar 27, 2017 at 06:03:42PM +0000, Stephane Delsert wrote:
> Hi,
>
> I've joined a little sample and a little script if you want create a bigger
> file. This script doesn't change the initial order. My user sort function
> uses 2 internal tables that could be a research way and I tried to make a
> test with a setting of those tables in the BEGIN statement but without
> success.
> Normally I use gawk as filter for simple processing. The number of lines in
> input and in output is huge but the processes remain simple. This tool is
> already highly powerful and I had processed several billions of lines with
> high performances nevertheless I will study all opportunities that this
> extension can offer.
>
> Great thanks ,
> Regards,
>
> Stéphane.
>
>
>
> -----Original Message-----
> From: Andrew J. Schorr [mailto:address@hidden
> Sent: lundi 27 mars 2017 17:20
> To: Stephane Delsert <address@hidden>
> Cc: address@hidden; Fatima Aliane <address@hidden>;
> Vihan_Sharma - Vihan Sharma (LiveRamp) <address@hidden>
> Subject: Re: [bug-gawk] Memory leak
>
> Hi,
>
> Thanks for bug report. Is it possible for you to supply a small sample
> dataset that can be used with this script?
>
> Also, gawk's array implementation currently incurs a lot of overhead for each
> array entry saved. I think the last time I measured this, it was around 253
> bytes per array element when the index and the value were both strings. Since
> you are using numeric indices, the overhead should be less, but it still can
> consume a tremendous amount of memory. If you load 320 million records, that
> might come to tens of GB of overhead. Are you certain that the
> PROCINFO["sorted_in"] setting really matters? I wonder if this is simply a
> problem with gawk array overhead.
>
> For working with massive datasets, you might consider trying the gawkextlib
> lmdb extension. It is very fast and handles large key-value stores. You can
> download it here:
> https://sourceforge.net/projects/gawkextlib/files/
>
> Regards,
> Andy
>
> On Mon, Mar 27, 2017 at 02:42:28PM +0000, Stephane Delsert wrote:
> > Hi,
> >
> > We hit a memory leak with gawk for the joined script. This script sorts a
> > file already sorted on primary keys for additional keys. For achieve this I
> > used a user defined function and set this function as follow :the
> > PROCINFO["sorted_in"]="__sort_subsort"
> > We noticed a growth of memory required by gawk with the increase of the
> > processed records. Gawk ended after over 320 MM of records. The memory
> > size was over 20Gb. A post analysis shown that the maximum size of the
> > tables of the script was 121 elements.
> > I made different tests and it appears that this issue doesn't arrive when I
> > don't use PROCINFO mechanism. For little files, this script works correctly.
> >
> > I didn't see this kind of bug in the bug reports. I made tests with version
> > 4.1.3 and version 4.1.4 without success.
> >
> > Thank you for your help.
> >
> > Best regards,
> >
> > Stéphane Delsert.
> >
> > ********************************************************************
> > **
> > ***** The information contained in this communication is
> > confidential, is intended only for the use of the recipient named
> > above, and may be legally privileged.
> >
> > If the reader of this message is not the intended recipient, you are
> > hereby notified that any dissemination, distribution or copying of
> > this communication is strictly prohibited.
> >
> > If you have received this communication in error, please resend this
> > communication to the sender and delete the original message or any
> > copy of it from your computer system.
> >
> > Thank You.
> > ********************************************************************
> > **
> > ******
>
> > BEGIN {
> > FS="|"
> > OFS="|"
> >
> > sort_old_key_1=""
> > sort_old_key_2=""
> > sort_old_key_3=""
> > sort_old_key_4=""
> > sort_old_key_5=""
> > sort_old_key_6=""
> > sort_old_key_7=""
> > sort_old_key_8=""
> > sort_old_key_9=""
> > split("", tab_store);
> > split("", subsort_tab1);
> > split("", subsort_tab2);
> > nb_tab_store=0;
> > PROCINFO["sorted_in"]="__sort_subsort"
> > }
> > {
> > FIELD0=$1
> > FIELD1=$2
> > FIELD2=$3
> > FIELD3=$4
> > FIELD4=$5
> > FIELD5=$6
> > FIELD6=$7
> > FIELD7=$8
> > FIELD8=$9
> > FIELD9=$10
> > FIELD10=$11
> > FIELD11=$12
> >
> > sort_key_1=" " FIELD2
> > sort_key_2=" " FIELD3
> > sort_key_3=" " FIELD4
> > sort_key_4=" " FIELD5
> > sort_key_5=" " FIELD6
> > sort_key_6=" " FIELD7
> > sort_key_7=" " FIELD8
> > sort_key_8=" " FIELD1
> > sort_key_9=" " FIELD9
> > sort_prim_compare = ( ( sort_old_key_1 < sort_key_1 ) ? -1 : (
> > ( sort_old_key_1 == sort_key_1 ) ? 0 : 1 ) );
> > sort_prim_compare = ( sort_prim_compare
> > == 0 ) ? ( ( sort_old_key_2 < sort_key_2 ) ? -1 : ( ( ( sort_old_key_2 ==
> > sort_key_2 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> > sort_prim_compare = ( sort_prim_compare
> > == 0 ) ? ( ( sort_old_key_3 < sort_key_3 ) ? -1 : ( ( ( sort_old_key_3 ==
> > sort_key_3 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> > sort_prim_compare = ( sort_prim_compare
> > == 0 ) ? ( ( sort_old_key_4 < sort_key_4 ) ? -1 : ( ( ( sort_old_key_4 ==
> > sort_key_4 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> > sort_prim_compare = ( sort_prim_compare
> > == 0 ) ? ( ( sort_old_key_5 < sort_key_5 ) ? -1 : ( ( ( sort_old_key_5 ==
> > sort_key_5 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> > sort_prim_compare = ( sort_prim_compare
> > == 0 ) ? ( ( sort_old_key_6 < sort_key_6 ) ? -1 : ( ( ( sort_old_key_6 ==
> > sort_key_6 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> > sort_prim_compare = ( sort_prim_compare
> > == 0 ) ? ( ( sort_old_key_7 < sort_key_7 ) ? -1 : ( ( ( sort_old_key_7 ==
> > sort_key_7 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> > sort_prim_compare = ( sort_prim_compare
> > == 0 ) ? ( ( sort_old_key_8 < sort_key_8 ) ? -1 : ( ( ( sort_old_key_8 ==
> > sort_key_8 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> > sort_prim_compare = ( sort_prim_compare
> > == 0 ) ? ( (
> > sort_old_key_9 < sort_key_9 ) ? -1 : ( ( ( sort_old_key_9 ==
> > sort_key_9 ) ? 0 : 1 ) ) ) : sort_prim_compare ;
> >
> > if ( ( sort_prim_compare > 0 ) && ( NR > 1 ) ) {
> > print "file not correctly sorted at " NR " line " >
> > ".sortcsv.sh_14831_S.acx_error_message.9d"
> > exit 9
> > }
> >
> > sort_sec_key_1=" " FIELD11
> > if ( ( sort_prim_compare != 0 ) || ( NR == 1 ) ) {
> > if ( nb_tab_store > 1 ) {
> > for ( sort_tmp_line in tab_store ) {
> > print tab_store[sort_tmp_line] ;
> > }
> > }
> > else {
> > if ( nb_tab_store > 0 ) {
> > print tab_store[0] ;
> > }
> > }
> >
> > sort_old_key_1= sort_key_1
> > sort_old_key_2= sort_key_2
> > sort_old_key_3= sort_key_3
> > sort_old_key_4= sort_key_4
> > sort_old_key_5= sort_key_5
> > sort_old_key_6= sort_key_6
> > sort_old_key_7= sort_key_7
> > sort_old_key_8= sort_key_8
> > sort_old_key_9= sort_key_9
> > split("", tab_store);
> > nb_tab_store=0;
> > }
> > $1=$1
> > tab_store[nb_tab_store] = sort_sec_key_1 OFS $0
> > nb_tab_store += 1;
> > }
> >
> >
> > END {
> > for ( sort_tmp_line in tab_store ) {
> > print tab_store[sort_tmp_line] ;
> > }
> > }
> > function __sort_subsort(i1,v1,i2,v2)
> > {
> > nb_subsort_tab1 = split(v1, subsort_tab1 );
> > nb_subsort_tab2 = split(v2, subsort_tab2 );
> >
> > sort_sec_compare = ( ( subsort_tab1[1] < subsort_tab2[1] ) ? -1
> > :
> > ( ( subsort_tab1[1] == subsort_tab2[1] ) ? 0 : 1 ) );
> >
> > return(sort_sec_compare)
> > }
> Archive: /var/tmp/samplegnu.zip
> Zip file size: 9071 bytes, number of entries: 3
> -rw-rw-r-- 3.0 unx 91894 tx defN 17-Mar-27 13:30 sample4gnu.pip
> -rw-rw-r-- 3.0 unx 56 tx defN 17-Mar-27 13:36 README.txt
> -rw-rw-r-- 3.0 unx 3427 tx defN 17-Mar-27 10:15 test.awk
> 3 files, 95377 bytes uncompressed, 8601 bytes compressed: 91.0%
--
Andrew Schorr e-mail: address@hidden
Telemetry Investments, L.L.C. phone: 917-305-1748
545 Fifth Ave, Suite 1108 fax: 212-425-5550
New York, NY 10017-3630
- [bug-gawk] Memory leak, Stephane Delsert, 2017/03/27
- Re: [bug-gawk] Memory leak, arnold, 2017/03/27
- Re: [bug-gawk] Memory leak, Andrew J. Schorr, 2017/03/27
- Re: [bug-gawk] Memory leak, Stephane Delsert, 2017/03/27
- Re: [bug-gawk] Memory leak, arnold, 2017/03/28
- Re: [bug-gawk] Memory leak, Andrew J. Schorr, 2017/03/28
- Re: [bug-gawk] Memory leak, arnold, 2017/03/28
- Re: [bug-gawk] Memory leak, Andrew J. Schorr, 2017/03/28
- Re: [bug-gawk] Memory leak, Stephane Delsert, 2017/03/29
- Re: [bug-gawk] Memory leak, Andrew J. Schorr, 2017/03/29
- Re: [bug-gawk] Memory leak, arnold, 2017/03/29
- Re: [bug-gawk] Memory leak, arnold, 2017/03/30