bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Awk memory allocation: not a bug, but a limitation?


From: Aharon Robbins
Subject: Re: Awk memory allocation: not a bug, but a limitation?
Date: Tue, 19 Nov 2002 15:29:31 +0200

Greetings. Re this:

> Subject: Awk memory allocation: not a bug, but a limitation?
> To: address@hidden
> Cc: "Pedro Fernandes" <address@hidden>
> From: "Romeu Gaspar" <address@hidden>
> Date: Tue, 19 Nov 2002 13:11:59 +0000
>
> Dear Gawk developers,
>
> I've been using Gawk for almost a year now and I'm amazed with it's
> flexibility and performance. However I'm having problems with a Gawk script
> that looks something like this:
>
>   -----------------------------------------
>   while ( (getline  < input_file) > 0 ) {
>
>       # skip file header
>       n_nr++
>       if ( n_nr < 3 ) continue
>
>       summary_detail_array=$2"|"$3
>
>       summary_detail[summary_detail_array]+=$1
>
>   }
>
>   for ( item in summary_detail ) print summary_detail[item],item >
>   summary_output_file
>   -----------------------------------------
>
>
>   The input file has over 20 million lines.
>   When I run the script, it aborts due to insufficient memory ("fatal:
>   newnode: nextfree: can't allocate memory (Not enough space)"). Although
>   the array has only 5000 elements (10 million input linres read), Gawk is
>   using more than 2GB of RAM, at the time of the abort.
>   I've tried running the same script in Awk and the same thing happens (Awk
>   however seems to use less memory).

Gawk has a lot of overhead per array element.

>   My only explanation for this is that Gawk always adds a new element to
>   the array, even when you are just updating an existing element.
>   The old   element is deleted, but the memory is not deallocated.
>   Does this make sense? If so, is there any workaround?

No; it changes the value, but that doesn't use more memory. It's just that 
there's
a lot of overhead.

The following patch is already in my development version. It cuts the
per-element overhead by a third, and should help.

You can figure out how much memory you need as 80 bytes per element plus
length in characters of each index.  (It would be more if the value were
a string, but here it's clearly a number.)  If that exceeds the total
virtual address space, then there's not much you can do except buy more
RAM and/or recompile your kernel.

I hope this helps.

Arnold
------------------------------------------------------------------
Mon Oct 14 12:02:39 2002  Arnold D. Robbins  <address@hidden>

        Major space reduction in array management.  Overhead reduced
        to two NODE's per element from three.

        * awk.h (ahash): Union is gone.
        (hash.ref): new union member.
        (ahnext): new definition into hash union.
        (ahvalue): new definition into hash union.
        (ahname_str): new member, points into hash union.
        (ahname_len): new member, points into hash union.
        (ahname_ref): new member, points into hash union.
        * array.c: Replaces uses of ahname member with string and
        length.  Set the reference count correctly to 1 on new nodes.
        * eval.c (interpret): Case for Node_K_arrayfor.  dupnode() the
        array indices, and set loop variable to new value made via
        make_string().
        * node.c (unref, dupnode): Node_ahash nodes are now also
        reference counted, a la strings.  Similar code is used to
        increment/decrement the counts, and/or copy nodes as
        needed.

*** awk.h.save  Sun Sep 22 22:23:43 2002
--- awk.h       Sun Oct 13 18:13:01 2002
***************
*** 373,378 ****
--- 373,379 ----
        Node_hashnode,          /* an identifier in the symbol table */
        Node_ahash,             /* an array element */
        Node_array_ref,         /* array passed by ref as parameter */
+ 
        Node_BINMODE,           /* variables recognized in the grammar */
        Node_CONVFMT,
        Node_FIELDWIDTHS,
***************
*** 436,454 ****
                        char *name;
                        size_t length;
                        struct exp_node *value;
                } hash;
  #define       hnext   sub.hash.next
  #define       hname   sub.hash.name
  #define       hlength sub.hash.length
  #define       hvalue  sub.hash.value
!               struct {
!                       struct exp_node *next;
!                       struct exp_node *name;
!                       struct exp_node *value;
!               } ahash;
! #define       ahnext  sub.ahash.next
! #define       ahname  sub.ahash.name
! #define       ahvalue sub.ahash.value
        } sub;
        NODETYPE type;
        unsigned short flags;
--- 437,454 ----
                        char *name;
                        size_t length;
                        struct exp_node *value;
+                       long ref;
                } hash;
  #define       hnext   sub.hash.next
  #define       hname   sub.hash.name
  #define       hlength sub.hash.length
  #define       hvalue  sub.hash.value
! 
! #define       ahnext          sub.hash.next
! #define       ahname_str      sub.hash.name
! #define ahname_len    sub.hash.length
! #define       ahvalue sub.hash.value
! #define ahname_ref    sub.hash.ref
        } sub;
        NODETYPE type;
        unsigned short flags;
*** array.c.save        Tue Sep 10 17:09:24 2002
--- array.c     Mon Oct 14 11:57:46 2002
***************
*** 101,109 ****
        for (i = 0; i < symbol->array_size; i++) {
                for (bucket = symbol->var_array[i]; bucket != NULL; bucket = 
next) {
                        next = bucket->ahnext;
-                       unref(bucket->ahname);
                        unref(bucket->ahvalue);
!                       freenode(bucket);
                }
                symbol->var_array[i] = NULL;
        }
--- 101,108 ----
        for (i = 0; i < symbol->array_size; i++) {
                for (bucket = symbol->var_array[i]; bucket != NULL; bucket = 
next) {
                        next = bucket->ahnext;
                        unref(bucket->ahvalue);
!                       unref(bucket);  /* unref() will free the ahname_str */
                }
                symbol->var_array[i] = NULL;
        }
***************
*** 206,212 ****
  assoc_find(NODE *symbol, register NODE *subs, int hash1)
  {
        register NODE *bucket;
!       NODE *s1, *s2;
  
        for (bucket = symbol->var_array[hash1]; bucket != NULL;
                        bucket = bucket->ahnext) {
--- 205,213 ----
  assoc_find(NODE *symbol, register NODE *subs, int hash1)
  {
        register NODE *bucket;
!       char *s1_str;
!       size_t s1_len;
!       NODE *s2;
  
        for (bucket = symbol->var_array[hash1]; bucket != NULL;
                        bucket = bucket->ahnext) {
***************
*** 214,226 ****
                 * This used to use cmp_nodes() here.  That's wrong.
                 * Array indexes are strings; compare as such, always!
                 */
!               s1 = bucket->ahname;
!               s1 = force_string(s1);
                s2 = subs;
  
!               if (s1->stlen == s2->stlen) {
!                       if (s1->stlen == 0      /* "" is a valid index */
!                           || STREQN(s1->stptr, s2->stptr, s1->stlen))
                                return bucket;
                }
        }
--- 215,227 ----
                 * This used to use cmp_nodes() here.  That's wrong.
                 * Array indexes are strings; compare as such, always!
                 */
!               s1_str = bucket->ahname_str;
!               s1_len = bucket->ahname_len;
                s2 = subs;
  
!               if (s1_len == s2->stlen) {
!                       if (s1_len == 0         /* "" is a valid index */
!                           || STREQN(s1_str, s2->stptr, s1_len))
                                return bucket;
                }
        }
***************
*** 336,353 ****
         * One day: Use an atom table to track array indices,
         * and avoid the extra memory overhead.
         */
!       if (subs->flags & TEMP)
!               bucket->ahname = dupnode(subs);
!       else
!               bucket->ahname = copynode(subs);
  
!       free_temp(subs);
  
!       /* array subscripts are strings */
!       bucket->ahname->flags &= ~(NUMBER|NUM);
!       bucket->ahname->flags |= (STRING|STR);
!       /* ensure that this string value never changes */
!       bucket->ahname->stfmt = -1;
  
        bucket->ahvalue = Nnull_string;
        bucket->ahnext = symbol->var_array[hash1];
--- 337,351 ----
         * One day: Use an atom table to track array indices,
         * and avoid the extra memory overhead.
         */
!       bucket->flags |= MALLOC;
!       bucket->ahname_ref = 1;
!       emalloc(bucket->ahname_str, char *, subs->stlen + 2, "assoc_lookup");
!       bucket->ahname_len = subs->stlen;
  
!       memcpy(bucket->ahname_str, subs->stptr, subs->stlen);
!       bucket->ahname_str[bucket->ahname_len] = '\0';
  
!       free_temp(subs);
  
        bucket->ahvalue = Nnull_string;
        bucket->ahnext = symbol->var_array[hash1];
***************
*** 420,434 ****
                 * This used to use cmp_nodes() here.  That's wrong.
                 * Array indexes are strings; compare as such, always!
                 */
!               NODE *s1, *s2;
  
!               s1 = bucket->ahname;
!               s1 = force_string(s1);
                s2 = subs;
  
!               if (s1->stlen == s2->stlen) {
!                       if (s1->stlen == 0      /* "" is a valid index */
!                           || STREQN(s1->stptr, s2->stptr, s1->stlen))
                                break;
                }
        }
--- 418,434 ----
                 * This used to use cmp_nodes() here.  That's wrong.
                 * Array indexes are strings; compare as such, always!
                 */
!               char *s1_str;
!               size_t s1_len;
!               NODE *s2;
  
!               s1_str = bucket->ahname_str;
!               s1_len = bucket->ahname_len;
                s2 = subs;
  
!               if (s1_len == s2->stlen) {
!                       if (s1_len == 0         /* "" is a valid index */
!                           || STREQN(s1_str, s2->stptr, s1_len))
                                break;
                }
        }
***************
*** 445,453 ****
                last->ahnext = bucket->ahnext;
        else
                symbol->var_array[hash1] = bucket->ahnext;
-       unref(bucket->ahname);
        unref(bucket->ahvalue);
!       freenode(bucket);
        symbol->table_size--;
        if (symbol->table_size <= 0) {
                memset(symbol->var_array, '\0',
--- 445,452 ----
                last->ahnext = bucket->ahnext;
        else
                symbol->var_array[hash1] = bucket->ahnext;
        unref(bucket->ahvalue);
!       unref(bucket);  /* unref() will free the ahname_str */
        symbol->table_size--;
        if (symbol->table_size <= 0) {
                memset(symbol->var_array, '\0',
***************
*** 493,499 ****
                if (symbol->var_array[i] != NULL) {
                        lhs = get_lhs(tree->lnode, & after_assign, FALSE);
                        unref(*lhs);
!                       *lhs = dupnode(symbol->var_array[i]->ahname);
                        break;
                }
        }
--- 492,499 ----
                if (symbol->var_array[i] != NULL) {
                        lhs = get_lhs(tree->lnode, & after_assign, FALSE);
                        unref(*lhs);
!                       *lhs = make_string(symbol->var_array[i]->ahname_str,
!                                       symbol->var_array[i]->ahname_len);
                        break;
                }
        }
***************
*** 561,568 ****
  
                for (chain = old[i]; chain != NULL; chain = next) {
                        next = chain->ahnext;
!                       hash1 = hash(chain->ahname->stptr,
!                                       chain->ahname->stlen, newsize);
  
                        /* remove from old list, add to new */
                        chain->ahnext = new[hash1];
--- 561,568 ----
  
                for (chain = old[i]; chain != NULL; chain = next) {
                        next = chain->ahnext;
!                       hash1 = hash(chain->ahname_str,
!                                       chain->ahname_len, newsize);
  
                        /* remove from old list, add to new */
                        chain->ahnext = new[hash1];
***************
*** 615,628 ****
        for (i = 0; i < symbol->array_size; i++) {
                for (bucket = symbol->var_array[i]; bucket != NULL;
                                bucket = bucket->ahnext) {
!                       printf("%s: I: [(%p, %ld, %s) len %d <%.*s>] V: [",
                                symbol->vname,
!                               bucket->ahname,
!                               bucket->ahname->stref,
!                               flags2str(bucket->ahname->flags),
!                               (int) bucket->ahname->stlen,
!                               (int) bucket->ahname->stlen,
!                               bucket->ahname->stptr);
                        pr_node(bucket->ahvalue);
                        printf("]\n");
                }
--- 615,625 ----
        for (i = 0; i < symbol->array_size; i++) {
                for (bucket = symbol->var_array[i]; bucket != NULL;
                                bucket = bucket->ahnext) {
!                       printf("%s: I: [len %d <%.*s>] V: [",
                                symbol->vname,
!                               (int) bucket->ahname_len,
!                               (int) bucket->ahname_len,
!                               bucket->ahname_str);
                        pr_node(bucket->ahvalue);
                        printf("]\n");
                }
***************
*** 695,706 ****
                                        /* get a node for the linked list */
                                        getnode(bucket);
                                        bucket->type = Node_ahash;
  
                                        /*
                                         * copy the corresponding name and
                                         * value from the original input list
                                         */
!                                       bucket->ahname = dupnode(chain->ahname);
                                        bucket->ahvalue = 
dupnode(chain->ahvalue);
  
                                        /*
--- 692,711 ----
                                        /* get a node for the linked list */
                                        getnode(bucket);
                                        bucket->type = Node_ahash;
+                                       bucket->flags |= MALLOC;
+                                       bucket->ahname_ref = 1;
  
                                        /*
                                         * copy the corresponding name and
                                         * value from the original input list
                                         */
!                                       emalloc(bucket->ahname_str, char *, 
chain->ahname_len + 2, "dup_table");
!                                       bucket->ahname_len = chain->ahname_len;
! 
!                                       memcpy(bucket->ahname_str, 
chain->ahname_str, chain->ahname_len);
!                                       bucket->ahname_str[bucket->ahname_len] 
= '\0';
! 
! 
                                        bucket->ahvalue = 
dupnode(chain->ahvalue);
  
                                        /*
***************
*** 792,808 ****
        NODE *next;
        int i = 0;
        register int hash1;
  
        for (; list != NULL; list = next) {
                next = list->ahnext;
  
                /* make an int out of i++ */
                i++;
!               list->ahname = make_number((AWKNUM) i);
!               (void) force_string(list->ahname);
  
                /* find the bucket where it belongs */
!               hash1 = hash(list->ahname->stptr, list->ahname->stlen,
                                symbol->array_size);
  
                /* link the node into the chain at that bucket */
--- 797,818 ----
        NODE *next;
        int i = 0;
        register int hash1;
+       char buf[100];
  
        for (; list != NULL; list = next) {
                next = list->ahnext;
  
                /* make an int out of i++ */
                i++;
!               sprintf(buf, "%ld", i);
!               assert(list->ahname_str == NULL);
!               assert(list->ahname_ref == 1);
!               emalloc(list->ahname_str, char *, strlen(buf) + 2, 
"assoc_from_list");
!               list->ahname_len = strlen(buf);
!               strcpy(list->ahname_str, buf);
  
                /* find the bucket where it belongs */
!               hash1 = hash(list->ahname_str, list->ahname_len,
                                symbol->array_size);
  
                /* link the node into the chain at that bucket */
***************
*** 833,839 ****
        for (i = 0; i < symbol->array_size; i++) {
                for (bucket = symbol->var_array[i]; bucket != NULL; bucket = 
next) {
                        next = bucket->ahnext;
!                       unref(bucket->ahname);
                        bucket->ahnext = list;
                        list = bucket;
                        num++;
--- 843,863 ----
        for (i = 0; i < symbol->array_size; i++) {
                for (bucket = symbol->var_array[i]; bucket != NULL; bucket = 
next) {
                        next = bucket->ahnext;
!                       if (bucket->ahname_ref == 1) {
!                               free(bucket->ahname_str);
!                               bucket->ahname_str = NULL;
!                       } else {
!                               NODE *r;
! 
!                               getnode(r);
!                               *r = *bucket;
!                               unref(bucket);
!                               bucket = r;
!                               bucket->flags |= MALLOC;
!                               bucket->ahname_ref = 1;
!                               bucket->ahname_str = NULL;
!                               bucket->ahname_len = 0;
!                       }
                        bucket->ahnext = list;
                        list = bucket;
                        num++;
*** eval.c.save Tue Sep 10 17:10:08 2002
--- eval.c      Sun Oct 13 18:18:23 2002
***************
*** 512,518 ****
                                continue;
  
                        for (; t != NULL; t = t->ahnext) {
!                               list[j++] = dupnode(t->ahname);
                        }
                }
  
--- 512,519 ----
                                continue;
  
                        for (; t != NULL; t = t->ahnext) {
!                               list[j++] = dupnode(t);
!                               assert(list[j-1] == t);
                        }
                }
  
***************
*** 529,535 ****
                for (i = 0; i < num_elems; i++) {
                        INCREMENT(stable_tree->exec_count);
                        unref(*((NODE **) lhs));
!                       *lhs = dupnode(list[i]);
                        if (after_assign)
                                (*after_assign)();
                        switch (setjmp(loop_tag)) {
--- 530,536 ----
                for (i = 0; i < num_elems; i++) {
                        INCREMENT(stable_tree->exec_count);
                        unref(*((NODE **) lhs));
!                       *lhs = make_string(list[i]->ahname_str, 
list[i]->ahname_len);
                        if (after_assign)
                                (*after_assign)();
                        switch (setjmp(loop_tag)) {
***************
*** 2244,2250 ****
  
        /* Array indexes are strings, compare as such, always! */
        if (len1 == len2 || len1 < len2)
!               return strncmp(str1, str2, len1);
        else
!               return strncmp(str1, str2, len2);
  }
--- 2245,2251 ----
  
        /* Array indexes are strings, compare as such, always! */
        if (len1 == len2 || len1 < len2)
!               return memcmp(str1, str2, len1);
        else
!               return memcmp(str1, str2, len2);
  }
*** node.c.save Thu Oct  4 18:48:19 2001
--- node.c      Mon Oct 14 12:01:28 2002
***************
*** 259,264 ****
--- 259,268 ----
                if (n->stref < LONG_MAX)
                        n->stref++;
                return n;
+       } else if ((n->flags & MALLOC) != 0 && n->type == Node_ahash) {
+               if (n->ahname_ref < LONG_MAX)
+                       n->ahname_ref++;
+               return n;
        }
        getnode(r);
        *r = *n;
***************
*** 269,274 ****
--- 273,283 ----
                emalloc(r->stptr, char *, r->stlen + 2, "dupnode");
                memcpy(r->stptr, n->stptr, r->stlen);
                r->stptr[r->stlen] = '\0';
+       } else if (n->type == Node_ahash && (n->flags & MALLOC) != 0) {
+               r->ahname_ref = 1;
+               emalloc(r->ahname_str, char *, r->ahname_len + 2, "dupnode");
+               memcpy(r->ahname_str, n->ahname_str, r->ahname_len);
+               r->stptr[r->ahname_len] = '\0';
        }
        return r;
  }
***************
*** 432,438 ****
                return;
        tmp->flags &= ~TEMP;
        if ((tmp->flags & MALLOC) != 0) {
!               if ((tmp->flags & STR) != 0) {
                        if (tmp->stref > 1) {
                                if (tmp->stref != LONG_MAX)
                                        tmp->stref--;
--- 441,454 ----
                return;
        tmp->flags &= ~TEMP;
        if ((tmp->flags & MALLOC) != 0) {
!               if (tmp->type == Node_ahash) {
!                       if (tmp->ahname_ref > 1) {
!                               if (tmp->ahname_ref != LONG_MAX)
!                                       tmp->ahname_ref--;
!                               return;
!                       }
!                       free(tmp->ahname_str);
!               } else if ((tmp->flags & STR) != 0) {
                        if (tmp->stref > 1) {
                                if (tmp->stref != LONG_MAX)
                                        tmp->stref--;




reply via email to

[Prev in Thread] Current Thread [Next in Thread]