[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
hash resizing bug (was: [PATCH] hash: declare some functions with the wa
From: |
Eric Blake |
Subject: |
hash resizing bug (was: [PATCH] hash: declare some functions with the warn_unused_result attribute) |
Date: |
Wed, 17 Jun 2009 18:21:45 +0000 (UTC) |
User-agent: |
Loom/3.14 (http://gmane.org/) |
Jim Meyering <jim <at> meyering.net> writes:
> > Additionally, it looks like hash_rehash has a memory leak - if new_table
> > is allocated, but we later fail to allocate new_entry, the function
> > returns immediately without reclaiming new_table. Which means that even
> > if an application is prepared to deal with a false return by trying other
> > means to reduce memory usage rather than just giving up with xalloc_die,
> > the leaked memory from the previous attempt will interfere.
>
> Good catch.
> That is definitely a bug.
I found another bug in hash.c. hash_rehash can get the table into a state
where it will NEVER rehash again, regardless of how much else you call
hash_insert. I triggered this bug in m4 by hashing the
strings "m1", "m2", ... "m100000"; m4 uses the default hash_tuning, and an
initial capacity request of 511. It also uses the following hash function (the
hash is over the string plus its length, to allow hashing embedded NUL bytes):
static size_t
symtab_hasher (const void *entry, size_t buckets)
{
const symbol *sym = (const symbol *) entry;
return hash (SYMBOL_NAME (sym), SYMBOL_NAME_LEN (sym)) % buckets;
}
static size_t
hash (const char *s, size_t len)
{
size_t val = len;
/* This algorithm was originally borrowed from GNU Emacs, but has
been modified to allow embedded NUL. */
while (len--)
val = (val << 7) + (val >> (sizeof val * CHAR_BIT - 7)) + to_uchar (*s++);
return val;
}
Note what happens in the call to hash_rehash, just after everything has been
copied to the new table, but before the old table is freed:
(gdb) p *table
$1 = {bucket = 0x716318, bucket_limit = 0x717720, n_buckets = 641,
n_buckets_used = 513, n_entries = 3289, tuning = 0x443cd8,
hasher = 0x417ab5 <symtab_hasher>,
comparator = 0x417af1 <symtab_comparator>,
data_freer = 0x417b45 <symtab_free_entry>, free_entry_list = 0x0}
(gdb) p *new_table
$2 = {bucket = 0x798970, bucket_limit = 0x79a5c8, n_buckets = 907,
n_buckets_used = 907, n_entries = 0, tuning = 0x443cd8,
hasher = 0x417ab5 <symtab_hasher>,
comparator = 0x417af1 <symtab_comparator>,
data_freer = 0x417b45 <symtab_free_entry>, free_entry_list = 0x75b100}
Before the rehash, about 20% of the buckets were empty. But after changing the
prime factor, the hasher function scatters better, and _every single bucket_ is
occupied. Which means ALL subsequent calls to hash_insert will trigger an
overflow, rather than inserting into a bucket head. And since the resize
checking logic in hash_rehash is only triggered after insertion into a bucket
head, the hash table has effectively been locked into a fixed size that will
never grow again, no matter how many hash_insert calls are made, even though
the table is certainly exceeding the requested threshold of 20% empty buckets.
Ouch.
In other words, this statement in hash_rehash:
/* If the growth threshold of the buckets in use has been reached, increase
the table size and rehash. There's no point in checking the number of
entries: if the hashing function is ill-conditioned, rehashing is not
likely to improve it. */
is not quite right. A hashing function may be somewhat ill-conditioned for one
bucket size, but much better conditioned for another size. At any rate, it
looks like the correct fix is to check for rehashing thresholds based on number
of entries, not number of buckets, and that this check needs to happen on every
insertion, not just insertions that land in an empty bucket. Or, at the very
least, check for threshold PRIOR to insertion, rather than after, such that in
the above case with m4, the very next call to hash_insert would notice the
table was at 100%, and force another resize above 907 buckets.
Thoughts, before I implement something along these lines?
--
Eric Blake
- Re: [PATCH] hash: declare some functions with the warn_unused_result attribute, (continued)
- Re: [PATCH] hash: declare some functions with the warn_unused_result attribute, Eric Blake, 2009/06/18
- Re: [PATCH] hash: declare some functions with the warn_unused_result attribute, Jim Meyering, 2009/06/17
- hash_rehash allocatino failure (was: [PATCH] hash: declare some functions with the warn_unused_result attribute), Eric Blake, 2009/06/17
- Re: hash_rehash allocatino failure, Jim Meyering, 2009/06/17
- Re: hash_rehash allocatino failure, Eric Blake, 2009/06/18
- Re: hash_rehash allocation failure, Eric Blake, 2009/06/19
- hash resizing bug (was: [PATCH] hash: declare some functions with the warn_unused_result attribute),
Eric Blake <=
- Re: hash resizing bug, Eric Blake, 2009/06/17
- Re: hash resizing bug, Eric Blake, 2009/06/17
- Re: hash resizing bug, Jim Meyering, 2009/06/17
- Re: hash resizing bug, Eric Blake, 2009/06/17
- Re: hash resizing bug, Jim Meyering, 2009/06/17
- Re: hash resizing bug, Eric Blake, 2009/06/17
- Re: hash resizing bug, Jim Meyering, 2009/06/18
- Re: hash resizing bug, Jim Meyering, 2009/06/17
- Re: hash resizing bug, Eric Blake, 2009/06/18
- Re: hash resizing bug, Jim Meyering, 2009/06/18