bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

gnulib's Knuth-Morris-Pratt implementation (was: memmem issues)


From: James Youngman
Subject: gnulib's Knuth-Morris-Pratt implementation (was: memmem issues)
Date: Thu, 20 Dec 2007 10:59:35 +0000

On Dec 19, 2007 11:15 PM, Eric Blake <address@hidden> wrote:

> - The implementation was naively quadratic in the worst-case complexity.  I
> lifted Bruno's KMP ideas in mbsstr to make it linear+delayed allocation.  
> glibc
> still uses the naive implementation - should we file a bug with them to fix
> it?

I've been fiddling around with Bruno's KMP code recently.  I like it a
lot.   I wish though there was some way for the caller to indicate
that they are willing to amortise the cost of the KMP initialisation
over many calls (that is, that we will look for this needle in many
haystacks).

Just doing the naive thing, using the KMP code every time for all
cases, is a big win for the case-insensitive uses (because
needle_mbchars is greatly beneficial) but it's a slight loss for the
case-sensitive and unibyte cases.   At least, this is so for the cases
I tried.  You can win back some of the difference by special-casing
the search for needle[0] in the j=0 case, but essentially this change
is simply reinstating a simplified version of Bruno's outer loop.
Because using just the KMP is not a clear win, I haven't suggested
extracting the KMP implementation out into a separate module.

For findutils though, I can also make use of the fact that the huge
majority of my haystacks have a large prefix in common with the
preceding haystack.   However, this is not so useful in the multibyte
case, as you can't iterate backward through a multibyte string without
using an ancilliary data structure.    I'm mulling the idea of
building such a thing.

James.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]