bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[PATCH] strstr: revert patches that introduced bug and pessimization


From: Eric Blake
Subject: [PATCH] strstr: revert patches that introduced bug and pessimization
Date: Fri, 25 Feb 2011 09:18:05 -0700

Jim's one-liner solved the bug by pessimizing speed, making the
algorithm shift less per iteration and thus perform more repeated
comparisons.  The real reason for the bug is that my supposed
"optimizations" actually resulted in cases on certain periodic needles
where critical_factorization returned a factorization that was equal
to, rather than less than the period of the needle.  This makes the
CMP_FUNC choose the wrong branch, since a periodic needle must be
handled differently than one where the left half of the needle does
not overlap the right half.

Thankfully, the flawed "optimization" was only present in gnulib, and
was never ported to glibc or cygwin (the only two known
implementations that use the two-way algorithm), so no additional m4
check is needed to detect the bug in the wild.

* lib/str-two-way.h: Add another reference.
(two_way_short_needle, two_way_long_needle): Revert changes from
2011-02-24; they pessimize search speed.
(critical_factorization): Partially revert changes from
2010-06-22; they violate the requirement that the left half of the
needle be smaller than the period of the needle.

Signed-off-by: Eric Blake <address@hidden>
---

I'm working on a followup patch that enhances memmem and strcasestr
tests to also test for the same bug, to ensure we don't regress again.

 ChangeLog         |   10 ++++++++++
 lib/str-two-way.h |   37 ++++++++++++++++++++-----------------
 2 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index e40fe97..bdb7ac6 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,13 @@
+2011-02-25  Eric Blake  <address@hidden>
+
+       strstr: revert patches that introduced bug and pessimization
+       * lib/str-two-way.h: Add another reference.
+       (two_way_short_needle, two_way_long_needle): Revert changes from
+       2011-02-24; they pessimize search speed.
+       (critical_factorization): Partially revert changes from
+       2010-06-22; they violate the requirement that the left half of the
+       needle be smaller than the period of the needle.
+
 2011-02-24  Paul Eggert  <address@hidden>

        filenamecat: remove unnecessary dependency on dirname-lgpl
diff --git a/lib/str-two-way.h b/lib/str-two-way.h
index 317612c..7dcb387 100644
--- a/lib/str-two-way.h
+++ b/lib/str-two-way.h
@@ -44,14 +44,15 @@
 #include <limits.h>
 #include <stdint.h>

-/* We use the Two-Way string matching algorithm, which guarantees
-   linear complexity with constant space.  Additionally, for long
-   needles, we also use a bad character shift table similar to the
-   Boyer-Moore algorithm to achieve improved (potentially sub-linear)
-   performance.
-
-   See http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
-   and http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm
+/* We use the Two-Way string matching algorithm (also known as
+   Chrochemore-Perrin), which guarantees linear complexity with
+   constant space.  Additionally, for long needles, we also use a bad
+   character shift table similar to the Boyer-Moore algorithm to
+   achieve improved (potentially sub-linear) performance.
+
+   See http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260,
+   http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm,
+   
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.6641&rep=rep1&type=pdf
 */

 /* Point at which computing a bad-byte shift table is likely to be
@@ -108,7 +109,7 @@ static size_t
 critical_factorization (const unsigned char *needle, size_t needle_len,
                         size_t *period)
 {
-  /* Index of last byte of left half.  */
+  /* Index of last byte of left half, or SIZE_MAX.  */
   size_t max_suffix, max_suffix_rev;
   size_t j; /* Index into NEEDLE for current candidate suffix.  */
   size_t k; /* Offset into current period.  */
@@ -124,8 +125,8 @@ critical_factorization (const unsigned char *needle, size_t 
needle_len,
     }

   /* Invariants:
-     1 <= j < NEEDLE_LEN - 1
-     0 <= max_suffix{,_rev} < j
+     0 <= j < NEEDLE_LEN - 1
+     -1 <= max_suffix{,_rev} < j (treating SIZE_MAX as if it were signed)
      min(max_suffix, max_suffix_rev) < global period of NEEDLE
      1 <= p <= global period of NEEDLE
      p == global period of the substring NEEDLE[max_suffix{,_rev}+1...j]
@@ -133,8 +134,9 @@ critical_factorization (const unsigned char *needle, size_t 
needle_len,
   */

   /* Perform lexicographic search.  */
-  max_suffix = 0;
-  j = k = p = 1;
+  max_suffix = SIZE_MAX;
+  j = 0;
+  k = p = 1;
   while (j + k < needle_len)
     {
       a = CANON_ELEMENT (needle[j + k]);
@@ -167,8 +169,9 @@ critical_factorization (const unsigned char *needle, size_t 
needle_len,
   *period = p;

   /* Perform reverse lexicographic search.  */
-  max_suffix_rev = 0;
-  j = k = p = 1;
+  max_suffix_rev = SIZE_MAX;
+  j = 0;
+  k = p = 1;
   while (j + k < needle_len)
     {
       a = CANON_ELEMENT (needle[j + k]);
@@ -284,7 +287,7 @@ two_way_short_needle (const unsigned char *haystack, size_t 
haystack_len,
     {
       /* The two halves of needle are distinct; no extra memory is
          required, and any mismatch results in a maximal shift.  */
-      period = MAX (suffix, needle_len - suffix);
+      period = MAX (suffix, needle_len - suffix) + 1;
       j = 0;
       while (AVAILABLE (haystack, haystack_len, j, needle_len))
         {
@@ -407,7 +410,7 @@ two_way_long_needle (const unsigned char *haystack, size_t 
haystack_len,
       /* The two halves of needle are distinct; no extra memory is
          required, and any mismatch results in a maximal shift.  */
       size_t shift;
-      period = MAX (suffix, needle_len - suffix);
+      period = MAX (suffix, needle_len - suffix) + 1;
       j = 0;
       while (AVAILABLE (haystack, haystack_len, j, needle_len))
         {
-- 
1.7.4




reply via email to

[Prev in Thread] Current Thread [Next in Thread]