bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 0/5] unistr/u*-chr, unistr/u*-strchr: improve test coverage,


From: Bruno Haible
Subject: Re: [PATCH 0/5] unistr/u*-chr, unistr/u*-strchr: improve test coverage, optimize
Date: Wed, 21 Jul 2010 01:07:59 +0200
User-agent: KMail/1.9.9

Hello Paolo,

> this series introduces more optimizations (basically a Boyer-Moore
> algorithm without the table) in the u8_strchr and u8_chr functions.
> ...
> While doing this I also improved the coverage of the functions.
> 
> Here are the timings:
> 
>                 test-u8-chr         test-u8-strchr
>    before       1:00                1:07
>    after        0:46                0:42

Great! Thank you for this.

> The algorithm introduces a heavy slow-down on UTF-16.  This is probably
> due to the nature of the test, but I decided it is not worth optimizing
> that case a lot.

Yes, the majority of the users use the u8_* functions.

Comments about the individual patches:


About [PATCH 1/5] unistr/u*-strchr: add tests:

> +#define U_CHR u16_strchr
> +  ASSERT (U_CHR (input, 'b') == input + 1);

The convention throughout libunistring is that U_XXX is the abbreviation
for u{8,16,32}_xxx. So tests/unistr/test-strchr.h looks like it is calling
u16_chr with two arguments. But u16_chr takes three arguments. Can you
please use U_STRCHR as macro name here, instead of U_CHR?


About [PATCH 2/5] unistr/u*-chr, unistr/u*-strchr: prepare for multibyte tests:

> * modules/unistr/u16-chr-test,  modules/unistr/u16-strchr-test:
> Depend on u32-to-u16.
> * modules/unistr/u8-chr-test,  modules/unistr/u8-strchr-test:
> Depend on u32-to-u8.

The file names actually end in '-tests'.

> version as UCS4 then convert.

The correct name of this encoding is UCS-4 with a hyphen/minus. (See RFC 2044.)


About [PATCH 4/5] unistr/u*-chr, unistr/u*-strchr: test multibyte sequences 
more:

> +  UNIT *exp;
> +  UNIT *prev;

The number of local variables declared at the top of this function is getting
large. For readability, I would create a local block, to minimize their scope:

  {
    UNIT *exp;
    UNIT *prev;

    exp = input + 1026;
    ...
  }

> +      size_t n;

A local variable of the same name already exists in an outer scope.
It leads to a warning with "gcc -Wshadow". Can you rename it?

> +      n = U_UCTOMB(c, uc, 6);

Needs a space before the opening parenthesis.


About [PATCH 5/5] unistr/u8-chr: use Boyer-Moore like algorithm:

In modules/unistr/u8-chr, a dependency to 'memchr' should be added.

> +      return memchr (s, c0, n);

Please write this as
         return (uint8_t *) memchr ((const char *) s, c0, n);
It's better to make the casts explicit in these cases, IMO,
for clarity, for maintenance.

>        case 2:
> -        if (n > 1)

>        case 3:
> -        if (n > 2)

>        case 4:
> -        if (n > 3)

Removing this test of n is wrong IMO. If, for example, uc is a
2-units character and n == 0, in the statement
  const uint8_t *end = s + n - 1;
you create a pointer to a memory area where you are not allowed to have a
pointer to, according to ISO C.

Please also add comments that prove the validity of the code. Like this:

For 2:

+          uint8_t c0 = c[0];
+          uint8_t c1 = c[1];
           /* Note: 0x80 <= c1 < 0xC0 < c0.  */

+          while (s < end)
+            {
+              uint8_t s1 = s[1];
+              if (s1 == c1)
+                {
+                  if (*s == c0)
+                    return (uint8_t *) s;
+                  else
                     /* We can skip the search at s + 1, because s[1] == c1 < 
c0.  */
+                    s += 2;
+                }
+              else
+                {
+                  if (s1 == c0)
+                    s++;
+                  else
                     /* We can skip the search at s + 1, because s[1] != c0.  */
+                    s += 2;
+                }
+            }

For 3:

+          uint8_t c0 = c[0];
+          uint8_t c1 = c[1];
+          uint8_t c2 = c[2];
           /* Note: c1 and c2 are < 0xC0 < c0.  */

+          while (s < end)
+            {
+              uint8_t s2 = s[2];
+              if (s2 == c2)
+                {
+                  if (s[1] == c1 && *s == c0)
+                    return (uint8_t *) s;
+                  else if (s2 == c1)
+                    s++;
+                  else
                     /* We can skip the search at s + 1, because s[2] != c1.
                        We can skip the search at s + 2, because s[2] == c2 < 
c0.  */
+                    s += 3;
+                }
+              else
+                {
+                  if (s2 == c1)
+                    s++;
+                  else if (s2 == c0)
                     /* We can skip the search at s + 1, because s[2] != c1.  */
+                    s += 2;
+                  else
                     /* We can skip the search at s + 1, because s[2] != c1.
                        We can skip the search at s + 2, because s[2] != c0.  */
+                    s += 3;
+                }
+            }

Likewise for 4.


Otherwise nicely done! Thanks!

Bruno



reply via email to

[Prev in Thread] Current Thread [Next in Thread]