diffutils-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: mbcel module for Gnulib?, incomplete multibyte sequences


From: Bruno Haible
Subject: Re: mbcel module for Gnulib?, incomplete multibyte sequences
Date: Tue, 25 Jul 2023 02:01:43 +0200

Paul Eggert wrote:
> Even mbcel strains GCC's capabilities to optimize. When I look at the 
> x86-64 code it generates I see clear opportunities for tighter code. If 
> I have time I'll fire off a suggestion or two to the GCC maintainers.

Yes, this would be good. The benchmarks showed that clang's code is
slightly better than GCC's code:
  https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00093.html
  https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00095.html

> I for mbuiterf, J for mbuiterf with NDEBUG

Indeed, doing the benchmarks on Solaris and NetBSD, I notice that NDEBUG
gives a significant speedup.

> The mbu?iterf? overhead would be OK if it bought something for diff. But 
> it doesn't.
> ...
> multi-byte source code that is significantly 
> simpler than the mbuiterf code. Here's the mbcel version:
> ...
> and the mbuiterf equivalent:

Timings of a testdir of "mbiterf-bench-tests mbrtoc32-regular", together
with your mbcel benchmark patch from
https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00064.html
modified to use mbszero() (since some of the assumptions regarding
mbstate_t turned out to be unsafe), compiled with CPPFLAGS=-DNDEBUG:

    glibc system     Solaris 11.4     NetBSD 9.3
     gcc 13.1          gcc 7.3          gcc 7.5
   mbiterf  mbcel   mbiterf  mbcel   mbiterf  mbcel
a   0.231   0.229    0.191   0.220    0.183   0.200
b   0.223   0.213    0.176   0.220    0.178   0.223
c   1.038   0.655    0.609   0.653    0.513   0.611
d   0.797   0.702    0.735   0.769    ---     ---
e   0.834   0.722    0.653   0.712    0.665   0.845
f  10.849   5.676    5.544   5.451    3.474   4.455
g   7.423   6.483    ---     ---      ---     ---
h   7.533   6.519    6.077   6.188    6.190   7.742
i   3.327   2.817    2.437   2.459    2.702   3.460
j   3.521   3.148    2.753   2.783    2.545   2.920

What you can see here is that on Solaris and NetBSD (where clearing an
mbstate_t is not close to a no-op) in the test cases c..j (where
mbrtoc32 calls are frequent) mbcel comes out slower than mbiterf.
That is because of the overhead of clearing the state at each loop round.

Namely, the visual complexity of mbiterf and mbuiterf is that all
macro calls take a state argument. This state, initialized outside of
the inner loop, clutters up the code, but its effect is a speedup on
platforms like Solaris and NetBSD.

> here is the list of functions 
> that the two implementations call, along with the static number of calls 
> (C for mbcel, I for mbuiterf, J for mbuiterf with NDEBUG):
> 
>     C I J
>       3   __assert_fail
>     1 3 3 __ctype_get_mb_cur_max
>     1 1 1 __ctype_to_lower_loc
>     2 3 3 mbrtoc32 / rpl_mbrtoc32
>       4 2 mbsinit
>       3 3 memcmp
>       2 2 strlen
>       3 3 strnlen1
>     2 2 2 towlower

The fact that here mbsinit shows up despite NDEBUG, indicates that you
haven't had mbrtoc32-regular included your testdir. You are thus comparing
apples with peaches, since
  - mbcel assumes a regular mbrtoc32,
  - mbu?iterf? can with with a regular as well as with an irregular mbrtoc32,
    but is slower when it cannot assume a regular one.
To work with the same assumptions on both sides, you need to include
'mbrtoc32-regular.

> It has a function 
> mbcel_strcasecmp which has the same behavior as Gnulib's mbuiterf-using 
> mbscasecmp

If it behaves the same, without making unportable and undocumented
assumptions (i.e. I don't want to optimize mbszero() beyond the current
state), and is faster, I am all for including this variant in gnulib.

Regarding the "is faster", I guess that Solaris and *BSD platforms will
comes out faster with mbu?iterf, see above. Even then, I'd have no
objections against a compile-time #if:

  mbscasecmp (...)
  {
    #if _GL_MBSTATE_ZERO_SIZE >= 12 /* BSD, Solaris */
    ... mbiterf based implementation ...
    #else
    ... mbcel based implementation ...
    #endif
  }

Now, regarding the claim "the two versions have the same behavior":

Can you prove it? I mean, mathematically prove it?

To prove it, I can only think of the two lemmas:
  Lemma 1: Anywhere where mbiter returns an incomplete character with
           b bytes, mbcel will return b error-returns.
  Lemma 2: The application code handles an incomplete character with b bytes
           in the same way as b incomplete or erroneous returns with
           1 byte each.

Whether Lemma 2 holds or not is, in general, dependent on the code
that consumes the result of mbiter_next vs. mbcel_scan. In the case of
mbscasecmp (mbcel_strcasecmp), I believe it holds, since no case conversion
is being done if g1.err || g2.err.

> No, mbcel interprets it as a malformed byte sequence. It's not a 
> character, nor is it multiple characters. It is a sequence of encoding 
> error bytes, and it's up to the caller to deal with that sequence as it 
> chooses.
> 
> diff uses mbcel to output the malformed sequence as-is.

This is important. We are getting closer to the answer whether lemma 2
holds or not. Essentially you are saying: If the application outputs
a malformed sequence as-is, then it doesn't depend whether it was
as one error return with b bytes or as b error returns with 1 byte each.

This should become part of mbcel's documentation at some point,
because it's one thing a user of mbcel has to think about, whereas for
mbu?iterf there is no such worry.

> That section is about converters. mbcel and mbiter are not converters: 
> they're scanners. You can use mbcel (or mbiter) to build a converter 
> that conforms to the standard in any way you like. You can also use 
> either of them in code that does not convert (e.g., mbscasecmp). So this 
> disagreement is not about mbcel itself, but how mbcel is used as part of 
> a larger application.

I see what you mean, and agree.

Does Lemma 1 hold? 

The answer is: It holds for encodings with MB_CUR_MAX <= 2, as well as for
UTF-8 and EUC-JP. It does *not* hold for the encodings GB18030 and EUC-TW.

Proof:

Since incomplete character returns have at most MB_CUR_MAX - 1 bytes, we
need to consider only encoding with MB_CUR_MAX > 2. These are:
   EUC-JP  3
   EUC-TW  4
   GB18030 4
   UTF-8   4
For the detailed considerations, we need to look at the code structure
of these four encoding. I have attached tables of all valid multibyte
sequences of these encodings. From these, we can derive these properties
(I derived them through a couple of 'sed' and 'grep' invocations):

   EUC-JP
     - Valid 1st bytes: 00..7F, 80..9F, A1..A8, B0..F4
     - Valid 2nd bytes of 3-byte sequences: A2, A6..A7, A9..AB, B0..ED
     - Problematic bytes: A2, A6..A7, A9..AB, B0..ED
   EUC-TW
     - Valid 1st bytes: 00..7F, 8E, A1..A7, C2, C4..FD
     - Valid 2nd bytes of 3- or 4-byte sequences: A1..A7, AF
     - Valid 3rd bytes of 4-byte sequences: A1..FD
     - Problematic bytes: A1..A7, C2, C4..FD
   GB18030
     - Valid 1st bytes: 00..7F, 81..FE
     - Valid 2nd bytes of 3- or 4-byte sequences: 30..39
     - Valid 3rd bytes of 4-byte sequences: 81..FE
     - Problematic bytes: 30..39, 81..FE
   UTF-8
     - Valid 1st bytes: 00..7F, C2..F4
     - Valid 2nd bytes of 3- or 4-byte sequences: 80..BF
     - Valid 3rd bytes of 4-byte sequences: 80..BF
     - Problematic bytes: none

What I call "problematic bytes" here are those that can be a valid
2nd or 3rd byte in an incomplete multibyte sequence as well as a
valid 1st byte.

For UTF-8, when mbrtoc32() returns (size_t)(-2) for 2 or more bytes,
the incomplete multibyte sequence consists of
  - 1st byte: C2..F4
  - 2nd and possibly 3rd byte: 80..BF
Since 80..BF is not a valid 1st byte, mbcel will see 2 or 3 error
returns. Proof done.

For EUC-JP, when mbrtoc32() returns (size_t)(-2) for 2 or more bytes,
the incomplete multibyte sequence consists of
  - 1st byte: 8F
  - 2nd byte: A2, A6..A7, A9..AB, B0..ED
mbcel will produce two error returns: one for 8E, and another one for
the 2nd byte, since we are at the end of the string and that 2nd byte
is not in the ASCII range. Proof done.

For EUC-TW, when mbrtoc32() returns (size_t)(-2) for 2 or more bytes,
we have two cases:
  * b = 2. The incomplete multibyte sequence consisted of
    - 1st byte: 8E, A1..A7, C2, C4..FD
    - 2nd byte: A1..A7, AF
    Like for EUC-JP, since we are at the end of the string and the
    2nd byte is not in the ASCII range, mbcel will produce two error returns.
  * b = 3. Here, the 2nd and 3rd byte can well be a valid multibyte character.
    See:
    $ LC_ALL=C join EUC-TW <(grep '^0x........' EUC-TW | sed -e 's/..$//' -e 
's/^0x8E/0x/' | LC_ALL=C sort -u) | wc -l
    381
    This means, there are 381 different pairs of (2nd byte, 3rd byte), such
    that the multibyte sequence (0x8E, 2nd byte, 3rd byte) will be
    interpreted as a 3-bytes incomplete sequence by mbrtoc32() but as
    1 error return and 1 non-error character by mbcel. Proof of lemma 1 failed.

For GB18030, we have two cases again:
  * b = 2. The incomplete multibyte sequence consisted of
    - 1st byte: 81..FE
    - 2nd byte: 30..39
    mbrtoc32() produces a 2-bytes incomplete sequence, whereas mbcel
    returns 1 error return and 1 non-error character. Proof of lemma 1 failed.
  * b = 3. Don't even need to consider this case any more.

Thus the proof of lemma 1 failed.

Do the two versions of mbscasecmp still produce the same behaviour?
I don't think so, when looking at the return value of mbcel_cmp() for
the error cases, versus mb_casecmp(). The only thing that could convince
me at this point is an extensive test suite for each of the four encodings
listed above. And, frankly, I won't write this test suite, because it
would be at least one day of work, for merely the benefit of optimizing
one single function by 10%-30%.

Bruno

Attachment: multibyte-sequences.tar.xz
Description: application/xz-compressed-tar


reply via email to

[Prev in Thread] Current Thread [Next in Thread]