bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’

From:	Michal Nazarewicz
Subject:	bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’
Date:	Mon, 18 Jul 2016 20:07:18 +0200
User-agent:	Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.2 (x86_64-unknown-linux-gnu)

On Mon, Jul 18 2016, Eli Zaretskii wrote:
>> From: Michal Nazarewicz <mina86@mina86.com>
>> Date: Mon, 18 Jul 2016 16:04:44 +0200
>> 
>> mutually_exclusive_p did not check for the claass bits of an charset
>> opcode when comparing it with an exactn which resulted in situation
>> where it thought a multibyte character could not match the character
>> class.
>> 
>> This assumption caused incorrect optimisation of the regular expression
>> and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’.
>> 
>> The issue affects all multibyte word characters as well as other
>> character classes which may match multibyte characters.
>
> Thanks.
>
> Unfortunately, the above description is too terse for me to understand
> the issue and the way you propose to fix it.  Could you please provide
> more details,

‘[[:word:]]*\u2620’ ends up being as:

0:      /on_failure_keep_string_jump to 28
3:      /charset [ !extends past end of pattern! $-%0-9A-Za-z]has-range-table
25:     /jump to 3
28:     /exactn/3/â// 
33:     /succeed
34:     end of pattern.

while ‘\sw*\u2620’ as:

0:      /on_failure_jump to 8
3:      /syntaxspec/2
5:      /jump to 0
8:      /exactn/3/â// 
13:     /succeed
14:     end of pattern.

Apart from a different opcode to match the word class the crux of the
difference is the first opcode: on_failure_keep_string_jump
vs. on_failure_jump.

As a matter of fact, regex_compile puts a on_failure_jump_smart opcode
at the beginning which is optimised by re_match_2_internal (debug code
removed for brevity):

        /* This operation is used for greedy *.
           Compare the beginning of the repeat with what in the
           pattern follows its end. If we can establish that there
           is nothing that they would both match, i.e., that we
           would have to backtrack because of (as in, e.g., `a*a')
           then we can use a non-backtracking loop based on
           on_failure_keep_string_jump instead of on_failure_jump.  */
        case on_failure_jump_smart:
          EXTRACT_NUMBER_AND_INCR (mcnt, p);
          {
            re_char *p1 = p; /* Next operation.  */
            /* Here, we discard `const', making re_match non-reentrant.  */
            unsigned char *p2 = (unsigned char*) p + mcnt; /* Jump dest.  */
            unsigned char *p3 = (unsigned char*) p - 3; /* opcode location.  */

            p -= 3;             /* Reset so that we will re-execute the
                                   instruction once it's been changed. */

            EXTRACT_NUMBER (mcnt, p2 - 2);

            /* Ensure this is a indeed the trivial kind of loop
               we are expecting.  */
            if (mutually_exclusive_p (bufp, p1, p2))
              {
                /* Use a fast `on_failure_keep_string_jump' loop.  */
                *p3 = (unsigned char) on_failure_keep_string_jump;
                STORE_NUMBER (p2 - 2, mcnt + 3);
              }
            else
              {
                /* Default to a safe `on_failure_jump' loop.  */
                *p3 = (unsigned char) on_failure_jump;
              }
          }

In other words, in our example, the code checks whether ‘[[:word:]]’ can
match ‘💀’.  If it cannot than we can be greedy about matching
‘[[:word:]]*’ and never backtrace looking for a shorter match; if it
can, we may need to backtrace if the overall matching fails.

mutually_exclusive_p concludes that ‘[[:word:]]’ cannot match ‘💀’ (or
any non-ASCII characters really) but as a matter of fact, word class
does match skull character.

So when ‘[[:word:]]*💀’ matches ‘foo💀’ the following happens:
1. ‘[[:word:]]*’ matches the whole string.
2. String is now empty so ‘💀’ doesn’t match.
3. Because of incorrect assumptions, the engine does not shorten the
   initial ‘[[:word:]]*’ match.

(I may be butchering the exact terms and algorithm that is being applied
but the general idea is, I hope, shown).

> Note that some of the classes deliberately don't work on multibyte
> characters, and are documented as such.

This is irrelevant.  ‘[[:word:]]*’ matches ‘foo’ thus ‘[[:word:]]*b’
must match ‘foob’ (which it does) and ‘[[:word:]]*☠’ must match ‘foo☠’
(which it doesn’t).

> including what problems you saw in classes other than [:word:]?

The problem happens for any class which matches multibyte characters.
For example:

    (string-match-p "[[:alpha:]]*" "żółć")
    => 0
    (string-match-p "[[:alpha:]]*w" "żółw")
    => 0
    (string-match-p "[[:alpha:]]*ć" "żółć")
    => nil  (should be 0)
    ;; or even more simply:
    (string-match-p "[[:alpha:]]*ć" "ż")
    => nil  (should be 0)

In general, for a class FOO, if a multibyte character C matches that
class regex "[[:FOO]]*C") should match the character itself but it
doesn’t.

> So if we are changing that, there should be documentation changes and
> an entry in NEWS as well (but I suggest not to make such changes too
> easily, not without measuring the impact on performance, if any).
>
>> * src/regex.c (executing_charset): A new function for executing the
>> charset and charset_not opcodes.  It performs check on the character
>> taking into consideration existing bitmap, rang table and class bits.
>                                              ^^^^
> A typo.
>
>> +#ifdef emacs
>> +  else if (rtp)
>> +    {
>> +      int class_bits = CHARSET_RANGE_TABLE_BITS (p);
>> +      re_wchar_t range_start, range_end;
>> +
>> +  /* Sort tests by the most commonly used classes with some adjustment to 
>> which
>> +     tests are easiest to perform.  Frequencies of character class names as 
>> of
>> +     2016-07-15:
>
> Not sure what files you used for this.  Are those Emacs source files?
>
>> diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
>> new file mode 100644
>> index 0000000..a2dd4f0
>> --- /dev/null
>> +++ b/test/src/regex-tests.el
>> @@ -0,0 +1,75 @@
>> +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t 
>> -*-
>        ^^^^^^^^^^^^^^^
>
> Copy-paste error.
>
>> +;;; buffer-tests.el ends here
>
> And another one.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’, Michal Nazarewicz, 2016/07/18
- bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’, Eli Zaretskii, 2016/07/18
  - bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’, Michal Nazarewicz <=
  - bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020), Michal Nazarewicz, 2016/07/18
    - bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020), Andreas Schwab, 2016/07/19
    - bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020), Michal Nazarewicz, 2016/07/20
    - bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020), Michal Nazarewicz, 2016/07/25
- bug#24020: [PATCH] Fix ‘is multibyte’ test regex.c’s mutually_exclusive_p (bug#24020), Michal Nazarewicz, 2016/07/27

Prev by Date: bug#24014: 24.5; doc string of `cl-reduce'
Next by Date: bug#23917: Please consider making Bug #23917 a blocker for 25.1 (was Re: org-capture: Capture template ‘g’: Match data clobbered by buffer modification hooks)
Previous by thread: bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’
Next by thread: bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020)
Index(es):
- Date
- Thread