bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 locale and \n in regexps


From: Aharon Robbins
Subject: Re: UTF-8 locale and \n in regexps
Date: Tue, 24 Apr 2007 22:02:18 +0300

Greetings. Re this:

> Date: Thu, 19 Apr 2007 17:09:02 +0300
> From: Pekka Pessi <address@hidden>
> Subject: UTF-8 locale and \n in regexps
> To: address@hidden
> Cc: address@hidden
>
> Hello,
>
> It looks like regexp with \n in [^] behaves badly if locale has
> an UTF-8 ctype.
>
> It looks like if there is \n and an range without \n, like /\n[^x\n]foo/,
> and first \n ends an even-numbered line within the string, regexp
> does not match.
>
> Please see the attached script for an demonstration.
>
> --Pekka Pessi
>
> [ test case removed ]

As I mentioned in my earlier mail, the match function should be using the
full matcher. Gawk was relying on the dfa matcher to say if there really
is a match or not, and the dfa matcher is (unfortunately) lieing. With
the following workaround, gawk behaves correctly.

This will make its way to the CVS archive soon.

I will be adding your program to the test suite, if you don't mind.

Thanks,

Arnold
------------------------------------------------------
Tue Apr 24 21:55:36 2007  Arnold D. Robbins  <address@hidden>

        * re.c (research): In the multibyte case, fall back to the full
        matcher if need_start, since there are bugs in the dfa matcher
        in some obscure cases.  Sigh.

===================================================================
RCS file: /d/mongo/cvsrep/gawk-stable/re.c,v
retrieving revision 1.2
diff -u -r1.2 re.c
--- re.c        6 Apr 2007 12:49:08 -0000       1.2
+++ re.c        24 Apr 2007 18:55:21 -0000
@@ -225,8 +225,15 @@
         *
         * The dfa matcher doesn't have a no_bol flag, so don't bother
         * trying it in that case.
+        *
+        * 4/2007: Grrrr.  The dfa matcher has bugs in certain multibyte
+        * cases that are just too deeply buried to ferret out. Don't
+        * let this kill us if we need_start.  (This may be too narrowly
+        * focused, perhaps we should relegate the DFA matcher to the
+        * single byte case all the time. OTOH, the speed difference
+        * between the matchers in non-trivial... Sigh.)
         */
-       if (rp->dfa && ! no_bol) {
+       if (rp->dfa && ! no_bol && (gawk_mb_cur_max == 1 || ! need_start)) {
                char save;
                int count = 0;
                /*




reply via email to

[Prev in Thread] Current Thread [Next in Thread]