[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
From: |
Norihiro Tanaka |
Subject: |
bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales |
Date: |
Fri, 19 Dec 2014 23:00:38 +0900 |
On Thu, 18 Dec 2014 14:45:58 +0100
Vincent Lefevre <address@hidden> wrote:
> >
> > 0xE0 0xC2 0xFF
> > 0xED 0xA0 0xFF
> > 0xF0 0xBF 0xFF 0xFF
>
> If I'm not mistaken, these first three are also treated as invalid by
> my patch (and should be treated as invalid by any tool).
I got them from pcre_valid_utf8(), but I made some mistakes. They are
as following.
0xE0 0xAF 0xBF
0xED 0xA0 0xBF
0xF0 0x8F 0xBF 0xBF
By the way, they are correspond with following codes in pcre_valid_utf8().
if (c == 0xe0 && (d & 0x20) == 0)
{
*erroroffset = (int)(p - string) - 2;
return PCRE_UTF8_ERR16;
}
if (c == 0xed && d >= 0xa0)
{
*erroroffset = (int)(p - string) - 2;
return PCRE_UTF8_ERR14;
}
........
if (c == 0xf0 && (d & 0x30) == 0)
{
*erroroffset = (int)(p - string) - 3;
return PCRE_UTF8_ERR17;
}
if (c > 0xf4 || (c == 0xf4 && d > 0x8f))
{
*erroroffset = (int)(p - string) - 3;
return PCRE_UTF8_ERR13;
}
> BTW,
>
> printf "\xF4\xBF\xBF\xBF\n" | grep .
>
> finds a match, and this appears to be a bug (grep should follow
> the current standard).
I also see it is a bug as you say. mbrlen() in glibc returns (size_t) -1
for the sequence.