bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales


From: Zoltán Herczeg
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Fri, 26 Sep 2014 08:36:40 +0200 (CEST)

Hi Paul,

thank you for the feedback.

>I doubt whether users would care all that much, so long as the default 
>is reasonable.  We don't get complaints about it with 'grep', anyway. 
>But if it's a real problem in the PCRE world, you could provide 
>compile-time or run-time options to satisfy the different opinions.

The situation is worse :( Reasonable has a different meaning for everybody.

Just consider these two examples, where \x9c is an incorrectly encoded unicode 
codepoint:

/(?<=\x9c)#/

Does it match \xd5\x9c# starting from #? Noticing errors during a backward scan 
is complicated.

/[\x9c-\x{ffff}]/

What does this range defines exactly? What kind of invalid and valid UTF byte 
sequences are inside (and outside) the bounds?

Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or 
\xc9 invalid UTF byte sequence? In general, UTF defines several character 
properties. What unicode properties does an invalid codepoint have?

Believe me, depending on their needs, everybody has different answers to these 
questions. We don't want to force the view of one group to others, since PCRE 
is a library. You have much more freedom to define any behavior, since grep is 
an end-user program.

>I don't see why.  libpcre can continue with its current implementation, 
>for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; 
>that's not a problem.  The problem is the case where users pass 
>possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK.  libpcre has 
>a slow implementation for this case, and this slow implementation's 
>performance should be improvable without affecting the performance for 
>the PCRE_NO_UTF8_CHECK case.

Regarding performance, this comes from the interpreter:

#define GETUTF8(c, eptr) \
    { \
    if ((c & 0x20) == 0) \
      c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
    else if ((c & 0x10) == 0) \
      c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
    else if ((c & 0x08) == 0) \
      c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
      ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
    else if ((c & 0x04) == 0) \
      c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
          ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
          (eptr[4] & 0x3f); \
    else \
      c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
          ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
          ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
    }

Imagine if you would need to add buffer end and other bit checks. Furthermore 
unicode expects that any character should be encoded with the least amount of 
bytes. More checks. You also need to check the current mode. Of course we have 
several macros similar like this (due to performance reasons), and there are 
code paths where we have assumptions about valid UTF strings. This would 
increase complexity a lot, we would need a lot of extra regression tests, we 
need a correct JIT implementation, and so on. This would also kill 
optimizations. For example, if you define a character range, where all 
characters are two byte long, JIT cleverly detect this, and use a fast case to 
discard any non-two byte UTF codepoints.

The question is, who would be willing to do this work.

>That would chew up CPU resources unnecessarily, by requiring two passes 
>over the input, one for checking UTF-8, the other for doing the actual 
>match.  Granted, it might be faster in real-time than what we have now, 
>but overall it'd probably be more expensive (e.g., more energy 
>consumption) than what we have now, and this doesn't sound promising.

Yeah but you could add a flag to enable this :) I feel this would be much less 
work than the former.

>That doesn't sound like a win, I'm afraid.  The use case that prompted 
>this bug report is someone using 'grep -r' to search for strings like 
>'foobar' in binary data, and this use case would not work with this 
>suggested solution.

In this case, I would simply disable UTF-8 decoding. You could search UTF 
character codes encoded in ascii (i.e. searching \xe9 as \xc3\xa9)

Regards,
Zoltan






reply via email to

[Prev in Thread] Current Thread [Next in Thread]