bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18777: [PATCH] dfa: improvement for checking of multibyte character


From: Eric Blake
Subject: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Date: Mon, 20 Oct 2014 10:07:20 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0

On 10/20/2014 09:04 AM, Norihiro Tanaka wrote:
> This patch improves performance for input string which doesn't match
> even the first part of a pattern.  Although there is no less effective
> for grep as it uses a superset of DFA, gawk speeds up about 40%.
> 

> 
> When found newline, we can skip check of a multibyte character boundary
> before the character, as we assume newline as a single byte character.
> by that.

POSIX requires that NUL, slash, dot, newline, and carriage return all be
single bytes that cannot occur inside a multibyte character (because
they have special meaning to file name resolution and/or terminal
interaction); it added this requirement fairly recently, but only after
confirming that common existing locales satisfy this constraint.  (The
same is not true for most any other character; even though POSIX
requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid
those characters from also being bytes embedded within multibyte
characters).  Is it worth extending your optimization to all five of the
POSIX-guaranteed single byte characters?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]