[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18777: [PATCH] dfa: improvement for checking of multibyte character
From: |
Eric Blake |
Subject: |
bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary |
Date: |
Mon, 20 Oct 2014 10:07:20 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 |
On 10/20/2014 09:04 AM, Norihiro Tanaka wrote:
> This patch improves performance for input string which doesn't match
> even the first part of a pattern. Although there is no less effective
> for grep as it uses a superset of DFA, gawk speeds up about 40%.
>
>
> When found newline, we can skip check of a multibyte character boundary
> before the character, as we assume newline as a single byte character.
> by that.
POSIX requires that NUL, slash, dot, newline, and carriage return all be
single bytes that cannot occur inside a multibyte character (because
they have special meaning to file name resolution and/or terminal
interaction); it added this requirement fairly recently, but only after
confirming that common existing locales satisfy this constraint. (The
same is not true for most any other character; even though POSIX
requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid
those characters from also being bytes embedded within multibyte
characters). Is it worth extending your optimization to all five of the
POSIX-guaranteed single byte characters?
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature