bug#18777: [PATCH] dfa: improvement for checking of multibyte character

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#18777: [PATCH] dfa: improvement for checking of multibyte character

From:	arnold
Subject:	bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Date:	Tue, 21 Oct 2014 08:43:47 -0600
User-agent:	Heirloom mailx 12.4 7/29/08

Hi.

Norihiro Tanaka <address@hidden> wrote:

> address@hidden wrote:
> > I would think adding a check for '\r' would be safe and would help
> > too; given that on Windows systems '\r' generally occurs just as
> > frequently as '\n', it should give a nice speedup for gawk on those
> > systems.
>
> As I recognize that DFA and regex aren't support multiple eolbytes as
> CR-LF, I can't understand where we can use the change.  Grep converts
> Windows text to Unix text by removal of CR in advance.

Gawk does not remove CR in advance, unless someone specifically
set RS = "\r\n", in which case the full regex matcher is used
to first find \r\n in the raw input buffer.

So for gawk, adding a check for (c == eolbyte || c == '\r')
should produce more speedup on Windows.

(Hmm, on Windows the default is probably text mode which causes
the library/OS to hide the \r anway. Harumph.  But if binary mode
wsa requested then it could still make a difference.)

> BTW, although I say `newline', correctly notice that it's `eolbyte'
> which mayn't be either LF or NUL.

Understood and agreed.

Adding a check for \r isn't a big deal in any case, but of the 5
characters Erik mentioned originally, that is the only one where I
see a potential for a check to really make a difference.

Thanks!

Arnold

[Prev in Thread]

Current Thread

[Next in Thread]

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/10/20
- bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/10/20
- bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Eric Blake, 2014/10/20
  - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/10/20
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, arnold, 2014/10/21
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/10/21
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, arnold <=
    - bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary, Norihiro Tanaka, 2014/10/22

Prev by Date: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Next by Date: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Previous by thread: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Next by thread: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary
Index(es):
- Date
- Thread