[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#18777: [PATCH] dfa: improvement for checking of multibyte character
From: |
arnold |
Subject: |
bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary |
Date: |
Tue, 21 Oct 2014 00:23:07 -0600 |
User-agent: |
Heirloom mailx 12.4 7/29/08 |
Norihiro Tanaka <address@hidden> wrote:
> Eric Blake <address@hidden> wrote:
> > Is it worth extending your optimization to all five of the
> > POSIX-guaranteed single byte characters?
>
> Thanks, but I don't want to perform it immediately. DFA has already
> regarded newline as a single byte character, but hasn't others yet. So,
> we may need to make many changes to handle invalid locales and sequences
> not to conform to the rule. If we omitted that, It might be that limits
> are added to the locale to be able to apply DFA to. Threfore, it should
> be performed carefully.
I would think adding a check for '\r' would be safe and would help
too; given that on Windows systems '\r' generally occurs just as
frequently as '\n', it should give a nice speedup for gawk on those
systems.
The other characters that Erik cited seem less like a big issue to me.
Thanks,
Arnold