[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH 3/3] dfa: optimize UTF-8 period
From: |
Jim Meyering |
Subject: |
Re: [PATCH 3/3] dfa: optimize UTF-8 period |
Date: |
Mon, 19 Apr 2010 21:44:31 +0200 |
Paolo Bonzini wrote:
> On 04/19/2010 10:07 AM, Jim Meyering wrote:
>> Paolo Bonzini wrote:
>>> On 04/17/2010 09:27 AM, Jim Meyering wrote:
>>>> Paolo Bonzini wrote:
>>>>> * NEWS: Document improvement.
>>>>> * src/dfa.c (struct dfa): Add utf8_anychar_classes.
>>>>> (add_utf8_anychar): New.
>>>>> (atom): Simplify if/else nesting. Call add_utf8_anychar for ANYCHAR
>>>>> in UTF-8 locales.
>>>>> (dfaoptimize): Abort on ANYCHAR.
>>>>> ---
>>>>> NEWS | 6 ++++++
>>>>> src/dfa.c | 46 +++++++++++++++++++++++++++++++++++++++++++---
>>>>> 2 files changed, 49 insertions(+), 3 deletions(-)
>>>>
>>>> Only quick superficial feedback for now:
>>>
>>> I pushed all patches but this.
>>
>> Thanks!
>>
>> Would you please add comments describing this one in more detail?
>> I ran out of time trying to understand how it works.
>
> Something like this?
>
> /* For UTF-8 expand the period to a series of CSETs that define a valid
> UTF-8 character. This avoids using the slow multibyte path. I'm
> pretty sure it would be both profitable and correct to do it for
> any encoding; however, the optimization must be done manually as
> it is done above in add_utf8_anychar. So, let's start with
> UTF-8: it is the most used, and the structure of the encoding
> makes the correctness more obvious. */
Thanks. That helps.
The key for me was to realize where this data came from:
+ static charclass utf8_classes[5] = {
+ { 0, 0, 0, 0, ~0, ~0, 0, 0 },
+ { ~0, ~0, ~0, ~0, 0, 0, 0, 0 },
+ { 0, 0, 0, 0, 0, 0, 0xfffffffcU, 0 },
+ { 0, 0, 0, 0, 0, 0, 0, 0xffff },
+ { 0, 0, 0, 0, 0, 0, 0, 0xff0000 }
+ };
+ const int n = sizeof (utf8_classes) / sizeof (utf8_classes[0]);
+ int i;
...
Then, this comment made a lot more sense:
+ /* (B|CA|DAA|EAAA) = (B|(C|(D|EA)A)A) =
+ B (C (D (E A CAT) OR A CAT) OR A CAT) OR */
A tiny comment saying that this describes the grammar of a UTF-8
character would go far in helping the uninitiated understand.
One final nit:
Please use "unsigned int" (not "int" and not "unsigned")
for both "i" and "n" above, since both are always non-negative.
That helps the reader, since then we don't have to wonder what
happens if they go negative, or if that can happen.
Thanks!
Re: [PATCH 1/3] dfa: simplify dfainit, Jim Meyering, 2010/04/17