[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
From: |
Norihiro Tanaka |
Subject: |
bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed |
Date: |
Sat, 06 Aug 2016 16:13:27 +0900 |
On Fri, 5 Aug 2016 10:45:59 -0400
Assaf Gordon <address@hidden> wrote:
> Hello Norihiro,
>
> Thank you for the patch.
>
> By using a cache table, isn't this code ignoring mbstate ?
> For example, in shift-jis encoding, the character '[' can either be
> standalone,
> or a second character in a sequence such as '\x83\x5b' ?
> Wouldn't it also prevent detection of invalid sequences ?
>
> As a side-note, gnu sed's current implementation has special code path for
> multibyte-non-utf8 input,
> so this change will not likely affect utf8 or C locales.
>
> regards,
> - assaf
Hi Assaf,
Thanks for review.
When MBRTOWC() or MBRLEN() are called in shift-jis, mbstate is always
initial state or the equivalent to a state with initial state except
invalid sequence and incomplete sequence found, as shift-jis is
state-less encoding.
Even if their sequences were found, mbstate should be set to initial
state manually to check following characters in the string. So I think
that we can ignore mbstate in state-less encoding.
However, the assumption is wrong for state-full encoding as ISO-2022 and
UTF-7. Does sed support state-full encoding which has shift sequence?
At least, It seems that regex does not support state-full encoding.
Thanks,
Norihiro