bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed

From:	Norihiro Tanaka
Subject:	bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
Date:	Sat, 06 Aug 2016 16:13:27 +0900

On Fri, 5 Aug 2016 10:45:59 -0400
Assaf Gordon <address@hidden> wrote:

> Hello Norihiro,
> 
> Thank you for the patch.
> 
> By using a cache table, isn't this code ignoring mbstate ?
> For example, in shift-jis encoding, the character '[' can either be 
> standalone,
> or a second character in a sequence such as '\x83\x5b' ?
> Wouldn't it also prevent detection of invalid sequences ?
> 
> As a side-note, gnu sed's current implementation has special code path for 
> multibyte-non-utf8 input,
> so this change will not likely affect utf8 or C locales.
> 
> regards,
>   - assaf

Hi Assaf,

Thanks for review.

When MBRTOWC() or MBRLEN() are called in shift-jis, mbstate is always
initial state or the equivalent to a state with initial state except
invalid sequence and incomplete sequence found, as shift-jis is
state-less encoding.

Even if their sequences were found, mbstate should be set to initial
state manually to check following characters in the string.  So I think
that we can ignore mbstate in state-less encoding.

However, the assumption is wrong for state-full encoding as ISO-2022 and
UTF-7.  Does sed support state-full encoding which has shift sequence?
At least, It seems that regex does not support state-full encoding.

Thanks,
Norihiro

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed, Norihiro Tanaka, 2016/08/05
- bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed, Assaf Gordon, 2016/08/05
  - bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed, Norihiro Tanaka <=

Prev by Date: bug#24161: [PATCH 2/2] sed: speed up matching by reguler expression with dfa matcher
Next by Date: bug#24161: [PATCH 2/2] sed: speed up matching by reguler expression with dfa matcher
Previous by thread: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
Next by thread: bug#24161: [PATCH 2/2] sed: speed up matching by reguler expression with dfa matcher
Index(es):
- Date
- Thread