bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed

bug-sed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed

From:	Assaf Gordon
Subject:	bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
Date:	Fri, 5 Aug 2016 10:45:59 -0400
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0

Hello Norihiro,

Thank you for the patch.

On 08/05/2016 09:51 AM, Norihiro Tanaka wrote:

We can speeds up sed by caching result of result mbrtowc() for single
byte characters.  It is effective especially in non-UTF8 multibyte
locales which is expensive calculatation.


Regarding this:
====
 #define MBRTOWC(pwc, s, n, ps) \
-  (mb_cur_max == 1 ? \
-   (*(pwc) = btowc (*(unsigned char *) (s)), 1) : \
+  (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \
+   (*(pwc) = mbrtowc_cache[*(unsigned char *) (s)], 1) : \
    mbrtowc ((pwc), (s), (n), (ps)))

#define MBRLEN(s, n, ps) \

-  (mb_cur_max == 1 ? 1 : mbrtowc (NULL, s, n, ps))
+  (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \
+   1 : mbrtowc (NULL, s, n, ps))
====

By using a cache table, isn't this code ignoring mbstate ?
For example, in shift-jis encoding, the character '[' can either be standalone,
or a second character in a sequence such as '\x83\x5b' ?
Wouldn't it also prevent detection of invalid sequences ?

As a side-note, gnu sed's current implementation has special code path for 
multibyte-non-utf8 input,
so this change will not likely affect utf8 or C locales.

regards,
 - assaf

[Prev in Thread]

Current Thread

[Next in Thread]

bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed, Norihiro Tanaka, 2016/08/05
- bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed, Assaf Gordon <=
  - bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed, Norihiro Tanaka, 2016/08/06

Prev by Date: bug#24161: [PATCH 2/2] sed: speed up matching by reguler expression with dfa matcher
Next by Date: bug#24161: [PATCH 2/2] sed: speed up matching by reguler expression with dfa matcher
Previous by thread: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
Next by thread: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed
Index(es):
- Date
- Thread