|
| From: | Paul Eggert |
| Subject: | mbcel module for Gnulib? |
| Date: | Sun, 9 Jul 2023 02:21:19 -0700 |
| User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 |
Some of the performance penalty is due to Gnulib's mbrtoc32 module replacing mbrtoc32 on glibc. As I understand it, this is due to glibc's mishandling of the C locale (it treats non-ASCII bytes as encoding errors). Such a bug should not affect diffutils, as diffutils uses mbrtoc32 only in multi-byte locales. So I'd like a way for diffutils to use the mbrtoc32 module without replacing mbrtoc32 on glibc. In the patch I just installed into diffutils on Savannah, this is done via a conditional "#undef mbrtoc32" (see attached) but this is a hack and there should be a better way.
More of the performance penalty appears to be the mbiter module's support for arbitrary character encodings that don't happen in practice - or at least if they do happen they're so rare that diffutils need not worry about them. To work around this problem I wrote a simple, fast iterator "mbcel" that I hope works on all the platforms Gnulib normally targets. mbcel uses a functional style (that is, its only function mbcel_scan is pure in the GCC sense, with no side effects), and this should help make calling code clearer (and I hope, more efficient).
I timed mbcel on the Emacs source code and it scanned the input significantly faster than mbiter did. So I installed it into diffutils on Savannah, as part of diffutils's new support for multi-byte locales.
I'm thinking that mbcel would be useful in Gnulib and in other GNU programs, and that we should create a mbcel module for it in Gnulib. A copy of its only file lib/mbcel.h is attached. The idea is to have an option that is simple and fast, albeit not portable to theoretical platforms.
mbcel.h
Description: Text Data
| [Prev in Thread] | Current Thread | [Next in Thread] |