bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

mbcel module for Gnulib?


From: Paul Eggert
Subject: mbcel module for Gnulib?
Date: Sun, 9 Jul 2023 02:21:19 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0

GNU diffutils has long lacked support for multi-byte locales for options like --ignore-case (-i) and --ignore-space-change (-b), and the recent char32_t changes to diffutils/src/side.c inspired me to fix this. As it was a pain for diffutils to use mbrtoc32 directly, I looked into using Gnulib's mbiter module to iterate through diffutils input. However, mbiter's generality had a performance penalty.

Some of the performance penalty is due to Gnulib's mbrtoc32 module replacing mbrtoc32 on glibc. As I understand it, this is due to glibc's mishandling of the C locale (it treats non-ASCII bytes as encoding errors). Such a bug should not affect diffutils, as diffutils uses mbrtoc32 only in multi-byte locales. So I'd like a way for diffutils to use the mbrtoc32 module without replacing mbrtoc32 on glibc. In the patch I just installed into diffutils on Savannah, this is done via a conditional "#undef mbrtoc32" (see attached) but this is a hack and there should be a better way.

More of the performance penalty appears to be the mbiter module's support for arbitrary character encodings that don't happen in practice - or at least if they do happen they're so rare that diffutils need not worry about them. To work around this problem I wrote a simple, fast iterator "mbcel" that I hope works on all the platforms Gnulib normally targets. mbcel uses a functional style (that is, its only function mbcel_scan is pure in the GCC sense, with no side effects), and this should help make calling code clearer (and I hope, more efficient).

I timed mbcel on the Emacs source code and it scanned the input significantly faster than mbiter did. So I installed it into diffutils on Savannah, as part of diffutils's new support for multi-byte locales.

I'm thinking that mbcel would be useful in Gnulib and in other GNU programs, and that we should create a mbcel module for it in Gnulib. A copy of its only file lib/mbcel.h is attached. The idea is to have an option that is simple and fast, albeit not portable to theoretical platforms.

Attachment: mbcel.h
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]