GNU regex: gnulib, glibc, stability and the future

bug-gnulib

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

GNU regex: gnulib, glibc, stability and the future

From:	Reuben Thomas
Subject:	GNU regex: gnulib, glibc, stability and the future
Date:	Fri, 17 Sep 2010 19:42:03 +0100

Recent discussions have highlighted some interesting issues.

There's an aim to sync glibc with GNU regex. (Great! Any help needed?
Or is it just slow because syncing with glibc is a slow process?) As a
result, there seems to be a desire not to change the API or ABI
further. As far as syncing with glibc goes, this is obviously a good
thing. However, there are deficiencies in the GNU API. For example:

1. Lack of thread-safe API to set syntax and various other options.

2. Translation tables are 8-bit only (and don't make much sense in a
wide-character context, at least in their present form, as you can't
sensibly have 4Gb translation tables; the most common use is taken
care of by RE_ICASE).

3. Missing functionality (no way to do plain text searching, e.g. via
the RE_PLAIN flag that I proposed, or Bruno's regexp-quote).

4. Inefficiencies (e.g. for certain cases matching wide-character
strings, especially UTF-8 encoding, as discussed on bug-grep, and
represented in the code by e.g. the patch that converts UTF-8 back to
ASCII where possible). The reason I mention this (which is not an
API/ABI issue), is that it can nonetheless include large code changes
which are hard to get into glibc.

5. Special-case APIs that do not obviously have a place in a
general-purpose library (I'm thinking of the split-buffer functions
that are only really useful to interactive editors, and editors which
are implemented in a particular way at that). Emacs uses it; does
anything else?

6. One library, two APIs (GNU and POSIX); some functionality is only
available via one API, some via the other. The majority of application
code uses the POSIX API; the majority of extra functionality is in the
GNU API. This should be rectified. I'd like to see a single API, which
is a backwards-compatible (API & ABI) extension of the POSIX API,
thereby providing facilities like backwards searching to applications
that use the POSIX API, and plain-text searching to all. The GNU API
as a whole could be deprecated, and maintained only for old code (the
fact that it is not currently even documented in glibc suggests that
the glibc maintainers would go along with this).

Hence, it seems to me there's a case for splitting the two efforts: on
the one hand, to get recent improvements into glibc (and indeed, this
should be an ongoing process), and on the other, to make further
improvements.

One way to do this would be a friendly fork of GNU regex, bearing the
same relation to the gnulib version as eglibc does to glibc. But I
think that these efforts can be reconciled, and that the right place
to do so is in gnulib (which is already a home for general-purpose
non-system code that is not in glibc, e.g. the various hash and list
APIs). There are two complications: first, the desire to sync glibc
with a stable version of regex (this could be overcome e.g. by marking
code, e.g. by #ifdef, that is not currently synced), and secondly, the
desire to retain a stable version of regex for non-GNU systems, and
those with an out-of-date glibc (this can be overcome by keeping all
changes API and ABI backwards-compatible).

Three different sets of program benefit from this arrangement:

a. GNU programs that currently use the GNU API will have a natural
incentive to move to the extended POSIX API, thereby making them more
portable (as then they will be closer to compiling against a "pure"
POSIX implementation), or use other regex libraries which are
POSIX-compatible, such as PCRE.

b. GNU and non-GNU programs that already use the POSIX API can now
have their capabilities extended easily.

c. Some things, like plain text searching, that are not currently
possible with either API, will become available, thereby simplifying
application code.

Further, GNU maintainers in particular benefit from a focus on a
single API, which is being improved, rather than two which are
stagnant.

In a word, building and running old application code remains possible;
maintaining current application code and writing new code becomes
simpler, and all without having to use a different version of regex
from gnulib's.

Finally, it is not obvious to me that it is desirable to add further
features to glibc. glibc is a system library, it implements system
APIs, and on a GNU system that means POSIX and GNU regex APIs. The
place for extensions is in a non-system library. Given nonetheless
fully backwards-compatible development, it should be possible to feed
bug fixes and algorithmic improvements into glibc, without preventing
further improvements to the API.

That was rather a long email, but the list of actions that results is
small and simple:

1. Agree a method of managing syncing with glibc that allows for
further development; implement an agreed way of keeping track of
syncing.

2. Agree a policy on changes to gnulib's GNU regex (my conservative
suggestion was "backwards ABI and API-compatible changes only").

3. In the light of this, consider various proposals for improving GNU
regex (some of which I've made above).

-- 
http://rrt.sc3d.org

[Prev in Thread]

Current Thread

[Next in Thread]

GNU regex: gnulib, glibc, stability and the future, Reuben Thomas <=

Prev by Date: Re: [PATCH] fdutimensat: add an atflag parameter
Next by Date: Re: [PATCH] fdutimensat: add an atflag parameter
Previous by thread: Re: what is $ac_t ?
Next by thread: [PATCH] init.sh: accomodate directory with no .exes
Index(es):
- Date
- Thread