bug-gnulib
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Dealing with character ranges in grep


From: Paolo Bonzini
Subject: Re: Dealing with character ranges in grep
Date: Thu, 09 Jun 2011 12:41:16 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.17) Gecko/20110428 Fedora/3.1.10-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.10

On 06/09/2011 11:58 AM, Bruno Haible wrote:
Paolo,

[=e=] to match "e" as well as accented versions like é, è and ê).
That is the one feature that you get with glibc, and that you would
sacrifice when building --with-included-regex.

I agree.  It's up to distros to choose, of course.

If you are on the point of sacrificing a glibc feature in many programs,
then IMO you should first talk with the glibc people to see what alternative
they can offer.

No, I'm not! It's not any different from now. Right now, some distros/people use --with-included-regex and get broken semantics + no equivalence classes; others use --without-included-regex and get another kind of broken semantics.

With my proposal, distros/people that use --with-included-regex would get understandable semantics + no equivalence classes; others will see no change.

I don't plan to change the default between the two.

It is probably futile to ask Ulrich Drepper to change how [a-z] is interpreted
by default.

I think it would be possible to discuss it civilly with Uli (not on Bugzilla though). Unfortunately, more glibc development now seems to be done by someone I shall not name who sports twice the arrogance and half the knowledge/talent.

But what would gnulib need so as to implement our "desired"
behaviour? As far as I understand, you want to keep the interpretation of
[=e=] in the POSIX + glibc way, but change the interpretation of [a-z]?

That's a different story. If we could implement [=e=] in gnulib code using glibc extensions, I would be all for that. But even right now, using gnulib's regex means sacrificing [=e=]. So that's a separate topic.

The only possibility is that with this change more distros may be using --with-included-regex. That's their choice, not ours.

Then, what do we need from glibc?
   - Do we need a RE_RANGES_IGNORE_LOCALES flag, like Arnold proposed?

No, that would be really really bad to have, for the reasons I mentioned in my original email.

   - Do we need an API that allows us to access the collation elements?
     (Or is strcoll and wcscoll sufficient?)

No, they're not, and I thought about designing such an API last year, but in the end decided that locale behavior of regex are irremediably broken. For example, when you have a collation element, you can match it using ranges (e.g. [d-i] matches "ch" in Czech; "ch" collates after "h"), and even apply negation (e.g. [^c-h] matches "ch" too). However there is no way to anchor your match to the beginning of the collation element. So "chci" matches both /[c-h]+ci/ and /[^c-h]+ci/. It is beyond repair, and [=e=] is the only part that can be salvaged.

Paolo



reply via email to

[Prev in Thread] Current Thread [Next in Thread]