emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regex and case-fold-search problem


From: Richard Stallman
Subject: Re: regex and case-fold-search problem
Date: Fri, 30 Aug 2002 15:19:14 -0400

    So, I agree with Stephen that his method is good enough.

It is wrong even for ASCII--we definitely must do something better, at
least for ASCII.  The only question is, how much more than ASCII?

    I think we all know that is the right behaviour, and at
    least for ASCII, the latest code works as that.  Perhpas, we
    should make Emacs work correctly also for Latin-1 chars,
    because in emacs-unicode also, they have the same code
    order.

What about for Latin-2 characters?  Will those regexp ranges
change their meaning in emacs-unicode?

If so, perhaps we only need to make an effort to support ranges really
right for codes 0-256.

    > A faster way, in the usual cases, would be to look for the case where
    > several consecutive characters that have just one case-sibling each,
    > and the siblings are consecutive too.  Each subrange of this kind can
    > be turned into two subranges, the original and the case-converted.
    > Also identify subranges of characters that have no case-siblings; each
    > subrange of this kind just remains as it is.  Finally, any unusual
    > characters that are encountered can be replaced with a list of all the
    > case-siblings.

    > This too requires use of the whole case table.

    Implemnting that for any range of characters consumes our
    man-power and makes the running code slower.

It is not a very hard program to write, I think.  I'd guess around 30
lines.  However, you're right about the slowness for large ranges.  If
we only do this for codes 0-256 (or, currently, for ASCII and
Latin-1), then it won't be too slow.

    Consider the situation that one writes this regexp
            "[\000-\xffff]"
    to search only Unicode BMP chars in emacs-unicode.

Do you think that is a reasonable kind of range that we
should try to support?  If so, there goes my idea that
we only need to support ranges in 0-256 very well.

On the other hand, if we handle \000-\xffff by doing case conversion
carefully only for ASCII and Latin-1, and treat the rest of the range
in a less smart way, we would get the same results in this case.
Is that a good solution?




reply via email to

[Prev in Thread] Current Thread [Next in Thread]