emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regex and case-fold-search problem


From: Kenichi Handa
Subject: Re: regex and case-fold-search problem
Date: Thu, 29 Aug 2002 17:53:53 +0900 (JST)
User-agent: SEMI/1.14.3 (Ushinoya) FLIM/1.14.2 (Yagi-Nishiguchi) APEL/10.2 Emacs/21.1.30 (sparc-sun-solaris2.6) MULE/5.0 (SAKAKI)

In article <address@hidden>, Richard Stallman <address@hidden> writes:
> The fact is, people know the character codes and take advantage of
> their knowledge.  I don't think this is unreasonable.  But that
> question is academic, since the feature is used and we need to make it
> work.

People know the character codes that are based on their
familiar charset.  So, they can take advantage only when
Emacs internally uses the character representation in which
character code order is the same as that familiar charset.
For instance, those who are familiar with iso-8859-2 charset
can take advantage of their knowledge in Emacs 21.  But, if
they write such a regular expression, they'll find it
matches different characters in emacs-unicode.

>       Maybe we can simply use the smallest contiguous
>>  range of chars that includes all the chars we should match,

> That isn't right.  The range should be equal to the disjunction of all
> characters in it; A-_ should be equivalent to []A.....Z[\^_].  With
> case folding, that should match A-Z, a-z, and [\]^_.  In other words,
> The correct behavior is that all character codes that are equivalent
> (when you ignore case) to any character in the originally specified
> range should match.

I think we all know that is the right behaviour, and at
least for ASCII, the latest code works as that.  Perhpas, we
should make Emacs work correctly also for Latin-1 chars,
because in emacs-unicode also, they have the same code
order.

But...

> Given the whole case table, you can compute this by looping over the
> original (non-case-folded) range and finding, for each character, all
> the characters that are equivalent to it.  Then those could be
> assembled into the smallest possible number of ranges.

> A faster way, in the usual cases, would be to look for the case where
> several consecutive characters that have just one case-sibling each,
> and the siblings are consecutive too.  Each subrange of this kind can
> be turned into two subranges, the original and the case-converted.
> Also identify subranges of characters that have no case-siblings; each
> subrange of this kind just remains as it is.  Finally, any unusual
> characters that are encountered can be replaced with a list of all the
> case-siblings.

> This too requires use of the whole case table.

Implemnting that for any range of characters consumes our
man-power and makes the running code slower.

Consider the situation that one writes this regexp
        "[\000-\xffff]"
to search only Unicode BMP chars in emacs-unicode.  I
suspect that, if we implent the above method, compiling this
regexp when case-fold-search is non-nil takes longer time
than people usually expect.

So, I agree with Stephen that his method is good enough.

---
Ken'ichi HANDA
address@hidden




reply via email to

[Prev in Thread] Current Thread [Next in Thread]