emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Case mapping of sharp s


From: Stephen J. Turnbull
Subject: Re: Case mapping of sharp s
Date: Wed, 18 Nov 2009 14:33:57 +0900

Eli Zaretskii writes:

 > My idea was to use the same technique to fold the text as you search
 > through it.  You could, for example, fold one character at a time, so
 > that instead of comparing against X you compare against fold(X).

That's precisely how case-insensitive Boyer-Moore is done.  But it's
not that easy for Unicode, because you will need to check
*characters*, not octets, while Boyer-Moore is very much oriented
toward fixed-width charsets.

I think the problem here is that because of the nature of Unicode,
this folding often crosses character blocks (here meaning "sets of
characters with the same bits except for the lowest 6").  That means
that you will need to keep track not only of the folding for the
current octet, but you will also need to be able to backtrack to check
the higher bits.  By contrast, with the Mule internal encoding the
case pairs are always in the same block of 96 characters.

I don't think that it's *impossible* to implement Boyer-Moore with
translation for UTF-8, but it's nowhere near as easy as for the
unibyte charset case, which Mule internal can in practice be reduced
to.  It's possible that all this fiddling could result in a
performance hit of the same order of magnitude as the performance hit
of brute-forcing the comparison.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]