[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Case mapping of sharp s
From: |
Ulrich Mueller |
Subject: |
Re: Case mapping of sharp s |
Date: |
Fri, 20 Nov 2009 09:10:29 +0100 |
>>>>> On Thu, 19 Nov 2009, David Kastrup wrote:
>> I can guess why it's much slower going backward: the simple search
>> operates on chars rather than bytes. The internal encoding we use
>> (currently based on utf-8) is designed to be easy to parse going
>> forward but not so easy going backward (IIRC our encoding is
>> actually even a bit more painful in this case than pure utf-8).
> I don't think so. The utf-8 _scheme_ can be used to encode 21bits in
> 4 characters.
The original UTF-8 (specified in RFC 2279) was good for encoding of
the full range of 2^31 characters in up to 6 bytes. The limitation to
2^20.1 came later and is artificial.
> We stay within that range, in the utf-8 4 character scheme, but
> outside of the Unicode range 2^20+2^16.
character.h says it's up to 22 bits encoded in up to 5 bytes:
,----
| character code 1st byte byte sequence
| -------------- -------- -------------
| 0-7F 00..7F 0xxxxxxx
| 80-7FF C2..DF 110xxxxx 10xxxxxx
| 800-FFFF E0..EF 1110xxxx 10xxxxxx 10xxxxxx
| 10000-1FFFFF F0..F7 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
| 200000-3FFF7F F8 11111000 1000xxxx 10xxxxxx 10xxxxxx 10xxxxxx
| 3FFF80-3FFFFF C0..C1 1100000x 10xxxxxx (for eight-bit-char)
| 400000-... invalid
`----
>> BM on the other hand works on bytes, so there's no such slowdown.
> With utf-8, I think that apart from character ranges, search forward and
> backward should work perfectly like on 8-bit characters. Exception is
> incomplete character matches, but since the utf-8 scheme can immediately
> tell "is a 7-bit character" "is the first character of a multibyte
> sequence of length n" "is last or intermediate character of multibyte
> sequence" this is not a serious problem.
When the search is for equivalence classes of characters (e.g. case
folding), then I think it must operate on whole characters and
therefore has to find the start of each multibyte sequence.
Ulrich
- Re: Case mapping of sharp s, (continued)
- Re: Case mapping of sharp s, Stephen J. Turnbull, 2009/11/20
- Re: Case mapping of sharp s, Richard Stallman, 2009/11/20
- Re: Case mapping of sharp s, David Kastrup, 2009/11/21
- Re: Case mapping of sharp s, Stephen J. Turnbull, 2009/11/21
- Re: Case mapping of sharp s, Eli Zaretskii, 2009/11/21
- Re: Case mapping of sharp s, Stephen J. Turnbull, 2009/11/21
- Re: Case mapping of sharp s, Eli Zaretskii, 2009/11/21
- Re: Case mapping of sharp s, Stephen J. Turnbull, 2009/11/22
- Re: Case mapping of sharp s, Kenichi Handa, 2009/11/22
- Re: Case mapping of sharp s, Richard Stallman, 2009/11/21
- Re: Case mapping of sharp s,
Ulrich Mueller <=
- Re: Case mapping of sharp s, Stephen J. Turnbull, 2009/11/20
- Re: Case mapping of sharp s, Ulrich Mueller, 2009/11/20
- Re: Case mapping of sharp s, Stephen J. Turnbull, 2009/11/20
- Re: Case mapping of sharp s, grischka, 2009/11/19
- Re: Case mapping of sharp s, Stefan Monnier, 2009/11/19
- Re: Case mapping of sharp s, grischka, 2009/11/20
- Re: Case mapping of sharp s, Eli Zaretskii, 2009/11/21
- Re: Case mapping of sharp s, Andreas Schwab, 2009/11/21
- Re: Case mapping of sharp s, Eli Zaretskii, 2009/11/21
- Re: Case mapping of sharp s, grischka, 2009/11/21