[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Wide strings
From: |
Mike Gran |
Subject: |
Re: Wide strings |
Date: |
Sun, 25 Jan 2009 16:16:12 -0800 (PST) |
> From: Ludovic Courtès address@hidden
I believe that we should aim for R6RS strings.
I think the most important thing is to have humility in the face of an
impossible problem: how to encode all textual information. It is
important to "stand on the shoulders of giants" here. It becomes a
matter of deciding which actively developed library of wide character
functions is to be used and how to integrate it.
There are 3 good, actively developed solutions of which I am aware.
1. Use GNU libc functionality. Encode wide strings as wchar_t.
2. Use GLib functionality. Encode wide strings as UTF-8. Possibly
give up on O(1). Possibly add indexing information to string to allow
O(1), which might negate the space advantage of UTF-8.
3. Use IBM's ICU4c. Encode wide strings as UTF-16. Thus, add an
obscure dependency.
Option 3 is likely a non-starter, because it seems that Guile has
tried to avoid adding new non-GNU dependencies. It is technologically
a great solution, IMHO.
Option 1 is probably the way to go, because it keeps Guile close to
the metal and keeps dependencies out of it. Unfortunately, UTF-8
strings would require conversion.
> 1. IMO it'd be nice to have ASCII strings special-cased so that they
> are always encoded in ASCII. This would allow for memory savings
> since, e.g., most symbols are expected to contain only ASCII
> characters. It might also simplify interaction with C in certain
> cases; for instance, it would make it easy to have statically
> initialized ASCII Scheme strings.
Why not? It does solve the initialization problem of dealing with strings
before setlocale has been called.
Let's say that a string is a union of either an ASCII char vector or a
wchar_t vector. A "character" then is just a Unicode codepoint.
String-ref returns a wchar_t. This is all in line with R6RS as I
understand it.
There could then be a separate iterator and function set that does
(likely O(n)) operations on the grapheme clusters of strings. A
grapheme cluster is a single written symbol which may be made up of
several codepoints. Unicode Standard Annex #29 describes how to
partition a string into a set of graphemes.[1]
There is the problem of systems where wchar_t is 2 bytes instead of 4
bytes, like Cygwin. For those systems, I'd recommend
restricting functionality to 16-bit characters instead of trying to
add an extra UTF-16 encoding/decoding step. I think there should
always be a complete codepoint in each wchar_t.
--
Mike Gran
[1] http://www.unicode.org/reports/tr29/
- Wide strings, Mike Gran, 2009/01/25
- Re: Wide strings, Ludovic Courtès, 2009/01/25
- Re: Wide strings, Neil Jerram, 2009/01/25
- Re: Wide strings,
Mike Gran <=
- Re: Wide strings, Mike Gran, 2009/01/26
- Re: Wide strings, Ludovic Courtès, 2009/01/26
- Re: Wide strings, Mike Gran, 2009/01/27
- Re: Wide strings, Mike Gran, 2009/01/27
- Re: Wide strings, Andy Wingo, 2009/01/27
- Re: Wide strings, Ludovic Courtès, 2009/01/27
- Re: Wide strings, Mike Gran, 2009/01/28
- Re: Wide strings, Andy Wingo, 2009/01/28
- Re: Wide strings, Ludovic Courtès, 2009/01/28
- Re: Wide strings, Neil Jerram, 2009/01/29