[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: i18n proposal
From: |
Ben Pfaff |
Subject: |
Re: i18n proposal |
Date: |
Sun, 18 Jun 2006 19:09:15 -0700 |
User-agent: |
Gnus/5.110004 (No Gnus v0.4) Emacs/21.4 (gnu/linux) |
John Darrington <address@hidden> writes:
> On Sun, Jun 18, 2006 at 02:50:37PM -0700, Ben Pfaff wrote:
> * String data that occurs in cases is primarily treated as opaque
> octets. Even procedure like SORT CASES that could easily do
> better (by using language-specific collation rules via, e.g.,
> wcscoll()) are documented to use bytewise comparison.
>
> It's probably documented that way, because it's easier to implement.
> It makes sense to me, that SORT CASES should use the collation of the
> "data locale". Let's at least look into the implications of doing so,
> and perhaps offer it under "enhanced" mode.
> A German wanting to say, select all cities from 'N' to 'Z' might be
> very annoyed to find that pspp ommitted 'Öhringen' (where they had the
> world cup match last week).
I have thought a little about that. I have a few ideas.
First, I don't think changing the default behavior is a good
idea, because it seems like it could be a surprising change. But
I can think of a few other options:
* Add a COLLATE keyword to SORT CASES that tells it to
use proper locale-specific collation rules.
* Add a COLLATE('a','b') function to the expression
syntax and extend SORT CASES to allow an arbitrary
expression to be used.
* Add an XFRM('string') function to the expression
syntax, then document that you can sort based on
locale-specific rules using
COMPUTE collate=XFRM(string).
SORT CASES BY collate.
(XFRM would be implemented via strxfrm().)
The last of those is kind of nice since you don't actually have
to change the sort algorithm at all.
> * The interface to the output subsystem (that is, primarily the
> functions in output.h and tab.h) should use multibyte strings,
> for these reasons. First, strings passed to the tab_*()
> functions are often fed through gettext() along the way, so
> wide strings would be inconvenient. Second, tables can get
> very large, so wide strings would be wasteful.
>
> (The ASCII driver might want to change its representation of
> the page to wide strings, though, because this would be an easy
> way for it to support Asian character sets.)
>
> Reading from the unicode website, there are texts which suggest that
> this would not be the case. Apparently, even in "monospace fonts"
> in the general case, the number of characters is not necessarily
> proportional to the width required to render them. The advice there
> is to use multi-byte representation for all input/output operations.
Are you talking about Unicode Standard Annex #11 (East Asian
Width)? I'm aware of the need to deal with single- and
double-width characters. It would not be too hard to do, seeing
as the wcwidth() function will tell you the width of a character.
I don't think that multi-byte representation would work well for
the ASCII driver's internal representation, because it's
difficult to index a multibyte string based on the number of
(single-)character widths from the left margin, which the ASCII
driver does all the time.
Of course, the output format of the ASCII output driver should be
multibyte characters.
> Incidently, if the ASCII driver is going to support other character
> sets, then it might want to be changed to a more appropriate name.
Yes, "text" or "plain text" is what I have in mind.
> * Each "struct variable" is split between multibyte and wide
> strings. Variable names are used as part of syntax processing,
> so we will probably want to change "name" to a wide string.
>
> But the short_name has to remain as it is I think.
Yes.
> * Finally, what should we pass to setlocale()? I think that we
> should select, with LC_ALL, the "output locale".
>
> Like you say, there's going to be a lot of locale switching going on,
> and with that comes potentinal for mistakes; mistakes that might
> easily go unnoticed. I suggest that we avoid direct calls to
> setlocale, and implement some wrappers.
Yes, but I want to keep locale switching to as much of a minimum
as we can. I suspect that on some systems it actually causes
libc to go out and read a locale file.
On systems that have newlocale()/uselocale()/freelocale(), we
should use those.
> I've been wondering why pspp currently sets the LC_MONETARY category.
[...]
I don't recall. Probably, it seemed harmless, so I chose to set it.
> Another option would be to preset the CCA format based upon the lconv
> struct, and leave the DOLLAR format as is. But this would mean that
> DOLLAR is an unmitigated nuisance in countries with a non-dollar
> currency. I wonder what spss in a European locale does?
Good questions.
--
"In the PARTIES partition there is a small section called the BEER.
Prior to turning control over to the PARTIES partition,
the BIOS must measure the BEER area into PCR[5]."
--TCPA PC Specific Implementation Specification