[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Pika-dev] Re: tla1.2 on cygwin
From: |
Tom Lord |
Subject: |
[Pika-dev] Re: tla1.2 on cygwin |
Date: |
Sun, 14 Mar 2004 09:02:30 -0800 (PST) |
> From: Matthew Dempsky <address@hidden>
[moving a gnu-arch-users thread here.]
>> In other news -- I really shudder to think what happens when case
>> insensitive file systems meet Unicode. Suddenly you're talking about
>> embedding a fairly complex case canonicalization algorithm based on
>> relatively huge data tables in every kernel and application that might
>> ever need to compare a filename. And the end result is going to be
>> either locale dependent (uh... yeah, right) or linguistically
>> imperfect
> Isn't this an issue you're sort of tackling with srfi-52?
> Maybe it's a little different because the majority of the
> case-insensitivity identifiers can be handled just internally
> (variable references and stuff can be optimized to just frame
> references, etc.), but if you're promoting Scheme as a system language
> then it seems like somewhere it would have to be exposed.
> I could believe the argument might be related to file systems being a
> (usually) kernel-space issue while even systems programming with
> Scheme is still user-space (unless we bring back Lisp machines), but
> if microkernels were more popular then the file systems could be just
> user-space programs.
> Personally, I don't have a very strong opinion either way, but I'm
> curious of your (apparently inconsistant) thoughts.
It's related to SRFI-52 but only in the sense that both are related to
case insensitive string and identifier comparison in Unicode
generally.
There's two levels of problem for Scheme:
1) How can portable programs decide if two identifier names are
equivalent, disregarding case, without knowing what character set
and canonicalization rules a given implementation uses?
SRFI-52 answers that (and some other things).
2) If there is to be a standard for exchanging Scheme programs
containing identifiers which contain Unicode characters,
then what should the standard say about canonicalization and
comparison rules for identifiers?
One of the draft SRFIs puts down a straw-man for this -- SRFI-52
does not try to answer this.
The problems for arch is maybe similar but not the same. It's far
harder to solve.
a) Should and if so how should the global namespace of archives and
revisions be extended to permit non-ASCII characters in namespace
names?
I tentatively think "yes".
b) Should case insensitivity of any kind be introduced?
No.
c) What are the canonicalization rules for global names?
This is a hard part. While names are purely ASCII, they are
bitwise identical when used as filenames on just about any
platform we'd care to support.
If names can be Unicode, then file systems will impose different
requirements on how they are stored as filenames. Some will want
UTF-8, others UTF-16. Some systems will insist on one
canonicalization form, others on another form, and others will not
peform canonicalization at all.
It remains to be seen whether (but is likely that) the transports we
use will muddy the waters further -- what will SFTP do with a UTF-16
encoded name? Will WebDAV implementations perform canonicalization?
b) And what of filenames within project trees?
The same issues (encoding and canonicalization) apply. They apply
to global arch names within project trees (such as patch logs) but
also to the user's source files as well. Only in this area it's
even worse because source files interact with compilers and `make'
and so forth -- not just the filesystem and arch. Our liberty to
rewrite filenames as they ship across platforms during distributed
revision control operations may be further restricted.
To give a sense of how things can go wrong: what happens if I have the
"same" category name in two archives, but in different
canonicalization forms when stored as a filename on the two archive
hosts. If I'm merging from each of these into my project tree, I want
the patch logs to go in the same category directory in my tree.
Where does that canonicalization divergence get resolved? In `tar'?
In `arch'? In the kernel on my system? What about the arguments
passed to `patch' and `diff'? There's no right answer yet to
questions like these.
It's going to be a while before "everything works swimmingly" in
Unicode-world.
-t