[Pika-dev] Re: tla1.2 on cygwin

pika-dev

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Pika-dev] Re: tla1.2 on cygwin

From:	Tom Lord
Subject:	[Pika-dev] Re: tla1.2 on cygwin
Date:	Sun, 14 Mar 2004 09:02:30 -0800 (PST)

    > From: Matthew Dempsky <address@hidden>

    [moving a gnu-arch-users thread here.]

    >> In other news -- I really shudder to think what happens when case
    >> insensitive file systems meet Unicode.  Suddenly you're talking about
    >> embedding a fairly complex case canonicalization algorithm based on
    >> relatively huge data tables in every kernel and application that might
    >> ever need to compare a filename.  And the end result is going to be
    >> either locale dependent (uh... yeah, right) or linguistically
    >> imperfect

    > Isn't this an issue you're sort of tackling with srfi-52?

    > Maybe it's a little different because the majority of the
    > case-insensitivity identifiers can be handled just internally
    > (variable references and stuff can be optimized to just frame
    > references, etc.), but if you're promoting Scheme as a system language
    > then it seems like somewhere it would have to be exposed.

    > I could believe the argument might be related to file systems being a
    > (usually) kernel-space issue while even systems programming with
    > Scheme is still user-space (unless we bring back Lisp machines), but
    > if microkernels were more popular then the file systems could be just
    > user-space programs.

    > Personally, I don't have a very strong opinion either way, but I'm
    > curious of your (apparently inconsistant) thoughts.

It's related to SRFI-52 but only in the sense that both are related to
case insensitive string and identifier comparison in Unicode
generally.

There's two levels of problem for Scheme:

1) How can portable programs decide if two identifier names are
   equivalent, disregarding case, without knowing what character set
   and canonicalization rules a given implementation uses?

  SRFI-52 answers that (and some other things).


2) If there is to be a standard for exchanging Scheme programs 
   containing identifiers which contain Unicode characters,
   then what should the standard say about canonicalization and 
   comparison rules for identifiers?

  One of the draft SRFIs puts down a straw-man for this -- SRFI-52
  does not try to answer this.


The problems for arch is maybe similar but not the same.  It's far
harder to solve.

a) Should and if so how should the global namespace of archives and
   revisions be extended to permit non-ASCII characters in namespace
   names?  

   I tentatively think "yes".


b) Should case insensitivity of any kind be introduced?

  No.

c) What are the canonicalization rules for global names?

  This is a hard part.   While names are purely ASCII, they are
  bitwise identical when used as filenames on just about any
  platform we'd care to support.

  If names can be Unicode, then file systems will impose different
  requirements on how they are stored as filenames.    Some will want
  UTF-8, others UTF-16.   Some systems will insist on one
  canonicalization form, others on another form, and others will not
  peform canonicalization at all.

  It remains to be seen whether (but is likely that) the transports we
  use will muddy the waters further -- what will SFTP do with a UTF-16
  encoded name?  Will WebDAV implementations perform canonicalization?


b) And what of filenames within project trees?

  The same issues (encoding and canonicalization) apply.   They apply
  to global arch names within project trees (such as patch logs) but
  also to the user's source files as well.    Only in this area it's 
  even worse because source files interact with compilers and `make'
  and so forth -- not just the filesystem and arch.   Our liberty to 
  rewrite filenames as they ship across platforms during distributed
  revision control operations may be further restricted.


To give a sense of how things can go wrong: what happens if I have the
"same" category name in two archives, but in different
canonicalization forms when stored as a filename on the two archive
hosts.  If I'm merging from each of these into my project tree, I want
the patch logs to go in the same category directory in my tree.

Where does that canonicalization divergence get resolved?   In `tar'?  
In `arch'?  In the kernel on my system?   What about the arguments
passed to `patch' and `diff'?  There's no right answer yet to
questions like these.

It's going to be a while before "everything works swimmingly" in
Unicode-world.

-t

[Prev in Thread]

Current Thread

[Next in Thread]

[Pika-dev] Re: tla1.2 on cygwin, Matthew Dempsky, 2004/03/14
- [Pika-dev] Re: tla1.2 on cygwin, Tom Lord <=

Prev by Date: [Pika-dev] Re: tla1.2 on cygwin
Next by Date: [Pika-dev] Hashtrees a bit unflexible
Previous by thread: [Pika-dev] Re: tla1.2 on cygwin
Next by thread: [Pika-dev] Hashtrees a bit unflexible
Index(es):
- Date
- Thread