Re: Multibyte and unibyte file names

emacs-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte and unibyte file names

From:	Stephen J. Turnbull
Subject:	Re: Multibyte and unibyte file names
Date:	Sun, 27 Jan 2013 02:10:54 +0900

I have to say I'm depressed: it is indeed sounding like a fair amount
of work, even without trying to get rid of the root cause.

Eli Zaretskii writes:

 > > My preferred flavor of Emacs never had unibyte.  It's got its problems
 > > in this area, but they're just lazy or over-ambitious programmer bugs,
 > > not a design flaw.
 > 
 > I can't reason about something I know nothing about.  So this is not a
 > useful argument.

Sure it is.  XEmacs is a pretty good facsimile of Emacs-compatibility;
the regular howls from people who want to support XEmacs when Emacs
does something to break compability are proof of that.  Nevertheless,
we've never needed unibyte, and our *-as-unibyte functions are no-ops,
and nobody has ever complained about that (a fact that remains
somewhat surprising to me).

 > > Of course.  In fact, pretty much all interaction with the outside
 > > world involves byte streams.  The problem Emacs is experiencing here
 > > is that Lisp can see bytes when it is designed only to work with
 > > characters.
 > 
 > In GNU Emacs, Lisp can work with bytes as well.

Not very well, historically (\207 bug, the expand-file-name bug Stefan
mentioned).  Nothing to be ashamed of at the counting bugs level:
dealing with the bytes/unicode split has cost Python a huge amount of
effort, and many bugs.  But it was unnecessary in the first place in
Emacs.

 > That's OK.  Emacs cannot solve these situations, and I didn't try to
 > target them.  I will be happy enough to correctly support file names
 > consistently encoded in a single encoding that is the value of
 > file-name-coding-system.  I hope you will agree that having _that_
 > broken is not good.

It's horrible.  I'm just saying that it might very well be worth
biting the bullet and eliminating unibyte instead of trying to patch
up a fundamentally poor design.  Or at least bypass unibyte for these
functions.

 > If you look back at this thread, you will see that this is what I
 > tried to say, but was consistently told that Posix systems have no
 > such problems "in practice".

Your informants evidently don't live in Japan.  In practice it's only
a problem if you need to deal with Shift JIS (cp932), such as on a
thumb drive or SMB mount (ISTR for CIFS Samba uses Unicode somehow
nowadays).  Nobody even thinks about using 7-bit JIS etc; POSIX
systems use either UTF-8 or EUC-JP (which you may recall is
ASCII-compatible, and uses only high-bit-set bytes for Japanese).  I
imagine there are similar issues for some subset of Chinese due to
Big5.

It *is* true that such issues are becoming rarer (but Shift JIS
incompatibility is a monthly annoyance for me because of a broken FTP
server I have to deal with).

 > Decoding is not a problem, but it hampers efficiency.

I'm sorry, but that's, uh, "premature optimization".  If Emacs were a
p-language, you'd have a wooden leg to stand on.[1]  But it's not.
People do not write byte-shoveling applications in Emacs Lisp.  They
do write text-shoveling applications, but to be correct those require
atomic characters, so you need to convert anyway.

 > There's also an associated problem that decoding a file can GC,
 > which is not good for functions that get 'char *' pointers as
 > arguments.

So never give them a char* into a Lisp_String, or inhibit GC when you
do.  But strncpy is plenty fast for this application[2], one hell of a
lot faster than the system calls you make to access a filesystem.
Even strndup is fast enough in our experience.

 > > In fact AFAIK the set of programs that use the unibyte feature at
 > > all is pretty small, and most of those (like Tramp) do so only in
 > > self-defense.
 > 
 > You are thinking on the wrong level.  The problem rears its ugly head
 > on the C level, not on the Lisp level.  Functions in dired.c and
 > fileio.c manipulate file names, assuming it is safe to address
 > individual bytes even if the file name is in some DBCS encoding.

And that's not mediated by Lisp?  I would be surprised if you find any
code paths involving dired that grab a filename from the system, pass
it to a manipulation function, and then try to access the file without
ever storing it in a Lisp object.[3][4]

Footnotes:
[1]  There's plenty of evidence that converting unibyte strings to
Unicode (widechar) in Python 3 doesn't hurt anything but the feelings
of people who assume it's costly but don't benchmark.

[2]  You know that the buffersize is at most PATHMAX + 1.

[3]  Except for very early in initialization of the interpreter, when
Emacs is still finding pieces of itself.

[4]  Indeed those were among the earliest files to be fully Mule-ized
in XEmacs, which in XEmacs means that textual data received from
outside of XEmacs is immediately converted to internal representation,
and only converted back to external representation immediately before
the system library call or kernel call that consumes it.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Multibyte and unibyte file names, (continued)

Prev by Date: Re: Multibyte and unibyte file names
Next by Date: [PATCHv4] Add cycle-spacing command.
Previous by thread: Re: Multibyte and unibyte file names
Next by thread: Re: Multibyte and unibyte file names
Index(es):
- Date
- Thread