emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Multibyte and unibyte file names


From: Eli Zaretskii
Subject: Re: Multibyte and unibyte file names
Date: Sat, 26 Jan 2013 19:33:55 +0200

> From: "Stephen J. Turnbull" <address@hidden>
> Cc: address@hidden
> Date: Sun, 27 Jan 2013 02:10:54 +0900
> 
>  > > My preferred flavor of Emacs never had unibyte.  It's got its problems
>  > > in this area, but they're just lazy or over-ambitious programmer bugs,
>  > > not a design flaw.
>  > 
>  > I can't reason about something I know nothing about.  So this is not a
>  > useful argument.
> 
> Sure it is.  XEmacs is a pretty good facsimile of Emacs-compatibility;
> the regular howls from people who want to support XEmacs when Emacs
> does something to break compability are proof of that.  Nevertheless,
> we've never needed unibyte, and our *-as-unibyte functions are no-ops,
> and nobody has ever complained about that (a fact that remains
> somewhat surprising to me).

Every solution of a problem has its downsides and its upsides.  I'm
saying that I cannot consider them in this case and therefore cannot
tell you whether on balance it is better than what Emacs does now.

>  > > Of course.  In fact, pretty much all interaction with the outside
>  > > world involves byte streams.  The problem Emacs is experiencing here
>  > > is that Lisp can see bytes when it is designed only to work with
>  > > characters.
>  > 
>  > In GNU Emacs, Lisp can work with bytes as well.
> 
> Not very well, historically (\207 bug, the expand-file-name bug Stefan
> mentioned).  Nothing to be ashamed of at the counting bugs level:
> dealing with the bytes/unicode split has cost Python a huge amount of
> effort, and many bugs.  But it was unnecessary in the first place in
> Emacs.

It _is_ necessary because file names passed to system APIs _must_ be
encoded.  That's where the bugs mentioned here (already fixed, btw)
happen: in the implementation of 'stat' we have in Emacs that does a
better job than the MS runtime, and in other similar cases.

> 
>  > That's OK.  Emacs cannot solve these situations, and I didn't try to
>  > target them.  I will be happy enough to correctly support file names
>  > consistently encoded in a single encoding that is the value of
>  > Decoding is not a problem, but it hampers efficiency.
> 
> I'm sorry, but that's, uh, "premature optimization".

It's not premature.  directory-files-and-attributes, used on Windows
to emulate 'ls', must be fast enough even in large directories,
because otherwise Dired will be painfully slow to start.  As things
are, things are too slow already, especially with remote filesystems;
there were bug reports about this last year.  IOW, the current
implementation is already borderline performance-wise.

>  > There's also an associated problem that decoding a file can GC,
>  > which is not good for functions that get 'char *' pointers as
>  > arguments.
> 
> So never give them a char* into a Lisp_String, or inhibit GC when you
> do.  But strncpy is plenty fast for this application[2], one hell of a
> lot faster than the system calls you make to access a filesystem.
> Even strndup is fast enough in our experience.

It's not rocket science, true.  I'm just saying that if it isn't
required, it's best avoided.

>  > > In fact AFAIK the set of programs that use the unibyte feature at
>  > > all is pretty small, and most of those (like Tramp) do so only in
>  > > self-defense.
>  > 
>  > You are thinking on the wrong level.  The problem rears its ugly head
>  > on the C level, not on the Lisp level.  Functions in dired.c and
>  > fileio.c manipulate file names, assuming it is safe to address
>  > individual bytes even if the file name is in some DBCS encoding.
> 
> And that's not mediated by Lisp?  I would be surprised if you find any
> code paths involving dired that grab a filename from the system, pass
> it to a manipulation function, and then try to access the file without
> ever storing it in a Lisp object.[3][4]

I gave examples in this thread that should make you surprised.

In any case, as long as file-name primitives support unibyte (encoded)
file names, there's nothing to prevent such examples from popping up.
Programmers are not disciplined enough to trust them on this.

> [4]  Indeed those were among the earliest files to be fully Mule-ized
> in XEmacs, which in XEmacs means that textual data received from
> outside of XEmacs is immediately converted to internal representation,
> and only converted back to external representation immediately before
> the system library call or kernel call that consumes it.

No such coding standards in Emacs, and the C code does manipulate
unibyte strings as long as they don't need to be passed to Lisp.  I
suggested converting to internal representation at entry to all
primitives in this thread, but it looks like Stefan disagrees, or at
least not completely agrees.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]