emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Multibyte and unibyte file names


From: Eli Zaretskii
Subject: Multibyte and unibyte file names
Date: Wed, 23 Jan 2013 19:45:35 +0200

For some initial context, see

  http://debbugs.gnu.org/cgi/bugreport.cgi?bug=13515#14

and my response there.  However, the issue at hand is IMO much more
broad.

Let me start with a question: do file primitives need to support
unibyte file names, as well as multibyte ones?  To avoid ambiguity,
let me say right away that by "unibyte" I mean here file names encoded
in some file-name-coding-system, possibly with non-ASCII characters.
I do NOT mean pure-ASCII file names (which in Emacs are normally
represented as unibyte strings).

Looking at the code, it sounds like the answer to the above is YES.
For example, expand-file-name clearly tries to be careful to support
both, as seen, for example, from this snippet:

  multibyte = STRING_MULTIBYTE (name);
  if (multibyte != STRING_MULTIBYTE (default_directory))
    {
      if (multibyte)
        default_directory = string_to_multibyte (default_directory);
      else
        {
          name = string_to_multibyte (name);
          multibyte = 1;
        }
    }

Moreover, some other primitives clearly expect other primitives to
work on encoded file names.  Here's a fragment from
file_name_completion:

  encoded_dir = ENCODE_FILE (dirname);

  block_input ();
  d = opendir (SSDATA (Fdirectory_file_name (encoded_dir)));

Assuming that encoded file names _should_ be supported, I think this
snippet, from directory_file_name, is a bug:

  if (srclen > 1
      && IS_DIRECTORY_SEP (dst[srclen - 1]))
    {
      dst[srclen - 1] = 0;
      srclen--;
    }

If dst[] is an encoded string that uses a multibyte encoding, it is
wrong to look at just the last byte of the string, because it could be
a trailing byte of some multibyte sequence, right?  There are a lot of
similar fragments in fileio.c, so much so that it seems as if there's
a hidden assumption that these strings cannot be encoded.  Which seems
to contradict the two fragments above, from expand-file-name and from
file_name_completion.  Am I missing something?

Why is this important?  For 2 main reasons:

 1) Many file primitives call dostounix_filename on MS-Windows.  That
    function converts backslashes to forward slashes and optionally
    down-cases the file name.  It is currently written to accept an
    encoded file name, and as long as file primitives need to support
    unibyte file names, dostounix_filename must DTRT with them.
    Encoding file names means in some situations that file names
    un-encodable in file-name-coding-system come out butchered from
    dostounix_filename, whereas some primitives are supposed to work
    on the file names on the syntactic level only, which is
    independent of whether or not a file can be passed to the
    underlying filesystem.  This also means that only cpNNNN encodings
    are fully supported on MS-Windows, because for other encodings
    Windows APIs don't have information which allows, e.g., advancing
    by characters in an encoded file name, looking for slashes and
    backslashes, and down-casing characters.

 2) This gets worse with remote file names.  For these, the handlers
    are always called first, and the result is never run through
    dostounix_filename.  However, Tramp sometimes turns around and
    calls the "real" handler on parts of the remote file name,
    evidently expecting that "real" handler not to do any harm.  But
    due to the above, it does do harm.  While it might be justified to
    limit native file name support to file names encodable with the
    current file-name-coding-system, it _cannot_ be justified for
    remote file names.  An example of this is file-name-directory:

     (defun tramp-handle-file-name-directory (file)
       "Like `file-name-directory' but aware of Tramp files."
       ;; Everything except the last filename thing is the directory.  We
       ;; cannot apply `with-parsed-tramp-file-name', because this expands
       ;; the remote file name parts.  This is a problem when we are in
       ;; file name completion.
       (let ((v (tramp-dissect-file-name file t)))
         ;; Run the command on the localname portion only.
         (tramp-make-tramp-file-name
          (tramp-file-name-method v)
          (tramp-file-name-user v)
          (tramp-file-name-host v)
          (tramp-run-real-handler
           'file-name-directory (list (or (tramp-file-name-localname v) ""))))))

    which on Windows means that, e.g.

      (let ((file-name-coding-system 'cp1252))
        (file-name-directory "/address@hidden:漢字/"))

       => "/address@hidden:  /"

   And there are other similar handlers in Tramp (e.g., the
   file-name-nondirectory handler) which do the same.  IOW, they seem
   to _assume_ that the corresponding "real" handler never needs to
   encode the file name.  A false assumption.

I don't know what to do with this mess.  If file primitives are not
supposed to handle encoded file names, dostounix_filename could be
rewritten to work on multibyte strings in Emacs's internal
representation, and then it wouldn't need to rely on Windows APIs that
require the encoding to be known to Windows and the characters in the
file name be encodable in that encoding.  But that would need
non-trivial changes elsewhere, and we need to decide what to do if an
encoded string does get passed to these primitives (signal an error?).

Note that, as long as encoded multibyte strings can get into these
primitives, code that advances by bytes and examines individual bytes
for equality to certain values like '/' is buggy on Unix as well,
unless I'm missing something.

Comments are welcome, as well as pointers to what I missed.

TIA




reply via email to

[Prev in Thread] Current Thread [Next in Thread]