[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Filename Encoding
From: |
Ben Pfaff |
Subject: |
Re: Filename Encoding |
Date: |
Sun, 22 Dec 2013 09:23:29 -0800 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Sun, Dec 22, 2013 at 05:51:47PM +0100, John Darrington wrote:
> On Tue, Dec 10, 2013 at 12:38:04PM -0800, Ben Pfaff wrote:
>
> in syntax and the output engine, we tend to convert everything we
> receive externally into UTF-8 for internal processing, and then convert
> back to other encodings as necessary.
>
> I'm not sure this was the best decision. How do we know to which
> encoding we should convert back to?
I guess you mean, in the context of filenames.
We have to know what is the system filename encoding. There can only
reasonably be one such encoding for a system, because programs as simple
as "ls" and as complicated as the GUI dialog boxes to pick files have to
be able to display and manipulate filenames, and there's no way to
attach an encoding to individual files or even to directories or file
systems.
I only know of two ways to determine the system filename encoding.
Programs like "ls" use the locale encoding. This has the (arguable)
advantage of giving the user easy control over the encoding, but it also
allows for inconsistencies if the user switches between locales that
have different encodings. Ultimately, it depends on the user to set a
reasonable locale.
The other way I know of determining the system filename encoding is the
method that Glib uses: always assume that the filename encoding is UTF-8
(unless overridden by the G_FILENAME_ENCODING environment variable). In
my opinion this makes more sense in modern environments, where one can
assume that the file system was created relatively recently.
> On a GNU/Linux system (where the filesystem is encoding agnostic) there
> exists two files
> which I shall call fileA and fileB.
The file system is encoding agnostic but everything above the OS itself
has to know so that it can manipulate file names.
> Let us assume that the bytes which comprise the the name of fileA
> happen to be valid UTF-8. Let us also assume that the bytes which
> comprise the name of fileB happen to be valid ISO-8859-1. Further,
> let us also assume that when the name of fileB is converted from
> ISO-8859-1 to UTF-8 the result happens to be identical to the name of
> fileA.
I think this is a problem with the file system, caused by a bug in
software or a configuration error. I don't think normal user software
should have to worry about this.
> On "normal" applications the question is not relevant. Filenames are
> simply byte strings. However because we convert everything to UTF-8
> in syntax (for example: GET FILE="?pfelfa?.sav".) We no longer know
> the encoding of that filename.
This is just as true for syntax that the user supplies in an ordinary
syntax file. To handle it, one would have to add support for syntax
files that contain a mix of encodings, or add a filename encoding option
to all commands that specify filenames, or otherwise do something
undesirable.
I don't think it's worth catering to systems with broken file systems,
and that's the only situation I know where this matters.