[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Filename Encoding
From: |
Ben Pfaff |
Subject: |
Re: Filename Encoding |
Date: |
Wed, 11 Dec 2013 07:38:46 -0800 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
On Wed, Dec 11, 2013 at 09:05:16AM +0100, John Darrington wrote:
> On Tue, Dec 10, 2013 at 12:38:04PM -0800, Ben Pfaff wrote:
>
> I understand now. However, in other places in PSPP, and in particular
> in syntax and the output engine, we tend to convert everything we
> receive externally into UTF-8 for internal processing, and then convert
> back to other encodings as necessary. It would be convenient for some
> purposes to do this for filenames also (e.g. to include file names in
> output), and it would avoid needing to keep around two pieces of
> information (file name plus encoding) when one (UTF-8 file name) would
> do.
>
> Do you think that storing file name plus encoding is superior?
>
> Both solutions have advantages and disadvantages.
>
> The converting-all-filenames-to-utf8 solution has two disadvantages that I
> can see:
>
> *. Unnecessary recoding - often it will be necessary to convert from
> "filename encoding"
> to utf8 and then, back to "filename encoding".
Is the concern here about performance, or something else? I doubt that
there is a real performance problem with doing one or two conversions of
a file name, once per file open. Also, on GNU/Linux the filename
encoding is UTF-8 anyway, so there is no actual conversion.
> *. The bigger disadvantage, is that it will be very easy simply to forget to
> do
> the necessary conversion. If the programmer forgets - the compiler won't
> complain -
> it is just a char * - Passing a struct file_handle * one cannot forget -
> there'll
> be a compiler error.
That's true. In data, we use uint8_t instead of char to remind
ourselves that the data is in the dictionary encoding. We could use
int8_t for UTF-8 data, but that doesn't match either libunistring or
glib practice so it would probably cause a lot of friction at
interfaces.