qemu-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Qemu-devel] [PATCH v10] Support vhd type VHD_DIFFERENCING


From: Philipp Hahn
Subject: Re: [Qemu-devel] [PATCH v10] Support vhd type VHD_DIFFERENCING
Date: Wed, 11 Mar 2015 09:55:28 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Icedove/31.5.0

Hello,

On 11.03.2015 07:22, Xiaodong Gong wrote:
>> Hope that clarified things.
...
> first,your patch is very clear,a good sample.
> 
> store ascii code in kernel that I said before is a mistake,I mean the
> glibc need the input of arguments of fuction such as fopen(path)is
> ascii code

No:
 ASCII = *7* bit, see "man 7 ascii".

 Kernel = *8* bit, that is the kernel doesn't care if you use
ISO-8859-1, UTF-8, BIG5, GB2312, or any other encoding you can get by
running "iconv --list".
For the kernel only '\0'=0x00 and '/'=0x2f are special, all other
characters the kernel doesn't care for and passes them in and out
unmodified.

Most character sets have the ASCII alphabet in their character range
0x00-0x7f, which solved the '\0' and '/' issue nicely.

So again:
- If you use opendir() and readdir(), the kernel returns to you a 8 bit
byte sequence.
- To convert that into a character sequence, you must know which
encoding was used when the file was created. This information is not
stored explicitly in the file-system, file-name, or anywhere else.

- The only hint you get is LC_CTYPE, which is set by the user to tell
you which encoding should be used to convert a byte-stream into a
character-string.

- If I create a new file on the Linux text console, where I did a
"unicode_start", I get an UTF-8 byte sequence from the input layer,
which is passed unmodified through the getty and by shell to the
create() call. You don't need to know the encoding, you just pass the
data in and out unmodified.

- When typing "ls" the kernel again return that byte sequence, which
gets passed through the shell to the Linux frame buffer, which
translates that UTF-8 sequence to a character and picks the right glyph
for being displayed on the screen.

- If I don't switch the Linux console to Unicode mode, I get a different
byte sequence from the input layer. That different byte sequence would
be stored on the disk when creating a file. (This translation is
installed by running the "loadkeys" command.)

- If I do the same in X11, the translation from key-codes to characters
and back is done by the terminal (or application). See "man 1 xterm" for
"-lc" and "-u8".

- BUT when you want to create a specific character encoding, you MUST
know from which encoding you start. Assuming ASCII or UTF-8 is wrong,
you MUST check LC_ALL/LC_CTYPE/LANG by querying nl_langinfo(CODESET)).

So if I would give you a disk containing a file with the name "\xa1",
depending on the locale you would see a different glyph:
$ for ((c=1;c<=15;c++));do printf '\xa1'|recode
ISO-8859-$c..dump-with-names 2>/dev/null|tail -n 1;done
00A1   !I    inverted exclamation mark
0104   A;    latin capital letter a with ogonek
0126   H/    latin capital letter h with stroke
0104   A;    latin capital letter a with ogonek
0401   IO    cyrillic capital letter io
201B   9'    single high-reversed-9 quotation mark
00A1   !I    inverted exclamation mark
0104   A;    latin capital letter a with ogonek
00A1   !I    inverted exclamation mark
00A1   !I    inverted exclamation mark
1E02   B.    latin capital letter b with dot above
00A1   !I    inverted exclamation mark

In an UTF-8 environment you would get an error, as "\xa1" is not a valid
UTF-8 byte sequence.

Read "man 7 unicode", especially the section "Unicode Under Linux" or
"man 7 charsets".

> I think:

Any program basically has two options:

1. The program does not care about different character sets ans just
passes in file-names and data in and out as byte streams. That is
perfectly okay and most UNIX shell commands work just fine that way.

2. The program is encoding aware, as for example it works on characters
instead of bytes (like "wc --bytes" vs. "wc --chars") or does need to
perform a conversion between encodings. Then the sanest thing is to
- query the encoding of the environment once (or per network connection),
- convert any input data from that encoding into a (fixed) internal
format like wchar/utf-16/utf-32 including file-names, file-content, etc.
- convert the internal data back into the right format on output, which
also includes calling APIs like open().

Otherwise you always have to remember if your char[] buffer contains
"some byte stream, which needs to be decoded() before being used" or
"already decoded character string". That is why most libraries and
frame-works contain wrappers for the file, input- and output, as they
internally use one data type for characters consistently and hide all
the explicit conversion from you by providing wrappers.

> icovn_open(utf16le,ascii)in encode
> icovn_open(ascii,utf16le)in decode
> icovn_open(codeset,ascii)in show

That would be correct ONLY if you store the file-name internally as
ASCII, which would prevent you from handling file-names containing any
character outside the ASCII codeset.
You should use "UTF-8" instead of "ascii", as that allows you to handle
file-names containing any valid characters.
This would make the conversion in show() trivial when codeset="UTF-8",
as there iconv() would not have to do anything there.

Philipp



reply via email to

[Prev in Thread] Current Thread [Next in Thread]