Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorre

tinycc-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorre

From:	张博洋
Subject:	Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly
Date:	Fri, 1 Sep 2017 17:00:55 +0800
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1

Hello,

Thanks for your reply.

My assumptions only applicable to wide string literals. The behavior forplain strings literals of both original tcc and my patched tcc is "copybytes in plain string as is". And for wide strings, original tcc "readeach char and cast them to wchar_t", my patched tcc "decode them asUTF-8 sequences".

After some consideration, I found the assumption I made was "wide stringliterals are written in UTF-8, and wchar_t is always UTF-32". That leadsto two problems. First, wide string encoding in source file isdefinitely same as the encoding of source file, which might not beUTF-8. This will cause problems as you mentioned. Second, wchar_t is notalways UTF-32. It's UTF-16 on Microsoft Windows. Some chars, like emojis, will get corrupted because of value truncation.

Although there are problems, if the second problem got fixed (which iseasy), my patched tcc will always perform better than original tcc. Ifsomething breaks, it will also breaks on original tcc. I provided atable in attachments describing every situation and corresponding behaviors.

The ideal solution is to provide charset options as you mentioned. Afterdoing some search on internet, I found that there are 3 command lineoptions that controls char encoding:

-fexec-charset=charset
-fwide-exec-charset=charset
-finput-charset=charset
In order to make these feature works correctly, tcc must do two conversions:
(1) convert all plain string literal from input-charset to exec-charset
(2) convert all wide string literal from input-charset to wide-exec-charset

However, providing these feature requires external libraries like iconv,doing this might make Tiny C Compiler not tiny.


My problems are:
(1) Is wchar_t either UTF-32 or UTF-16 on all platforms?
(2) Should we provide full support for charset using external librarys?


Thanks
Zhang Boyang



在 2017年09月01日 11:54, Christian Jullien 写道:

Hello,

I'm not sure you can assume that a character having code >= 0x80 is part of UTF-8. Beyond what is 
called "basic character set" which is globally the ASCII 7bits, there is the "extended 
character set" which is implementation defined.

For example, the euro sign € may be part of 8859-15 and perfectly well encoded 
on 8bits with 0xA4 see https://en.wikipedia.org/wiki/ISO/IEC_8859-15

Microsoft VC++ has the following flags

/utf-8 set source and execution character set to UTF-8
/validate-charset[-] validate UTF-8 files for only legal characters

That controls how source code is encoded.

gcc (more specifically cpp the C preprocessor) processes source file using 
UTF-8 but, as VC++ has a flag to control input-char

        -finput-charset=charset
            Set the input character set, used for translation from the
            character set of the input file to the source character set used by
            GCC.  If the locale does not specify, or GCC cannot get this
            information from the locale, the default is UTF-8.  This can be
            overridden by either the locale or this command-line option.
            Currently the command-line option takes precedence if there's a
            conflict.  charset can be any encoding supported by the system's
            "iconv" library routine.

Now, tcc should be compatible with both. I mean:

- Native Windows tcc port should NOT assume characters are UTF-8 encoded and 
-utf-8 flag should change this behavior (+ -finput-charset=xxx for gcc 
compatibility)
- Other ports (I mean Linux & alt.) should assume characters are UTF-8 encoded 
and -finput-charset=xxx flag should change this behavior (+ -utf-8 for VC++ 
compatibility)

To summarize, which should add both utf-8 and -finput-charset=xxx support and 
set the default behavior based on native port.

Wdyt?

Christian


-----Original Message-----
From: Tinycc-devel [mailto:address@hidden On Behalf Of ???
Sent: mercredi 30 août 2017 09:31
To: address@hidden
Subject: [Tinycc-devel] BUG: wide char in wide string literal handled 
incorrectly

Hello,

    I found that when TCC processing wide string literal, it behaves like 
directly casting each char in original file to wchar_t and store them in wide 
string. This will work for ASCII chars. However, it might not work for real 
wide chars. For example:
    The Euro-sign (€, U+20AC) stored in UTF-8 is "E2 82 AC". In GCC, this char stored in wide 
string will be "000020AC". However, in TCC, this char is stored as 3 wide chars "000000E2 
00000082 000000AC".
    I provided a patch, a test program and two screenshots that describe this problem, 
they are in attachments. I solve this problem by making assumptions that input charset is 
UTF-8. Although it's not a perfect solution, it's still better than "directly 
casting char to wchar_t". I'm wondering if that is appropriate, so please review the 
code carefully.

Thanks
Zhang Boyang


_______________________________________________
Tinycc-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/tinycc-devel

behavior-table.png
Description: PNG image

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly, 张博洋 <=
- Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly, Christian JULLIEN, 2017/09/01
  - Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly, 张博洋, 2017/09/02
    - Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly, Christian Jullien, 2017/09/03
    - Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly, 张博洋, 2017/09/03

Next by Date: Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly
Next by thread: Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly
Index(es):
- Date
- Thread