tinycc-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorre


From: Christian Jullien
Subject: Re: [Tinycc-devel] BUG: wide char in wide string literal handled incorrectly
Date: Sun, 3 Sep 2017 07:50:45 +0200

Managing UTF-8 (and Unicode) correctly on all platforms is a nightmare. I did 
it only partially for my Lisp.
It's hard to say that your code is correct or not but I have the impression it 
is not since you don’t use MB_LEN_MAX nor MB_CUR_MAX. Hence you don't handle 
all possible multi-bytes character len.
There is a system dependent constant named MB_LEN_MAX that tells you the max 
number of multi-bytes. (see for example 
http://man7.org/linux/man-pages/man3/MB_LEN_MAX.3.html)
As you can read here it must be used with MB_CUR_MAX, a locale dependent value. 
With "most common" locales you can leave with 5 to 6 bytes but I'm discovering 
that MB_LEN_MAX is now 16 on Linux!!!

>From Linux <limits.h>

/* Maximum length of any multibyte character in any locale.
   We define this value here since the gcc header does not define
   the correct value.  */
#define MB_LEN_MAX      16

>From VC++ 14

#define MB_LEN_MAX    5             // max. # bytes in multibyte char

The ISO C standard defines two macros that provide this information. 
Macro: int MB_LEN_MAX
MB_LEN_MAX specifies the maximum number of bytes in the multibyte sequence for 
a single character in any of the supported locales. It is a compile-time 
constant and is defined in limits.h.  
Macro: int MB_CUR_MAX
MB_CUR_MAX expands into a positive integer expression that is the maximum 
number of bytes in a multibyte character in the current locale. The value is 
never greater than MB_LEN_MAX. Unlike MB_LEN_MAX this macro need not be a 
compile-time constant, and in the GNU C Library it is not. 
 
MB_CUR_MAX is defined in stdlib.h.



If it helps, you can adapt use:

/*
 * Retuns the number of multiple bytes needed to store MB character c.
 */

#define OLMBCLEN_USES_TABLE

#if     defined( OLMBCLEN_USES_TABLE )
static const unsigned char olbytesForUTF8[256] = {
        /* ASCII 7bit char         -> 0xxxxxxx */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 00 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 10 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 20 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 30 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 40 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 50 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 60 */
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, /* 70 */
        /* invalid UTF-8 char      -> 10xxxxxx */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 80 */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 90 */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* A0 */
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* B0 */
        /* (c & 0xE0) == 0xC0      -> 110xxxxx */
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* C0 */
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, /* D0 */
        /* (c & 0xF0) == 0xE0      -> 1110xxxx */
        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, /* E0 */
        /* (c & 0xF8) == 0xF0      -> 11110xxx */
#if     (OLMB_LEN_MAX == 4)
        4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0  /* F0 */
#else
        4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0  /* F0 */
#endif
};

size_t
olmbclen( int c )
{
        return( (size_t)olbytesForUTF8[ c & 0xFF ] );
}

#else
size_t
olmbclen( int c )
{
        if ((c & 0x80) == 0x00) {
                return( 1 );
        } else  if( (c & 0xE0) == 0xC0) {
                return( 2 );
        } else  if( (c & 0xF0) == 0xE0 ) {
                return( 3 );
        } else  if( (c & 0xF8) == 0xF0) {
                return( 4 );
#if     (OLMB_LEN_MAX > 4)
        } else  if ((c & 0xFC) == 0xF8) {
                return( 5 );
#endif
#if     (OLMB_LEN_MAX > 5)
        } else  if ((c & 0xFE) == 0xFC) {
                return( 6 );
#endif
        }

        return( 0 );
}
#endif

-----Original Message-----
From: Tinycc-devel [mailto:address@hidden On Behalf Of ???
Sent: samedi 2 septembre 2017 19:12
To: address@hidden
Subject: Re: [Tinycc-devel] BUG: wide char in wide string literal handled 
incorrectly

Hello,

Here is the new patch, which fixed the UTF-16 truncate problem on Windows.

Zhang Boyang



在 2017年09月01日 19:50, Christian JULLIEN 写道:
> Given platforms tcc supports, I think you can assume, wchar_t uses 2 bytes on 
> Windows and 4 bytes on all other platforms (I'm not totally sure, but think 
> you can force wchar_t to be 2 bytes on macOS).
> I've never heard about other implementation for wchar_t (I don't recall how 
> zOS encodes wchar_t but I doubt someone will port tcc on this system which 
> still uses EBCDIC natively).
> 
> 
>   Le :&nbsp;01 septembre 2017 à 11:02 (GMT +02:00) De :&nbsp;"张博洋" 
> &lt;address@hidden&gt; À :&nbsp;"address@hidden" 
> &lt;address@hidden&gt; Objet :&nbsp;Re: [Tinycc-devel] BUG: 
> wide char in wide string literal handled
>   incorrectly
> 
> 
> Hello,
> 
> Thanks for your reply.
> 
> My assumptions only applicable to wide string literals. The behavior 
> for plain strings literals of both original tcc and my patched tcc is 
> "copy bytes in plain string as is". And for wide strings, original tcc 
> "read each char and cast them to wchar_t", my patched tcc "decode them 
> as
> UTF-8 sequences".
> 
> After some consideration, I found the assumption I made was "wide 
> string literals are written in UTF-8, and wchar_t is always UTF-32". 
> That leads to two problems. First, wide string encoding in source file 
> is definitely same as the encoding of source file, which might not be 
> UTF-8. This will cause problems as you mentioned. Second, wchar_t is 
> not always UTF-32. It's UTF-16 on Microsoft Windows. Some chars, like 
> emojis , will get corrupted because of value truncation.
> 
> Although there are problems, if the second problem got fixed (which is 
> easy), my patched tcc will always perform better than original tcc. If 
> something breaks, it will also breaks on original tcc. I provided a 
> table in attachments describing every situation and corresponding behaviors.
> 
> The ideal solution is to provide charset options as you mentioned. 
> After doing some search on internet, I found that there are 3 command 
> line options that controls char encoding:
> -fexec-charset=charset
> -fwide-exec-charset=charset
> -finput-charset=charset
> In order to make these feature works correctly, tcc must do two conversions:
> (1) convert all plain string literal from input-charset to 
> exec-charset
> (2) convert all wide string literal from input-charset to 
> wide-exec-charset However, providing these feature requires external 
> libraries like iconv, doing this might make Tiny C Compiler not tiny.
> 
> My problems are:
> (1) Is wchar_t either UTF-32 or UTF-16 on all platforms?
> (2) Should we provide full support for charset using external librarys?
> 
> 
> Thanks
> Zhang Boyang
> 
> 
> 
> 在 2017年09月01日 11:54, Christian Jullien 写道:
> &gt; Hello,
> &gt;
> &gt; I'm not sure you can assume that a character having code &gt;= 0x80 is 
> part of UTF-8. Beyond what is called "basic character set" which is globally 
> the ASCII 7bits, there is the "extended character set" which is 
> implementation defined.
> &gt;
> &gt; For example, the euro sign EUR may be part of 8859-15 and 
> perfectly well encoded on 8bits with 0xA4 see 
> https://en.wikipedia.org/wiki/ISO/IEC_8859-15
> &gt;
> &gt; Microsoft VC++ has the following flags &gt; &gt; /utf-8 set 
> source and execution character set to UTF-8 &gt; /validate-charset[-] 
> validate UTF-8 files for only legal characters &gt; &gt; That controls 
> how source code is encoded.
> &gt;
> &gt; gcc (more specifically cpp the C preprocessor) processes source 
> file using UTF-8 but, as VC++ has a flag to control input-char &gt;
> &gt;         -finput-charset=charset
> &gt;             Set the input character set, used for translation from the
> &gt;             character set of the input file to the source character set 
> used by
> &gt;             GCC.  If the locale does not specify, or GCC cannot get this
> &gt;             information from the locale, the default is UTF-8.  This can 
> be
> &gt;             overridden by either the locale or this command-line option.
> &gt;             Currently the command-line option takes precedence if 
> there's a
> &gt;             conflict.  charset can be any encoding supported by the 
> system's
> &gt;             "iconv" library routine.
> &gt;
> &gt; Now, tcc should be compatible with both. I mean:
> &gt;
> &gt; - Native Windows tcc port should NOT assume characters are UTF-8 
> encoded and -utf-8 flag should change this behavior (+ 
> -finput-charset=xxx for gcc compatibility) &gt; - Other ports (I mean 
> Linux &amp; alt.) should assume characters are UTF-8 encoded and 
> -finput-charset=xxx flag should change this behavior (+ -utf-8 for VC++ 
> compatibility) &gt; &gt; To summarize, which should add both utf-8 and 
> -finput-charset=xxx support and set the default behavior based on native port.
> &gt;
> &gt; Wdyt?
> &gt;
> &gt; Christian
> &gt;
> &gt;
> &gt; -----Original Message-----
> &gt; From: Tinycc-devel [mailto:address@hidden On Behalf Of ???
> &gt; Sent: mercredi 30 août 2017 09:31 &gt; To: 
> address@hidden &gt; Subject: [Tinycc-devel] BUG: wide char in 
> wide string literal handled incorrectly &gt; &gt; Hello, &gt;
> &gt;     I found that when TCC processing wide string literal, it behaves 
> like directly casting each char in original file to wchar_t and store them in 
> wide string. This will work for ASCII chars. However, it might not work for 
> real wide chars. For example:
> &gt;     The Euro-sign (EUR, U+20AC) stored in UTF-8 is "E2 82 AC". In GCC, 
> this char stored in wide string will be "000020AC". However, in TCC, this 
> char is stored as 3 wide chars "000000E2 00000082 000000AC".
> &gt;     I provided a patch, a test program and two screenshots that describe 
> this problem, they are in attachments. I solve this problem by making 
> assumptions that input charset is UTF-8. Although it's not a perfect 
> solution, it's still better than "directly casting char to wchar_t". I'm 
> wondering if that is appropriate, so please review the code carefully.
> &gt;
> &gt; Thanks
> &gt; Zhang Boyang
> &gt;
> &gt;
> &gt; _______________________________________________
> &gt; Tinycc-devel mailing list
> &gt; address@hidden
> &gt; https://lists.nongnu.org/mailman/listinfo/tinycc-devel
> &gt;
> 
> _______________________________________________
> Tinycc-devel mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/tinycc-devel
> 
> 
> 
> _______________________________________________
> Tinycc-devel mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/tinycc-devel
> 

--
张博洋 - 复旦大学2014级计算机科学与技术
我的手机: 18600020982
我的个人网站: http://www.zbyzbyzby.com




reply via email to

[Prev in Thread] Current Thread [Next in Thread]