dotgnu-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DotGNU]Problems with UTF-8


From: Miroslaw Dobrzanski-Neumann
Subject: Re: [DotGNU]Problems with UTF-8
Date: Mon, 24 Nov 2003 16:16:27 +0100
User-agent: Mutt/1.4i

On Mon, Nov 24, 2003 at 12:43:46PM -0200, brunoacf wrote:
> Hi,
> 
> I am having problems compiling codes
> that have UTF-8 string sequences.
> 
> Here is an example:
> 
> class utftest
> {
>     public static void Main ( ) {
>         System.Console.WriteLine
> ("Protégé");
>     }
> }
> 
> $ cscc -o utftest utftest.cs
> utftest.cs:4: warning: invalid UTF-8
> sequence
> utftest.cs:4: warning: invalid UTF-8
> sequence
> 
> $ ilrun utftest
> Protg
> 
> The code compiles, but the method
> WriteLine does not show the accented
> characters.
> 
> I'm using pnet 0.6.0 and pnetlib
> 0.6.0. My system is a Slackware 9.1.
> 
> I think this may be a bug, so i'm
> posting in the developers mailing list.

The original mail uses latin1 charset. I've got the same warning messages when
utftest.cs was encodded with with latin1 instead of utf8.
Are you sure you've used utf-8 encoding?

Try the following
$ od -c utftest.cs

if the file is utf-8 encodded you get
0000000   c   l   a   s   s       u   t   f   t   e   s   t  \n   {  \n
0000020  \t   p   u   b   l   i   c       s   t   a   t   i   c       v
0000040   o   i   d       M   a   i   n       (       )       {  \n  \t
0000060  \t   S   y   s   t   e   m   .   C   o   n   s   o   l   e   .
0000100   W   r   i   t   e   L   i   n   e       (   "   P   r   o   t
0000120 303 251   g 303 251   "   )   ;  \n  \t   }  \n   }  \n

if the file is latin1 encodded you get
0000000   c   l   a   s   s       u   t   f   t   e   s   t  \n   {  \n
0000020  \t   p   u   b   l   i   c       s   t   a   t   i   c       v
0000040   o   i   d       M   a   i   n       (       )       {  \n  \t
0000060  \t   S   y   s   t   e   m   .   C   o   n   s   o   l   e   .
0000100   W   r   i   t   e   L   i   n   e       (   "   P   r   o   t
0000120 351   g 351   "   )   ;  \n  \t   }  \n   }  \n

All numbers are octals
Latin1: 351 = 11101001 matches the rule 1110xxxx
        must be followed by two bytes 10xxxxxx and is not
Utf-8:  303 = 11000011 matches the rule 110xxxxx
        must be followed by one byte 10xxxxxx
        The following 251 = 10101001 matches the rule
-- 
Mirosław Dobrzański-Neumann
E-mail: address@hidden

This message is utf-8 encoded


reply via email to

[Prev in Thread] Current Thread [Next in Thread]