bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [bug-gawk] GAWK for Windows does not work properly with UTF-8


From: Eli Zaretskii
Subject: Re: [bug-gawk] GAWK for Windows does not work properly with UTF-8
Date: Wed, 17 Feb 2016 21:52:46 +0200

> Date: Tue, 16 Feb 2016 23:10:31 +0100
> From: Marc de Bourget <address@hidden>
> 
> Hello Eli,

(Let's not make this a private conversation; please keep the list on
the CC.)

> I have another little question please: Issue with UTF-8 encoded source file:
> BEGIN {
> test = "Céline"
> if (test ~ /[àá]/)
> print "found"
> else 
> print "not found"
> # => Wrong result: found
> }
> 
> It seems multibyte characters can't be used in character classes correctly?
> I am trying to understand, why it doesn't work.

It doesn't work because the Windows library functions that Gawk uses
to support non-ASCII characters interpret the bytes in your program
assuming they are encoded in the current locale's codeset, which I'm
guessing is some Windows codepage.  These functions don't know you
actually feed them with UTF-8 multibyte sequences.  The UTF-8 encoding
of all the 3 letters you used begins with a 0xC3 byte, so Gawk
produces a false match.

> Can I solve this otherwise?
> I have to use a lot character classes with accents.

The only way to support UTF-8 encoded characters is to write your own
regexp matching code.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]