bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug report, incorrect handling of regular expression with range


From: Bob Proulx
Subject: Re: Bug report, incorrect handling of regular expression with range
Date: Thu, 18 Jun 2009 17:43:27 -0600
User-agent: Mutt/1.5.18 (2008-05-17)

Tomasz Żok wrote:
> I wanted to achieve a simple thing - count and print how many lowercase
> letters there are in each line. My first aproach was this:
>  {
>   print gsub(/[a-z]/, "x")
> }
> But unfortunately it does not work.

Unfortunately [a-z] is a locale dependent range.  In the C/POSIX
locale it does mean all of the letters a-z.  But in most of the
locales that vendors automatically set up for users (e.g. en_US.UTF-8)
the range is affected by the locale specific collating sequence.  The
collating sequence has been modified to be aAbBcC...zZ.  Which means
that [a-z] matches lower case letters and upper case letters except Z.

> This AWK script prints both lowercase
> and uppercase letters' count. If I use:
>  {
>   print gsub(/[[:lower:]]/, "x")
> }

That is the correct way to specify lower case letters in a locale
independent way.  Alternatively you can force a locale for collation
using LC_COLLATE or LC_ALL.  In scripts it is now rather typical to
see LC_ALL=C being set to force standard behavior regardless of a
user's locale.

> So my guess is that an error is somewhere inside the range modifier of a
> regular expression. Because the interval [a-z] is consistent in means of
> ASCII codes, there's no way the uppercase letters "incidentally" got treated
> as part of [a-z]

Unfortunately this is an intentional change by the powers that be.  It
is a change to the libc collation sequence and affects all of the
system and not just awk.  All of awk, grep, sed, and the shell
(e.g. doing file glob expansion 'echo *') are all affected.  It is
part of libc and not part of any of the utilities that use it.

> - /[a-z]/ matches incorrectly
> - /[[:lower:]]/ or /[qwertyuiopasdfghjklzxcvbnm]/ matches correctly
> - test instance:

Correct.

I personally set LANG=en_US.UTF-8 in order to get unicode support
(such as to handle the Ż in your name) but then set LC_COLLATE=C in
order to get a standard sort ordering.  However that isn't a
completely general solution.  I have no idea how that would work
within a Chinese charset for instance.  YMMV.

Bob




reply via email to

[Prev in Thread] Current Thread [Next in Thread]