[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Bug report, incorrect handling of regular expression with range
From: |
Bob Proulx |
Subject: |
Re: Bug report, incorrect handling of regular expression with range |
Date: |
Thu, 18 Jun 2009 17:43:27 -0600 |
User-agent: |
Mutt/1.5.18 (2008-05-17) |
Tomasz Żok wrote:
> I wanted to achieve a simple thing - count and print how many lowercase
> letters there are in each line. My first aproach was this:
> {
> print gsub(/[a-z]/, "x")
> }
> But unfortunately it does not work.
Unfortunately [a-z] is a locale dependent range. In the C/POSIX
locale it does mean all of the letters a-z. But in most of the
locales that vendors automatically set up for users (e.g. en_US.UTF-8)
the range is affected by the locale specific collating sequence. The
collating sequence has been modified to be aAbBcC...zZ. Which means
that [a-z] matches lower case letters and upper case letters except Z.
> This AWK script prints both lowercase
> and uppercase letters' count. If I use:
> {
> print gsub(/[[:lower:]]/, "x")
> }
That is the correct way to specify lower case letters in a locale
independent way. Alternatively you can force a locale for collation
using LC_COLLATE or LC_ALL. In scripts it is now rather typical to
see LC_ALL=C being set to force standard behavior regardless of a
user's locale.
> So my guess is that an error is somewhere inside the range modifier of a
> regular expression. Because the interval [a-z] is consistent in means of
> ASCII codes, there's no way the uppercase letters "incidentally" got treated
> as part of [a-z]
Unfortunately this is an intentional change by the powers that be. It
is a change to the libc collation sequence and affects all of the
system and not just awk. All of awk, grep, sed, and the shell
(e.g. doing file glob expansion 'echo *') are all affected. It is
part of libc and not part of any of the utilities that use it.
> - /[a-z]/ matches incorrectly
> - /[[:lower:]]/ or /[qwertyuiopasdfghjklzxcvbnm]/ matches correctly
> - test instance:
Correct.
I personally set LANG=en_US.UTF-8 in order to get unicode support
(such as to handle the Ż in your name) but then set LC_COLLATE=C in
order to get a standard sort ordering. However that isn't a
completely general solution. I have no idea how that would work
within a Chinese charset for instance. YMMV.
Bob