[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Patch to fix /[A-Z]/ and internationalization bug
From: |
Sam Trenholme |
Subject: |
Patch to fix /[A-Z]/ and internationalization bug |
Date: |
Thu, 2 Nov 2006 18:46:59 -0600 (CST) |
Arnold (or whoever),
I am willing to sign over this patch that I feel is
the best solution to the problems with /[A-Z]/ and
case sensitivity in non-C locales.
I am also willing to make changes to this patch so we
can make it a part of Awk.
- Sam
__________________________________________________
Correo Yahoo!
Espacio para todos tus mensajes, antivirus y antispam ¡gratis!
Regístrate ya - http://correo.espanol.yahoo.com/
This patch makes it so regular expression ranges, such as /[A-Z]/,
do not break if internationalization is enabled. The problem is
this: Many international locales place uppercase letters next
to lowercase letters. While this results in a more sensible
"ls" output, this also breaks scripts that assume /[A-Z]/ matches
only upper case and /[a-z]/ only matches lower case.
The way the patch works around this issue is to use traditional
ASCII ordering of characters if both characters in a range are
ASCII characters. If either of the characters in a range are
not ASCII, such as /[Ã-Z]/ (the first letter in this range
is an A with an accute accent), this code will use the wcscoll()
routine to determine the range.
Some issues before this can become a part of Gawk:
1) I have to sign the paperwork assigning copyright to the FSF.
For legal reasons, I have to physically sign a paper and give it
to them.
2) This may break on non-ASCII systems (as I recall, Gawk still has
support for non-ASCII systems).
3) Maybe have an environmental variable with reenables the old Gawk
behavior. I'll have to use a static variable so we don't do an
expensive getenv() call every time we look at a character.
- Sam
*** gawk-3.1.5/dfa.c.orig 2005-07-26 13:07:43.000000000 -0500
--- gawk-3.1.5/dfa.c 2006-11-02 15:32:41.000000000 -0600
***************
*** 2638,2646 ****
wcbuf[2] = work_mbc->range_sts[i];
wcbuf[4] = work_mbc->range_ends[i];
! if (wcscoll(wcbuf, wcbuf+2) >= 0 &&
! wcscoll(wcbuf+4, wcbuf) >= 0)
! goto charset_matched;
}
/* match with a character? */
--- 2638,2663 ----
wcbuf[2] = work_mbc->range_sts[i];
wcbuf[4] = work_mbc->range_ends[i];
! /* If both characters are ASCII characters, we use the ASCII
! * ordering of the characters to determine the range. This way,
! * i18n doesn't break regexes like /[A-Z]/ (which is supposed to
! * mean "upper case only", and should never match lower-case) */
! if (wcbuf[2] < 128 && wcbuf[4] < 128)
! {
! if (wcbuf[0] >= wcbuf[2] &&
! wcbuf[4] >= wcbuf[0])
! {
! goto charset_matched;
! }
! }
! else
! {
! if (wcscoll(wcbuf, wcbuf+2) >= 0 &&
! wcscoll(wcbuf+4, wcbuf) >= 0)
! {
! goto charset_matched;
! }
! }
}
/* match with a character? */
*** gawk-3.1.5/doc/gawk.texi.orig 2006-11-02 15:40:43.000000000 -0600
--- gawk-3.1.5/doc/gawk.texi 2006-11-02 16:26:02.000000000 -0600
***************
*** 3830,3876 ****
@section Where You Are Makes A Difference
Modern systems support the notion of @dfn{locales}: a way to tell
! the system about the local character set and language. The current
! locale setting can affect the way regexp matching works, often
! in surprising ways. In particular, many locales do case-insensitive
! matching, even when you may have specified characters of only
! one particular case.
!
! The following example uses the @code{sub} function, which
! does text replacement
! (@pxref{String Functions}).
! Here, the intent is to remove trailing uppercase characters:
!
! @example
! $ echo something1234abc | gawk '@{ sub("[A-Z]*$", ""); print @}'
! @print{} something1234
! @end example
!
! @noindent
! This output is unexpected, since the @samp{abc} at the end of
@samp{something1234abc}
! should not normally match @samp{[A-Z]*}. This result is due to the
! locale setting (and thus you may not see it on your system).
! There are two fixes. The first is to use the POSIX character
! class @samp{[[:upper:]]}, instead of @samp{[A-Z]}.
! The second is to change the locale setting in the environment,
! before running @command{gawk},
! by using the shell statements:
!
! @example
! LANG=C LC_ALL=C
! export LANG LC_ALL
! @end example
!
! The setting @samp{C} forces @command{gawk} to behave in the traditional
! Unix manner, where case distinctions do matter.
! You may wish to put these statements into your shell startup file,
! e.g., @file{$HOME/.profile}.
!
! Similar considerations apply to other ranges. For example,
! @samp{["-/]} is perfectly valid in ASCII, but is not valid in many
! Unicode locales, such as @samp{en_US.UTF-8}. (In general, such
! ranges should be avoided; either list the characters individually,
! or use a POSIX character class such as @samp{[[:punct:]]}.)
For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant.
For other single byte record separators, using @samp{LC_ALL=C} will give you
--- 3830,3858 ----
@section Where You Are Makes A Difference
Modern systems support the notion of @dfn{locales}: a way to tell
! the system about the local character set and language. In particular,
! many locales do case-insensitive matching, even when you may have
! specified characters of only one particular case.
!
! In order to be compatible with traditional AWK scripts that
! assume an ASCII ordering of letters, if both characters in a
! regular expression range, such as @samp{[A-Z]} are ASCII, Gawk will
! use ASCII ordering to determine the characters in the range. This, in
! particular, preserves the case sensitivity that
! traditional AWK scripts have utilized.
!
! This behavior is different than the behavior in earlier versions of
! Gawk. In earlier versions of Gawk, the current locale always
! determined what characters to put in a regular expression
! range. This behavior gave surprising results: Previously case-sensitive
! character ranges became case-insensitive, breaking AWK scripts.
!
! One consequence of this change is that @samp{[A-Za-z]} no longer
! matches accented letters in non-English locales. If this behavior
! is needed, use the POSIX character class @samp{[[:alpha:]]}, which
! matches all alphabetic characters. Another option is to use an accented
! character in the regular expression range, which will reinstate
! Gawk's older behavior.
For the normal case of @samp{RS = "\n"}, the locale is largely irrelevant.
For other single byte record separators, using @samp{LC_ALL=C} will give you
- Patch to fix /[A-Z]/ and internationalization bug,
Sam Trenholme <=