[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
i18n: special letter(s?) cause regular expression error in match() and w
From: |
R. Bijlsma |
Subject: |
i18n: special letter(s?) cause regular expression error in match() and wrong length() |
Date: |
Sat, 03 Jan 2009 18:44:27 +0100 |
User-agent: |
Icedove 1.5.0.14eol (X11/20080724) |
Dear gawk writers,
The attached script bug_special_letters.awk proves and analysis a bug
in gawk 3.1.6, namely that the special letter é (='e) is treated badly
by match() and by length().
System: The precompiled package of Ubuntu 8.10, Intrepid.
The script is its own test input, and also contains all explanations,
here a summary:
In match(), the /./ expression does not match the special letter é (='e),
while it is matched by ~.
Furthermore, match() may match a line containing an é, but set RLENGTH
to -2.
Length() counts upto the first occurance of é and ignores the rest of
the line.
It seems that the problem is related to the chosen language and
character encoding.
I have LANG=en_US.UTF-8.
Best regards,
Rita Bylsma
#
# NAME
# bug_special_letters.awk :: Proves and analysis a bug in gawk 3.1.6 (The
precompiled package of Ubuntu 8.10, Intrepid).
# Namely that the special letter é (='e) is
treated badly by match() and by length().
#
# SYNOPSIS
# grep '^# ' bug_special_letters.awk To read the
proof as found on my system.
# gawk -f bug_special_letters.awk bug_special_letters.awk To test on your
own system.
#
#
# DESCRIPTION
#
# In match(), the /./ expression does not match the special letter é (='e),
while it is matched by ~.
# Furthermore, match() may match a line containing an é, but set RLENGTH to -2.
# Length() counts upto the first occurance of é and ignores the rest of the
line.
#
# Other special letters where not tested.
#
# It seems that the problem is related to the chosen language and character
encoding.
# I have LANG=en_US.UTF-8.
#
# Below is the output of my test, as found on my system. There are colorcodes
in it to
# see immediately where the errors are, if read on the console, for example
with grep.
#
# Note that some errors can generally be detected by this script, but in some
cases
# gawk has no way of detecting that it is doing something wrong.
# For an example of such cases the colorcodes are hardcoded.
#
#
# ________________________________________________________________________
# 'e'
# Length: 1
# !~ /(.+de)(.+)/ with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/ with both ~ and match(), RSTART=0, RLENGTH=-1
# ~ /.+/ with both ~ and match(), RSTART=1, RLENGTH=1
# ________________________________________________________________________
# 'é'
# [1;31mLength: 0 (should be: 1)[0m
# !~ /(.+de)(.+)/ with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/ with both ~ and match(), RSTART=0, RLENGTH=-1
# [1;31m~ /.+/ with ~
# !~ /.+/ with match(), RSTART=0, RLENGTH=-1
# [0m________________________________________________________________________
# 'Na deze regel één lege regel. After this line one empty line.'
# [1;31mLength: 14 (should be: 61)[0m
# ~ /(.+de)(.+)/ with both ~ and match(), RSTART=1,
[1;31mRLENGTH=14[0m
# 1 (.+de): 'Na de', length = Length: 5
# 2 (.+): '[1;31mze regel [0m', length = Length: 9
# ~ /(de[^\"]+)\.(.+)/ with both ~ and match(), RSTART=4,
[1;31mRLENGTH=-2[0m
# 1 (de[^\"]+): 'deze regel één lege regel', length =
[1;31mLength: 11 (should be: 25)[0m
# 2 (.+): ' After this line one empty line.', length =
Length: 32
# ~ /.+/ with both ~ and match(), RSTART=1,
[1;31mRLENGTH=14[0m
# ________________________________________________________________________
# ''
# Length: 0
# !~ /(.+de)(.+)/ with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /(de[^\"]+)\.(.+)/ with both ~ and match(), RSTART=0, RLENGTH=-1
# !~ /.+/ with both ~ and match(), RSTART=0, RLENGTH=-1
# ________________________________________________________________________
# 'Voor deze regel een lege regel. Before this line an empty line.'
# Length: 63
# ~ /(.+de)(.+)/ with both ~ and match(), RSTART=1, RLENGTH=63
# 1 (.+de): 'Voor de', length = Length: 7
# 2 (.+): 'ze regel een lege regel. Before this line an
empty line.', length = Length: 56
# ~ /(de[^\"]+)\.(.+)/ with both ~ and match(), RSTART=6, RLENGTH=58
# 1 (de[^\"]+): 'deze regel een lege regel', length =
Length: 25
# 2 (.+): ' Before this line an empty line.', length =
Length: 32
# ~ /.+/ with both ~ and match(), RSTART=1, RLENGTH=63
#
#
#
## Test input for the program. All testline start with '#: ' (note the space).
#: e
#: é
#: Na deze regel één lege regel. After this line one empty line.
#:
#: Voor deze regel een lege regel. Before this line an empty line.
BEGIN {
#regex_ar[ "[^\\\"]" ]
#regex_ar[ "" ]
regex_ar[ ".+" ]
regex_ar[ "(.+de)(.+)" ]
regex_ar[ "(de[^\\\"]+)\\.(.+)" ]
afkap_example="Na deze regel één lege regel. After this line one empty line."
afkap_regex="(.+de)(.+)"
full_regex=".+"
}
function op_error( if_error, string, explanation ) {
if ( if_error )
return ( "\033[1;31m" string explanation "\033[0m" )
else
return string
}
function getlength( string ,len,after,replacement,lenrepl) {
len=length(string)
replacement=string
gsub(/[^\"]/, "x", replacement)
lenrepl=length(replacement)
return op_error( lenrepl != len, "Length: " len, " (should be: " lenrepl ")")
}
/^\#: / {
example=$0
sub(/^\#: /, "", example)
printf
"________________________________________________________________________\n"
printf "'%s'\n", example
print getlength(example)
for ( regex in regex_ar )
{ if ( example ~ regex )
opmatchstr="~"
else
opmatchstr="!~"
if ( match( example, regex, matchAr ))
funcmatchstr="~"
else
funcmatchstr="!~"
if ( funcmatchstr == "~" && RLENGTH < 0 || ( ( regex == afkap_regex ||
regex == full_regex ) && example == afkap_example) )
rlength_error=1
else
rlength_error=0
if ( opmatchstr == funcmatchstr )
printf "%-3s%-20s with both ~ and match(), RSTART=%s, %s\n",
funcmatchstr, ("/" regex "/"), RSTART, op_error( rlength_error, "RLENGTH="
RLENGTH)
else
{ printf "\033[1;31m"
printf "%-3s%-20s with ~\n", opmatchstr, ("/" regex "/")
printf "%-3s%-20s with match(), RSTART=%s, %s\n", funcmatchstr, ("/"
regex "/"), RSTART, op_error(rlength_error, "RLENGTH=" RLENGTH)
printf "\033[0m"
}
if ( funcmatchstr == "~" )
{ subar["nr"]=split(regex, subar, "(")
for (i=2; i<=subar["nr"]; i++)
{ sub(/\).*/, "", subar[i])
printf "%25d %s: '%s', length = %s\n", i-1, "(" subar[i] ")",
op_error(regex == afkap_regex && example == afkap_example && i==3, matchAr[i-1]
), getlength(matchAr[i-1])
}
}
}
}
- i18n: special letter(s?) cause regular expression error in match() and wrong length(),
R. Bijlsma <=