[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Fwd: [^\]] in basic regexes]
From: |
Wacek Kusnierczyk |
Subject: |
[Fwd: [^\]] in basic regexes] |
Date: |
Sat, 14 Feb 2009 09:56:58 +0100 |
User-agent: |
Thunderbird 2.0.0.19 (X11/20090105) |
yesterday i posted the below, but it seems not to have gine through the
system. i have just registered, and maybe the post was rejected -- just
in case, i resend it here, with further examples.
-------- Original Message --------
Subject: [^\]] in basic regexes
Date: Fri, 13 Feb 2009 16:10:47 +0100
From: Wacek Kusnierczyk <address@hidden>
To: address@hidden
hello,
i observe a behaviour of grep that i am not sure is correct, possibly
due to my misunderstanding.
i've recently reviewed code written is some language were the intent was
to match a sequence of any number of non-']' characters. the matching
was done with an underlying regex library, and i have tried the pattern
directly with grep.
with grep, the pattern '[^]]' matches one non-] character:
grep '[^]]' <<< '[\]'
# match
however, in that code the pattern was '[^\]]*' (with the idea that the
character ']' is a metacharacter and therefore must be escaped).
according to the docs i know, it is not necessary to escape ']' within a
character class when it's the first character there (as in '[]]'), since
it then is not considered meta; but it shouldn't be harmful. it
happens that this pattern won't do:
grep '[^\]]' <<< '[\]'
# no match
this seems strange; i'd read the pattern as 'one character that is not
]'. clearly, the data has two such characters. alternatively, the
pattern could be read as 'one character that is neither \ nor ]', but
this would require the backslash to be treated as a regular character
(not a meta):
grep '[\]' <<< '[\]'
# match
grep '[^\]' <<< '[\]'
# match
grep '[^\[]' <<< '[\]'
# match
in fact, the third above has one possible match, so the pattern is read
as 'one non-\ non-[' rather than as 'one non-[':
grep -o '[^\[]' <<< '[\]'
# ]
so the 'one non-\ non-]' reading of '[^\]]' is not implausible; then,
there would one match, but there is none.
it actually appears that the pattern is read as 'one non-\ followed by
one ]':
grep -o '[^\]]' <<< '[]'
# []
that is, the first ] is not escaped (coherently with the case of
'[^\[]') but rather closes the character class, and the second
(unescaped!) ] does not close any class, but is taken literally!
(should this not be an invalid regex, with an unmatched class-closing
bracket?)
i haven't looked at the sources of grep, so these are plain guesses, but
is the behaviour of grep with '[^\]]' correct and intended, or is it a bug?
grep -V
# GNU grep 2.5.3
regards,
wacek
ps. here are some further experiments, which seem to indicate that grep gets
confused with some combinations of [, ], ^, and \.
# [[] should match one opening bracket
grep -o '[[]' <<< '[^\]'
# [
# OK
# []] should match one closing bracket
grep -o '[]]' <<< '[^\]'
# ]
# OK
# [][] should match one bracket
grep -o '[][]' <<< '[^\]'
# [
# ]
#OK
# [[]] should match one bracket
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[[]]' <<< '[^\]'
# WRONG (?) -- neither a match nor an error
# [\] shoud match one backslash
grep -o '[\]' <<< '[^\]'
# \
# OK
# [\[] should match one backslash or opening bracket
grep -o '[\[]' <<< '[^\]'
# [
# \
# OK
# [\]] should match one backslash or closing bracket
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[\]]' <<< '[^\]'
# \]
# WRONG (?) -- matches *two* characters
# [[^] should match one opening bracket or caret
grep -o '[[^]' <<< '[^\]'
# [
# ^
# OK
# [[^\] should match one opening bracket, caret, or backslash
grep -o '[[^\]' <<< '[^\]'
# [
# ^
# \
# OK
# [[^\]] should match one opening bracket, caret, backslash, or closing bracket
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[[^\]]' <<< '[^\]'
# \]
# WRONG (?) -- matches *two* characters
# [\ ]] should match one backslash, space, or closing bracket
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[\ ]]' <<< '[^\]'
# \]
# WRONG (?) -- matches *two* characters
# [\ ] ] should match one backslash, space, or closing bracket
# alternatively (preferred?), report invalid regex (unmatched second ])
grep -o '[\ ] ]' <<< '[^\]'
# WRONG (?) -- neither a match nor an error
grep -o '[\ ] ]' <<< '[^\ ]'
# \ ]
# WRONG (?) -- matches *three* characters
# [\] ] should match one backslash, closing bracket, or space
# alternatively (preferred?), report invalid expression (unmatched second ])
grep -o '[\] ]' <<< '[^\]'
# WRONG (?) -- neither a match nor an error
grep -o '[\] ]' <<< '[^\ ]'
# \ ]
# WRONG (?) -- matches *three* characters
# [^] should report invalid regex (void ^or unmatched [)
grep -o '[^]' <<< '[^\]'
# grep: Unmatched [ or [^
# OK
# [^]\] match one non-closing-bracket or non-backslash
# alternatively, report invalid regex (void ^)
grep -o '[^]\]' <<< '[^\]'
# [
# ^
# WRONG (?) -- matches *two* characters, seemingly inappropriately
--
-------------------------------------------------------------------------------
Wacek Kusnierczyk, MD PhD
Email: address@hidden
Phone: +47 73591875, +47 72574609
Department of Computer and Information Science (IDI)
Faculty of Information Technology, Mathematics and Electrical Engineering (IME)
Norwegian University of Science and Technology (NTNU)
Sem Saelands vei 7, 7491 Trondheim, Norway
Room itv303
Bioinformatics & Gene Regulation Group
Department of Cancer Research and Molecular Medicine (IKM)
Faculty of Medicine (DMF)
Norwegian University of Science and Technology (NTNU)
Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway
Room 231.05.060
-------------------------------------------------------------------------------
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Fwd: [^\]] in basic regexes],
Wacek Kusnierczyk <=