[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
behavior of regexp ( ) function
From: |
John W. Eaton |
Subject: |
behavior of regexp ( ) function |
Date: |
Tue, 27 Jan 2009 23:42:58 -0500 |
On 1-Jan-2009, Daniel J Sebald wrote:
| Below are some results from regexp() that seem questionable given what the
documentation says (or I'm misunderstanding). Say I want to pull the
substrings from a tab separated data file. Let
|
| octave:6> a = sprintf('20\t50\tcelcius\t80')
| a = 20 50 celcius 80
| octave:7> b = sprintf('20\t50\t\t80')
| b = 20 50 80
|
| be some sample lines that might come from a datafile. String a has at least
one character between tabs; b has a case where there are zero characters
between tabs. For regexp, the metacharacters [^\t] mean any ASCII value other
than a tab. The metacharacter + means match one or more times. Here are the
results for a and b processed with these metacharacters:
|
| octave:8> regexp(a, '[^\t]+', 'match')
| ans =
|
| {
| [1,1] = 20
| [1,2] = 50
| [1,3] = celcius
| [1,4] = 80
| }
|
| Looks good.
|
| octave:9> regexp(b, '[^\t]+', 'match')
| ans =
|
| {
| [1,1] = 20
| [1,2] = 50
| [1,3] = 80
| }
|
| I'll go along with that result too. There are zero characters between the
second and third tab and + requires at least one match.
|
| Now, according to the documentation, * is similar to + in concept, but there
must be a match of _zero_ or more. Here's the results for a and b processed
with those metacharacters:
|
| octave:10> regexp(a, '[^\t]*', 'match')
| ans =
|
| {
| [1,1] = 20
| }
|
| Doesn't look correct. I'm thinking this should be pretty much the same
result as with metacharacter +, i.e.,
|
| [1,1] = 20
| [1,2] = 50
| [1,3] = celcius
| [1,4] = 80
|
| because + was one or more matches, and "one or more" is a subset of "zero or
more". Next result:
|
| octave:11> regexp(b, '[^\t]*', 'match')
| ans =
|
| {
| [1,1] = 20
| }
|
| Same as previous, but the way I see it, this case should result in
|
| [1,1] = 20
| [1,2] = 50
| [1,3] = []
| [1,4] = 80
|
| where the third empty string comes from the fact there are zero characters
between two tabs, i.e., "zero or more".
|
| Am I correctly understanding what "zero or more" means?
I'm not sure whether this is a bug. But it is apparently incompatible
behavior. I don't know what the fix is, but I looked at the
octregexp_list function, and it is correctly matching the first "20"
and moving idx forward to 2 (the position of the next character in the
string). But then the next call to pcre_exec is matching zero or more
of anything not TAB and returning a zero-length substring starting and
ending at idx == 2. So then ovector[1] <= ovector[0] and execution
breaks out of the loop.
David, would you say this is a bug in Octave, or Matlab? How would
you interpret the '[^\t]*' regexp in this case? If it is a bug in
Octave, do you see a fix?
Thanks,
jwe