octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

behavior of regexp ( ) function


From: Daniel J Sebald
Subject: behavior of regexp ( ) function
Date: Thu, 01 Jan 2009 00:34:17 -0600
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.3) Gecko/20041020

Below are some results from regexp() that seem questionable given what the 
documentation says (or I'm misunderstanding).  Say I want to pull the 
substrings from a tab separated data file.  Let

octave:6> a = sprintf('20\t50\tcelcius\t80')
a = 20  50      celcius 80
octave:7> b = sprintf('20\t50\t\t80')
b = 20  50              80

be some sample lines that might come from a datafile.  String a has at least 
one character between tabs; b has a case where there are zero characters 
between tabs.  For regexp, the metacharacters [^\t] mean any ASCII value other 
than a tab.  The metacharacter + means match one or more times.  Here are the 
results for a and b processed with these metacharacters:

octave:8> regexp(a, '[^\t]+', 'match')
ans =

{
 [1,1] = 20
 [1,2] = 50
 [1,3] = celcius
 [1,4] = 80
}

Looks good.

octave:9> regexp(b, '[^\t]+', 'match')
ans =

{
 [1,1] = 20
 [1,2] = 50
 [1,3] = 80
}

I'll go along with that result too.  There are zero characters between the 
second and third tab and + requires at least one match.

Now, according to the documentation, * is similar to + in concept, but there 
must be a match of _zero_ or more.  Here's the results for a and b processed 
with those metacharacters:

octave:10> regexp(a, '[^\t]*', 'match')
ans =

{
 [1,1] = 20
}

Doesn't look correct.  I'm thinking this should be pretty much the same result 
as with metacharacter +, i.e.,

[1,1] = 20
[1,2] = 50
[1,3] = celcius
[1,4] = 80

because + was one or more matches, and "one or more" is a subset of "zero or 
more".  Next result:

octave:11> regexp(b, '[^\t]*', 'match')
ans =

{
 [1,1] = 20
}

Same as previous, but the way I see it, this case should result in

[1,1] = 20
[1,2] = 50
[1,3] = []
[1,4] = 80

where the third empty string comes from the fact there are zero characters between two 
tabs, i.e., "zero or more".

Am I correctly understanding what "zero or more" means?

Dan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]