behavior of regexp ( ) function

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

behavior of regexp ( ) function

From:	John W. Eaton
Subject:	behavior of regexp ( ) function
Date:	Tue, 27 Jan 2009 23:42:58 -0500

On  1-Jan-2009, Daniel J Sebald wrote:

| Below are some results from regexp() that seem questionable given what the 
documentation says (or I'm misunderstanding).  Say I want to pull the 
substrings from a tab separated data file.  Let
| 
| octave:6> a = sprintf('20\t50\tcelcius\t80')
| a = 20  50      celcius 80
| octave:7> b = sprintf('20\t50\t\t80')
| b = 20  50              80
| 
| be some sample lines that might come from a datafile.  String a has at least 
one character between tabs; b has a case where there are zero characters 
between tabs.  For regexp, the metacharacters [^\t] mean any ASCII value other 
than a tab.  The metacharacter + means match one or more times.  Here are the 
results for a and b processed with these metacharacters:
| 
| octave:8> regexp(a, '[^\t]+', 'match')
| ans =
| 
| {
|   [1,1] = 20
|   [1,2] = 50
|   [1,3] = celcius
|   [1,4] = 80
| }
| 
| Looks good.
| 
| octave:9> regexp(b, '[^\t]+', 'match')
| ans =
| 
| {
|   [1,1] = 20
|   [1,2] = 50
|   [1,3] = 80
| }
| 
| I'll go along with that result too.  There are zero characters between the 
second and third tab and + requires at least one match.
| 
| Now, according to the documentation, * is similar to + in concept, but there 
must be a match of _zero_ or more.  Here's the results for a and b processed 
with those metacharacters:
| 
| octave:10> regexp(a, '[^\t]*', 'match')
| ans =
| 
| {
|   [1,1] = 20
| }
| 
| Doesn't look correct.  I'm thinking this should be pretty much the same 
result as with metacharacter +, i.e.,
| 
| [1,1] = 20
| [1,2] = 50
| [1,3] = celcius
| [1,4] = 80
| 
| because + was one or more matches, and "one or more" is a subset of "zero or 
more".  Next result:
| 
| octave:11> regexp(b, '[^\t]*', 'match')
| ans =
| 
| {
|   [1,1] = 20
| }
| 
| Same as previous, but the way I see it, this case should result in
| 
| [1,1] = 20
| [1,2] = 50
| [1,3] = []
| [1,4] = 80
| 
| where the third empty string comes from the fact there are zero characters 
between two tabs, i.e., "zero or more".
| 
| Am I correctly understanding what "zero or more" means?

I'm not sure whether this is a bug.  But it is apparently incompatible
behavior.  I don't know what the fix is, but I looked at the
octregexp_list function, and it is correctly matching the first "20"
and moving idx forward to 2 (the position of the next character in the
string).  But then the next call to pcre_exec is matching zero or more
of anything not TAB and returning a zero-length substring starting and
ending at idx == 2.  So then ovector[1] <= ovector[0] and execution
breaks out of the loop.

David, would you say this is a bug in Octave, or Matlab?  How would
you interpret the '[^\t]*' regexp in this case?  If it is a bug in
Octave, do you see a fix?

Thanks,

jwe

[Prev in Thread]

Current Thread

[Next in Thread]

behavior of regexp ( ) function, Daniel J Sebald, 2009/01/01
- behavior of regexp ( ) function, John W. Eaton <=
  - Re: behavior of regexp ( ) function, David Bateman, 2009/01/28
    - Re: behavior of regexp ( ) function, Søren Hauberg, 2009/01/28
    - Re: behavior of regexp ( ) function, David Bateman, 2009/01/28
    - Re: behavior of regexp ( ) function, David Bateman, 2009/01/28
    - RE: behavior of regexp ( ) function, HALL, BENJAMIN PW, 2009/01/28

Prev by Date: Re: support for advanced gnuplot features (was: Plotting semi-trasnparent patches?)
Next by Date: Re: Objects and OOP
Previous by thread: behavior of regexp ( ) function
Next by thread: Re: behavior of regexp ( ) function
Index(es):
- Date
- Thread