octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regexp Cleanup


From: Rik
Subject: Re: Regexp Cleanup
Date: Thu, 04 Jul 2013 08:32:59 -0700

On 07/04/2013 08:01 AM, address@hidden wrote:
> Message: 5
> Date: Thu, 04 Jul 2013 09:40:42 -0400
> From: "John W. Eaton" <address@hidden>
> To: Laurent Hoeltgen <address@hidden>
> Cc: address@hidden
> Subject: Re: Regexp cleanup
> Message-ID: <address@hidden>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 07/04/2013 03:37 AM, Laurent Hoeltgen wrote:
>> > On 07/03/2013 09:57 PM, PhilipNienhuis wrote:
>>> >> Rik-4 wrote
>>>> >>> The re-write also solves the following existing bugs (I said it
>>>> >>> was
>>>> >>> creaky).
>>>> >>>
>>>> >>> 38778: wrong return value for regexp
>>>> >>> 38616: memory leak
>>>> >>> 38149: wrong tokens returned
> Thanks!
>
>>>> >>> So, depending on what Matlab does, would it be okay to drop support for
>>>> >>> this esoterica? I'm pretty tired of trying to work it out at this 
>>>> >>> point.
>>>> >>>
>>>> >>> --Rik
>>> >> Matlab r2013b prerelease does (after changing double quote to single
>>> >> quote,
>>> >> and removing empty lines):
>>> >>
>>>>> >>>> [S, E, TE, M, T, NM, SP] = regexp ('John Davis\nRogers, James',
>>>>> >>>> '(?<first>\w+)\s+(?<last>\w+)|(?<last>\w+),\s+(?<first>\w+)')
>>> >> S =
>>> >> 1 12
>>> >> E =
>>> >> 10 25
>>> >> TE =
>>> >> [2x2 double] [2x2 double]
>>> >> M =
>>> >> 'John Davis' 'nRogers, James'
>>> >> T =
>>> >> {1x2 cell} {1x2 cell}
>>> >> NM =
>>> >> 1x2 struct array with fields:
>>> >> first
>>> >> last
>>> >> SP =
>>> >> '' '\' ''
>>> >> ...so it seems Matlab thinks this is valid.
>> > Matlab R2012a returns the same result as above.
> Rik, I'm pretty sure this complexity was added to Octave for Matlab 
> compatibility reasons.
>
> jwe
7/4/13

Yes, it seems from at least two responses that Matlab accepts this syntax. 
I've been perusing the PCRE API and it seems that the underlying issue with
duplicate capture names can be resolved by throwing a switch when the
pattern is compiled (PCRE_DUPNAMES).  Of course, if we turn on that
feature, then we need to significantly change our code because we rename
all of the named expressions to unique named numerical values and maintain
a translation table to work backwards after the pattern matching.

I like moving in this direction, even with the extra work of re-drafting
the regexp code, because it makes the shim layer between Octave and PCRE
smaller and easier to understand.  We put more of the weight for code
development and maintenance on PCRE and less on Octave.  In addition it
would allow the use of named capture buffers within the matching expression
using the \k construct (right now that is hard to do).  PCRE_DUPNAMES was
added in 2008 so it is hardly new, but that may still be newer than
whatever the Red Hat Long Term Release is using.  What do you think about
bumping up to the year 2008?

--Rik


reply via email to

[Prev in Thread] Current Thread [Next in Thread]