help-octave
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: regexp: how to split a cellstr array into substring arrays, each mat


From: Philip Nienhuis
Subject: Re: regexp: how to split a cellstr array into substring arrays, each matching regular expressions
Date: Mon, 21 May 2012 21:47:27 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Answering myself (was easier then I thought):

Philip Nienhuis wrote:
Having a cellstr array like this:

octave:178> ar = {'abcdefguvwxAny' ; 'acegxyzTrailing'; 'vxzJunk'}
ar =
{
[1,1] = abcdefguvwxAny
[2,1] = acegxyzTrailing
[3,1] = vxzJunk
}

how can I efficiently split it into two columns using regular
expressions like
'[abcdefg]' and '[uvwxyz]'

to obtain

{ 'abcdefg', 'uvwxAny'; 'acegTrailing', 'xyz'; '', 'vxzJunk'} ?

IOW, I'd like to split the cellstr array at the location where
'[uvwxyz]' matches (even if not present, see far below).


The closest I get is:

## Invert pattern and use 'split' keyword
octave:179> ss = regexp (ar, '[^abcdefg]', 'split')
ss =
{
[1,1] =
{
[1,1] = abcdefg
[1,2] =
[1,3] =
[1,4] =
[1,5] =
[1,6] =
[1,7] =
[1,8] =
}
[2,1] =
{
[1,1] = aceg
[1,2] =
[1,3] =
[1,4] =
[1,5] =
[1,6] = a
[1,7] =
[1,8] =
[1,9] =
[1,10] = g
}
[3,1] =
{
[1,1] =
[1,2] =
[1,3] =
[1,4] =
[1,5] =
[1,6] =
[1,7] =
[1,8] =
}
}
octave:180> col1 = cellfun (@(x) x{1}, {ss{:}}, 'uni', false)
col1 =
{
[1,1] = abcdefg
[1,2] = aceg
[1,3] =
}
octave:181> col2 = regexp (ar, '[uvwxyz].*', 'match', 'once')
tt =
{
[1,1] = uvwxAny
[2,1] = xyzTrailing
[3,1] = vxzJunk
}

## ...or the latter statement, perhaps more robust, as:
octave:182> tt = regexp (ar, '[uvwxyz].*', 'match')
tt =
{
[1,1] =
{
[1,1] = uvwxAny
}
[2,1] =
{
[1,1] = xyzTrailing
}
[3,1] =
{
[1,1] = vxzJunk
}
}
octave:183> col2 = cellfun (@(x) [x{:}], {tt{:}}, 'Uni', false)
tt =
{
[1,1] = uvwxAny
[1,2] = xyzTrailing
[1,3] = vxzJunk
}
octave:184>


( cellfun() was invoked to be able to use repeated indexing; I couldn't
find another way to extract the first/last entries of ss and tt. )
I think my method isn't very robust.
So I hope there's a less convoluted and more reliable way.


BTW,
octave:184> ar = {'abcdefguvwxAny' ; 'acegxyzTrailing'; 'aJunk'}
ar =
{
[1,1] = abcdefguvwxAny
[2,1] = acegxyzTrailing
[3,1] = Junk
}
octave:186> tt = regexp (ar, '[uvwxyz].*', 'match', 'once')
tt =
{
[1,1] = uvwxAny
[2,1] = xyzTrailing
[3,1] = unk
}

=> is this a bug? (swallowing the "J" from the last entry)

Thanks,

Philip

octave-3.6.1.exe:16> ar = {'abcdefguvwxAny' ; 'acegxyzTrailing'; 'Junk'}
ar =
{
  [1,1] = abcdefguvwxAny
  [2,1] = acegxyzTrailing
  [3,1] = Junk
}
octave-3.6.1.exe:17> col_1 = regexp (ar, '^[abcdefg]*', 'once', 'match')
ss =
{
  [1,1] = abcdefg
  [2,1] = aceg
  [3,1] =
}
octave-3.6.1.exe:18> col_2 = regexprep (ar, '^[abcdefg]*', '')
tt =
{
  [1,1] = uvwxAny
  [2,1] = xyzTrailing
  [3,1] = Junk
}

I still have to check with ML on the last question (swallowing characters).

P.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]