octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New strsplit function


From: Ben Abbott
Subject: Re: New strsplit function
Date: Thu, 16 May 2013 16:34:10 +0800

On May 16, 2013, at 2:52 PM, Ben Abbott wrote:

> On May 16, 2013, at 2:39 PM, John W. Eaton wrote:
> 
>> On 05/16/2013 02:19 AM, Ben Abbott wrote:
>> 
>>> hmmm ... I took a look at Matlab 2013a.  It's not clear to me that we'd 
>>> want to copy this.
>>> 
>> 
>> Well, Matlab users apparently want compatibility here.  That's why I
>> received the report.
>> 
>>> matlab>  strsplit('', 'a')
>>> 
>>> ans =
>>> 
>>>    {''}
>>> 
>>> matlab>  strsplit('a', 'a')
>>> 
>>> ans =
>>> 
>>>    ''    ''
>>> 
>>> matlab>  strsplit('aa', 'a')
>>> 
>>> ans =
>>> 
>>>    ''    ''
>>> 
>>> matlab>  strsplit('aaa', 'a')
>>> 
>>> ans =
>>> 
>>>    ''    ''
>>> 
>>> matlab>  strsplit('aaaa', 'a')
>>> 
>>> ans =
>>> 
>>>    ''    ''
>>> matlab>  strsplit ('abc', {'a','b','c'})
>>> 
>>> ans =
>>> 
>>>    ''    ''
>>> In case it isn't clear, the output is a cellstring containing two empty 
>>> strings.
>> 
>> Oh, so collapsdelimiters means that if multiple consecutive delimiters
>> appear in the string that is being split, they should be treated as
>> one?
> 
> That is my understanding.  A moment ago, it occured to me I should check to 
> see what regexp () works.
> 
> octave> regexp ('aaaaa', '(a)+', 'split')
> ans = 
> {
>  [1,1] = 
>  [1,2] = 
> }
> octave> strsplit ('aaaaa', 'a', 'delimitertype', 'regularexpression')
> ans = 
> {
>  [1,1] = 
>  [1,2] = 
> }
> 
> So, it looks unlikely that there is a Matlab bug, but instead it is a 
> misunderstanding on my part.
> 
>> Then I think my guess about what was happening was wrong, and the
>> behavior above is correct.  If the string is 'aa' and the delimiter is
>> 'a', then it is the same as strsplit ('a', 'a') and the result should
>> be two empty strings (one for before and one for after the
>> delimiter).  That's the result we used to get for the simpler case of
>> strsplit ('a', 'a').  Now we get an empty cell array, which looks
>> wrong to me.
> 
> ahhh ... ok, that makes sense to me!
> 
>> So in this code
>> 
>>   ## Get substring lengths.
>>   if (isempty (idx))
>>     strlens = length (str);
>>   else
>>     strlens = [idx(1)-1, diff(idx)-1, numel(str)-idx(end)];
>>   endif
>>   if (nargout > 1)
>>     ## Grab the separators
>>     matches = num2cell (str(idx)(:)).';
>>     if (args.collapsedelimiters)
>>       ## Collapse the consequtive delimiters
>>       ## TODO - is there a vectorized way?
>>       for m = numel(matches):-1:2
>>         if (strlens(m) == 0)
>>           matches{m-1} = [matches{m-1:m}];
>>           matches(m) = [];
>>         endif
>>       end
>>     endif
>>   endif
>>   ## Remove separators.
>>   str(idx) = [];
>>   if (args.collapsedelimiters)
>>     ## Omit zero lengths.
>>     strlens = strlens(strlens != 0);
>>   endif
>> 
>>   ## Convert!
>>   result = mat2cell (str, 1, strlens);
>> 
>> it seems like we should be performing the "omit zero lengths" part on
>> the output of diff, then tacking on the beginning and ending strings.
>> But I don't understand what the "if (nargout > 1)" part in between is
>> doing.
> 
> The (nargout > 1) part was there to allow the block t be skipped if "matches" 
> isn't requested (the 2nd output).  I'll take a look at your suggested change.
> 
> Ben

I've untangled the "legacy" option from the rest, and added some tests.  This 
"should" be compatible with Matlab.  If its ok with you, I'll push this and 
follow up with another change to introduce cstrsplit.m and remove the "legacy" 
code form strsplit.m

Ben

Attachment: changeset.patch
Description: Binary data




reply via email to

[Prev in Thread] Current Thread [Next in Thread]