[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new strsplit function
From: |
Ben Abbott |
Subject: |
Re: new strsplit function |
Date: |
Sat, 20 Apr 2013 16:49:04 -0400 |
On Apr 20, 2013, at 3:51 PM, Philip Nienhuis wrote:
> Ben Abbott wrote:
>> On Apr 3, 2013, at 8:44 AM, Ben Abbott wrote:
>>
>>> On Apr 2, 2013, at 11:42 PM, John W. Eaton wrote:
>>>
>>>> On 04/02/2013 07:40 PM, Ben Abbott wrote:
>>>>
>>>>> On Apr 2, 2013, at 6:35 PM, Carnë Draug wrote:
>>>>>
>>>>>> On 2 April 2013 18:04, Ben Abbott<address@hidden> wrote:
>>>>>>
>>>>>>> On Apr 2, 2013, at 12:00 PM, Carnë Draug wrote:
>>>>>>>
>>>>>>>> On 2 April 2013 13:02, Ben Abbott<address@hidden> wrote:
>>>>>>>>> I've added a "conventional" to the possible delimitertypes, and
>>>>>>>>> selected this as the default when all delimiters are scalar
>>>>>>>>> characters. I've also enabled 2D character array inputs. The
>>>>>>>>> following replacements may be made into OF, and Octave core, to use
>>>>>>>>> Jaroslav's implementation. A changeset to core to use these
>>>>>>>>> substitutions will also be needed.
>>>>>>>>>
>>>>>>>>> - strsplit (str, del)
>>>>>>>>> + strsplit (str, del, "collapsedelimiters", false, "delimitertype",
>>>>>>>>> "conventional")
>>>>>>>>>
>>>>>>>>> - strsplit (str, del, false)
>>>>>>>>> + strsplit (str, del, "collapsedelimiters", false, "delimitertype",
>>>>>>>>> "conventional")
>>>>>>>>>
>>>>>>>>> - strsplit (str, del, true)
>>>>>>>>> + strsplit (str, del, "collapsedelimiters", true, "delimitertype",
>>>>>>>>> "conventional")
>>>>>>>>
>>>>>>>> That means that OF packages with this change, will only work after 3.8
>>>>>>>> being released, and the ones without the change will stop working once
>>>>>>>> 3.8 is released, so the releases would have to be timed.
>>>>>>>>
>>>>>>>> Instead, I propose we add the old strsplit in the private directory of
>>>>>>>> each package. Then, once 3.8 gets released, we can remove them and
>>>>>>>> make the change as each package gets re-released.
>>>>>>>
>>>>>>> That's a good idea!
>>>>>>
>>>>>> I have just made the change on revision 11779. I have also created a
>>>>>> new task on the trackers https://savannah.gnu.org/task/index.php?12561
>>>>>> so we don't forget what packages got affected.
>>>>>>
>>>>>> Finally, geometry and mechanics package need to be fixed some other
>>>>>> way since they make use of it on PKG_DEL and PKD_ADD scripts.
>>>>>>
>>>>>> Carnë
>>>>>
>>>>> I pushed a slightly modified changeset.
>>>>>
>>>>> http://hg.savannah.gnu.org/hgweb/octave/rev/5be43435bd5b
>>>>
>>>> In a recent changeset, you made the following change:
>>>>
>>>> - list = strsplit (list, " \n\t", true);
>>>> + list = strsplit (list, " \n\t", true, "delimitertype", "legacy");
>>>>
>>>> If people will have to make a change in their code to recover previous
>>>> behavior does is really help to introduce a new option like this? Isn't
>>>> this equivalent to
>>>>
>>>> list = strsplit (list, {" ", "\n", "\t"})
>>>>
>>>> ? Why not just tell people to make this change instead of introducing the
>>>> "legacy" delimiter type?
>>>
>>> The "legacy" delimiter type was introduced because Jaroslav's
>>> implementation was roughly 30x faster than the regexp approach. In many
>>> instances the slow down probably isn't a big deal, but for strread /
>>> textscan the slow down would be a heavy burden.
>>>
>>>> Also, the new strsplit does not seem to be working correctly for
>>>> single-quoted strings that contain escape sequences. I think that
>>>>
>>>> strsplit ("foo\tbar", '\t')
>>>>
>>>> should split on the TAB, but it is currently returning the original string
>>>> for me.
>>>
>>> Opps. I hadn't considered single quoted strings. Your example does work
>>> in Matlab. How is this sort of thing handled in other places? It is
>>> sufficient to just ...
>>>
>>> delimiter = sprintf (delimiter) ?
>>>
>>>> This is starting to look a bit messy. Maybe we should revert this change
>>>> until we have a better plan for the transition. Maybe it would be good to
>>>> give people some advance warning before making this change all at once.
>>>
>>>
>>> Sorry, I hadn't considered that. I say Jordi's request for someone to take
>>> this on, and it looked like a short project I could find the time for.
>>>
>>>
>>> https://mailman.cae.wisc.edu/pipermail/octave-maintainers/2013-March/032734.html
>>>
>>> To revert, the following three changesets need to be included.
>>>
>>> http://hg.savannah.gnu.org/hgweb/octave/rev/1de4ec2a856d
>>>
>>> http://hg.savannah.gnu.org/hgweb/octave/rev/5be43435bd5b
>>>
>>> http://hg.savannah.gnu.org/hgweb/octave/rev/61989cde13ae
>>>
>>> Ben
>>
>> All,
>>
>> The attached changeset adds sq-string support for "simple" delimiters.
>> Unless, I've missed something else, that should make strsplit() fully Matlab
>> compatible.
>>
>> Regarding the "legacy" delimiter type, jwe made a good point. I also
>> sympathised with Rik/Philip suggestion to preserve the original algorithm
>> for its speed advantage. Personally, I'm inclined to leave "legacy" in
>> place, but can be easily swayed to revert that.
>>
>> Comments?
>
> Obviously the point is not to have the "legacy" option per sé, but to
> preserve the 'legacy speed' as much as possible for simple cases.
> Could that be done by parsing the input and -depending on the results- choose
> a flexible but slow path, or a fast but legacy/limited path? (note I haven't
> delved deep in your patches yet)
>
> I'm not opposed to requiring a cellstr array for multiple delimiters. If all
> delimiters are single characters we could do s/th like:
> delimiter = [delimiter{:}]
> and choose the 'legacy' path, if there's a multi-char delimiter we'd need
> regexp() and, consequently, a slower path.
This is already being done for "delimitertype" == "simple" !
> BTW I have a multi-char delimiter strread.m lying around somewhere that
> worked with strrep(), one call for each multi-char delimiter on the entire
> input character array. Worked slow as well as strrep() is slow (does it
> invoke regexp() behind the scenes?) but there are several other calls to
> strrep() in strread.m anyway, no big deal.
> With the new strsplit, multi-char delimiters can more formally be integrated
> in strread.m, we'd have some advantage (and a useful one) over ML then.
>
> And then there is the bug John mentioned with escape characters.
> On a stable MinGW 3.6.4 build it doesn't work either:
>
> octave:1> strsplit ("foo\tbar", '\t')
> ans =
> {
> [1,1] = foo bar
> }
Is this with the changeset I attached applied?
Ben
- Re: new strsplit function, (continued)
- Re: new strsplit function, Philip Nienhuis, 2013/04/01
- Re: new strsplit function, Ben Abbott, 2013/04/02
- Re: new strsplit function, Carnë Draug, 2013/04/02
- Re: new strsplit function, Ben Abbott, 2013/04/02
- Re: new strsplit function, Carnë Draug, 2013/04/02
- Re: new strsplit function, Ben Abbott, 2013/04/02
- Re: new strsplit function, John W. Eaton, 2013/04/02
- Re: new strsplit function, Ben Abbott, 2013/04/03
- Re: new strsplit function, Ben Abbott, 2013/04/20
- Re: new strsplit function, Philip Nienhuis, 2013/04/20
- Re: new strsplit function,
Ben Abbott <=
- Re: new strsplit function, Philip Nienhuis, 2013/04/20
- Re: new strsplit function, Ben Abbott, 2013/04/20
- Re: new strsplit function, Ben Abbott, 2013/04/23
- Re: new strsplit function, Ben Abbott, 2013/04/23
- Re: new strsplit function, Ben Abbott, 2013/04/02