Re: new strsplit function

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new strsplit function

From:	Philip Nienhuis
Subject:	Re: new strsplit function
Date:	Mon, 1 Apr 2013 12:02:55 -0700 (PDT)

bpabbott wrote
> On Apr 1, 2013, at 2:06 AM, Rik wrote:
> 
>> On 03/31/2013 05:38 PM, Ben Abbott wrote:
>>> Rik,
>>> 
>>> I've pushed a changeset that includes a note in the NEWS file.
>>> 
>>>     http://hg.savannah.gnu.org/hgweb/octave/rev/1de4ec2a856d
>>> 
>>> I have not run any benchmarks.  If it is any help, the new version is
>>> based on regexp().
>> Ben,
>> 
>> Unfortunately regexp is slow compared to operations on native char types.
>> 
>> I did a quick benchmark and I do think this is likely to be an issue. 
>> The
>> strread function is now 30X slower.
>> 
>> Benchmark Code:
>> cd scripts/string
>> tic; A = textread ("strtok.m", "%s"); toc
>> 
>> New results:
>> 1.502 +/- .004
>> 
>> Old results:
>> 0.0455 +/- .0006
>> 
>> Slowdown = 1.502 / .0455 = 33
>> 
>> strtok.m is a small file, 7.2KB, so a largish real data file is going to
>> parse very slowly.
>> 
>>> 
>>> If we are to add another script, perhaps cstrsplit() is a good names (we
>>> did that for strcat some time ago).  Where "c" is for (c)onventional.
>> That would be a good idea.
>> 
>> Also, shouldn't the replacement of existing instances have been
>> 
>> strsplit (str, del, "collapsedelimiters", false)
>> 
>> rather than
>> 
>> strsplit (str, del, false)
>> 
>> ???
>> 
>> I don't see that the Matlab function accepts a third argument--only
>> PROP/VALUE pairs.
>> 
>> Cheers,
>> Rik
> 
> Both
> 
>       strsplit (str, del, "collapsedelimiters", false)
> 
> and
> 
>       strsplit (str, del, false)
> 
> should give the same result.  Treating the third argument in this way was
> part of our original implementation, and I kept it for backward
> compatibility.
> 
> Regarding the slow down, another option is to add a new delimiter type,
> say "conventional"? (I'm embarrassed that I hadn't thought of that
> approach before)

If Matlab compatibility comes with a significant performance hit I'd
hesitate to blindly follow that aim.
However strread.m is an extreme example for core Octave, it should have been
replaced by a binary version anyway. But in Octave_forge there are several
calls to strsplit in various packages. 
In Nitzan's recent 3.6.4 MinGW binary we have 81 packages, yielding
(excluding the Java package):

.../share/octave $ grep -i -r "strsplit" *

packages/cgi-0.1.0/@cgi/cgi.m:p = strsplit(self.query_string,'&');
packages/cgi-0.1.0/@cgi/cgi.m:  pp = strsplit(p{i},'=');
packages/communications-1.1.1/comms.m:    if (!isempty(char(strsplit (path,
":"))))
packages/communications-1.1.1/comms.m:      infopaths =[infopaths;
char(strsplit (path, ":"))];
packages/communications-1.1.1/comms.m:    if (!isempty(char(strsplit
(DEFAULT_LOADPATH, ":"))))
packages/communications-1.1.1/comms.m:      infopaths =[infopaths;
char(strsplit  (DEFAULT_LOADPATH, ":"))];
packages/dataframe-0.9.1/@dataframe/dataframe.m:          dummy = cellfun
('size', cellfun (@(x) strsplit (x, ":=("), df._name{2}, \
packages/dataframe-0.9.1/@dataframe/dataframe.m:          content = cellfun
(@(x) strsplit (x, sep), lines, \
packages/dataframe-0.9.1/@dataframe/dataframe.m:    x = strsplit (base,
"=");
packages/dataframe-0.9.1/@dataframe/private/df_name2idx.m:        dummy =
strsplit (subs{indi}, ':');
packages/fuzzy-logic-toolkit-0.4.2/readfis.m:  line_vec =
discard_empty_strings (strsplit(line, "=':,[] \t", true));
packages/fuzzy-logic-toolkit-0.4.2/readfis.m:  line_vec = strsplit (line,
",():", true);
packages/geometry-1.6.0/io/@svg/loadpaths.m:  strpath = strsplit
(str(1:end-1), '$', true);
packages/geometry-1.6.0/io/@svg/loadsvgdata.m:  strdata = strsplit
(str(1:end-1), '$', true);
packages/geometry-1.6.0/PKG_ADD:pp = strsplit (dirname,filesep (), true);
packages/geometry-1.6.0/PKG_ADD:    pkg_folder = strsplit
(pkg_folder,filesep (), true);
packages/geometry-1.6.0/PKG_DEL:pp = strsplit (dirname,filesep (), true);
packages/geometry-1.6.0/PKG_DEL:    pkg_folder = strsplit
(pkg_folder,filesep (), true);
packages/io-1.2.1/chk_spreadsheet_support.m:    oct_vsn = str2double
(strsplit (OCTAVE_VERSION, '.'){1}) + ...
packages/io-1.2.1/chk_spreadsheet_support.m:              0.1 * str2double
(strsplit (OCTAVE_VERSION, '.'){2});
packages/io-1.2.1/chk_spreadsheet_support.m:    if (ispc)str2double
(strsplit (OCTAVE_VERSION, '.'){2});
packages/io-1.2.1/chk_spreadsheet_support.m:    cjver = strsplit (jver,
'.');
packages/io-1.2.1/chk_spreadsheet_support.m:    if (isunix && ~iscell
(jcp)); jcp = strsplit (char (jcp), ':'); end %if
packages/io-1.2.1/private/chk_jar_entries.m:      jentry = strsplit (lower
(jcp{ii}), filesep){end};
packages/io-1.2.1/private/getodsinterfaces.m:        jcp = strsplit (char
(jcp), pathsep ());
packages/io-1.2.1/private/getxlsinterfaces.m:        jcp = strsplit (char
(jcp), pathsep ());
packages/io-1.2.1/private/__chk_java_sprt__.m:    cjver = strsplit (jver,
".");
packages/io-1.2.1/private/__chk_java_sprt__.m:      jcp = strsplit (char
(jcp), pathsep ());
packages/io-1.2.1/private/__JOD_spsh2oct__.m:              tmp = strsplit
(char (scell.getValue ()), " ");
packages/io-1.2.1/private/__JOD_spsh2oct__.m:              tmp = strsplit
(char (scell.getValue ().getTime ()), " ");
packages/io-1.2.1/private/__JXL_spsh2oct__.m:##     ''     order in
strsplit, wrong isTime condition
packages/io-1.2.1/private/__JXL_spsh2oct__.m:            tmp = strsplit
(char (scell.getDate ()), " ");
packages/io-1.2.1/private/__JXL_spsh2oct__.m:                tmp = strsplit
(char (scell.getDate ()), " ");
packages/io-1.2.1/private/__UNO_getusedrange__.m:  adrblks = strsplit
(addrs, ",");
packages/io-1.2.1/private/__UNO_getusedrange__.m:    adrblks = strsplit
(addrs, ";");
packages/io-1.2.1/private/__UNO_getusedrange__.m:      ## Same, but tru
strsplit()
packages/io-1.2.1/private/__UNO_getusedrange__.m:      range = strsplit
(adrblks{ii}, "."){2};
packages/io-1.2.1/private/__UNO_spsh2oct__.m:  adrblks = strsplit (addrs,
",");
packages/io-1.2.1/private/__UNO_spsh_close__.m:          fname =
canonicalize_file_name (strsplit (nfilename, filesep){end});
packages/io-1.2.1/private/__UNO_spsh_close__.m:          fname =
make_absolute_filename (strsplit (nfilename, filesep){end});
packages/io-1.2.1/private/__UNO_spsh_close__.m:          tmp = strsplit
(fname, filesep);
packages/io-1.2.1/private/__UNO_spsh_open__.m:        fname =
canonicalize_file_name (strsplit (filename, filesep){end});
packages/io-1.2.1/private/__UNO_spsh_open__.m:        fname =
make_absolute_filename (strsplit (filename, filesep){end});
packages/io-1.2.1/private/__UNO_spsh_open__.m:        tmp = strsplit (fname,
filesep);
packages/io-1.2.1/xmlwrite.m:    sn = char (strsplit (filename, "."));
packages/mechanics-1.3.1/core/@rigidbody/display.m:              str =
strsplit (str, "\n",true);
packages/mechanics-1.3.1/PKG_ADD:pp = strsplit (dirname,filesep (), true);
packages/mechanics-1.3.1/PKG_ADD:    pkg_folder = strsplit
(pkg_folder,filesep (), true);
packages/mechanics-1.3.1/PKG_ADD:pp = strsplit (dirname,filesep (), true);
packages/mechanics-1.3.1/PKG_DEL:pp = strsplit (dirname,filesep (), true);
packages/mechanics-1.3.1/PKG_DEL:    pkg_folder = strsplit
(pkg_folder,filesep (), true);
packages/mechanics-1.3.1/PKG_DEL:pp = strsplit (dirname,filesep (), true);
packages/miscellaneous-1.2.0/read_options.m:    lextra = lgrep (cellstr
(strsplit (extra," ")));
packages/ncarray-1.0.0/nccoord.m:    tmp =
strsplit(vinfo.Attributes(index).Value,' ');

...so that's 9 packages that are affected. 
A large part of it is in the io package, which has been overhauled already
(not yet in svn) to cope with new core Java compatibility. The strsplit
change would be no big deal, except for performance.

An asset of the new strsplit is that it allows for multi-char delimiters;
very useful in some situations. 
I have an old version of strread.m where a similar option was implemented
(using regexprep to first replace those multi-char delimiter sequences into
char(255) and then split the input string on char(255)).

Is there a way to first parse/analyze the input args, and based on that
decide to either invoke regexp() or fall back on old strsplit()? If so,
conventional Octave users could still enjoy a fast strsplit and <evil grin>
those dying for ML compatilibility would get their valued slow down </grin>
;-)

Perhaps we could sacrifice some ML compatibility by checking for a plain
string as arg #2 (and optional logical arg#3) and then invoking "old"
strsplit; and requiring cellstr arrays as arg#2 for a strict ML-compatible
call to strplit/regexprep.

Philip




--
View this message in context: 
http://octave.1599824.n4.nabble.com/Re-new-strsplit-function-tp4651374p4651401.html
Sent from the Octave - Maintainers mailing list archive at Nabble.com.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: new strsplit function, Rik, 2013/04/01
- Re: new strsplit function, Ben Abbott, 2013/04/01
  - Re: new strsplit function, Philip Nienhuis <=
    - Re: new strsplit function, Ben Abbott, 2013/04/02
    - Re: new strsplit function, Carnë Draug, 2013/04/02
    - Re: new strsplit function, Ben Abbott, 2013/04/02
    - Re: new strsplit function, Carnë Draug, 2013/04/02
    - Re: new strsplit function, Ben Abbott, 2013/04/02
    - Re: new strsplit function, John W. Eaton, 2013/04/02
    - Re: new strsplit function, Ben Abbott, 2013/04/03
    - Re: new strsplit function, Ben Abbott, 2013/04/20
    - Re: new strsplit function, Philip Nienhuis, 2013/04/20
    - Re: new strsplit function, Ben Abbott, 2013/04/20

Prev by Date: Re: I'm no longer working on Octave
Next by Date: Re: octave-forge video package
Previous by thread: Re: new strsplit function
Next by thread: Re: new strsplit function
Index(es):
- Date
- Thread