[Octave-bug-tracker] [bug #52550] textscan drops delimiter character for

octave-bug-tracker

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-bug-tracker] [bug #52550] textscan drops delimiter character for

From:	Dan Sebald
Subject:	[Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option
Date:	Wed, 29 Nov 2017 00:21:34 -0500 (EST)
User-agent:	Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0

Follow-up Comment #1, bug #52550 (project octave):

I'm going to make some general (and perhaps rambling) comments in this post,
and then in follow up posts address what should be a simple fix for the
original problem.

----------

In the current code there is a comment about delim_table and delim_list:


    // Three cases for delim_table and delim_list
    // 1. delim_table empty, delim_list empty:  whitespace delimiters
    // 2. delim_table = look-up table of delim chars, delim_list empty.
    // 3. delim_table non-empty, delim_list = Cell array of delim strings


The above combinations ostesibly represent two conditions, i.e.:


    if (delim_list.numel () == 0)         // single character delimiter
      {
      }
    else                                  // multi-character delimiter
      {
      }


That is, in order to make this distinction of single/multi-character, the
delim_list (or some other means of representing multiple conditions) needs to
exist.  There are a number of questions I have regarding this.

First is whether this mulit-character delimiter is proper and/or useful.  Is
this behavior supposed to be a superset of Matlab?  If the single character
string and character strings within a cell were, instead, to behave in a
similar way, we could just put the characters for all scenarios in the same
delim_table ('delims' string at time of parsing) and drop the delim_list usage
in the code.  Any user who chooses a multicharacter delimiter should be able
to easily convert their data file to a single character by use of sed.

Second is if this multi-character delimiter is the desired behavior for
Octave, then does the coding need to be as complex as the current code?  The
processing for multi-character is sort of non-causal in the sense that it
needs to look N-1 symbols ahead, where N is the length of the delimiter.  For
example, say the delimiter one chooses is "123".  Then upon being presented
with data that starts with 1, there are two scenarios.  In the next two
examples


example: 987123654123456...  >>  987 654 456 ...
example: 987120654123456...  >>  987120645 456 ...
            ^


imagine being presented with the '1'.  It's not until seeing the '3' or the
'0' that one knows how to treat the data.  As a result, the code has this
"lookahead" state memory by which it grabs some data from the stream (say
three characters) and then puts the file pointer back to the location prior to
the reading of the data.  This is somewhat clumsy programming.  Yes, there are
processing tasks that require more-sophisticated programming and the current
code is neither overly complex nor uncommon non-causal techniques, but it is
always a matter of complexity of the programming matching the complexity of
the problem.  We just explored the speed of loading ASCII data (see
http://savannah.gnu.org/bugs/?51871) in which was used "getline" like
techniques to bring a whole line of data into memory.  If a similar thing were
done here, i.e., first get a whole line of data then search that string for
multiple-character delimiters, it would obviate the need for grabbing
look-ahead data and tellg, seekg, etc.

Again, even a more elegant use of getline/parse for multiple-character
delimiters should have one wondering if the necessitated complexity of such a
requirement is appropriate for computer programming and data file format.  Is
this multi-character delimiter feature something that ever gets used?



    _______________________________________________________

Reply to this item at:

  <http://savannah.gnu.org/bugs/?52550>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.gnu.org/

[Prev in Thread]

Current Thread

[Next in Thread]

[Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Dan Sebald, 2017/11/29
- [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Dan Sebald <=
  - [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Dan Sebald, 2017/11/29
    - [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Dan Sebald, 2017/11/29
    - [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Dan Sebald, 2017/11/29
    - [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Dan Sebald, 2017/11/29
    - [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Rik, 2017/11/29
    - [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Dan Sebald, 2017/11/29
    - [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Rik, 2017/11/29
    - [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option, Dan Sebald, 2017/11/29

Prev by Date: [Octave-bug-tracker] [bug #51092] mldivide returns wrong matrix size if X contains NaN
Next by Date: [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option
Previous by thread: [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option
Next by thread: [Octave-bug-tracker] [bug #52550] textscan drops delimiter character for multi-character, cell-specified delimiter option
Index(es):
- Date
- Thread