Re: Improving strread / textread / textscan

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan

From:	PhilipNienhuis
Subject:	Re: Improving strread / textread / textscan
Date:	Mon, 31 Oct 2011 13:54:34 -0700 (PDT)

bpabbott wrote:
> 
> On Oct 24, 2011, at 5:47 PM, Philip Nienhuis wrote:
> :
> <snip>
> :
>>> a. "Words" or fields (to be interpreted later) are separated by
>>> white-space or delimiters.
> <snip>
>>> g. If the delimiter property is specified, then white-space is *not*
>>> used to delimit character fields. However, white-space is always used to
>>> delimit numeric fields.
> :
> <snip>
> :
>> Strict compliance with rule g. might render patching of strread.m much
>> more complicated, as for each individual format specifier we'd have to
>> check the whitespace/delimiters around the field in question, depending
>> on the format specifier's nature.
>> This is more easily done in a compiled version that linearly ploughs
>> through the text string, than in current strread.m that works by parsing
>> complete columns one by one.
>> I can try to implement rule g. in a quick-and-dirty fashion, perhaps this
>> will solve the actual bug that provoked my renewed interest.
> 

There might be a way to do this cleanly (but not easily) along the lines
I've used to parse literals, as that is the place where the input text
string is matched to the format string to assess the proper number of data
columns.
But I'm afraid this can become a bit complicated.  I'll try to have a go at
it in the coming weeks; hopefully this can be finished before the "code
freeze" for Octave 3.6. Or perhaps someone might beat me to it (who
knows...)



> 02) When reading character fields, if no "delimiter" property is defined,
> then
>     the characters contained by the "whitespace" property are used to
> delimit
>     fields. When the "delimiter" property is defined, the defined
> "whitespace"
>     property is ignored for the purpose of delimiting strings. Also, when
> the
>     "delimiter" property is defined all leading and trailing characters
>     contained in the "whitespace" property are trimmed from the strings
> read.
> 

The following Matlab (r2007a) results perhaps give more clarity (note %s and
%d in format string), FYI textscan gives identical results:

<Example 1>
>> [a, b] = strread ('1 2 3, 4 5, , 6', '%d%s', 'delimiter', ',')
a =
     1
     4
     0

b = 
    '2 3'
    '5'
    '6'

<Example 2>
>> [a, b] = strread ('1 2 3, 4 5, , 6', '%d%s', 'delimiter', ',',
>> 'whitespace', '')
a =
     1
     4
     0

b = 
    ' 2 3'
    ' 5'
    ' 6'

<Example 3>
>> [a, b, c] = strread ('1 2 3, 4 5, , 6', '%d%s%d', 'delimiter', ',')
a =
     1
     5

b = 
    '2 3'
    ''

c =
     4
     6

<Example 4>
>> [a, b, c] = strread ('1 2 3, 4 5, , 6', '%d%s%d', 'delimiter', ',',
>> 'whitespace', ' ')
a =
     1
     5

b = 
    '2 3'
    ''

c =
     4
     6

<Example 5>
>> [a, b, c] = strread ('1 2 3, 4 5, , 6', '%d%s%d', 'delimiter', ',',
>> 'whitespace', '')
a =
     1
     5

b = 
    ' 2 3'
    ' '

c =
     4
     6

<Example 6>
>> [a, b, c] = strread ('1 2 3 , 4 5, , 6', '%d%s%d', 'delimiter', ',')  %
>> note space after '3'
a =
     1
     5

b = 
    '2 3 '
    ''

c =
     4
     6

<Example 7>
>> str = sprintf ('aaaaa\nbbbbb');
>> c = textscan (str, '%s', 'endofline', '');
>> c{:}
ans = 
    'aaaaa
bbbbb'

<Example 8>
>> c = textscan (str, '%s', 'endofline', '\n');
>> c{:}
ans = 
    'aaaa'
    'bbb'

(...Hey... where did that one 'b' go...? this happens consistently in
r2007a)

Perhaps your ML behavior rule 02) had better be worded as:

02) Strings are delimited by the characters included in the "delimiter" and
"endofline" values. Numeric fields are also delimited by characters in the
"whitespace" values. Whitespace used to delimit numeric fields is not
included in adjacent string fields.

(You wrote:


> 02) ....
> <snip>
>     Also, when the
>     "delimiter" property is defined all leading and trailing characters
>     contained in the "whitespace" property are trimmed from the strings
> read.
> 

As you can see in example 6, this is not what happens. At least the trailing
space is preserved.

Anyway, ML seems to have a consistent (but obscurely documented) rule about
parsing numeric versus string fields depending on delimiter/whitespace
values. That might render implementing it in Octave a bit easier.

Philip

--
View this message in context: 
http://octave.1599824.n4.nabble.com/Improving-strread-textread-textscan-tp3931190p3961592.html
Sent from the Octave - Maintainers mailing list archive at Nabble.com.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Improving strread / textread / textscan, (continued)

Prev by Date: Re: Thread-safety issues in QtHandles
Next by Date: Re: Improving strread / textread / textscan
Previous by thread: Re: Improving strread / textread / textscan
Next by thread: Re: Improving strread / textread / textscan
Index(es):
- Date
- Thread