Re: Improving strread / textread / textscan

octave-maintainers
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Improving strread / textread / textscan

From:	Ben Abbott
Subject:	Re: Improving strread / textread / textscan
Date:	Tue, 25 Oct 2011 17:56:44 -0400
On Oct 25, 2011, at 4:43 PM, Philip Nienhuis wrote:

> Ben Abbott wrote:
> 
>> On Oct 24, 2011, at 5:47 PM, Philip Nienhuis wrote:

<snip>

>>> Not all of it.
>>> An EOL can also be a field delimiter. Obvious, because an EOL naturally 
>>> cuts off fields if there's no other delimiter first.
>>> The rest of i. looks correct to me.
>> 
>> Maybe we're defining "delimiter" differently? ... or maybe I'm being overlay 
>> pedantic?
> 
> I think you're simply at a more abstract level then me, while I (the guy who 
> patched this part for Octave) tend to think more at a practical level (how do 
> I manage to code it).
> 
>> I'm using the term to indicate a character that separates lines. Which an 
>> EOL does. Or a character that separates fields. Which EOL does not do.
> 
> ....unless the EOL chars are part of whitespace. Now ML's default whitespace 
> for strread = ' \b\r\n\t'.
> AFAIU ML only allows '\n', '\r\n', or '\r' as EOL (default = determined from 
> file), all of which are in strread's default whitespace, and as whitespace is 
> the default delimiter, EOL's implicitly can delimit fields.
> Perhaps this is where my confusion stems from. See a few lines below...:

Ok. I hadn't noticed that strread and textscan used different defaults for 
whitespace (textscan uses " \b\t").

>> Thus, EOLs are delimiters for lines but not for fields within a line.
>> 
>> The MW docs do a reasonable job of describing this. See "Field and Row 
>> Delimiters" at the link below.
>> 
>>      http://www.mathworks.com/help/techdoc/ref/textscan.html
> 
> ... we should be careful to not mix up strread and textscan.
> I suppose you think more "the textscan way", while I (knowing that currently 
> strread does the actual work for textscan) tend to perceive stuff more 
> against strread.m background.

Yeah. That explains a lot! :-)

<snip>

>> 
>> ... imply to me that that when reading character data, when "delimiter" is 
>> specified, white-space is not used to delimit, and the characters read are 
>> trimmed of leading and trailing white-space.
> 
> That's my impression as well.
> 
> From textscan docs:
> <QUOTE>
> textscan adds a space character, char(32), to any specified Whitespace unless 
> Whitespace is empty ('') and the format includes any string conversion 
> specifier.
> <QUOTE>
> I suppose strread does the same. Perhaps this is where we need to search for 
> analysis of ML behavior.

Your inference is correct.

a = strread ('1 2 3', '%n', 'whitespace', sprintf('\t'))

a =

     1
     2
     3

>>> Strict compliance with rule g. might render patching of strread.m much more 
>>> complicated, as for each individual format specifier we'd have to check the 
>>> whitespace/delimiters around the field in question, depending on the format 
>>> specifier's nature.
>>> This is more easily done in a compiled version that linearly ploughs 
>>> through the text string, than in current strread.m that works by parsing 
>>> complete columns one by one.
>>> I can try to implement rule g. in a quick-and-dirty fashion, perhaps this 
>>> will solve the actual bug that provoked my renewed interest.
>>> 
>>> How much further should we go in fixing current strread (the work horse for 
>>> textscan and textread), given the end-of-life for strread in ML plus jwe's 
>>> upcoming compiled textscan version? (if he -or someone else- ever gets time 
>>> to finish it, of course)
>>> I'm not in favor of blindly imitating as much as we can of the more 
>>> obscure, or undocumented, or inconsistent, or corner case behavior of ML.
>>> I'd prefer clarity and consistency over strict ML compatibility.
>>> Your suggestion of documenting the Octave behavior that ML didn't document 
>>> for its own functions is to be applauded.
>> 
>> For the moment, I'm mostly concerned about documenting how textscan should 
>> work. If you've been able to improve Octave's compatibility, then I 
>> recommend you put together a changeset. John or someone else may make it 
>> obsolete at some point, but that is part of the nature of code development 
>> ... after all you're about to do the same to one of my contributions ;-)
> 
> Happened to me too, several times. Yes that's our fate...
> But you are quick in turning ideas into changesets. I'm more reluctant and 
> rather wait until I'm fairly sure.
> 
> I'll try to prepare a changeset for strread.m in the coming days (I have only 
> little time each day due to medical issues).

ok. No rush, I'll finish writing a test script for textscan. I'm nearly done, 
but still need to write some test using files.

>> In any event, my latest attempt is below to document how textscan parses 
>> fields is below.
>> 
> 
>> 01) Lines of input are delimited by EOL chars. The EOL character may be
>>     specified by the parameter "endofline". The default is determined from
>>     the file ("\n", "\r", or "\r\n").
> 
> ... 01) only applies if textscan reads from file. Correct?

I think it also applied to strread.

a = strread (sprintf ('1\n2\n3'), '%n')

a =

     1
     2
     3

[a, b] = strread (sprintf ('1\n2\n3'), '%n %n')

a =

     1
     2
     3


b =

     0
     0

[a, b] = strread (sprintf ('1\n2\n3'), '%n %n')

a =

     1
     2
     3


b =

     0
     0

[a, b] = strread (sprintf ('1 1\n2\n3'), '%n %n')

a =

     1
     2
     3


b =

     1
     0

Maybe I'm missing something, but it looks to me as if Matlab's textscan and 
strread treat EOLs and whitespace in the same way.

>> 02) When reading character fields, if no "delimiter" property is defined, 
>> then
>>     the characters contained by the "whitespace" property are used to delimit
>>     fields. When the "delimiter" property is defined, the defined 
>> "whitespace"
>>     property is ignored for the purpose of delimiting strings. Also, when the
>>     "delimiter" property is defined all leading and trailing characters
>>     contained in the "whitespace" property are trimmed from the strings read.
>> 03) Any attempt to read fields beyond an EOL are treated as being empty. For
>>     numeric data empty values are replaced by the property "emptyvalue".
>> 04) Values for numeric fields are separated by characters contained by the
>>     "whitespace", or "delimiter", properties.
> 
> ... or their union (?) (which is what I think); but see below 09)

Yes. That would be a better description.

>> 05) The white-space char set can be adapted by the user with the "whitespace"
>>     property. It can even be set to empty.
> 
> ... I'm not sure, but I think ML only allows certain characters to be part of 
> whitespace. At least I read the strread docs this way. I don't know if this 
> also holds for textscan.

For strread you are correct. 

        http://www.mathworks.com/help/techdoc/ref/strread.html

I don't think there is any such restriction for textscan.

>> 06) A repetitiion of white-space chars is folded into one char.
>> 07) Delimiters are also characters that separate fields.  Multiple
>>     delimiters are not folded into a single instance.
>> 09) For numeric fields, vectors of white-space, and one delimiter, are folded
>>     into one _delimiter_ that separates the fields

> __VV__count goes wrong...

What are you referring to?

>> 09) A pair of delimiters separated by white-space (or nothing) implies an
>>     empty value.
>> 10) If the delimiter property is specified, then white-space is *not* used to
>>     delimit character fields. However, white-space is always used to delimit
>>     numeric fields.
>> 11) For numeric data, the default "emptyvalue" is NaN. If the numeric
>>     type doesn't support NaN, then zero is used (int32 for example). For
>>     character fields, an empty value is just an empty string.
>> 12) Multiple consecutive delimiters can be folded into one delimiter by
>>     setting the "MultipleDelimsAsOne" parameter to true.
>> 
>> Once this part is settled, then I hope to write tests for all of this. Later 
>> I'll add tests for all data types, patterns, field-multiplicity, and 
>> skipping fields / literals.
> 
> For which textscan version?

For Matlab's textscan. I had agreed to do that for jwe sometime ago.

Ben
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Improving strread / textread / textscan, (continued)
Prev by Date: Re: Mingw Octave-3.4.3 binaries for testing on windows
Next by Date: help with graphics
Previous by thread: Re: Improving strread / textread / textscan
Next by thread: Re: Improving strread / textread / textscan
Index(es):
- Date
- Thread