octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan


From: Ben Abbott
Subject: Re: Improving strread / textread / textscan
Date: Mon, 24 Oct 2011 08:09:11 -0400

On Oct 24, 2011, at 4:00 AM, Philip Nienhuis wrote:

> Ben Abbott wrote:
>> 
>> On Oct 23, 2011, at 8:37 PM, Ben Abbott wrote:
>> 
>>> I'll prepare a changeset.
>>> 
>>> Ben
>> 
>> A changeset is atached (I won't push until I get feedback)
>> 
>> It appears to me that whitespace is treated different from delimiters. 
>> Specifically repeated whitespace is always treated as a single delimiter 
>> (i'll need to check what happens when the "delimiter" parameter include 
>> white-space characters), and repeated delimiters imply an empty field which 
>> (for numeric data) is set equal to "emptyvalue".
> 
> Just as a., b., c. and d. in my original posting, isn't it?
> 
>> Thus, "emptyvalue" only substituted for missing numeric data between 
>> non-white space delimiters (those characters explicitly specified as 
>> "delimiters")
> 
> Again, just as d.
> 
>> The default value for "emptyvalue" is NaN, unless NaN cannot be represented 
>> by the data type (int32 for example). When NaN isn't value zero is used.
> 
> Yes, cf. ML docs. This looks consistent to me.
> 
>> I've added some tests and xtests. Some of the xtests conflict with tests. 
>> I've added comments to the tests in the hope of avoiding confusion.
>> 
>> I still plan to add more tests for other functionality which is present in 
>> ML but not in Octave. I'll add those as xtests as well and push them as I 
>> don't expect any discussion is needed.
> 
> Well, perhaps we'd better wait until this is all sorted out.
> 
> I've already patched strread / textscan / textread on my dev machine (esp. 
> strread as that does the work for textscan) for "proper" whitespace/delimiter 
> handling, + better code (Rik rightly complained about too "complicated" 
> parts).
> Some of those changes overlap with your patches.
> 
> BTW, there IS an emptyvalue texinfo string, in strread. Octave's textscan.m 
> refers to strread.m for all parameters not exclusive to textscan.m (same goes 
> for textread).
> (What's lacking is mentioning the default value of NaN, except for int32.)
> So I think the texinfo string in strread.m should be amended, rather than 
> adding one in textscan.m
> 
> Thank you for your efforts here! I have only patchy access to ML currently, 
> so your sorting out is a big help.
> 
> Philip


I've made some modifications to your original notes, and added a few more below.

a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) imply an empty 
value. 
g. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character 
fields, an empty value is just an empty string.
h. If so desired, multiple consecutive delimiters can be folded into one 
delimiter if "MultipleDelimsAsOne" parameter is set to 1.
i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not 
affected by the MultipleDelimsAsOne parameter.

I think these is consistent with your understanding, correct?

Instead of patching textscan.m, I'll start working on an independent test 
script.

Ben



reply via email to

[Prev in Thread] Current Thread [Next in Thread]