octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan


From: Ben Abbott
Subject: Re: Improving strread / textread / textscan
Date: Mon, 24 Oct 2011 19:54:51 -0400

On Oct 24, 2011, at 5:47 PM, Philip Nienhuis wrote:

> Ben Abbott wrote:
>> On Oct 24, 2011, at 2:49 PM, Philip Nienhuis wrote:
>> 
>>> Answers to three emails in one:
>>> 
>>> Ben Abbott wrote:
> <snip>
>>>> Test #11: Passed.
>>> 
>>> Hmmm... on ML2007a, I get:
>>> Test #11: Failed.
>>> OBSERVED:
>>>   49   10   76   50
>>> 
>>> EXPECTED:
>>>   76   49   10   76   50
>>> 
>>> So ML is inconsistent...
>>> 
>>> ( Note I fixed some typos in your script :-)  )
>> 
>> I'm confused. Did you run a modified test #11? If so, how did the unmodified 
>> script behave, and can you show us what you changed?
> 
> I copied/pasted your code into the ML editor, and only adapted the typos 
> (OBSEVED -> OBSERVED, and "no enough"-> "not enough" in oct_assert.m).

The 11th test is ...

c = textscan (sprintf ('L1\nL2'), '%s', 'endofline', '');
oct_assert (int8(c{:}{:}), int8([ 76,  49,  10,  76,  50 ]));

Looks to me as if R2007a had a bug in it. Is that a reasonable conclusion?

> 
>>>> Test #12: Failed.
>>>> OBSEVED:
>>>>            2
>>>> 
>>>> EXPECTED:
>>>>            2
>>>>            4
>>>>            0
>>>> 
>>>> Test #13: Passed.
>>>> Test #14: Passed.
>>>> 
>>>> The script with the  tests and the oct_assert function are attached.
>>> 
>>> Apparently ML doesn't recognize empty fields squeezed between two literals.
>> 
>> For reference, test 12 is ...
>> 
>> str = sprintf ('Text1Text2Text\nTextText4Text\nText57Text');
>> c = textscan (str, 'Text%*dText%dText');
>> fprintf ('Test #12:')
>> oct_assert (c{1}, int32 ([2; 4; 0]));
>> 
>> Looking at the table under "User Configurable Options" (link below), MW 
>> indicates that "EmptyValue" is the "Value to return for empty numeric fields 
>> in delimited files." I read this to mean that empties only occur between the 
>> characters defined as "delimiters".
>> 
>>      http://www.mathworks.com/help/techdoc/ref/textscan.html
>> 
>> Replace the literals (i.e. "Text") with delimiters ...
>> 
>> c = textscan (sprintf ('1,2\n,4\n57,'), '%*d%d', 'delimiter', ',');
>> c{:}
>> 
>> ans =
>> 
>>            2
>>            4
>> 
>> Notice the last value isn't between two delimiters, but is preceded by a 
>> delimiter and followed by white-space. If a second delimiter is added, then 
>> ...
>> 
>> c = textscan (sprintf ('1,2\n,4\n57,,'), '%*d%d', 'delimiter', ',');
>> c{:}
>> 
>> ans =
>> 
>>            2
>>            4
>>            0
>> 
>> I haven't studied the docs very deeply, and have only looked at the docs for 
>> R2011b, but it looks to me that ML is behaving in a manner that is 
>> consistent with its documentation (admittedly the documentation is rather 
>> esoteric).
> 
> I'd say Octave more strictly complies to the rules. But admittedly this is an 
> extreme example.
> Note that processing literals differs from processing of delimiters.
> 
> <snip>
>>> =========================
>>> Ben Abbott wrote:
>>>> 
>>>> I've made some modifications to your original notes, and added a few more 
>>>> below.
>>>> 
>>>> a. "Words" or fields (to be interpreted later) are separated by 
>>>> white-space or delimiters.
>>>> b. The white-space char set can be adapted by the user with the 
>>>> "whitespace" keyword. It can even be set to empty.
>>>> c. White-space is understood to possibly be a vector of white-space chars 
>>>> that during reading is folded into one char that separates two fields.
>>>> d. Delimiters are also characters that separate words / fields.  Multiple 
>>>> delimiters are not folded into a single instance.
>>>> e. Vectors of white-space and one delimiter are folded into one 
>>>> _delimiter_ that separates fields.
>>>> f. A pair of delimiters separated by white-space (or nothing) imply an 
>>>> empty value.
>>>> g. By default "emptyvalue" is NaN for numeric data types. If the numeric 
>>>> type doesn't support NaN, the zero is used (int32 for example). For 
>>>> character fields, an empty value is just an empty string.
>>>> h. If so desired, multiple consecutive delimiters can be folded into one 
>>>> delimiter if "MultipleDelimsAsOne" parameter is set to 1.
>>>> i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not 
>>>> affected by the MultipleDelimsAsOne parameter.
>>> 
>>> As to strread, there's another ML subrule:
>>> <QUOTE>
>>> If your data uses a character other than a space as a delimiter, you must 
>>> use the strread parameter 'delimiter' to specify the delimiter
>>> </QUOTE>
>>> What is it, space or whitespace?
>> 
>> Are you referring to the different EOLs? I'm not entirely sure what you are 
>> asking, but I'll make a guess.
> 
> Sorry for not being clear enough.
> At one place in the docs, ML says "fields are separated by whitespace", while 
> a bit further down is the quote I gave above which only mentions genuine 
> spaces.

I also find the ML docs difficult to follow. It reminds me of the US tax code 
;-)

>> Textread operates on one line at a time. If an attempt is made to read past 
>> the end of a line with a single format statement, empties will be inserted 
>> for those fields read past the EOL.
>> 
>> c = textscan (sprintf ('1\n2\n\n4\n57\n\n'), '%*d%d', 'delimiter', ',');
>>>> c{:}
>> 
>> ans =
>> 
>>            0
>>            0
>>            0
>>            0
>> 
>> Unfortunately, I missed catching the problems with "i" before. I think it 
>> should read ...
>> 
>> i. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
>> delimit fields / words and are unaffected by the MultipleDelimsAsOne 
>> parameter. Any fields read beyond an EOL are treated as being empty.
>> 
>> Does that make sense?
> 
> Not all of it.
> An EOL can also be a field delimiter. Obvious, because an EOL naturally cuts 
> off fields if there's no other delimiter first.
> The rest of i. looks correct to me.

Maybe we're defining "delimiter" differently? ... or maybe I'm being overlay 
pedantic?

I'm using the term to indicate a character that separates lines. Which an EOL 
does. Or a character that separates fields. Which EOL does not do.

Thus, EOLs are delimiters for lines but not for fields within a line.

The MW docs do a reasonable job of describing this. See "Field and Row 
Delimiters" at the link below.

        http://www.mathworks.com/help/techdoc/ref/textscan.html

>>> IAnyway, if your&  mine colllection of inferred rules apply, I do not 
>>> understand this (7th test of Octave strread.m):
>>> 
>>> octave:23>  a = strread ("a b c, d e, , f", "%s", "delimiter", ",")
>>> a =
>>> {
>>>  [1,1] = a b c
>>>  [2,1] = d e
>>>  [3,1] =
>>>  [4,1] = f
>>> }
>>> (Same goes for ML)
>> 
>> I hadn't considered this before.  I'll have to study the docs again to see 
>> if there is a reference to this. I did try dropping the "delimiter" to see 
>> what happens.
>> 
>> a = textscan ('a b c, d e, , f', '%s');
>> 
>> a{:}
>> 
>> ans =
>> 
>>     'a'
>>     'b'
>>     'c,'
>>     'd'
>>     'e,'
>>     ','
>>     'f'
>> 
>>> because in this example there are spaces ("whitespace") separating e.g., 
>>> 'a' and 'b'.
>>> 
>>> But (ML):
>>>>> a = strread ('1 2 3, 4 5, , 6', '%d', 'delimiter', ',')
>>> a =
>>>     1
>>>     2
>>>     3
>>>     4
>>>     5
>>>     0
>>>     6
>>> 
>>> In the above cases, I get the same results for textscan.
>>> 
>>> So it seems that interpretation&  processing of default whitespace depends 
>>> on the field format specifier as well?
>> 
>> It appears that ML doesn't use the white-space property, as delimiters for 
>> strings, when the "delimiter" property has been specified. I've added 
>> another line to the list (specifically "g" and "j").
>> 
>> a. "Words" or fields (to be interpreted later) are separated by white-space 
>> or delimiters.
>> b. The white-space char set can be adapted by the user with the "whitespace" 
>> keyword. It can even be set to empty.
>> c. White-space is understood to possibly be a vector of white-space chars 
>> that during reading is folded into one char that separates two fields.
>> d. Delimiters are also characters that separate words / fields.  Multiple 
>> delimiters are not folded into a single instance.
>> e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
>> that separates fields.
>> f. A pair of delimiters separated by white-space (or nothing) implies an 
>> empty value.
>> g. If the delimiter property is specified, then white-space is *not* used to 
>> delimit character fields. However, white-space is always used to delimit 
>> numeric fields.
>> h. By default "emptyvalue" is NaN for numeric data types. If the numeric 
>> type doesn't support NaN, the zero is used (int32 for example). For 
>> character fields, an empty value is just an empty string.
>> i. If so desired, multiple consecutive delimiters can be folded into one 
>> delimiter if "MultipleDelimsAsOne" parameter is set to 1.
>> j. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
>> delimit fields / words and are unaffected by the MultipleDelimsAsOne 
>> parameter. Any fields read beyond an EOL are treated as being empty.
>> 
>> Does this look correct to you?
> 
> Overall, yes, save for i. as mentioned above.
> But as to g., ML seems inconsistent. Spaces in character strings would only 
> be preserved if whitespace is set to "" (empty), according to the ML docs 
> (they even got an example about this).

hmmm ... I think I managed to confuse myself a bit earlier. I tried a simple 
test to confirm my understanding, but just proved my understanding was 
incomplete.

a = textscan ('1, 2, 3', '%s %s %s', 'delimiter', ',', 'whitespace', '');

a{:}

ans = 

    '1'


ans = 

    ' 2'


ans = 

    ' 3'

Notice a{2:3} have leading spaces. If "whitespace" is not defined empty, then 
there is  no white space in a{:}.

a = textscan ('1, 2, 3', '%s %s %s', 'delimiter', ',');

a{:}

ans = 

    '1'


ans = 

    '2'


ans = 

    '3'

These two examples and the one below (we've used before) ...

a = textscan ('a b c, d e, , f', '%s', 'delimiter', ',');
>> a{:}

ans = 

    'a b c'
    'd e'
    ''
    'f'

... imply to me that that when reading character data, when "delimiter" is 
specified, white-space is not used to delimit, and the characters read are 
trimmed of leading and trailing white-space.

> Strict compliance with rule g. might render patching of strread.m much more 
> complicated, as for each individual format specifier we'd have to check the 
> whitespace/delimiters around the field in question, depending on the format 
> specifier's nature.
> This is more easily done in a compiled version that linearly ploughs through 
> the text string, than in current strread.m that works by parsing complete 
> columns one by one.
> I can try to implement rule g. in a quick-and-dirty fashion, perhaps this 
> will solve the actual bug that provoked my renewed interest.
> 
> How much further should we go in fixing current strread (the work horse for 
> textscan and textread), given the end-of-life for strread in ML plus jwe's 
> upcoming compiled textscan version? (if he -or someone else- ever gets time 
> to finish it, of course)
> I'm not in favor of blindly imitating as much as we can of the more obscure, 
> or undocumented, or inconsistent, or corner case behavior of ML.
> I'd prefer clarity and consistency over strict ML compatibility.
> Your suggestion of documenting the Octave behavior that ML didn't document 
> for its own functions is to be applauded.

For the moment, I'm mostly concerned about documenting how textscan should 
work. If you've been able to improve Octave's compatibility, then I recommend 
you put together a changeset. John or someone else may make it obsolete at some 
point, but that is part of the nature of code development ... after all you're 
about to do the same to one of my contributions ;-)

In any event, my latest attempt is below to document how textscan parses fields 
is below.

01) Lines of input are delimited by EOL chars. The EOL character may be
    specified by the parameter "endofline". The default is determined from
    the file ("\n", "\r", or "\r\n").
02) When reading character fields, if no "delimiter" property is defined, then
    the characters contained by the "whitespace" property are used to delimit
    fields. When the "delimiter" property is defined, the defined "whitespace"
    property is ignored for the purpose of delimiting strings. Also, when the
    "delimiter" property is defined all leading and trailing characters
    contained in the "whitespace" property are trimmed from the strings read.
03) Any attempt to read fields beyond an EOL are treated as being empty. For
    numeric data empty values are replaced by the property "emptyvalue".
04) Values for numeric fields are separated by characters contained by the
    "whitespace", or "delimiter", properties.
05) The white-space char set can be adapted by the user with the "whitespace"
    property. It can even be set to empty.
06) A repetitiion of white-space chars is folded into one char.
07) Delimiters are also characters that separate fields.  Multiple
    delimiters are not folded into a single instance.
09) For numeric fields, vectors of white-space, and one delimiter, are folded
    into one _delimiter_ that separates the fields.
09) A pair of delimiters separated by white-space (or nothing) implies an
    empty value.
10) If the delimiter property is specified, then white-space is *not* used to
    delimit character fields. However, white-space is always used to delimit
    numeric fields.
11) For numeric data, the default "emptyvalue" is NaN. If the numeric
    type doesn't support NaN, then zero is used (int32 for example). For
    character fields, an empty value is just an empty string.
12) Multiple consecutive delimiters can be folded into one delimiter by
    setting the "MultipleDelimsAsOne" parameter to true.

Once this part is settled, then I hope to write tests for all of this. Later 
I'll add tests for all data types, patterns, field-multiplicity, and skipping 
fields / literals.

Interger, signed: %d, %d8, %d16, %d32, %d64
Interger, usigned: %u, %u8, %u16, %u32, %u64
Floating-point: %f, %f32, %f64, %n
Character strings: %s, %q, %c

Pattern-matching: %[...], %[^...]

Multiple fields: %Nc, %Ns, %Nq, %N[...], %N[^...], %Nn, %Nd, %Nu, %Nf, %N.Dn, 
%N.Df

Skipping fields: %*, %*n, and literals

Ben



reply via email to

[Prev in Thread] Current Thread [Next in Thread]