octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan


From: Ben Abbott
Subject: Re: Improving strread / textread / textscan
Date: Mon, 24 Oct 2011 15:52:11 -0400

On Oct 24, 2011, at 2:49 PM, Philip Nienhuis wrote:

> Answers to three emails in one:
> 
> Ben Abbott wrote:
>> On Oct 23, 2011, at 6:42 PM, Ben Abbott wrote:
>> 
>>> Ok. Lets start with writing tests for ML. I'll start by extracting Octave's 
>>> tests and confirm they work on ML.
>>> 
>>> Ben
>> 
>> I've copied the tests from textscan and modified them to run on ML. To do 
>> that I wrote a simple oct_assert function to handle the asserts. Of the 
>> total 14 asserts, two of them have failed.
>> 
>> Test #1: Passed.
>> Test #2: Passed.
>> Test #3: Passed.
>> Test #4: Passed.
>> Test #5: Passed.
>> Test #6: Passed.
>> Test #7: Passed.
>> Test #8: Passed.
>> Test #9: Failed.
>> OBSEVED:
>>           16         241           3
>> 
>> EXPECTED:
>>           16         241           3           0
>> 
>> Test #10: Passed.
>> Test #11: Passed.
> 
> Hmmm... on ML2007a, I get:
> Test #11: Failed.
> OBSERVED:
>   49   10   76   50
> 
> EXPECTED:
>   76   49   10   76   50
> 
> So ML is inconsistent...
> 
> ( Note I fixed some typos in your script :-)  )

I'm confused. Did you run a modified test #11? If so, how did the unmodified 
script behave, and can you show us what you changed?

>> Test #12: Failed.
>> OBSEVED:
>>            2
>> 
>> EXPECTED:
>>            2
>>            4
>>            0
>> 
>> Test #13: Passed.
>> Test #14: Passed.
>> 
>> The script with the  tests and the oct_assert function are attached.
> 
> Apparently ML doesn't recognize empty fields squeezed between two literals.

For reference, test 12 is ...

str = sprintf ('Text1Text2Text\nTextText4Text\nText57Text');
c = textscan (str, 'Text%*dText%dText');
fprintf ('Test #12:')
oct_assert (c{1}, int32 ([2; 4; 0]));

Looking at the table under "User Configurable Options" (link below), MW 
indicates that "EmptyValue" is the "Value to return for empty numeric fields in 
delimited files." I read this to mean that empties only occur between the 
characters defined as "delimiters".

        http://www.mathworks.com/help/techdoc/ref/textscan.html

Replace the literals (i.e. "Text") with delimiters ...

c = textscan (sprintf ('1,2\n,4\n57,'), '%*d%d', 'delimiter', ',');
c{:}

ans =

           2
           4

Notice the last value isn't between two delimiters, but is preceded by a 
delimiter and followed by white-space. If a second delimiter is added, then ...

c = textscan (sprintf ('1,2\n,4\n57,,'), '%*d%d', 'delimiter', ',');
c{:}

ans =

           2
           4
           0

I haven't studied the docs very deeply, and have only looked at the docs for 
R2011b, but it looks to me that ML is behaving in a manner that is consistent 
with its documentation (admittedly the documentation is rather esoteric).

> ====================
> On Oct 23, 2011, at 8:37 PM, Ben Abbott wrote:
>> 
>>> > a3 = cell2mat (textscan (sprintf 
>>> > ('Text1Text2Text\nText3TextText\nText57Text63Text'), 'Text%dText%dText'))
>>> >
>>> > Matlab returns ...
>>> >
>>> > a1 =
>>> >           2
>>> >           4
>>> > a2 =
>>> >           2
>>> > Error using cat
>>> > CAT arguments dimensions are not consistent.
>> I got this wrong. Removing the "cell2mat" ...
>> 
>> a3 = textscan (sprintf ('Text1Text2Text\nText3TextText\nText57Text63Text'), 
>> 'Text%dText%dText')
>> a3 =
>>    [2x1 int32]    [2]
>> a3{1}
>> ans =
>>           1
>>           3
>> 
>> However, I'm still having trouble understanding ML's behavior.
> 
> This might just be a ML bug.
> I think Octave does the right (= expected) thing.

I think this was my mistake. When I remove the cell2mat in my test script, then 
Matlab doesn't give the "CAT" error. This error was because Octave's 
implementation inserting zeros for what was interpreted as empties between 
literals ("Text"). However, it looks to me as if ML is behaving as documented.

> =========================
> Ben Abbott wrote:
>> 
>> I've made some modifications to your original notes, and added a few more 
>> below.
>> 
>> a. "Words" or fields (to be interpreted later) are separated by white-space 
>> or delimiters.
>> b. The white-space char set can be adapted by the user with the "whitespace" 
>> keyword. It can even be set to empty.
>> c. White-space is understood to possibly be a vector of white-space chars 
>> that during reading is folded into one char that separates two fields.
>> d. Delimiters are also characters that separate words / fields.  Multiple 
>> delimiters are not folded into a single instance.
>> e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
>> that separates fields.
>> f. A pair of delimiters separated by white-space (or nothing) imply an empty 
>> value.
>> g. By default "emptyvalue" is NaN for numeric data types. If the numeric 
>> type doesn't support NaN, the zero is used (int32 for example). For 
>> character fields, an empty value is just an empty string.
>> h. If so desired, multiple consecutive delimiters can be folded into one 
>> delimiter if "MultipleDelimsAsOne" parameter is set to 1.
>> i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not 
>> affected by the MultipleDelimsAsOne parameter.
> 
> As to strread, there's another ML subrule:
> <QUOTE>
> If your data uses a character other than a space as a delimiter, you must use 
> the strread parameter 'delimiter' to specify the delimiter
> </QUOTE>
> What is it, space or whitespace?

Are you referring to the different EOLs? I'm not entirely sure what you are 
asking, but I'll make a guess.

Textread operates on one line at a time. If an attempt is made to read past the 
end of a line with a single format statement, empties will be inserted for 
those fields read past the EOL.

c = textscan (sprintf ('1\n2\n\n4\n57\n\n'), '%*d%d', 'delimiter', ',');
>> c{:}

ans =

           0
           0
           0
           0

Unfortunately, I missed catching the problems with "i" before. I think it 
should read ...

i. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter. 
Any fields read beyond an EOL are treated as being empty.

Does that make sense?

> IAnyway, if your & mine colllection of inferred rules apply, I do not 
> understand this (7th test of Octave strread.m):
> 
> octave:23> a = strread ("a b c, d e, , f", "%s", "delimiter", ",")
> a =
> {
>  [1,1] = a b c
>  [2,1] = d e
>  [3,1] =
>  [4,1] = f
> }
> (Same goes for ML)

I hadn't considered this before.  I'll have to study the docs again to see if 
there is a reference to this. I did try dropping the "delimiter" to see what 
happens.

a = textscan ('a b c, d e, , f', '%s');

a{:}

ans = 

    'a'
    'b'
    'c,'
    'd'
    'e,'
    ','
    'f'

> because in this example there are spaces ("whitespace") separating e.g., 'a' 
> and 'b'.
> 
> But (ML):
> >> a = strread ('1 2 3, 4 5, , 6', '%d', 'delimiter', ',')
> a =
>     1
>     2
>     3
>     4
>     5
>     0
>     6
> 
> In the above cases, I get the same results for textscan.
> 
> So it seems that interpretation & processing of default whitespace depends on 
> the field format specifier as well?

It appears that ML doesn't use the white-space property, as delimiters for 
strings, when the "delimiter" property has been specified. I've added another 
line to the list (specifically "g" and "j").

a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) implies an empty 
value.
g. If the delimiter property is specified, then white-space is *not* used to 
delimit character fields. However, white-space is always used to delimit 
numeric fields.
h. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character 
fields, an empty value is just an empty string.
i. If so desired, multiple consecutive delimiters can be folded into one 
delimiter if "MultipleDelimsAsOne" parameter is set to 1.
j. EOL char sequences (\n, \r\n, or \r) delimit lines of input. They do not 
delimit fields / words and are unaffected by the MultipleDelimsAsOne parameter. 
Any fields read beyond an EOL are treated as being empty.

Does this look correct to you?

Ben



reply via email to

[Prev in Thread] Current Thread [Next in Thread]