octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving strread / textread / textscan


From: Philip Nienhuis
Subject: Re: Improving strread / textread / textscan
Date: Mon, 24 Oct 2011 20:49:17 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Answers to three emails in one:

Ben Abbott wrote:
On Oct 23, 2011, at 6:42 PM, Ben Abbott wrote:

Ok. Lets start with writing tests for ML. I'll start by extracting Octave's 
tests and confirm they work on ML.

Ben

I've copied the tests from textscan and modified them to run on ML. To do that 
I wrote a simple oct_assert function to handle the asserts. Of the total 14 
asserts, two of them have failed.

Test #1: Passed.
Test #2: Passed.
Test #3: Passed.
Test #4: Passed.
Test #5: Passed.
Test #6: Passed.
Test #7: Passed.
Test #8: Passed.
Test #9: Failed.
OBSEVED:
           16         241           3

EXPECTED:
           16         241           3           0

Test #10: Passed.
Test #11: Passed.

Hmmm... on ML2007a, I get:
Test #11: Failed.
OBSERVED:
   49   10   76   50

EXPECTED:
   76   49   10   76   50

So ML is inconsistent...

( Note I fixed some typos in your script :-)  )

Test #12: Failed.
OBSEVED:
            2

EXPECTED:
            2
            4
            0

Test #13: Passed.
Test #14: Passed.

The script with the  tests and the oct_assert function are attached.

Apparently ML doesn't recognize empty fields squeezed between two literals.

====================
On Oct 23, 2011, at 8:37 PM, Ben Abbott wrote:

> a3 = cell2mat (textscan (sprintf 
('Text1Text2Text\nText3TextText\nText57Text63Text'), 'Text%dText%dText'))
>
> Matlab returns ...
>
> a1 =
>           2
>           4
> a2 =
>           2
> Error using cat
> CAT arguments dimensions are not consistent.
I got this wrong. Removing the "cell2mat" ...

a3 = textscan (sprintf ('Text1Text2Text\nText3TextText\nText57Text63Text'), 
'Text%dText%dText')
a3 =
    [2x1 int32]    [2]
a3{1}
ans =
           1
           3

However, I'm still having trouble understanding ML's behavior.

This might just be a ML bug.
I think Octave does the right (= expected) thing.

=========================
Ben Abbott wrote:

I've made some modifications to your original notes, and added a few more below.

a. "Words" or fields (to be interpreted later) are separated by white-space or 
delimiters.
b. The white-space char set can be adapted by the user with the "whitespace" 
keyword. It can even be set to empty.
c. White-space is understood to possibly be a vector of white-space chars that 
during reading is folded into one char that separates two fields.
d. Delimiters are also characters that separate words / fields.  Multiple 
delimiters are not folded into a single instance.
e. Vectors of white-space and one delimiter are folded into one _delimiter_ 
that separates fields.
f. A pair of delimiters separated by white-space (or nothing) imply an empty 
value.
g. By default "emptyvalue" is NaN for numeric data types. If the numeric type 
doesn't support NaN, the zero is used (int32 for example). For character fields, an empty 
value is just an empty string.
h. If so desired, multiple consecutive delimiters can be folded into one delimiter if 
"MultipleDelimsAsOne" parameter is set to 1.
i. EOL char sequences (\n, \r\n, or \r) are also delimiters, but are not 
affected by the MultipleDelimsAsOne parameter.

As to strread, there's another ML subrule:
<QUOTE>
If your data uses a character other than a space as a delimiter, you must use the strread parameter 'delimiter' to specify the delimiter
</QUOTE>
What is it, space or whitespace?


IAnyway, if your & mine colllection of inferred rules apply, I do not understand this (7th test of Octave strread.m):

octave:23> a = strread ("a b c, d e, , f", "%s", "delimiter", ",")
a =
{
  [1,1] = a b c
  [2,1] = d e
  [3,1] =
  [4,1] = f
}
(Same goes for ML), while, if the rules apply, especially a. & e., I'd expect ML would yield:
a =
{
  [1,1] = a
  [2,1] = b
  [3,1] = c
  [4,1] = d
  [5,1] = e
  [6,1] = []
  [7,1] = f
}

because in this example there are spaces ("whitespace") separating e.g., 'a' and 'b'.

But (ML):
>> a = strread ('1 2 3, 4 5, , 6', '%d', 'delimiter', ',')
a =
     1
     2
     3
     4
     5
     0
     6

In the above cases, I get the same results for textscan.

So it seems that interpretation & processing of default whitespace depends on the field format specifier as well?
Weird.


Philip


reply via email to

[Prev in Thread] Current Thread [Next in Thread]