octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: xlsread in Octave 3.6.4


From: Philip Nienhuis
Subject: Re: xlsread in Octave 3.6.4
Date: Mon, 30 Sep 2013 18:42:41 +0200
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6

Hey Philip, hope everything is going well so far.

I've take a look at io source on svn. Are you finished already?
Referring to __OCT_spsh_open__.m  line 51
"## Read file contents. For some reason fgetl needs to be called twice"
In OOXML you have to do this twice because the file hast just two lines.
The first line includes just this: "<?xml version="1.0" encoding="UTF-8"
standalone="yes"?>"
All relevant datas are in the second line. Because of this I triggered
fgetl two times.

Yeah I found out this is valid for most XML files.

Do you just figured out in __OCT_spsh_open__.m whitch sheet should be
opened (given by name or index)?
So you need after  __OCT_spsh_open__.m
lastintf = "OCT";
xls.filename = filename;
? If yes, I can split-out this part easily from xlsxread.
Furthermore, what's up with this?
"## FIXME find a class that doesn't display as one looooong string"

That's only for ODS.
ODS, once unzipped, has all of its sheets & formatting in one big file called "content.xml". To avoid time-consuming re-reading and re-parsing these xml contents again and again for each sheet, I decided to just put the entire text string + pointers to the sheets in the ods file pointer struct.

In unzipped OOXML however each sheet is in a separate file; so for .xlsx it makes sense to just store the location of the tmp dir containing the unzipped files.

Back to .ods: later I'll try if just storing the file position pointers and using fseeks etc. isn't too much of a performance penalty. That'll save memory, although reading really big spreadsheet files probably isn't daily routine (even at my work where we spend lots of time digging around in large databases which are still mostly in Excel and .mat files).

<snip>
I've found a second mark for xlsx in __OCT_getusedrange__.m: line 43 to
50.

Yeah that's meant "for you".

I guess just extracting the used range from xlsxread is not possible as
you do for ods.

You mean: the really occupied cell range, not the formatted range.
Yeah, maybe for later. Although the latter would do for the moment. As we saw earlier it is kept somewhere in the top section of one of the unzipped files. Note that COM/ActiveX (i.e., Excel behind-the-scenes), and therefore Matlab as well, has the very same issue. So it isn't all that bad.

__readnum is reading the value and the position of the value. The used
range results from that. So reading just the position in
__OCT_getusedrange__.m is possible, but this information must not be
lost.
But maybe you do this in __ods_get_sheet_dims__.m?

__ods_get_sheet_dims__ is only meant for ODS.
It was written a few years ago for older versions of jOpenDocument when those still had no methods for determining the used range. As jOpenDOcument reads ods files as a long text string into the Java heap memory, it was easy to copy it to the Octave world and apply Octave's string functions. Already then I saw it as a first step to reading .ods w/o Java. The unzip part has long been in the way until you sorted that out.

A shedule would be nice. Which file/function comes first, which
information goes in this file/function and which information have to go
out to the next...

See below.

Furthermore I've take a (short) look on xmlread.
No, it's not just "bad documented", it is simple not documented.

Let's say "not" is a subset of "bad" (or the other way round).

There *is* a help string (just try "help xmlread") so I don't agree it isn't documented at all :-)

Last night it occurred to me that the various Octave/Octave-Forge mailing lists from around the time Laurent committed these files might contain a clue.

Trying to use it with OOXML just give an error. "bad declaration" (It
stumbles over the first line (<?xml version="1.0" encoding="UTF-8"
standalone="yes"?>).
By the way, Matlab isn't able to read it too with xmlread :)

It may be (but I know really little about XML) that a .dtd file (Document Type Description) is needed. The first line should be skipped in all cases. That is easily fixed in the source of xmlread.


As to what I think is needed for xlsx in the io package (1.2.4):
-----------------------------
xlsread.m:
Nothing (apart from the help text header, I'll do that)

xlsopen.m, xls2oct.m, xlsclose.m:
Extra stanzas for OCT interface, plus header text. A'la similar parts for odsopen/ods2oct/odsclose
I could do these too, it's not much work.

private/ functions:
-------------------
__OCT_spsh_open__.m
Unzipping the xlsx file, storing the unzipped dir pointer in the file pointer struct, perhaps also the sheet file names. Most is already in place (at the top), there's a section at the bottom for xlsx.

__OCT_xlsx2oct__.m
New file. Contains most of your xlsxread.

__OCT_getusedrange__.m
Needs a little stanza where the stored used range is to be assessed. Easy, use the <dimension ref="<range>" /> from the top of the worksheet files; you could use getxmlnode.m/getxmlattv.m to get the value, parse_sp_range to translate it to row/column numbers.

__OCT_spsh_close__.m
A stanza for xlsx. Just wiping the tmpdir and returning an empty struct is all that is needed.

__OCT_spsh_info__.m
To be assessed later. AFAICS it could be used w/o change if you set up the file pointer struct (output of __OCT_spsh_open__.m) similar to what is done for .ods


FYI I have plans to include reading gnumeric files and add them to the OCT interface as well. gnumeric also has XML files.


On a related note:
Markus wrote:
> Am 2013-09-18 22:34, schrieb Philip Nienhuis:
<snip>
>> * do you want to add more options, like
>> > specifying a range to read
>
> Range-read would be nice but it has the lowest priority for me.

It doesn't look too hard to get a long way here. After all you can easily strip a subsection of all rows in a worksheet and apply 'xlsxread' to it. One might read to many columns but that's a matter of stripping those away.


Philip



reply via email to

[Prev in Thread] Current Thread [Next in Thread]