octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

importdata different approach


From: Daniel J Sebald
Subject: importdata different approach
Date: Tue, 30 Jul 2013 23:40:36 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16

Erik,

I used the importdata function last night, and although it works fine (thank you) it seems to be quite slow for CSV files even for relatively small files. I profiled the routine a bit, and here are some CPU times for various parts of the routine (the size of the data is 7383 x 5):

ans =  0.0099990
ans =  0.089986
ans = 0
ans = 0
ans = 0
ans =  0.097985
ans =  0.49592
ans =  3.6494

The main thing to note from this is basically that the first stages involving the regexp routine are rather efficient and the last stages which involve double looping are quite the opposite. The main issue is that the following tests and whatnot consume time:

    if (any (file_content_rows{i} != " "))

and

          data_numeric = str2double (row_data{j});

and

  for i=(header_rows+1):length(file_content_rows)
    data_columns = max (data_columns,
                        length (regexp (file_content_rows{i},
                                        delimiter_pattern, "split")));
  endfor

and

      row_data = regexp (file_content_rows{i}, delimiter_pattern, "split");

Take particular note that the last two operations are duplicating the same work of splitting the data according to delimiter.

The reason, say, "str2double (row_data{j})" is slow is that the argument to str2double is a single element. Even though the core of str2double() is pretty fast, there is the precursor type-checking on what the arguments are, whether they are strings or whether they are cells, etc. So, when called in this way, str2double is spending an inordinate amount of CPU cycles not doing the actual conversion but checking data types. It's better to call str2double() on large cells or string matrices.

I tried to rework things by handling the white space removal in the character data stream before breaking data into rows and applying str2double() on all the cells at once. I managed to cut the CPU consumption to about 1/4 of the current version. But I then wondered if there wasn't something else we could use because a big portion of the time was in creating the cells via regexp(...,"split"). In fact, there already is the "dlmread()" which I think has enough flexibility with its arguments to handle the importdata CSV ascii case. It is so efficient that I think a better approach is to

1) Just fscanf the first header lines of the file (as opposed to reading in the whole data file)

2) Use dlmread() to do all the work, which places NaN for the cases where the conversion failed

3) Look at the data matrix for any NaN and then retroactively read in the data file and then compute where the associated lines are. I think I've done it efficiently so that every entry of the file need not be extracted, just the lines where the NaN occurred.

The last step slows things down, but it is still pretty efficient. Here is the CPU consumption for stages of the revamped importdata:

octave:460> aa = importdata_new ('foo.csv');
ans =    1.0000e-03
ans =  0.029996
ans = 0

Whoo-hoo! Factor of 125 speed up. Here are the results when I place a couple text strings amongst the data columns:

octave:461> aa = importdata_new ('foo_b.csv');
ans = 0
ans =  0.033995
ans =  0.18297

Well, you can see that having to pull the data back in and apply regexp adds some, but still compared to the current importdata.m it is rather minuscule.

Perhaps Rik can take a look at the last part of figuring out the output.textdata cell contents when the data has text strings in it. I don't think it can be much more efficient that what I did, but if it is possible Rik will find it.

The patch is here:

https://savannah.gnu.org/patch/index.php?8140


There are three tests that fail after applying the patch. We can discuss those. Basically, I don't agree with some of the results:


%!test
%! # Header
%! A.data = [3.1 -7.2 0; 0.012 6.5 128];
%! A.textdata = {"This is a header row."; \
%! "this row does not contain any data, but the next one does."};
%

I think that treating text with spaces rules out using space characters as delimiter and automatically recognizing column names. For example, if the first lines of my data file were

TIME VOLTAGE DISPLACEMENT
0 3.3 0.137
0.25 3.4 0.148
0.5 3.6 0.150

how can we tell that the first line should be data column titles or just some textdata?


%!test
%! # Missing values
%! A = [3.1 NaN 0; 0.012 6.5 128];

The above test produces the correct data output.data, but while this expectation is just the data, the new routine is creating output.textdata for that NaN result which happens to be an empty string. Isn't that the proper result?


%!test
%! # CR for line breaks
%! A = [3.1 -7.2 0; 0.012 6.5 128];
%! fn  = tmpnam ();
%! fid = fopen (fn, "w");
%! fputs (fid, "3.1\t-7.2\t0\r0.012\t6.5\t128");

The new version of importdata fails on the above test, and it would be easy to correct as a first step by searching and replacing any \r with \n. However, I wonder if the proper fix for this would be a simple addition to dlmread(). So let's hold off on this test until we are certain where it should be fixed.

Dan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]