|
From: | Juergen Sauermann |
Subject: | Re: [Bug-apl] Performance problems when constructing large(ish) arrays |
Date: | Tue, 17 Jan 2017 21:10:57 +0100 |
User-agent: | Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 |
Hi Elias, Nick, I would believe that most of the time in Elias' function is spent in Z ← Z⍪ pattern convert_entry¨ (line≠separator) ⊂ line →nextmore precisly in 'Z⍪' . This make the function run in quadratic time because Z gets bigger and bigger with every new line. The formatting part, i.e. 'pattern convert_entry¨ (line≠separator) ⊂ line ' is usually fast enough to not care for it. The usual solution is to: Z←N⍴0 ⍝ N is the number of lines Z[j]←xxx for every line of the file This avoids unnecessary copying of Z as it is done in Z⍪ The only problem remaining is to figure the number of lines N. I suppose when the lines come from a database query then the database should have a means of determining the number of result lines beforehand, so you could continue with fgets() from ⎕FIO for every line of the result. Another possibility is to read the entire file in one go and then N←+/(⎕UCS 10)=characters_in_file I could also create a new ⎕FIO number that creates a nested APL vector Z with ⍴Z = number of lines and Z[j] being the Zth line with CR and LF removed. The rest (processing the lines) will then be lightweight even in APL. What I do not like about ⎕CSV (actually I am only guessing here because I dont know what it reallly does, but I assume it is specifically for comma separated lists) is that it is supposedly only works for comma separated lists. If we have something more general which solves the performance problem of Z⍪ without only working for specific formats like CSV then I would prefer that. /// Jürgen On 01/17/2017 07:33 PM, Nick
Lobachevsky wrote:
Elias, In just about any language, iterative string catenation is a real performance killer owing to the repeated growing memory allocation which just gets bigger with every iteration. I would restructure the algorithm to work more like this: (sorry, on a Windoze box and no APL, so metacode only) Open the file, read the contents of the entire file, close the file Change NL to separator, then partition the whole thing (Z) Check to see that the length of the partitioned string divides evenly by pattern length Create a temporary of only the numeric items. As they should all be scalar, you should be able to do it in a single execute and get a numeric vector of the correct length. Assign the numeric vector back into Z. You may want to do a ravel each if you would rather have one element vectors Reshape Z to have the same number columns as in pattern Dyalog and some other APLs (APLX?) have a []CSV function. The problem gets a bit harder when you have optional quotes around strings, embedded separators, and have to figure out yourself whether something is numeric or char. On 1/17/17, Elias Mårtenson <address@hidden> wrote:I wanted to use GNU APL to work on a dataset of star data. The file consists of 34030 lines of the following form: 892376 3813 4.47 0.4699 1.532 0.007 7306.69 0.823 0.4503 0 --- 1026146 4261 4.57 0.6472 14.891 0.12 11742.56 1.405 0.7229 0 --- 1026474 4122 4.56 0.5914 1.569 0.006 30471.8 1.204 0.6061 0 --- 1162635 3760 4.77 0.4497 15.678 0.019 10207.47 0.978 0.5445 1 --- I wrote a generic CSV loader to handle this (source code at the end of this email), and loaded the data like so: * z ← 'nnnnnnnnnns' read_csv 'apjs492452t1_mrt.txt'* This took many minutes to load, which in my opinion shouldn't happen. Now, I have a few questions: 1. Is there a way to speed up this code? 2. Is there something that could be done on the GNU APL implementation side to make this faster? 3. Shouldn't we have a generic ⎕CSV function or something like that which would be able to load CSV files in milliseconds regardless of size? This should be trivial to do in C++. Here's the code in question: ∇Z ← type convert_entry value →('n'≡type)/numeric →('s'≡type)/string ⎕ES 'Illegal conversion type' numeric: Z←⍎value →end string: Z←value end: ∇ ∇Z ← pattern read_csv filename ;fd;line;separator separator ← ' ' Z ← 0 (↑⍴pattern) ⍴ ⍬ fd ← 'r' FIO∆fopen filename next: line ← FIO∆fgets fd ⍝ Read one line from the file →(⍬≡line)/end →(10≠line[⍴line])/skip_nl ⍝ If the line ends in a newline line ← line[⍳¯1+⍴line] ⍝ Remove the newline skip_nl: line ← ⎕UCS line Z ← Z⍪ pattern convert_entry¨ (line≠separator) ⊂ line →next end: FIO∆fclose fd ∇ |
[Prev in Thread] | Current Thread | [Next in Thread] |