Re: [Bug-apl] Performance problems when constructing large(ish) arrays

You've all made good points, and I changed the code slightly to provide the initial array side in order to avoid the recreation of the array on each iteration. This brought down the loading time to a much more bearable 14 seconds. I rewrote the Lisp code to be compatible with the APL code and the time was 1.46 seconds. This suggests that GNU APL is consistently about 10 times slower than non-optimised Lisp code. To me, this is not unexpected given the fact that GNU APL isn't designed to be high-performance.

However, while 14 seconds for 30k is manageable, I have had the need to work with arrays of over a million rows. Extrapolating this suggests that it would take almost 8 minutes to load such a file. Thus, unless GNU APL can magically improve overall performance by at least 10 times, I still think we need a native CSV loading function.

Regards,

Elias

For reference, here is the APL code:

∇Z ← type convert_entry value
→('n'≡type)/numeric
→('s'≡type)/string
⎕ES 'Illegal conversion type'
numeric:
Z←⍎value
→end
string:
Z←value
end:
∇

∇Z ← pattern read_csv_n[n] filename ;fd;line;separator;i
separator ← ' '
Z ← n (↑⍴pattern) ⍴ 0
fd ← 'r' FIO∆fopen filename
i ← ⎕IO

next:
line ← FIO∆fgets fd           ⍝ Read one line from the file
→(⍬≡line)/end
→(10≠line[⍴line])/skip_nl     ⍝ If the line ends in a newline
line ← line[⍳¯1+⍴line]        ⍝ Remove the newline
skip_nl:
line ← ⎕UCS line
Z[i;] ← pattern convert_entry¨ (line≠separator) ⊂ line
i ← i+1
→next
end:

FIO∆fclose fd
∇

And here is the Lisp code (the test case was running on SBCL), requires the QL packages SPLIT-SEQUENCE and PARSE-NUMBER:

(defparameter *result*
           (time
            (with-open-file (s "apjs492452t1_mrt.txt")
              (let ((res (make-array '(34030 11))))
                (dotimes (i (array-dimension res 0))
                  (let* ((line (read-line s))
                         (parts (split-sequence:split-sequence #\Space line :remove-empty-subseqs t)))
                    (loop
                      for ii from 0 below 10
                      for p in parts
                      do (setf (aref res i ii) (parse-number:parse-number p)))
                    (setf (aref res i 10) (nth 10 parts))))
                res))))

On 18 January 2017 at 09:57, Blake McBride <address@hidden> wrote:

On Tue, Jan 17, 2017 at 7:39 PM, Xiao-Yong Jin <address@hidden> wrote:
I always feel GNU APL kind of slow compared to Dyalog, but I never really compared two in large dataset.
I'm mostly using J now for large dataset.
If Elias has the optimized code for GNU APL and a reproducible way to measure timing, I'd like to compare it with Dyalog and J.

I think that's actually a good idea. It would be a good comparison. It would really make it clear if there is a blaring problem. But first the APL code should be optimized a bit (but nothing crazy like reading it all into memory right now.)

--blake

From:	Elias Mårtenson
Subject:	Re: [Bug-apl] Performance problems when constructing large(ish) arrays
Date:	Wed, 18 Jan 2017 18:17:16 +0800