Re: indexing expression performance

On Thu, Jan 15, 2009 at 08:17, Jaroslav Hajek <address@hidden> wrote:

On Wed, Jan 14, 2009 at 5:24 PM, Nicholas Tung <address@hidden> wrote:
> On Wed, Jan 14, 2009 at 01:36, Jaroslav Hajek <address@hidden> wrote:
>>
>> On Wed, Jan 14, 2009 at 5:00 AM, Nicholas Tung <address@hidden>
>> wrote:
>> > On Tue, Jan 13, 2009 at 16:46, Francesco Potortì <address@hidden>
>> > wrote:
>> >>
>> >> >I've noticed that indexing expressions on large arrays can be quite
>> >> > slow,
>> >>
>> >> Have a look at this, which I wrote in November but for some reason does
>> >> not appear in the list archives. In an application of mine, I wrote an
>> >> ad hoc function to access a 5-dim matrix using the last method below,
>> >> getting a significant speedup. I usually run 3.0.1, and I did not make
>> >> any check to compare its speed with the more recent Octave versions.
>> >>
>> >> |Date: 11 Nov 2008 10:44:45 +0100
>> >> |From: Francesco Potortì <address@hidden>
>> >> |To: Jaroslav Hajek <address@hidden>
>> >> |CC: Octave help list <address@hidden>
>> >> |Subject: Re: slow access to consecutive matrix elements
>> >> |
>> >> |>On Sat, Oct 11, 2008 at 10:22 AM, Francesco Potorti`
>> >> |><address@hidden> wrote:
>> >> |>> This is a real life example that demonstrates how access to matrix
>> >> |>> elements is not optimised for the case of complete ranges in the
>> >> first
>> >> |>> dimensions. I am sending it here and not to the bug list because
>> >> this
>> >> |>> example may be useful to others, at least until this cases is
>> >> optimised
>> >> |>> in the sources.
>> >> |>>
>> >> |>> # This is a big 5-dim matrix, about 100MB size
>> >> |>> octave> kk=rand(156,222,1,44,8);
>> >> |>>
>> >> |>> # I access a slice of it, which is sequential in memory
>> >> |>> octave> t=cputime; for ii=1:44, for jj=1:8
>> >> |>> mm=kk(:,:,:,ii,jj); endfor, endfor, cputime-t
>> >> |>> ans = 5.9124
>> >> |>>
>> >> |>> # Using a linear index is much faster
>> >> |>> octave> span=(1:156*222);
>> >> |>> t=cputime; for ii=1:44, for jj=1:8
>> >> |>> mm=kk(sub2ind(size(kk),1,1,1,ii,jj)+span-1); endfor, endfor,
>> >> cputime-t
>> >> |>> ans = 2.6642
>> >> |>>
>> >> |>> # Removing the call to sub2ind reaches top speed
>> >> |>> octave> cp=[1,cumprod(size(kk)(1:end-1))]; span=(1:156*222);
>> >> |>> t=cputime; for ii=1:44, for jj=1:8
>> >> |>> mm=kk(sum(([1,1,1,ii,jj]-1).*cp)+span); endfor, endfor, cputime-t
>> >> |>> ans = 1.4001
>> >> |>>
>> >> |>> Still, I wish access were faster yet. Is there a reason why
>> >> copying
>> >> |>> consecutive memory is so slow? I wish I could help with optimising
>> >> |>> this, even if I am certainly not the most qualified person to do
>> >> it.
>> >> |>>
>> >> |>
>> >> |>I guess the indexing routines do deserve some attention w.r.t
>> >> |>performance. Reducing code duplication would also be nice. I have
>> >> this
>> >> |>on my TODO list, but I don't think it's a good idea to do it before
>> >> |>3.2.x is forked, as such changes are, IMO, likely to introduce bugs.
>> >> |
>> >> |Follwoing up on this, I realised that there is room for further
>> >> |significant speedup:
>> >> |
>> >> |# Be careful to not convert ranges into matrices
>> >> |octave> cp=[1,cumprod(size(kk)(1:end-1))]; len=156*222;
>> >> | t=cputime; for ii=1:44, for jj=1:8
>> >> |base=sum(([1,1,1,ii,jj]-1).*cp); mm=kk(base+1:base+len); endfor,
>> >> endfor,
>> >> cputime-t
>> >> |ans = 0.26402
>> >> |
>> >> |The fact is, I was discounting the fact that a range remains a range
>> >> |even after linear transformation, while this does not appear to be the
>> >> |case:
>> >> |
>> >> |octave3.1> typeinfo(1:4)
>> >> |ans = range
>> >> |octave3.1> typeinfo(4+(1:4))
>> >> |ans = matrix
>> >> |octave3.1> typeinfo(4*(1:4))
>> >> |ans = matrix
>> >> |
>> >> |>From the depth of my ignorance about Octave's internals, I dare say
>> >> that
>> >> |it should not be too difficult to keep ranges as ranges even after sum
>> >> |or product with a scalar. Maybe even after sum with a range with the
>> >> |same number of elements. Am I wrong?
>> >
>> > neat, I didn't know that 1-d vector indexing worked on matrices.
>> > However, I
>> > think it may be faster for you because of the multi-dimensional aspect,
>> > whereas I am working with more values (potentially large image samples).
>> > On
>> > Octave 3.1.51+ (built today) the following code runs in about the same
>> > time.
>> > If you're doing something more complicated, please tell me. I know the
>> > two
>> > snippets below don't compute exactly the same value...
>> >
>> > tic, x1 = 1:299; x2 = 2:300; for i = 1:100, a(:, x1) += a(:, x2);
>> > endfor,
>> > toc
>> > x1 = 1:89700; x2 = 301:90000; tic, for i = 1:100, a(x1) += a(x2);
>> > endfor,
>> > toc
>> >
>> > As another note, it doesn't seem to matter that much which way the
>> > matrix is
>> > indexed for the first, i.e. a(x1, :) vs a(:, x1).
>> >
>> > Jaroslav - I updated to hg head and it was faster (cut time in half for
>> > the
>> > test example), but on real code not quite as much as I'd hoped.
>> >
>> > Nicholas
>> >
>>
>> Hi Nicholas,
>>
>> if you can identify bottlenecks and post them as reasonably small
>> self-contained examples, we can surely discuss them and possible
>> improvements.
>>
>> regards
>>
>> --
>> RNDr. Jaroslav Hajek
>> computing expert
>> Aeronautical Research and Test Institute (VZLU)
>> Prague, Czech Republic
>> url: www.highegg.matfyz.cz
>
> Thanks so much. I uploaded a real example to
> http://gatoatigrado.arvixe.com/files/temp/octave_indexing_perf.tgz
>
> The script is 60 lines long; it should perform a two-step Haar
> transformation (actually only for the wavelet coefficients) lifting method
> decomposition. Run it with the following:
> octave encode_image_standalone.m attachment.jpg
>
> I looped it 450 times to simulate the amount of processing that I need. On
> my machine, the base time is 26.6 seconds; if you remove the matrix scalar
> indexing (original file lines 36, 37 and 43, 44), it still takes 17 seconds
> just to run through the interpreter. Replacing both of the lines above with
> gamma_ += alpha_arr(shift_idx) * gamma_;
> lambda_ += alpha_arr(shift_idx) * lambda_;
> respectively, it takes about 20 seconds, which is much closer to the
> interpreter overhead (which I might be able to cut down by reducing the
> number of lines; I'm not sure.
>

20 seconds are certainly not the interpreter overhead - I daresay that
is almost negligible here. Apart from the lines you've picked up, most
time is spent extracting gamma and lambda from the matrix and putting
it back, and, more importantly, transposing it twice. Transposition is
a relatively costly operation for large matrices because it is not
cache friendly (even though David Bateman optimized it somewhat to
reduce the number of cache misses, so it's faster in development
sources).

It's not so large, 72x128. With only extraction [and transformation], i.e. commenting lines 26 to 47, it only takes 0.71 seconds. Tell me if you get different results. I'm running this on a Core Duo [not core 2 duo] U2400; it has 2mb of cache.

I didn't check very closely but it appears to me you're really doing a
row & column convolution. Perhaps conv2 can do the job for you? That
would also avoid the transpose, so it should make your code ultimately
faster.
Conv2 is a compiled function and so should be fast, though I guess
(from a quick scan) there's room to improve performance even there.

That's right! Thanks so much! That cuts it down to about 9.5 seconds.

From:	Nicholas Tung
Subject:	Re: indexing expression performance
Date:	Thu, 15 Jan 2009 15:03:49 -0500