On Wed, Jan 14, 2009 at 5:24 PM, Nicholas Tung <
address@hidden> wrote:
> On Wed, Jan 14, 2009 at 01:36, Jaroslav Hajek <
address@hidden> wrote:
>>
>> On Wed, Jan 14, 2009 at 5:00 AM, Nicholas Tung <
address@hidden>
>> wrote:
>> > On Tue, Jan 13, 2009 at 16:46, Francesco Potortì <
address@hidden>
>> > wrote:
>> >>
>> >> >I've noticed that indexing expressions on large arrays can be quite
>> >> > slow,
>> >>
>> >> Have a look at this, which I wrote in November but for some reason does
>> >> not appear in the list archives. In an application of mine, I wrote an
>> >> ad hoc function to access a 5-dim matrix using the last method below,
>> >> getting a significant speedup. I usually run 3.0.1, and I did not make
>> >> any check to compare its speed with the more recent Octave versions.
>> >>
>> >> |Date: 11 Nov 2008 10:44:45 +0100
>> >> |From: Francesco Potortì <
address@hidden>
>> >> |To: Jaroslav Hajek <
address@hidden>
>> >> |CC: Octave help list <
address@hidden>
>> >> |Subject: Re: slow access to consecutive matrix elements
>> >> |
>> >> |>On Sat, Oct 11, 2008 at 10:22 AM, Francesco Potorti`
>> >> |><
address@hidden> wrote:
>> >> |>> This is a real life example that demonstrates how access to matrix
>> >> |>> elements is not optimised for the case of complete ranges in the
>> >> first
>> >> |>> dimensions. I am sending it here and not to the bug list because
>> >> this
>> >> |>> example may be useful to others, at least until this cases is
>> >> optimised
>> >> |>> in the sources.
>> >> |>>
>> >> |>> # This is a big 5-dim matrix, about 100MB size
>> >> |>> octave> kk=rand(156,222,1,44,8);
>> >> |>>
>> >> |>> # I access a slice of it, which is sequential in memory
>> >> |>> octave> t=cputime; for ii=1:44, for jj=1:8
>> >> |>> mm=kk(:,:,:,ii,jj); endfor, endfor, cputime-t
>> >> |>> ans = 5.9124
>> >> |>>
>> >> |>> # Using a linear index is much faster
>> >> |>> octave> span=(1:156*222);
>> >> |>> t=cputime; for ii=1:44, for jj=1:8
>> >> |>> mm=kk(sub2ind(size(kk),1,1,1,ii,jj)+span-1); endfor, endfor,
>> >> cputime-t
>> >> |>> ans = 2.6642
>> >> |>>
>> >> |>> # Removing the call to sub2ind reaches top speed
>> >> |>> octave> cp=[1,cumprod(size(kk)(1:end-1))]; span=(1:156*222);
>> >> |>> t=cputime; for ii=1:44, for jj=1:8
>> >> |>> mm=kk(sum(([1,1,1,ii,jj]-1).*cp)+span); endfor, endfor, cputime-t
>> >> |>> ans = 1.4001
>> >> |>>
>> >> |>> Still, I wish access were faster yet. Is there a reason why
>> >> copying
>> >> |>> consecutive memory is so slow? I wish I could help with optimising
>> >> |>> this, even if I am certainly not the most qualified person to do
>> >> it.
>> >> |>>
>> >> |>
>> >> |>I guess the indexing routines do deserve some attention w.r.t
>> >> |>performance. Reducing code duplication would also be nice. I have
>> >> this
>> >> |>on my TODO list, but I don't think it's a good idea to do it before
>> >> |>3.2.x is forked, as such changes are, IMO, likely to introduce bugs.
>> >> |
>> >> |Follwoing up on this, I realised that there is room for further
>> >> |significant speedup:
>> >> |
>> >> |# Be careful to not convert ranges into matrices
>> >> |octave> cp=[1,cumprod(size(kk)(1:end-1))]; len=156*222;
>> >> | t=cputime; for ii=1:44, for jj=1:8
>> >> |base=sum(([1,1,1,ii,jj]-1).*cp); mm=kk(base+1:base+len); endfor,
>> >> endfor,
>> >> cputime-t
>> >> |ans = 0.26402
>> >> |
>> >> |The fact is, I was discounting the fact that a range remains a range
>> >> |even after linear transformation, while this does not appear to be the
>> >> |case:
>> >> |
>> >> |octave3.1> typeinfo(1:4)
>> >> |ans = range
>> >> |octave3.1> typeinfo(4+(1:4))
>> >> |ans = matrix
>> >> |octave3.1> typeinfo(4*(1:4))
>> >> |ans = matrix
>> >> |
>> >> |>From the depth of my ignorance about Octave's internals, I dare say
>> >> that
>> >> |it should not be too difficult to keep ranges as ranges even after sum
>> >> |or product with a scalar. Maybe even after sum with a range with the
>> >> |same number of elements. Am I wrong?
>> >
>> > neat, I didn't know that 1-d vector indexing worked on matrices.
>> > However, I
>> > think it may be faster for you because of the multi-dimensional aspect,
>> > whereas I am working with more values (potentially large image samples).
>> > On
>> > Octave 3.1.51+ (built today) the following code runs in about the same
>> > time.
>> > If you're doing something more complicated, please tell me. I know the
>> > two
>> > snippets below don't compute exactly the same value...
>> >
>> > tic, x1 = 1:299; x2 = 2:300; for i = 1:100, a(:, x1) += a(:, x2);
>> > endfor,
>> > toc
>> > x1 = 1:89700; x2 = 301:90000; tic, for i = 1:100, a(x1) += a(x2);
>> > endfor,
>> > toc
>> >
>> > As another note, it doesn't seem to matter that much which way the
>> > matrix is
>> > indexed for the first, i.e. a(x1, :) vs a(:, x1).
>> >
>> > Jaroslav - I updated to hg head and it was faster (cut time in half for
>> > the
>> > test example), but on real code not quite as much as I'd hoped.
>> >
>> > Nicholas
>> >
>>
>> Hi Nicholas,
>>
>> if you can identify bottlenecks and post them as reasonably small
>> self-contained examples, we can surely discuss them and possible
>> improvements.
>>
>> regards
>>
>> --
>> RNDr. Jaroslav Hajek
>> computing expert
>> Aeronautical Research and Test Institute (VZLU)
>> Prague, Czech Republic
>> url:
www.highegg.matfyz.cz
>
> Thanks so much. I uploaded a real example to
>
http://gatoatigrado.arvixe.com/files/temp/octave_indexing_perf.tgz
>
> The script is 60 lines long; it should perform a two-step Haar
> transformation (actually only for the wavelet coefficients) lifting method
> decomposition. Run it with the following:
> octave encode_image_standalone.m attachment.jpg
>
> I looped it 450 times to simulate the amount of processing that I need. On
> my machine, the base time is 26.6 seconds; if you remove the matrix scalar
> indexing (original file lines 36, 37 and 43, 44), it still takes 17 seconds
> just to run through the interpreter. Replacing both of the lines above with
> gamma_ += alpha_arr(shift_idx) * gamma_;
> lambda_ += alpha_arr(shift_idx) * lambda_;
> respectively, it takes about 20 seconds, which is much closer to the
> interpreter overhead (which I might be able to cut down by reducing the
> number of lines; I'm not sure.
>