octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

more permute optimizations


From: Jaroslav Hajek
Subject: more permute optimizations
Date: Wed, 25 Mar 2009 11:54:23 +0100

Hi,

I managed to implement a more smart algorithm for permute (Array<T>::permute):
http://hg.savannah.gnu.org/hgweb/octave/rev/9f5e095555fc

Summary:
The new algorithm will now apply dimension reductions where sequences
of dimensions are moved.
For instance,
permute (a, [2,3,1,4,5])

will be equivalent to
s = size (a);
b = reshape (a, [s(1), s(2)*s(3), s(4)*s(5)]);
b = permute (a, [2,1,3]);
b = reshape (a, s([2,3,1,4,5]));

thus reducing the dimensionality of the problem (and hence the amount
of work for the recursive copying).
If there are leading dimensions that map to themselves, the operation
will be done as a sequence of std::copy (this was actually true for
some time). If the operation can be reduced to a sequence of
transposes, these will be done using a specialized blocked transpose
code - a block is prefetched to temporary array and then redistributed
in a column-oriented manner, reducing the number of cache misses
approximately by a factor equal to the block size. If neither of these
conditions holds, the reduced operation is performed using plain
loops.

A short benchmark on permuting a 5d array follows (be careful with n
as the size is n^5):
n = 30;
rand("state", 1);
a = rand(n,n,n,n,n);
tic; b = permute (a, [1,3,4,2,5]); toc
tic; b = permute (a, [2,3,1,4,5]); toc
tic; b = permute (a, [3,4,1,2,5]); toc
tic; b = permute (a, [4,5,1,2,3]); toc

a = rand(3e3);
tic; a.'; toc
tic; permute (a, [2 1]); toc

with recent tip, I get:

Elapsed time is 0.125344 seconds.
Elapsed time is 0.180573 seconds.
Elapsed time is 0.243261 seconds.
Elapsed time is 0.335334 seconds.
Elapsed time is 0.0856569 seconds.
Elapsed time is 0.112666 seconds.

with the new patch, I get:

Elapsed time is 0.124983 seconds.
Elapsed time is 0.173172 seconds.
Elapsed time is 0.149354 seconds.
Elapsed time is 0.156777 seconds.
Elapsed time is 0.0828929 seconds.
Elapsed time is 0.0602632 seconds.

for certain operations, speed-up by as much as 100% is visible. For
comparison, here's 3.0.x:

Elapsed time is 0.280931 seconds.
Elapsed time is 0.379384 seconds.
Elapsed time is 0.456503 seconds.
Elapsed time is 0.464732 seconds.
Elapsed time is 0.123731 seconds.
Elapsed time is 0.157451 seconds.

The last two numbers for the new patch show that permute is now faster
than transpose doing the same job.
Which is a little weird, given that the blocked transpose routine is
essentially the same as David Bateman's transpose() implementation,
even with the same block size, just using raw pointers rather than
arrays directly.
Maybe it's just the optimizer having less trouble with CSE.

Unless David (or anyone else) objects, I'll replace the transpose
implementation by a call to the routine using pointers. I also
tinkered with the block size a bit, but it seems to me that 8 was a
good guess from David.
The best block size seems to depend on the problem dimensions, and I
bet that dependence is nontrivial.

I can conclude that permute is now a good deal smarter than its 3.0.x
counterpart.

enjoy

PS. One important motivation is that we're talking about implementing
basic statistical summary reductions using more robust and accurate
algorithms, which often want (conditionally) multiple passes through
the data. A cache-aligned implementation for non-contiguous reductions
(i.e. dim > 1) - see oct-norm.cc for example seems, however, really
painful. So I'm thinking about just implementing the reductions for
the contiguous case and simply rotating the relevant dimension to the
front first in other case. Comments?

-- 
RNDr. Jaroslav Hajek
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz


reply via email to

[Prev in Thread] Current Thread [Next in Thread]