FYI: optimizing certain matrix arithmetic

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

FYI: optimizing certain matrix arithmetic

From:	Jaroslav Hajek
Subject:	FYI: optimizing certain matrix arithmetic
Date:	Wed, 23 Sep 2009 12:50:48 +0200

hi all,

the following three patches:
http://hg.savannah.gnu.org/hgweb/octave/rev/afcf852256d2
http://hg.savannah.gnu.org/hgweb/octave/rev/0d3b248f4ab6
http://hg.savannah.gnu.org/hgweb/octave/rev/7e5b4de5fbfe

add new optimizations for matrix multiplication and division. Quote for NEWS:

 ** More efficient matrix division handling. Octave is now able to
handle the expressions

       M' \ v
       M.' \ v
       v / M

    (M is a matrix and v is a vector) more efficiently in certain cases.
    In particular, if M is triangular, all three expressions will be
handled by a single call to
    xTRTRS, with appropriate flags. Previously, all three expressions
required a physical transpose
    of M.

 ** More efficient handling of certain mixed real-complex matrix operations.
    For instance, if RM is a real matrix and CM a complex matrix,

      RM * CM

    can now be evaluated either as

      complex (RM * real (CM), RM * imag (CM))

    or as

      complex (RM) * CM,

    depending on the dimensions. The 1st form requires more
temporaries and copying,
    but halves the count of FLOPs, which normally brings better performance if
    RM has enough rows. Previously, the 2nd form was always used.

    Matrix division is similarly affected.

The triangular optimization is well demonstrated by this benchmark:

n = 500;
R = triu (rand (n));
u = rand (n, 1);

tic; for i = 1:1000; R \ u; endfor; toc
tic; for i = 1:1000; u' / R; endfor; toc
tic; for i = 1:1000; R' \ u; endfor; toc

R = tril (rand (n));
u = rand (n, 1);

tic; for i = 1:1000; R \ u; endfor; toc
tic; for i = 1:1000; u' / R; endfor; toc
tic; for i = 1:1000; R' \ u; endfor; toc

u = u + I*rand (n, 1);
tic; for i = 1:1000; R \ u; endfor; toc
tic; for i = 1:1000; R' \ u; endfor; toc


with a recent tip, I got (Core 2 Duo @ 2.83 GHz, g++ -O3
-march=native, self-built GotoBLAS + LAPACK):

Elapsed time is 0.225974 seconds.
Elapsed time is 0.78818 seconds.
Elapsed time is 1.81756 seconds.
Elapsed time is 0.219842 seconds.
Elapsed time is 0.79384 seconds.
Elapsed time is 1.83497 seconds.
Elapsed time is 4.32738 seconds.
Elapsed time is 7.73774 seconds.

with the new patches, I get

Elapsed time is 0.221489 seconds.
Elapsed time is 0.204968 seconds.
Elapsed time is 0.204267 seconds.
Elapsed time is 0.219243 seconds.
Elapsed time is 0.20444 seconds.
Elapsed time is 0.202193 seconds.
Elapsed time is 0.227214 seconds.
Elapsed time is 0.209824 seconds.

This is particularly cool when working intensively with Cholesky, LU
or QR factorizations - you can do both A \ v and A' \ v almost as
efficiently as with raw BLAS.

Also, there's some speed-up with real x complex multiplication and
division, which is not a very common operation, but sometimes needed.
The point is that for bigger matrices, complex (RM * real (CM), RM *
imag (CM)) is typically up to 2x faster than complex (RM) * CM.
Surprised? Just count the FLOPs. On the other hand, the former
expression is less friendly for Octave's complex storage.

The benchmark is:

n = 800;
a = rand (n);
b = rand (n) + i*rand (n);
tic; a * b; toc
tic; b * a; toc
tic; a' * b; toc
tic; b * a'; toc
tic; a \ b; toc
tic; b / a; toc

with a recent tip:

Elapsed time is 0.431021 seconds.
Elapsed time is 0.410468 seconds.
Elapsed time is 0.411169 seconds.
Elapsed time is 0.412252 seconds.
Elapsed time is 0.641075 seconds.
Elapsed time is 0.66255 seconds.

with the new patches:

Elapsed time is 0.221522 seconds.
Elapsed time is 0.2183 seconds.
Elapsed time is 0.22033 seconds.
Elapsed time is 0.22058 seconds.
Elapsed time is 0.297738 seconds.
Elapsed time is 0.325542 seconds.

The conclusion is that Octave 3.4 will again be able to map linear
algebra code more efficiently onto BLAS and LAPACK than the previous
version.

-- 
RNDr. Jaroslav Hajek
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

[Prev in Thread]

Current Thread

[Next in Thread]

FYI: optimizing certain matrix arithmetic, Jaroslav Hajek <=
- Re: FYI: optimizing certain matrix arithmetic, Michael D Godfrey, 2009/09/23
  - Re: FYI: optimizing certain matrix arithmetic, John W. Eaton, 2009/09/23
    - Re: FYI: optimizing certain matrix arithmetic, Michael D Godfrey, 2009/09/23
    - Re: FYI: optimizing certain matrix arithmetic, Jaroslav Hajek, 2009/09/24
    - Re: FYI: optimizing certain matrix arithmetic, Michael D Godfrey, 2009/09/24
    - Re: FYI: optimizing certain matrix arithmetic, Jaroslav Hajek, 2009/09/24
    - Re: FYI: optimizing certain matrix arithmetic, Michael Godfrey, 2009/09/24
    - Re: FYI: optimizing certain matrix arithmetic, Jaroslav Hajek, 2009/09/25
    - Re: FYI: optimizing certain matrix arithmetic, Michael D Godfrey, 2009/09/25
    - Re: FYI: optimizing certain matrix arithmetic, Shai Ayal, 2009/09/25

Prev by Date: octave-3.3.50 built by mingw-gcc-4.4.0 cannot be executed.
Next by Date: divergence function
Previous by thread: octave-3.3.50 built by mingw-gcc-4.4.0 cannot be executed.
Next by thread: Re: FYI: optimizing certain matrix arithmetic
Index(es):
- Date
- Thread