[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Using OpenMP in Octave
From: |
David Bateman |
Subject: |
Re: Using OpenMP in Octave |
Date: |
Mon, 29 Mar 2010 21:39:13 +0200 |
User-agent: |
Mozilla-Thunderbird 2.0.0.22 (X11/20090706) |
Jaroslav Hajek wrote:
Unfortunately, it confirms what I anticipated: the elementary
operations scale poorly. Memory bandwidth is probably the real limit
here. The mappers involve more work per cycle and hence scale much
better.
I was hoping the multi-level cache architecture of modern processors
with the L1 cache dedicated to each core would make even the elementary
operations faster. Though as the times are identical for all the cases
for the elementary operations it seems, as you say, that the copying to
and from the memory takes more time than the floating point operation
itself.
This is why I think we should not hurry with multithreading the
elementary operations, and reductions like sum(). I know Matlab does
it, but I think it's just fancy stuff, to convince customers that new
versions add significant value.
Elementary operations are seldom a bottleneck; add Amdahl's law to
their poor scaling and the result is going to be very little music for
lots of money.
Ok, it seems that these aren't profitable.
When I read about Matlab getting parallelized stuff like sum(), I was
a little surprised. 50 million numbers get summed in 0.07 seconds on
my computer; generating them in some non-trivial way typically takes
at least 50 times that long, often much more. In that case,
multithreaded sum is absolutely marginal, even if it scaled perfectly.
One area where multithreading really helps is the complicated mappers,
as shown by the second part of the benchmark.
Though I imagine airy scales more the the sine function..
Still, I think we should carefully consider how to best provide parallelism.
For instance, I would be happy with explicit parallelism, something
like pararrayfun from the OctaveForge package, so that I could write:
pararrayfun (3, @erf, x, "ChunksPerProc", 100); # parallelize on 3
threads, splitting the array to 300 chunks.
Note that if I was about to parallelize a larger section of code that
uses erf, I could do
erf = @(x) pararrayfun (3, @erf, x, "ChunksPerProc", 100); # use
parallel erf for the rest of the code
Yes I agree that this could be accelerated with OpenMP, rather than with
fork/pipe as the control over the threads and that they run of different
cores is more explicit
If we really insisted that the builtin functions must support
parallelism, I say it must fulfill at least the following:
1. an easy way of temporarily disabling it must exist (for high-level
parallel constructs like parcellfun, it should be done automatically)
2. the tuning constants should be customizable.
Why make it tunable if we've done sufficient testing that the defaults
result in faster code every or at least the majority of cases and the
slow ups are minor?
for instance, I can imagine something like
mt_size_limit ("sin", 1000); # parallelize sin for arrays with > 1000 elements
mt_size_limit ("erfinv", 500); # parallelize erfinv for arrays with >
500 elements
But this means we maintain a map of every parallelized mapper function
and the number of elements where we apply a multi-thread approach.. That
comes with its own overhead. Though given that some functions will take
much longer per element than others to optimal point to change from a
serial function to a parallel one will probably be very different, so if
we don't maintain a table of sort we'll certain forgo some potential
speed ups. The functions arrayfun and cellfun will be particularly nasty
in this respect as the user could pass anything to them and Octave can
have no idea a-priori of the optimal serial to parallel switching
point.. Though I think I'd prefer having an additional option to
arrayfun and cellfun so the user can define this value directly.
We have no chance to determine the best constant for all machines, so
I think users should be allowed to find out their own.
The bus speeds aren't that different in most processors that generic
values will probably be fine.. If the optimal change over from one
algorithm to another for a mapper function changes from 800 to 1000 do
we really care?
David
- Using OpenMP in Octave, David Bateman, 2010/03/28
- Re: Using OpenMP in Octave, Søren Hauberg, 2010/03/29
- Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/29
- Re: Using OpenMP in Octave, Michael D Godfrey, 2010/03/29
- Re: Using OpenMP in Octave,
David Bateman <=
- Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/30
- Re: Using OpenMP in Octave, David Bateman, 2010/03/30
- Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/31
- Re: Using OpenMP in Octave, Kai Habel, 2010/03/29
Re: Using OpenMP in Octave, Michael Goffioul, 2010/03/29