octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using OpenMP in Octave


From: Jaroslav Hajek
Subject: Re: Using OpenMP in Octave
Date: Tue, 30 Mar 2010 11:21:29 +0200

On Mon, Mar 29, 2010 at 9:39 PM, David Bateman <address@hidden> wrote:
> Jaroslav Hajek wrote:
>>
>> Unfortunately, it confirms what I anticipated: the elementary
>> operations scale poorly. Memory bandwidth is probably the real limit
>> here. The mappers involve more work per cycle and hence scale much
>> better.
>>
>
> I was hoping the multi-level cache architecture of modern processors with
> the L1 cache dedicated to each core would make even the elementary
> operations faster. Though as the times are identical for all the cases for
> the elementary operations it seems, as you say, that the copying to and from
> the memory takes more time than the floating point operation itself.
>

It seems quite so. I think I have reads somewhere that the cost of fp
addition/subtraction and multiplication are essentially equal on
modern hardware. Adding two long vectors repeatedly fetches two memory
blocks to cache, adds them, and then stores the result back. With
multithreading, you can reduce the time needed for the additions, and
you can also keep the memory bus more occupied (i.e. while one CPU is
doing additions, you can serve memory to another), but you can't
reduce the amount of memory transfers. So you can only expect nearly
perfect scaling with operations where memory transfers play a minor
role.

>
>> This is why I think we should not hurry with multithreading the
>> elementary operations, and reductions like sum(). I know Matlab does
>> it, but I think it's just fancy stuff, to convince customers that new
>> versions add significant value.
>> Elementary operations are seldom a bottleneck; add Amdahl's law to
>> their poor scaling and the result is going to be very little music for
>> lots of money.
>>
>
> Ok, it seems that these aren't profitable.
>>
>> When I read about Matlab getting parallelized stuff like sum(), I was
>> a little surprised. 50 million numbers get summed in 0.07 seconds on
>> my computer; generating them in some non-trivial way typically takes
>> at least 50 times that long, often much more. In that case,
>> multithreaded sum is absolutely marginal, even if it scaled perfectly.
>>
>> One area where multithreading really helps is the complicated mappers,
>> as shown by the second part of the benchmark.
>>
>
> Though I imagine airy scales more the the sine function..

Apparently, even exp scales well. Though personally I don't have
programs making heavy use of non-trivial elemental operations, I
daresay this is one area which is both easy enough and profitable
enough to start with.

>>
>> Still, I think we should carefully consider how to best provide
>> parallelism.
>> For instance, I would be happy with explicit parallelism, something
>> like pararrayfun from the OctaveForge package, so that I could write:
>>
>> pararrayfun (3, @erf, x, "ChunksPerProc", 100); # parallelize on 3
>> threads, splitting the array to 300 chunks.
>>
>> Note that if I was about to parallelize a larger section of code that
>> uses erf, I could do
>>
>> erf = @(x) pararrayfun (3, @erf, x, "ChunksPerProc", 100); # use
>> parallel erf for the rest of the code
>>
>
> Yes I agree that this could be accelerated with OpenMP, rather than with
> fork/pipe as the control over the threads and that they run of different
> cores is more explicit
>

I was thinking about providing a special optimization to pararrayfun
that will recognize certain builtins and use OpenMP-ized code rather
than the default fork-pipe approach. Of course, if those get into core
Octave, it's meaningless.


>> If we really insisted that the builtin functions must support
>> parallelism, I say it must fulfill at least the following:
>>
>> 1. an easy way of temporarily disabling it must exist (for high-level
>> parallel constructs like parcellfun, it should be done automatically)
>> 2. the tuning constants should be customizable.
>>
>
> Why make it tunable if we've done sufficient testing that the defaults
> result in faster code every or at least the majority of cases and the slow
> ups are minor?
>

How do you imagine "sufficient testing"? I can only test on 2 or 3
machines, with their specific configuration. Is that sufficient? I was
thinking along the lines that if we make everything customizable,
users will be more willing to tune the constants for their machine. We
are not MathWorks, we don't have a team dedicated to this task. Had
everyone agreed with your original patch, I'd bet anything that the
constants would stay as they were unless you tuned them yourself.

>> for instance, I can imagine something like
>>
>> mt_size_limit ("sin", 1000); # parallelize sin for arrays with > 1000
>> elements
>> mt_size_limit ("erfinv", 500); # parallelize erfinv for arrays with >
>> 500 elements
>>
>
> But this means we maintain a map of every parallelized mapper function and
> the number of elements where we apply a multi-thread approach.. That comes
> with  its own overhead.

Yes. Perhaps not that big, though. I can imagine a fairly efficient
implementation, using the octave_base_value::unary_mapper_t
enumeration, which would essentially need just an extra array
reference.

> Though given that some functions will take much
> longer per element than others to optimal point to change from a serial
> function to a parallel one will probably be very different, so if we don't
> maintain a table of sort we'll certain forgo some potential speed ups. The
> functions arrayfun and cellfun will be particularly nasty in this respect as
> the user could pass anything to them and Octave can have no idea a-priori of
> the optimal serial to parallel switching point.. Though I think I'd prefer
> having an additional option to arrayfun and cellfun so the user can define
> this value directly.
>

I implemented some configuration options for this in pararrayfun and
parcellfun recently, in collaboration with Jean-Benoist Leger.
In the end pararrayfun applied on beta was even able to outperform the
serial invocation :)

I agree there is a gap between a scalable application of
parcellfun/pararrayfun and serial invocation, that would be nice to
fill, and that multithreading is probably best capable of it. Another
good idea

>> We have no chance to determine the best constant for all machines, so
>> I think users should be allowed to find out their own.
>>
>
> The bus speeds aren't that different in most processors that generic values
> will probably be fine.. If the optimal change over from one algorithm to
> another for a mapper function changes from 800 to 1000 do we really care?
>

Depends. 800 to 1000, maybe not. But what if it is 200 to 1000? The
optimal switch point also depends on the number of cores, so a fixed
number is sub-optimal in principle.


By the way, I just realized there are further problems to be dealt
with. Watch this demo (with my yesterday's patch):

octave:1> a = rand (1e6, 1);
octave:2> b = rand (1e6, 1);
octave:3> a(end/2) = NaN;
octave:4> c = a & b;
error: invalid conversion of NaN to logical
terminate called after throwing an instance of 'octave_execution_exception'
panic: Aborted -- stopping myself...
Aborted

Bang. What happened? Yes, the answer is - exceptions. The invalid NaN
to logical conversion is currently handled via
octave_execution_exception. And exceptions don't play nicely with
OpenMP. Current OpenMP mandates exceptions to be caught in the same
thread that caused them. For loops, it essentially means in the same
cycle.
This means enclosing each loop in a try-catch block, pay attention for
execution exceptions and interrupt exceptions, store them somehow when
caught and restore when back.
This particular case is both easy to trigger and easy to fix. But what
about others? I think some of the mappers implemented in libcruft can
raise exceptions (through the fortran XSTOPX).
Lots of stuff can raise the interrupt exception (for good reasons)
What should we do? Should Ctrl-C be disabled when multithreading is in
effect? Or shall we simply manually make sure that there are no
octave_quit or F77_XFCN calls in the parallel areas?

There's a lot of open questions here. I would prefer if we first
focused on multithreading of the most computationally intensive stuff.
BLAS and FFTW already can do it. BLAS is usually parallelized through
OpenMP, so it would be nice if we wrapped the omp_set_num_threads (),
so that users could influence the number of threads from Octave. IIRC
FFTW uses a different multi-threading method, so maybe we could
address both in a unified way?

I think stuff like this has a far bigger impact than parallelizing
plus and minus. Besides, Octave's got no JIT, so even if you
parallelize plus, a+b+c will still be slower than it could be if it
was done by a single serial loop.
Not that it really matters; as I already said these operations are
seldom a bottleneck, and if they are, rewriting to C++ is probably the
way to go.

Perhaps we should best identify the most computationally heavy
functions, and target them first.
I still think it's not a good idea to put multithreading stuff into
3.4.0. There's been a lot of performance improvements anyway.
I think multithreading is one area where we should not blindly copy
Matlab, and it deserves a careful approach.

regards

-- 
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz



reply via email to

[Prev in Thread] Current Thread [Next in Thread]