Re: Using OpenMP in Octave

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using OpenMP in Octave

From:	David Bateman
Subject:	Re: Using OpenMP in Octave
Date:	Mon, 29 Mar 2010 21:39:13 +0200
User-agent:	Mozilla-Thunderbird 2.0.0.22 (X11/20090706)

Jaroslav Hajek wrote:

Unfortunately, it confirms what I anticipated: the elementary
operations scale poorly. Memory bandwidth is probably the real limit
here. The mappers involve more work per cycle and hence scale much
better.

I was hoping the multi-level cache architecture of modern processorswith the L1 cache dedicated to each core would make even the elementaryoperations faster. Though as the times are identical for all the casesfor the elementary operations it seems, as you say, that the copying toand from the memory takes more time than the floating point operationitself.

This is why I think we should not hurry with multithreading the
elementary operations, and reductions like sum(). I know Matlab does
it, but I think it's just fancy stuff, to convince customers that new
versions add significant value.
Elementary operations are seldom a bottleneck; add Amdahl's law to
their poor scaling and the result is going to be very little music for
lots of money.

Ok, it seems that these aren't profitable.

When I read about Matlab getting parallelized stuff like sum(), I was
a little surprised. 50 million numbers get summed in 0.07 seconds on
my computer; generating them in some non-trivial way typically takes
at least 50 times that long, often much more. In that case,
multithreaded sum is absolutely marginal, even if it scaled perfectly.

One area where multithreading really helps is the complicated mappers,
as shown by the second part of the benchmark.

Though I imagine airy scales more the the sine function..

Still, I think we should carefully consider how to best provide parallelism.
For instance, I would be happy with explicit parallelism, something
like pararrayfun from the OctaveForge package, so that I could write:

pararrayfun (3, @erf, x, "ChunksPerProc", 100); # parallelize on 3
threads, splitting the array to 300 chunks.

Note that if I was about to parallelize a larger section of code that
uses erf, I could do

erf = @(x) pararrayfun (3, @erf, x, "ChunksPerProc", 100); # use
parallel erf for the rest of the code

Yes I agree that this could be accelerated with OpenMP, rather than withfork/pipe as the control over the threads and that they run of differentcores is more explicit

If we really insisted that the builtin functions must support
parallelism, I say it must fulfill at least the following:

1. an easy way of temporarily disabling it must exist (for high-level
parallel constructs like parcellfun, it should be done automatically)
2. the tuning constants should be customizable.

Why make it tunable if we've done sufficient testing that the defaultsresult in faster code every or at least the majority of cases and theslow ups are minor?

for instance, I can imagine something like

mt_size_limit ("sin", 1000); # parallelize sin for arrays with > 1000 elements
mt_size_limit ("erfinv", 500); # parallelize erfinv for arrays with >
500 elements

But this means we maintain a map of every parallelized mapper functionand the number of elements where we apply a multi-thread approach.. Thatcomes with its own overhead. Though given that some functions will takemuch longer per element than others to optimal point to change from aserial function to a parallel one will probably be very different, so ifwe don't maintain a table of sort we'll certain forgo some potentialspeed ups. The functions arrayfun and cellfun will be particularly nastyin this respect as the user could pass anything to them and Octave canhave no idea a-priori of the optimal serial to parallel switchingpoint.. Though I think I'd prefer having an additional option toarrayfun and cellfun so the user can define this value directly.

We have no chance to determine the best constant for all machines, so
I think users should be allowed to find out their own.

The bus speeds aren't that different in most processors that genericvalues will probably be fine.. If the optimal change over from onealgorithm to another for a mapper function changes from 800 to 1000 dowe really care?


David

[Prev in Thread]

Current Thread

[Next in Thread]

Using OpenMP in Octave, David Bateman, 2010/03/28
- Re: Using OpenMP in Octave, Søren Hauberg, 2010/03/29
  - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/29
    - Re: Using OpenMP in Octave, Søren Hauberg, 2010/03/29
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/29
    - Re: Using OpenMP in Octave, Michael D Godfrey, 2010/03/29
    - Re: Using OpenMP in Octave, David Bateman <=
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/30
    - Re: Using OpenMP in Octave, David Bateman, 2010/03/30
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/31
    - Re: Using OpenMP in Octave, Kai Habel, 2010/03/29
- Re: Using OpenMP in Octave, Michael Goffioul, 2010/03/29

Prev by Date: Re: Using OpenMP in Octave
Next by Date: Re: Using OpenMP in Octave
Previous by thread: Re: Using OpenMP in Octave
Next by thread: Re: Using OpenMP in Octave
Index(es):
- Date
- Thread