octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using OpenMP in Octave


From: Jaroslav Hajek
Subject: Re: Using OpenMP in Octave
Date: Wed, 31 Mar 2010 07:49:32 +0200

On Tue, Mar 30, 2010 at 10:29 PM, David Bateman <address@hidden> wrote:
> Jaroslav Hajek wrote:
>>
>> Apparently, even exp scales well. Though personally I don't have
>> programs making heavy use of non-trivial elemental operations, I
>> daresay this is one area which is both easy enough and profitable
>> enough to start with.
>>
>
> Then lets start there.
>
>>> Why make it tunable if we've done sufficient testing that the defaults
>>> result in faster code every or at least the majority of cases and the
>>> slow
>>> ups are minor?
>>>
>>>
>>
>> How do you imagine "sufficient testing"? I can only test on 2 or 3
>> machines, with their specific configuration. Is that sufficient? I was
>> thinking along the lines that if we make everything customizable,
>> users will be more willing to tune the constants for their machine. We
>> are not MathWorks, we don't have a team dedicated to this task. Had
>> everyone agreed with your original patch, I'd bet anything that the
>> constants would stay as they were unless you tuned them yourself.
>>
>
> I would have tuned themselves. in any case.
>

I thought you would :) What if I won't like your constants, figuring
out better working ones for me? A commit war?

>>>> We have no chance to determine the best constant for all machines, so
>>>> I think users should be allowed to find out their own.
>>>>
>>>>
>>>
>>> The bus speeds aren't that different in most processors that generic
>>> values
>>> will probably be fine.. If the optimal change over from one algorithm to
>>> another for a mapper function changes from 800 to 1000 do we really care?
>>>
>>>
>>
>> Depends. 800 to 1000, maybe not. But what if it is 200 to 1000? The
>> optimal switch point also depends on the number of cores, so a fixed
>> number is sub-optimal in principle.
>>
>
> My point was that the serial part of the code will include the copying to
> the memory shared by the code so Amadhl rule includes that and the
> difference in wall time for a switching point of 200 or 1000 even for a
> serial or parallel version might not be that difference.. The switch point
> is probably tolerant to variations in performance. Though, maybe its just
> better to parallelize everything as the small cases aren't limiting in their
> computational time.

I need to be able to turn off the multithreading at runtime and I
absolutely insist on that, so there needs to be a branch anyway.
The solution I offered earlier was to encapsulate the decision into a
routine, that will accept loop dimensions (number of cycles and the
"size" of one cycle) and decide whether it's worth parallelizing.
An universal switch point may be a good start. But if you expect
people to play with the parallel code, I don't see why you don't want
a tunable parameter for every operation and every mapper, if it can be
done efficiently.

>
>> By the way, I just realized there are further problems to be dealt
>> with. Watch this demo (with my yesterday's patch):
>>
>> octave:1> a = rand (1e6, 1);
>> octave:2> b = rand (1e6, 1);
>> octave:3> a(end/2) = NaN;
>> octave:4> c = a & b;
>> error: invalid conversion of NaN to logical
>> terminate called after throwing an instance of
>> 'octave_execution_exception'
>> panic: Aborted -- stopping myself...
>> Aborted
>>
>> Bang. What happened? Yes, the answer is - exceptions. The invalid NaN
>> to logical conversion is currently handled via
>> octave_execution_exception. And exceptions don't play nicely with
>> OpenMP. Current OpenMP mandates exceptions to be caught in the same
>> thread that caused them. For loops, it essentially means in the same
>> cycle.
>> This means enclosing each loop in a try-catch block, pay attention for
>> execution exceptions and interrupt exceptions, store them somehow when
>> caught and restore when back.
>> This particular case is both easy to trigger and easy to fix. But what
>> about others? I think some of the mappers implemented in libcruft can
>> raise exceptions (through the fortran XSTOPX).
>> Lots of stuff can raise the interrupt exception (for good reasons)
>> What should we do? Should Ctrl-C be disabled when multithreading is in
>> effect? Or shall we simply manually make sure that there are no
>> octave_quit or F77_XFCN calls in the parallel areas?
>>
>
> Ok we have to deal with exceptions in a generic manner in the applier
> functions.  I suppose we need to do something like
>
> template <class R, class X, class Y>
> inline Array<R>
> do_mm_binary_op (const Array<X>& x, const Array<Y>& y,
>                void (*op) (size_t, R *, const X *, const Y *),
>                const char *opname)
> {
>  dim_vector dx = x.dims (), dy = y.dims ();
>  if (dx == dy)
>   {
>     Array<R> r (dx);
>     op (r.length (), r.fortran_vec (), x.data (), y.data ());
>     octave_idx_type n = r.numel ();
> #ifdef HAVE_OPENMP
>     octave_idx_type chunk = mt_compute_chunk (sizeof (R), n);
>     if (chunk > 0)
>       {
>     bool exception_thrown;
>     sig_atomic_t saved_octave_exception_state = 0;
>     sig_atomic_t saved_octave_interrupt_state = 0;
>
>         R *rvec = r.fortran_vec ();
>         const X *xvec = x.data ();
>         const Y *yvec = y.data ();
>         OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec,
> yvec, saved_octave_exception_state, saved_octave_interrupt_state,
> exception_thrown) schedule (static, 1))
>         for (octave_idx_type i = 0; i < n; i += chunk)
>       {
>         octave_jmp_buf saved_context;
>         octave_save_current_context (saved_context);
>         if (octave_set_current_context)
>       {
>         octave_restore_current_context (saved_context);
>         execption_thrown = true;
>         saved_octave_exception_state = octave_exception_state;
>         saved_octave_interrupt_state = octave_interrupt_state;
>       }
>         else
>       {
>         op (std::min (chunk, n - i), rvec + i, xvec + i, yvec + i);
>         octave_restore_current_context (saved_context);
>       }
>       }
>     if (exception_thrown)
>       octave_rethrow_exception ();
>       }
>     else
> #endif
>       op (n, r.fortran_vec (), x.data (), y.data ());
>     return r;
>   }
>  else
>   {
>     gripe_nonconformant (opname, dx, dy);
>     return Array<R> ();
>   }
> }
>
> will do what we, but its no longer a simple matter of including a pragma
> wrapped in a macro.

Yes. Besides, I don't think this works.
I think you mixed together the interrupts thrown from Fortran that are
handled by setjmp/longjmp and regular C++ exceptions (which are the
case here). Here, one needs to employ try/catch. But see below, I
don't think this is necessary.

> This question of the exception will occur even more as
> we try to parallelize at a coarser grain and you probably have similar
> issues in your parcellfun implementation already.
>

Not really. Remember that parcellfun does not use multithreading.

>> There's a lot of open questions here. I would prefer if we first
>> focused on multithreading of the most computationally intensive stuff.
>> BLAS and FFTW already can do it. BLAS is usually parallelized through
>> OpenMP, so it would be nice if we wrapped the omp_set_num_threads (),
>> so that users could influence the number of threads from Octave. IIRC
>> FFTW uses a different multi-threading method, so maybe we could
>> address both in a unified way?
>>
>
> Are you sure that MKL and ACML both use OpenMP for multithreading? What
> about others vendors? In any case wrapping all of this in a function to set
> the number of threads in Octave should be easier. Note that Matlab
> deprecated the maxNumCompThreads function
>
> http://www.mathworks.com/access/helpdesk/help/techdoc/ref/maxnumcompthreads.html
> http://www.mathworks.com/access/helpdesk/help/techdoc/rn/bry1ecg-1.html#br1ask3-1
>
> in recent versions of matlab exactly because libraries like BLAS and FFTW
> are multithreaded. So we'd be going against the decision taken by matlab
> because they consider they can't control the number of threads of these
> libraries.
>

Unlike Matlab, Octave is open, so users can use BLAS of their choice.
No, they're not always multithreaded by OpenMP. For instance, GotoBLAS
is not. So what? It may be useful anyway, but we can simply say "this
function is equivalent to omp_set_num_threads, so it will only
influence libraries based on OpenMP".


>> I think stuff like this has a far bigger impact than parallelizing
>> plus and minus. Besides, Octave's got no JIT, so even if you
>> parallelize plus, a+b+c will still be slower than it could be if it
>> was done by a single serial loop.
>> Not that it really matters; as I already said these operations are
>> seldom a bottleneck, and if they are, rewriting to C++ is probably the
>> way to go.
>>
>
> Yes but as you said yourself its a manner of communications. If someone does
> a bench of a+b, stupid as that maybe be, I'd like Octave to be just as fast
> as matlab. Kai's results showed that is quad core AMD improved performance
> of the binary operators with multithreading. JITs another issue that yes the
> cases where it makes the most difference aren't the ones that have the
> largest impact on well written code. Though there are a lot of idoits out
> there writing bad code and benchmarking Octave against Matlab and then
> badmouthing Octave. So yes JIT would be good to have as well :-)
>

Maybe you place too much on idiots. Chasing Matlab in functionality is
useless; we can never win. There will always be idiots like that. Who
cares?

> Yes however, my main first target were the mapper functions though I started
> with the unary and binary op appliers as a question of starting somewhere.
>>
>> Perhaps we should best identify the most computationally heavy
>> functions, and target them first.
>> I still think it's not a good idea to put multithreading stuff into
>> 3.4.0. There's been a lot of performance improvements anyway.
>> I think multithreading is one area where we should not blindly copy
>> Matlab, and it deserves a careful approach.
>>
>
> If its isolated and disabled by default I don't see why it shouldn't go into
> 3.4.0 to allow people to start playing with this code.. There seems to be
> quite a few people regularly building and testing Octave now and letting
> them have this functionality in an experimental form can only be a good
> thinng, as long as of course it does put at risk the 3.4.0 release.
>
> D.
>

OK, I see you're determined :)
So, let's merge the elemental operations patch first. Even though they
scale poorly, it appears you can still get them, say, 2x faster with 4
cores. If you have 4 free cores, why not... besides, certain cases
scale well. For instance, integer division is apparently
computationally intensive enough to scale as well as the mappers.

We do not need to do the ugly magic proposed above; the logical
operations on reals are actually the only loops that can throw, and I
never liked this solution anyway. I think this can be solved by other
means, by just checking for the NaNs prior to the operation. We shall
indicate in mx-inlines.cc that loops must not throw exceptions, and
declare them as such.

Then we can address the mappers. This is going to be more difficult as
I don't think we can currently universally rely on the mappers to
never throw. Further, unlike the fast loops, the mappers can take a
long time and so are supposed to be breakable.

I think this is one of the biggest drawbacks of the OpenMP attempts.
Previously, making computations breakable required just putting
octave_quit() somewhere. Now, it should be done carefully and inside
parallel sections we simply need to resort to something more
complicated.
Maybe there's a better, exception-safe method of C++ parallelism out
there? What about boost::threads? (I don't know, I'd have to check it
out).


-- 
RNDr. Jaroslav Hajek, PhD
computing expert & GNU Octave developer
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz



reply via email to

[Prev in Thread] Current Thread [Next in Thread]