octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using OpenMP in Octave


From: David Bateman
Subject: Re: Using OpenMP in Octave
Date: Tue, 30 Mar 2010 22:29:45 +0200
User-agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090706)

Jaroslav Hajek wrote:
Apparently, even exp scales well. Though personally I don't have
programs making heavy use of non-trivial elemental operations, I
daresay this is one area which is both easy enough and profitable
enough to start with.
Then lets start there.

Why make it tunable if we've done sufficient testing that the defaults
result in faster code every or at least the majority of cases and the slow
ups are minor?


How do you imagine "sufficient testing"? I can only test on 2 or 3
machines, with their specific configuration. Is that sufficient? I was
thinking along the lines that if we make everything customizable,
users will be more willing to tune the constants for their machine. We
are not MathWorks, we don't have a team dedicated to this task. Had
everyone agreed with your original patch, I'd bet anything that the
constants would stay as they were unless you tuned them yourself.
I would have tuned themselves. in any case.

We have no chance to determine the best constant for all machines, so
I think users should be allowed to find out their own.

The bus speeds aren't that different in most processors that generic values
will probably be fine.. If the optimal change over from one algorithm to
another for a mapper function changes from 800 to 1000 do we really care?


Depends. 800 to 1000, maybe not. But what if it is 200 to 1000? The
optimal switch point also depends on the number of cores, so a fixed
number is sub-optimal in principle.
My point was that the serial part of the code will include the copying to the memory shared by the code so Amadhl rule includes that and the difference in wall time for a switching point of 200 or 1000 even for a serial or parallel version might not be that difference.. The switch point is probably tolerant to variations in performance. Though, maybe its just better to parallelize everything as the small cases aren't limiting in their computational time.

By the way, I just realized there are further problems to be dealt
with. Watch this demo (with my yesterday's patch):

octave:1> a = rand (1e6, 1);
octave:2> b = rand (1e6, 1);
octave:3> a(end/2) = NaN;
octave:4> c = a & b;
error: invalid conversion of NaN to logical
terminate called after throwing an instance of 'octave_execution_exception'
panic: Aborted -- stopping myself...
Aborted

Bang. What happened? Yes, the answer is - exceptions. The invalid NaN
to logical conversion is currently handled via
octave_execution_exception. And exceptions don't play nicely with
OpenMP. Current OpenMP mandates exceptions to be caught in the same
thread that caused them. For loops, it essentially means in the same
cycle.
This means enclosing each loop in a try-catch block, pay attention for
execution exceptions and interrupt exceptions, store them somehow when
caught and restore when back.
This particular case is both easy to trigger and easy to fix. But what
about others? I think some of the mappers implemented in libcruft can
raise exceptions (through the fortran XSTOPX).
Lots of stuff can raise the interrupt exception (for good reasons)
What should we do? Should Ctrl-C be disabled when multithreading is in
effect? Or shall we simply manually make sure that there are no
octave_quit or F77_XFCN calls in the parallel areas?
Ok we have to deal with exceptions in a generic manner in the applier functions. I suppose we need to do something like

template <class R, class X, class Y>
inline Array<R>
do_mm_binary_op (const Array<X>& x, const Array<Y>& y,
                void (*op) (size_t, R *, const X *, const Y *),
                const char *opname)
{
 dim_vector dx = x.dims (), dy = y.dims ();
 if (dx == dy)
   {
     Array<R> r (dx);
     op (r.length (), r.fortran_vec (), x.data (), y.data ());
     octave_idx_type n = r.numel ();
#ifdef HAVE_OPENMP
     octave_idx_type chunk = mt_compute_chunk (sizeof (R), n);
     if (chunk > 0)
       {
     bool exception_thrown;
     sig_atomic_t saved_octave_exception_state = 0;
     sig_atomic_t saved_octave_interrupt_state = 0;

         R *rvec = r.fortran_vec ();
         const X *xvec = x.data ();
         const Y *yvec = y.data ();
OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec, yvec, saved_octave_exception_state, saved_octave_interrupt_state, exception_thrown) schedule (static, 1))
         for (octave_idx_type i = 0; i < n; i += chunk)
       {
         octave_jmp_buf saved_context;
         octave_save_current_context (saved_context);
         if (octave_set_current_context)
       {
         octave_restore_current_context (saved_context);
         execption_thrown = true;
         saved_octave_exception_state = octave_exception_state;
         saved_octave_interrupt_state = octave_interrupt_state;
       }
         else
       {
         op (std::min (chunk, n - i), rvec + i, xvec + i, yvec + i);
         octave_restore_current_context (saved_context);
       }
       }
     if (exception_thrown)
       octave_rethrow_exception ();
       }
     else
#endif
       op (n, r.fortran_vec (), x.data (), y.data ());
     return r;
   }
 else
   {
     gripe_nonconformant (opname, dx, dy);
     return Array<R> ();
   }
}

will do what we, but its no longer a simple matter of including a pragma wrapped in a macro. This question of the exception will occur even more as we try to parallelize at a coarser grain and you probably have similar issues in your parcellfun implementation already.

There's a lot of open questions here. I would prefer if we first
focused on multithreading of the most computationally intensive stuff.
BLAS and FFTW already can do it. BLAS is usually parallelized through
OpenMP, so it would be nice if we wrapped the omp_set_num_threads (),
so that users could influence the number of threads from Octave. IIRC
FFTW uses a different multi-threading method, so maybe we could
address both in a unified way?
Are you sure that MKL and ACML both use OpenMP for multithreading? What about others vendors? In any case wrapping all of this in a function to set the number of threads in Octave should be easier. Note that Matlab deprecated the maxNumCompThreads function

http://www.mathworks.com/access/helpdesk/help/techdoc/ref/maxnumcompthreads.html
http://www.mathworks.com/access/helpdesk/help/techdoc/rn/bry1ecg-1.html#br1ask3-1

in recent versions of matlab exactly because libraries like BLAS and FFTW are multithreaded. So we'd be going against the decision taken by matlab because they consider they can't control the number of threads of these libraries.

I think stuff like this has a far bigger impact than parallelizing
plus and minus. Besides, Octave's got no JIT, so even if you
parallelize plus, a+b+c will still be slower than it could be if it
was done by a single serial loop.
Not that it really matters; as I already said these operations are
seldom a bottleneck, and if they are, rewriting to C++ is probably the
way to go.
Yes but as you said yourself its a manner of communications. If someone does a bench of a+b, stupid as that maybe be, I'd like Octave to be just as fast as matlab. Kai's results showed that is quad core AMD improved performance of the binary operators with multithreading. JITs another issue that yes the cases where it makes the most difference aren't the ones that have the largest impact on well written code. Though there are a lot of idoits out there writing bad code and benchmarking Octave against Matlab and then badmouthing Octave. So yes JIT would be good to have as well :-)

Yes however, my main first target were the mapper functions though I started with the unary and binary op appliers as a question of starting somewhere.
Perhaps we should best identify the most computationally heavy
functions, and target them first.
I still think it's not a good idea to put multithreading stuff into
3.4.0. There's been a lot of performance improvements anyway.
I think multithreading is one area where we should not blindly copy
Matlab, and it deserves a careful approach.
If its isolated and disabled by default I don't see why it shouldn't go into 3.4.0 to allow people to start playing with this code.. There seems to be quite a few people regularly building and testing Octave now and letting them have this functionality in an experimental form can only be a good thinng, as long as of course it does put at risk the 3.4.0 release.

D.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]