Re: Using OpenMP in Octave

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Using OpenMP in Octave

From:	David Bateman
Subject:	Re: Using OpenMP in Octave
Date:	Tue, 30 Mar 2010 22:29:45 +0200
User-agent:	Mozilla-Thunderbird 2.0.0.22 (X11/20090706)

Jaroslav Hajek wrote:

Apparently, even exp scales well. Though personally I don't have
programs making heavy use of non-trivial elemental operations, I
daresay this is one area which is both easy enough and profitable
enough to start with.

Then lets start there.

Why make it tunable if we've done sufficient testing that the defaults
result in faster code every or at least the majority of cases and the slow
ups are minor?


How do you imagine "sufficient testing"? I can only test on 2 or 3
machines, with their specific configuration. Is that sufficient? I was
thinking along the lines that if we make everything customizable,
users will be more willing to tune the constants for their machine. We
are not MathWorks, we don't have a team dedicated to this task. Had
everyone agreed with your original patch, I'd bet anything that the
constants would stay as they were unless you tuned them yourself.

I would have tuned themselves. in any case.

We have no chance to determine the best constant for all machines, so
I think users should be allowed to find out their own.

The bus speeds aren't that different in most processors that generic values
will probably be fine.. If the optimal change over from one algorithm to
another for a mapper function changes from 800 to 1000 do we really care?


Depends. 800 to 1000, maybe not. But what if it is 200 to 1000? The
optimal switch point also depends on the number of cores, so a fixed
number is sub-optimal in principle.

My point was that the serial part of the code will include the copyingto the memory shared by the code so Amadhl rule includes that and thedifference in wall time for a switching point of 200 or 1000 even for aserial or parallel version might not be that difference.. The switchpoint is probably tolerant to variations in performance. Though, maybeits just better to parallelize everything as the small cases aren'tlimiting in their computational time.

By the way, I just realized there are further problems to be dealt
with. Watch this demo (with my yesterday's patch):

octave:1> a = rand (1e6, 1);
octave:2> b = rand (1e6, 1);
octave:3> a(end/2) = NaN;
octave:4> c = a & b;
error: invalid conversion of NaN to logical
terminate called after throwing an instance of 'octave_execution_exception'
panic: Aborted -- stopping myself...
Aborted

Bang. What happened? Yes, the answer is - exceptions. The invalid NaN
to logical conversion is currently handled via
octave_execution_exception. And exceptions don't play nicely with
OpenMP. Current OpenMP mandates exceptions to be caught in the same
thread that caused them. For loops, it essentially means in the same
cycle.
This means enclosing each loop in a try-catch block, pay attention for
execution exceptions and interrupt exceptions, store them somehow when
caught and restore when back.
This particular case is both easy to trigger and easy to fix. But what
about others? I think some of the mappers implemented in libcruft can
raise exceptions (through the fortran XSTOPX).
Lots of stuff can raise the interrupt exception (for good reasons)
What should we do? Should Ctrl-C be disabled when multithreading is in
effect? Or shall we simply manually make sure that there are no
octave_quit or F77_XFCN calls in the parallel areas?

Ok we have to deal with exceptions in a generic manner in the applierfunctions. I suppose we need to do something like


template <class R, class X, class Y>
inline Array<R>
do_mm_binary_op (const Array<X>& x, const Array<Y>& y,
                void (*op) (size_t, R *, const X *, const Y *),
                const char *opname)
{
 dim_vector dx = x.dims (), dy = y.dims ();
 if (dx == dy)
   {
     Array<R> r (dx);
     op (r.length (), r.fortran_vec (), x.data (), y.data ());
     octave_idx_type n = r.numel ();
#ifdef HAVE_OPENMP
     octave_idx_type chunk = mt_compute_chunk (sizeof (R), n);
     if (chunk > 0)
       {
     bool exception_thrown;
     sig_atomic_t saved_octave_exception_state = 0;
     sig_atomic_t saved_octave_interrupt_state = 0;

         R *rvec = r.fortran_vec ();
         const X *xvec = x.data ();
         const Y *yvec = y.data ();

OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec,xvec, yvec, saved_octave_exception_state, saved_octave_interrupt_state,exception_thrown) schedule (static, 1))

         for (octave_idx_type i = 0; i < n; i += chunk)
       {
         octave_jmp_buf saved_context;
         octave_save_current_context (saved_context);
         if (octave_set_current_context)
       {
         octave_restore_current_context (saved_context);
         execption_thrown = true;
         saved_octave_exception_state = octave_exception_state;
         saved_octave_interrupt_state = octave_interrupt_state;
       }
         else
       {
         op (std::min (chunk, n - i), rvec + i, xvec + i, yvec + i);
         octave_restore_current_context (saved_context);
       }
       }
     if (exception_thrown)
       octave_rethrow_exception ();
       }
     else
#endif
       op (n, r.fortran_vec (), x.data (), y.data ());
     return r;
   }
 else
   {
     gripe_nonconformant (opname, dx, dy);
     return Array<R> ();
   }
}

will do what we, but its no longer a simple matter of including a pragmawrapped in a macro. This question of the exception will occur even moreas we try to parallelize at a coarser grain and you probably havesimilar issues in your parcellfun implementation already.

There's a lot of open questions here. I would prefer if we first
focused on multithreading of the most computationally intensive stuff.
BLAS and FFTW already can do it. BLAS is usually parallelized through
OpenMP, so it would be nice if we wrapped the omp_set_num_threads (),
so that users could influence the number of threads from Octave. IIRC
FFTW uses a different multi-threading method, so maybe we could
address both in a unified way?

Are you sure that MKL and ACML both use OpenMP for multithreading? Whatabout others vendors? In any case wrapping all of this in a function toset the number of threads in Octave should be easier. Note that Matlabdeprecated the maxNumCompThreads function


http://www.mathworks.com/access/helpdesk/help/techdoc/ref/maxnumcompthreads.html
http://www.mathworks.com/access/helpdesk/help/techdoc/rn/bry1ecg-1.html#br1ask3-1

in recent versions of matlab exactly because libraries like BLAS andFFTW are multithreaded. So we'd be going against the decision taken bymatlab because they consider they can't control the number of threads ofthese libraries.

I think stuff like this has a far bigger impact than parallelizing
plus and minus. Besides, Octave's got no JIT, so even if you
parallelize plus, a+b+c will still be slower than it could be if it
was done by a single serial loop.
Not that it really matters; as I already said these operations are
seldom a bottleneck, and if they are, rewriting to C++ is probably the
way to go.

Yes but as you said yourself its a manner of communications. If someonedoes a bench of a+b, stupid as that maybe be, I'd like Octave to be justas fast as matlab. Kai's results showed that is quad core AMD improvedperformance of the binary operators with multithreading. JITs anotherissue that yes the cases where it makes the most difference aren't theones that have the largest impact on well written code. Though there area lot of idoits out there writing bad code and benchmarking Octaveagainst Matlab and then badmouthing Octave. So yes JIT would be good tohave as well :-)

Yes however, my main first target were the mapper functions though Istarted with the unary and binary op appliers as a question of startingsomewhere.

Perhaps we should best identify the most computationally heavy
functions, and target them first.
I still think it's not a good idea to put multithreading stuff into
3.4.0. There's been a lot of performance improvements anyway.
I think multithreading is one area where we should not blindly copy
Matlab, and it deserves a careful approach.

If its isolated and disabled by default I don't see why it shouldn't gointo 3.4.0 to allow people to start playing with this code.. There seemsto be quite a few people regularly building and testing Octave now andletting them have this functionality in an experimental form can only bea good thinng, as long as of course it does put at risk the 3.4.0 release.

D.

[Prev in Thread]

Current Thread

[Next in Thread]

Using OpenMP in Octave, David Bateman, 2010/03/28
- Re: Using OpenMP in Octave, Søren Hauberg, 2010/03/29
  - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/29
    - Re: Using OpenMP in Octave, Søren Hauberg, 2010/03/29
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/29
    - Re: Using OpenMP in Octave, Michael D Godfrey, 2010/03/29
    - Re: Using OpenMP in Octave, David Bateman, 2010/03/29
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/30
    - Re: Using OpenMP in Octave, David Bateman <=
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/31
    - Re: Using OpenMP in Octave, Kai Habel, 2010/03/29
- Re: Using OpenMP in Octave, Michael Goffioul, 2010/03/29

Prev by Date: Better quadrature routine in octave
Next by Date: Re: Better quadrature routine in octave
Previous by thread: Re: Using OpenMP in Octave
Next by thread: Re: Using OpenMP in Octave
Index(es):
- Date
- Thread