[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Using OpenMP in Octave
From: |
David Bateman |
Subject: |
Using OpenMP in Octave |
Date: |
Mon, 29 Mar 2010 01:37:42 +0200 |
User-agent: |
Mozilla-Thunderbird 2.0.0.22 (X11/20090706) |
I've had a short discussion with Jaroslav and John off list about
implementing OpenMP multi-threading in Octave and want to bring it back
to the list. The use of OpenMP with Octave 3.4 will still be too
experimental and so if we include the code now I propose to make it off
by default and the changeset I committed on Saturday adds the autoconf
code to probe for OpenMP support, but only if the "--enable-openmp"
configure option is used. Currently it only probes for OpenMP support
for gcc and msvc (though the msvc code is untested).
So now the question is how we should use multithread support in Octave?
The way I'd like to see it used is to use OpenMP for the fine/medium
grain parallelzation and MPI for the coarse grain (ie outer loop of the
octave m-file) parallelisation. The reason is that the coarse grain
parallelisation will in general have a low amount of network
communication as a percentage of the computation time and so is well
adapted to clusters were MPI rules. I propose therefore not to treat
such coarse parallelisation at all and leave the openmpi_ext code to
deal with that.
The OpenMP directives should be introduced at a level below which we
know to be thread-safe. A reasonable place to start looking for
multi-threading is in thee mx_inline code. Jaroslav suggested something like
Jaroslav Hajek wrote:
template <class R, class X>
inline Array<R>
do_mx_unary_op (const Array<X>& x,
void (*op) (size_t, R *, const X *))
{
Array<R> r (x.dims ());
int nthreads = omp_get_num_threads ();
octave_idx_type n = r.numel (), chunk = n / nthreads;
R *rvec = r.fortran_vec ();
const X *xvec = x.fortran_vec ();
#pragma omp parallel for
for (octave_idx_type i = 0; i < n; i += chunk)
op (std::min (chunk, n - i), rvec + i, xvec + i);
return r;
}
Where the idea is that the applier function will be used with many
different mx_inline_* functions and this will limit the intrusion in the
code and chunk to parallelisation to force each thread to treat a
contiguous block of data. There is no loop in these appliers currently
as the underlying mx_inline_* function itself has the loop. Therefore if
we want to multi-thread the code at this point we need to add a loop as
suggested by Jaroslav.
I think chunk should be calculated as
chuck = (n + nthreads - 1) / nthreads;
To really limit the number of loops to the number of threads. However,
this for loop has an overhead that might be significant, and it probably
makes sense to have a minimum value for r.numel() before threading is
used. Something like
template <class R, class X>
inline Array<R>
do_mx_unary_op (const Array<X>& x,
void (*op) (size_t, R *, const X *))
{
Array<R> r (x.dims ());
octave_idx_type n = r.numel ();
#ifdef HAVE_OPENMP
// Only use multithreading for large arrays to avoid excessive overhead.
// FIXME 1000 is an arbitrary value. Do some benchmarking
if (n > 1000)
{
int nthreads = omp_get_num_threads ();
octave_idx_type chunk = (n + nthreads - 1) / nthreads;
R *rvec = r.fortran_vec ();
const X *xvec = x.fortran_vec ();
OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
for (octave_idx_type i = 0; i < n; i += chunk)
op (std::min (chunk, n - i), rvec + i, xvec + i);
}
else
#endif
op (n, r.fortran_vec (), x.fortran_vec ());
return r;
}
where the macro OCTAVE_OMP_PRAGMA is just
#ifdef HAVE_OPENMP
OCTAVE_OMP_PRAGMA(x) _Pragma(x)
#else
OCTAVE_OMP_PRAGMA(x)
#endif
However, the value 1000 is arbitrary and a little benchmarking is
needed. I attach a first experimental changeset for those who want to
experiment. Configured with "--enable-openmp" and a recent tip this code
sucessfully run through "make check", but I don't know if the choice of
array size to switch between single and multithread code is optimal.
A couple of interesting tests might be
n = 300; a = ones(n,n,n);
tic; sum(a,1); toc
tic; sum(a,2); toc
tic; sum(a,3); toc
n = 999; a = (1+1i)*ones (n,n); tic; a = real(a); toc
n = 1001; a = (1+1i)*ones (n,n); tic; a = real(a); toc
before and after the change. Unfortunately I'm developing on a atom and
so I won't personally see much gain from this multi-threading
Cheers
David
D.
# HG changeset patch
# User David Bateman <address@hidden>
# Date 1269819260 -7200
# Node ID 7fd295e13462d8c1c38b32d98e41a1a1946f1918
# Parent bcabc1c4f20c0244793a31f60261b2a3f5a2a575
Experimental OpenMP code
diff --git a/liboctave/mx-inlines.cc b/liboctave/mx-inlines.cc
--- a/liboctave/mx-inlines.cc
+++ b/liboctave/mx-inlines.cc
@@ -35,6 +35,7 @@
#include "oct-cmplx.h"
#include "oct-locbuf.h"
#include "oct-inttypes.h"
+#include "oct-openmp.h"
#include "Array.h"
#include "Array-util.h"
@@ -299,12 +300,29 @@
// a pointer, to allow the compiler reduce number of instances.
template <class R, class X>
-inline Array<R>
+inline Array<R>
do_mx_unary_op (const Array<X>& x,
void (*op) (size_t, R *, const X *))
{
Array<R> r (x.dims ());
- op (r.numel (), r.fortran_vec (), x.data ());
+ octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+ // Only use multithreading for large arrays to avoid excessive overhead.
+ // FIXME 1000 is an arbitrary value. Do some benchmarking
+ if (n > 1000)
+ {
+ int nthreads = omp_get_num_threads ();
+ octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+ R *rvec = r.fortran_vec ();
+ const X *xvec = x.fortran_vec ();
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+ for (octave_idx_type i = 0; i < n; i += chunk)
+ op (std::min (chunk, n - i), rvec + i, xvec + i);
+ }
+ else
+#endif
+ op (n, r.fortran_vec (), x.fortran_vec ());
+
return r;
}
@@ -329,11 +347,23 @@
do_mx_inplace_op (Array<R>& r,
void (*op) (size_t, R *))
{
- op (r.numel (), r.fortran_vec ());
+ octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+ if (n > 1000)
+ {
+ int nthreads = omp_get_num_threads ();
+ octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+ R *rvec = r.fortran_vec ();
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec))
+ for (octave_idx_type i = 0; i < n; i += chunk)
+ op (std::min (chunk, n - i), rvec + i);
+ }
+ else
+#endif
+ op (n, r.fortran_vec ());
return r;
}
-
template <class R, class X, class Y>
inline Array<R>
do_mm_binary_op (const Array<X>& x, const Array<Y>& y,
@@ -344,7 +374,22 @@
if (dx == dy)
{
Array<R> r (dx);
- op (r.length (), r.fortran_vec (), x.data (), y.data ());
+ octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+ if (n > 1000)
+ {
+ int nthreads = omp_get_num_threads ();
+ octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+ R *rvec = r.fortran_vec ();
+ const X *xvec = x.data ();
+ const Y *yvec = y.data ();
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec,
yvec))
+ for (octave_idx_type i = 0; i < n; i += chunk)
+ op (std::min (chunk, n - i), rvec + i, xvec + i, yvec + i);
+ }
+ else
+#endif
+ op (n, r.fortran_vec (), x.data (), y.data ());
return r;
}
else
@@ -360,7 +405,21 @@
void (*op) (size_t, R *, const X *, Y))
{
Array<R> r (x.dims ());
- op (r.length (), r.fortran_vec (), x.data (), y);
+ octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+ if (n > 1000)
+ {
+ int nthreads = omp_get_num_threads ();
+ octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+ R *rvec = r.fortran_vec ();
+ const X *xvec = x.data ();
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+ for (octave_idx_type i = 0; i < n; i += chunk)
+ op (std::min (chunk, n - i), rvec + i, xvec + i, y);
+ }
+ else
+#endif
+ op (n, r.fortran_vec (), x.data (), y);
return r;
}
@@ -370,7 +429,21 @@
void (*op) (size_t, R *, X, const Y *))
{
Array<R> r (y.dims ());
- op (r.length (), r.fortran_vec (), x, y.data ());
+ octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+ if (n > 1000)
+ {
+ int nthreads = omp_get_num_threads ();
+ octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+ R *rvec = r.fortran_vec ();
+ const Y *yvec = y.data ();
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, yvec))
+ for (octave_idx_type i = 0; i < n; i += chunk)
+ op (std::min (chunk, n - i), rvec + i, x, yvec + i);
+ }
+ else
+#endif
+ op (n, r.fortran_vec (), x, y.data ());
return r;
}
@@ -382,7 +455,23 @@
{
dim_vector dr = r.dims (), dx = x.dims ();
if (dr == dx)
- op (r.length (), r.fortran_vec (), x.data ());
+ {
+ octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+ if (n > 1000)
+ {
+ int nthreads = omp_get_num_threads ();
+ octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+ R *rvec = r.fortran_vec ();
+ const X *xvec = x.data ();
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+ for (octave_idx_type i = 0; i < n; i += chunk)
+ op (std::min (chunk, n - i), rvec + i, xvec + i);
+ }
+ else
+#endif
+ op (n, r.fortran_vec (), x.data ());
+ }
else
gripe_nonconformant (opname, dr, dx);
return r;
@@ -393,7 +482,20 @@
do_ms_inplace_op (Array<R>& r, const X& x,
void (*op) (size_t, R *, X))
{
- op (r.length (), r.fortran_vec (), x);
+ octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+ if (n > 1000)
+ {
+ int nthreads = omp_get_num_threads ();
+ octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+ R *rvec = r.fortran_vec ();
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec))
+ for (octave_idx_type i = 0; i < n; i += chunk)
+ op (std::min (chunk, n - i), rvec + i, x);
+ }
+ else
+#endif
+ op (n, r.fortran_vec (), x);
return r;
}
@@ -1145,7 +1247,31 @@
dims.chop_trailing_singletons ();
Array<R> ret (dims);
- mx_red_op (src.data (), ret.fortran_vec (), l, n, u);
+#ifdef HAVE_OPENMP
+ if (l * u > 1000)
+ {
+ int nthreads = omp_get_num_threads ();
+ const T *svec = src.data ();
+ R *rvec = ret.fortran_vec ();
+ if (u < nthreads)
+ {
+ octave_idx_type chunk = (l + nthreads - 1) / nthreads;
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, l, u, chunk, rvec,
svec))
+ for (octave_idx_type i = 0; i < l; i += chunk)
+ mx_red_op (svec + i * n, rvec + i, std::min (chunk, l - i), n, u);
+ }
+ else
+ {
+ octave_idx_type chunk = (u + nthreads - 1) / nthreads;
+ OCTAVE_OMP_PRAGMA(omp parallel for shared(n, l, u, chunk, rvec,
svec))
+ for (octave_idx_type i = 0; i < u; i += chunk)
+ mx_red_op (svec + i * l * n, rvec + i * l , l, n,
+ std::min (chunk, u - i));
+ }
+ }
+ else
+#endif
+ mx_red_op (src.data (), ret.fortran_vec (), l, n, u);
return ret;
}
- Using OpenMP in Octave,
David Bateman <=
- Re: Using OpenMP in Octave, Søren Hauberg, 2010/03/29
- Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/29
- Re: Using OpenMP in Octave, Michael D Godfrey, 2010/03/29
- Re: Using OpenMP in Octave, David Bateman, 2010/03/29
- Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/30
- Re: Using OpenMP in Octave, David Bateman, 2010/03/30
- Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/31
- Re: Using OpenMP in Octave, Kai Habel, 2010/03/29