Using OpenMP in Octave

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Using OpenMP in Octave

From:	David Bateman
Subject:	Using OpenMP in Octave
Date:	Mon, 29 Mar 2010 01:37:42 +0200
User-agent:	Mozilla-Thunderbird 2.0.0.22 (X11/20090706)

I've had a short discussion with Jaroslav and John off list aboutimplementing OpenMP multi-threading in Octave and want to bring it backto the list. The use of OpenMP with Octave 3.4 will still be tooexperimental and so if we include the code now I propose to make it offby default and the changeset I committed on Saturday adds the autoconfcode to probe for OpenMP support, but only if the "--enable-openmp"configure option is used. Currently it only probes for OpenMP supportfor gcc and msvc (though the msvc code is untested).

So now the question is how we should use multithread support in Octave?The way I'd like to see it used is to use OpenMP for the fine/mediumgrain parallelzation and MPI for the coarse grain (ie outer loop of theoctave m-file) parallelisation. The reason is that the coarse grainparallelisation will in general have a low amount of networkcommunication as a percentage of the computation time and so is welladapted to clusters were MPI rules. I propose therefore not to treatsuch coarse parallelisation at all and leave the openmpi_ext code todeal with that.

The OpenMP directives should be introduced at a level below which weknow to be thread-safe. A reasonable place to start looking formulti-threading is in thee mx_inline code. Jaroslav suggested something like


Jaroslav Hajek wrote:

template <class R, class X>
inline Array<R>
do_mx_unary_op (const Array<X>& x,
                void (*op) (size_t, R *, const X *))
{
  Array<R> r (x.dims ());
  int nthreads = omp_get_num_threads ();
  octave_idx_type n = r.numel (), chunk = n / nthreads;
  R *rvec = r.fortran_vec ();
  const X *xvec = x.fortran_vec ();
#pragma omp parallel for
  for (octave_idx_type i = 0; i < n; i += chunk)
    op (std::min (chunk, n - i), rvec + i, xvec + i);
  return r;
}

Where the idea is that the applier function will be used with manydifferent mx_inline_* functions and this will limit the intrusion in thecode and chunk to parallelisation to force each thread to treat acontiguous block of data. There is no loop in these appliers currentlyas the underlying mx_inline_* function itself has the loop. Therefore ifwe want to multi-thread the code at this point we need to add a loop assuggested by Jaroslav.


I think chunk should be calculated as

chuck = (n + nthreads - 1) / nthreads;

To really limit the number of loops to the number of threads. However,this for loop has an overhead that might be significant, and it probablymakes sense to have a minimum value for r.numel() before threading isused. Something like


template <class R, class X>
inline Array<R>
do_mx_unary_op (const Array<X>& x,
               void (*op) (size_t, R *, const X *))
{
 Array<R> r (x.dims ());
 octave_idx_type n = r.numel ();
#ifdef HAVE_OPENMP
 // Only use multithreading for large arrays to avoid excessive overhead.
 // FIXME 1000 is an arbitrary value. Do some benchmarking
 if (n > 1000)
   {
     int nthreads = omp_get_num_threads ();
     octave_idx_type chunk = (n + nthreads - 1) / nthreads;
     R *rvec = r.fortran_vec ();
     const X *xvec = x.fortran_vec ();
     OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
     for (octave_idx_type i = 0; i < n; i += chunk)
       op (std::min (chunk, n - i), rvec + i, xvec + i);
   }
 else
#endif
   op (n, r.fortran_vec (), x.fortran_vec ());

 return r;
}

where the macro OCTAVE_OMP_PRAGMA is just

#ifdef HAVE_OPENMP
OCTAVE_OMP_PRAGMA(x) _Pragma(x)
#else
OCTAVE_OMP_PRAGMA(x)
#endif

However, the value 1000 is arbitrary and a little benchmarking isneeded. I attach a first experimental changeset for those who want toexperiment. Configured with "--enable-openmp" and a recent tip this codesucessfully run through "make check", but I don't know if the choice ofarray size to switch between single and multithread code is optimal.


A couple of interesting tests might be

n  = 300; a = ones(n,n,n);
tic; sum(a,1); toc
tic; sum(a,2); toc
tic; sum(a,3); toc
n = 999; a = (1+1i)*ones (n,n); tic; a = real(a); toc
n = 1001; a = (1+1i)*ones (n,n); tic; a = real(a); toc

before and after the change. Unfortunately I'm developing on a atom andso I won't personally see much gain from this multi-threading


Cheers
David






D.

# HG changeset patch
# User David Bateman <address@hidden>
# Date 1269819260 -7200
# Node ID 7fd295e13462d8c1c38b32d98e41a1a1946f1918
# Parent  bcabc1c4f20c0244793a31f60261b2a3f5a2a575
Experimental OpenMP code

diff --git a/liboctave/mx-inlines.cc b/liboctave/mx-inlines.cc
--- a/liboctave/mx-inlines.cc
+++ b/liboctave/mx-inlines.cc
@@ -35,6 +35,7 @@
 #include "oct-cmplx.h"
 #include "oct-locbuf.h"
 #include "oct-inttypes.h"
+#include "oct-openmp.h"
 #include "Array.h"
 #include "Array-util.h"
 
@@ -299,12 +300,29 @@
 // a pointer, to allow the compiler reduce number of instances.
 
 template <class R, class X>
-inline Array<R> 
+inline Array<R>
 do_mx_unary_op (const Array<X>& x,
                 void (*op) (size_t, R *, const X *))
 {
   Array<R> r (x.dims ());
-  op (r.numel (), r.fortran_vec (), x.data ());
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+  // Only use multithreading for large arrays to avoid excessive overhead.
+  // FIXME 1000 is an arbitrary value. Do some benchmarking
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      const X *xvec = x.fortran_vec ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i, xvec + i);
+    }
+  else
+#endif
+    op (n, r.fortran_vec (), x.fortran_vec ());
+
   return r;
 }
 
@@ -329,11 +347,23 @@
 do_mx_inplace_op (Array<R>& r,
                   void (*op) (size_t, R *))
 {
-  op (r.numel (), r.fortran_vec ());
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i);
+    }
+  else
+#endif
+    op (n, r.fortran_vec ());
   return r;
 }
 
-
 template <class R, class X, class Y>
 inline Array<R> 
 do_mm_binary_op (const Array<X>& x, const Array<Y>& y,
@@ -344,7 +374,22 @@
   if (dx == dy)
     {
       Array<R> r (dx);
-      op (r.length (), r.fortran_vec (), x.data (), y.data ());
+      octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+      if (n > 1000)
+        {
+          int nthreads = omp_get_num_threads ();
+          octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+          R *rvec = r.fortran_vec ();
+          const X *xvec = x.data ();
+          const Y *yvec = y.data ();
+          OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec, 
yvec))
+          for (octave_idx_type i = 0; i < n; i += chunk)
+            op (std::min (chunk, n - i), rvec + i, xvec + i, yvec + i);
+        }
+      else
+#endif
+        op (n, r.fortran_vec (), x.data (), y.data ());
       return r;
     }
   else
@@ -360,7 +405,21 @@
                  void (*op) (size_t, R *, const X *, Y))
 {
   Array<R> r (x.dims ());
-  op (r.length (), r.fortran_vec (), x.data (), y);
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      const X *xvec = x.data ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i, xvec + i, y);
+    }
+  else
+#endif
+    op (n, r.fortran_vec (), x.data (), y);
   return r;
 }
 
@@ -370,7 +429,21 @@
                  void (*op) (size_t, R *, X, const Y *))
 {
   Array<R> r (y.dims ());
-  op (r.length (), r.fortran_vec (), x, y.data ());
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      const Y *yvec = y.data ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, yvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i, x, yvec + i);
+    }
+  else
+#endif
+    op (n, r.fortran_vec (), x, y.data ());
   return r;
 }
 
@@ -382,7 +455,23 @@
 {
   dim_vector dr = r.dims (), dx = x.dims ();
   if (dr == dx)
-    op (r.length (), r.fortran_vec (), x.data ());
+    {
+      octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+      if (n > 1000)
+        {
+          int nthreads = omp_get_num_threads ();
+          octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+          R *rvec = r.fortran_vec ();
+          const X *xvec = x.data ();
+          OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+          for (octave_idx_type i = 0; i < n; i += chunk)
+            op (std::min (chunk, n - i), rvec + i, xvec + i);
+        }
+      else
+#endif
+        op (n, r.fortran_vec (), x.data ());
+    }
   else
     gripe_nonconformant (opname, dr, dx);
   return r;
@@ -393,7 +482,20 @@
 do_ms_inplace_op (Array<R>& r, const X& x,
                   void (*op) (size_t, R *, X))
 {
-  op (r.length (), r.fortran_vec (), x);
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i, x);
+    }
+  else
+#endif
+    op (n, r.fortran_vec (), x);
   return r;
 }
 
@@ -1145,7 +1247,31 @@
   dims.chop_trailing_singletons ();
 
   Array<R> ret (dims);
-  mx_red_op (src.data (), ret.fortran_vec (), l, n, u);
+#ifdef HAVE_OPENMP
+  if (l * u > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      const T *svec = src.data ();
+      R *rvec = ret.fortran_vec ();
+      if (u < nthreads)
+        {
+          octave_idx_type chunk = (l + nthreads - 1) / nthreads;
+          OCTAVE_OMP_PRAGMA(omp parallel for shared(n, l, u, chunk, rvec, 
svec))
+          for (octave_idx_type i = 0; i < l; i += chunk)
+            mx_red_op (svec + i * n, rvec + i, std::min (chunk, l - i), n, u);
+        }
+      else
+        {
+          octave_idx_type chunk = (u + nthreads - 1) / nthreads;
+          OCTAVE_OMP_PRAGMA(omp parallel for shared(n, l, u, chunk, rvec, 
svec))
+          for (octave_idx_type i = 0; i < u; i += chunk)
+            mx_red_op (svec + i * l * n, rvec + i * l , l, n,
+                       std::min (chunk, u - i));
+        }
+    }
+  else
+#endif
+    mx_red_op (src.data (), ret.fortran_vec (), l, n, u);
 
   return ret;
 }

[Prev in Thread]

Current Thread

[Next in Thread]

Using OpenMP in Octave, David Bateman <=
- Re: Using OpenMP in Octave, Søren Hauberg, 2010/03/29
  - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/29
    - Re: Using OpenMP in Octave, Søren Hauberg, 2010/03/29
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/29
    - Re: Using OpenMP in Octave, Michael D Godfrey, 2010/03/29
    - Re: Using OpenMP in Octave, David Bateman, 2010/03/29
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/30
    - Re: Using OpenMP in Octave, David Bateman, 2010/03/30
    - Re: Using OpenMP in Octave, Jaroslav Hajek, 2010/03/31
    - Re: Using OpenMP in Octave, Kai Habel, 2010/03/29

Prev by Date: Re: variability in a test for eigs.cc
Next by Date: Re: variability in a test for eigs.cc
Previous by thread: matrix.h for MEX?
Next by thread: Re: Using OpenMP in Octave
Index(es):
- Date
- Thread