octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Using OpenMP in Octave


From: David Bateman
Subject: Using OpenMP in Octave
Date: Mon, 29 Mar 2010 01:37:42 +0200
User-agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090706)

I've had a short discussion with Jaroslav and John off list about implementing OpenMP multi-threading in Octave and want to bring it back to the list. The use of OpenMP with Octave 3.4 will still be too experimental and so if we include the code now I propose to make it off by default and the changeset I committed on Saturday adds the autoconf code to probe for OpenMP support, but only if the "--enable-openmp" configure option is used. Currently it only probes for OpenMP support for gcc and msvc (though the msvc code is untested).

So now the question is how we should use multithread support in Octave? The way I'd like to see it used is to use OpenMP for the fine/medium grain parallelzation and MPI for the coarse grain (ie outer loop of the octave m-file) parallelisation. The reason is that the coarse grain parallelisation will in general have a low amount of network communication as a percentage of the computation time and so is well adapted to clusters were MPI rules. I propose therefore not to treat such coarse parallelisation at all and leave the openmpi_ext code to deal with that.

The OpenMP directives should be introduced at a level below which we know to be thread-safe. A reasonable place to start looking for multi-threading is in thee mx_inline code. Jaroslav suggested something like

Jaroslav Hajek wrote:
template <class R, class X>
inline Array<R>
do_mx_unary_op (const Array<X>& x,
                void (*op) (size_t, R *, const X *))
{
  Array<R> r (x.dims ());
  int nthreads = omp_get_num_threads ();
  octave_idx_type n = r.numel (), chunk = n / nthreads;
  R *rvec = r.fortran_vec ();
  const X *xvec = x.fortran_vec ();
#pragma omp parallel for
  for (octave_idx_type i = 0; i < n; i += chunk)
    op (std::min (chunk, n - i), rvec + i, xvec + i);
  return r;
}
Where the idea is that the applier function will be used with many different mx_inline_* functions and this will limit the intrusion in the code and chunk to parallelisation to force each thread to treat a contiguous block of data. There is no loop in these appliers currently as the underlying mx_inline_* function itself has the loop. Therefore if we want to multi-thread the code at this point we need to add a loop as suggested by Jaroslav.

I think chunk should be calculated as

chuck = (n + nthreads - 1) / nthreads;

To really limit the number of loops to the number of threads. However, this for loop has an overhead that might be significant, and it probably makes sense to have a minimum value for r.numel() before threading is used. Something like

template <class R, class X>
inline Array<R>
do_mx_unary_op (const Array<X>& x,
               void (*op) (size_t, R *, const X *))
{
 Array<R> r (x.dims ());
 octave_idx_type n = r.numel ();
#ifdef HAVE_OPENMP
 // Only use multithreading for large arrays to avoid excessive overhead.
 // FIXME 1000 is an arbitrary value. Do some benchmarking
 if (n > 1000)
   {
     int nthreads = omp_get_num_threads ();
     octave_idx_type chunk = (n + nthreads - 1) / nthreads;
     R *rvec = r.fortran_vec ();
     const X *xvec = x.fortran_vec ();
     OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
     for (octave_idx_type i = 0; i < n; i += chunk)
       op (std::min (chunk, n - i), rvec + i, xvec + i);
   }
 else
#endif
   op (n, r.fortran_vec (), x.fortran_vec ());

 return r;
}

where the macro OCTAVE_OMP_PRAGMA is just

#ifdef HAVE_OPENMP
OCTAVE_OMP_PRAGMA(x) _Pragma(x)
#else
OCTAVE_OMP_PRAGMA(x)
#endif

However, the value 1000 is arbitrary and a little benchmarking is needed. I attach a first experimental changeset for those who want to experiment. Configured with "--enable-openmp" and a recent tip this code sucessfully run through "make check", but I don't know if the choice of array size to switch between single and multithread code is optimal.

A couple of interesting tests might be

n  = 300; a = ones(n,n,n);
tic; sum(a,1); toc
tic; sum(a,2); toc
tic; sum(a,3); toc
n = 999; a = (1+1i)*ones (n,n); tic; a = real(a); toc
n = 1001; a = (1+1i)*ones (n,n); tic; a = real(a); toc

before and after the change. Unfortunately I'm developing on a atom and so I won't personally see much gain from this multi-threading

Cheers
David






D.

# HG changeset patch
# User David Bateman <address@hidden>
# Date 1269819260 -7200
# Node ID 7fd295e13462d8c1c38b32d98e41a1a1946f1918
# Parent  bcabc1c4f20c0244793a31f60261b2a3f5a2a575
Experimental OpenMP code

diff --git a/liboctave/mx-inlines.cc b/liboctave/mx-inlines.cc
--- a/liboctave/mx-inlines.cc
+++ b/liboctave/mx-inlines.cc
@@ -35,6 +35,7 @@
 #include "oct-cmplx.h"
 #include "oct-locbuf.h"
 #include "oct-inttypes.h"
+#include "oct-openmp.h"
 #include "Array.h"
 #include "Array-util.h"
 
@@ -299,12 +300,29 @@
 // a pointer, to allow the compiler reduce number of instances.
 
 template <class R, class X>
-inline Array<R> 
+inline Array<R>
 do_mx_unary_op (const Array<X>& x,
                 void (*op) (size_t, R *, const X *))
 {
   Array<R> r (x.dims ());
-  op (r.numel (), r.fortran_vec (), x.data ());
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+  // Only use multithreading for large arrays to avoid excessive overhead.
+  // FIXME 1000 is an arbitrary value. Do some benchmarking
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      const X *xvec = x.fortran_vec ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i, xvec + i);
+    }
+  else
+#endif
+    op (n, r.fortran_vec (), x.fortran_vec ());
+
   return r;
 }
 
@@ -329,11 +347,23 @@
 do_mx_inplace_op (Array<R>& r,
                   void (*op) (size_t, R *))
 {
-  op (r.numel (), r.fortran_vec ());
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i);
+    }
+  else
+#endif
+    op (n, r.fortran_vec ());
   return r;
 }
 
-
 template <class R, class X, class Y>
 inline Array<R> 
 do_mm_binary_op (const Array<X>& x, const Array<Y>& y,
@@ -344,7 +374,22 @@
   if (dx == dy)
     {
       Array<R> r (dx);
-      op (r.length (), r.fortran_vec (), x.data (), y.data ());
+      octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+      if (n > 1000)
+        {
+          int nthreads = omp_get_num_threads ();
+          octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+          R *rvec = r.fortran_vec ();
+          const X *xvec = x.data ();
+          const Y *yvec = y.data ();
+          OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec, 
yvec))
+          for (octave_idx_type i = 0; i < n; i += chunk)
+            op (std::min (chunk, n - i), rvec + i, xvec + i, yvec + i);
+        }
+      else
+#endif
+        op (n, r.fortran_vec (), x.data (), y.data ());
       return r;
     }
   else
@@ -360,7 +405,21 @@
                  void (*op) (size_t, R *, const X *, Y))
 {
   Array<R> r (x.dims ());
-  op (r.length (), r.fortran_vec (), x.data (), y);
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      const X *xvec = x.data ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i, xvec + i, y);
+    }
+  else
+#endif
+    op (n, r.fortran_vec (), x.data (), y);
   return r;
 }
 
@@ -370,7 +429,21 @@
                  void (*op) (size_t, R *, X, const Y *))
 {
   Array<R> r (y.dims ());
-  op (r.length (), r.fortran_vec (), x, y.data ());
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      const Y *yvec = y.data ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, yvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i, x, yvec + i);
+    }
+  else
+#endif
+    op (n, r.fortran_vec (), x, y.data ());
   return r;
 }
 
@@ -382,7 +455,23 @@
 {
   dim_vector dr = r.dims (), dx = x.dims ();
   if (dr == dx)
-    op (r.length (), r.fortran_vec (), x.data ());
+    {
+      octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+      if (n > 1000)
+        {
+          int nthreads = omp_get_num_threads ();
+          octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+          R *rvec = r.fortran_vec ();
+          const X *xvec = x.data ();
+          OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec, xvec))
+          for (octave_idx_type i = 0; i < n; i += chunk)
+            op (std::min (chunk, n - i), rvec + i, xvec + i);
+        }
+      else
+#endif
+        op (n, r.fortran_vec (), x.data ());
+    }
   else
     gripe_nonconformant (opname, dr, dx);
   return r;
@@ -393,7 +482,20 @@
 do_ms_inplace_op (Array<R>& r, const X& x,
                   void (*op) (size_t, R *, X))
 {
-  op (r.length (), r.fortran_vec (), x);
+  octave_idx_type n = r.numel ();
+#ifdef HAVE_OPENMP 
+  if (n > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      octave_idx_type chunk = (n + nthreads - 1) / nthreads;
+      R *rvec = r.fortran_vec ();
+      OCTAVE_OMP_PRAGMA(omp parallel for shared(n, chunk, rvec))
+      for (octave_idx_type i = 0; i < n; i += chunk)
+        op (std::min (chunk, n - i), rvec + i, x);
+    }
+  else
+#endif
+    op (n, r.fortran_vec (), x);
   return r;
 }
 
@@ -1145,7 +1247,31 @@
   dims.chop_trailing_singletons ();
 
   Array<R> ret (dims);
-  mx_red_op (src.data (), ret.fortran_vec (), l, n, u);
+#ifdef HAVE_OPENMP
+  if (l * u > 1000)
+    {
+      int nthreads = omp_get_num_threads ();
+      const T *svec = src.data ();
+      R *rvec = ret.fortran_vec ();
+      if (u < nthreads)
+        {
+          octave_idx_type chunk = (l + nthreads - 1) / nthreads;
+          OCTAVE_OMP_PRAGMA(omp parallel for shared(n, l, u, chunk, rvec, 
svec))
+          for (octave_idx_type i = 0; i < l; i += chunk)
+            mx_red_op (svec + i * n, rvec + i, std::min (chunk, l - i), n, u);
+        }
+      else
+        {
+          octave_idx_type chunk = (u + nthreads - 1) / nthreads;
+          OCTAVE_OMP_PRAGMA(omp parallel for shared(n, l, u, chunk, rvec, 
svec))
+          for (octave_idx_type i = 0; i < u; i += chunk)
+            mx_red_op (svec + i * l * n, rvec + i * l , l, n,
+                       std::min (chunk, u - i));
+        }
+    }
+  else
+#endif
+    mx_red_op (src.data (), ret.fortran_vec (), l, n, u);
 
   return ret;
 }

reply via email to

[Prev in Thread] Current Thread [Next in Thread]