Hi Xtian,
the problem with that example is that SolveSuduku and even
the lambda {SolveSudoku ⍵} are defined functions and
therefore allowed to have side effects.
These side effects need to be taken care of and that causes
either a considerable synchronization effort (like semaphores on all
APL variables) or some other concept that is not present in GNU APL
today.
The main problem is this: if we cannot achieve a reasonable speed-up
on scalar
functions (which can be computed without such synchronization) then
we should
not expect that things get any better if such synchronization is
needed on top of that.
In the PRAM model (which is a theoretical model) the memory
bandwidth scales linearly
with the number of processors (cores).
In real world machines, however, the memory bandwidth is typically
high but still limited by
the hardware. On such machines if you increase the number of cores,
the memory bandwidth
per core decreases accordingly and you pretty soon reach the point
where all cores are wasting
most of their processing capacity by waiting for the common memory.
Caches and other local
memories can decrease the number of accesses to the shared memory,
but only by a constant
factor (which translates to a constant factor in the speedup that
can be achieved).
As Xiao-Yong pointed out correctly:
"We always need a large amount of operations per thread to be
worthwhile."
In the Sudoku example this is the case and you could actually tackle
the problem differently:
instead of parallelizing the built-in APL functions, you could start
a GNU APL thread (or
processor) per Sudoku and collect the results at the end. I would
expect than that gives a
much better performance than computing the Sudokus in parallel in
one interpreter.
It even scales with the number of processors if they have (as
opposed to corers/threads) a
sufficiently large local memory.
/// Jürgen
On 08/27/2016 03:47 AM, Christian
Robert wrote:
Well I saw a couple times where parallelism could have
been very useful.
Something like:
{SolveSudoku ⍵}⍤2 ⊣ Cube_of_9x9x1000_sudoku
{SolveSudoku ⍵}¨ big_array_of_enclosed_9x9_sudoku
but I don't want ⍤ (rank operator) or ¨ (foreach)
operators doing parallelism per default, so I though
introducing a new operator (like in (3÷⍨1) reverse operators) who
would say "I know what I'm doing and run the function to the left
in as many threads as I have cpu's available"
Just an idea I had a year or two ago when testing parallelism in
gnu-apl. Very few
operators can really gain from parallelism, ⍤ (rank operator) and
¨ (foreach) are two.
comments ?
Xtian.
On 2016-08-26 16:57, Xiao-Yong Jin wrote:
Of course. Simple scalar functions would
be the worst to parallelize.
We always need a large amount of operations per thread to be
worthwhile.
Something like the following might be a good thing to start
⌹⍤2 X
where (LargeNumber = ×/¯2↓⍴X) ∧ 2<⍴⍴X
To support general function instead of builtins the code would
need to analyze the data dependencies, which I believe is
impossible in APL because of ⍎ can’t be done without execute it,
but supporting a subset seems good enough. Looking at data
parallel Haskell, and you know it’s a really hard problem.
I wish there will be an array language that could hide all the
details of AVX/OMP/CUDA/MPI stuff.
There are some efforts in bringing SIMD type executions to array
languages but in my opinion these are just not enough, as our
central processors having more and more fast cores/threads.
(Once there was a time when APL was wonderful in utilizing
computing powers, but it’s long gone.)
On Aug 26, 2016, at 2:34 PM, Juergen
Sauermann <address@hidden> wrote:
Hi,
maybe, maybe not.
Our earlier measurements on an 80-core machine indicated that
the way how the cores connect to the memories seems to
determine the
parallel performance that can be achieved in GNU APL.
One can easily prove that for sufficiently large APL arrays
(of arbitrary rank) all scalar APL
functions must scale linearly with N on a N-processor PRAM.
See
http://link.springer.com/chapter/10.1007%2F978-3-642-61012-7_11
(in German).
One can also show that on the 80-core machine that we used,
the performance stops increasing
at fairly low N (around 10) even for very large arrays.
Which means that either the GNU APL parallel implementation of
scalar functions is wrong or
that the 80-core machine we used is performance-wise not even
near the PRAM model. My best
guess is the latter.
/// Jürgen
On 08/26/2016 08:41 PM, Xiao-Yong Jin wrote:
On Aug 26, 2016, at 1:12 PM,
address@hidden
wrote:
finally a computer just perfect for gnuapl
http://thehackernews.com/2016/08/powerful-multicore-processor.html
Now is the perfect time to invest your time and effort in
improving parallel efficiency in gnu-apl.
|