Re: [Bug-apl] apl computer

This is a subject that has interested me a lot, and after coming to a similar conclusion as Jürgen, decided to see if it was possible to get around many of the problems by eliminating the main cause of bad parallelism: Side effects.

I have been working an experimental APL interpreter (mostly doing design and proof-of-concept tests, with very little actual code written) which pays very little attention to the APL spec in cases where they conflict with the goal of being highly parallelisable.

So, what are my current conclusions?

First of all, plain code that do not operate on very large arrays tends to be slower than a more direct implementation such as GNU APL.

Any operations that are not highly parallelisable suffer, since what is in an implementation such as GNU APL simple arrays with very clear semantics (compiled to very tight single-threaded loops in C++), are now complex objects and simple operations on them are much slower.

In particular, in this language, array modification (such as: foo[2;1]←9) is a very costly operation since arrays are CoW, but it only happens if the array is realised (operations are lazy)).

That said, there are special cases that can be significantly improved, but I can't say how fast they would be in a real-world application.

Regards,

Elias

On 27 August 2016 at 20:02, Juergen Sauermann <address@hidden> wrote:

Hi Xtian,

the problem with that example is that SolveSuduku and even
the lambda {SolveSudoku ⍵} are defined functions and
therefore allowed to have side effects.

These side effects need to be taken care of and that causes
either a considerable synchronization effort (like semaphores on all
APL variables) or some other concept that is not present in GNU APL today.

The main problem is this: if we cannot achieve a reasonable speed-up on scalar
functions (which can be computed without such synchronization) then we should
not expect that things get any better if such synchronization is needed on top of that.

In the PRAM model (which is a theoretical model) the memory bandwidth scales linearly
with the number of processors (cores).

In real world machines, however, the memory bandwidth is typically high but still limited by
the hardware. On such machines if you increase the number of cores, the memory bandwidth
per core decreases accordingly and you pretty soon reach the point where all cores are wasting
most of their processing capacity by waiting for the common memory. Caches and other local
memories can decrease the number of accesses to the shared memory, but only by a constant
factor (which translates to a constant factor in the speedup that can be achieved).

As Xiao-Yong pointed out correctly:

"We always need a large amount of operations per thread to be worthwhile."

In the Sudoku example this is the case and you could actually tackle the problem differently:
instead of parallelizing the built-in APL functions, you could start a GNU APL thread (or
processor) per Sudoku and collect the results at the end. I would expect than that gives a
much better performance than computing the Sudokus in parallel in one interpreter.
It even scales with the number of processors if they have (as opposed to corers/threads) a
sufficiently large local memory.

/// Jürgen

On 08/27/2016 03:47 AM, Christian Robert wrote:

Well I saw a couple times where parallelism could have been very useful.

Something like:

      {SolveSudoku ⍵}⍤2 ⊣ Cube_of_9x9x1000_sudoku

      {SolveSudoku ⍵}¨ big_array_of_enclosed_9x9_sudoku

but I don't want ⍤ (rank operator) or ¨ (foreach)
operators doing parallelism per default, so I though
introducing a new operator (like in (3÷⍨1) reverse operators) who
would say "I know what I'm doing and run the function to the left in as many threads as I have cpu's available"

Just an idea I had a year or two ago when testing parallelism in gnu-apl. Very few
operators can really gain from parallelism, ⍤ (rank operator) and ¨ (foreach) are two.

comments ?

Xtian.

On 2016-08-26 16:57, Xiao-Yong Jin wrote:

Of course. Simple scalar functions would be the worst to parallelize.
We always need a large amount of operations per thread to be worthwhile.
Something like the following might be a good thing to start
   ⌹⍤2 X
where (LargeNumber = ×/¯2↓⍴X) ∧ 2<⍴⍴X
To support general function instead of builtins the code would need to analyze the data dependencies, which I believe is impossible in APL because of ⍎ can’t be done without execute it, but supporting a subset seems good enough. Looking at data parallel Haskell, and you know it’s a really hard problem.

I wish there will be an array language that could hide all the details of AVX/OMP/CUDA/MPI stuff.

There are some efforts in bringing SIMD type executions to array languages but in my opinion these are just not enough, as our central processors having more and more fast cores/threads. (Once there was a time when APL was wonderful in utilizing computing powers, but it’s long gone.)

On Aug 26, 2016, at 2:34 PM, Juergen Sauermann <address@hiddende> wrote:

Hi,

maybe, maybe not.

Our earlier measurements on an 80-core machine indicated that
the way how the cores connect to the memories seems to determine the
parallel performance that can be achieved in GNU APL.

One can easily prove that for sufficiently large APL arrays (of arbitrary rank) all scalar APL
functions must scale linearly with N on a N-processor PRAM.

See http://link.springer.com/chapter/10.1007%2F978-3-642-61012-7_11 (in German).

One can also show that on the 80-core machine that we used, the performance stops increasing
at fairly low N (around 10) even for very large arrays.

Which means that either the GNU APL parallel implementation of scalar functions is wrong or
that the 80-core machine we used is performance-wise not even near the PRAM model. My best
guess is the latter.

/// Jürgen

On 08/26/2016 08:41 PM, Xiao-Yong Jin wrote:

On Aug 26, 2016, at 1:12 PM, address@hidden
wrote:

finally a computer just perfect for gnuapl

http://thehackernews.com/2016/08/powerful-multicore-processor.html

Now is the perfect time to invest your time and effort in improving parallel efficiency in gnu-apl.

From:	Elias Mårtenson
Subject:	Re: [Bug-apl] apl computer
Date:	Sat, 27 Aug 2016 21:36:56 +0800