Re: [Bug-apl] apl computer

bug-apl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] apl computer

From:	Juergen Sauermann
Subject:	Re: [Bug-apl] apl computer
Date:	Sat, 27 Aug 2016 14:02:43 +0200
User-agent:	Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.4.0

Hi Xtian,

the problem with that example is that SolveSuduku and even
the lambda {SolveSudoku ⍵} are defined functions and
therefore allowed to have side effects.

These side effects need to be taken care of and that causes
either a considerable synchronization effort (like semaphores on all
APL variables) or some other concept that is not present in GNU APL today.

The main problem is this: if we cannot achieve a reasonable speed-up on scalar
functions (which can be computed without such synchronization) then we should
not expect that things get any better if such synchronization is needed on top of that.

In the PRAM model (which is a theoretical model) the memory bandwidth scales linearly
with the number of processors (cores).

In real world machines, however, the memory bandwidth is typically high but still limited by
the hardware. On such machines if you increase the number of cores, the memory bandwidth
per core decreases accordingly and you pretty soon reach the point where all cores are wasting
most of their processing capacity by waiting for the common memory. Caches and other local
memories can decrease the number of accesses to the shared memory, but only by a constant
factor (which translates to a constant factor in the speedup that can be achieved).

As Xiao-Yong pointed out correctly:

"We always need a large amount of operations per thread to be worthwhile."

In the Sudoku example this is the case and you could actually tackle the problem differently:
instead of parallelizing the built-in APL functions, you could start a GNU APL thread (or
processor) per Sudoku and collect the results at the end. I would expect than that gives a
much better performance than computing the Sudokus in parallel in one interpreter.
It even scales with the number of processors if they have (as opposed to corers/threads) a
sufficiently large local memory.

/// Jürgen

On 08/27/2016 03:47 AM, Christian Robert wrote:

Well I saw a couple times where parallelism could have been very useful.

Something like:

      {SolveSudoku ⍵}⍤2 ⊣ Cube_of_9x9x1000_sudoku

      {SolveSudoku ⍵}¨ big_array_of_enclosed_9x9_sudoku

but I don't want ⍤ (rank operator) or ¨ (foreach)
operators doing parallelism per default, so I though
introducing a new operator (like in (3÷⍨1) reverse operators) who
would say "I know what I'm doing and run the function to the left in as many threads as I have cpu's available"

Just an idea I had a year or two ago when testing parallelism in gnu-apl. Very few
operators can really gain from parallelism, ⍤ (rank operator) and ¨ (foreach) are two.

comments ?

Xtian.

On 2016-08-26 16:57, Xiao-Yong Jin wrote:

Of course. Simple scalar functions would be the worst to parallelize.
We always need a large amount of operations per thread to be worthwhile.
Something like the following might be a good thing to start
   ⌹⍤2 X
where (LargeNumber = ×/¯2↓⍴X) ∧ 2<⍴⍴X
To support general function instead of builtins the code would need to analyze the data dependencies, which I believe is impossible in APL because of ⍎ can’t be done without execute it, but supporting a subset seems good enough. Looking at data parallel Haskell, and you know it’s a really hard problem.

I wish there will be an array language that could hide all the details of AVX/OMP/CUDA/MPI stuff.

There are some efforts in bringing SIMD type executions to array languages but in my opinion these are just not enough, as our central processors having more and more fast cores/threads. (Once there was a time when APL was wonderful in utilizing computing powers, but it’s long gone.)

On Aug 26, 2016, at 2:34 PM, Juergen Sauermann <address@hidden> wrote:

Hi,

maybe, maybe not.

Our earlier measurements on an 80-core machine indicated that
the way how the cores connect to the memories seems to determine the
parallel performance that can be achieved in GNU APL.

One can easily prove that for sufficiently large APL arrays (of arbitrary rank) all scalar APL
functions must scale linearly with N on a N-processor PRAM.

See http://link.springer.com/chapter/10.1007%2F978-3-642-61012-7_11 (in German).

One can also show that on the 80-core machine that we used, the performance stops increasing
at fairly low N (around 10) even for very large arrays.

Which means that either the GNU APL parallel implementation of scalar functions is wrong or
that the 80-core machine we used is performance-wise not even near the PRAM model. My best
guess is the latter.

/// Jürgen

On 08/26/2016 08:41 PM, Xiao-Yong Jin wrote:

On Aug 26, 2016, at 1:12 PM, address@hidden
wrote:

finally a computer just perfect for gnuapl

http://thehackernews.com/2016/08/powerful-multicore-processor.html

Now is the perfect time to invest your time and effort in improving parallel efficiency in gnu-apl.

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-apl] apl computer, enztec, 2016/08/26
- Re: [Bug-apl] apl computer, Xiao-Yong Jin, 2016/08/26
  - Re: [Bug-apl] apl computer, Juergen Sauermann, 2016/08/26
    - Re: [Bug-apl] apl computer, Xiao-Yong Jin, 2016/08/26
    - Re: [Bug-apl] apl computer, Christian Robert, 2016/08/26
    - Re: [Bug-apl] apl computer, Juergen Sauermann <=
    - Re: [Bug-apl] apl computer, Elias Mårtenson, 2016/08/27

Prev by Date: Re: [Bug-apl] apl computer
Next by Date: Re: [Bug-apl] apl computer
Previous by thread: Re: [Bug-apl] apl computer
Next by thread: Re: [Bug-apl] apl computer
Index(es):
- Date
- Thread