bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Parallel APL Questions


From: Dr . Jürgen Sauermann
Subject: Re: Parallel APL Questions
Date: Fri, 7 Feb 2020 20:25:51 +0100
User-agent: Mozilla/5.0 (X11; Linux i686; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

Hi Andrew,

let me try to answer some of your questions inline below...

On 2/7/20 6:35 PM, Andrew wrote:
Good evening

This is my first post to this mailing list.  It is a mainly some questions, not a bug report, so I hope it is appropriate to post it here.  Apologies if not.  (And apologies also for a rather long and rambling e-mail!)

No problem, youu found the right list.
I recently learned of Gnu APL and, having had some experience of APL on IBM mainframes in the 1980s, I was curious to know how it would work on a couple of my computers, and to use it to compare performance of two virtualised and emulated environments.

Firstly, I installed it on Ubuntu 18.04.3 running under VMWare Fusion on a 2.3GHz 8-core Intel i9.  This is the latest SVN version, built using CORE_COUNT-WANTED=syl on ./configure (not make parallels, which gave me a problem with autoconf).  I then used ⎕syl[26;2] to set the number of cores.

Using ⎕ai to obtain the compute time, I tried using 1 and 4 cores for brute force prime number counting, using this _expression_: r←⍴(1=+⌿0=r∘.∣r)/r←1↓⍳n

⎕AI is rather imprecise, even worse than ⎕TS. For performance measurements on Intel
CPUs you should use ⎕FIO ¯1 (return CPU cycle counter) and maybe ⎕FIO ¯2 (return CPU frequency).
⎕FIO ¯1 is the most precise timing source that you can get in GNU APL.
Although I could see, on the system monitor, that 4 cores were being used, the execution time with n=10000 actually took longer for the 4 core case, typically 15-20% more time than the 1 core case.

The _expression_ above that you benchmarked is a mix of parallelized and not parallelized APL
primitives. Each of them is subject to varying execution times, so it is difficult to tell if the increased
execution time is caused by the parallel execution or by the anyhow varying execution times.
However, I then tried it in a very different environment: Ubuntu 18.04.3 again, but running in an emulated IBM S/390 mainframe (using the Hercules S/370 emulator running in Ubuntu in VMWare on a 3.5 GHz 6-core Xeon).  For n=5000, this gave the opposite result: the 4 core case was approx. 45% quicker.

In my experience using all cores of a CPU is not optimal because external events  from the OS (interrupts
etc) slow down one of the cores used for APL so that the CPU(s) hit by external events increase the
execution time of each primitive. If you leave one core unused (and if you are luck), then the scheduler
of the OS will see which cores are busy (execution APL) and will direct thos events to the unused core.

I also rather doubt that a virtual or emulated environment is able to tell anything about parallelized APL.
By the way there is a nechmarking workspace Scalar3.apl shipped with GNU APL that makes benchmarking of parallel GNU APL easier. Intel I9 is a good platform for running that workspace, but
avoid any virtualizations and ./configure it properly.

Directly comparing these two environments (one “simply” virtualized, the other emulated and virtualized) is not meaningful.  It is to be expected that the emulated one will be very substantially slower.  The more interesting point is, perhaps, that on the i9, using more cores actually slows it down whereas, in the emulated environment, which is effectively a *much* slower processor, using multiple cores does yield a modest speed-up.

The speedups that can be achieved are generally disappointing. I have also compared Intel I7 with intel I9.
Seems like at the same CPU frequency and with the same core count, the I9 uis substantially faster
than the I7 but at the same time the I7 benefits more from parallelization than the I9. Most likely the
CPU optimizations in the I9 (compared to I7) aim at the same kind of parallelism, so that improvements
of one aspect (CPU architecture) are made at the expense of the other aspect (APL parallelization)

I am not sure which components of the _expression_ (if any) would be parallelized by Gnu APL.  So my questions are:

1.  Is it plausible that, on a reasonably modern CPU (the i9), using multiple cores would slow down execution of this _expression_?
Could very well be. The _expression_ has a rather small amount of parallelization since the majority of
its primitives is not parallelized.
2.  Which of the operators in the _expression_ above would Gnu APL actually parallelize?
Currently all scalar functions and inner and outer products of them. One can proove These are the ones
that in theory and given the GNU APL implementation they must have a linear speedup (linear in the
number of cores). That is, on an I9 a scalar function on 4 cores must be 4 times faster than on one
core. In real life it is only 1.5 or so times faster. This points to a hardware bottleneck between the cores
and the memory. The scalar functions are so lightweight that the memory accesses (fetching the operands
and storing the results) dominate the entire execution time.
3.  Are there any configuration changes that I could make to adjust the way in which parallelization is done?

If you mean ./configure options by configurations then no. However some ./configure options have
performance impacts both for parallel and non-parallel execution. These should be switched off.
See README-2-configure for details.
One other comment:

Before I realised that the svn version is more recent, I used the apl-1.8.tar.gz version of the code that is available on the Gnu mirror.  This seems to have a minor error in Parallel.hh: two occurrences of & in the definition of PRINT_LOCKED, which cause a compilation error.  They appear to have been removed in the svn version.

Yes. In the early days of GNU APL I updated the apl-1.X.tar.gz files after every bug fix. I was then told
by the GNU project that this would mess up their mirrors so I stopped doing that. Therefore problems in
1.8 will only be fixed in 1.9, typically 1-2 years later.
Any comments or answers would be appreciated.  Thank you for taking the time to read my e-mail.

You're wecome
Jürgen
Andrew



reply via email to

[Prev in Thread] Current Thread [Next in Thread]