Hi Andrew,
let me try to answer some of your questions inline below...
On 2/7/20 6:35 PM, Andrew wrote:
Good evening
This is my first post to this mailing list. It is a
mainly some questions, not a bug report, so I hope it is
appropriate to post it here. Apologies if not. (And apologies
also for a rather long and rambling e-mail!)
No problem, youu found the right list.
I recently learned of Gnu APL and, having had some
experience of APL on IBM mainframes in the 1980s, I was curious
to know how it would work on a couple of my computers, and to
use it to compare performance of two virtualised and emulated
environments.
Firstly, I installed it on Ubuntu 18.04.3 running
under VMWare Fusion on a 2.3GHz 8-core Intel i9. This is the
latest SVN version, built using CORE_COUNT-WANTED=syl on
./configure (not make parallels, which gave me a problem with
autoconf). I then used ⎕syl[26;2] to set the number of cores.
Using ⎕ai to obtain the compute time, I tried using
1 and 4 cores for brute force prime number counting, using this
_expression_: r←⍴(1=+⌿0=r∘.∣r)/r←1↓⍳n
⎕AI is rather imprecise, even worse than ⎕TS. For performance
measurements on Intel
CPUs you should use ⎕FIO ¯1 (return CPU cycle counter) and maybe
⎕FIO ¯2 (return CPU frequency).
⎕FIO ¯1 is the most precise timing source that you can get in GNU
APL.
Although I could see, on the system monitor, that 4
cores were being used, the execution time with n=10000 actually
took longer for the 4 core case, typically 15-20% more time than
the 1 core case.
The _expression_ above that you benchmarked is a mix of parallelized
and not parallelized APL
primitives. Each of them is subject to varying execution times, so
it is difficult to tell if the increased
execution time is caused by the parallel execution or by the anyhow
varying execution times.
However, I then tried it in a very different
environment: Ubuntu 18.04.3 again, but running in an emulated
IBM S/390 mainframe (using the Hercules S/370 emulator running
in Ubuntu in VMWare on a 3.5 GHz 6-core Xeon). For n=5000, this
gave the opposite result: the 4 core case was approx. 45%
quicker.
In my experience using all cores of a CPU is not optimal because
external events from the OS (interrupts
etc) slow down one of the cores used for APL so that the CPU(s) hit
by external events increase the
execution time of each primitive. If you leave one core unused (and
if you are luck), then the scheduler
of the OS will see which cores are busy (execution APL) and will
direct thos events to the unused core.
I also rather doubt that a virtual or emulated environment is able
to tell anything about parallelized APL.
By the way there is a nechmarking workspace
Scalar3.apl
shipped with GNU APL that makes benchmarking of parallel GNU APL
easier. Intel I9 is a good platform for running that workspace, but
avoid any virtualizations and
./configure it properly.
Directly comparing these two environments (one
“simply” virtualized, the other emulated and virtualized) is not
meaningful. It is to be expected that the emulated one will be
very substantially slower. The more interesting point is,
perhaps, that on the i9, using more cores actually slows it down
whereas, in the emulated environment, which is effectively a
*much* slower processor, using multiple cores does yield a
modest speed-up.
The speedups that can be achieved are generally disappointing. I
have also compared Intel I7 with intel I9.
Seems like at the same CPU frequency and with the same core count,
the I9 uis substantially faster
than the I7 but at the same time the I7 benefits more from
parallelization than the I9. Most likely the
CPU optimizations in the I9 (compared to I7) aim at the same kind of
parallelism, so that improvements
of one aspect (CPU architecture) are made at the expense of the
other aspect (APL parallelization)
I am not sure which components of the _expression_ (if
any) would be parallelized by Gnu APL. So my questions are:
1. Is it plausible that, on a reasonably modern CPU
(the i9), using multiple cores would slow down execution of this
_expression_?
Could very well be. The _expression_ has a rather small amount of
parallelization since the majority of
its primitives is not parallelized.
2. Which of the operators in the _expression_ above
would Gnu APL actually parallelize?
Currently all scalar functions and inner and outer products of them.
One can proove These are the ones
that in theory and given the GNU APL implementation they must have a
linear speedup (linear in the
number of cores). That is, on an I9 a scalar function on 4 cores
must be 4 times faster than on one
core. In real life it is only 1.5 or so times faster. This points to
a hardware bottleneck between the cores
and the memory. The scalar functions are so lightweight that the
memory accesses (fetching the operands
and storing the results) dominate the entire execution time.
3. Are there any configuration changes that I could
make to adjust the way in which parallelization is done?
If you mean ./configure options by configurations then no. However
some ./configure options have
performance impacts both for parallel and non-parallel execution.
These should be switched off.
See README-2-configure for details.
One other comment:
Before I realised that the svn version is more
recent, I used the apl-1.8.tar.gz version of the code that is
available on the Gnu mirror. This seems to have a minor error
in Parallel.hh: two occurrences of & in the definition of
PRINT_LOCKED, which cause a compilation error. They appear to
have been removed in the svn version.
Yes. In the early days of GNU APL I updated the
apl-1.X.tar.gz
files after every bug fix. I was then told
by the GNU project that this would mess up their mirrors so I
stopped doing that. Therefore problems in
1.8 will only be fixed in 1.9, typically 1-2 years later.
Any comments or answers would be appreciated. Thank
you for taking the time to read my e-mail.
You're wecome
Jürgen
Andrew