On the subject of copying arrays, I was wondering, in GNU APL, where in code such as
A <- A,1 2 3 B <- A,4 5 6 A <- A,7 8 9
(in that order) A is copied. It has to be copied in the third line (unless some weird bookkeeping is done) as B referred to A's previous value, but can be modified in place in the first two lines. Also
1 2 3 + 3 4 5
where the result of + can overwrite one of its two arguments (the same goes for any scalar function).
What is the in place situation of GNU APL?
Thanks, Louis
Hi Blake,
The state of GNU APL is, I believe, this:
I am not aware of any unnecessary copying of arrays in GNU APL.
There were some suspicions
claimed earlier that some copying could be avoided. But it then
turned out that removing these
copies would corrupt other values under certain circumstances
(because different functions would
modify the same value). We therefore had to revert the
"unnecessary copies" to a state that seems
to be safe now.
Regarding parallel processing, the situation seems to be so that a multi-core CPU cannot be
significantly faster than a single-core CPU with the same memory
(-interface). A significant
speedup requires that the bandwidth between the cores and the
memory grows with the number
of cores. I had built a machine like that with some students
back in 1990, but the current PCs with
multi-core CPUs just do not provide that.
If you make parallel1 and then run ScalarBenchmark.apl
(which comes with GNU APL)
then you get, for example:
-------------- Mix_IRC + Mix1_IRC
--------------
average sequential startup cost: 530 cycles
average parallel startup cost: 1300 cycles
per item cost sequential: 122 cycles
per item cost parallel: 195 cycles
parallel break-even length: not reached
This means that:
the additional start-up cost for a parallel computation is
1300-530=770 cycles or 240 nano-seconds
on a 3.2 GHz machine. This is actually a pretty good value.
Before writing my own core synchronization functions I used a
standard paralle;ization library that took aboutc 20,000 cycles.
I believe it was libmpi but that I was many years ago so I don't
quite remember.
This startup cost also includes what Elias refers to as
bookkeeping).
What remains (and is the show-stopper) is the per item cost A
single core needs 122 cycles for
adding two numbers while 4 cores need 195 cycles per core = 780
cycles in total.
The code in both cases is exactly the same. Putting it
differenly, if I run alone then some function
takes 122 cycles and when my collegues work in parallel on
something that has nothing
to do with my work then I need 780 cycles.
Once the parallel startup has finished the cores work
independently and without any locks or
the like between them. This cannot be explained at software
level but rather suggests that
some common resource (memory ?!) is slowing each core down when
some other core(s) are
working at the same time. The 122 cycles (~40 ns) in the single
core case is roughly the time
for one DRAM access in page mode. In other words, the main
memory bandwidth (at least of
my machine) is just enough to feed one core, but far to low for
4 cores.
GNU APL can do little to fix this bottleneck. I believe I have
done everything possible (although
new ideas are welcome) that can be done in software, but if we
hit hardware bottlenecks then
then thats it. I believe GNU APL would run perfectly on a 4-core
CPU with a 4-port main memory,
but that will probably remain a dream in my lifetime.
Best Regards,
Jürgen Sauermann
On 07/04/2017 06:15 PM, Blake McBride
wrote:
Hi Ellas,
I remember your earlier work and comments. In fact, that
is the source of my knowledge of this.
Putting more features on a base with known issues is
problematic for the following reasons:
1. The time and effort it would take to secure the base is
being spent elsewhere.
2. Adding new features is, in effect, ignoring the problem
and moving on.
3. APL is good and useful as it is. If it's not being
used as-is, then there is some other issue (like efficiency).
I remember another post about unnecessary bloat because of the
unnecessary copying. This was causing his machine to page
unnecessarily. The unnecessary coping is major in terms of
performance and memory efficiency. This is big.
4. APL was developed with the promise of parallel
processing. This is something that wasn't practical back
then, but is now. I am under the impression that this was one
of the big reasons Juergen created GNU APL - to take advantage
of parallel processing.
Just one opinion.
Blake
|