Hello guys,
I've spent some time experimenting with various performance optimisations and I would like to share my latest results with you:
I've run a lot of tests using Callgrind, which is part of the
Valgrind tool (documentation
here). In doing so, I've concluded that a disproportionate amount of time is spent copying values (this can be parallelised; more about that below).
I set out to see how much faster I could make simple test program that applied a monadic scalar function. Here is my test program:
This program calls my time operator which simply shows the amount of time it took to execute the operation. This is of course needed for benchmarking. For completeness, here is the implementation of time:
∇Z←L (OP time) R;start;end
'Time:',((end[3]+end[2]×1E6) - (start[3]+start[2]×1E6))÷1E6
The unmodified version of GNU APL runs this in 5037.00 milliseconds on my machine.
I then set out to minimise the amount of cloning of values, taking advantage of the existing temp functionality. Once I had done this, the execution time was reduced to 2577.00 ms.
I then used the
Threading Building Blocks library to parallelise two operations: The clone operation and the monadic
SkalarFunction::eval_skalar_B(). After this, on my 4-core machine, the runtime was reduced to
1430.00 ms.
Threading Building Blocks is available from the application repositories of at least Arch Linux and Ubuntu, and I'm sure it's available elsewhere too. To test in on OSX I had to download it separately.
To summarise:
- Standard: 5037.00
- Reduced cloning: 2577.00
- Parallel: 1430.00
I have attached the patch, but it's definitely not something that should be applied blindly. I have hacked around is several parts of the code, some of which I can't say I understand fully, so see it as a proof-of-concept, nothing else.
Note that the code that implements the parallelism using TBB is pretty ugly, and the code ends up being duplicated in the parallel and non-parallel version. This can, of course, be encapsulated much nicer if one wants to make this generic.
Another thing, TBB is incredibly efficient, especially on Intel CPU's. I'd be very interested to see how OpenMP performs on this same code.
Regards,
Elias