This is something that I have been thinking about as well. And the thing that mostly concerns me is how multiple operations can be optimised. Part of that thinking is actually part of my idea to keep track of temporary objects so that they can be modified in-place. The core of that was already implemented by Jürgen. Thanks for that. :-) (by the way, there are plenty of more functions that could benefit from in-place modification)
About multithreading. I have a few concerns, and I'll outline them all here for posterity:
Task dispatch and job construction
If you have a large array A (say, 8k rows) and you perform, say, the operation 1+2×A. If you have 8 cores, you want to perform the following operations:
Core 1: {1+2×⍵}¨A[⍳1000;]
Core 2: {1+2×⍵}¨A[1000+⍳1000;]
Core 3: {1+2×⍵}¨A[2000+⍳1000;]
etc...
Note that with in-place modifications this can be quite efficient.
However, note that there must be some intelligent algorithm figuring out that the sequence 1+2× can be merged into a single threadable job.
If this optimisation is not done, then we'll probably see inefficiencies like this:
With a plain non-optimised multi-core dispatching we would be wasting a lot of time doing dispatch and subsequent collections of results:
Core 2: {2×⍵}¨A[1000+⍳1000;]
Then, collect the result of the multiplication followed by another dispatch to execute:
Core 1: {1×⍵}¨A[⍳1000;]
Core 2: {1×⍵}¨A[1000+⍳1000;]
etc...
Unless the arrays that you are working on are very large, you are going to spend more time synchronising the threads than actually doing real work.
Thread model
Unless all threadable operations always take the same amount of time (i.e. no branching) a simple thread-pool will not be enough. One would need some kind of job-stealing queue system. There are several available for C++, but as I'm not a C++ developer I don't have any experience with them. However,
Threading Building Blocks looks promising.
Synchronisation
Having the EACH operator being threadable sound incredibly tempting, but how should the system deal with synchronisation? It's clear that most operations does not have to worry about synchronisation since they only work on its small slice of an array, but what about something like this:
N←0 ◊ {⍵+N←N+1}¨⍳100
The easy way around this would be to simply not multi-thread functions that contains the assignment operator. The benefit of this is that one can do away with locking almost completely.
Multithreaded memory management
Right now, instances of Value are allocated using the default allocator. This allocator calls down to malloc(), which is not the fastest in the world. Most implementations of malloc() also have a single lock, making the allocation operation effectively single-threaded. Since GNU APL does a lot of memory allocations, this will definitely be a bottleneck. I would suggest using an allocator that has a thread-local memory pool, ensuring that two parallel memory allocations will not interfere with each other.
I'm sure there are more issues, but these are the ones I have been thinking about so far. Comments welcome.
Regards,
Elias