bug-apl
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Proposal for initial multicore support


From: Elias Mårtenson
Subject: Re: [Bug-apl] Proposal for initial multicore support
Date: Mon, 10 Mar 2014 22:31:54 +0800

This is something that I have been thinking about as well. And the thing that mostly concerns me is how multiple operations can be optimised. Part of that thinking is actually part of my idea to keep track of temporary objects so that they can be modified in-place. The core of that was already implemented by Jürgen. Thanks for that. :-)  (by the way, there are plenty of more functions that could benefit from in-place modification)

About multithreading. I have a few concerns, and I'll outline them all here for posterity:

Task dispatch and job construction

If you have a large array A (say, 8k rows) and you perform, say, the operation 1+2×A. If you have 8 cores, you want to perform the following operations:

    Core 1: {1+2×⍵}¨A[⍳1000;]
    Core 2: {1+2×⍵}¨A[1000+⍳1000;]
    Core 3: {1+2×⍵}¨A[2000+⍳1000;]
    etc...

Note that with in-place modifications this can be quite efficient.

However, note that there must be some intelligent algorithm figuring out that the sequence 1+2× can be merged into a single threadable job.

If this optimisation is not done, then we'll probably see inefficiencies like this:

With a plain non-optimised multi-core dispatching we would be wasting a lot of time doing dispatch and subsequent collections of results:

    Core 1: {2×⍵}¨A[⍳1000;]
    Core 2: {2×⍵}¨A[1000+⍳1000;]
    etc...
Then, collect the result of the multiplication followed by another dispatch to execute:
    Core 1: {1×⍵}¨A[⍳1000;]
    Core 2: {1×⍵}¨A[1000+⍳1000;]
    etc...

Unless the arrays that you are working on are very large, you are going to spend more time synchronising the threads than actually doing real work.

Thread model

Unless all threadable operations always take the same amount of time (i.e. no branching) a simple thread-pool will not be enough. One would need some kind of job-stealing queue system. There are several available for C++, but as I'm not a C++ developer I don't have any experience with them. However, Threading Building Blocks looks promising.

Synchronisation

Having the EACH operator being threadable sound incredibly tempting, but how should the system deal with synchronisation? It's clear that most operations does not have to worry about synchronisation since they only work on its small slice of an array, but what about something like this:

    N←0 ◊ {⍵+N←N+1}¨⍳100

The easy way around this would be to simply not multi-thread functions that contains the assignment operator. The benefit of this is that one can do away with locking almost completely.

Multithreaded memory management

Right now, instances of Value are allocated using the default allocator. This allocator calls down to malloc(), which is not the fastest in the world. Most implementations of malloc() also have a single lock, making the allocation operation effectively single-threaded. Since GNU APL does a lot of memory allocations, this will definitely be a bottleneck. I would suggest using an allocator that has a thread-local memory pool, ensuring that two parallel memory allocations will not interfere with each other.

I'm sure there are more issues, but these are the ones I have been thinking about so far. Comments welcome.

Regards,
Elias


On 10 March 2014 19:44, Juergen Sauermann <address@hidden> wrote:
Hi David,

I think your ideas are correct. I have planned multicore support for GNU APL 1.4 and
every help is welcome.

Actually parallel processing was one of the main reasons for me to write GNU APL.
I published a Ph.D thesis "A parallel APL computer" (in German) back in 1990. We had
built a prototype based on multiple MC68020 processors and demonstrated that most
APL primitive functions and operators can benefit from parallel execution. The only
thing missing at that time was the interpreter (which is now GNU APL).

My thoughts so far were going in the direction of OpenMP which uses a similar approach
as the one you describe. But I haven't worked with it yet and there are other solutions around.

The most important step is to find the best approach for GNU APL. That probably means a lot
of benchmarking and experimenting.

In GNU APL every primitive function is a separate object so we can decide on a per-function
basis at what length we go multi-core.

/// Jürgen





On 03/10/2014 06:33 AM, David B. Lamkins wrote:
This weekend I spent a few hours poking around in the GNU APL source
code while thinking about what it might take to exploit a multicore CPU.
I've collected my thoughts on what it might take to do an initial
implementation (see attached).

Juergen (and anyone else who has worked in the code base), I'd
appreciate some feedback regarding the feasibility of my proposal, as
well as your comments about things I've overlooked.

This is something I'd be willing to work on in my spare time.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]