Re: [Bug-apl] Proposal for initial multicore support

bug-apl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-apl] Proposal for initial multicore support

From:	David Lamkins
Subject:	Re: [Bug-apl] Proposal for initial multicore support
Date:	Mon, 10 Mar 2014 11:19:55 -0700

That's a good analysis of the larger landscape, Elias. Thanks for posting.

I intend to avoid many of those issues by working on the (I think)
simplest case of scalar functions that don't reshape the result. That
obviously doesn't help with the more general cases...

Regarding your first example of parallelizing the each operator:

{1+2×⍵}¨A

It seems that it's suboptimal to decompose the lambda into individual
operations. What if the lambda could (somehow) be executed as a whole
rather than decomposed? TBH, I don't know enough about the interpreter
to know whether this is even feasible.

Also, I admit that I haven't given more than passing thought to
consequences of intermediate results in the lambda. It might be useful
to think in terms of doing a parallel each only in the case where the
function is a composition of scalar functions. (That simplification
would address your concern about side-effects, too.)



On Mon, Mar 10, 2014 at 7:31 AM, Elias Mårtenson <address@hidden> wrote:
> This is something that I have been thinking about as well. And the thing
> that mostly concerns me is how multiple operations can be optimised. Part of
> that thinking is actually part of my idea to keep track of temporary objects
> so that they can be modified in-place. The core of that was already
> implemented by Jürgen. Thanks for that. :-)  (by the way, there are plenty
> of more functions that could benefit from in-place modification)
>
> About multithreading. I have a few concerns, and I'll outline them all here
> for posterity:
>
> Task dispatch and job construction
>
> If you have a large array A (say, 8k rows) and you perform, say, the
> operation 1+2×A. If you have 8 cores, you want to perform the following
> operations:
>
>     Core 1: {1+2×⍵}¨A[⍳1000;]
>     Core 2: {1+2×⍵}¨A[1000+⍳1000;]
>     Core 3: {1+2×⍵}¨A[2000+⍳1000;]
>     etc...
>
> Note that with in-place modifications this can be quite efficient.
>
> However, note that there must be some intelligent algorithm figuring out
> that the sequence 1+2× can be merged into a single threadable job.
>
> If this optimisation is not done, then we'll probably see inefficiencies
> like this:
>
> With a plain non-optimised multi-core dispatching we would be wasting a lot
> of time doing dispatch and subsequent collections of results:
>
>     Core 1: {2×⍵}¨A[⍳1000;]
>     Core 2: {2×⍵}¨A[1000+⍳1000;]
>     etc...
> Then, collect the result of the multiplication followed by another dispatch
> to execute:
>     Core 1: {1×⍵}¨A[⍳1000;]
>     Core 2: {1×⍵}¨A[1000+⍳1000;]
>     etc...
>
> Unless the arrays that you are working on are very large, you are going to
> spend more time synchronising the threads than actually doing real work.
>
> Thread model
>
> Unless all threadable operations always take the same amount of time (i.e.
> no branching) a simple thread-pool will not be enough. One would need some
> kind of job-stealing queue system. There are several available for C++, but
> as I'm not a C++ developer I don't have any experience with them. However,
> Threading Building Blocks looks promising.
>
> Synchronisation
>
> Having the EACH operator being threadable sound incredibly tempting, but how
> should the system deal with synchronisation? It's clear that most operations
> does not have to worry about synchronisation since they only work on its
> small slice of an array, but what about something like this:
>
>     N←0 ◊ {⍵+N←N+1}¨⍳100
>
> The easy way around this would be to simply not multi-thread functions that
> contains the assignment operator. The benefit of this is that one can do
> away with locking almost completely.
>
> Multithreaded memory management
>
> Right now, instances of Value are allocated using the default allocator.
> This allocator calls down to malloc(), which is not the fastest in the
> world. Most implementations of malloc() also have a single lock, making the
> allocation operation effectively single-threaded. Since GNU APL does a lot
> of memory allocations, this will definitely be a bottleneck. I would suggest
> using an allocator that has a thread-local memory pool, ensuring that two
> parallel memory allocations will not interfere with each other.
>
> I'm sure there are more issues, but these are the ones I have been thinking
> about so far. Comments welcome.
>
> Regards,
> Elias
>
>
> On 10 March 2014 19:44, Juergen Sauermann <address@hidden>
> wrote:
>>
>> Hi David,
>>
>> I think your ideas are correct. I have planned multicore support for GNU
>> APL 1.4 and
>> every help is welcome.
>>
>> Actually parallel processing was one of the main reasons for me to write
>> GNU APL.
>> I published a Ph.D thesis "A parallel APL computer" (in German) back in
>> 1990. We had
>> built a prototype based on multiple MC68020 processors and demonstrated
>> that most
>> APL primitive functions and operators can benefit from parallel execution.
>> The only
>> thing missing at that time was the interpreter (which is now GNU APL).
>>
>> My thoughts so far were going in the direction of OpenMP which uses a
>> similar approach
>> as the one you describe. But I haven't worked with it yet and there are
>> other solutions around.
>>
>> The most important step is to find the best approach for GNU APL. That
>> probably means a lot
>> of benchmarking and experimenting.
>>
>> In GNU APL every primitive function is a separate object so we can decide
>> on a per-function
>> basis at what length we go multi-core.
>>
>> /// Jürgen
>>
>>
>>
>>
>>
>> On 03/10/2014 06:33 AM, David B. Lamkins wrote:
>>>
>>> This weekend I spent a few hours poking around in the GNU APL source
>>> code while thinking about what it might take to exploit a multicore CPU.
>>> I've collected my thoughts on what it might take to do an initial
>>> implementation (see attached).
>>>
>>> Juergen (and anyone else who has worked in the code base), I'd
>>> appreciate some feedback regarding the feasibility of my proposal, as
>>> well as your comments about things I've overlooked.
>>>
>>> This is something I'd be willing to work on in my spare time.
>>>
>>>
>>
>>
>



-- 
"The secret to creativity is knowing how to hide your sources."
   Albert Einstein


http://soundcloud.com/davidlamkins
http://reverbnation.com/lamkins
http://reverbnation.com/lcw
http://lamkins-guitar.com/
http://lamkins.net/
http://successful-lisp.com/

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-apl] Proposal for initial multicore support, David B. Lamkins, 2014/03/10
- Re: [Bug-apl] Proposal for initial multicore support, Juergen Sauermann, 2014/03/10
  - Re: [Bug-apl] Proposal for initial multicore support, Elias Mårtenson, 2014/03/10
    - Re: [Bug-apl] Proposal for initial multicore support, David Lamkins <=
    - Re: [Bug-apl] Proposal for initial multicore support, Elias Mårtenson, 2014/03/10

Prev by Date: [Bug-apl] What is the daemon mode?
Next by Date: Re: [Bug-apl] What is the daemon mode?
Previous by thread: Re: [Bug-apl] Proposal for initial multicore support
Next by thread: Re: [Bug-apl] Proposal for initial multicore support
Index(es):
- Date
- Thread