swarm-support
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Swarm-Support] Parallelism and threads


From: Marcus G. Daniels
Subject: Re: [Swarm-Support] Parallelism and threads
Date: Sat, 29 Jan 2011 11:40:57 -0700
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101207 Thunderbird/3.1.7

On 1/28/2011 4:10 PM, Goldsby, Mike E wrote:
 
Does anyone out there know
1) whether swarm supports multi-host parallelism?
2) whether swarm supports multiple threads on each host or is single-threaded?

To extract concurrency, Swarm uses discrete time steps:  t+1 can occur, but t+1.5 cannot occur.   When t+1.5 is needed, the timescale of the whole simulation can be stretched by a factor of 2 or 10, etc.    The advantage of this approach is that when there are many instances of events all sharing the same timestamp, they can be combined.  The data structure Swarm uses is an ordered linked list where the objects in the list are either the original events, or a wrapper denoting a fused set of concurrent events. 

There are practical and conceptual problems with exploiting concurrency.  Consider a simulation with thousands of agents having activities on a similar timescale.    A big processor like a Magny-Cours can simultaneously advance a dozen threads at once.   Coordinating data movement from caches on chip has latency on the order of several nanoseconds, and pulling memory from another processor on the same node is on the order of a several hundred nanoseconds.   Since one or more instructions issue every cycle (at about 1/3 nanosecond) there needs to be around 1000 instructions executed to justify moving memory between processors.    The Heatbug step method is 1072 x86_64 instructions.   In terms of machine balance, multi-core (#2) would probably work for Heatbugs.

A single Heatbug has 88 bytes associated with it for x86_64.   Moving it between nodes of a good cluster (e.g. QDR Infiniband) using MPI would take about 2.5 microseconds.   One could arrange to spatially load balance the bugs in different regions of the space, but that would require more examination of memory to achieve.  A naive implementation that did not efficiently identify and group bugs for shipment would spend all of the time waiting for the network to move bugs around.    #1 can require substantial programmer intervention to figure out how to organize the simulation to move large blocks of agents at once.   Otherwise it won't scale. 

Simultaneous multithreading (SMT) is a technique that can absorb the kind of unpredictable latency patterns that agent simulations tend to create.  One extreme implementation of this is the Threadstorm processor used in the Cray XMT.  Sandia has one of these.  In the Magny-Cours case, when memory is needed from another processor, it is necessary to wait for it -- stalling the processor.   An XMT system, in contrast, can schedule other work from one of its 128 hardware threads.   Other examples of machines that can do this, but with 2-8 hardware threads, are the Sparc T3, Power 7, and most Intel processors.  nVidia GPUs also have very lightweight threads.  I think it would be interesting to investigate scalability of this approach with agent models.

Even if the scaling problems can be addressed, there's a basic problem with simulations using imperative (C/Java) languages.   If two agents try to update the same cell at the same time, it can become corrupted.   For example, the cell has a counter in it, and agent A reads it out, increments, and writes it out.   Meanwhile, agent B has incremented and written it.   Agent B's update will be erased by agent A.  

The pessimistic way to address this problem is to put locks on everything that can be touched by more than one agent.  This ends up being very slow because there are synchronization points that stall the processors.  Worse, it very hard for most people to understand this kind of code. 

An optimistic way to address the problem is with software transactional memory (STM).   The idea is that if most agents don't typically touch the same data, then they don't interfere and thus do not require synchronization.  When they do, the runtime identifies that has occurred and rolls back one of the agents to retry their change to the simulation.   To use STM, a programmer has to note when operations need to be atomic.  In the example above, that would mean that the increment and write were not two operations but one.   To realize this usually requires compiler support.  Resolving conflicts is more expensive than a lock, so it is important that the hypothesis about memory independence of the agents is true.  

I think the ideal thing for ABM would be a hybrid STM/SMT.  Pieces of whats needed exist:  OpenCL kernels having dependencies on memory transfers could roughly implement SMT on a GPU.   GHC Haskell has a STM implementation, but there's no automatic path to compile general Haskell to a OpenCL kernel.   The GCC transactional memory implementation could probably be adapted for use with Objective C and Swarm.  The Cray XMT has metadata on memory that indicates whether it has been read or written -- building a STM on that should be feasible.

Marcus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]