swarm-modeling
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Swarm-Modelling] ABMs on Graphical Processor Units


From: Marcus G. Daniels
Subject: Re: [Swarm-Modelling] ABMs on Graphical Processor Units
Date: Fri, 28 Dec 2007 17:47:08 -0700
User-agent: Thunderbird 2.0.0.9 (X11/20071115)

Russell Standish wrote:
All you need to do is link statically, rather than dynamically. This
happens by default when you use MPICH, for instance. Then you are just
loading up the parts that you use. But seriously, how much local
memory do you get on a cell local store? If it is not enough to store
a few megabytes of dynamic libraries, it will not be enough to do any
serious ABM simulation, which tends to need 100s of MB.
There's a lot of machinery in OpenMPI that gets pulled no matter what, owing in part to multiple abstraction layers, including a component model. Perhaps MPICH would be easier to strip down, but even with static linkage it was clear to me it wasn't going to be a < 64k (and say another 64k for heap) which is basically what you'd want in order to keep it resident on the local store. (Keep in mind you want some local store to do real work and there is only 256kb per SPU.) It is possible, using the latest GCC, to build a library for the Cell SPU into overlays, and have callers automatically tickle the overlay they need. When a different overlay is needed, that means pulling it over the DMA. (Not much different cost than evicting something from L2 cache, but nonetheless a cost only experienced programmers even recognize.)

The problem of keeping any serial processor busy, is one of keeping calculations close to their memory (or other blocking operations like I/O). The reality is if we don't do that, or fail to tolerate latency with built-in parallelism, then we're wasting compute cycles anyway. DDR will never be as fast as a register. And we can't just wave our hands and make all problems inherently parallel. I suppose one could wish that SPU's would each have 24MB local stores like a high-end Itanium. By my calculations that would be about 12 billion transistors.

As a datapoint, Sony's distributed [protein] address@hidden PS/3 network hit a petaflop a few months ago. They started from the standard Gromacs codebase and started reworking and optimizing.. It soon overshadowed the PC address@hidden network.. http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats

Anyway, my point is not to push the Cell, but to say that GPUs and Cell processors, vector units, microprocessors all have tradeoffs. None of them give parallelism where it can't be proven from the code or exists as an obvious part of the algorithm.
Marcus




reply via email to

[Prev in Thread] Current Thread [Next in Thread]