RE: [pooma-dev] Plan for Reducing Pooma's Running Time

freepooma-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [pooma-dev] Plan for Reducing Pooma's Running Time

From:	James Crotinger
Subject:	RE: [pooma-dev] Plan for Reducing Pooma's Running Time
Date:	Tue, 21 Aug 2001 10:11:19 -0600

> -----Original Message-----
> From: Jeffrey Oldham [mailto:address@hidden]
> Sent: Tuesday, August 14, 2001 9:47 PM
> To: address@hidden
> Subject: [pooma-dev] Plan for Reducing Pooma's Running Time
>
[...]

> To permit comparisons between executions of different programs, we can
> compute the abstraction ratio for using data-parallel or Pooma2
> abstractions. The abstraction ratio is the ratio of a program's
> running time to the corresponding C program's running time. We want
> this ratio to be at most one.

Two comments:

First, I think you just said that we want to have the abstraction penalty be negative ("the ratio to be at most one"), which strikes me as unlikely. Especially if the C compiler takes advantage of restrict.

Second, it is impractical for us to write C code for comparison with the POOMA kernals running in parallel. Or at least it is impractical to do this for very many kernels. Thus we also need to look at scaling and other measurements of parallel performance.

> We need to resolve timing granularity problems and which time
> measurements to make, e.g., wall-clock or CPU time.

This is really only an issue with the Benchmark class or other experiments that simply amount to timing the execution of an entire run. The profiling tools shouldn't have this problem.

>
>
> Infrastructure:
>
> We should establish daily builds and benchmark runs to check that
> running times do not increase while we try to reduce running times.
> Running times on both Irix and Linux is desirable. We'll use QMTest
> to perform the testing.
>
> Question: Should we post these reports on a daily basis?

Probably not - if we could automate putting them on a website, that would be cool.

>
> We should use KCC. Some preliminary performance indicates that KCC's
> and gcc's performances differ. Tools to profile the code include
> Linux's gprof (instructions available in info pages and
> http://sources.redhat.com/binutils/docs-2.10/gprof.html) and Irix's
> ssrun(1) and prof(1).
>
> Question: Are there other profiling tools?

When I tried gprof with KCC, it crashed (gprof did, that is). I haven't yet looked at Gaby's notes to figure out if I did something wrong, or if we have a different configuration.

At any rate, gprof is OK for serial benchmarking, which is where we want to start, but we need something else when we start benchmarking in parallel. The tool that we've used before is called Tau. I think there are some links to it on the acl web site. I've never used it on linux, so we'll have to check that out. I believe this is supposed to work either with threads or with message passing, but currently it doesn't handle both. But neither does POOMA at this point, so that's OK.

> Work:
>
> Scott Haney suggests first speeding Array execution since (New)Fields
> use Arrays. A good initial step would be checking that the KCC
> optimizer produces code similar to C code without Pooma abstraction
> overheads.

I've run ABCTest with KCC and, not too surprisingly, there is an observed abstraction penalty - the C code got about 45 MFlops for large arrays, and the POOMA (Brick) code got about 30. Given that these asymptotic results should be measuring memory speed, this is probably the result of the C loops being jammed or something, resulting in some load-store optimizations that the optimizer can't do with the POOMA code since POOMA does not inline everything. I haven't looked at the C output from KCC yet (which can be a pain to decipher - there used to be a product called "cback" that was designed to clean up the output from CFRONT. I wonder if it still exists.).

> First, we can compare C-style arrays with Brick Array
> engines on uniprocessor machines. Then, we work with multipatch Array
> engines, trying to reduce the overhead of having multiple patches.
> Trying various patch sizes on a uniprocessor machine will demonstrate
> the overhead of having multipatches. We'll defer threaded and
> multi-processor execution to later.

The various benchmarks and the Benchmark class were designed with these sort of tests in mind.

>
> Stephen will soon post a list of Array benchmarks, what they test, and
> what they do not test. We can write additional programs to fill any
> deficiencies in our testing. Each individual researcher can speed a
> benchmark's execution.
>
> Work on the NewField should be delayed until Stephen Smith and I merge
> our work into the mainline. Currently, there is one benchmark program
> benchmarks/Doof2d that use NewField.h. We also will have the Lee et
> al. statigraphic flow code. Are these sufficient for testing? If
> not, should we write more test cases? Will we want to finish the
> Caramana et al. hydrodynamics program?
>
> Question: Who besides Jeffrey has access to a multi-processor computer
> with more than a handful of processors?

I've got an account on chi now, and should be able to get back onto nirvana without too much hassle (I hope).

>
> Question: Do we need to check for memory leaks? Bluemountain has
> purify, which should reveal leaks. Perhaps we can modify the QMTest
> scripts to ease checking.

This isn't a performance issue, but we definitely want to put purify in our test suite.

>
> Procedure for Modifying Pooma Code:
>
> Even though we'll probably work on a separate development branch, we
> need to ensure that the Pooma code compiles at all times to permit
> multiple programmers to work on the same code. Before committing a
> code change,
>
> 1. Make sure the Pooma library compiles with the change. Also check
>    that associated executables still run.
> 2. Obtain patch approval from at least one other person.
> 3. Commit the patch.
> 4. Send email to address@hidden, listing
>   a. the changes and an explanation,
>   b. the test platform, and
>   c. the patch approver.

I never do step 4 - given that all this information should be in the CVS checkin message and that we have a CVS mailing list, why make a redundant post to pooma-dev?

>
> To Do List:
>
> o Complete this list.
> o Add this list in the Pooma CVS tree for easy sharing and
> modification.
> o Describe the existing benchmarks.   Stephen
> o Determine what execution tasks are not covered by existing
> code. Stephen
> o Determine interesting benchmarks using Arrays.
>     Stephen recommends starting with benchmarks/Doof2dUMP.    Gaby?
> o Establish nightly Pooma builds for Linux and Irix, producing summary
>   reports. Jeffrey
> o Ensure Pooma compiles with the threads package.     Jim?

I can work on SMARTS.

I think it is important that we get Tau up and working with POOMA on the platforms that we'll be profiling on. This may not be a small task.

Jim

[Prev in Thread]

Current Thread

[Next in Thread]

RE: [pooma-dev] Plan for Reducing Pooma's Running Time, James Crotinger <=

Prev by Date: RE: [pooma-dev] Speeding Pooma Code on a Branch?
Next by Date: RE: [pooma-dev] Profiling POOMA: How to?
Previous by thread: RE: [pooma-dev] Speeding Pooma Code on a Branch?
Next by thread: RE: [pooma-dev] Profiling POOMA: How to?
Index(es):
- Date
- Thread