Re: [Bug-apl] 80 core performance results

From:

Juergen Sauermann

Subject:

Date:

Sat, 23 Aug 2014 13:02:39 +0200

User-agent:

Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.0

Hi Elias,

I believe the gain of coalescing functions (or slicing the values involved) is somewhat limited and occurs
only when your APL values are small. For large values computing one function after the other has a better cache
locality. And it has a price: the runtime parser gets slower (you cannot detect always these sequences at ⎕FX time,
error reporting becomes dubious), And the scheme only works for scalar-like functions so that the number of
functions in theses sequences is small.

What can be saved by slicing is essentially memory allocation for the intermediate results and fork/sync times for
the intermediate functions. memory allocation should be reasonably fast already so there is little to gain. And the
fork/sync times, which are currently the biggest problem, need to go down significantly (otherwise we can forget
about parallel APL). The gain on fork/sync times is O(log(core-count) × (coalescing-length - 1)) which is not too much.

The penalty of non-localization can be huge. For example:

      ]PSTAT 13
╔════════════════════════════════════════════════════════════════════════╗
║                  Performance Statistics (CPU cycles)                   ║
╠══════════╦══════════════════════════════╦══════════════════════════════╣
║          ║          first pass          ║       subsequent passes      ║
║ Function ╟──────────┬──────────┬────────╫──────────┬──────────┬────────╢
║          ║     N    │     μ    │ σ÷μ % ║     N    │     μ    │ σ÷μ % ║
╠══════════╬══════════╪══════════╪════════╬══════════╪══════════╪════════╣
║ A + B    ║        3 │      511 │   49 % ║     6047 │       39 │   69 % ║
╚══════════╩══════════╧══════════╧════════╩══════════╧══════════╧════════╝

The above shows 3 runs of A+B with integer, real, and complex data The left column shows the (average)
number of cycles for the first ravel element of each vector while the right column shows the subsequent
ravel elements. That is in 1 2 3 4 5 + 1 2 3 4 5, the left column shows the average time for 1+1 while the
right columns shows the time for 2+2, 3+3, 4+4, and 5+5. This pattern is typical for all scalar functions and
my best explanation for it is (instruction-) caching.

Now the risk with coalescing is that if a function has a large instruction footprint then it could kick other
functions out of the cache so that we get the first pass cycles (511 above) on all passes and not only on
the first instead of the 39 above.

I am planning to add more performance counters over time so that we have a more solid basis for this kind of
discussions.

/// Jürgen

On 08/22/2014 06:24 PM, Elias Mårtenson wrote:

Thanks, that's interesting indeed.

What about the idea of coalescing multiple functions so that each thread can stream multiple operations in a row without synchronising? To me, it would seem to be hugely beneficial if the _expression_ -1+2+X could stream the three operations (two additions, one negation) when generating the output. Would such a feature require much re-architecting of the application?

Regards,
Elias

On 22 August 2014 21:46, Juergen Sauermann <address@hidden> wrote:

Hi Elias,

I am working on it.

As a preparation I have created a new command ]PSTAT that shows how many CPU cycles
the different scalar function take. You can run the new workspace ScalarBenchmark_1.apl to
see the results (SVN 444).

These numbers are very imprtant to determine when to switch from sequential to parallel execution.
The idea is to feed back these numbers into the interpreter so that machines can tune themselves.

The other thing is the lesson learned from your benchmark results. As far as I can see, semaphores are far to
slow for syncing the threads. The picture that is currently evolving in my head is this: Instead of 2 states
(blocked on semaphore/running) of the threads there should be 3 states:

1. blocked on semaphore,
2. busy waiting on some flag in userspace, and
3. running (performing parallel computations)

The transition between states 1 and 2 is somewhat slow, but only done when the interpreter is blocked on input from the user.
The transition between 2 and 3 is much more lightweight so that the break-even point between sequential and
parallel execution occurs at much shorter vector sizes.

Since this involves some interaction with Input.cc I wasn't sure if I should first throw out libreadline (in order to simplify Input.cc)
or to do the parallel stuff first.

Another lesson from the benchmark was that OMP is always slower than the hand-crafted method, so I guess it is out of scope now,

My long term plan for the next 1 or 2 releases is this:

1. remove libreadline
2. parallel for scalar functions
3. replace liblapack

/// Jürgen

On 08/22/2014 12:22 PM, Elias Mårtenson wrote:

Have the results of this been integrated in the interpreter?

On 1 August 2014 21:57, Juergen Sauermann <address@hidden> wrote:

Hi Elias,

yes - actually a lot. I haven't looked through all files, but
at 80, 60, and small core counts.

The good news is that all results look plausible now. There are some variations
in the data, of course, but the trend is clear:

The total time for OMP (the rightmost value in the plot, i.e. x == corecount + 10) is consistently
about twice the total time for a hand-crafted fork/sync. The benchmark was made in such way
that it only shows the fork/join times. Column N ≤ corecount shows the time when the N'th core
started execution of its task.

I have attached a plot for the 80 core result (4 hand-crafted runs in red and 4 OMP runs in green).
And the script that created the plots using gnuplot.

/// Jürgen

On 08/01/2014 03:16 PM, Elias Mårtenson wrote:

Were you able to deduce anything from the test results?

On 11 May 2014 23:02, "Juergen Sauermann" <address@hidden> wrote:

Hi Elias,

thanks, already interesting. If you could loop around the core count:

for ((i=1; $i<=80; ++i)); do
./Parallel $i
./Parallel_OMP $i
done

then I could understand the data better. Also not sure if something
is wrong with the benchmark program. On my new 4-core with OMP I get
fluctuations from:

address@hidden ~/apl-1.3/tools $ ./Parallel_OMP 4
Pass 0: 4 cores/threads, 8229949 cycles total
Pass 1: 4 cores/threads, 8262 cycles total
Pass 2: 4 cores/threads, 4035 cycles total
Pass 3: 4 cores/threads, 4126 cycles total
Pass 4: 4 cores/threads, 4179 cycles total

to:

address@hidden ~/apl-1.3/tools $ ./Parallel_OMP 4
Pass 0: 4 cores/threads, 11368032 cycles total
Pass 1: 4 cores/threads, 4042228 cycles total
Pass 2: 4 cores/threads, 7251419 cycles total
Pass 3: 4 cores/threads, 3846 cycles total
Pass 4: 4 cores/threads, 2725 cycles total

The fluctuations with the manual parallel for are smaller:

Pass 0: 4 cores/threads, 87225 cycles total
Pass 1: 4 cores/threads, 245046 cycles total
Pass 2: 4 cores/threads, 84632 cycles total
Pass 3: 4 cores/threads, 63619 cycles total
Pass 4: 4 cores/threads, 93437 cycles total

but still considerable. The picture so far suggests that OMP fluctuates much
more (in the start-up + sync time) than manual with the highest OMP start-up above manual
and the lowest far below. One change on my TODO list is to use futexes instead of mutexes
(like OMP does), probably not an issue under Solaris sunce futextes are linux-specific.

/// Jürgen

On 05/11/2014 04:23 AM, Elias Mårtenson wrote:

Here are the files that I promised earlier.

Regards,

Elias

[Prev in Thread]

Current Thread

[Next in Thread]