|
From: | Juergen Sauermann |
Subject: | Re: [Bug-apl] Performance optimisations: Results |
Date: | Wed, 02 Apr 2014 14:12:45 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130330 Thunderbird/17.0.5 |
Hi,
the output is meant to be gnuplotted. You either copy-and-paste the data lines into a some file or else apl > file (in that case you have to type blindly and remove the non-data lines with an editor. The first 256 data lines are the cycle counter of the CPU before the nth iteration at the beginning of the loop. Looking at the result: 0, 168 1, 344610 2, 673064 3, 994497 and and at the code: int64_t T0 = cycle_counter(); for (c = 0; c < count; c++) { if ((c & 0x0FFF) == 0) Tn[c >> 12] = cycle_counter(); const Cell * cell_A = &A->get_ravel(c); ... } int64_t TX = cycle_counter(); we see that that the loop begins at 0 cycles (actually at T0 but T0 is subtracted from Tn when printed so that time 0 is virtually the start of the loop. At cycle 168 we are at the first line of the loop. At cycle 344610 we have performed 4096 iterations of the loop, At cycle 673064 we have performed 4096 more iterations of the loop, At cycle 994497 we have performed 4096 more iterations of the loop, ... The last value is the cycle counter after the loop (so joining of the the threads is included). In file parallel, the first half of the data is supposedly the timestamps written by one thread and the other half the timestamps written by the other thread. On a 8 core machine this should look like: /| /| /| /| /| /| /| / / | / | / | / | / | / | / | / / |/ |/ |/ |/ |/ |/ |/ The interesting times are: - T0 (showing roughly the start-up overhead of the loop), and - the difference between the last two values compared with the average difference between two values (showing the joining overhead of the loop), and - the last value (the total execution time of the loop). Comparing files seqential and parallel we see that the same loop costs 81834662 cycles when run on one core and 43046192 cycles when run on two cores. This is 2128861 cycles away from speedup 2. The difference between two values is around 322500 (for 4096 iterations) which is about 79 cycles for one iteration. Thus the break-even point where parallel is faster is at vector length 26947 (due to the fact that integer addition is about the fastest operation on a CPU). The real code had a call to expand_pointers(cell_Z, cell_A, cell_B, fun); instead of: (cell_B->*fun)(cell_Z, cell_A); so that the break-even point will go down a little. /// Jürgen On 04/02/2014 12:43 PM, Elias Mårtenson wrote:
|
[Prev in Thread] | Current Thread | [Next in Thread] |