|
From: | Juergen Sauermann |
Subject: | Re: [Bug-apl] 80 core performance results |
Date: | Sat, 23 Aug 2014 13:02:39 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux i686; rv:31.0) Gecko/20100101 Thunderbird/31.0 |
Hi Elias, I believe the gain of coalescing functions (or slicing the values involved) is somewhat limited and occurs only when your APL values are small. For large values computing one function after the other has a better cache locality. And it has a price: the runtime parser gets slower (you cannot detect always these sequences at ⎕FX time, error reporting becomes dubious), And the scheme only works for scalar-like functions so that the number of functions in theses sequences is small. What can be saved by slicing is essentially memory allocation for the intermediate results and fork/sync times for the intermediate functions. memory allocation should be reasonably fast already so there is little to gain. And the fork/sync times, which are currently the biggest problem, need to go down significantly (otherwise we can forget about parallel APL). The gain on fork/sync times is O(log(core-count) × (coalescing-length - 1)) which is not too much. The penalty of non-localization can be huge. For example: ]PSTAT 13 ╔════════════════════════════════════════════════════════════════════════╗ ║ Performance Statistics (CPU cycles) ║ ╠══════════╦══════════════════════════════╦══════════════════════════════╣ ║ ║ first pass ║ subsequent passes ║ ║ Function ╟──────────┬──────────┬────────╫──────────┬──────────┬────────╢ ║ ║ N │ μ │ σ÷μ % ║ N │ μ │ σ÷μ % ║ ╠══════════╬══════════╪══════════╪════════╬══════════╪══════════╪════════╣ ║ A + B ║ 3 │ 511 │ 49 % ║ 6047 │ 39 │ 69 % ║ ╚══════════╩══════════╧══════════╧════════╩══════════╧══════════╧════════╝ The above shows 3 runs of A+B with integer, real, and complex data The left column shows the (average) number of cycles for the first ravel element of each vector while the right column shows the subsequent ravel elements. That is in 1 2 3 4 5 + 1 2 3 4 5, the left column shows the average time for 1+1 while the right columns shows the time for 2+2, 3+3, 4+4, and 5+5. This pattern is typical for all scalar functions and my best explanation for it is (instruction-) caching. Now the risk with coalescing is that if a function has a large instruction footprint then it could kick other functions out of the cache so that we get the first pass cycles (511 above) on all passes and not only on the first instead of the 39 above. I am planning to add more performance counters over time so that we have a more solid basis for this kind of discussions. /// Jürgen On 08/22/2014 06:24 PM, Elias Mårtenson
wrote:
|
[Prev in Thread] | Current Thread | [Next in Thread] |