octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?


From: Daniel J Sebald
Subject: Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?
Date: Sun, 23 Jul 2017 16:18:08 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0

On 07/21/2017 11:34 AM, Daniel J Sebald wrote:> On 07/21/2017 10:43 AM, Daniel J Sebald wrote:
On 07/21/2017 10:17 AM, Rik wrote:
On 07/21/2017 01:24 AM, Daniel J Sebald wrote:
On 07/20/2017 11:31 PM, Rik wrote:
On 07/20/2017 04:39 PM, John W. Eaton wrote:
[snip]
But this change should only be made in cases where the size really
is a
constant.

There were only a few instances of static buffers which I changed to
use
local variables in this cset
(http://hg.savannah.gnu.org/hgweb/octave/rev/cda0614beaec).

Rik, at the bottom of this changeset is

+      octave_idx_type ii = 0;
+      octave_idx_type jj;
       for (jj = 0; jj < (nc - 8 + 1); jj += 8)
         {
           for (ii = 0; ii < (nr - 8 + 1); ii += 8)

[snip]

Just curious about the use of a function pointer and call and how
efficient that is.  For example, complex-conjugation can probably be
done in just a few instruction cycles, whereas repeatedly placing a
variable on the stack and jumping to a function and consume much more.
Or does the Template construct deal with all this, i.e., that the T
*fcn() is treated more like an inline?

There are two different cases here,

  if (nr >= 8 && nc >= 8)

Wouldn't the former case still work if nr or nc were less than 8?

That's an interesting motivation, to avoid cache jumping.  Wouldn't this
idea still apply if say, nc = 3 and nr = 1000000?

This 8x8 construct has got me thinking a bit.  It's an interesting idea,
very parallel processing oriented in a way, but I'm curious how
efficient this is.  There is still big spacing in this loop

              for (octave_idx_type j = jj, k = 0, idxj = jj * nr;
                   j < jj + 8; j++, idxj += nr)

if nr is extremely large.  Isn't it normally the case that memory is
packed along one dimension versus the other, e.g.,

matrix-index : memory-index
0  0    :       0
1  0    :       1
2  0    :       2
...
999999  :  999999
0, 1    : 1000000
1, 1    : 1000001
2, 1    : 1000002
...
etc.

In other words, we want to make sure we are always placing the inner
index in the direction which the memory is packed contiguous.  So,
rather than have a block that spans 8 rows and 8 columns, don't we want
a block that spans 64 rows (or less) and just one column?  That way we
are more likely to stay within the cache memory.  Am I thinking about
this correctly?

Another thought is that rather than place that function operation within
the loop (if that fcn is in fact compiled a stack/jump/return sequence):

                  result.xelem (j + idxi) = fcn (buf[k]);

maybe it would make more sense to first copy the matrix as a transpose
and then make a call to a function which has arguments NR and NC so that
the called function can do a simple double loop through the whole array
with a register-based fcn equivalent, whether that is complex conjugate
or whatnot.

I tried a lot of variations on the cache-friendly block construct and didn't find anything faster. For example, rearranging the second loop and incorporating it into the first loop so that no intermediate buffer is used resulted in a slight loss of performance.

However, I did realize that the case of large NR compared to NC takes a big hit because the last partial columns don't do blocking (it's a separate loop) and it is likely those last columns which are most likely to be the ones suffering from cache-jumping. So I've moved those extra loops within the main two loops.

The result is much more streamlined code and equally fast in all scenarios of NR x NC:

https://savannah.gnu.org/patch/index.php?9415

Dan




reply via email to

[Prev in Thread] Current Thread [Next in Thread]