octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?


From: Daniel J Sebald
Subject: Re: Replace OCTAVE_LOCAL_BUFFER implementation with std::unique_ptr?
Date: Fri, 21 Jul 2017 11:34:47 -0500
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0

On 07/21/2017 10:43 AM, Daniel J Sebald wrote:
On 07/21/2017 10:17 AM, Rik wrote:
On 07/21/2017 01:24 AM, Daniel J Sebald wrote:
On 07/20/2017 11:31 PM, Rik wrote:
On 07/20/2017 04:39 PM, John W. Eaton wrote:
[snip]
But this change should only be made in cases where the size really
is a
constant.

There were only a few instances of static buffers which I changed to
use
local variables in this cset
(http://hg.savannah.gnu.org/hgweb/octave/rev/cda0614beaec).

Rik, at the bottom of this changeset is

+      octave_idx_type ii = 0;
+      octave_idx_type jj;
       for (jj = 0; jj < (nc - 8 + 1); jj += 8)
         {
           for (ii = 0; ii < (nr - 8 + 1); ii += 8)

[snip]

Just curious about the use of a function pointer and call and how
efficient that is.  For example, complex-conjugation can probably be
done in just a few instruction cycles, whereas repeatedly placing a
variable on the stack and jumping to a function and consume much more.
Or does the Template construct deal with all this, i.e., that the T
*fcn() is treated more like an inline?

There are two different cases here,

  if (nr >= 8 && nc >= 8)

Wouldn't the former case still work if nr or nc were less than 8?

That's an interesting motivation, to avoid cache jumping.  Wouldn't this
idea still apply if say, nc = 3 and nr = 1000000?

This 8x8 construct has got me thinking a bit. It's an interesting idea, very parallel processing oriented in a way, but I'm curious how efficient this is. There is still big spacing in this loop

              for (octave_idx_type j = jj, k = 0, idxj = jj * nr;
                   j < jj + 8; j++, idxj += nr)

if nr is extremely large. Isn't it normally the case that memory is packed along one dimension versus the other, e.g.,

matrix-index : memory-index
0  0    :       0
1  0    :       1
2  0    :       2
...
999999  :  999999
0, 1    : 1000000
1, 1    : 1000001
2, 1    : 1000002
...
etc.

In other words, we want to make sure we are always placing the inner index in the direction which the memory is packed contiguous. So, rather than have a block that spans 8 rows and 8 columns, don't we want a block that spans 64 rows (or less) and just one column? That way we are more likely to stay within the cache memory. Am I thinking about this correctly?

Another thought is that rather than place that function operation within the loop (if that fcn is in fact compiled a stack/jump/return sequence):

                  result.xelem (j + idxi) = fcn (buf[k]);

maybe it would make more sense to first copy the matrix as a transpose and then make a call to a function which has arguments NR and NC so that the called function can do a simple double loop through the whole array with a register-based fcn equivalent, whether that is complex conjugate or whatnot.

Dan



reply via email to

[Prev in Thread] Current Thread [Next in Thread]