octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Faster Array transpose


From: David Bateman
Subject: Faster Array transpose
Date: Thu, 01 May 2008 00:08:25 +0200
User-agent: Thunderbird 2.0.0.12 (X11/20080306)

The array transpose in Octave uses an extremely simplistic method

template <class T>
Array<T>
Array<T>::transpose (void) const
{
  assert (ndims () == 2);

  octave_idx_type nr = dim1 ();
  octave_idx_type nc = dim2 ();

  if (nr > 1 && nc > 1)
    {
      Array<T> result (dim_vector (nc, nr));

      for (octave_idx_type j = 0; j < nc; j++)
    for (octave_idx_type i = 0; i < nr; i++)
      result.xelem (j, i) = xelem (i, j);

      return result;
    }
  else
    {
      // Fast transpose for vectors and empty matrices
      return Array<T> (*this, dim_vector (nc, nr));
    }
}

This has poor performance due to the number of cache missings. I know
there were discussions about flagging a transposed matrix rather than
actually doing it to use with the BLAS functions, but the overhead was
felt to be too high. If we can't do that can we at least use a method like

template <class T>
Array<T>
Array<T>::transpose (void) const
{
  assert (ndims () == 2);

  octave_idx_type nr = dim1 ();
  octave_idx_type nc = dim2 ();

  if (nr > 1 && nc > 1)
    {
      Array<T> result (dim_vector (nc, nr));

      // Blocked transpose to attempt to avoid cache misses.
#define TRANSN 8

      // Don't use OCTAVE_LOCAL_BUFFER here as it doesn't work with bool
      // on some compilers.
      Array<T> tmp (TRANSN * TRANSN);
      T *buf = tmp.fortran_vec ();

      octave_idx_type ii = 0, jj;
      for (jj = 0; jj < (nc - TRANSN + 1); jj += TRANSN)
    {
      for (ii = 0; ii < (nr - TRANSN + 1); ii += TRANSN)
        {
          // Copy to buffer
          for (octave_idx_type j = jj, k = 0; j < jj + TRANSN; j++)
        for (octave_idx_type i = ii; i < ii + TRANSN; i++)
          buf[k++] = xelem (i, j);

          // Copy from buffer
          for (octave_idx_type i = ii; i < ii + TRANSN; i++)
        for (octave_idx_type j = jj, k = 0; j < jj + TRANSN;
             j++, k+=TRANSN)
          result.xelem (j, i) = buf[i + k];
        }

      if (ii < nr)
        for (octave_idx_type j = jj; j < jj + TRANSN; j++)
          for (octave_idx_type i = ii; i < nr; i++)
        result.xelem (j, i) = xelem (i, j);
    }

      for (octave_idx_type j = jj; j < nc; j++)
    for (octave_idx_type i = ii; i < nr; i++)
      result.xelem (j, i) = xelem (i, j);

#undef TRANSN

      return result;
    }
  else
    {
      // Fast transpose for vectors and empty matrices
      return Array<T> (*this, dim_vector (nc, nr));
    }
}

That copies to a tile and then recopies to the result matrix. This
reduces the cache misses by a factor of 2/TRANSN at the cost of an
additional copy. For TRANSN=8 about and using the script

N = [128, 129, 1024, 1025, 2048, 2049, 4096, 4097];
nruns = 10;

t = zeros (1, length (N));
for i = 1: length (N)
  A = randn (N(i), N(i));

  for j = 1: nruns
    t0 = cputime ();
    B = A .';
    t(i) = t(i) + cputime() - t0;
  endfor
  t (i) = t (i) ./ nruns;

  printf("N = %4d, time = %g sec\n", N(i), t(i));
  fflush (stdout);
endfor
 
I see the times

N =  128, time = 0.0029999 sec
N =  129, time = 0.0003333 sec
N = 1024, time = 0.0279981 sec
N = 1025, time = 0.0219985 sec
N = 2048, time = 0.234985 sec
N = 2049, time = 0.0796615 sec
N = 4096, time = 0.957271 sec
N = 4097, time = 0.41564 sec

before this change and

N =  128, time = 0.0026665 sec
N =  129, time = 0.0003333 sec
N = 1024, time = 0.010666 sec
N = 1025, time = 0.0073329 sec
N = 2048, time = 0.0849944 sec
N = 2049, time = 0.0563296 sec
N = 4096, time = 0.339311 sec
N = 4097, time = 0.301647 sec

after. So at least on my machine such a change seems to be a consistent
win in terms of the performance of the transpose. I made no attempt to
really optimize the above code and perhaps larger values of TRANSN will
given even better results. We might also directory reference the data
rather than use the xelem method to access it for further speed as we
can partly do the index calculation in advance. Something similar might
also be done with the hermitian method as well.

Regards
David



reply via email to

[Prev in Thread] Current Thread [Next in Thread]