octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bin2dec and cellfun improvements


From: Daniel J Sebald
Subject: Re: bin2dec and cellfun improvements
Date: Sun, 18 Mar 2012 15:33:41 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16

Thank you Rik.  Very keen observations on cpu consumption.  Comments below.

On 03/18/2012 01:25 PM, Rik wrote:
3/18/12

Daniel,

Here is some extra background.  From the NEWS file describing the changes
between 3.4.3 and 3.6.0

  ** All .m string functions have been modified for better performance or
     greater Matlab compatibility.  Performance gains of 15X-30X have
     been demonstrated.  Operations on cell array of strings no longer pay
     quite as high a penalty as those on 2-D character arrays.

I was the author of those changes.  I did extensive benchmarking to tweak
crossover points between using different algorithms and to choose an
initial algorithm.  Most of the functions were coded to work with ordinary
strings and cell array of strings.  The procedure for handling cell array
of strings used to be to recursively call the the string function using
cellfun.  A typical function layout would be:

function my_string_func (s)

   if (ischar (s))
     ... handle ordinary single string or char matrix ...
   else if (iscellstr (s))
     cellfun (@my_string_func, s)
   else
     print_usage ()
   endif

endfunction

Although quite straightforward, this method always delivered poor
performance.  In practice, it turns out to be much faster to convert a
cellstring array to a char matrix, manipulate that, and then convert back
to a cellstring array if necessary.  The functions in Octave's core are
coded for performance first and readability second.  We expect that core
functions will be in the tightest inner loops of user's code, and therefore
the trade-off of a bit of obtuse coding for 50%, or even an order of
magnitude, performance gain is justified.

Certainly true in this case. But I think that the amount of code you have used to do this, although reasonably small for any fluent script writer, is an indication that a built-in function might be in order for base2dec.


The cellfun function is just somewhat slow.  Here is some code that I have
used for benchmarking the related function arrayfun.

----- Code Segment -----

## Declare size of tests
N = 5;
bm = zeros (N, 1);

for i = 1:N
   ## Clear variables for test and pre-assign array
   clear y;
   y = zeros (size (x));

   tic;
   ##################################################
   ## for loop
#  for j = 1:numel (x)
#    y(j) = tan (j);
#  endfor

#  ## simple Arrayfun
#  y = arrayfun (@tan, x);

   ## HomebrewIndexing
#  y = cos (x) ./ sin (x);

   ## Builtin function
#  y = tan (x);

   ##################################################
   bm(i) = toc;

endfor

----- End Code Segment -----

And here are the results normalized to the builtin function tan

tan () : 1
homebrew : 1.3548
arrayfun : 14.483
for loop : 149.77

As you can see, looping is 2 orders of magnitude slower and is really
abysmal.  But that is expected.  On the other hand, using arrayfun is still
an order of magnitude slower than calling the function directly.  Similar
results obtain for cellfun.

The homebrew result is due simply to the extra computations of computing sin and cos individually (trig functions aren't trivial). There is really no indexing issue associated with that.

But I think you've identified the limitations of arrayfun (and cellfun). My feeling is that there shouldn't be such a performance loss associated with that. In 3.2.4 arrayfun is a script that uses cellfun, is that still the case (I don't see arrayfun.m in the repository)? Anyway, whether it is arrayfun or cellfun, internally the code should able to identify it can use methods that are just as fast tan(x). It might make sense for arrayfun to be a separate entity from cellfun.


2) Internally, when a cell array is created, is there some flag or
descriptor indicating that the array has a consistent class across the
whole array? That is "all_same_class" would indicate the cell array is all
"char" or all "double", etc. As an example, the cellstr() routine could,
rather than set each cell's class, set the "all_the_same_class" variable
and copy the contents of the string matrix. Not much extra work, almost the
same as a copy.

As far as I know, there is no such flag.  However, this would be only
another reason why cellfun is slow and it doesn't explain why arrayfun is
also slow.  With arrayfun, one can guarantee that all of the inputs will be
of the same type, and yet it is still slow.

Yes. In the arrayfun case, inherently they are all similar. In the cellfun case it would be nice to have some internal variable indicating the cells are all the same, and I do believe it might exist but isn't being utilized to full potential. Here is a test:

cpustart = cputime(); x = repmat('abcd', 1e7, 1); cputime() - cpustart
cpustart = cputime(); y = cellstr(x); cputime() - cpustart
cpustart = cputime(); iscellstr(y)
cputime() - cpustart
cpustart = cputime(); iscellstr(y)
cputime() - cpustart

Please run the above on your machine.  I'm getting:

octave:8> cpustart = cputime(); x = repmat('abcd', 1e7, 1); cputime() - cpustart
ans =  0.035994
octave:9> cpustart = cputime(); y = cellstr(x); cputime() - cpustart
ans =  11.118
octave:10> cpustart = cputime(); iscellstr(y)
ans =  1
octave:11> cputime() - cpustart
ans =  5.4212
octave:12> cpustart = cputime(); iscellstr(y)
ans =  1
octave:13> cputime() - cpustart
ans =  0.0020000

So, some observations here:

1) We know y is a cellstr when it is assigned because that is inherent to the cellstr (built-in function) command.

2) cellstr seems to be way slower than it need be. Again, I don't know the innards, but my gut feeling is that should be doable much more quickly. I suppose each cell has to have its own memory allocation for potential upcoming modifications to individual cells. C memory allocation isn't flexible (can only malloc and free big hunks of memory). Does C++ memory allocation bring anything to the table? That is, by using "new" on an object size of, in this case, 1e7 allow one to inherently compute the pointers to individual items and treat them independent then reassign memory? (I'm guessing perhaps not because it would be like creating Swiss cheese out the memory that the core OS gave up.)

3) The second call to iscellstr() is extremely fast. Therefore, I conclude that there is in fact already an internal variable associated with a cell variable indicating that the cells are all the same and it gets set as a result of the first call to iscellstr(). That's good, because otherwise could you imagine the time consumed with multiple function calls each of which do a iscellstr()?

4) Given 3, and the observation of 1, in fact the first call to iscellstr() above should be equally as fast as the second call.


My guess, and I'm just speculating at this point, is that the issue lies in
the handle that is passed to arrayfun or cellfun.  This is a polymorphic
object that has to do a lot of things.  For example, if it points to an
m-file the m-file timestamp needs to be checked and the function might need
to be compiled into an internal form.  If we could pass the data directly
to a function method I think we would see better results.

Yes, there are a number of basic, internal functions that could benefit from this approach. And often those basic internal functions do a lot of what the user is looking for.


This seems to be
confirmed by certain operations in cellfun which do directly call C++
functions.

Yes, but maybe there is benefit to be had still internally if one knows all the cells are of the same class.


  In Octave-3.4.3 these were recognized by passing in the string
name of the accelerated function.  Passing in a function handle would not
engage the acceleration.  Thus, in 3.4.3

c = cell (1e6, 1);
c(1000) = [1 2 3];
tic; y = cellfun ("isempty", c); toc
Elapsed time is 0.01 seconds.
tic; y = cellfun (@isempty, c); toc
Elapsed time is 1 seconds.

So there are two orders of magnitude lost when a lot of function checking,
that probably should be done only once at the start of the loop, is done.

Right. I assume that isempty (being a built-in function) has a method that handles cells and is very fast. (If it is built-in, there is no timestamp checking, right?)

Seeing as cellfun is also builtin, I would think it should be able to recognize in this case, given isempty is also built-in, that it can default to the faster isempty method call that uses cells.


There is a hierarchy of improvements when optimizing.  The largest change
comes from choosing a better algorithm, say replacing a bubble sort with a
quick sort, which can yield 1-3 orders of magnitude improvement.  The next
biggest improvement, in Octave, is to do away with looping structures and
use vectorization.  This will typically get you 1 order of magnitude.  The
next improvement is to change an m-file to a C++ compiled version of the
function.  This might get you a 2-3X improvement.

So, base2dec could be recoded into C++.  My benchmarking was showing 1
million 10-digit conversions taking 1.008 seconds.  This could probably be
reduced to 0.5 second through recoding in C++.  This seems a small enough
benefit that a programmer would be better off to concentrate on other areas
of the code such as bug fixes or areas where orders of magnitude
improvement are possible.

I understand what you are saying, but don't necessarily agree in this case for several reasons:

1) base2dec is a rather fundamental routine, so there is some justification for it being done as a built-in. If all the other bin2dec, etc. can utilize that one efficient routine, then even more reason to do so.

2) Internally, stuff like (2.^[length(s)-1:-1:0])' with all it's memory allocation and so on is gone. base2dec may be a very simple routine done as a builtin, requiring little maintenance and debugging.

3) There is no padding of any sort (and the extra multiplications that come with it) if internally a base2dec can C-loop through the bits of the string accumulating numbers and stop at the end of the string.


  If you want to go ahead and try, however, no one
will turn down fast, debugged code.  It would be very easy to test a
different implementation by writing your own C++ function and using the
mkoctfile interface.

I'm just trying to brainstorm if and where there could be some benefit of a little optimization here. The bigger picture of base2dec is that of more efficiently handling strings and cellstrs internally. I claim this is no small improvement.

Dan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]