[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bin2dec and cellfun improvements
From: |
Daniel J Sebald |
Subject: |
Re: bin2dec and cellfun improvements |
Date: |
Sun, 18 Mar 2012 15:33:41 -0500 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16 |
Thank you Rik. Very keen observations on cpu consumption. Comments below.
On 03/18/2012 01:25 PM, Rik wrote:
3/18/12
Daniel,
Here is some extra background. From the NEWS file describing the changes
between 3.4.3 and 3.6.0
** All .m string functions have been modified for better performance or
greater Matlab compatibility. Performance gains of 15X-30X have
been demonstrated. Operations on cell array of strings no longer pay
quite as high a penalty as those on 2-D character arrays.
I was the author of those changes. I did extensive benchmarking to tweak
crossover points between using different algorithms and to choose an
initial algorithm. Most of the functions were coded to work with ordinary
strings and cell array of strings. The procedure for handling cell array
of strings used to be to recursively call the the string function using
cellfun. A typical function layout would be:
function my_string_func (s)
if (ischar (s))
... handle ordinary single string or char matrix ...
else if (iscellstr (s))
cellfun (@my_string_func, s)
else
print_usage ()
endif
endfunction
Although quite straightforward, this method always delivered poor
performance. In practice, it turns out to be much faster to convert a
cellstring array to a char matrix, manipulate that, and then convert back
to a cellstring array if necessary. The functions in Octave's core are
coded for performance first and readability second. We expect that core
functions will be in the tightest inner loops of user's code, and therefore
the trade-off of a bit of obtuse coding for 50%, or even an order of
magnitude, performance gain is justified.
Certainly true in this case. But I think that the amount of code you
have used to do this, although reasonably small for any fluent script
writer, is an indication that a built-in function might be in order for
base2dec.
The cellfun function is just somewhat slow. Here is some code that I have
used for benchmarking the related function arrayfun.
----- Code Segment -----
## Declare size of tests
N = 5;
bm = zeros (N, 1);
for i = 1:N
## Clear variables for test and pre-assign array
clear y;
y = zeros (size (x));
tic;
##################################################
## for loop
# for j = 1:numel (x)
# y(j) = tan (j);
# endfor
# ## simple Arrayfun
# y = arrayfun (@tan, x);
## HomebrewIndexing
# y = cos (x) ./ sin (x);
## Builtin function
# y = tan (x);
##################################################
bm(i) = toc;
endfor
----- End Code Segment -----
And here are the results normalized to the builtin function tan
tan () : 1
homebrew : 1.3548
arrayfun : 14.483
for loop : 149.77
As you can see, looping is 2 orders of magnitude slower and is really
abysmal. But that is expected. On the other hand, using arrayfun is still
an order of magnitude slower than calling the function directly. Similar
results obtain for cellfun.
The homebrew result is due simply to the extra computations of computing
sin and cos individually (trig functions aren't trivial). There is
really no indexing issue associated with that.
But I think you've identified the limitations of arrayfun (and cellfun).
My feeling is that there shouldn't be such a performance loss
associated with that. In 3.2.4 arrayfun is a script that uses cellfun,
is that still the case (I don't see arrayfun.m in the repository)?
Anyway, whether it is arrayfun or cellfun, internally the code should
able to identify it can use methods that are just as fast tan(x). It
might make sense for arrayfun to be a separate entity from cellfun.
2) Internally, when a cell array is created, is there some flag or
descriptor indicating that the array has a consistent class across the
whole array? That is "all_same_class" would indicate the cell array is all
"char" or all "double", etc. As an example, the cellstr() routine could,
rather than set each cell's class, set the "all_the_same_class" variable
and copy the contents of the string matrix. Not much extra work, almost the
same as a copy.
As far as I know, there is no such flag. However, this would be only
another reason why cellfun is slow and it doesn't explain why arrayfun is
also slow. With arrayfun, one can guarantee that all of the inputs will be
of the same type, and yet it is still slow.
Yes. In the arrayfun case, inherently they are all similar. In the
cellfun case it would be nice to have some internal variable indicating
the cells are all the same, and I do believe it might exist but isn't
being utilized to full potential. Here is a test:
cpustart = cputime(); x = repmat('abcd', 1e7, 1); cputime() - cpustart
cpustart = cputime(); y = cellstr(x); cputime() - cpustart
cpustart = cputime(); iscellstr(y)
cputime() - cpustart
cpustart = cputime(); iscellstr(y)
cputime() - cpustart
Please run the above on your machine. I'm getting:
octave:8> cpustart = cputime(); x = repmat('abcd', 1e7, 1); cputime() -
cpustart
ans = 0.035994
octave:9> cpustart = cputime(); y = cellstr(x); cputime() - cpustart
ans = 11.118
octave:10> cpustart = cputime(); iscellstr(y)
ans = 1
octave:11> cputime() - cpustart
ans = 5.4212
octave:12> cpustart = cputime(); iscellstr(y)
ans = 1
octave:13> cputime() - cpustart
ans = 0.0020000
So, some observations here:
1) We know y is a cellstr when it is assigned because that is inherent
to the cellstr (built-in function) command.
2) cellstr seems to be way slower than it need be. Again, I don't know
the innards, but my gut feeling is that should be doable much more
quickly. I suppose each cell has to have its own memory allocation for
potential upcoming modifications to individual cells. C memory
allocation isn't flexible (can only malloc and free big hunks of
memory). Does C++ memory allocation bring anything to the table? That
is, by using "new" on an object size of, in this case, 1e7 allow one to
inherently compute the pointers to individual items and treat them
independent then reassign memory? (I'm guessing perhaps not because it
would be like creating Swiss cheese out the memory that the core OS gave
up.)
3) The second call to iscellstr() is extremely fast. Therefore, I
conclude that there is in fact already an internal variable associated
with a cell variable indicating that the cells are all the same and it
gets set as a result of the first call to iscellstr(). That's good,
because otherwise could you imagine the time consumed with multiple
function calls each of which do a iscellstr()?
4) Given 3, and the observation of 1, in fact the first call to
iscellstr() above should be equally as fast as the second call.
My guess, and I'm just speculating at this point, is that the issue lies in
the handle that is passed to arrayfun or cellfun. This is a polymorphic
object that has to do a lot of things. For example, if it points to an
m-file the m-file timestamp needs to be checked and the function might need
to be compiled into an internal form. If we could pass the data directly
to a function method I think we would see better results.
Yes, there are a number of basic, internal functions that could benefit
from this approach. And often those basic internal functions do a lot
of what the user is looking for.
This seems to be
confirmed by certain operations in cellfun which do directly call C++
functions.
Yes, but maybe there is benefit to be had still internally if one knows
all the cells are of the same class.
In Octave-3.4.3 these were recognized by passing in the string
name of the accelerated function. Passing in a function handle would not
engage the acceleration. Thus, in 3.4.3
c = cell (1e6, 1);
c(1000) = [1 2 3];
tic; y = cellfun ("isempty", c); toc
Elapsed time is 0.01 seconds.
tic; y = cellfun (@isempty, c); toc
Elapsed time is 1 seconds.
So there are two orders of magnitude lost when a lot of function checking,
that probably should be done only once at the start of the loop, is done.
Right. I assume that isempty (being a built-in function) has a method
that handles cells and is very fast. (If it is built-in, there is no
timestamp checking, right?)
Seeing as cellfun is also builtin, I would think it should be able to
recognize in this case, given isempty is also built-in, that it can
default to the faster isempty method call that uses cells.
There is a hierarchy of improvements when optimizing. The largest change
comes from choosing a better algorithm, say replacing a bubble sort with a
quick sort, which can yield 1-3 orders of magnitude improvement. The next
biggest improvement, in Octave, is to do away with looping structures and
use vectorization. This will typically get you 1 order of magnitude. The
next improvement is to change an m-file to a C++ compiled version of the
function. This might get you a 2-3X improvement.
So, base2dec could be recoded into C++. My benchmarking was showing 1
million 10-digit conversions taking 1.008 seconds. This could probably be
reduced to 0.5 second through recoding in C++. This seems a small enough
benefit that a programmer would be better off to concentrate on other areas
of the code such as bug fixes or areas where orders of magnitude
improvement are possible.
I understand what you are saying, but don't necessarily agree in this
case for several reasons:
1) base2dec is a rather fundamental routine, so there is some
justification for it being done as a built-in. If all the other
bin2dec, etc. can utilize that one efficient routine, then even more
reason to do so.
2) Internally, stuff like (2.^[length(s)-1:-1:0])' with all it's memory
allocation and so on is gone. base2dec may be a very simple routine
done as a builtin, requiring little maintenance and debugging.
3) There is no padding of any sort (and the extra multiplications that
come with it) if internally a base2dec can C-loop through the bits of
the string accumulating numbers and stop at the end of the string.
If you want to go ahead and try, however, no one
will turn down fast, debugged code. It would be very easy to test a
different implementation by writing your own C++ function and using the
mkoctfile interface.
I'm just trying to brainstorm if and where there could be some benefit
of a little optimization here. The bigger picture of base2dec is that
of more efficiently handling strings and cellstrs internally. I claim
this is no small improvement.
Dan
- Re: bin2dec behavior different from Matlab?, (continued)
- Re: bin2dec behavior different from Matlab?, John W. Eaton, 2012/03/16
- Re: bin2dec behavior different from Matlab?, Rik, 2012/03/16
- Re: bin2dec behavior different from Matlab?, Daniel J Sebald, 2012/03/16
- Re: bin2dec behavior different from Matlab?, Rik, 2012/03/16
- Re: bin2dec behavior different from Matlab?, Daniel J Sebald, 2012/03/16
- Re: bin2dec behavior different from Matlab?, Daniel J Sebald, 2012/03/17
- Re: bin2dec behavior different from Matlab?, Jordi GutiƩrrez Hermoso, 2012/03/17
- Re: bin2dec behavior different from Matlab?, Daniel J Sebald, 2012/03/17
- Re: bin2dec behavior different from Matlab?, Daniel J Sebald, 2012/03/17
- Re: bin2dec and cellfun improvements, Rik, 2012/03/18
- Re: bin2dec and cellfun improvements,
Daniel J Sebald <=
- Re: bin2dec and cellfun improvements, Daniel J Sebald, 2012/03/20
- Re: bin2dec behavior different from Matlab?, Daniel J Sebald, 2012/03/16
Re: bin2dec behavior different from Matlab?, ahowe42, 2012/03/22