octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Speed improvements and slowups 3.0.x to latest 3.1.x


From: David Bateman
Subject: Re: Speed improvements and slowups 3.0.x to latest 3.1.x
Date: Sat, 07 Feb 2009 09:11:40 +0100
User-agent: Mozilla-Thunderbird 2.0.0.17 (X11/20081018)

Jaroslav Hajek wrote:
Now, theoretically memcpy is faster than std::copy because it has the
stronger assumptions (non-overlapping ranges), but I always thought
the differences were minuscule. You can easily choose the fastest
algorithm if you can safely compare any two pointers within your
address space, which you can't in C/C++, but maybe you can at the
system level? I see g++ 4.3. defers std::copy of POD types to an
internal function (__builtin_memmove), so it's a question how good is
that. Were you compiling using an older version / different compiler?
Basically, we could mimick the POD dispatching done in g++'s standard
library files and then put memcpy back into the game, but I really
don't like this idea much, I think this is the sort of stuff compilers
should handle.
Although the compiler was the same, in one case I'm using the prebuilt debian version of octave and building it myself for 3.1.51+.. It semms you built both yourself and didn't see the effect I was seeing so I take it back..

regarding sort, there are a number of further optimization possibilities.

1. the algorithm, borrowed from Python, is really targeted to sorting
arrays containing already sorted segments, i.e. re-sorting. I did a
few experiments here ind it seems that std::sort from gcc 4.3,
implementing introsort, wins for random arrays but gets beaten for
arrays that contain a lot of sorted segments. Maybe we could create a
hybrid, falling back to std::sort if no long runs are detected.
The python algorithm is perhaps 70% slower for random data but many times faster for partially sorted data that is likely to be a more common case inside a loop. So I'd say the penalty for random data is acceptable given the gain for partially sorted lists


2. the comparison is currently done by passing in a function pointer.
For simple POD types, this is a significant overhead. In fact I
experimented with allowing it to be inlined by giving octave_sort a
second template parameter, and for random arrays it was a win by up to
20% (IIRC). For partially sorted arrays, the win was smaller
(apparently because the number of comparisons was less compared to the
bookkeeping).
The result was, however, more complicated because you no longer have
one external octave_sort instance per type, a the instantiations got a
bit messy - it took me a while before I sorted out all undefined
references at linking. And I didn't like the idea making the algorithm
completely inline to further slow down compilation.
Now, the mess could be avoided if we resign on optimizing descending
sorts and just optimize the ascending ones by optimizing the case when
octave_sort::compare is null. Currently this is tested in each
comparison (see IFLT in octave_sort.cc) which is probably sub-optimal
even though branch prediction in modern CPUs can cope amazingly well
with stuff like that.
I think it is possible to move the dispatch up a little, using
templates or just code duplication, which could again win us some
percents.
Maybe this one would be reasonable ..
I guess I could try to realize some of the two ideas. Do you know of a
serious Octave application where sorting is a bottleneck?
Not really, just that it was a regression I saw against the sciview benchmark

D.

--
David Bateman                                address@hidden
35 rue Gambetta                              +33 1 46 04 02 18 (Home)
92100 Boulogne-Billancourt FRANCE            +33 6 72 01 06 33 (Mob)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]