Re: Speed improvements and slowups 3.0.x to latest 3.1.x

octave-maintainers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Speed improvements and slowups 3.0.x to latest 3.1.x

From:	Jaroslav Hajek
Subject:	Re: Speed improvements and slowups 3.0.x to latest 3.1.x
Date:	Mon, 9 Feb 2009 13:26:10 +0100

On Sat, Feb 7, 2009 at 9:11 AM, David Bateman <address@hidden> wrote:
> Jaroslav Hajek wrote:
>>
>> Now, theoretically memcpy is faster than std::copy because it has the
>> stronger assumptions (non-overlapping ranges), but I always thought
>> the differences were minuscule. You can easily choose the fastest
>> algorithm if you can safely compare any two pointers within your
>> address space, which you can't in C/C++, but maybe you can at the
>> system level? I see g++ 4.3. defers std::copy of POD types to an
>> internal function (__builtin_memmove), so it's a question how good is
>> that. Were you compiling using an older version / different compiler?
>> Basically, we could mimick the POD dispatching done in g++'s standard
>> library files and then put memcpy back into the game, but I really
>> don't like this idea much, I think this is the sort of stuff compilers
>> should handle.
>>
>
> Although the compiler was the same, in one case I'm using the prebuilt
> debian version of octave and building it myself for 3.1.51+.. It semms you
> built both yourself and didn't see the effect I was seeing so I take it
> back..
>
>> regarding sort, there are a number of further optimization possibilities.
>>
>> 1. the algorithm, borrowed from Python, is really targeted to sorting
>> arrays containing already sorted segments, i.e. re-sorting. I did a
>> few experiments here ind it seems that std::sort from gcc 4.3,
>> implementing introsort, wins for random arrays but gets beaten for
>> arrays that contain a lot of sorted segments. Maybe we could create a
>> hybrid, falling back to std::sort if no long runs are detected.
>>
>
> The python algorithm is perhaps 70% slower for random data but many times
> faster for partially sorted data that is likely to be a more common case
> inside a loop. So I'd say the penalty for random data is acceptable given
> the gain for partially sorted lists
>
>
>> 2. the comparison is currently done by passing in a function pointer.
>> For simple POD types, this is a significant overhead. In fact I
>> experimented with allowing it to be inlined by giving octave_sort a
>> second template parameter, and for random arrays it was a win by up to
>> 20% (IIRC). For partially sorted arrays, the win was smaller
>> (apparently because the number of comparisons was less compared to the
>> bookkeeping).
>> The result was, however, more complicated because you no longer have
>> one external octave_sort instance per type, a the instantiations got a
>> bit messy - it took me a while before I sorted out all undefined
>> references at linking. And I didn't like the idea making the algorithm
>> completely inline to further slow down compilation.
>> Now, the mess could be avoided if we resign on optimizing descending
>> sorts and just optimize the ascending ones by optimizing the case when
>> octave_sort::compare is null. Currently this is tested in each
>> comparison (see IFLT in octave_sort.cc) which is probably sub-optimal
>> even though branch prediction in modern CPUs can cope amazingly well
>> with stuff like that.
>> I think it is possible to move the dispatch up a little, using
>> templates or just code duplication, which could again win us some
>> percents.
>>
>
> Maybe this one would be reasonable ..
>>
>> I guess I could try to realize some of the two ideas. Do you know of a
>> serious Octave application where sorting is a bottleneck?
>>
>
> Not really, just that it was a regression I saw against the sciview
> benchmark
>
> D.
>
> --
> David Bateman                                address@hidden
> 35 rue Gambetta                              +33 1 46 04 02 18 (Home)
> 92100 Boulogne-Billancourt FRANCE            +33 6 72 01 06 33 (Mob)
>
>

Hi David,
changes are upstream.

I've tried the following simple benchmark (Core 2 @ 2.83 GHz, g++ 4.3 with -O2)

# purely random array
n = 4e6;
a = randn (1, n);
tic; b = sort (a); toc
tic; [b, i] = sort (a); toc

# partially ordered array
for k = 1:100
  j = ceil (rand * (n-1));
  # swap randomly
  b = [b(j+1:end), b(1:j)];
endfor
tic; a = sort (b); toc
tic; [a, i] = sort (b); toc

## unique
a = round (1:n + 1.2*rand(1,n) - 0.6);
tic; [b, i] = unique (a); toc

## setdiff
tic; d = setdiff (b, 1:10:101); toc

3.0.3:
Elapsed time is 0.744449 seconds.
Elapsed time is 1.70849 seconds.
Elapsed time is 0.0697792 seconds.
Elapsed time is 0.196957 seconds.
Elapsed time is 0.344748 seconds.
Elapsed time is 0.715953 seconds.

previous tip:
Elapsed time is 0.757062 seconds.
Elapsed time is 1.58256 seconds.
Elapsed time is 0.0594368 seconds.
Elapsed time is 0.140261 seconds.
Elapsed time is 0.175813 seconds.
Elapsed time is 0.444922 seconds.

new tip:
Elapsed time is 0.640009 seconds.
Elapsed time is 0.870252 seconds.
Elapsed time is 0.0450239 seconds.
Elapsed time is 0.105679 seconds.
Elapsed time is 0.121528 seconds.
Elapsed time is 0.35074 seconds.

I would especially note that the indexed sort (which i think is more
important than plain sort) is sped up almost twice compared to 3.0.3.
A consequence is the speed-up of various functions depending on
indexed sort.

PS. In sorting real arrays, there was an interesting bitflipping trick
used to compare doubles as 64-bit integers. I don't however, fully
understand the motivation. Is comparing 64-bit integers significantly
faster than comparing doubles?

PS2. octave_sort is now able to do an indexed sort directly (using
by-the-way permuting, which probably accounts for most of the speed-up
seen above), so there's no more need for using stuff like vec_index
etc for that matter. I haven't however, removed all such usages, so I
guess there remain a few (for instance in the sparse code). You (or
anyone else) are welcome to finish the conversion.

PS3. I think there's still room to optimize certain inner loops (see
the comments in octave_sort.cc), but that didn't seem to me worth
doing unless there's a real demand.

cheers

-- 
RNDr. Jaroslav Hajek
computing expert
Aeronautical Research and Test Institute (VZLU)
Prague, Czech Republic
url: www.highegg.matfyz.cz

[Prev in Thread]

Current Thread

[Next in Thread]

Speed improvements and slowups 3.0.x to latest 3.1.x, David Bateman, 2009/02/06
- Re: Speed improvements and slowups 3.0.x to latest 3.1.x, Jaroslav Hajek, 2009/02/07
  - Re: Speed improvements and slowups 3.0.x to latest 3.1.x, David Bateman, 2009/02/07
    - Re: Speed improvements and slowups 3.0.x to latest 3.1.x, Jaroslav Hajek <=
  - Re: Speed improvements and slowups 3.0.x to latest 3.1.x, Søren Hauberg, 2009/02/07
    - Re: Speed improvements and slowups 3.0.x to latest 3.1.x, Jaroslav Hajek, 2009/02/07
- Re: Speed improvements and slowups 3.0.x to latest 3.1.x, Carlo de Falco, 2009/02/07

Prev by Date: Re: Use of ls-R kpathsea database
Next by Date: Re: gl-render.cc: framework OpenGL on MacOSX
Previous by thread: Re: Speed improvements and slowups 3.0.x to latest 3.1.x
Next by thread: Re: Speed improvements and slowups 3.0.x to latest 3.1.x
Index(es):
- Date
- Thread