chicken-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-users] Optimizing inner loops


From: Will M Farr
Subject: Re: [Chicken-users] Optimizing inner loops
Date: Tue, 29 Aug 2006 16:07:56 -0400

Hello Carlos (and everyone),

Thanks for the reference to the inline egg---I'd seen that, but had forgotten about it. Below is a little bit of code you can use to test the speed of the various methods I discussed in my last email for different sizes of vectors.

Each of the attached files can be compiled to an executable which is invoked as

./test_{c,scm,fast_scm} <n> <n-iter>

which adds two vectors of double-precision floating point numbers of length <n> and stores the result in a third vector <n-iter> times. On my machine (PowerBook G4 800 Mhz, Mac OS 10.4.7) user timings with n = 1000 and n-iter = 1000000 (conveniently 1 GFLOP) are as follows:

csr-dyn-69:/tmp farr$ time ./test_c 1000 1000000

real    0m7.889s
user    0m6.407s
sys     0m0.057s


csr-dyn-69:/tmp farr$ time ./test_fast_scm 1000 1000000

real    0m7.879s
user    0m6.296s
sys     0m0.094s


csr-dyn-69:/tmp farr$ time ./test_scm 1000 1000000
^C

real    30m56.616s
user    19m30.921s
sys     1m14.798s

As you can see, I terminated the test_scm run prematurely because it was taking *forever*. I guess the moral of this story is that the pure-scheme implementation isn't going to come close (over two orders of magnitude when I killed it) to the speed of the C operation. Several other things I can think of about this test:

1. Putting inline c code in the inner loops of your scheme program is just as fast as pure-c for 1000-element vectors. This performance equality will degrade as the vectors get shorter (because less work is done in c before looping in scheme). In general, the more you do inside the foreign-lambda* c-code, the closer you will come to pure-c code because the looping scheme calls will be a smaller and smaller fraction of the run time. On my machine, test_fast_scm is about 15 times slower than test_c when n = 3.

2. Cache effects will become important at some point (which is above n = 1000 on my machine)---i.e. if you run ./test_{...} 1000000 1000, which is also 1 GFLOP of operations, the relative timings will change a lot. With such a long vector to add, and so little work being done on each vector element, this is now a test of how fast your computer can move vector elements from main memory into the low-level cache--- the speed of the add instruction becomes a lot less relevant, so you can probably better afford the extra work done in the pure scheme code. In the paper http://repository.readscheme.org/ftp/papers/ sw2000/lucier.pdf , they claim that their gambit-c code runs just as fast as c. They are correct, but (based on some other tests I've done) this only holds when you're bus-speed limited---gambit is, in fact, a bit slower (maybe a factor of 2 or 3) than c in its arithmetic. This factor clearly doesn't matter when you're stalled on a cache miss (though I bet even extreme cache missing wouldn't disguise the two orders of magnitude in the native-scheme chicken code).

Feel free to run this code yourself with relevant numbers for <n> and <n-iter> for your project. That should at least give you an idea of the maximum performance.

Will

P.S.---My compile options were:

gcc: -O3

csc: -block -optimize-level 3 -debug-level 0 -lambda-lift -disable- interrupts

On my system, csc invokes gcc with (among other path-setting options) -Os.

P.P.S.---I'm surprised at how bad the pure-scheme code is for this. I'd love to hear from some of the other Chicken experts out there that I've made a mistake somewhere....

Attachment: test-fast.scm
Description: Binary data

Attachment: test.c
Description: Binary data

Attachment: test.scm
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]