Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr

On Sun, Feb 28, 2016 at 5:39 PM, Douglas Geiger <address@hidden> wrote:

The phenomenon Sylvain is pointing at is basically the fact that as compilers improve, you should expect the 'optimized' proto-kernels to no longer have as dramatic an improvement compared with the generic ones. As to your question of 'is it worth it' - that comes down to a couple of things: for example - how much of an improvement do you require to be 'worth it' (i.e., how much is your time worth and/or how much of an performance improvement do you require for your application). Similarly, is it worth it to you to get cross-platform improvements (which is one of the features of VOLK)? Or, perhaps, is it worth it to you just to learn how to use VOLK?

A couple of thoughts here: in general, when I have a flowgraph that is not meeting my performance requirements, my first step is to do some course profiling (i.e. via gr-perf-monitorx) to determine if there is a single block that is my primary performance bottleneck. If so - that is the block I will concentrate on for optimizations (both via VOLK, and/or any algorithmic improvements - e.g. can I turn any run-time calculations into a look-up table calculated either at compile-time, or within the constructor).
If there is not a clear bottleneck, then next I look a little deeper using perf/oprofile to look at what functions my flowgraph is spending a lot of time in: can I e.g. create a faster version of some primitive calculation that all my blocks use a lot, and therefore get a speed-up across many blocks which should translate into a fast over-all application.

Finally, if I still need more improvements I would look at collecting many blocks together into a single, larger block. This is generally less desirable, since you now have a (more) application-specific block, and it becomes harder to re-use in later projects, but if you have performance requirements that drive you there, then it absolutely is an option. At this point you likely have multiple operations being done to your incoming samples, and it becomes easy to collect all of those into a single larger VOLK call (and from there, create a SIMD-ized proto-kernel that targets your particular platform). So, while re-usability of code drives you away from this scenario, it offers the greatest potential for performance improvements, and thus is where many applications with high performance requirements tend to gravitate towards. Ideally you can strike a balance between the two: i.e. have widely re-usable blocks, but with a set of operations inside them that you can take advantage of e.g. SIMD-ized function calls to make them high-performance. If you can craft the block to be widely re-usable for a certain class of things (e.g. look at how the OFDM blocks are setup to be easily re-configurable for the many ways an OFDM waveform can be crafted). In the long-run having more knobs to turn to customize your existing code base to deal with whatever new scenario you are looking at in 1/2/10 years from now is always better than a brittle solution that solves today's problem, but is difficult to re-use to deal with tomorrow's.

Hope that was helpful. If you are interested in learning more about how to use VOLK - certainly have a look at libvolk.org - the documentation is (I think) fairly good at introducing the concepts and intent, as well as how the API looks/works.

Aww, thanks :-)

And certainly don't be shy about asking more questions here.

Good luck,
Doug

On Sun, Feb 28, 2016 at 1:58 AM, Sylvain Munaut <address@hidden> wrote:
> Just wanted to ask the more experienced users if you think this idea is
> worth a shot, or the performance improvement will be marginal.

Performance improvement is vastly dependent of the operation you're doing.

You can get an idea of the improvement by comparing the volk-profile
output for the generic kernel (coded in pure C) and the sse/avx ones.

For instance, on my laptop : for some very simple one (like float
add), the generic is barely slower than simd. Most likely because it's
so simple than even the compiler itself was able to simdize it by
itself.
But for other things (like complex multiply), the SIMD version is 10x faster ...

Cheers,

Sylvain

I agree with everything that's been said, but wanted to chime in with a slightly different perspective.

Let's take an example of doing y[t] = sum(a[i]*x[t-i]). The first approach will always be to write a loop to iterate over samples doing the point by point multiply and add them up.

There's a couple of VOLK kernels that are "relevant" to this (Let's assume everything is complex). You could do volk_32fc_x2_multiply_32fc(y, x, a, i). Now you need to sum y. This is not going to be much better than not using the VOLK call (depending on your compiler and CPU architecture) because multiplies are very simple compilers are pretty good at generating code for them.

You could write a kernel to sum a complex vector (since I don't think that exists) or you could just do it in your work function. That new kernel might help a little bit (it probably would), however, doing a volk_32fc_x2_dot_prod_32f(y, x, a, i) is going to be much better. The primary reason is because you can use an accumulator so that once you do a multiply the sum is done on data in registers rather than having to write and read back from memory (probably cache) later.

So basically, even if you can replace a bunch of multiplies with VOLK versions you won't see huge improvement. You'll see much better improvement if you can use the dot_prods, exps, sinusoids, rotator, etc kernels. You'll also likely see improvements if you write your own kernels (assuming what you need does not exist), but that requires the most effort (also the biggest pay off).

Basically, don't think of VOLK as just a library of code that you can only use. View it also as a tool that you can use to place a generic version that you already have next to an optimized version for whatever architectures you care about. From a performance perspective you're going to get the most improvement by using SIMD *and* avoiding unnecessary memory access.

From:	West, Nathan
Subject:	Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr_complex, in my own custom blocks.
Date:	Sun, 28 Feb 2016 22:23:55 -0500

Re: [Discuss-gnuradio] Using volk kernels on basic operations of gr_comp