Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr

2009/6/30 Martin DvH <address@hidden>

On Tue, 2009-06-30 at 02:52 -0400, Yu-Hua Yang wrote:
> Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there
> exists 5 different kernel functions to do the multiplication, where
> the default one,
>
> __global__ void
> cuda_multiply_const_ff_kernel(const float* g_idata, float*
> g_odata,const int noutput_items,const float konst)
>
> is completely blank. But regardless, nobody calls these kernel
> functions.

cuda_multiply_const_ff was never finished or I accidently removed some
code here.
Either way, this is clearly a bug.
The empty kernel should call one of the implemented kernels.

> Then, in the same file, which is called by cuda_multiply_const_ff.cc,
> in this function
>
> int
> get_cuda_multiply_const_ff_kernel_params
> ( cuda_multiply_const_ff_kernel_params *params )
> {
> int result=0;
> //const unsigned int max_num_threads_per_block =
> MAX_NUM_THREADS_ALL; //can use the maximum number of threads
> if wanted
> //unsigned int max_num_blocks = MAX_NUM_BLOCKS_ALL;
>
> unsigned int num_blocks=4096 ;// =
> gridDim.x; //NUM_CUDABLOCKS
> unsigned int num_threads_per_block=512;// =
> blockDim.x; //NUM_THREADS;
> unsigned int num_outputs_per_block=num_threads_per_block;
>
> const unsigned int num_outputs_per_grid=
> num_outputs_per_block*num_blocks; //(blockDim.x)*gridDim.x
>
> size_t dynamic_shared_mem_size =
> 0;//256*sizeof(float);//0;//num_threads_per_block*sizeof(gr_complex);
> dim3 griddim( num_blocks, 1, 1);
> dim3 threaddim( num_threads_per_block, 1, 1);
>
> params->griddim=griddim;
> params->threaddim=threaddim;
> params->dynamic_shared_mem_size=dynamic_shared_mem_size;
> params->num_outputs_padded=num_outputs_per_grid;
> params->num_inputs_padded=num_outputs_per_grid;
> params->num_inputs=0;//num_outputs_per_grid;//num_outputs;
> params->num_outputs=0;//num_outputs_per_grid;//num_outputs;
>
> //Now you can do the kernel invocation like this:
> //cuda_multiply_const_ff_filter_kernel<<< params->griddim,
> params->threaddim, params->dynamic_shared_mem_size
> >>>(g_idata, g_odata, params->num_outputs_padded*X,konst);
> return result;
> }
>
> The kernel invocation is completely commented out! The result is
> initialized as 0 at the top and returns it. All the work in between to
> specify and allocate thread, block sizes does not seem to matter. Not
> sure why this code exists this way, did someone make an edit or did
> Martin specifically commented out the kernel invocation?

Yes the kernel invocation is specifically commented out.
The commented out kernel invocation is only here as documentation on
how to do the actual kernel invocation.

There is still a small typo.
It should refer to cuda_multiply_const_ff_kernel and not

cuda_multiply_const_ff_filter_kernel

This methods name is get_cuda_multiply_const_ff_kernel_params
It does just that, determine the needed params for the kernel
invocation.

The actual kernel invocation is done in:
cuda_multiply_const_ff_work_device

Both get_cuda_multiply_const_ff_kernel_params and
cuda_multiply_const_ff_work_device are called from normal C++ code in
cuda_multiply_const_ff_kernel.cu

Here you mean called by cuda_multiply_const_ff.cc right? the kernel itself seems to just define the kernel functions and everything is called by cuda_multiply_const_ff.cc, maybe I am wrong because this leads me to my question which is, where in cuda_multiply_const_ff calls cuda_multiply_const_ff_work_device? doesn't seem like it does but maybe its some OOP process that I missed...anyways...not that important at the moment....

Better examples to look at and benchmark would be
cuda_quadrature_demod_cuda_cf
cuda_fir_filter_fff

These are both used in
testbed/wfm/cuda_wfm_rcv.py
Which calls the complete cuda implementation of a WFM receiver in
cuda_wfm_rcv_cuda.py

You will notice a cuda_multiply_const_ff block is instantiated as
volume_control but not used in the final connect because it didn't work.
Now I know this is because of the bug you found.

> Is this suppose to be this way? I don't see how this can be a proper
> benchmarking if it seems that we just test about allocating threads
> and blocks on the device and memory access times, but really don't do
> any computation.

That is not a good benchmark indeed.
cuda_multiply_const_ff should be fixed first.

The fir-filter is however the thing you really want to benchmark.
(Which is also in there)
The more taps the fir-filter has, the more computations it will have to
make.
Note that the cuda implementation is at the moment limited in the number
of taps it can support at a certain decimation factor.
If I remember well decimationfactor*numberoftaps should stay below 2048
(or was it 512)

Ways of improving the resuls include having gnuradio do calculations in
bigger chunks.
This can be done in several ways.
One way is to have a block at the end of the processing chain which has
a large output_multiple requirement which needs 1 input for every
output.
In other words. Connect a block to the end of the processing chain which
has set_output_multiple(large_number) in its constructor.

You can use gr_test for this.

I decided to abandon and comment out all the cuda.multiply_const_ff function calls and concentrate on cuda.fir_filter_fff as suggested. Things I got questions/concerns

1. I increased output_multiple by doing "options.output_multiple = xxx" and this has no effect on the computing time of either CUDA or CPU. Did I do something wrong?
2. I increased the taps by doing "taps = range(1,256)" and also increasing number of blocks of fir_filter in the code and voila, I am now able to get CUDA to be faster than just CPU. However, if I implement something like "taps = range(1,512)" the CUDA part would be extremely slow (~20 seconds) while the CPU is still cool (~ 2 sec). Why? But this maybe related to what you were saying about max number of taps...although why is CPU able to still compute?
3. I had to increase the number of fir_filter blocks to 14 blocks before I can start seeing CUDA out-perform CPU. Experimentally its fine, I achieved my objective, but how is this "increased computation" justified in a normal GNURadio operation? I mean, when would a normal GNURadio operation require a chain of 14 fir_filters? I guess this is going beyond just "benchmarking" and asking where else can I take advantage of CUDA's computation power in GNURadio in a "normal" operation?
4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of cuda_fir_filter, it seems you are using shared memory right? Just making sure we are not using global or local memory which would disastrously slow down the CUDA computations.

There are other ways which include changing the scheduler and the buffer
instantiating code.

I rather not pursue these ways because of the increased complexity. I rather increase computation, seems fastest/easiest to me, just not sure where.

I hope this helps,

Martin

It has been most helpful! Thank you again!

> I am probably way off here, doesnt make any sense......someone please
> clarify!
>

> _______________________________________________
> Discuss-gnuradio mailing list
> address@hidden
> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio

From:	Yu-Hua Yang
Subject:	Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible improvements
Date:	Wed, 1 Jul 2009 05:27:19 -0400

Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible im