discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible im


From: Yu-Hua Yang
Subject: Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible improvements
Date: Thu, 2 Jul 2009 01:08:17 -0400

Thanks Martin, for your generous effort to help me.

It appears only one time so I think I am in the clear.

I decided to abandon and comment out all the cuda.multiply_const_ff function calls and concentrate on cuda.fir_filter_fff as suggested. Things I got questions/concerns

1. I increased output_multiple by doing "options.output_multiple = xxx" and this has no effect on the computing time of either CUDA or CPU. Did I do something wrong?
2.  I increased the taps by doing "taps = range(1,256)" and also increasing number of blocks of fir_filter in the code and voila, I am now able to get CUDA to be faster than just CPU. However, if I implement something like "taps = range(1,512)" the CUDA part would be extremely slow (~20 seconds) while the CPU is still cool (~ 2 sec). Why? But this maybe related to what you were saying about max number of taps...although why is CPU able to still compute?
3. I had to increase the number of fir_filter blocks to 14 blocks before I can start seeing CUDA out-perform CPU. Experimentally its fine, I achieved my objective, but how is this "increased computation" justified in a normal GNURadio operation? I mean, when would a normal GNURadio operation require a chain of 14 fir_filters? I guess this is going beyond just "benchmarking" and asking where else can I take advantage of CUDA's computation power in GNURadio in a "normal" operation?
4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of cuda_fir_filter, it seems you are using shared memory right? Just making sure we are not using global or local memory which would disastrously slow down the CUDA computations.


Thank you again for taking the time to help me!

2009/7/1 Martin DvH <address@hidden>
On Wed, 2009-07-01 at 05:27 -0400, Yu-Hua Yang wrote:
> Thank you very much for your reply. I have some more questions. I
> understand you wont be looking/updating this branch anymore due to the
> release of OpenCL, but thanks for helping right now!
>
> Before you dive in, one thing I am having concerns is this output that
> happens every time I run CUDA-enable code:
>
> gr_vmcircbuf_cuda::copy_buf_to_buf() error cudaMemcpy() returned 0x3 =
> 3 initialization error
> gr_vmcircbuf_cuda::copy_buf_to_buf recreating buffer d_base=0x3820000
> size=32768 *2
>
If I remember well, this error is not a problem if you get it once per
instantiated CUDA block.

It is a problem if you get it multiple times for the same block.


If I remember well, this has to do with CUDA not being happy with
multithreaded applications and access to the same device buffer from
different host-threads.

Gnuradio using cuda should always use the single-threaded scheduler.
(This is not the default)
I don't remember if my code does this automatically or if you have to
force it using an environment variable.
( you can select the
scheduler used at runtime by setting the GR_SCHEDULER environment
variable.  E.g.,
multithreaded Thread-Per-Block scheduler
 $ GR_SCHEDULER=TPB ./my-gnuradio-application.py ...

or
Single-Threaded-Scheduler
 $ GR_SCHEDULER=STS ./my-gnuradio-application.py ...
)

If you have to use the environment variable, then this should go into
the gnuradio-CUDA docs (if it is not already there).


Greeting,
Martin
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]