[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible im
From: |
Martin DvH |
Subject: |
Re: [Discuss-gnuradio] CUDA-Enabled GNURadio gr_benchmark10 possible improvements |
Date: |
Wed, 01 Jul 2009 21:24:49 +0200 |
On Wed, 2009-07-01 at 05:27 -0400, Yu-Hua Yang wrote:
> Thank you very much for your reply. I have some more questions. I
> understand you wont be looking/updating this branch anymore due to the
> release of OpenCL, but thanks for helping right now!
>
> Before you dive in, one thing I am having concerns is this output that
> happens every time I run CUDA-enable code:
>
> gr_vmcircbuf_cuda::copy_buf_to_buf() error cudaMemcpy() returned 0x3 =
> 3 initialization error
> gr_vmcircbuf_cuda::copy_buf_to_buf recreating buffer d_base=0x3820000
> size=32768 *2
>
If I remember well, this error is not a problem if you get it once per
instantiated CUDA block.
It is a problem if you get it multiple times for the same block.
If I remember well, this has to do with CUDA not being happy with
multithreaded applications and access to the same device buffer from
different host-threads.
Gnuradio using cuda should always use the single-threaded scheduler.
(This is not the default)
I don't remember if my code does this automatically or if you have to
force it using an environment variable.
( you can select the
scheduler used at runtime by setting the GR_SCHEDULER environment
variable. E.g.,
multithreaded Thread-Per-Block scheduler
$ GR_SCHEDULER=TPB ./my-gnuradio-application.py ...
or
Single-Threaded-Scheduler
$ GR_SCHEDULER=STS ./my-gnuradio-application.py ...
)
If you have to use the environment variable, then this should go into
the gnuradio-CUDA docs (if it is not already there).
Greeting,
Martin
>
> Are these 2 outputs normal? Are they suppose to happen? If not, how do
> I fix this?
>
> 2009/6/30 Martin DvH <address@hidden>
> On Tue, 2009-06-30 at 02:52 -0400, Yu-Hua Yang wrote:
> > Upon a closer look into cuda_muiltiply_const_ff_kernel.cu,
> there
> > exists 5 different kernel functions to do the
> multiplication, where
> > the default one,
> >
> > __global__ void
> > cuda_multiply_const_ff_kernel(const float* g_idata,
> float*
> > g_odata,const int noutput_items,const float konst)
> >
> > is completely blank. But regardless, nobody calls these
> kernel
> > functions.
>
> cuda_multiply_const_ff was never finished or I accidently
> removed some
> code here.
> Either way, this is clearly a bug.
> The empty kernel should call one of the implemented kernels.
>
>
>
> > Then, in the same file, which is called by
> cuda_multiply_const_ff.cc,
> > in this function
> >
> > int
> > get_cuda_multiply_const_ff_kernel_params
> > ( cuda_multiply_const_ff_kernel_params *params )
> > {
> > int result=0;
> > //const unsigned int max_num_threads_per_block =
> > MAX_NUM_THREADS_ALL; //can use the maximum number
> of threads
> > if wanted
> > //unsigned int max_num_blocks =
> MAX_NUM_BLOCKS_ALL;
> >
> > unsigned int num_blocks=4096 ;// =
> > gridDim.x;
> //NUM_CUDABLOCKS
> > unsigned int num_threads_per_block=512;// =
> > blockDim.x; //NUM_THREADS;
> > unsigned int
> num_outputs_per_block=num_threads_per_block;
> >
> > const unsigned int num_outputs_per_grid=
> > num_outputs_per_block*num_blocks;
> //(blockDim.x)*gridDim.x
> >
> > size_t dynamic_shared_mem_size =
> >
> 0;//256*sizeof(float);//0;//num_threads_per_block*sizeof(gr_complex);
> > dim3 griddim( num_blocks, 1, 1);
> > dim3 threaddim( num_threads_per_block, 1, 1);
> >
> > params->griddim=griddim;
> > params->threaddim=threaddim;
> >
> params->dynamic_shared_mem_size=dynamic_shared_mem_size;
> > params->num_outputs_padded=num_outputs_per_grid;
> > params->num_inputs_padded=num_outputs_per_grid;
> >
> params->num_inputs=0;//num_outputs_per_grid;//num_outputs;
> >
> params->num_outputs=0;//num_outputs_per_grid;//num_outputs;
> >
> > //Now you can do the kernel invocation like this:
> > //cuda_multiply_const_ff_filter_kernel<<<
> params->griddim,
> > params->threaddim, params->dynamic_shared_mem_size
> > >>>(g_idata, g_odata,
> params->num_outputs_padded*X,konst);
> > return result;
> > }
> >
> > The kernel invocation is completely commented out! The
> result is
> > initialized as 0 at the top and returns it. All the work in
> between to
> > specify and allocate thread, block sizes does not seem to
> matter. Not
> > sure why this code exists this way, did someone make an edit
> or did
> > Martin specifically commented out the kernel invocation?
>
> Yes the kernel invocation is specifically commented out.
> The commented out kernel invocation is only here as
> documentation on
> how to do the actual kernel invocation.
>
> There is still a small typo.
> It should refer to cuda_multiply_const_ff_kernel and not
> cuda_multiply_const_ff_filter_kernel
>
>
>
>
> This methods name is get_cuda_multiply_const_ff_kernel_params
> It does just that, determine the needed params for the kernel
> invocation.
>
> The actual kernel invocation is done in:
> cuda_multiply_const_ff_work_device
>
>
> Both get_cuda_multiply_const_ff_kernel_params and
> cuda_multiply_const_ff_work_device are called from normal C++
> code in
> cuda_multiply_const_ff_kernel.cu
>
> Here you mean called by cuda_multiply_const_ff.cc right? the kernel
> itself seems to just define the kernel functions and everything is
> called by cuda_multiply_const_ff.cc, maybe I am wrong because this
> leads me to my question which is, where in cuda_multiply_const_ff
> calls cuda_multiply_const_ff_work_device? doesn't seem like it does
> but maybe its some OOP process that I missed...anyways...not that
> important at the moment....
>
>
> Better examples to look at and benchmark would be
> cuda_quadrature_demod_cuda_cf
> cuda_fir_filter_fff
>
> These are both used in
> testbed/wfm/cuda_wfm_rcv.py
> Which calls the complete cuda implementation of a WFM receiver
> in
> cuda_wfm_rcv_cuda.py
>
> You will notice a cuda_multiply_const_ff block is
> instantiated as
> volume_control but not used in the final connect because it
> didn't work.
> Now I know this is because of the bug you found.
>
>
> > Is this suppose to be this way? I don't see how this can be
> a proper
> > benchmarking if it seems that we just test about allocating
> threads
> > and blocks on the device and memory access times, but really
> don't do
> > any computation.
>
> That is not a good benchmark indeed.
> cuda_multiply_const_ff should be fixed first.
>
> The fir-filter is however the thing you really want to
> benchmark.
> (Which is also in there)
> The more taps the fir-filter has, the more computations it
> will have to
> make.
> Note that the cuda implementation is at the moment limited in
> the number
> of taps it can support at a certain decimation factor.
> If I remember well decimationfactor*numberoftaps should stay
> below 2048
> (or was it 512)
>
> Ways of improving the resuls include having gnuradio do
> calculations in
> bigger chunks.
> This can be done in several ways.
> One way is to have a block at the end of the processing chain
> which has
> a large output_multiple requirement which needs 1 input for
> every
> output.
> In other words. Connect a block to the end of the processing
> chain which
> has set_output_multiple(large_number) in its constructor.
>
> You can use gr_test for this.
>
> I decided to abandon and comment out all the cuda.multiply_const_ff
> function calls and concentrate on cuda.fir_filter_fff as suggested.
> Things I got questions/concerns
>
> 1. I increased output_multiple by doing "options.output_multiple =
> xxx" and this has no effect on the computing time of either CUDA or
> CPU. Did I do something wrong?
> 2. I increased the taps by doing "taps = range(1,256)" and also
> increasing number of blocks of fir_filter in the code and voila, I am
> now able to get CUDA to be faster than just CPU. However, if I
> implement something like "taps = range(1,512)" the CUDA part would be
> extremely slow (~20 seconds) while the CPU is still cool (~ 2 sec).
> Why? But this maybe related to what you were saying about max number
> of taps...although why is CPU able to still compute?
> 3. I had to increase the number of fir_filter blocks to 14 blocks
> before I can start seeing CUDA out-perform CPU. Experimentally its
> fine, I achieved my objective, but how is this "increased computation"
> justified in a normal GNURadio operation? I mean, when would a normal
> GNURadio operation require a chain of 14 fir_filters? I guess this is
> going beyond just "benchmarking" and asking where else can I take
> advantage of CUDA's computation power in GNURadio in a "normal"
> operation?
> 4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of
> cuda_fir_filter, it seems you are using shared memory right? Just
> making sure we are not using global or local memory which would
> disastrously slow down the CUDA computations.
>
>
>
> There are other ways which include changing the scheduler and
> the buffer
> instantiating code.
>
> I rather not pursue these ways because of the increased complexity. I
> rather increase computation, seems fastest/easiest to me, just not
> sure where.
>
>
>
> I hope this helps,
>
> Martin
>
> It has been most helpful! Thank you again!
>
>
>
> > I am probably way off here, doesnt make any
> sense......someone please
> > clarify!
> >
>
> > _______________________________________________
> > Discuss-gnuradio mailing list
> > address@hidden
> > http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>
>