discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss-gnuradio] Data (was: Re: Run graph/ scheduler overhead)


From: Dennis Glatting
Subject: [Discuss-gnuradio] Data (was: Re: Run graph/ scheduler overhead)
Date: Thu, 23 Jul 2015 12:48:43 -0700

I copied out the dc_block_cc block from 3.7.8 and ran some performance
tests against it, which I've summarized in a table below.

I had to make some modifications to the original code, such as:

 * I removed the make wrapper. 
 * I tested against different containers. 
 * Different containers have different access/management methods
   which meant some changes to code body (I tried to be consistent).
 * On input I passed a std::vector to work() rather than complex*.
   Although this changes the flavor of work() I figure it's relative.
 * I only used long_form and deleted the short_form code. I used 
   the key part of the original code.

The three containers are the original std::deque then std::queue and
std::list. The results are interesting. I probably should have looked at
other containers such as std::vector but that might require recoding.

I also compiled with and without -std=c++11 because when i looked at
container source I saw a bunch of #ifdefs for >= c++0x.

These are some of the problems with the original dc_block:

* Passing by value rather than by reference.
* No inlines.
* const needed where const should be.

So in a second copy of dc_block I did those things. I found a case
(filter()) where it returns by value and I left that one alone.

The table below summarizes the results. "Old" means my reasonable(?)
facsimile of the original dc_block. "+c11" means I added -std=c++11 to
the compile line. "Opt" is my optimized copy of the code where I added
references, inlines, etc. "Special" is "opt" but with different compile
options. All of the output is included at the end of this message.

The numbers you'll see for old/c++1/etc is the amount of time it took to
process /one/ sample. In "old+deque" for example (the first item), it
took 701us to process a sample. One of the surprising numbers is that
std::list sucks. Also, when looking at the assembly language for
filter() (copy below) I see reallocs(). That's not surprising and
probably badness. (BTW, "CPLX" is: "typedef std::complex<float> CPLX;".)

inline const CPLX
moving_averager_c_list::filter( const CPLX& x ) {

  d_out_d1 = d_out;
  d_delay_line.push_back(x);
  d_out = d_delay_line.front();
  d_delay_line.pop_front();

  CPLX y = x - d_out_d1 + d_out_d2;
  d_out_d2 = y;

  return (y / (float)(d_length));
}

The "size" numbers in the table are the text segment size returned using
"size a.out". The "block size" is simply a sizeof(d_delay_line), which
is really sizeof(std:deque<CPLX>) for example.

One other note. I compiled "special" with -Ofast and it failed content
integrity check. Probably a bad option to use. :)


My os:       Ubuntu 15.04.
My compiler: gcc version 4.9.2 (Ubuntu 4.9.2-10ubuntu13)
My system:   AMD FX(tm)-9590 Eight-Core Processor @ 4.7GHz

I'm happy to send copies of the test code (two files) for review if
someone wants to put them on the web. The three main code blocks are
pretty simple:

  { dc_blocker_cc_deque dc( NUM_ELEM );

    std::cout << "deque:" << std::endl;

    t_start = gr::high_res_timer_now();
    for( int i = 0; i < NUM_LOOPS; ++i )
      for( int j = 0; j < NUM_COMPLEX; ++j )
        dc.work( data, dc_deque );
    timing( t_start, gr::high_res_timer_now(), NUM_LOOPS*NUM_COMPLEX );

  }


#define NUM_LOOPS   5
#define NUM_COMPLEX 10000
#define NUM_ELEM    32


Here's the summary table:


        old          old+c11      opt         opt+c11      special

deque:  0.000701038  0.000705963  0.000235234  0.00023607  0.000234233
queue:  0.00069784   0.000705617  0.00023619   0.00023222  0.000237184
list:   0.00194583   0.00243208   0.00191296   0.00193926  0.00194809

text
size:   26502        28902        21712         29574      23112

text
orig:   33821        26502


        block size:

deque:  80
queue:  80
list:   16






Original facsimile (not c++11):

address@hidden:~/dc_test$ c++ -O3  main.cc
address@hidden:~/dc_test$ size a.out 
   text    data     bss     dec     hex filename
  28902     856     280   30038    7556 a.out

address@hidden:~/dc_test$ ./a.out 
Building complex number data...
Done.
GNURadio hi-res clock tps: 1000000000
GNURadio sizeof(gr_complex): 8
GNURadio sizeof(CPLX): 8

dc_blocker_cc_deque: delay_line size=80
deque:
Done: total_t: 35051914970, sec_t: 35.0519, t/ea: 0.000701038

dc_blocker_cc_queue: delay_line size=80
queue:
Done: total_t: 34892023951, sec_t: 34.892, t/ea: 0.00069784

dc_blocker_cc_list: delay_line size=16
list:
Done: total_t: 97291349192, sec_t: 97.2913, t/ea: 0.00194583




Original facsimile (c++11):

address@hidden:~/dc_test$ c++ -O3 -std=c++11 main.cc
address@hidden:~/dc_test$ size a.out 
   text    data     bss     dec     hex filename
  21712     848     280   22840    5938 a.out

address@hidden:~/dc_test$ ./a.out 
Building complex number data...
Done.
GNURadio hi-res clock tps: 1000000000
GNURadio sizeof(gr_complex): 8
GNURadio sizeof(CPLX): 8

dc_blocker_cc_deque: delay_line size=80
deque:
Done: total_t: 35298153446, sec_t: 35.2982, t/ea: 0.000705963

dc_blocker_cc_queue: delay_line size=80
queue:
Done: total_t: 35280849767, sec_t: 35.2808, t/ea: 0.000705617

dc_blocker_cc_list: delay_line size=16
list:
Done: total_t: 121603777765, sec_t: 121.604, t/ea: 0.00243208



Optimized code (not c++11):

address@hidden:~/dc_test$ c++ -O3 -finline main_opt.cc
address@hidden:~/dc_test$ size a.out 
   text    data     bss     dec     hex filename
  29574     856     280   30710    77f6 a.out

address@hidden:~/dc_test$ ./a.out 
Building complex number data...
Done.
GNURadio hi-res clock tps: 1000000000
GNURadio sizeof(gr_complex): 8
GNURadio sizeof(CPLX): 8

dc_blocker_cc_deque: delay_line size=80
deque:
Done: total_t: 11761720007, sec_t: 11.7617, t/ea: 0.000235234

dc_blocker_cc_queue: delay_line size=80
queue:
Done: total_t: 11809516472, sec_t: 11.8095, t/ea: 0.00023619

dc_blocker_cc_list: delay_line size=16
list:
Done: total_t: 95647805916, sec_t: 95.6478, t/ea: 0.00191296


Optimized code (c++11):

address@hidden:~/dc_test$ c++ -O3 -finline -std=c++11 main_opt.cc
address@hidden:~/dc_test$ size a.out 
   text    data     bss     dec     hex filename
  23080     848     280   24208    5e90 a.out


address@hidden:~/dc_test$ ./a.out 
Building complex number data...
Done.
GNURadio hi-res clock tps: 1000000000
GNURadio sizeof(gr_complex): 8
GNURadio sizeof(CPLX): 8

dc_blocker_cc_deque: delay_line size=80
deque:
Done: total_t: 11803504003, sec_t: 11.8035, t/ea: 0.00023607

dc_blocker_cc_queue: delay_line size=80
queue:
Done: total_t: 11610977298, sec_t: 11.611, t/ea: 0.00023222

dc_blocker_cc_list: delay_line size=16
list:
Done: total_t: 96962902014, sec_t: 96.9629, t/ea: 0.00193926


special (opt+c++11):

address@hidden:~/dc_test$ c++ -Ofast -Wsign-compare -Wall
-Wno-uninitialized -fvisibility=hidden -finline -std=c++11 main_opt.cc
address@hidden:~/dc_test$ size a.out 
   text    data     bss     dec     hex filename
  23112     856     280   24248    5eb8 a.out


address@hidden:~/dc_test$ ./a.out 
Building complex number data...
Done.
GNURadio hi-res clock tps: 1000000000
GNURadio sizeof(gr_complex): 8
GNURadio sizeof(CPLX): 8

dc_blocker_cc_deque: delay_line size=80
deque:
Done: total_t: 11711630308, sec_t: 11.7116, t/ea: 0.000234233

dc_blocker_cc_queue: delay_line size=80
queue:
Done: total_t: 11859205796, sec_t: 11.8592, t/ea: 0.000237184

dc_blocker_cc_list: delay_line size=16
list:
Done: total_t: 97404287524, sec_t: 97.4043, t/ea: 0.00194809

Data error i=0


















reply via email to

[Prev in Thread] Current Thread [Next in Thread]