discuss-gnuradio
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Discuss-gnuradio] Data (was: Re: Run graph/ scheduler overhead)


From: Tom Rondeau
Subject: Re: [Discuss-gnuradio] Data (was: Re: Run graph/ scheduler overhead)
Date: Fri, 24 Jul 2015 10:51:50 -0400

Dennis,

This is great. I'd like to see how to get some of this wrapped back as a patch for GNU Radio. However, a few things. We can't (yet) support C++11 standards. In 3.7, we allow older versions of compilers and dependencies that don't have to support that, so neither can we enforce it. I'm planning on doing something about this for our 3.8 release. One thing we can think about doing is having cmake discover if the compiler can us C++11 and add that as a flag, though none of our code can explicitly depend on this.

Second, we try not to use C++ templates, mostly because VOLK is very type specific. I'm really surprised here that it made such a difference. However, Nathan West and I are looking at this now thinking that using a container of this kind might not be the write thing to do, anyways. Plus, Nathan is thinking about how to use VOLK here :)

If you can put together a patch that gives us a bit of a boost here, that'd be great. But as you say, it doesn't look like this algorithm as it is will ever be fantastically fast. It was definitely meant more for hardware than this case.

Tom



On Thu, Jul 23, 2015 at 10:20 PM, Dennis Glatting <address@hidden> wrote:
Was able to do better by rewriting the moving average (below). Didn't
touch the blocker class itself other than to templatize it.


address@hidden:~/dc_test$ c++ -O3 -Wsign-compare -Wall
-Wno-uninitialized -fvisibility=hidden -finline -std=c++11 main_test.cc
address@hidden:~/dc_test$ size a.out
   text    data     bss     dec     hex filename
  28835     864     408   30107    759b a.out

                original        opt+c11

template:       ----------      0.00019348
deque:          0.000701038     0.0002474
queue:          0.00069784      0.000248843
list:           0.00194583      0.00212156


For my own amusement I ran these tests on a FreeBSD machine that is
slower than my Ubuntu machine. This machine has a different processor
(AMD 8350) at 4GHz.

Compile lines:

address@hidden g++49 -O3 -Wsign-compare -Wall -Wno-uninitialized
-fvisibility=hidden -finline -std=c++11 main_test.cc

address@hidden g++5 -O3 -Wsign-compare -Wall -Wno-uninitialized
-fvisibility=hidden -finline -std=c++11 main_test.cc

address@hidden clang++ -O3 -Wsign-compare -Wall -Wno-uninitialized
-fvisibility=hidden -finline -std=c++11 -I /usr/local/include
main_test.cc

address@hidden clang++-devel -O3 -Wsign-compare -Wall -Wno-uninitialized
-fvisibility=hidden -finline -std=c++11 -I /usr/local/include
main_test.cc


          g++49         g++5        clang++(3.4.1) clang++(3.7.0)

template: 0.000226272  0.00022474   0.000287082    0.00023594
deque:    0.000297232  0.000357486  0.000693899    0.000632054
queue:    0.000297098  0.000357484  0.000690884    0.000627877
list:     0.00457583   0.00418757   0.00411387     0.0040383


(I also compiled against gcc6 snapshot 20150719 but it's similar in
performance to gcc5.)

It's interesting the performance for deque/queue is much worse under
clang and the templated moving average is fairly steady across all
compilers.

I think what this says is the following. First, we can improve the
performance of dc_block but against the current algorithm it's never
going to be awesome. Second, the performance of the template library
containers is fairly good and consistent and the choice of std::deque
was a fairly good one. Third, different compilers and different template
libraries at different optimization levels is going to impact
performance (duh), perhaps negatively.

One of the things GNURadio is missing is some indication of
block/library performance. For example, maybe -O2 is better than -O3 and
maybe -std=c++11 worse than the default on a given platform. However I
have little idea how we'd incorporate those metrics (seems kind of nasty
when you think about the mechanics).




template<typename T>
class moving_average_t {

private:

  // Relationships of the following variables:
  //
  //  d_len   = d_delay_line.size().
  //  d_index = (d_index % d_len)
  //

  // The length of the moving average queue and the current index into
  // it.
  //

  size_t d_len, d_index;

  // The moving average delay line.
  //
  // The delay line is an array of type T, length d_len, and where
  // d_index is the oldest entry.
  //

  std::vector<T> d_delay_line;

  T d_out, d_out_d2;

  // Rather than returning a constructed sample from filter(), which
  // of course can have construction/destruction overhead, store the
  // sample here and return a const reference.
  //

  T d_temp;

public:

           moving_average_t( void ) = delete;
           moving_average_t( size_t len );
  virtual ~moving_average_t( void );

  moving_average_t( const moving_average_t& t ) = delete;
  moving_average_t& operator=( const moving_average_t& t ) = delete;

  // Set the length of the delay line, reinitializing content.
  //

  void set_len( size_t len );

  // Add a sample to the delay line and return the oldest and averaged
  // sample (removed from the delay line).
  //

  const T& filter( const T& t );

  // The oldest, unadjusted (i.e., averaged) signal in the delay line.
  //

  const T& delayed_sig( void ) const;

};

template<typename T>
moving_average_t<T>::moving_average_t( size_t len )
  : d_len( 0 ), d_index( 0 ),
    d_out( 0 ), d_out_d2( 0 ),
    d_temp( 0 ) {

  set_len( len );
}

template<typename T>
moving_average_t<T>::~moving_average_t( void ) {}

template<typename T>
void
moving_average_t<T>::set_len( size_t len ) {

  d_len   = len;
  d_index = 0;
  d_temp  = 0;

  d_out = d_out_d2 = 0;

  d_delay_line.clear();
  for( size_t i = 0; i < d_len - 1; ++i )
    d_delay_line.push_back( T(0));

}

template<typename T>
inline const T&
moving_average_t<T>::delayed_sig( void ) const {

  return d_out;
}

template<typename T>
inline const T&
moving_average_t<T>::filter( const T& t ) {

  T d_out_d1 ( d_out );

  // Cache the oldest signal used in the average.
  //

  d_out = d_delay_line[d_index];

  // Stuff the passed sample into the delay line and bump the index.
  //

  d_delay_line[d_index] = t;
  d_index = (( d_index + 1 ) % ( d_len - 1 ));

  // Do the math.
  //

  d_out_d2 = (t - d_out_d1 + d_out_d2 );
  d_temp   = d_out_d2 / (float)(d_len);

  return d_temp;
}















On Thu, 2015-07-23 at 12:48 -0700, Dennis Glatting wrote:
> I copied out the dc_block_cc block from 3.7.8 and ran some performance
> tests against it, which I've summarized in a table below.
>
> I had to make some modifications to the original code, such as:
>
>  * I removed the make wrapper.
>  * I tested against different containers.
>  * Different containers have different access/management methods
>    which meant some changes to code body (I tried to be consistent).
>  * On input I passed a std::vector to work() rather than complex*.
>    Although this changes the flavor of work() I figure it's relative.
>  * I only used long_form and deleted the short_form code. I used
>    the key part of the original code.
>
> The three containers are the original std::deque then std::queue and
> std::list. The results are interesting. I probably should have looked at
> other containers such as std::vector but that might require recoding.
>
> I also compiled with and without -std=c++11 because when i looked at
> container source I saw a bunch of #ifdefs for >= c++0x.
>
> These are some of the problems with the original dc_block:
>
> * Passing by value rather than by reference.
> * No inlines.
> * const needed where const should be.
>
> So in a second copy of dc_block I did those things. I found a case
> (filter()) where it returns by value and I left that one alone.
>
> The table below summarizes the results. "Old" means my reasonable(?)
> facsimile of the original dc_block. "+c11" means I added -std=c++11 to
> the compile line. "Opt" is my optimized copy of the code where I added
> references, inlines, etc. "Special" is "opt" but with different compile
> options. All of the output is included at the end of this message.
>
> The numbers you'll see for old/c++1/etc is the amount of time it took to
> process /one/ sample. In "old+deque" for example (the first item), it
> took 701us to process a sample. One of the surprising numbers is that
> std::list sucks. Also, when looking at the assembly language for
> filter() (copy below) I see reallocs(). That's not surprising and
> probably badness. (BTW, "CPLX" is: "typedef std::complex<float> CPLX;".)
>
> inline const CPLX
> moving_averager_c_list::filter( const CPLX& x ) {
>
>   d_out_d1 = d_out;
>   d_delay_line.push_back(x);
>   d_out = d_delay_line.front();
>   d_delay_line.pop_front();
>
>   CPLX y = x - d_out_d1 + d_out_d2;
>   d_out_d2 = y;
>
>   return (y / (float)(d_length));
> }
>
> The "size" numbers in the table are the text segment size returned using
> "size a.out". The "block size" is simply a sizeof(d_delay_line), which
> is really sizeof(std:deque<CPLX>) for example.
>
> One other note. I compiled "special" with -Ofast and it failed content
> integrity check. Probably a bad option to use. :)
>
>
> My os:       Ubuntu 15.04.
> My compiler: gcc version 4.9.2 (Ubuntu 4.9.2-10ubuntu13)
> My system:   AMD FX(tm)-9590 Eight-Core Processor @ 4.7GHz
>
> I'm happy to send copies of the test code (two files) for review if
> someone wants to put them on the web. The three main code blocks are
> pretty simple:
>
>   { dc_blocker_cc_deque dc( NUM_ELEM );
>
>     std::cout << "deque:" << std::endl;
>
>     t_start = gr::high_res_timer_now();
>     for( int i = 0; i < NUM_LOOPS; ++i )
>       for( int j = 0; j < NUM_COMPLEX; ++j )
>       dc.work( data, dc_deque );
>     timing( t_start, gr::high_res_timer_now(), NUM_LOOPS*NUM_COMPLEX );
>
>   }
>
>
> #define NUM_LOOPS   5
> #define NUM_COMPLEX 10000
> #define NUM_ELEM    32
>
>
> Here's the summary table:
>
>
>         old          old+c11      opt         opt+c11      special
>
> deque:  0.000701038  0.000705963  0.000235234  0.00023607  0.000234233
> queue:  0.00069784   0.000705617  0.00023619   0.00023222  0.000237184
> list:   0.00194583   0.00243208   0.00191296   0.00193926  0.00194809
>
> text
> size:   26502        28902        21712         29574      23112
>
> text
> orig:   33821        26502
>
>
>         block size:
>
> deque:  80
> queue:  80
> list:   16
>
>
>
>
>
>
> Original facsimile (not c++11):
>
> address@hidden:~/dc_test$ c++ -O3  main.cc
> address@hidden:~/dc_test$ size a.out
>    text          data     bss     dec     hex filename
>   28902           856     280   30038    7556 a.out
>
> address@hidden:~/dc_test$ ./a.out
> Building complex number data...
> Done.
> GNURadio hi-res clock tps: 1000000000
> GNURadio sizeof(gr_complex): 8
> GNURadio sizeof(CPLX): 8
>
> dc_blocker_cc_deque: delay_line size=80
> deque:
> Done: total_t: 35051914970, sec_t: 35.0519, t/ea: 0.000701038
>
> dc_blocker_cc_queue: delay_line size=80
> queue:
> Done: total_t: 34892023951, sec_t: 34.892, t/ea: 0.00069784
>
> dc_blocker_cc_list: delay_line size=16
> list:
> Done: total_t: 97291349192, sec_t: 97.2913, t/ea: 0.00194583
>
>
>
>
> Original facsimile (c++11):
>
> address@hidden:~/dc_test$ c++ -O3 -std=c++11 main.cc
> address@hidden:~/dc_test$ size a.out
>    text          data     bss     dec     hex filename
>   21712           848     280   22840    5938 a.out
>
> address@hidden:~/dc_test$ ./a.out
> Building complex number data...
> Done.
> GNURadio hi-res clock tps: 1000000000
> GNURadio sizeof(gr_complex): 8
> GNURadio sizeof(CPLX): 8
>
> dc_blocker_cc_deque: delay_line size=80
> deque:
> Done: total_t: 35298153446, sec_t: 35.2982, t/ea: 0.000705963
>
> dc_blocker_cc_queue: delay_line size=80
> queue:
> Done: total_t: 35280849767, sec_t: 35.2808, t/ea: 0.000705617
>
> dc_blocker_cc_list: delay_line size=16
> list:
> Done: total_t: 121603777765, sec_t: 121.604, t/ea: 0.00243208
>
>
>
> Optimized code (not c++11):
>
> address@hidden:~/dc_test$ c++ -O3 -finline main_opt.cc
> address@hidden:~/dc_test$ size a.out
>    text          data     bss     dec     hex filename
>   29574           856     280   30710    77f6 a.out
>
> address@hidden:~/dc_test$ ./a.out
> Building complex number data...
> Done.
> GNURadio hi-res clock tps: 1000000000
> GNURadio sizeof(gr_complex): 8
> GNURadio sizeof(CPLX): 8
>
> dc_blocker_cc_deque: delay_line size=80
> deque:
> Done: total_t: 11761720007, sec_t: 11.7617, t/ea: 0.000235234
>
> dc_blocker_cc_queue: delay_line size=80
> queue:
> Done: total_t: 11809516472, sec_t: 11.8095, t/ea: 0.00023619
>
> dc_blocker_cc_list: delay_line size=16
> list:
> Done: total_t: 95647805916, sec_t: 95.6478, t/ea: 0.00191296
>
>
> Optimized code (c++11):
>
> address@hidden:~/dc_test$ c++ -O3 -finline -std=c++11 main_opt.cc
> address@hidden:~/dc_test$ size a.out
>    text          data     bss     dec     hex filename
>   23080           848     280   24208    5e90 a.out
>
>
> address@hidden:~/dc_test$ ./a.out
> Building complex number data...
> Done.
> GNURadio hi-res clock tps: 1000000000
> GNURadio sizeof(gr_complex): 8
> GNURadio sizeof(CPLX): 8
>
> dc_blocker_cc_deque: delay_line size=80
> deque:
> Done: total_t: 11803504003, sec_t: 11.8035, t/ea: 0.00023607
>
> dc_blocker_cc_queue: delay_line size=80
> queue:
> Done: total_t: 11610977298, sec_t: 11.611, t/ea: 0.00023222
>
> dc_blocker_cc_list: delay_line size=16
> list:
> Done: total_t: 96962902014, sec_t: 96.9629, t/ea: 0.00193926
>
>
> special (opt+c++11):
>
> address@hidden:~/dc_test$ c++ -Ofast -Wsign-compare -Wall
> -Wno-uninitialized -fvisibility=hidden -finline -std=c++11 main_opt.cc
> address@hidden:~/dc_test$ size a.out
>    text          data     bss     dec     hex filename
>   23112           856     280   24248    5eb8 a.out
>
>
> address@hidden:~/dc_test$ ./a.out
> Building complex number data...
> Done.
> GNURadio hi-res clock tps: 1000000000
> GNURadio sizeof(gr_complex): 8
> GNURadio sizeof(CPLX): 8
>
> dc_blocker_cc_deque: delay_line size=80
> deque:
> Done: total_t: 11711630308, sec_t: 11.7116, t/ea: 0.000234233
>
> dc_blocker_cc_queue: delay_line size=80
> queue:
> Done: total_t: 11859205796, sec_t: 11.8592, t/ea: 0.000237184
>
> dc_blocker_cc_list: delay_line size=16
> list:
> Done: total_t: 97404287524, sec_t: 97.4043, t/ea: 0.00194809
>
> Data error i=0

reply via email to

[Prev in Thread] Current Thread [Next in Thread]