[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Libunwind-devel] Updated fast trace patch with initial performance resu
From: |
Lassi Tuura |
Subject: |
[Libunwind-devel] Updated fast trace patch with initial performance results |
Date: |
Wed, 23 Mar 2011 00:50:35 +0100 |
Hi,
I've finished a round of testing on a patch set which combines my original
fast trace (1), Jason Evans' improvements (2) plus integration directly into
backtrace() and reusing some more of libunwind existing internals (attached).
I ran the following tests on RHEL5-based x86_64 system. The system libraries
(libc, libm, etc.) are compiled with RedHat's GCC 4.1.2. All the rest was
built with GCC 4.3.4. I used igprof memory and performance profilers as the
test tool. I ran all tests four times, twice normally and twice with taskset
restricted to one core, and picked the best performance - the variation was
small. The system is an 4-core, 24 GB RAM Intel Xeon L5520 2.27 GHz with
hyper-threading on. There were two profiling crashes because of bad unwind
info from GCC 4.3.4-generated code; I rerun the test when that happened,
having verified the crash was indeed caused by incorrect unwind info.
Without any profiling / libunwind use the program runs at 99% CPU use for 405
seconds, of which 2.5s are system time, rest is user time. The program is
single-threaded and loads 650 shared libraries, 22 of which are system
libraries (/lib64, /usr/lib64), 628 are our own.
Here are the results, with this classification:
P = performance profile
M = memory allocation profile
NI = non-instrumented bare application
NT = normal "slow" trace using unw_step() loop
FT = fast trace as I originally submitted it
JE = Jason Evan's improved version
BT = updated fast trace using unw_backtrace()
Timing with average/rms of stack depths seen, stack walk time, including per
stack level time. The times are in clock cycles from rdtsc instruction. VSIZE,
RSS are in megabytes. Best viewed in fixed width font.
Time % VSIZE RSS Walks Depth Walk Time Per Stack Level
-/NI 405s 100% 1548 1251 - - / - - / - - / -
P/NT 412s 102s 1614 1299 68460 29.1 / 7.7 227k / 175k 8729 / 8206
P/FT 408s 101s 1610 1294 67743 29.1 / 7.7 61k / 97k 2980 / 7887
P/JE 408s 101% 1584 1270 67749 29.2 / 7.6 53k / 90k 2580 / 7137
P/BT 410s 101% 1593 1273 68154 29.1 / 7.6 65k / 105k 3111 / 7838
M/NT 5384s 1330% 3078 2774 387M 28.0 / 6.3 26k / 30k 949.4 / [x]
M/FT 1314s 325% 2820 2515 387M 28.0 / 6.3 2767 / 1948 100.5 / 121.0
M/JE 1348s 333% 2800 2494 387M 28.0 / 6.3 2933 / 2869 106.0 / 144.8
M/BT 1251s 309% 2815 2501 387M 28.0 / 6.3 2313 / 2650 84.5 / 114.3
[x] Ran out of 64-bit integer range for calculating RMS on per-level timing.
All but the M/NT run with negligible system time as mentioned above.
M/NT had 60% CPU time, 40% system time - or actually it varied 60-88%
user time, the rest system time, but the total time stayed ~ the same.
I tried to make the new patches as we discussed. Please review and note:
- This is still preliminary in that I haven't run libunwind tests yet.
I also haven't yet tried any other system other than RHEL5/x86_64.
- I had to rename 'backtrace' to 'unw_backtrace' and add a prototype;
otherwise it didn't get used, dynamic linker resolved the calls to
the system backtrace() instead. I am somewhat uncomfortable about
attempting to override system's backtrace() in libunwind anyway :)
- unw_backtrace() now returns one more frame than it did before.
- The frame caches are thread-local and internal to unw_tdep_trace().
- The hashes and the caches are allocated using mempool, which does
fixed-sized objects only so this is a hybrid of Jason's hash growth
code and fixed-sized objects. There's no alloc/free callback now.
- My hashes grow faster and remain emptier than Jason's, and I reset
contents (not rehash) on resize. It uses slightly more memory, but
seems to perform slightly better in heavy use.
- I used a simpler hash function than Jason's code had.
- I didn't find any meaningful way to reuse the existing register set
caches in the dwarf code. They are LRU caches and have a generation
change which don't apply to the frame cache.
- I will still try to squeeze better performance, but would welcome
feedback on general design - let me know if you find it reasonable.
I think my two patches should really be squashed together, but I kept
them separate so you can compare the code. The underlying fast trace
patch had a few changes:
- Added unw_tdep_trace() / unw_tdep_frame_t for other architectures.
- Adjusted unw_tdep_frame_t field widths based on value profiling:
cfa_reg_offset 20 -> 30 bits, r{s,b}p_cfa_offset 20 -> 15 bits wide.
- Improved tdep_stash_frame() debug message and fixed off-by-one bug in
RSP, RBP offset range validation.
- Adapted the patch for Makefile.am changes.
- Fixed white space warnings.
My apologies especially to Jason Evans for not responding to feedback
on these patches sooner - but thanks for your interest and input :-)
Regards,
Lassi
(1) http://thread.gmane.org/gmane.comp.lib.unwind.devel/680
(2) http://thread.gmane.org/gmane.comp.lib.unwind.devel/821
fast-trace.tar.bz2
Description: BZip2 compressed data
- [Libunwind-devel] Updated fast trace patch with initial performance results,
Lassi Tuura <=