Re: Performance improvement for large keysets

Interesting... I would try the 'hash-map' implementation from gnulib.

unordered_set is effectively hash-map impl, and it's pretty good. Moreover, I stored elements by pointers without copying them, yet, gperf was extremely slow, so slow that it couldn't even finish. Perhaps I had a bug, but it did seem like the code was OK.

For thousands of keys it needs to build hash map with thousands of keys, which might be the reason it was so slow.

By the way, which profiler would you recommend for CPU-profiling of a program
like gperf? I have a couple of old notes regarding profiling (below), but
can't really tell which one to start with.

I totally prefer Visual Studio, it comes with both sampling and instrumentation profilers. It easily shows what function take most times, and what lines in these functions consume cpu the most. For example, this is the heaviest function and line in gperf:

https://i.ibb.co/X8P66dx/image.png

as you can see, "_collision_detector->set_bit(hashcode)" takes 23% of cpu in entire program.

If you have a mac, xcode also has a decent/usable profiler

Bruno

--- --- --- --- --- --- --- --- --- --- --- --- ---

Profilers with call-graph functionality
=======================================

See https://en.wikipedia.org/wiki/Call_graph#Free_software_call-graph_generators

Comparisons:
http://gernotklingler.com/blog/gprof-valgrind-gperftools-evaluation-tools-application-level-cpu-profiling-linux/

Profiling with perf
-------------------
Doc:
https://perf.wiki.kernel.org/index.php/Tutorial#Period_and_rate
http://www.brendangregg.com/perf.html

Works on: Linux with packages 'perf' and 'linux-tools-<version>' installed.

To get just the important methods:

# perf record -c 1000 src/wc -Lm < mbc.txt
# perf report

To get the call graph as well:

# perf record -c 1000 -a --call-graph fp src/wc -Lm < mbc.txt
# perf record -c 2000 -a --call-graph dwarf src/wc -Lm < mbc.txt
# perf report --call-graph --stdio
???

Profiling with valgrind
-----------------------
Doc:
http://valgrind.org/docs/manual/cl-manual.html

$ valgrind --tool=callgrind src/wc -m < mbc.txt
$ callgrind_annotate callgrind.out.10379
$ callgrind_annotate --tree=calling callgrind.out.10379
$ kcachegrind callgrind.out.10379 ; switch to callee map

Works on: Linux (sampling + call-tree), macOS (only sampling, not call-tree)

Profiling with gprof
--------------------
Compile and link with "-pg".
Visualization: https://stackoverflow.com/questions/2439060/is-it-possible-to-get-a-graphical-representation-of-gprof-results

Works on Linux and other systems with libc_g. Not clang!

Profiling with gperftools
-------------------------
Doc: https://github.com/gperftools/gperftools

1) Link your executable with -lprofiler
2) Run your executable with the CPUPROFILE environment var set:
$ CPUPROFILE=/tmp/prof.out <path/to/binary> [binary args]
3) Run pprof to analyze the CPU usage
$ pprof --text <path/to/binary> /tmp/prof.out # -pg-like text output
$ pprof --gv <path/to/binary> /tmp/prof.out # really cool graphical output
On macOS: Cannot map addresses to symbols. => Unusable.

From:	Pavel P
Subject:	Re: Performance improvement for large keysets
Date:	Thu, 30 Jan 2020 12:49:22 +0600