[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[lmi] Results of comparison of {boost,std}::regex, CTRE and PCRE
From: |
Vadim Zeitlin |
Subject: |
[lmi] Results of comparison of {boost,std}::regex, CTRE and PCRE |
Date: |
Thu, 10 Jun 2021 23:36:28 +0200 |
Hello,
I think I finally know enough to suggest the best replacement for
Boost.Regex in test_coding_rules.cpp, so even though I didn't fully finish
everything yet, I'd like to already summarize my results so far to see if
you agree with my choice.
First of all, here are the results of benchmarking the run-time of
test_coding_rules using the following command (in zsh, to be able to use
the "~" glob character in order to discard the effect of changes to this
file itself in the different versions)
$ test_coding_rules *~test_coding_rules.cpp
on one particular (not very fast) machine for the following versions of lmi
(please note that CTRE/PCRE ones are not finished and rely on headers not
included in the repository, so you won't be able to compile them currently,
I only link to them in order to show the changes to the lmi sources that
would be needed to use them):
0. Original f9276c05f (Rework minimum bounds for solves, 2021-05-27).
1. Using std::regex (https://github.com/vadz/lmi/commit/2026e7227).
2. Using CTRE (https://github.com/vadz/lmi/commit/67b7cf33c).
3. Using PCRE (https://github.com/vadz/lmi/commit/f1a35217b).
+------------------+------+------+------+------+
| Compiler\Version | 0 | 1 | 2 | 3 |
|------------------+------+------+------+------+
| gcc 11 | 10.5 | 17.8 | 1.3 | 1.6 |
| clang 12 | 6.1 |238 | 24 | 1.7 |
+------------------+------+------+------+------+
(result for (1) for clang is _not_ a typo).
As previously mentioned, all the results can be improved by a factor of
2-3 by using GNU parallel, here are the timings for
$ parallel -m test_coding_rules ::: *~test_coding_rules.cpp
for completeness:
+------------------+------+------+------+------+
| Compiler\Version | 0 | 1 | 2 | 3 |
|------------------+------+------+------+------+
| gcc 11 | 3.0 | 5.8 | 0.6 | 0.7 |
| clang 12 | 1.8 | 72 | 7.8 | 0.7 |
+------------------+------+------+------+------+
And the last table I'd like to show is the one for compilation time of
test_coding_rules.o:
+------------------+------+------+------+------+
| Compiler\Version | 0 | 1 | 2 | 3 |
|------------------+------+------+------+------+
| gcc 11 | 4.8 | 8.3 | 42.9 | 4.2 |
| clang 12 | 4.3 | 5.6 | 33.0 | 3.3 |
+------------------+------+------+------+------+
It's a bit of a letdown because I had started by thinking that PCRE was
going to be the best choice even before starting to do all this, but
looking at the tables above it's IMO hard to argue that it isn't.
Experimenting with CTRE was interesting and I don't think the time spent on
it was wasted and, at the very least, I can confirm that CTRE does have
excellent performance, at least with gcc. For the simplest possible regex,
matching a literal string, it is only ~2 times slower compared to calling
contains(), while PCRE is ~10 times slower and {boost,std}::regex are ~50
times slower. Of course, for such simple tests using contains() is still
faster and simpler, but, generally speaking, CTRE beats PCRE (although
there are some exceptions).
However CTRE has a number of disadvantages too:
- It's described as experimental and not deemed for production by its
author and while we were not planning to use it for anything critical,
it still risks being a problem because I'm almost sure that its API will
change in incompatible ways if it's going to evolve at all.
- It lacks several features used in the current code, such as
case-insensitive matching (hence the need to write /[wW][iI][nN]32/ in
enforce_taboos()) and doesn't implement backtracking properly, which
means that some regexes become more complicated (although maybe more
efficient), and probably some more that are not used yet, but might be in
the future, i.e. it's not feature-complete neither.
- Its compile times are pretty bad. The version benchmarked here still uses
std::regex for many tests that were not worth optimizing. If we wanted to
switch all regexes to CTRE, the compilation time would probably double as
it's directly proportional to the number of regexes used in the code.
- It still has to be combined with something else (std::regex in the
version tested) for matches that can't be done at compile-time.
- Using it requires a lot of changes to the existing code. Compare the
output of "git show --stat test_coding_rules.cpp" for the various
commits:
- std::regex: 89 insertions(+), 86 deletions(-)
- PCRE: 63 insertions(+), 91 deletions(-)
- CTRE: 209 insertions(+), 210 deletions(-)
And the changes when switching to CTRE are much less trivial and involve
changing both the code and some of the regexes, while switching to
std::regex requires only changing a couple of regexes and switching to
PCRE requires only a small change to a single regex.
So the only reasonable choices, IMO, are either sticking with std::regex
and living with its bad performance, maybe by partially compensating it
with GNU parallel, or switching to PCRE.
The latter has a number of advantages:
- It is significantly faster. Waiting 18s for the full test_coding_rules
run on this machine is too annoying, and while using GNU parallel helps,
6s is still quite slow, while with the PCRE version I don't even need to
use parallel as 1.6s is fast enough. I also didn't spend any time at all
on optimizing the PCRE version, as its performance was already good
enough for me, but I'm pretty sure that I could improve it more if
needed.
- Its performance is consistent between compilers and platforms. Moreover,
it's good (basically the same) even when not using the optimizations, as
all the real work is done inside the PCRE library, while the performance
of debug/non-optimized builds of all the other versions is much worse.
- It is robust and personally I trust PCRE, used by many critical projects,
more than I trust std::regex that has a (not completely undeserved, even
if it is clearly much improved in the latest libstdc++ versions)
reputation of being low quality, and so probably is not used that much.
I didn't run into any problems with it, while I found several in CTRE
and a couple with libstdc++ std::regex implementation.
Of course, it does have 2 disadvantages too:
1. It requires PCRE library.
2. It needs a C++ wrapper as using PCRE C API directly is too painful.
For (2), I've written a ~500 line (including a lot of blank lines and
comments) pcre_regex.hpp header, which I don't show yet, because I need to
clean it up and add more extensive tests for it, which provides pcre::regex
class and pcre::search() and pcre::replace() functions that are very
similar to boost::regex_search() and regex_replace(), as well as
pcre::search_all() which allows for much simpler iteration over regex
matches than boost or std::sregex_iterator using a simple range for loop:
for(auto const& z : pcre::search_all(f.data(), R"(...)"))
{
... use z[0] for the full match or z[1] for the first captured
expression here ...
}
I don't know if you'd like to make this header part of lmi or handle it as
an external dependency, but in any case I'm ready to maintain it. And while
having this separate header is a disadvantage, compared to just using
<regex> from the standard library, the API above is clearly much nicer than
dealing with sregex_iterator directly.
For (1), we can rely on the system package under Linux, but we'd need to
compile PCRE ourselves under MSW. This is not difficult to do, but, as I
mentioned before, I'd like to also switch to using PCRE for wxRegEx, and if
I do this, we could get the compiled PCRE library as a byproduct of
building wx without doing anything extra in lmi itself.
Solving both of these problems shouldn't take that long, but I'd still
like to ask for your agreement before continuing to work on it. It's not
urgent at all, as I have plenty of other things to do in the meanwhile, but
I don't want to spend more time on polishing my PCRE C++ wrapper if you
decide that we don't need it and std::regex is, finally, good enough, for
example.
To summarize, the current state is:
- I have a branch with ready to be submitted changes with some minor
improvements to the current version, that could (and should) be merged
in any case.
- I have a commit replacing boost::regex with std::regex that is also
(almost) ready to be submitted -- I just need to test it a bit more.
- I also have not quite finished, but already working, changes using PCRE
that I need to finalize before submitting them.
Please let me know how would you like to proceed and thanks in advance,
VZ
pgp1Ek0Ub14yh.pgp
Description: PGP signature
- [lmi] Results of comparison of {boost,std}::regex, CTRE and PCRE,
Vadim Zeitlin <=