This is interesting. The parallel speedup on your
machine using TBB is in the same ballpark as on my
machine using OpenMP, and they're both delivering less
than a 2:1 speedup.
I informally ran some experiments two nights ago to try to
characterize the behavior. On my machine, with OpenMP
#pragmas on the scalar loops, the ratio of single-threaded
to multi-threaded runtimes held stubbornly at about 0.7
regardless of the size of the problem. I tried integer and
float data, addition and power, with ravels up to 100
million elements. (My smallest test set was a million
elements; I still need to try smaller sets to see whether
I can find a knee where the thread setup overhead
dominates and results in a runtime ratio greater than 1.)
I'm not sure what this means, yet. I'd hoped to see
some further improvement as the ravel size increased,
despite the internal inefficiencies. TBH, I didn't find
and annotate the copy loop(s); that might have a lot to do
with my results. (I did find and annotate another loop in
check_value(), though. Maybe parallelizing that will
improve your results.) I'm hoping that the poor showing so
far isn't a result of memory bandwidth limitations.
I hope to spend some more time on this over the weekend.