I think one problem with your current test is that it does way too much (random) doubling. So you get huge total scores coming out, which adds loads of noise to the results and makes it hard to be confident what it means.
A simpler test might be to compare gnubg's best strategy against a "dumber" doubling strategy that, say, never offers the cube, and always takes. In that world, the cube is never larger than 2, so it fixes the noise from randomly humongous cube values.
If you do that, then you might wonder how many games those two strategies need to play against themselves to see a difference. First, you'd need to estimate the size of the signal we're trying to find. If we assume gnubg is "perfect", then anytime the dumber strategy takes when it should be a pass, it'll lose expected value, in the amount of the equity error. Make that happens once every three games, and maybe the average error size is 0.1, so we'd expect that the dumber strategy would lose, on average, about 0.03 cents per game.
How many games do you need to simulate such that the statistical measurement error on the average score is much less than 0.03? The standard deviation of score in a regular backgammon money game is something like 1.3, IIRC; so the statistical measurement error on the average is around 1.3 / sqrt(N), where N is the number of games you play. If you want that to be, say, 0.006 (5x smaller than the 0.03 signal we're trying to find), when N would be about 50k games.
So that tells you how many games you need to play in your simulation to see if there's a measurable difference: around 50k. Maybe my numbers are wrong by a factor of 2 here or there, but that's the idea.
So you could run that and see whether the dumb strategy does, in fact, lose in head to head play against the standard; or whether it's about even, and all this fancy cube stuff is nonsense.