I believe the reason your 0.9 output is working beyond the bit timing loop is because of a sampling rate difference between 0.9 and 3.X. 0.9 samples at either 20MS/s or 21.52MS/s, while 3.X uses 19.2MS/s. The test files located at
http://comsec.com/data/README are sampled at 20MS/s. The reason it works beyond the bit timing loop, with the output from 0.9, is because only the interpolator, fpll, and bit timing loop (at least those are the ones that I found an explicit reference to a sampling rate) care about the sampling rate.
Thus you have proven that the field_sync_demux and beyond are working, but anything before that is potentially at fault. I'm trying to narrow down the issue, but haven't found anything yet.