octave-maintainers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: splinefit test failures


From: Daniel J Sebald
Subject: Re: splinefit test failures
Date: Thu, 02 Aug 2012 12:45:54 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.24) Gecko/20111108 Fedora/3.1.16-1.fc14 Thunderbird/3.1.16

On 08/02/2012 11:44 AM, Rik wrote:
On 08/01/2012 09:59 AM, address@hidden wrote:
Message: 7
Date: Wed, 01 Aug 2012 11:59:02 -0500
From: Daniel J Sebald<address@hidden>
To: "John W. Eaton"<address@hidden>
Cc: octave maintainers mailing list<address@hidden>
Subject: Re: random numbers in tests
Message-ID:<address@hidden>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

On 08/01/2012 11:39 AM, John W. Eaton wrote:


Since all I had done was rename some files, I couldn't understand what
could have caused the problem.  After determining that the changeset
that renamed the files was definitely the one that resulted in the
failed tests, and noting that running the tests from the command line
worked, I was really puzzled.  Only after all of that did I finally
notice that the tests use random data.

It seems the reason the change reliably affected "make check" was that
by renaming the DLD-FUNCTION directory to dldfcn, the tests were run
in a different order.  Previously, the tests from files in the
DLD-FUNCTION directory were executed first.  Now they were done later,
after many other tests, some of which have random values, and some
that may set the random number generator state.

Is this sort of thing also what caused the recent problem with the
svds test failure?
It sure looks like it.  Some of the examples I gave yesterday showed
that the SVD on sparse data algorithm had results varying at least four
times esp(), and that was just one or two examples.  If one were to look
at hundreds or thousands of examples, I would think it is very likely to
exceed 10*eps.

Spline fits and simulations can have less accuracy as well.  So the
10*eps tolerance is a bigger question.


Should we always set the random number generator state for tests so
that they can be reproducible?  If so, should this be done
automatically by the testing functions, or left to each individual
test?
I would say that putting in a fixed input that passes is not the thing
to do.  The problem with that approach is if the library changes their
algorithm slightly these same issues might pop up again when a library
is updated and people will wonder what is wrong once again.
I also think we shouldn't "fix" the random data by initializing the seed in
test.m.  For complete testing one needs both directed tests, created by
programmers, and random tests to cover the cases that no human would think
of, but which are legal.  I think the current code re-organization is a
great chance to expose latent bugs.

Agreed.


Instead, I think the sorts of approaches that Ed suggested yesterday is
the thing to do.  I.e., come up with a reasonable estimate for how
accurate such an algorithm should be and use that.  Octave is testing
functionality here, not the ultimate accuracy of the algorithm, correct?
Actually we are interested in both things.  Users rely on an Octave
algorithm to do what it says (functionality) and to do it accurately
(tolerance).  For example, the square root function could use many
different algorithms.  One simple replacement for the sqrt() mapper
function on to the C library (the current Octave solution) would be to use
a root finding routine like fzero.  So, hypothetically,

function y = sqrt_rep (x)
   y = fzero (@(z) z*z -x, 0);
endfunction

If I try "sqrt_rep (5)" I get "-2.2361".  Excepting the sign of the result,
the answer is accurate to the 5 digits displayed.  However, if I try abs
(ans) - sqrt (5) I get 1.4e-8 so the ultimate accuracy of this algorithm
isn't very good although the algorithm is functional.

Yes, that is fairly inaccurate in terms of computer precision, but at the same time may be adequate for many applications. 1e-8 isn't awful for some applications. (I wouldn't use it to compute square root, but if it solved some general problem who's answer I knew were sqrt(5) I might be satisfied.) In the past I've tested some of Octave's Runge Kutta differential equation solvers and had accuracy on the order of 1e-5 or so. I would have liked better. The point is that one should check the tools being used and know limitations.

But you've swayed me on the "trust Octave accuracy" point.


Also, we do want more than just a *reasonable* estimate of the accuracy.
We try and test close to the bounds of the accuracy of the algorithm
because, even with a good algorithm, there are plenty of ways that the
implementation can be screwed up.  Perhaps we cast intermediate results to
float and thereby throw away accuracy.  What if we have an off-by-1 error
in a loop condition that stops us from doing the final iteration that
drives the accuracy below eps?  Having tight tolerances helps us understand
whether it is the algorithm or the programmer which is failing.  If it can
be determined with certainty that it is the algorithm, rather than the
implementation, which is underperforming then I think it is acceptable at
that point to raise tolerances to stop %!test failures.

Well, if Octave is using ARPACK, outside of Octave's control, then there isn't much alternative. Nonetheless, I think that tolerances of 2*eps are just too small. 100*eps is more like it. If something is accurate to 1e-14, that's amazing. 1e-8 is something we can fathom, but 1e-14 is like subatomic in my consciousness. Check this result:

octave:1> sqrt(5)*sqrt(5) - 5
ans =  8.8818e-16

sqrt() isn't even accurate to 2*eps, so expecting other numerical techniques to be that good is too tight of tolerance.


I tried running some of the examples to see how accurate the spline fit
is, but kept getting errors about some pp structure not having 'n' as a
member.
That is really odd.  You might try 'which splinefit' to make sure you are
pulling the correct one from scripts/polynomial directory.  I run 'test
splinefit' and it works fine on revision dda73cb60ac5.

A good way to test whether there are repeatability issues is not to run
'make check', which takes too long, but to run 'test ("suspect_fcn")'.  I
ran the following, after commenting out the line that initializes the randn
seed to 13.

fid = fopen ("tst_spline.err", "w");
for i = 1:1000
   bm(i) = test ("splinefit", "quiet", fid);
endfor
sum (bm)
fclose (fid);

The benchmark sum shows 898 so approximately 10% of the time the tests
fail.  On looking through the results in the tst_spline.err log I see that
it is only when randn has returned a value exceptionally far from the
expected mean of 0 do we get a test failure.  Given that randn can return
any real number between -Inf, +Inf we might be better testing the function
with a narrower input.

Replacing
%! yb = randn (size (xb));  range is [-Inf, Inf]
with
%! yb = 2*rand (size (xb)) - 1; range is [-1,1]

changes the success rate to 999/1000.

Seems a worthwhile test to me. Again, I'd loosen the tolerance a bit to the point of making random deviations outside of tolerance small. And if that doesn't happen, then it needs to be fixed.

Dan


reply via email to

[Prev in Thread] Current Thread [Next in Thread]