bug-gplusplus
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Floating point failure (NaN)


From: Ron Miller
Subject: Re: Floating point failure (NaN)
Date: Thu, 7 Jun 2001 12:22:46 -0700

Hi Greg,

Thanks for the input and for trying the testcase on your own system.

We've done a bit more debugging on our end, and right now it is looking like
the culprit may be the version of the kernel that we are using. From the
redhat bug reports, there is a known problem with context switching that can
sometimes cause this behavior. The known problem was not exactly the same
version of redhat that we are using, but was close enough that we are going
with that theory.

However ... I liked your hypothesis about radiation :-)

Ron

p.s. The problem Linux machine is in Sunnyvale, CA and not Colorado.

"Greg Chicares" <address@hidden> wrote in message
news:address@hidden
> Ron Miller wrote:
> >
> > I have spent a lot of the last week trying to track down a bug that
appears
> > to be OS/hardware related.
> >
> > The problem that I am seeing is that NaN (not a number) is mysteriously
> > appearing in some of my floating point variables even when it should
not.
> >
> > I posted a message last week, but since then I have simplified the code.
> >
> > I have a code fragment that looks like the following:
> >
> >     double da, db;
> >     while ( 1 ) {
> >         da = 1.0;
> >         db = da;
> >         if ( (da!=da) || (db!=db) ) {
> >             printf( "Found NaN\n" );
> >         }
> >     }
> >
> > This will run for hours or even days on a multi-processor machine, and
then
> > at the same time, several of the jobs will start printing out that they
> > found NaN (maybe 3 or 4 messages) and then go on acting normally again..
> > Some machines also seem more susceptible to the problem, although it
seems
> > to eventually fail on all machines that I have tried.
> >
> > My operating system / environment is:
> >
> >         Redhat 6.2
> >         gcc version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release)
> >         (also tried gcc version 2.95.2 19991024 (release))
> >         2 x 700 MHz Intel Pentium III
> >         100 MHz Bus Clock Speed
> >         2 GB RAM
> >         4 GB Swap Space
> >
> > Changing compilers or changing compiler flags doesn't seem to fix the
> > problem.
> >
> > It also fails with floats as well as doubles.
> >
> > It seems like it might be a problem with the hardware (FPU
overheating?),
> > but I have been able to get it to fail on machines from multiple
vendors.
> >
> > Has anyone else seen anything like this before?
>
> I've never seen anything like that, but I haven't watched a program run
> as long as you have. Well, I did once manage to keep win95 running for
> several days, but that wasn't repeatable.
>
> It sure sounds like a hardware failure. Isolated bursts of errors
> would seem support that hypothesis.
>
> The old 80287 chip always ran hot. I once saw an intel guy say
> you could test it by touching it with your finger; if you got
> a burn that said '78208' then it passed the test. But now the
> FPU is on the same die as the other stuff, isn't it?
>
> I tried running your old program overnight, but couldn't get
> it to fail.
>
> >         2 GB RAM
>
> Is this parity-checked RAM? Using the statistics here
>   http://www.sciam.com/1999/1099issue/1099cyber.html
> I figure you can go about 43 days on average without a
> cosmic-ray RAM incident. Oh, wait, now that you've
> simplified the program, I can see that it's going to
> execute in cache, at least if nothing else is running
> at the same time. Well, anyway, have a look here:
>   http://patrec.com/rico/ecc/
>
> I just had to look up whether your company's offices
> are in Colorado, since according to this URL
>   http://www.research.ibm.com/journal/rd/421/ziegler.html
> Denver gets pelted with four times as many hadrons as
> NYC, and Leadville should have thirteen times the 'soft
> fail' rate as NYC; but it looks like you've got lots of
> locations. Are you up in the mountains or in a brick
> building or something? Have you had the place checked
> for radon?





reply via email to

[Prev in Thread] Current Thread [Next in Thread]