swarm-support
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

On portability and testing


From: Benedikt Stefansson
Subject: On portability and testing
Date: Fri, 30 May 1997 15:48:33 +0200

I have also been trying to work with Swarm on a Silicon Graphics Irix
workstation, which is the third or fourth platform and the up-tenth
computer I use for Swarm development. This experience, in addition to
something that occurred a few weeks ago leads me to write this long
e-mail.

One of the interesting (harrowing?) things I've discovered is that
running your Swarm sim on more than one platform invariantly
will lead you to discover hidden bugs (or features)...

So let me share these two stories with you, one which has a happy
ending, the other which still doesn't have an ending at all.

         Part 1: Too much RAM can be a bad thing

      In which the author learns to stop worrying and love 
      Linux when it bombs...

My first encounter with this phenomenon occurred a few weeks ago when
I made a few seemingly insignificant changes in an old sim. This
caused it to start crashing in the worst way on my Linux box. 

To add insult to injury the errors spewing forth from the debugger
where apparently totally random; sometimes the crash seemed to occur
within the ModelSwarm, in other cases in updating one of the graphs or
in a million other places.

But to give an odd twist to the whole affair, when I
compiled the same code on a Sun Sparc station, everything seemed to be
fine and dandy. The sim would run for thousands of clicks without
ever crashing.

At first I was pissed off at Linux, and thought that somewhere in the
bowels of GCC or the TK/TCL libraries for Linux a bug lay in hiding,
buried deep enough so that I, a mere mortal, would never be able to lure
it out of his lair.

Well, to cut a long and painful story short the hero of the story turned
out to be Good Ol' Linux after all: In my infinite wisdom I had
programmed an array overrun (an obvious explanation once you come to
your senses). 

In the 16 Mb of RAM of the Linux box this was a serious matter. The OS
did the right thing and crashed the program. In the 1 Gb + RAM of the
Sun Sparc station this did not seem to matter much, the OS was not going
to bother to bring down the sim even if program was splattering
information all over memory.

          Part 2: Though it runs it may still be crashing

     In which the author learns to "diff" his data but 
     remains perplexed.

My second encounter with the phenomenon has come in these last couple of
days; Since I'm 9000 km and a veeery slow Internet link away from the
Sparc station mentioned above I'm trying to use a SGI Indy to speed up
the parameter sweeps of a different simulation (which unfortunately
takes 3 1/2 hrs to finish 5.000 periods on a Linux Pentium 90 Mhz box).

Well, although the program compiles on the Indy and runs OK with the
GUI, it exhibits some odd behavior. 

Firstly, when I run the program in -batchmode it stalls on trying to
load the BatchSwarm setup file. Doesn't crash - just seems to go into an
infinite loop.

Removing this method call in the BatchSwarm seems to cure the problem -
oddly enough because ModelSwarm is still loading the setup file using
the same ObjectLoader library. 

Secondly, although the program will start up and run for a few hundred
periods in either -batchmode or with the GUI, it now crashes (at random
of course, what else) with the message that one of the agents doesn't
recognize a method, which it does and has for the other 100 odd periods. 

Predictably, I can't replicate the error on the Linux box.

Thirdly, to make matters worse using "diff" on the data reveals that the
behavior generated by some of the agents on the SGI is not the same as
the one generated on the Linux box. 

At the present I'm using the default random number generator (and was
under the impression that this would cause the same random number seed
to be used in both cases). The GCC version on the SGI is 2.7.1 but 2.7.2
on the Linux box, so that might possibly explain some of this behavior. 

Although the source of the bugs might simply be a stray pointer or
overrun, the bottom line is that we are dealing with a complex piece of
software here, so it pays to test whatever you generate carefully, not
just on your trusted OS but even on different platforms.  

----

Don't get me wrong, I'm not putting any blame on Swarm here. Obviously
one of the goals of the Swarm project, is to give lousy programmers like
me the ability to create portable and testable simulations based on
standard libraries and compilers. 

The bottom line is that we have to follow up on this promise and
actually port and test extensively, even though the results may not be
pleasant.

Regards,
-Benedikt

                  ==================================
   Swarm-Support is for discussion of the technical details of the day
   to day usage of Swarm.  For list administration needs (esp.
   [un]subscribing), please send a message to <address@hidden>
   with "help" in the body of the message.
                  ==================================


reply via email to

[Prev in Thread] Current Thread [Next in Thread]