freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04


From: Albert Chu
Subject: Re: [Freeipmi-users] bmc-watchdog 0.7.15-2 exiting under Ubuntu 10.04
Date: Tue, 01 Feb 2011 17:20:01 -0800

Hi Robert,

On Tue, 2011-02-01 at 11:40 -0800, Robert Hardy wrote:
> It is possible that there is a bios option which starts the watchdog 
> which is enabled.
> Once I get a chance, I will dig around in the BIOS and see.

I think a more likely scenario would be the IPMI kernel driver is
starting up the watchdog and racing w/ the FreeIPMI one.  Are you
loading the IPMI kernel driver?

> I would think it would be much better behaviour on startup to do a 
> equivalent to bmc-watchdog -y then start the watchdog.

I had to look this up (b/c I couldn't remember, but was fairly certain)
the IPMI spec indicates that the watchdog timer is required to be turned
off when a node is rebooted (27.1).

> Failing to start simply because the BIOS started the countdown seems 
> very very bad to me especially without logging anything.

The logging portion of this issue should be fixed w/ the next release.

> You're left in 
> a state where the watchdog dies quietly and the server hard reboots 
> every couple of minutes.

If the BIOS happens to be starting the countdown, that's *REALLY* bad on
the part of the BIOS programmers.  Whoever starts the countdown needs to
manage it.  It can't be trusted for some other random piece of software
to handle.

So just so I understand the situation correctly, when you disable the
bmc-watchdog daemon, does the problem go away?  The FreeIPMI
bmc-watchdog does not start any timer until it determines the timer is
stopped.  Since the timer is already running, it never starts it.

Al
  

> I'm willing to test anything you send my way. The server isn't really in 
> production yet but will be soon.
> 
> Ultimately I'm trying to package some better .debs for use on Ubuntu. 
> The current ones are badly packaged, to the point of of being unusable.  
> I've re-written the init script for Ubuntu but I'd really like to see an 
> upstart based one....
> 
> Rob
> 
> On 2011-02-01 12:54 PM, Albert Chu wrote:
> > Hey Robert,
> >
> > I think I see the problem(s).  I call _err_exit(), which writes to
> > stderr, instead of _daemon_error_exit() which writes to the log.  That's
> > the error logging issue, which is secondary to the real one.
> >
> > As for the real issue, I think this is being hit:
> >
> >    if (timer_state == IPMI_BMC_WATCHDOG_TIMER_TIMER_STATE_RUNNING)
> >      _err_exit ("watchdog timer must be stopped before running daemon");
> >
> > For some reason, your BMC think's the watchdog is running from the
> > start.  You could verify w/ bmc-watchdog --get if if you don't star thte
> > timer.  Perhaps it's a hardware bug?
> >
> > As an experiment, would you be willing to try a beta that removed this
> > check?  The issue is, I have no idea what the consequences of removing
> > this check will be on your motherboard if there's a bug in it.
> >
> > Al
> >
> > On Mon, 2011-01-31 at 15:11 -0800, Robert Hardy wrote:
> >> That would be /var/log/freeipmi/bmc-watchdog.log here and nothing is
> >> logged at startup (or after the unexpected exit) during bootup.
> >>
> >> I've put all sorts of debugging lines in my init script for bmc-watchdog.
> >>
> >> I finally ended up doing doing this at root:
> >> mv /usr/sbin/bmc-watchdog /usr/sbin/bmc-watchdog.real
> >>
> >> and then putting this in /usr/sbin/bmc-watchdog:
> >> #!/bin/bash
> >> strace -fFv -o /tmp/bmcstrace.log -- /usr/sbin/bmc-watchdog.real $@
> >>
> >> At bootup the bmc-watchdog initscript does launch a process with a new
> >> PID but it does NOT log the regular "starting bmc-watchdog daemon". It
> >> in fact logs nothing at all to /var/log/freeipmi/bmc-watchdog.log DURING
> >> BOOT UP.
> >>
> >> The strace above captured bmc-watchdog running at bootup and the same
> >> process exiting here at the last few lines:
> >>
> >> 1584  semop(229383, {{0, 1, SEM_UNDO}}, 1) = 0
> >> 1584  nanosleep({0, 1000}, NULL)        = 0
> >> 1584  write(2, "bmc-watchdog.real: watchdog time"..., 72) = -1 EBADF
> >> (Bad file descriptor)
> >> 1584  exit_group(1)                     = ?
> >>
> >> I've posted the entire strace here:
> >> http://webcon.ca/~rhardy/bmcdrop/
> >>
> >> Can you parse that and make any suggestions as to why it would exit
> >> uncleanly and only on boot up?
> >>
> >> I'm not quite sure what is going on, but it seems to be trying to write
> >> on a bad file descriptor, getting an error and then exiting.
> >>   From the strace, file descriptor 2 is in fact closed so that error
> >> makes sense to me. The real question is it trying to write to FD 2?
> >>
> >> When I restart bmc-watchdog when it gets to the same place it properly
> >> writes the startup message on file descriptor 0 which is the log file
> >> which was opened earlier...
> >>
> >> 2466  write(0, "[Jan 31 18:03:23]: starting bmc-"..., 48) = 48
> >>
> >> I'm open to debugging suggestions too... Ideas?
> >>
> >> Thanks for your help,
> >> Rob
> >>
> >> On 2011-01-28 5:37 PM, Albert Chu wrote:
> >>> Hey Robert,
> >>>
> >>> That is indeed strange.  Does the bmc-watchdog log say anything? (I
> >>> can't remember the exact location, but I think it's /var/log/freeipmi/
> >>> something).
> >>>
> >>> Al
> >>>
> >>> On Thu, 2011-01-27 at 13:14 -0800, Robert Hardy wrote:
> >>>> I'm running bmc-watchdog 0.7.15-2 under a current Ubuntu 10.04 64 bit on
> >>>> several fairly new unloaded Supermicro servers.
> >>>>
> >>>> On only one (always the same server) of four servers the bmc-watchdog
> >>>> process quietly exits shortly after start up leaving the system setup 
> >>>> for a
> >>>> hard reset shortly after bootup.
> >>>>
> >>>> The options and builds are identical on all of the servers. These are my
> >>>> options: OPTIONS="-d -u 2 -p 0 -a 1 -F -P -L -S -O -i 300 -e 60"
> >>>>
> >>>> Through debugging I've confirmed on boot up:
> >>>>
> >>>> - The init script gets run
> >>>>
> >>>> - It launches bmc-watchdog  saves a new PID correctly in 
> >>>> /var/run/bmc-watchdog.pid.
> >>>>
> >>>> - Checking for a bmc-watchdog process in rc.local shows it isn't running 
> >>>> and
> >>>>      the timer is counting down.
> >>>>
> >>>> - There is no shutdown message logged when the process disappears during 
> >>>> bootup.
> >>>>
> >>>> - There are no messages suggesting the process was killed
> >>>>
> >>>> On shutdown the init script gets as far as removing
> >>>> /var/run/bmc-watchdog.pid and seems to work fine.
> >>>>
> >>>> If I stuff this in rc.local the bmc-watchdog starts up properly and never
> >>>> seems to die again until the next reboot:
> >>>> /usr/sbin/service bmc-watchdog stop
> >>>> /usr/sbin/service bmc-watchdog start
> >>>>
> >>>> All in all this is very weird behaviour. Is it possible a newer version 
> >>>> of
> >>>> bmc-watchdog would address this? i.e. is this a known bug?
> >>>>
> >>>> Any other ideas why this is happening (or how I can debug further)?
> >>>>
> >>>> Regards,
> >>>> Rob
> >>>>
> >>>> _______________________________________________
> >>>> Freeipmi-users mailing list
> >>>> address@hidden
> >>>> http://lists.gnu.org/mailman/listinfo/freeipmi-users
> 
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




reply via email to

[Prev in Thread] Current Thread [Next in Thread]