[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [monit] Monit fails to create PID file on restart
From: |
Jonathan Maddox |
Subject: |
Re: [monit] Monit fails to create PID file on restart |
Date: |
Tue, 13 Jul 2010 14:34:35 +1000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.10) Gecko/20100527 Thunderbird/3.0.5 |
Hello,
We have seen the same issue with in-house daemons running on CentOS
(with init scripts derived from those in stock CentOS packages) and
monitored by monit. I am sure that the same problem can occur with
stock daemons such as Apache with certain workloads. It is the result
of a race condition between the init script and the monit 'restart'
action, which is triggered when a daemon takes several seconds to shut
down when signalled.
What happens is this : Monit's 'restart' action first begins the 'stop'
command in a background process, while the main process polls once per
second for the process no longer to exist, defined by reading the
pidfile and checking for a process with the specified pid. When the
pidfile and the daemon no longer match, monit will run the 'start' command.
The 'stop' command is often an init script. The stock CentOS and RedHat
init scripts will read the pidfile for the daemon and will send several
signals to the same process number with sleeps in between, as follows (
this code is in /etc/init.d/functions in the function killproc() ):
delay=3
...
if checkpid $pid 2>&1; then
# TERM first, then KILL if not dead
kill -TERM $pid >/dev/null 2>&1
usleep 100000
if checkpid $pid && sleep 1 &&
checkpid $pid && sleep $delay &&
checkpid $pid ; then
kill -KILL $pid >/dev/null 2>&1
usleep 100000
fi
fi
...
rm -f "${pid_file:-/var/run/$base.pid}"
(I've elided irrelevant bits that depend on special options passed to
this function. This is the default behaviour.)
The race condition is that monit's polling can notice that the daemon is
gone while the init script is still doing one of its sleeps, and will
have already called the 'start' command well before the 'stop' command
is complete. The pidfile for the *new* invocation of the daemon will
have been created, and so the 'stop' command, when it wakes up, removes
the new pidfile.
There are several ways to fix this.
One way would be simply to remove the line in the init script which
deletes the pid file. Since stale pid files are commonplace after many
error cases (eg. daemon crashes and hardware failure) and scripts all
seem to be written to cope with them, removing the pid file does not win
anything.
Another, more complete fix would be for monit not to run its 'start'
command until after 'stop' has returned. It would then work even with
'broken' init scripts.
We have dealt with the problem locally by replacing relevant parts of
the init scripts, only for those daemons which are watched by monit.
Our scripts no longer unconditionally remove the pid file after 'stop',
but will delete it only if it still contains the pid which has just been
killed. This is actually still open to a race, but not a race across a
sleep of one or more seconds. It also assumes that the kernel will not
recycle the same pid for the same daemon ... probably not completely
valid but seems reliable enough to us for now.
I should now raise bugs with monit and CentOS and/or RedHat upstreams to
let them know about the issue, now that I know that other people have
found it to be a problem in the wild.
Looking superficially at the init scripts provided by a couple of other
distributions, I see that some don't seem ever to remove the pid file so
they are obviously immune to this race.
I hope this helps.
regards,
Jonathan Maddox