[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: New user with several major monit problems
From: |
Jonathan Wheeler |
Subject: |
Re: New user with several major monit problems |
Date: |
Tue, 13 Sep 2005 02:17:38 +1200 |
User-agent: |
Mozilla Thunderbird 1.0.2 (X11/20050423) |
Martin Pala wrote:
> Jonathan Wheeler wrote:
>
>> Martin Pala wrote:
>>
>>
>>> Jonathan Wheeler wrote:
>>>
>>>
>>>> Most annoyingly, for my cluster monit -g node1 stop all (as taken
>>>> directly from your documentation) kills the *entire* server (see
>>>> problem 1)
>>>
>>>
>>>
>>> Yet one thing - the described node shutdown sounds me like some
>>> watchdog driven shutdown - do you use heartbeat's watchdog capability
>>> or some other external check which is able to panic the node?
>>>
>>
>> No I don't, nothing fancy at all yet :)
>>
>> Any thoughts on how I might troubleshoot this further? Syslog is killed
>> itself, so I don't have any information in the logs at all. Local
>> console is also booted out, so even sitting in front of the server
>> doesn't help.
>
>
> I think it is either watchdog or some stonith method (power off/cycle
> the machine). You can try for example 'lsof | grep watchdog' to see
> whether the watchdog device is opened.
>
> If you can supply your heartbeat, monit and scripts configuration as
> described Hauk, then it will be much easier to find the problem.
>
> Martin
Hi Team,
As promised I've done some more digging into this. After rebooting my
two test servers, I was unable to replicate the problem by simply
running monit -g node1 stop all. So I went back to my HA/monit
configuration again to see what would happen.
I was then in a position where monit -g node1 stop all, will kick me out
of my ssh sessions to the machine, and according to the monit http
interface it's restarting all services (one of which is sshd in should
be noted), regardless of group.
Then I realised, monit stop drbdfs, or monit stop heartbeat would kick
me out.
I commented out and stopped heartbeat at this point.
Syslogd was defined in my monitrc file, so I commented it out, reloaded
monit, and ran monit -g node1 stop all. I was booted out, and
reconnected to find this time syslog hadn't restarted itself. So, it
would appear that monit has been (re??)starting syslog (and sshd) for me
after all processes are killed. With syslog stopping it's very hard to
tell exactly what is happening, and of course it shouldn't be stopping
in the first place.
My assumption is that monit is only surviving as it is
running/respawning directly from init, and according to monit's uptime
number's it too is being restarted.
I then clicked that 'monit stop drbdfs' killing the system was probably
a very important clue and when running the arguments manually, I was
also kicked out!
YAY, so removing HA and monit I'm still able to replicate the problem:
inertia:~# /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
reiserfs stop
/mnt/nfstest: 1rce 2rc 3rc 4rc 21rc 46rc
47rc 48rc 49rc 185rc 205rc 312rc 1867rc 7041rce 7045rce
7056rce 7071rce 7086rce 7102rce 7108rce 7118rce 7209rce 7216rce
7253rce 7256rce 7311rce 7314rce 7324rce 7327re 16681rc 16688rc
kill 7256: No such process
Connection to inertia closed by remote host.
Connection to inertia closed.
Now I realise that I've now more or less ruled out monit as the cause of
this, but I wonder if you'd be so kind as to cast your eyes over this
script and let me know if you see anything out of place, as I then
rebooted, ran the scripts manually, and WASN'T kicked out, as below.
These scripts, and indeed HA, was working for the most part before I
added monit to the equation, and the fact that this script worked this
reboot around is further murkyness. I do realise that it is still
perhaps a little hasty to therefore conclude that monit is at fault, but
any assistance you can provide would be greatly appreciated.
inertia:/mnt# /etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
reiserfs stop
inertia:/mnt# ps ax
PID TTY STAT TIME COMMAND
1 ? S 0:00 init [2]
2 ? SN 0:00 [ksoftirqd/0]
3 ? S< 0:00 [events/0]
4 ? S< 0:00 [khelper]
21 ? S< 0:00 [kblockd/0]
46 ? S 0:00 [pdflush]
47 ? S 0:00 [pdflush]
49 ? S< 0:00 [aio/0]
48 ? S 0:00 [kswapd0]
185 ? S 0:00 [kseriod]
205 ? S< 0:00 [ata/0]
312 ? S< 0:00 [reiserfs/0]
1894 ? S 0:00 [drbd0_worker]
1907 ? S 0:00 [drbd0_receiver]
1917 ? S 0:00 [drbd0_asender]
3555 tty1 Ss+ 0:00 -bash
3596 tty2 Ss+ 0:00 -bash
3608 tty3 Ss+ 0:00 /sbin/getty 38400 tty3
3629 tty4 Ss+ 0:00 /sbin/getty 38400 tty4
3649 tty5 Ss+ 0:00 /sbin/getty 38400 tty5
3661 tty6 Ss+ 0:00 /sbin/getty 38400 tty6
3772 ? Ss 0:00 /usr/sbin/monit -Ic /etc/monit/monitrc
3807 ? Ss 0:00 /usr/sbin/sshd
3814 ? Ss 0:00 /usr/sbin/exim4 -bd -q30m
3868 ? Ss 0:00 sshd: address@hidden/1
3871 pts/1 Ss+ 0:00 -bash
3956 ? Ss 0:00 sshd: address@hidden/0
3959 pts/0 Ss 0:00 -bash
4033 pts/0 R+ 0:00 ps ax
Filesystem script:
inertia:~# cat /etc/ha.d/resource.d/Filesystem | grep -v
\#
unset LC_ALL; export LC_ALL
unset LANGUAGE; export LANGUAGE
prefix=/usr
exec_prefix=/usr
. /etc/ha.d/shellfuncs
MODPROBE=/sbin/modprobe
FSCK=/sbin/fsck
FUSER=/bin/fuser
MOUNT=/bin/mount
UMOUNT=/bin/umount
BLOCKDEV=/sbin/blockdev
check_util () {
if [ ! -x "$1" ] ; then
ha_log "ERROR: setup problem: Couldn't find utility $1"
exit 1
fi
}
usage() {
cat <<-EOT;
usage: $0 <device> <directory> <fstype> [<options>]
{start|stop|status}
<device> : name of block device for the filesystem. e.g.
/dev/sda1, /dev/md0
OR -LFileSystemLabel OR -Uuuid or an NFS specification
<directory> : the mount point for the filesystem
<fstype> : name of the filesystem type. e.g. ext2
<options> : options to be given as -o options to mount.
$Id: Filesystem.in,v 1.10 2003/07/03 02:14:14 alan Exp $
EOT
}
flushbufs() {
if
[ "$BLOCKDEV" != "" -a -x "$BLOCKDEV" ]
then
case $1 in
-*|[^/]*:/*) ;;
*) $BLOCKDEV --flushbufs $1;;
esac
fi
}
DEVICE=$1
MOUNTPOINT=$2
FSTYPE=$3
case $DEVICE in
;;
;;
*) if [ ! -b "$DEVICE" ] ; then
ha_log "ERROR: Couldn't find device $DEVICE. Expected /dev/???
to exist"
usage
exit 1
fi;;
esac
if [ ! -d "$MOUNTPOINT" ] ; then
ha_log "ERROR: Couldn't find directory $MOUNTPOINT to use as a
mount point"
usage
exit 1
ficheck_util $MODPROBE
check_util $FSCK
check_util $FUSER
check_util $MOUNT
check_util $UMOUNT
4) operation=$4; options="";;
5) operation=$5; options="-o $4";;
*) usage; exit 1;;
esac
case "$operation" in
start)
$MOUNT | cut -d' ' -f3 | grep -e "^$MOUNTPOINT$" >/dev/null
if [ $? -ne 1 ] ; then
ha_log "ERROR: Filesystem $MOUNTPOINT is already mounted!"
exit 1;
fi
$MODPROBE scsi_hostadapter >/dev/null 2>&1
$MODPROBE $FSTYPE >/dev/null 2>&1
grep -e "$FSTYPE"'$' /proc/filesystems >/dev/null
if [ $? != 0 ] ; then
ha_log "ERROR: Couldn't find filesystem $FSTYPE in
/proc/filesystems"
usage
exit 1
fi
if
case $FSTYPE in
ext3|reiserfs|xfs|jfs|vfat|fat|nfs) false;;
*) true;;
esac
then
ha_log "info: Starting filesystem check on $DEVICE"
$FSCK -t $FSTYPE -a $DEVICE
if
[ $? -ge 4 ]
then
ha_log "ERROR: Couldn't sucessfully fsck filesystem for $DEVICE"
exit 1
fi
fi
flushbufs $DEVICE if
$MOUNT -t $FSTYPE $options $DEVICE $MOUNTPOINT
then
: Mount worked!
else
ha_log "ERROR: Couldn't mount filesystem $DEVICE on $MOUNTPOINT"
exit 1
fi
;;
stop)
if
$MOUNT | grep -e " on $MOUNTPOINT " >/dev/null
then
$FUSER -mk $MOUNTPOINT
DEV=`$MOUNT | grep "on $MOUNTPOINT " | cut -d' ' -f1`
$UMOUNT $MOUNTPOINT
if [ $? -ne 0 ] ; then
ha_log "ERROR: Couldn't unmount $MOUNTPOINT"
exit 1
fi
flushbufs $DEV
else
ha_log "WARNING: Filesystem $MOUNTPOINT not mounted?"
fi
;;
status)
$MOUNT | grep -e "on $MOUNTPOINT " >/dev/null
if [ $? = 0 ] ; then
echo "$MOUNTPOINT is mounted (running)"
else
echo "$MOUNTPOINT is unmounted (stopped)"
fi
;;
*)
echo "This script should be run with a fourth argument of 'start',
'stop', or 'status'"
usage
exit 1
;;
esac
exit 0;
My monitrc:
set daemon 60
set logfile syslog facility log_daemon
set mailserver localhost port 25, willow.griffous.net
set mail-format { from: address@hidden }
set alert address@hidden
set httpd port 2812 and
allow 10.0.10.6
allow 192.168.1.133
check process sshd with pidfile /var/run/sshd.pid
start program "/etc/init.d/ssh start"
stop program "/etc/init.d/ssh stop"
if failed port 22 protocol ssh then restart
if 5 restarts within 5 cycles then timeout
group system
check process exim4 with pidfile /var/run/exim4/exim.pid
start program "/etc/init.d/exim4 start"
stop program "/etc/init.d/exim4 stop"
if failed port 25 protocol smtp then restart
if 5 restarts within 5 cycles then timeout
group system
check device drbd path /proc/drbd
start program = "/etc/ha.d/resource.d/drbddisk r0 start"
stop program = "/etc/ha.d/resource.d/drbddisk r0 stop"
mode manual
group node1
check directory drbdfs path /mnt/nfstest/nfs
start program = "/etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
reiserfs start"
stop program = "/etc/ha.d/resource.d/Filesystem /dev/drbd0 /mnt/nfstest
reiserfs stop"
mode manual
depends drbd
group node1
check process nfsd with pidfile /var/run/nfsd.pid
start program = "/etc/init.d/nfs-kernel-server start"
stop program = "/etc/init.d/nfs-kernel-server stop"
mode manual
depends on drbdfs
group node1
inertia:/mnt# monit -V
This is monit version 4.5
Copyright (C) 2000-2005 by the monit project group. All Rights Reserved.
Thanks,
Jonathan
- New user with several major monit problems, Jonathan Wheeler, 2005/09/09
- Re: New user with several major monit problems, Martin Pala, 2005/09/09
- Re: New user with several major monit problems, Martin Pala, 2005/09/09
- Re: New user with several major monit problems, Jonathan Wheeler, 2005/09/10
- Re: New user with several major monit problems, Martin Pala, 2005/09/10
- monit -g xxxx start, Jonathan Wheeler, 2005/09/10
- Re: monit -g xxxx start, Martin Pala, 2005/09/11
- Re: monit -g xxxx start, Jonathan Wheeler, 2005/09/11
- Re: monit -g xxxx start, Martin Pala, 2005/09/11
- Re: New user with several major monit problems,
Jonathan Wheeler <=
- Re: New user with several major monit problems, Martin Pala, 2005/09/12
Re: New user with several major monit problems, Jonathan Wheeler, 2005/09/10
Re: New user with several major monit problems, Jan-Henrik Haukeland, 2005/09/10