Hi monit gurus,
I'm absolutely stumped, and have been stumped for more than a month
trying to chase a problem down. I'm using Monit 4.10.1 on OpenSuse 11.0
64-bit.
Monit SOMETIMES starts multiple copies of the same job. Not always, not
never, SOMETIMES.
Monit can read the PID file for the job, the PID is defined, written out
to the file, permissions are correct, ownership is correct, and the PID
file contains a PID of one of the multiple executions of the same job.
The job in question is the tm_prod03catalogedge01 job (see -v output
before for more specifics). The start/stop commands call help scripts
that do the heavy lifting. The "sleep 30" at the end of the script is
an attempt to slow monit down so it doesn't try to start multiple
instances of the same job. It doesn't work. When multiple copies of
the same job are started, there is a NOT a 30 second delay when looking
at ps and viewing the start times.
Has anyone else run into a bug where Monit very quickly starts multiple
instances of the same job? I'm seeing this on dozens of different
hosts at different times, so it's not isolated to a single monit
instance or a single job definition. The only thing that is in common
is that all of the jobs are Jboss servers.
I've been anxiously watching the Monit 5.0 beta's, hoping it gets
released as a final soon. These are production servers, and I'd rather
not run beta code if at all possible. However, I will if this is a
known bug that has been fixed, and I just couldn't match this problem up
to the entries in the Changelog.
--
monit_run.sh:
#!/bin/ksh
DATE=`date +%Y%m%d-%H%M%S`
CONSOLE_LOG=/opt/jboss/server/${4}/log/console.log
if [ -a ${CONSOLE_LOG} ]; then
mv ${CONSOLE_LOG} ${CONSOLE_LOG}-${DATE}
fi
logger "Running /opt/jboss/bin/run.sh for ${2}"
cd /opt/jboss/bin; ./${4} $* | tee ${CONSOLE_LOG}
#sticking in a sleep to try to get monit to stop spawing multiple procs
sleep 30
--
monitrc:
set daemon 20
set logfile syslog facility log_daemon
set mailserver localhost # primary mailserver
set eventqueue
basedir /opt/monit/eventqueue # set the base directory where
events will be stored
set mail-format { Subject: monit alert for $HOST -- $EVENT $SERVICE }
set alert address@hidden # receive all alerts
set httpd port 2812 and
use address localhost # only accept connection from localhost
allow localhost # allow localhost to connect to the server and
include /opt/monit/jobs/*
check system localhost
noalert address@hidden
--
monit -v output:
[dpaper]:[18:07:48]:/opt/jboss/bin> sudo monit -v
monit: Debug: Adding host allow 'localhost'
monit: Debug: Skipping redundant host 'localhost'
monit: Debug: Skipping redundant host 'localhost'
monit: Debug: Skipping redundant host 'localhost'
monit: Debug: Skipping redundant host 'localhost'
monit: Debug: Skipping redundant host 'localhost'
Runtime constants:
Control file = /opt/monit/etc/monitrc
Log file = syslog
Pid file = /var/run/monit.pid
Debug = True
Log = True
Use syslog = True
Is Daemon = True
Use process engine = True
Poll time = 20 seconds
Event queue = base directory /opt/monit/eventqueue with
unlimited slots
Mail server(s) = localhost:25
Mail from = address@hidden
Mail subject = monit alert for $HOST -- $EVENT $SERVICE
Mail message = $EVENT Service $SERV..(truncated)
Start monit httpd = True
httpd bind address = localhost
httpd portnumber = 2812
httpd signature = True
Use ssl encryption = False
httpd auth. style = Host/Net allow list
Alert mail to = address@hidden
Alert on = All events
The service list contains the following entries:
Process Name = tm_prod03catalogedge01
Pid file = /var/run/jboss/tm_prod03catalogedge01.pid
Monitoring mode = active
Start program = '/opt/jboss/bin/monit_run.sh -b
prod03catalogedge01.dc03.totalmusic.net -c prod03catalogedge01' as uid
8002 as gid 8002 timeout 1 cycle(s)
Stop program = '/bin/bash -c /opt/jboss/bin/monit_stop.sh
prod03catalogedge01.dc03.totalmusic.net > /tmp/stop.log 2>&1' as uid
8002 as gid 8002 timeout 1 cycle(s)
Pid = if changed 1 times within 1 cycle(s) then alert
Ppid = if changed 1 times within 1 cycle(s) then alert
Port = if failed
prod03catalogedge01.dc03.totalmusic.net:8080 [DEFAULT via TCP] with
timeout 5 seconds 5 times within 10 cycle(s) then alert else if passed 1
times within 1 cycle(s) then alert
System Name = localhost
Monitoring mode = active
Alert mail to = address@hidden
Alert on = No events
-------------------------------------------------------------------------------
monit daemon at 1850 awakened
--
Thanks!
-dave
--
Dave Paper address@hidden
MCSE is to computers as McDonalds Certified Chef is to fine cuisine.
--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general