hi all,
I have a very, very _very_ strange problem with apache2. It
appears to be going down randomly, and monit seems to be sort of
screwing up the restart. I'll explain:
Monit has for a while been reporting that its connection to
apache fails, tries to restart it, the restart fails, then tries
to start it, then successfully connects. This process takes over
2 minutes in some cases (I have a 2min monit monitoring interval).
Analysis of apache logs and var/log/* wasn't turning up much of
anything, so I wrote a little script that runs from cron every 15
seconds. It does a simple grep on curl mysite.com and if it can't
connect will send a very useful, detailed dump of the state of
the machine (monit status output for apache, process list,
specific grep of the last 100 log entries from apache logs, etc).
Here's an excerpt from syslog - note that the "logger" entries
with apache2 test PASSED|FAILED are my little custom script which
is running every 15 seconds.
################ /var/log/syslog
Apr 20 16:05:46 myhost last message repeated 3 times
Apr 20 16:06:02 myhost logger: apache2 test PASSED
Apr 20 16:06:33 myhost last message repeated 2 times
Apr 20 16:06:47 myhost logger: apache2 test PASSED
Apr 20 16:07:02 myhost logger: apache2 test PASSED
Apr 20 16:07:46 myhost last message repeated 3 times
Apr 20 16:08:02 myhost logger: apache2 test PASSED
Apr 20 16:08:46 myhost last message repeated 3 times
Apr 20 16:09:01 myhost /USR/SBIN/CRON[17282]: (root) CMD ( [ -d /
var/ lib/php4 ] && find /var/lib/php4/ -type f -cmin +$(/usr/lib/
php4/ maxlifetime) -print0 | xargs -r -0 rm)
### remove old session files
Apr 20 16:09:03 myhost logger: apache2 test PASSED
Apr 20 16:09:17 myhost logger: apache2 test PASSED
Apr 20 16:09:34 myhost logger: apache2 test PASSED
Apr 20 16:09:42 myhost monit[24001]: HTTP: error receiving data
-- Resource temporarily unavailable
Apr 20 16:09:42 myhost monit[24001]: 'apache2' failed protocol
test [HTTP] at INET[www.mywebsite.com:80].
Apr 20 16:09:42 myhost monit[24001]: 'apache2' trying to restart
Apr 20 16:09:42 myhost monit[24001]: 'apache2' stop: /etc/init.d/
apache2
Apr 20 16:09:46 myhost monit[24001]: 'apache2' start: /etc/init.d/
apache2
## note here that apache is not running for a FULL 2 minutes - ouch
Apr 20 16:09:46 myhost logger: apache2 test FAILED - sending
notification
Apr 20 16:10:01 myhost logger: apache2 test FAILED - sending
notification
Apr 20 16:10:16 myhost logger: apache2 test FAILED - sending
notification
Apr 20 16:10:31 myhost logger: apache2 test FAILED - sending
notification
Apr 20 16:10:46 myhost logger: apache2 test FAILED - sending
notification
Apr 20 16:11:01 myhost logger: apache2 test FAILED - sending
notification
Apr 20 16:11:16 myhost logger: apache2 test FAILED - sending
notification
Apr 20 16:11:31 myhost logger: apache2 test FAILED - sending
notification
Apr 20 16:11:46 myhost logger: apache2 test FAILED - sending
notification
## samples of the notification output from my script are below.
Apr 20 16:11:46 myhost monit[24001]: 'apache2' process is not running
Apr 20 16:11:46 myhost monit[24001]: 'apache2' trying to restart
Apr 20 16:11:46 myhost monit[24001]: 'apache2' start: /etc/init.d/
apache2
Apr 20 16:11:47 myhost monit[14467]: 'apache2' failed to start
## I'm guessing monit just doesn't wait long enough to run the
protocol test
## once the service fails, which is why it thinks that it failed
to start.
Apr 20 16:11:48 myhost monit[32720]: 'apache2' started
Apr 20 16:12:02 myhost logger: apache2 test PASSED
Apr 20 16:12:47 myhost last message repeated 3 times
Apr 20 16:13:01 myhost logger: apache2 test PASSED
Apr 20 16:13:46 myhost last message repeated 3 times
Apr 20 16:13:46 myhost monit[24001]: 'apache2' process is running
with pid 32724
Apr 20 16:13:50 myhost monit[24001]: 'apache2' connection passed
Apr 20 16:14:02 myhost logger: apache2 test PASSED
Apr 20 16:14:47 myhost last message repeated 3 times
Apr 20 16:15:02 myhost logger: apache2 test PASSED
Apr 20 16:15:46 myhost last message repeated 3 times
Apr 20 16:16:02 myhost logger: apache2 test PASSED
Apr 20 16:16:47 myhost last message repeated 3 times
Apr 20 16:17:01 myhost /USR/SBIN/CRON[24860]: (root) CMD ( run-
parts --report /etc/cron.hourly)
Apr 20 16:17:02 myhost logger: apache2 test PASSED
Apr 20 16:17:46 myhost last message repeated 3 times
Apr 20 16:18:02 myhost logger: apache2 test PASSED
#etc...
################ last output from my notification script above
MONIT STATUS
------------------------------------------
The monit daemon 4.5 uptime: 20d 19h 7m
System 'lamp.turingstudio.com'
load average [0.39] [0.23] [0.16]
cpu 21.3%us 9.5%sy 0.0%wa
memory usage 4069324 kB [68.4%]
data collected Thu Apr 20 16:09:30 2006
Process 'apache2'
status Connection failed
monitoring status monitored
pid -1
parent pid -1
uptime 0m
childrens 0
memory kilobytes 0
memory kilobytes total 0
memory percent 0.0%
memory percent total 0.0%
cpu percent 0.0%
cpu percent total 0.0%
port response time -1.000s to www.mywebsite.com:
80/ catalog/list.php [HTTP]
data collected Thu Apr 20 16:09:46 2006
File 'apache2_init'
status accessible
monitoring status monitored
permission 700
uid 0
gid 0
timestamp Wed Mar 8 17:55:52 2006
size 3583 B
data collected Thu Apr 20 16:07:30 2006
File 'apache2_bin'
status accessible
monitoring status monitored
permission 755
uid 0
gid 0
timestamp Sun Oct 30 15:11:32 2005
size 385484 B
data collected Thu Apr 20 16:07:30 2006
PROCESS LIST
------------------------------------------
PID TTY STAT TIME COMMAND
1 ? Ss 9:13 ini
13870 ? Ss 1:37 /sbin/syslogd
20745 ? S 0:00 /bin/sh /usr/bin/mysqld_safe
20781 ? S 0:00 \_ /bin/sh /usr/bin/mysqld_safe
20782 ? S 0:13 \_ /usr/sbin/mysqld --basedir=/usr
-- datadir=/var/lib/mysql --user=mysql --pid-file=/var/run/mysqld/
mysqld.pid --skip-locking --port=3306 --socket=/var/run/mysqld/
mysqld.sock
20784 ? S 6:55 | \_ /usr/sbin/mysqld --
basedir=/ usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/run/mysqld/ mysqld.pid --skip-locking --port=3306 --socket=/
var/run/mysqld/ mysqld.sock
20785 ? S 0:00 | \_ /usr/sbin/mysqld --
basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/
var/run/ mysqld/mysqld.sock
20786 ? S 0:00 | \_ /usr/sbin/mysqld --
basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/
var/run/ mysqld/mysqld.sock
20787 ? S 0:00 | \_ /usr/sbin/mysqld --
basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/
var/run/ mysqld/mysqld.sock
20788 ? S 0:00 | \_ /usr/sbin/mysqld --
basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/
var/run/ mysqld/mysqld.sock
20801 ? S 97:36 | \_ /usr/sbin/mysqld --
basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/
var/run/ mysqld/mysqld.sock
20802 ? S 70:01 | \_ /usr/sbin/mysqld --
basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/
var/run/ mysqld/mysqld.sock
20803 ? S 0:00 | \_ /usr/sbin/mysqld --
basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/
var/run/ mysqld/mysqld.sock
20804 ? S 0:22 | \_ /usr/sbin/mysqld --
basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/
var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/
var/run/ mysqld/mysqld.sock
20783 ? S 0:00 \_ logger -p daemon.err -t
mysqld_safe -i -t mysqld
29792 ? Ss 0:39 /usr/sbin/sshd
13825 ? Ss 0:00 \_ sshd: turing [priv]
14435 ? S 0:00 \_ sshd: address@hidden/0
14625 pts/0 Ss 0:00 \_ -bash
14917 pts/0 S+ 0:00 \_ bash
31705 ? Ss 1:13 /usr/lib/postfix/master
31708 ? S 0:38 \_ qmgr -l -t fifo -u -c
3397 ? S 0:00 \_ pickup -l -t fifo -u -c
2496 ? S 0:00 \_ cleanup -z -t unix -u -c
2624 ? S 0:00 \_ trivial-rewrite -n rewrite -t unix -
u -c
2883 ? S 0:00 \_ smtp -t unix -u -c
3074 ? S 0:00 \_ smtp -t unix -u -c
6016 ? S 0:00 \_ proxymap -t unix -u
24001 ? Ss 5:34 /usr/sbin/monit -Ic /etc/monit/monitrc
24033 ? S 0:50 \_ /usr/sbin/monit -Ic /etc/monit/monitrc
24066 ? S 1:40 \_ /usr/sbin/monit -Ic /etc/monit/
monitrc
14467 ? S 0:00 \_ /usr/sbin/monit -Ic /etc/monit/
monitrc
12992 ? Ss 0:04 /usr/sbin/cron
18441 ? S 0:00 \_ /USR/SBIN/CRON
18446 ? Ss 0:00 \_ /bin/sh -c sleep 45; /root/
lamp- apache2-watcher.sh
31797 ? S 0:00 \_ /bin/sh -c sleep 45; /root/
lamp-apache2-watcher.sh
32064 ? R 0:00 \_ ps axf
### note the absence of apache from the processlist!!
I'm so glad to be running rails and lighttpd now. I have never had
a moment's trouble with a crashing rails app - they all have
insanely detailed backtraces - and lighty simply never goes down -
it also starts up in under a second.
Apache2 has turned into a big ugly pig. :(
_a
--
alex black, founder
the turing studio, inc.
510.666.0074
address@hidden
http://www.turingstudio.com
2600 10th street, suite 635
berkeley, ca 94710
--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general