monit tries to resolve mail host too early; after that it seems unable t
From:
Mike Schmidt
Subject:
monit tries to resolve mail host too early; after that it seems unable to get to the network
Date:
Sat, 16 Apr 2011 14:40:24 -0400
HI,
I have some 50 systems running monit. I start monit with a 60 second delay. However, after a reboot, monit sometimes starts up and tries to resolve the mail address too early, because it doesn't wait the 60 seconds; and at that point the network may not be ready. After that, I receive no alerts. I get this in the message file: ( I xxxx-ed out the hostname)
Apr 16 03:02:54 actiforme-1 monit[2564]: Starting monit HTTP server at [actiforme-1.vpn.impacts.xxxx.com:2812] Apr 16 03:02:54 actiforme-1 monit[2564]: monit HTTP server started
Apr 16 03:02:54 actiforme-1 monit[2564]: 'system' Monit started Apr 16 03:03:14 actiforme-1 monit[2564]: M/Monit: cannot open a connection to http://mon2.xxxx.com:8080/impact/collector -- Success
Apr 16 03:03:14 actiforme-1 monit[2564]: M/Monit: trying next server http://mon1.xxxx.com:8080/impact/collector Apr 16 03:03:34 actiforme-1 monit[2564]: M/Monit: cannot open a connection to http://mon1.xxxx.com:8080/impact/collector -- Success
Apr 16 03:03:34 actiforme-1 monit[2564]: M/Monit: no server available Apr 16 03:03:54 actiforme-1 monit[2564]: Cannot open a connection to the mailserver 'mailman.xxxx.com:25' -- Success
Apr 16 03:03:54 actiforme-1 monit[2564]: No mail servers are available Apr 16 03:03:54 actiforme-1 monit[2564]: Aborting event Apr 16 03:03:54 actiforme-1 monit[2564]: M/Monit heartbeat started Apr 16 03:03:54 actiforme-1 monit[2564]: 'date-time' process is not running
Apr 16 03:04:14 actiforme-1 monit[2564]: M/Monit: cannot open a connection to http://mon2.xxxx.com:8080/impact/collector -- Success Apr 16 03:04:14 actiforme-1 monit[2564]: M/Monit: trying next server http://mon1.xxxx.com:8080/impact/collector
Apr 16 03:04:34 actiforme-1 monit[2564]: M/Monit: cannot open a connection to http://mon1.xxxx.com:8080/impact/collector -- Success Apr 16 03:04:34 actiforme-1 monit[2564]: M/Monit: no server available
Apr 16 03:04:54 actiforme-1 monit[2564]: Cannot open a connection to the mailserver 'mailman.xxxx.com:25' -- Success Apr 16 03:04:54 actiforme-1 monit[2564]: No mail servers are available
Apr 16 03:04:54 actiforme-1 monit[2564]: Aborting event Apr 16 03:04:54 actiforme-1 monit[2564]: 'date-time' trying to restart Apr 16 03:04:54 actiforme-1 monit[2564]: 'date-time' start: /sbin/service
Apr 16 03:05:15 actiforme-1 monit[2564]: 'Impact3' failed, cannot open a connection to INET[impact3.xxxx.com:443] via TCP Apr 16 03:05:35 actiforme-1 monit[2564]: 'Impact4' failed, cannot open a connection to INET[impact4.xxxx.com:443] via TCP
..... more of the same
When I logged on the that system this morning, there was no trouble accessing the two https sites
here's the config:
check host Impact3 with address impact3.xxxx.com
if failed port 443 for 2 times within 3 cycles then alert
check host Impact4 with address impact4.xxxx.com if failed port 443 for 2 times within 3 cycles then alert
monit was still trying 8 hours later, still said there was no access to impact3 and 4.
Meanwhile, impact3 and 4 were accessible, as the application that uses them was able to check for updates every 5minutes since just after the reboot.
Anybody have any ideas as to why this happens? In this case, there are no alerts, services are marked down when they are not, ....
--
Mike SCHMIDT CTO
Intello Technologies Inc. address@hidden