monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Failing to synchronize 'unmonitor' actions with ongoing checks: Sola


From: Nestor Urquiza
Subject: Re: Failing to synchronize 'unmonitor' actions with ongoing checks: Solaris 10 monit 5.5 possible bug
Date: Mon, 24 Sep 2012 15:29:41 -0400

Just to report that this happens also when monit is monitoring back, for example:
[EDT Sep 24 15:18:34] info     : monit daemon with PID 17391 awakened
[EDT Sep 24 15:18:34] info     : 'server1' monitor action done
[EDT Sep 24 15:18:34] info     : Awakened by User defined signal 1
[EDT Sep 24 15:18:34] info     : 'server2' monitor on user request
[EDT Sep 24 15:18:34] info     : monit daemon with PID 17391 awakened
[EDT Sep 24 15:18:34] info     : 'server2' monitor action done
[EDT Sep 24 15:18:34] info     : 'server3' monitor on user request
[EDT Sep 24 15:18:34] info     : monit daemon with PID 17391 awakened
[EDT Sep 24 15:18:34] error    : 'server1' connection failed, INET[server1:80] via TCP is not ready for i|o -- I
nterrupted system call
[EDT Sep 24 15:18:34] info     : 'server6' monitor on user request
[EDT Sep 24 15:18:34] info     : monit daemon with PID 17391 awakened
[EDT Sep 24 15:18:34] info     : 'server4' monitor on user request
[EDT Sep 24 15:18:34] info     : monit daemon with PID 17391 awakened
[EDT Sep 24 15:18:34] info     : 'server5' monitor on user request
[EDT Sep 24 15:18:34] info     : monit daemon with PID 17391 awakened
[EDT Sep 24 15:18:35] info     : 'server3' monitor action done
[EDT Sep 24 15:18:35] info     : 'server6' monitor action done
[EDT Sep 24 15:18:35] info     : 'server4' monitor action done
[EDT Sep 24 15:18:35] info     : 'server5' monitor action done
[EDT Sep 24 15:18:35] info     : Awakened by User defined signal 1
[EDT Sep 24 15:18:35] info     : 'server1' connection succeeded to INET[server1:80] via TCP


On Mon, Sep 24, 2012 at 10:14 AM, Nestor Urquiza <address@hidden> wrote:

Hi guys,

Not sure if this is a problem in other OSs as well but I believe I have found a bug in monit 5.5 which at least for Solaris 10 is failing to synchronize unmonitor actions with ongoing checks. Here is how to recreate (tested in two different physical Solaris boxes (Intel)

1. Configure monit to check every minute. Create several instances like the below, checking several external ports and servers:

check host myhost with address myhost

if failed port myport type tcp with timeout 15 seconds

   then alert

2. Issue the below command exactly by the time monit runs (when the clock is giving hh:mm:59):

monit unmonitor all

3. Randomly you get an alert for at least one of the host/port combination even though the host/port is actually available. As an example:

Action: alert, Description: connection failed, INET[mssql:1433] via TCP is not ready for i|o -- Interrupted system call, Service: ptrsvr, Tested From Host: myhost

4. After issuing 'monit monitor all' no alert about the service being back up is sent but 'monit status' does show the service is up.


IMO monit has a bug where basically it does not synchronize the calls to unmonitor and the checks to be performed. If monit receives "unmonitor all" it should: (wait for all current checks to finish OR cancel them AND ignore any alert messages to be sent).


Makes sense?


Thanks!

-Nestor



reply via email to

[Prev in Thread] Current Thread [Next in Thread]