monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Monit not restarting a service reliably


From: address@hidden
Subject: Re: Monit not restarting a service reliably
Date: Mon, 3 Jun 2019 17:37:56 +0200

Hi,

since monit 5.16.0, the exec action is executed only on a state change. In your 
case the service didn't transition to the "succeeded" state, so the exec action 
wasn't repeated.

If you want to retry the exec action if the service remains in failure state, 
you can use the "repeat" option.

Snip from monit 5.16.0 changelog which provides more details:

--8<--
New: The exec action is now executed only once, on state change, same way as 
the alert
action. The new "repeat" option allows to repeat the exec action after given 
number of
cycles if the error persists.  Syntax:
        if <test> then exec <script> repeat every <x> cycles
If you want to get the old behaviour, use "repeat every 1 cycle". Example:
        if failed port 1234 then exec "/usr/bin/myscript.sh" repeat every 5 
cycles
--8<--

Best regards,
Martin


> On 31 May 2019, at 19:14, Jan Rychter <address@hidden> wrote:
> 
> Hi,
> 
> I'm looking for help, because I can't figure out what I'm doing wrong. I have 
> a simple monit setup, which is supposed to monitor a web server and restart 
> it if anything seems wrong.
> 
> This seems to work but not always. Monit does restart the service, but on 
> subsequent failures it just notices that the service isn't working and 
> doesn't act anymore.
> 
> Example from the log, where the service was restarted, but went down again, 
> and monit didn't do anything:
> 
> [CEST May 31 06:44:11] info     : 'triac.mysite.com' Monit 5.16 started
> [CEST May 31 09:36:29] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:37:39] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:37:39] info     : 'mysite.com' exec: /usr/bin/supervisorctl
> [CEST May 31 09:38:49] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:39:59] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:41:09] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:42:19] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:43:29] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:44:39] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:45:50] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:47:00] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> [CEST May 31 09:48:10] error    : 'mysite.com' failed protocol test [HTTP] at 
> [mysite.com]:443 [TCP/IP SSL] -- HTTP: Error receiving data -- Resource 
> temporarily unavailable
> 
> The net result is that the service doesn't work and monit just sits there, 
> knowing that the service failed the protocol test, but doing nothing about it.
> 
> I suspect this is because monit does not notice that the service was OK after 
> restarting for a moment, so it does not notice another transition from OK to 
> failed.
> 
> Here is the relevant part of the configuration (nearly all of it):
> 
> set daemon 60
> check host mysite.com with address mysite.com
> if failed
>  port 443
>  protocol https
>  with ssl options {verify: enable}
>  for 2 cycles
> then exec "/usr/bin/supervisorctl restart mysite"
> if 20 restarts within 60 cycles then unmonitor
> 
> Is there a way to achieve unconditional actions? E.g. "even though I haven't 
> noticed the service to transition from failed to working, restart it anyway 
> after 60 seconds if it is still in the failed state"
> 
> Any help would be much appreciated.
> 
> --J.
> 
> 
> -- 
> To unsubscribe:
> https://lists.nongnu.org/mailman/listinfo/monit-general




reply via email to

[Prev in Thread] Current Thread [Next in Thread]