monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: failing more than once before alert


From: Martin Pala
Subject: Re: failing more than once before alert
Date: Tue, 02 Aug 2005 00:46:40 +0200
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050513 Debian/1.7.8-1

Hi,

the event ratio dependant rules are on our todo list: http://www.tildeslash.com/monit/doc/next.php#20

The syntax is just idea and may change, because it is probably not suitable for error level zones (such as critical/warning/info/etc.)

Here is another raw idea which can allow to define the error severity and needed value with related event ratio to change the event state to different error level (in both direction - worse or better state):

 --8<--
 # In the case that the space usage is greater then 70% for
 # 10 cycles, set the service state to failed[informational].
 # In the case that the space usage fallen under 70% and didn't
 # exceeded it for 15 cycles, reset the state to passed and
 # send alert:
 if space usage > 70% for 10 cycles then alert
    and if recovered for 15 cycles then alert

 # In the case that the space usage exceeded 80% for 20 cycles,
 # increase the severity to failed[warning] and send alert. As
 # soon as the usage fallen under 80% (i.e. by default first time
 # as in current monit version) clear the warning severity and
 # try to start the process (stopped in critical rule bellow) +
 # send the alert (however the previous rule will match so the
 # state will be still failed but the severity will decrease to
 # informational unless the usage will go under 70%:
 if space usage > 80% for 20 cycles
    then severity warning and alert
    and if recovered for 30 cycles then exec 'start the_process'

 # In the case that the space usage exceeded 99%, set the state
 # to failed[critical] and stop the process which writes to the
 # filesystem. When the usage fallen under 99% the state will
 # change to warning (nearest lower error zone). In the case
 # that the usage fallen even under 80%, monit will attempt
 # to start the process again (see rule above):
 if space usage > 99%
    then severity critical and exec 'pkill the_process'
 --8<--

The severity can be optional (default can be failed[informational]), the cycles count can be also optional (when omitted, monit will act on first occurence as in current monit version).


Martin


Ben Hartshorne wrote:
Hi, all,

I have been getting an incredible number of false positive pages
recently.  I have to believe that it's something having to do with my
application, but most of the pages I get correct themselves one cycle
later.  I put in a test to hit google on port 80, and even that paged me
once in the middle of the night.

This pissed me off enough to do something about it.  Reading through the
list archives, I found this post:
http://lists.gnu.org/archive/html/monit-general/2005-04/msg00016.html
It gave me a nice idea (and I followed his example) but I really didn't
like the fact that after a single failure, the service requires human
intervention to restart monitoring (since the timeout function disables
monitoring for that service).
So I started making code changes.  Unfortunately, I didn't do it the
*right* way, because it's been way too long since I played with flex
etc.  Instead, I took advantage of the "if x restarts in y cycles then
timeout," but eviscerated the ACTION_TIMEOUT functionality.  It no
longer actually times out, it just alerts.
What I really wanted was "if x restarts in y cycles then alert," but I
couldn't figure the right way to do it.

Since the timeout funcitonality was designed to start counting at the
first failure, and if a service actually times out, stop monitoring it,
the counter manipulation didn't work so well when timeouts could be
triggered and recovered often.
The end result:  I have a rule like:
set alert address@hidden {timeout}
check host RadixTest with address cryptio.net
        start program = "/bin/true"
        stop program = "/bin/true"
        if 2 restarts within 3 cycles then timeout
                if failed url http://cryptio.net/~ben/lilo.conf
                      and content == "default=linuxprep"
                      then restart

and I only get paged if it fails twice within three attempts.
(i.e.:
fail pass fail == page
fail pass pass == no page
fail fail pass == page)

I also made it decrement the pass-counter slowly, so that fail pass fail pass fail == faiure-page, recovery-page, failure-page
i.e. if it's recently failed, be more paranoid.

One annoyance is that the check_timeout function comes before the
service test instead of afterwards, so I actually get paged at the
beginning of the cycle following the failure condition.  I'm cheking
every 60 seconds, so I can deal with that.  A correct solution woludn't
exibit this problem...  ;)

Another less-than-desireable trait - IMHO, the right way to do this
kind of thing is to use the leaky-bucket algorithm (as many network
protocols do) that says failures add up quickly but subside slowly.  You
would have to specify a rate at which the failure counter drops in
addition to the thresholds.
i.e. 5 failures within 10 attempts, decrease failure counter at a rate
of 1/5 successes.

This allows a certain amount of flakyness, but alerts you quickly on a
hard failure, and alerts you if it gets too flaky.

Anyway...   In case any of you are interested, I have attached a patch
of the modifications I made (to the head of the CVS tree)

-ben

p.s.  is this the right list? or should I have posted this to the
monit-general?  It seems much more high volume -- all I see go by here
are announcements of newly checked in files...




reply via email to

[Prev in Thread] Current Thread [Next in Thread]