Hi,
I have a bunch of Monit rules to perform check on a service
- One check process rule (existence and port checks)
- does not exist for 5 cycles then start
- failed port XXXX for 6 times within 8 cycles then restart
- failed port YYYY for 6 times within 8 cycles then restart
- failed port ZZZZ for 6 times within 8 cycles then restart
- Three check program rules with custom checks
- if status != 0 for 5 times within 10 cycles then restart
- if status != 0 for 5 times within 10 cycles then restart
- if status != 0 for 5 times within 10 cycles then restart
- One to check log content
- check file + if content = "BIG ERROR" then restart
start/stop rules are
start program = "/bin/systemctl start myservice"
stop program = "/bin/systemctl stop myservice"
There are no dependency at Monit level but checks are part of the same bunch of groups.
Problem, is that due to multiple issues, I got a "restart" storm as
- some port check failed -> restart issued
- lead to error at custom script -> restart issued
- content log reading has some lags -> restart issued
Myservice or system.d configuration/feature are not well designed so got "already bind exception" as system.d tried to start several instance at the same time🤔
So port check failed again, system.d killed the wrong one, MyService was blocked, restart again. etc.....
I had to shutdown Monit to prevent further action (I could have monit -g group unmonitor also), kill every instance of my service, start it correctly, then reactivate Monit
Question:
- Is there a native way to prevent Monit to issue the same start/stop commands in a defined time-frame ?
- Does Monit dependency feature between checks could help as I don't see how it could help ?
- Any other hint/proposal (aside increasing the values of "for N times within T cycles" to delay the restart)
Remark: maybe exploring system.D features StartLimitIntervalSe &
StartLimitBurst
could help.
Best Regards.