monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: monit restart bug/race condition (3.1 and current CVS version behavi


From: Martin Pala
Subject: Re: monit restart bug/race condition (3.1 and current CVS version behavior)
Date: Mon, 10 Feb 2003 02:08:56 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20021226 Debian/1.2.1-9

Jan-Henrik Haukeland wrote:

Martin Pala <address@hidden> writes:

[..]
You can replicate the problem easily when you let monit run in
daemon mode and wakeup every 1s.

I think we will need to add a restart action to check_process() which
does the restart in one atomic operation (instead of the current
check_process(stop) -> wait -> check_process(start). That is, call
do_stop -> do_start in sequence in check_process(). And to avoid any
race condition (if the monit daemon thread has a low poll frequency)
use an action-mutex so only one thread can execute an action in
check_process() at the same time.

Something like

check_process(P, restart) {

 LOCK(action_mutex)
     do_stop(P);
     do_start(P);
 END_LOCK;

}

I can look into it, so you can concentrate on the device stuff :-)

I made i fix for the 2.) problem - the code is now simplier little bit. I though about mutex for race condition too, but there's one problem - it isn't possible to use it just in check_process() and d_check_process(), otherwise it will break stop method, because of following test on the start of do_validate():

if(do_not_validate(p)) return;

In the case, that service stop will be requested (monit running in daemon mode), monit httpd will lock the action mutex, run program's stop method and wait for program to stop (which could take some time, because the application need to finish - databases, etc.). After program stop, it will set do_validate to false, which will disable monitoring.

However, in the meantime between stop method call and do_validate flag disabling (by monit httpd), monit main thread passes do_not_validate(p) test (because do_validate is yet true). It detects that the service don't answer and tries to start the service. It will block because of action_mutex - as soon as the mutex is freed, it will run start method and start application and its monitoring again (which is bad in this case).

I tried it on simple test - in this case monit without action mutex is less affected then monit with action mutex in check_process(), which failes every time. To solve the race condition, it will be needed to implement probably per-process mutex before do_validate flag test and cervlet check_process() calls. I did it too, but i found a deadlock in that case => it needs more work (last 'development' patch attached - probably there could be better way to do it).

I must go to work tomorrow morning, so i will continue either tomorrow evening (or you can try it :)

Btw. what about to release 3.2 after race condition fix? There were few bugs in 3.1, so probably we can make bugfix release 3.2 (plus few extensions which are already made for it) and schedule device tests for 3.3, which i hope will be ready soon :)

Cheers,
Martin

diff -Naur monit/http/cervlet.c monit.race_condition/http/cervlet.c
--- monit/http/cervlet.c        2003-02-10 01:53:36.000000000 +0100
+++ monit.race_condition/http/cervlet.c 2003-02-10 01:42:04.000000000 +0100
@@ -539,7 +539,9 @@
 
        if(p->start)
 
-         check_process(name, action);
+         LOCK(p->mutex)
+           check_process(name, action);
+         END_LOCK;
 
        else {
 
@@ -555,7 +557,9 @@
 
        if(p->stop)
 
-         check_process(name, action);
+         LOCK(p->mutex)
+           check_process(name, action);
+         END_LOCK;
 
        else {
 
@@ -570,8 +574,10 @@
       if( is(action, "restart") ) {
 
        if(p->start && p->stop) {
-
-         check_process(name, action);
+ 
+         LOCK(p->mutex)
+           check_process(name, action);
+         END_LOCK;
 
        } else {
 
diff -Naur monit/monitor.h monit.race_condition/monitor.h
--- monit/monitor.h     2003-02-10 01:53:33.000000000 +0100
+++ monit.race_condition/monitor.h      2003-02-10 01:44:00.000000000 +0100
@@ -311,6 +311,7 @@
   ProcInfo_T procinfo;                        /**< Data for the procfs check */
   
   /** For internal use */
+  pthread_mutex_t mutex;          /**< Mutex used for action synchronization */
   struct myprocess *next;                         /**< next process in chain */
   struct myprocess *next_depend;           /**< next depend process in chain */
 } *Process_T;
diff -Naur monit/validate.c monit.race_condition/validate.c
--- monit/validate.c    2003-02-10 01:53:35.000000000 +0100
+++ monit.race_condition/validate.c     2003-02-10 01:38:31.000000000 +0100
@@ -130,6 +130,9 @@
 
   ASSERT(p);
   
+  /* Obtain process action mutex */
+  LOCK(p->mutex)
+
   /* First, check for pre-conditions */
   if(do_not_validate(p)) return;
   
@@ -253,6 +256,9 @@
   
   /* Remove the SIGTERM block */
   pthread_sigmask(SIG_SETMASK, &os, NULL);
+
+  /* Release process action mutex */
+  END_LOCK;
  
 }
 

reply via email to

[Prev in Thread] Current Thread [Next in Thread]