monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: solaris and kill(pid, 0)


From: Jan-Henrik Haukeland
Subject: Re: solaris and kill(pid, 0)
Date: 23 Oct 2002 12:00:57 +0200
User-agent: Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Civil Service)

Christian Hopp <address@hidden> writes:

> On 23 Oct 2002, Jan-Henrik Haukeland wrote:
> 
> > Christian Hopp <address@hidden> writes:
> >
> > > kill(pid, 0) does not work on Solaris if you do not own the process!
> > > )-: There must be a different possibility to find out if a process
> > > is running or not.
> >
> > What's the problem?  We use following code to check if a process is
> > running in util.c:is_process_running
> >
> >     if(kill(pid, 0) == 0 || errno == EPERM)
> >
> > Now, if you own the process the first part is TRUE, if you do not own
> > the process *but* the process is running, the second part is true.
> > Doesn't this litle hack work anymore on your system?
> 
> It sometimes does... sometimes not. )-:
> 
> During "validate" it works... in the httpd server code it doesn't.
> In the httpd server (not status) errno is set to zero and kill is
> returning -1.
> 
> I don't know what's wrong there???

Could this be a classic synchronization problem? Since errno is a
global variable and we use two threads any thread can set the errno
code by calling any function that sets errno. The reason this usually
works is that the monit daemon thread is going to sleep between polls
and that the http thread isn't used much[1]. But if you have a short
poll cycle and often look at the monit web pages theoretically you
could have one thread set errno unexpectedly.

  thread 1: 

    errno= 0;
    if(kill(pid, 0) == 0 || errno == EPERM)
                         
                           ^
                           |    
                           |

  thread 2 comes in here and set errno to 0 while we expect
  EPERM. 

It's may be far fetched but this example or something simmilar match
pretty good the pattern you have experienced, that is, it works
sometime. Anyway what is the exact code in the http module that fails,
maybe it's possible to deduce more from the code and context?


[1] Aha, maybe the kill(0) signal on Solaris wakes up the monit daemon
in it's sleep phase. This sounds plausible and then you will have a
race condition on which thread sets errno. One soulution could be to
mask out the 0 signal if possible at all in a signal(0, SIG_IGN) ?

Another solution could be to check for the process in the /proc
system, since we have the code already, unfortunately this will not
work on FreeBSD since only root is allowed to read kvm and I'm not
sure about Solaris. Since monit should be run by any user it's an
absolut last resort solution I feel.

-- 
Jan-Henrik Haukeland




reply via email to

[Prev in Thread] Current Thread [Next in Thread]