[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket s
From: |
Christian Hopp |
Subject: |
Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added] |
Date: |
Tue, 6 Aug 2002 16:14:24 +0200 (CEST) |
On 6 Aug 2002, Jan-Henrik Haukeland wrote:
Hi everyone!
> Christian Hopp <address@hidden> writes:
>
(...)
> > Let me cite "man mount" on this:
>
> > The program accessing a file on a NFS mounted file system
> > will hang when the server crashes. The process cannot be
> > interrupted or killed unless you also specify intr.
>
> Yes and I belive that alarm is exactly such an interrupt. If a process
> receives the alarm signal it _has_ to act on it. The default behavior
> is to terminate the process unless an alarm handler was installed. So
> alarm (2) should not have any problems jolting monit out off a file
> read block.
I have a attached a little bit of code to prove that you are wrong.
That program opens an iso from my nfs mounted home dir. It is read
for 5 seconds than it get an time out. I have run it once normally
and the second time I have pulled out my network. I "time"ed it and
you can see...
(address@hidden) ~/compile/monit/tests> time ./nfs_test
Timeout occured!
Time spent in user mode (CPU seconds) : 0.030s
Time spent in kernel mode (CPU seconds) : 0.810s
Total time : 0:05.00s
CPU utilisation (percentage) : 16.8%
Times the process was swapped : 0
Times of major page faults : 62
Times of minor page faults : 152
Exit 1
(address@hidden) ~/compile/monit/tests> time ./nfs_test
Timeout occured!
Time spent in user mode (CPU seconds) : 0.030s
Time spent in kernel mode (CPU seconds) : 0.780s
Total time : 0:22.72s
CPU utilisation (percentage) : 3.5%
Times the process was swapped : 0
Times of major page faults : 62
Times of minor page faults : 127
Exit 1
The alarm signal is still evaluated... but later! So as I told you a
NFS halted processes can't be woken up by anything.
I had even better stuff on my linux machine last week. My linux has
halted processes which have tried to access /usr/include on ext3.
They have been unKILLable. A reboot helped... to change it to
/usr/share/something. But monit would be just stuck in that situation.
> > We have unfortunately a very unreliable network right now.
>
> You're in an excellent position to test this then :) I'm beting you a
> bottle of beer that alarm will work.
Our address is on our web page. (-: But for me something
non-alcoholic, please!
(...)
> > If we are in "what if" discussions here are some other things to think
> > about.
> >
> > * Monit checks a server which defuncs aka. is a zombie. Is it in
> > "good health" or not? Pidfile and Pid do match. I don't know what
> > its ports do (do they still connect or not?).
>
> They should not accept a connection, but I'm not quite sure since the
> kernel handles socket connection and deliver them to the process.
> Anyway, a zombi process, even if it accept a connection should not
> pass the default connection test (the one with select) and of course
> not any protocol test.
Any other day I can test it, it's easy to make a zombie test bench. (-:
> But it's still a valid questions, especially for daemons without
> network code, like crond. This could be solved if I ever get around
> to hack the process status code I was planning to do (see item 6. in
> the next release plan). Maybe you would like to give it a stab?
I think I can do it. I already took a look into the /proc access in
Linux and Solaris. It's gonna be quite OS dependent. I will give it
a generic frame work to access it OS independent, or can anyone
recommend me a good lib for it, but the backend is different on
esp. Linux and Solaris.
> > * A start/stop script returns with error, should monit still try to
> > (re)start/stop the process?
>
> Good one, I thought about this someday when I was looking at the code.
> At least an alert message should be sent if monit cannot start the
> process. Now, only a log entry is made.
I mean some progs take a lot of CPU/MEM when they start, and when they
try every cycle. Of course the timeout statement can help with it.
How should be dealt with it. Not starting it again or just sending a
mail?
Bye,
Christian
--
Christian Hopp email: address@hidden
Institut für Elektrische Informationstechnik fon: +49-5323-72-2113
Technische Universität Clausthal fax: +49-5323-72-3197
pgpkey: https://www.iei.tu-clausthal.de/pgp-keys/chopp.key.asc (2001-11-22)
nfs_test.c
Description: Text Data
- Re: [CVS] unix socket support added, (continued)
- Re: [CVS] unix socket support added, Christian Hopp, 2002/08/02
- Re: [CVS] unix socket support added, Jan-Henrik Haukeland, 2002/08/02
- Re: [CVS] unix socket support added, Martin Pala, 2002/08/02
- Re: [CVS] unix socket support added, Thomas Oppel, 2002/08/02
- Re: [CVS] unix socket support added, Christian Hopp, 2002/08/05
- Re: [CVS] unix socket support added, Jan-Henrik Haukeland, 2002/08/05
- Re: [CVS] unix socket support added, Christian Hopp, 2002/08/05
- Re: [CVS] unix socket support added, Jan-Henrik Haukeland, 2002/08/05
- NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added], Christian Hopp, 2002/08/05
- Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added], Jan-Henrik Haukeland, 2002/08/05
- Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added],
Christian Hopp <=
- Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added], rory, 2002/08/06
- Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added], Jan-Henrik Haukeland, 2002/08/06
- Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added], Christian Hopp, 2002/08/07
- Re: NFS is going down, et al [was: pidfiles aka. Re: [CVS] unix socket support added], Jan-Henrik Haukeland, 2002/08/07