Re: Reliability of RPC services

l4-hurd
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Reliability of RPC services

From:	Bas Wijnen
Subject:	Re: Reliability of RPC services
Date:	Tue, 25 Apr 2006 22:46:17 +0200
User-agent:	Mutt/1.5.11+cvs20060403
On Tue, Apr 25, 2006 at 05:31:43PM +0200, Marcus Brinkmann wrote:
> > > > Obviously, A is not distinguishable from S taking a long time to reply
> > > > (except by code inspection, which is not considered here).  However,
> > > > at some point the user will become impatient.  Assuming that he can
> > > > track the problem down to S, he will kill S.  In that case, this
> > > > situation becomes the same as D.
> > > 
> > > This is true, although my suspicion is that the user will, at first,
> > > only be able to track down the problem to C.  Only further "probing"
> > > he may be able to narrow it down to something else.
> > 
> > Yes.  Before S gets killed, probably several instances of C have been killed
> > already. :-)  But at some point, S itself will (hopefully) get killed.
> 
> I was saying that myself, but I am not sure.  A typical user may never
> be able to identify S in some situations.

This is true.  On current systems, this type of problem typically leads to
people not using the full capacity of their systems (because they don't know
it's available).  If we have a solution to this (for the case of hanging
servers), then that would be a big improvement over the current situation I
think.

> > > So, at least in the case where somebody (the user or C) can identify a
> > > problem in C, we have a solution: Giving C a nudge between the ribs
> > > (sending it a signal).
> > 
> > Sure.  But I was also thinking about situations where C isn't actually
> > something directly started by the user, but indirectly.  A daemon doing
> > some nice things for him, for example.  The user may not be aware at all
> > that C is running (or S, for that matter).
> 
> Ok, good point.  However, in that case, you have a big problem: How do
> you identify that there exists a problem in the first place?  Say at
> some point the search function on the desktop stops working, because
> the indexing daemon is hung up on some failed filesystem.  How can you
> ever figure that one out as a normal user?  I think that in such
> cases, you must use very different techniques (watchdogs?).

I like Shapiro's idea about watchdogs which don't bite, but only bark. ;-)
(He didn't use that terminology, but I think it's nice. :-) )  In other words,
the watchdog doesn't kill the process, it just notifies the user that there
may be a problem with it.  If there is a list of processes with problems,
which is usually empty, and on heavy load some processes appear in it, there's
nothing wrong.  However, if some processes are there even without load, there
is likely a problem.

> > A user logs in to a terminal, and his session receives the corresponding
> > capabilities.  One of them will be for the usb bus, where he can register
> > any device driver he wants.  The user registers driver D with the system
> > and starts using it through program P.  That is, P calls the device driver
> > framework, which eventually calls D to perform the operation.
> > 
> > D is buggy and dies without responding.  Now the device driver framework
> > is still waiting for it and will continue to do so until it is somehow
> > notified (or aborted, but that isn't a good idea, I suppose).
> > 
> > The user doesn't like the long wait and cancels the operation in P.  Now
> > the device driver framework is still waiting for D, and likely a new
> > request from P is considered a protocol violation.  As is a late response
> > to P, which is no longer accepted.
> 
> Are the device drivers (D) provided by the operating system or the
> user?

By the user.  The idea is that you can bring your own driver for your device,
and don't need the system administrator to install it in order to use your
device.  I'd expect a Hurd system to be able to do that. :-)

> If they are provided by the user, I don't think the driver framework should
> make an upcall to them.

Why not?  If I bring a usb printer with me for which no driver is installed,
and I bring a driver as well, I should be able to use it.  However, programs
using the printer cannot accept my driver.  They will call the device driver
framework and ask it for a list of printers.  So in order to use it, the
printer must show up in that list.  Which means it must be plugged into the
device driver framework.

> If instead all drivers are system code, and the user just activates and
> deactivates them, I agree with your scenario.

I don't think it matters much, the scenario can also happen in that case.

> > This can be solved with cancellation support.  But conceptually this is
> > the wrong thing to do: the system knows that there will not be a response.
> 
> This depends on how "D" dies.  Does it fault without being able to
> process the fault?  In that case, if it was started by the driver
> framework, the driver framework will probably get the fault message
> and clean up the process.  In that case, it could potentially notice
> that and restart it, or otherwise recover.

No, it's not started by the driver framework.  Even if the framework insists
that it starts it, it can be useful to write a proxy device driver which gets
a capability to the actual driver at startup.  Also in that case, the driver
framework isn't the parent of the driver (and the proxy isn't either).

> If it just hangs, then the system does not know that there will not be
> a response.

Indeed.  That is the usual problem for hangs.  A barking watchdog seems
attractive for that problem.

> > Also, I'm not sure if cancellation always solves this problem.
> 
> It doesn't.  It can only clean up as far as the first failure in the
> invocation chain, unless you nudge (SIGHUP?) other processes within
> the chain.

Hmm, yes.  That may be tricky, because it is likely that the user doesn't know
what these processes are.

Thanks,
Bas

-- 
I encourage people to send encrypted e-mail (see http://www.gnupg.org).
If you have problems reading my e-mail, use a better reader.
Please send the central message of e-mails as plain text
   in the message body, not as HTML and definitely not as MS Word.
Please do not use the MS Word format for attachments either.
For more information, see http://129.125.47.90/e-mail.html
signature.asc
Description: Digital signature
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Reliability of RPC services, (continued)
Prev by Date: Re: Reliability of RPC services
Next by Date: Re: Reliability of RPC services
Previous by thread: Re: Reliability of RPC services
Next by thread: Re: Reliability of RPC services
Index(es):
- Date
- Thread