Re: Reliability of RPC services

l4-hurd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services

From:	Marcus Brinkmann
Subject:	Re: Reliability of RPC services
Date:	Tue, 25 Apr 2006 17:31:43 +0200
User-agent:	Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.7 (Sanjō) APEL/10.6 Emacs/21.4 (i486-pc-linux-gnu) MULE/5.0 (SAKAKI)

At Tue, 25 Apr 2006 16:09:11 +0200,
Bas Wijnen <address@hidden> wrote:
> > pretty uninteresting.  I know how "move only" capabilities can be
> > implemented, it's not a big deal.  So, if we come to the conclusion
> > that we absolutely and desperately want them, we won't be stuck.
> 
> Good point.  But I tend to trust Shapiro's judgement on whether such a thing
> has unacceptably high costs.  So if after he understands what I'm saying (I
> don't think we're there yet :-( ) he still thinks it's a bad idea to use them,
> then I'm not so optimistic about using them.

I think he understands the proposals at hand very well, but really,
really does not like them ;) I think the overhead of
move-only/send-once capabilities (whatever, you and I both know what I
mean ;) are not desasterous, but as I said before, much more
interesting question is if they are the appropriate technique.  There
are big question marks right there.

[...]
> failure on the other side, you simply won't bother.  However, if it's a
> trivial thing to do, then we may find that having more recovery boundaries is
> actually a very good thing, which makes the system much more robust.

Good point.  I would say let's start with the cases where we feel a
recovery boundary is absolutely necessary, and work our way towards
other cases.

[...]

> > > Obviously, A is not distinguishable from S taking a long time to reply
> > > (except by code inspection, which is not considered here).  However, at
> > > some point the user will become impatient.  Assuming that he can track the
> > > problem down to S, he will kill S.  In that case, this situation becomes
> > > the same as D.
> > 
> > This is true, although my suspicion is that the user will, at first,
> > only be able to track down the problem to C.  Only further "probing"
> > he may be able to narrow it down to something else.
> 
> Yes.  Before S gets killed, probably several instances of C have been killed
> already. :-)  But at some point, S itself will (hopefully) get killed.

I was saying that myself, but I am not sure.  A typical user may never
be able to identify S in some situations.

> > So, at least in the case where somebody (the user or C) can identify a
> > problem in C, we have a solution: Giving C a nudge between the ribs
> > (sending it a signal).
> 
> Sure.  But I was also thinking about situations where C isn't actually
> something directly started by the user, but indirectly.  A daemon doing some
> nice things for him, for example.  The user may not be aware at all that C is
> running (or S, for that matter).

Ok, good point.  However, in that case, you have a big problem: How do
you identify that there exists a problem in the first place?  Say at
some point the search function on the desktop stops working, because
the indexing daemon is hung up on some failed filesystem.  How can you
ever figure that one out as a normal user?  I think that in such
cases, you must use very different techniques (watchdogs?).

> > Another way to make progress on this issue is to consider very
> > specific use cases.  Ie, define clearly what C and S are, and what
> > they do and how they can fail.
> 
> Right, that's what I was planning:
> 
> A user logs in to a terminal, and his session receives the corresponding
> capabilities.  One of them will be for the usb bus, where he can register any
> device driver he wants.  The user registers driver D with the system and
> starts using it through program P.  That is, P calls the device driver
> framework, which eventually calls D to perform the operation.
> 
> D is buggy and dies without responding.  Now the device driver framework is
> still waiting for it and will continue to do so until it is somehow notified
> (or aborted, but that isn't a good idea, I suppose).
> 
> The user doesn't like the long wait and cancels the operation in P.  Now the
> device driver framework is still waiting for D, and likely a new request from
> P is considered a protocol violation.  As is a late response to P, which is no
> longer accepted.

Are the device drivers (D) provided by the operating system or the
user?  If they are provided by the user, I don't think the driver
framework should make an upcall to them.  If instead all drivers are
system code, and the user just activates and deactivates them, I agree
with your scenario.

> This can be solved with cancellation support.  But conceptually this is the
> wrong thing to do: the system knows that there will not be a response.

This depends on how "D" dies.  Does it fault without being able to
process the fault?  In that case, if it was started by the driver
framework, the driver framework will probably get the fault message
and clean up the process.  In that case, it could potentially notice
that and restart it, or otherwise recover.

If it just hangs, then the system does not know that there will not be
a response.

If D is removed by an explicit request, for example a "deactivation"
(maybe triggered by unplugging and re-plugging the device), or by a
well-managed fault, the device driver framework again gets some
explicit notification that it is removed (think SIGCHILD).  In fact,
because the driver is started by the framework, it always is in the
knows: There is an explicit parent-child relationship between the
driver framework and the driver, and I think that is sufficient to
ensure that there are sufficient mechanisms (or potential mechanisms)
to handle this case.

> Also, I'm not sure if cancellation always solves this problem.

It doesn't.  It can only clean up as far as the first failure in the
invocation chain, unless you nudge (SIGHUP?) other processes within
the chain.

> And of course, it really is a timeout, except that the user is the
> clock, which ticks at an unpredictable speed.  But it can still be
> considered a timeout and has all the problems that come with it.
>
> I was planning to think of more use cases, but I'll delay that, since writing
> this already takes me too much time.

No hurry.  It's a deep problem, no sense in rushing it.

Thanks,
Marcus

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Reliability of RPC services, (continued)

Prev by Date: Re: Reliability of RPC services
Next by Date: Re: Reliability of RPC services
Previous by thread: Re: Reliability of RPC services
Next by thread: Re: Reliability of RPC services
Index(es):
- Date
- Thread