Re: Reliability of RPC services

l4-hurd
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Reliability of RPC services

From:	Bas Wijnen
Subject:	Re: Reliability of RPC services
Date:	Tue, 25 Apr 2006 16:09:11 +0200
User-agent:	Mutt/1.5.11+cvs20060403
On Tue, Apr 25, 2006 at 01:06:32PM +0200, Marcus Brinkmann wrote:
> At Tue, 25 Apr 2006 12:22:23 +0200,
> Bas Wijnen <address@hidden> wrote:
> > > Coyotos will not implement "move only" capabilities. Period.
> > 
> > I'm sorry to hear that.  Not that I am convinced that this is the perfect
> > solution, but it is a solution to the problem we found, and timeouts seem
> > like a very inferior replacement as a solution.
> 
> Bas, I am not yet sure if they really are a solution to the problem.
> I am not even sure I know what the exact problem is at this point.
> This discussion has focussed on implementation details, but those are
> pretty uninteresting.  I know how "move only" capabilities can be
> implemented, it's not a big deal.  So, if we come to the conclusion
> that we absolutely and desperately want them, we won't be stuck.

Good point.  But I tend to trust Shapiro's judgement on whether such a thing
has unacceptably high costs.  So if after he understands what I'm saying (I
don't think we're there yet :-( ) he still thinks it's a bad idea to use them,
then I'm not so optimistic about using them.

> However, I am _much_ more interested in discussing what the actual
> problem is we are trying to solve.  It has to do with recovery, it has
> to do with sharing of resources, and it has to do with predictability
> of service.  These are hard problems, and it's worth to think about
> them thoroughly before reaching any conclusions.

That would make sense, I suppose.  I have the feeling I already have a pretty
clear view of the problem, but some use cases wouldn't hurt at least.  I'll
try to come up with some below.

> One strategy to think about these problems is to try to identify what
> we want the "recovery boundaries" of the system should be (we have yet
> to define the term "recovery boundary" precisely).  Some applications
> stay and fail as a unit, while for others there is a clear
> distinction.  In some cases we may _want_ a separation but may find
> out it is impossible to achieve.

I would like to make a comparison here to capabilities vs UIDs.  When using
UIDs, the unit of security is a user account.  That unit has proven to be
small enough for many people so far.  We want to make a system with smaller
units, using capabilities.  People who see that for the first time might say
"what does it matter, the unit of security is a user account anyway".  But
because of capabilities, we can now have more security than that.  It takes
some getting used to, but once you're comfortable with it, it's really good.

I think the same is true here.  If you use normal capabilities, and you need
to set up all kinds of (expensive) things in order to be able to recover from
failure on the other side, you simply won't bother.  However, if it's a
trivial thing to do, then we may find that having more recovery boundaries is
actually a very good thing, which makes the system much more robust.

This may or may not be the case.  But to me it feels like we aren't familiar
enough with the idea of (cheaply) recovering to really grasp the consequences.

> Once we have staked out the requirements, we may approach the question
> of what techniques are appropriate to address any problems we find.
> This may be more than one.

Yes, I mumbled this several times as well. :-)  However, I currently don't see
more than two (the task-server capability replacement, and the move-only
capabilities).  But I suppose there are more.

> > There is a client C, which wants to make a call to S.  The programmer
> > think of it like this:
> > 
> >     call => result
> >     result is the interesting value, or the reason why the call failed.
> > 
> > This call is implemented as two steps:
> > 1. C invokes a capability to S, providing a reply capability.
> > 2. S invokes the reply capability with the result or reason of failure.
> > 
> > The problem now is that 2 may not happen.  And in that case C will not be
> > able to recover if it cannot discriminate this from S taking a long time
> > to respond.  So what are the reasons that 2 doesn't happen:
> > 
> > A. S is malicious and wants C to wait forever.
> > B. S overwrites the capability. (because of a bug)
> > D. S dies before replying. (because of a bug, or user intervention)
> > 
> > Obviously, A is not distinguishable from S taking a long time to reply
> > (except by code inspection, which is not considered here).  However, at
> > some point the user will become impatient.  Assuming that he can track the
> > problem down to S, he will kill S.  In that case, this situation becomes
> > the same as D.
> 
> This is true, although my suspicion is that the user will, at first,
> only be able to track down the problem to C.  Only further "probing"
> he may be able to narrow it down to something else.

Yes.  Before S gets killed, probably several instances of C have been killed
already. :-)  But at some point, S itself will (hopefully) get killed.

> So, at least in the case where somebody (the user or C) can identify a
> problem in C, we have a solution: Giving C a nudge between the ribs
> (sending it a signal).

Sure.  But I was also thinking about situations where C isn't actually
something directly started by the user, but indirectly.  A daemon doing some
nice things for him, for example.  The user may not be aware at all that C is
running (or S, for that matter).

> The problem seems to occur if you have a sequence C->S->T, where T is broken.

Yes, this is what I meant, except that I omitted C (and used different
letters)

> > H. User intervention.  Assume everything is fine, and let the user figure it
> >    out if it isn't.  That is, the user may send some sort of signal to C
> >    telling it that the operation failed, or he just kills C completely.  The
> >    obvious drawback of this approach is that it requires very detailed
> >    knowledge on the part of the user about the inner workings of the system.
> 
> Actually, no, this part is simple.  The user just presses a button in
> the title bar of the application window that reads: "abort whatever
> you are doing, it takes too long".
> 
> The problem is not notifying C, but processes between C and S.

I considered any call operation to be a C->S interaction.  So the think I call
C may not have a title bar.

> Another way to make progress on this issue is to consider very
> specific use cases.  Ie, define clearly what C and S are, and what
> they do and how they can fail.

Right, that's what I was planning:

A user logs in to a terminal, and his session receives the corresponding
capabilities.  One of them will be for the usb bus, where he can register any
device driver he wants.  The user registers driver D with the system and
starts using it through program P.  That is, P calls the device driver
framework, which eventually calls D to perform the operation.

D is buggy and dies without responding.  Now the device driver framework is
still waiting for it and will continue to do so until it is somehow notified
(or aborted, but that isn't a good idea, I suppose).

The user doesn't like the long wait and cancels the operation in P.  Now the
device driver framework is still waiting for D, and likely a new request from
P is considered a protocol violation.  As is a late response to P, which is no
longer accepted.

This can be solved with cancellation support.  But conceptually this is the
wrong thing to do: the system knows that there will not be a response.  The
only information that comes from a notification is "this server is
misbehaving", which is not a security problem: if the server doesn't want it
(and so it isn't a bug), then it will not drop the capability, but keep it.

Also, I'm not sure if cancellation always solves this problem.  And of course,
it really is a timeout, except that the user is the clock, which ticks at an
unpredictable speed.  But it can still be considered a timeout and has all the
problems that come with it.

I was planning to think of more use cases, but I'll delay that, since writing
this already takes me too much time.

Thanks,
Bas

-- 
I encourage people to send encrypted e-mail (see http://www.gnupg.org).
If you have problems reading my e-mail, use a better reader.
Please send the central message of e-mails as plain text
   in the message body, not as HTML and definitely not as MS Word.
Please do not use the MS Word format for attachments either.
For more information, see http://129.125.47.90/e-mail.html
signature.asc
Description: Digital signature
[Prev in Thread]
Current Thread
[Next in Thread]
Re: Reliability of RPC services, (continued)
Prev by Date: Re: Reliability of RPC services
Next by Date: Re: Reliability of RPC services
Previous by thread: Re: Reliability of RPC services
Next by thread: Re: Reliability of RPC services
Index(es):
- Date
- Thread