Re: Reliability of RPC services

l4-hurd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services

From:	Marcus Brinkmann
Subject:	Re: Reliability of RPC services
Date:	Tue, 25 Apr 2006 13:06:32 +0200
User-agent:	Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.7 (Sanjō) APEL/10.6 Emacs/21.4 (i486-pc-linux-gnu) MULE/5.0 (SAKAKI)

At Tue, 25 Apr 2006 12:22:23 +0200,
Bas Wijnen <address@hidden> wrote:
> > Coyotos will not implement "move only" capabilities. Period.
> 
> I'm sorry to hear that.  Not that I am convinced that this is the perfect
> solution, but it is a solution to the problem we found, and timeouts seem like
> a very inferior replacement as a solution.

Bas, I am not yet sure if they really are a solution to the problem.
I am not even sure I know what the exact problem is at this point.
This discussion has focussed on implementation details, but those are
pretty uninteresting.  I know how "move only" capabilities can be
implemented, it's not a big deal.  So, if we come to the conclusion
that we absolutely and desperately want them, we won't be stuck.

However, I am _much_ more interested in discussing what the actual
problem is we are trying to solve.  It has to do with recovery, it has
to do with sharing of resources, and it has to do with predictability
of service.  These are hard problems, and it's worth to think about
them thoroughly before reaching any conclusions.

One strategy to think about these problems is to try to identify what
we want the "recovery boundaries" of the system should be (we have yet
to define the term "recovery boundary" precisely).  Some applications
stay and fail as a unit, while for others there is a clear
distinction.  In some cases we may _want_ a separation but may find
out it is impossible to achieve.

Once we have staked out the requirements, we may approach the question
of what techniques are appropriate to address any problems we find.
This may be more than one.

> There is a client C, which wants to make a call to S.  The programmer think of
> it like this:
> 
>       call => result
>       result is the interesting value, or the reason why the call failed.
> 
> This call is implemented as two steps:
> 1. C invokes a capability to S, providing a reply capability.
> 2. S invokes the reply capability with the result or reason of failure.
> 
> The problem now is that 2 may not happen.  And in that case C will not be able
> to recover if it cannot discriminate this from S taking a long time to
> respond.  So what are the reasons that 2 doesn't happen:
> 
> A. S is malicious and wants C to wait forever.
> B. S overwrites the capability. (because of a bug)
> D. S dies before replying. (because of a bug, or user intervention)
> 
> Obviously, A is not distinguishable from S taking a long time to reply (except
> by code inspection, which is not considered here).  However, at some point the
> user will become impatient.  Assuming that he can track the problem down to S,
> he will kill S.  In that case, this situation becomes the same as D.

This is true, although my suspicion is that the user will, at first,
only be able to track down the problem to C.  Only further "probing"
he may be able to narrow it down to something else.

Cancellation of the request in C is no problem.  However, this
cancellation will have no influence on anything but C (at this point,
I am not considering cancellation forwarding).

This is not a big issue: As we have only two processes involved, C and S.
We are now in a situation where C fully recovered and S is still
broken, but S was broken in the first place, so we are now fine.

So, at least in the case where somebody (the user or C) can identify a
problem in C, we have a solution: Giving C a nudge between the ribs
(sending it a signal).

The problem seems to occur if you have a sequence C->S->T, where T is broken.
Let me give an example:

C has transitively read-only access to a directory.  The directory
server now must downgrade the permissions to all files in the
directory.  Downgrading a permission could be a kernel function, using
a user-accessible chunk of the protected payload that can be bit-wise
cleared.  In this case, the directory server does not need to make a
call.  However, in general, not all proxying functions can be done in
the kernel, so this is a limited solution.  In general, the directory
server would have to invoke the object to ask for a mutated capability
(say, one with a reduced permission, or a different facet type).

The directory server now sits between C and the servers it has entries
for, and invokes these servers.  Some of these servers may not be very
reliable.  If the server fails while the directory server is in a
call, and the client C cancels, the directory server ends up stuck in
the middle.

Now, there is a solution to this: We can start a new directory server
specifically for each client process that needs to be proxied
("transitive read-only" etc).  Then, a whole range of techniques can
be used to make C and D cooperate on the matter.  However, this
solution has performance costs as well as complexity costs.  Starting
a new D, D', for special use C, is only the beginning of a solution,
which adds a whole bunch of new problems as well.

Also, this strategy may not be applicable to all "recovery boundary"
issues.  It's not a uniform solution.

> H. User intervention.  Assume everything is fine, and let the user figure it
>    out if it isn't.  That is, the user may send some sort of signal to C
>    telling it that the operation failed, or he just kills C completely.  The
>    obvious drawback of this approach is that it requires very detailed
>    knowledge on the part of the user about the inner workings of the system.

Actually, no, this part is simple.  The user just presses a button in
the title bar of the application window that reads: "abort whatever
you are doing, it takes too long".

The problem is not notifying C, but processes between C and S.

Maybe cancellation support is actually what we want.  I have to think
about that.

Another way to make progress on this issue is to consider very
specific use cases.  Ie, define clearly what C and S are, and what
they do and how they can fail.

Thanks,
Marcus

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Child killing UI (was Re: Reliability of RPC services), (continued)

Prev by Date: Re: Reliability of RPC services
Next by Date: Re: Reliability of RPC services
Previous by thread: Re: Reliability of RPC services
Next by thread: Re: Reliability of RPC services
Index(es):
- Date
- Thread