Re: Reliability of RPC services

l4-hurd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services

From:	Jonathan S. Shapiro
Subject:	Re: Reliability of RPC services
Date:	Sat, 22 Apr 2006 10:22:25 -0400

On Sat, 2006-04-22 at 10:28 +0200, Marcus Brinkmann wrote:
> At Sat, 22 Apr 2006 00:40:53 -0400,
> "Jonathan S. Shapiro" <address@hidden> wrote:

> > I think that it might be useful to make this more precise. What you mean
> > is that S must not be *destroyed* before T has a chance to reply. If S
> > simply overwrites the capability, no problem will arise.
> 
> Oh!  But that is insufficient, because it does not achieve the level
> of robustness I think is important to achieve.  If the kernel only
> generates reply messages on destruction, but not on overwrite, then
> accidential overwrite due to a bug can cause the caller to hang
> indefinitely.

Marcus:

You are making a mistake that a designer must try to avoid: you are
issuing requirements for something prior to data. At this time, you have
no evidence that this is a problem in practice, yet you are preparing to
insist on overwhelming design damage in order to achieve an objective
whose actual need has not been demonstrated. A comment, and then several
points of explanation.

COMMENT:

The KeyKOS mechanism, send on containing object destroy, was used
successfully in high-reliability production systems for nearly 25 years.
At the very least, this suggests that careful thought should be given
before it you declare that it is insufficient. Perhaps it *is*
insufficient, but the two or three hours of thought that you have given
this question is not enough to come to any conclusion, and concrete
experience is needed before making a change that is so potentially
damaging.

EXPLANATIONS/COMMENTS:

1. The behavior that you want can *only* be accomplished with reference
counting. Implementing reference counting requires that every time a
capability is copied or overwritten, the target object associated with
that capability must be brought into memory. Even if this is only done
for sender capabilities, you may reasonably assume that, on average,
this will add a significant *multiplier* (probably 3x or 4x) to the
*average* IPC cost.

2. Notifying on overwrite violates isolation. It discloses to the
destination process whenever a capability is dropped by a third party.
In general the receiver has no right to know that the holder has the
capability at all.

3. In fact, notifying on containing object destroy has the same problem,
and I don't like it at all. The only reason that this was accepted in
KeyKOS was that it cannot be defended against by the server -- the
server has no control over its storage being revoked.

4. You are failing to consider the larger problem. There are *hundreds*
of ways that a server can fail to meet the requirements of a client.
This is just one of them. Given that this is true, it is not at all
obvious that fixing this issue at such great cost is justified.

5. Note that reference counting doesn't really solve the problem either.
A server could simply store an extra copy of the reply descriptor and
forget about it for a very long time. The existence of this descriptor
is sufficient to prevent the "last drop" message from being sent for an
indefinite time.

6. You will soon generalize this to RcvQ send capabilities, but in that
case the problem is unsolvable because the message will commonly be
delayed (therefore lost).

> I think the following
> condition should be sufficient: The kernel guarantees that a reply
> message is sent _at the latest_ when the callee process is destroyed.
> This should hold true independent of what the callee does between
> being invoked and exiting.  In particular, simply dropping the reply
> capability should not change this guarantee (which in effect means
> that the kernel has to invoke the reply capability when it is
> dropped).

Several problems:

1. This requires dynamic storage allocation in the kernel. Dynamic
storage allocation in the kernel implies denial of resource
vulnerabilities and makes any statement of kernel robustness impossible.

2. Your description fails in the case where C calls S which forwards to
T, because the exit of S will cause an improper reply.

3. Your proposal seems to have the side effect (I am not certain) of
dictating a hierarchical calling relationship. This is bad.

> Moreover, your semantics
> (issueing notifications on destruction only, not on overwrite), break
> down completely if the callee is malicious (for example because it has
> been compromised).

Marcus: have some more beer. You are not thinking clearly.

ANY time that a client sends to a hostile recipient, the client cannot
rely on ANYTHING. It cannot rely on getting a correct answer. It cannot
rely on getting a well-formed answer. In fact, it cannot rely on getting
an answer at all!

The only solution to this is that clients must not rely on unreliable
code for anything at all. This is axiomatic.

> In short: I think overwriting a capability and destroying it should
> behave the same with regards to this issue.

So let me see if I have this right: you want to sent a death notice on
overwrite.

So this implies that every IPC must check the destination slots to see
if they cause such an overwrite, and must issue death notice calls on
those capabilities. If the IPC payload contains up to N capabilities,
and we assume that the death notice itself does not transfer a
capability, then every IPC has just been multiplied by up to N IPCs.

This just won't work, Marcus. Technically it can be done, but the
resulting system will perform much worse than Mach.

> I am happy to defer discussion of implementation details, but I would
> like to clarify the issue of (accidentially, maliciously) dropped
> reply capabilities.

I believe that I have offered a compelling argument for why it should
not be done: in order to solve a very rare problem, your "highly robust"
solution imposes a 400% overhead on the common case operation!

I am inclined to think that disk GC provides a better approach to the
whole problem, but I haven't really thought about it enough to have a
sensible opinion about this.

shap

[Prev in Thread]

Current Thread

[Next in Thread]

Reliability of RPC services, Marcus Brinkmann, 2006/04/21
- Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/21
  - Re: Reliability of RPC services, Jonathan S. Shapiro, 2006/04/21
    - Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/21
- Re: Reliability of RPC services, Jonathan S. Shapiro, 2006/04/21
  - Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/21
    - Re: Reliability of RPC services, Jonathan S. Shapiro, 2006/04/22
    - Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/22
    - Re: Reliability of RPC services, Jonathan S. Shapiro <=
    - Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/22
    - Re: Reliability of RPC services, Jonathan S. Shapiro, 2006/04/22
    - Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/23
    - Re: Reliability of RPC services, Jonathan S. Shapiro, 2006/04/29
    - Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/30
    - Re: Reliability of RPC services, Jonathan S. Shapiro, 2006/04/30
    - Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/23
    - Re: Reliability of RPC services, Tom Bachmann, 2006/04/23
    - Re: Reliability of RPC services, Marcus Brinkmann, 2006/04/23
    - Re: Reliability of RPC services, Tom Bachmann, 2006/04/23

Prev by Date: Re: Google's Summer of Code 2006
Next by Date: Re: A short note on microkernels
Previous by thread: Re: Reliability of RPC services
Next by thread: Re: Reliability of RPC services
Index(es):
- Date
- Thread