l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services


From: Marcus Brinkmann
Subject: Re: Reliability of RPC services
Date: Sat, 22 Apr 2006 10:28:35 +0200
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.7 (Sanjō) APEL/10.6 Emacs/21.4 (i486-pc-linux-gnu) MULE/5.0 (SAKAKI)

At Sat, 22 Apr 2006 00:40:53 -0400,
"Jonathan S. Shapiro" <address@hidden> wrote:
> 
> On Sat, 2006-04-22 at 03:52 +0200, Marcus Brinkmann wrote:
> 
> > Interesting that you propose this.  I thought about this earlier
> > today, and called it a "send at least once" capability when describing
> > it to Neal.  Neal pointed out that this may be inconvenient in actual
> > practice: If a server S wants to propagate (forward) a request to
> > another server T, then S must be careful to not destroy its own copy
> > of the reply capability before T had a chance to reply.
> 
> I think that it might be useful to make this more precise. What you mean
> is that S must not be *destroyed* before T has a chance to reply. If S
> simply overwrites the capability, no problem will arise.

Oh!  But that is insufficient, because it does not achieve the level
of robustness I think is important to achieve.  If the kernel only
generates reply messages on destruction, but not on overwrite, then
accidential overwrite due to a bug can cause the caller to hang
indefinitely.

Going back to the simple case of one caller and one callee, where the
callee does not invoke any other process, I think the following
condition should be sufficient: The kernel guarantees that a reply
message is sent _at the latest_ when the callee process is destroyed.
This should hold true independent of what the callee does between
being invoked and exiting.  In particular, simply dropping the reply
capability should not change this guarantee (which in effect means
that the kernel has to invoke the reply capability when it is
dropped).

This is really crucial, because otherwise you are, from a practical
point of view, increasing the robustness, but not completely
eliminating the window of vulnerability.  Moreover, your semantics
(issueing notifications on destruction only, not on overwrite), break
down completely if the callee is malicious (for example because it has
been compromised).  With the semantics I proposed, we can structure
the system in a way that allows callers of malicious servers to
predictably cleanup their resources once the malicious server has been
identified and shot down.  If the malicious server can attack this
cleanup process by dropping capabilities, we are not in any better
situation than before.

(This is where the send-once semantics win: They allow to reason about
which process is currently responsible for the caller to wait.  This
is also possible with the semantics you proposed, but more complicated
because more than one process can be involved at the same time.)

In short: I think overwriting a capability and destroying it should
behave the same with regards to this issue.

> The issue at hand has absolutely nothing to do with schedule donation or
> priority inheritance. The issue is timeouts. Please identify a boudned
> timeout duration -- ANY bounded timeout duration -- that operates
> correctly under all conditions of load.

[I will let this die here, because I am not the right person to defend
this option.]

> > > The semantics of send-once rights is an abomination. The cost of them is
> > > considerable, and the overhead of manipulating them correctly from the
> > > application perspective is a serious problem. Coyotos will not under any
> > > circumstances implement "send-once" or "grant-only" capabilities.
> > 
> > Well, let me try to understand the cost factor.  As far as I see it,
> > the only cost involved in the copy operation is setting a data field
> > in the source capability to mark it as invalid.
> 
> There is the cost to test whether this is a "special handling"
> capability. There is also the data-driven branch pipeline delay. There
> is the branch mispredict. There is the write of the source cache line,
> which would otherwise be read-only. On many processors that branch
> mispredict will be extremely noticeable. All of this will occur for
> every capability copy that happens in the kernel.

Ok, thanks for going into the detail.
 
> > Also, the semantics you propose require the kernel to make an attempt
> > to send a message for every destroyed reply capability, which entails
> > accessing the FCRB (cache pressure?).  And this although in the common
> > case (>99.999% of the cases), the FCRB will already contain the reply
> > message and thus not be available.
> 
> It is not the reply capability that is being destroyed. It is the
> containing object (which is probably a process, not an FCRB).
> 
> Destroy is dynamically rare. The cache pressure is actually not a big
> deal. The case in question would arise in destroy of a process, not
> destroy of an FCRB.

See above why I think that it is not sufficient to do this on destroy
alone, but also needs to happen on overwrite.  I can imagine what this
adds to the complexity of certain paths in the system, but I don't see
how it can be avoided to achieve the level of robustness I was thinking
about.

> If the FCRB already contains the reply message, then the protocol has
> completed successfully and you do not want a death notice. This is
> exactly the outcome that you want!
> 
> The real problem here comes in destroying processes and capability
> pages. We now require a scan of these objects to learn whether a valid
> reply capability is present, and we potentially require that the target
> FCRBs be paged in so that we can determine whether they are valid and
> then invoke those reply capabilities to send the death notice.
> 
> Yes, it is terribly expensive. This is why I took it out of EROS. There
> are various ways to optimize it.
> 
> But notice that your "send once" capability has exactly the same
> problem: we still need to scan these objects and find the send once
> capabilities, and we still need to page in the target FCRBs in order to
> send the death notice.

Yes.

> > > Worse, it has the disadvantage that every capability copy must be
> > > preceded by a capability type check, so that the sender knows whether it
> > > is losing the capability as a side effect. This violates encapsulation
> > > in a fairly fundamental way.
> > 
> > Good point.  However, I think that it would suffice to apply such
> > checks on incoming capabilities rather than outgoing, for those places
> > where it is actually relevant.  In many situations, the check can
> > probably be omitted (for example if the capability will be dropped
> > anyway).
> 
> The total number of incoming and outgoing capabilities is (obviously)
> identical, so that doesn't help.
> 
> The kernel can never know that a capability will be dropped, so that
> doesn't help.
> 
> In the end, the "overwrite on send" idea isn't really going to help you.
> The real cost is in getting those death notices to be sent.
> 
> I think that this is something where we should defer the argument until
> we can actually test it in a real system. Any of the options you are
> considering can be done. The question is: what do we really want to
> have? I don't see any way to avoid the "search for caps in dying
> objects" approach, which is the real killer in *both* proposals.

I am happy to defer discussion of implementation details, but I would
like to clarify the issue of (accidentially, maliciously) dropped
reply capabilities.

Thanks,
Marcus





reply via email to

[Prev in Thread] Current Thread [Next in Thread]