Re: Reliability of RPC services

l4-hurd
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Reliability of RPC services

From:	Bas Wijnen
Subject:	Re: Reliability of RPC services
Date:	Tue, 25 Apr 2006 12:22:23 +0200
User-agent:	Mutt/1.5.11+cvs20060403
On Mon, Apr 24, 2006 at 09:21:44PM -0400, Jonathan S. Shapiro wrote:
> On Tue, 2006-04-25 at 00:46 +0200, Bas Wijnen wrote:
> > So forget about all others for the moment, please.
> 
> I am not interested in considering move-only capabilities. We looked at
> them several times over many years and decided that (a) they create more
> problems than they solve, (b) they don't actually solve any of the
> problems that people *think* they solve, and (c) they have negative
> performance consequences.

Are there some archives of those discussions of previous years?  I'd like to
see the arguments for these statements.

> Coyotos will not implement "move only" capabilities. Period.

I'm sorry to hear that.  Not that I am convinced that this is the perfect
solution, but it is a solution to the problem we found, and timeouts seem like
a very inferior replacement as a solution.  Of course there's the
death-notification idea (similar to the task-info capabilities in L4.X2), but
although I haven't thought much about that:
- I don't have much confidence in it.
- I'm pretty sure you don't want that in the kernel either.

> > > > The case when many capabilities are overwriten with a single IPC is
> > > > most likely a bug in the server.
> > > 
> > > Actually, it is the near-universal practice for a single-threaded
> > > server. Arguments are commonly accepted in a way that overwrites the
> > > arguments from the last invocation.
> > 
> > This is not a problem.  The invocation only happens when valid send-once
> > capabilities get overwritten.  Each and every valid send-once capability
> > is directly related to a client waiting for a response.  If you overwrite
> > it, it will never get that response, because you are guaranteed to be the
> > only party who is capable of responding.
> 
> Yes, so now you have a situation where client A is notified of the
> server's mishandling of client B. This is a security error. Coyotos will
> not expose this fact.

I must be missing something here.  A and B have a communication channel (the
reply capability), so the fact that A is sending a message to B cannot be the
problem.  A has given B a move-only reply-capability.  Both A and B can see
this, and know what it means.  If B doesn't like it, then it can reply "I
don't accept these, because I want to be able to crash or be compromised
without you noticing".  Then A can try the same operation again with a normal
reply capability.  Or it can think "You must have gone completely crazy, so
I'll stop talking to you".  Actually the latter seems more reasonable to me,
because I cannot think of a valid reason why A is not supposed to know that B
is overwriting the reply capability without using it.

It reminds me of the DRM discussion we had:
        The client doesn't _have to be_ notified when its capability is
        dropped, but it can agree with the kernel that it will be.  Kernel
        support is needed for this agreement, though.  So the question is: why
        do you want to prevent such agreements from being made?

> > As Marcus also pointed out already, this does not define the send-once
> > capabilities we were talking about.
> 
> You are confusing "send once" with "move only". Please pick one.

Ok, you have a good point there.  "send-once" was a very bad choice of words.
As will have become clear by now, we meant "move-only and send-once".

I have some more thoughts that I like to share:
 
If the checks for capabilities being move-only give a too big performance hit,
it would perhaps be possible to make a different object type of them, with a
parallel set of kernel operations (except that they move, not copy).  I'm not
sure how feasable that would be, but it can be considered if the move-only
capabilities really solve a problem (I think they do), but are just too bad
for performance to be implemented.  Separating them will improve branch
prediction for both, which should improve performance.  On the other hand some
other checks must be doubled (from "are there capabilities" to "are there
capabilities or move-only capabilities"), so in total it may not actually be
an improvement.  You can probably estimate if it is.

I want to describe the problem I am talking about again, because it surprises
me how unimportant you seem to think it is:
There is a client C, which wants to make a call to S.  The programmer think of
it like this:

        call => result
        result is the interesting value, or the reason why the call failed.

This call is implemented as two steps:
1. C invokes a capability to S, providing a reply capability.
2. S invokes the reply capability with the result or reason of failure.

The problem now is that 2 may not happen.  And in that case C will not be able
to recover if it cannot discriminate this from S taking a long time to
respond.  So what are the reasons that 2 doesn't happen:

A. S is malicious and wants C to wait forever.
B. S overwrites the capability. (because of a bug)
D. S dies before replying. (because of a bug, or user intervention)

Obviously, A is not distinguishable from S taking a long time to reply (except
by code inspection, which is not considered here).  However, at some point the
user will become impatient.  Assuming that he can track the problem down to S,
he will kill S.  In that case, this situation becomes the same as D.

Getting back to the beginning, C is waiting for S, but for the above reasons
is never going to receive a reply.  This is conceptually too hard for the
programmer, so we add an extra possible result: C is notified (in some way)
that there will not be a reply.  Now things are easy for the programmer again,
because C may simply wait for a reply, check if it is an error condition, and
continue as usual.  The notification is simply one of the possible error
conditions.

However, how is this magic step of the notification taking place?  I see
several options:
E. Using the move-only-send-once-notify-on-destroy capabilities that we
   discussed in this thread.  This will solve the problem.  (You say above it
   doesn't, I like to hear why if you still think so.)  It will however cost
   some performance.  This may be worth it and it may not.
F. Using death-notifications which have to be registered with the kernel.
G. Using time-outs.  C defines how long it accepts S to take, and if things
   take longer, it assumes something has gone wrong and "notifies" itself
   about it.  Problem with this approach is that it yields false positives
   under load.
H. User intervention.  Assume everything is fine, and let the user figure it
   out if it isn't.  That is, the user may send some sort of signal to C
   telling it that the operation failed, or he just kills C completely.  The
   obvious drawback of this approach is that it requires very detailed
   knowledge on the part of the user about the inner workings of the system.
J. Do nothing and take the hit.  If this happens, well, tough luck, let's do
   other things.  C will be waiting forever.  This may be acceptable on a
   non-persistent system, where you can at least clean up the junk every now
   and then by rebooting, but it isn't on a persistent system, where this
   means a permanent memory-leak.

To me, H and J are unacceptable.  G really is unacceptable as well, but a bit
less so.  E and F (and perhaps other similar solutions) are fine, and should
be compared with respect to performance in particular.

So far, it seems you want to do J (combined with H?) and recently perhaps also
G.  This worries me a bit.

Thanks,
Bas

-- 
I encourage people to send encrypted e-mail (see http://www.gnupg.org).
If you have problems reading my e-mail, use a better reader.
Please send the central message of e-mails as plain text
   in the message body, not as HTML and definitely not as MS Word.
Please do not use the MS Word format for attachments either.
For more information, see http://129.125.47.90/e-mail.html
signature.asc
Description: Digital signature
[Prev in Thread]
Current Thread
[Next in Thread]
Child killing UI (was Re: Reliability of RPC services), (continued)
Prev by Date: Re: Strange wording about io_t->io_map method
Next by Date: Re: Reliability of RPC services
Previous by thread: Re: Child killing UI (was Re: Reliability of RPC services)
Next by thread: Re: Reliability of RPC services
Index(es):
- Date
- Thread