l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services


From: Jonathan S. Shapiro
Subject: Re: Reliability of RPC services
Date: Mon, 24 Apr 2006 11:48:22 -0400

I don't recall who it was, but somebody finally used the magic words
"the client is waiting for the dead server". This isn't true, and it
made me realize that we have been asking the wrong question -- or at
least we may not have framed the question correctly.

Indeed, the difficulty with death notification is exactly that the
client *isn't* waiting. If the client were waiting, they would be
blocked on a queue, and the death of the queue would provide the
necessary means to know when to notify them.

But in Coyotos, the scheduler activations stuff means that matters are
more subtle: we must distinguish between "waiting" (which is an
application-level abstraction) and "blocked" (i.e. unable to make useful
progress). Let me expand on this and see if it helps us.

To discuss it, I need a three-process example. Let us imagine that there
is an original client C, a "middle" process M and a server S. M is *not*
delegating to S in this example. It is simply calling S in order to do
some step that is necessary while handling the request from C.

So suppose we are now in the state where the following messages have
been sent:

        C ---> M ----> S

and M is "waiting" for the reply from S. Previously we have worried
about the server not replying to the client. In the discussion that
follows, we are concerned about the server not responding to M (which is
a client of S, but not the ultimate client).

Also, in the discussion below, I will use the notation

    FCRB->X     A send FCRB whose messages go to X.
    RFCRB->X    A send-once FCRB whose messages go to X. This
                is typically a reply port, thus the "R".

Remember that there is no CALL operation. Actually, M has performed a
SEND to S on some FCRB->S. If M is emulating CALL semantics, then client
M has *voluntarily* given up the CPU.

With this notation and reminder introduced, I can expand the diagram

        C ---> M ----> S

into:

    1. C has invoked some FCRB->M, passing some RFCRB->C
    2. C yields the CPU
    3. M has been activated by arrival of C's message
    4. M has invoked some FCRB->S, passing some RCFRB->M
    5. M yields the CPU
    6. S has been activated by arrival of M's message
    ?. S *may* eventually invoke RFCRB->M, but cannot be
       sure. This is the ``problem.''

Notice that when we expand the story this way, it becomes obvious that M
is not blocked at all. In particular, there are certain other events
that M will receive in this state:

   1. Incoming messages on any valid, unblocked FCRB.
   2. Especially, incoming idempotent messages.

So one way to guard against a failing server is to use idempotent timer
events to implement a "heartbeat" -- in much the way that TCP does.

I like this much better than complicating the invocation mechanism or
the capability overwrite mechanism, because the majority of interprocess
interactions are between components of the same application. These have
been separated into processes for reasons of isolation, reuse, and
testability, but they still fail as a unit. We do not want to impose
capability semantics that discourage this pattern, and death notices
between such processes are undesirable.

The heartbeat does introduce a new specification problem. Basically, we
are introducing a new class of error that is visible all the way up to
the user (X timed out) and a new requirement for wall-clock response
time limits.

My objection to timeouts at IPCs is that they are too low level to be
correct, and they get used in an ad-hoc way. I think that introduction
of a timeout should require a conscious choice on the part of the
designer so that the design issues get considered explicitly.


shap





reply via email to

[Prev in Thread] Current Thread [Next in Thread]