l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services


From: Jonathan S. Shapiro
Subject: Re: Reliability of RPC services
Date: Mon, 24 Apr 2006 15:39:04 -0400

On Mon, 2006-04-24 at 20:46 +0200, Tom Bachmann wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Jonathan S. Shapiro wrote:
> >    1. Incoming messages on any valid, unblocked FCRB.
> >    2. Especially, incoming idempotent messages.
> > 
> > So one way to guard against a failing server is to use idempotent timer
> > events to implement a "heartbeat" -- in much the way that TCP does.
> > 
> 
> I don't get it. What do we gain from artificially awaking M? Is the
> FCRB->M C invoked still blocked after step 5?

We need to avoid words like "blocked" in this discussion, because that
is what is causing the confusion. First, let me repeat the steps for
reference:

>    1. C has invoked some FCRB->M, passing some RFCRB->C
>     2. C yields the CPU
>     3. M has been activated by arrival of C's message
>     4. M has invoked some FCRB->S, passing some RCFRB->M
>     5. M yields the CPU
>     6. S has been activated by arrival of M's message
>     ?. S *may* eventually invoke RFCRB->M, but cannot be
>        sure. This is the ``problem.''

To answer your question, it isn't important whether FCRB->M is available
after step 5. This depends on whether M wishes to be multi-threaded.
This really has nothing to do with the problem we are trying to solve,
which is that S may fail to return to M.

Assuming that everybody does the *desired* thing, the next few steps
would look like:

    7. S invokes RFCRB->M, providing the reply to M's message.
    8. After some processing, M invokes RFCRB->C, providing the
       reply to C's original message.

The failure case we are considering is that step 7 does not occur --
either because S is hostile or because S fails. The consequence of this
failure is that M will never perform step 8, because step 8 cannot occur
before step 7.

The assumption in my example is that the "recovery boundary" is between
M and S, but C is relying fully on the fact that M performs recovery. In
this scenario, the purpose of waking up M is to allow it to notice that
S has not replied in any reasonable amount of time.

The key points are:

  1. You do not need a watchdog everywhere. Only at recovery perimeters.
  2. The kernel does not need to support this directly.
  3. Given that sometimes you don't *want* a watchdog, the kernel should
     not force you to implement one by setting any policy for last drop
     behavior.

shap








reply via email to

[Prev in Thread] Current Thread [Next in Thread]