l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services


From: Marcus Brinkmann
Subject: Re: Reliability of RPC services
Date: Mon, 24 Apr 2006 19:30:11 +0200
User-agent: Wanderlust/2.14.0 (Africa) SEMI/1.14.6 (Maruoka) FLIM/1.14.7 (Sanjō) APEL/10.6 Emacs/21.4 (i486-pc-linux-gnu) MULE/5.0 (SAKAKI)

Hi,

At Mon, 24 Apr 2006 11:48:22 -0400,
"Jonathan S. Shapiro" <address@hidden> wrote:
> So one way to guard against a failing server is to use idempotent timer
> events to implement a "heartbeat" -- in much the way that TCP does.
>
> I like this much better than complicating the invocation mechanism or
> the capability overwrite mechanism, because the majority of interprocess
> interactions are between components of the same application. These have
> been separated into processes for reasons of isolation, reuse, and
> testability, but they still fail as a unit. We do not want to impose
> capability semantics that discourage this pattern, and death notices
> between such processes are undesirable.
> 
> The heartbeat does introduce a new specification problem. Basically, we
> are introducing a new class of error that is visible all the way up to
> the user (X timed out) and a new requirement for wall-clock response
> time limits.

If you are going this way, it seems to make more sense to me to design
the system as a real time operating system in the first place, because
then one can at least precisely define what the requirements for
wall-clock (or even CPU) response time limits are.

The key term you use above is that the processes "fail as a unit".
This is quite pessimistic.  I am not sure if I accept this yet.  There
are some components in the system that have many dependencies, and
which are dependend upon by many components (like the user's shell),
and if one of the dependencies of these critical pieces fails, one
would like to have a way to recover from that without losing the
critical piece and everything that depends on it.  For example, a
failure in a minor device driver should not cause permanent damage in
a whole user session.

Timeouts do not scale, and they cause a constant background noise
that, depending on the details, I suspect would cause performance and
power management issues.

My vote is on nothing yet :)

Thanks,
Marcus





reply via email to

[Prev in Thread] Current Thread [Next in Thread]