l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services


From: Michal Suchanek
Subject: Re: Reliability of RPC services
Date: Wed, 26 Apr 2006 10:56:58 +0200

On 4/25/06, Jonathan S. Shapiro <address@hidden> wrote:
> On Tue, 2006-04-25 at 17:47 +0200, Michal Suchanek wrote:
> > On 4/25/06, Jonathan S. Shapiro <address@hidden> wrote:
> > > On Tue, 2006-04-25 at 11:54 +0200, Michal Suchanek wrote:
> > >
> > > > ad (b) Imagine a few  scenarios:
> > > > ...
> > > > And I do not think that timeouts or watchdogs solve [these] on 
> > > > non-realtime system.
> > >
> > > I agree. However, this mis-states the issue. You are talking about what
> > > happens when you have already decided to recover (e.g. by killing a
> > > non-performing renderer). The purpose of the timeout is to help
> > > determine when recovery is required.
> > >
> > > Also, in each of the examples that you gave, an asynchronous interface
> > > is appropriate. Recovering on an asynchronous interface is relatively
> > > straightforward.
> > >
> >
> > So you say that the timeouts and watch dogs actually solve a different
> > kind of problem.
>
> No. Watchdogs and timeouts are the same thing. You were talking about
> cases where a user says "kill that rendering agent, because it is
> misbehaving."

I meant different from the problem we are trying to solve with the
reference counted capabilities.
Watchdogs and timeouts sure are pretty much the same thing.

>
> >
> > The send-once + reference-counted capabilities serve to notify when a
> > service has already failed. This allows the client to restart the
> > action or use different means for obtaiing the service. Or just free
> > any resources associated with the failed service in case of a proxy.
> >
> > But the watchdog is used to identify a service that is slow to respond
> > and may be the one that is failing so that the user may remove it and
> > trigger the recovery.
>
> Notice that the first is subsumed by the second. The only question is to
> decide what latency is acceptable before noticing that a server has been
> destroyed. This will determine whether a timeout is sufficient.

No. The timeouts are unreliable so they may only provide a hint to the
user.  There may be cases when some process acts automatically when
commmunication times out but it cannot be accepted as the general
solution.

So there needs to be some reliable way of recovery when the
misbehaving process is identified and terminated.
You say that there are systems that do not need reference counted
capabilities for this, and work.

But note that these are very specialized systems. They are used for
servers, not user desktops. So one could expect there are
administrators who know (or eventually find out) that when buggy
service S is replaced with a fixed version dependent programs A, C, F
and service I have to be restarted as well. Such information will be
probably provided with the update.

You also proposed an alternative to reference counting: garbage
collection. Since it takes long and cannot run all the time it does
not seem appropriate at the first glance. But if the user wants to see
results of killing S immediately there may be a way for the user to
trigger the garbage collection. And some notice when the garbage
collection finishes.

Thanks

Michal

reply via email to

[Prev in Thread] Current Thread [Next in Thread]