l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Reliability of RPC services


From: Christopher Nelson
Subject: RE: Reliability of RPC services
Date: Tue, 25 Apr 2006 11:45:18 -0600

> > If the designer cannot come up
> > with a good metric, then that recovery mechanism is not appropriate 
> > for that use case.  It is *never* appropriate to expect an 
> end user to 
> > be the watchdog.
> 
> I agree partially.  It's true that the user probably can't 
> really know when an operation is just taking long because the 
> machine is busy, or because there is something botched.  
> However, I believe that the user should always be in control, 
> so if an operation is taking too long, and the user does want 
> to abort the operation, he should have a chance to do so.

I certainly don't mean that you should *take away* the ability of a user
to kill a process.  It should be made a easy as possible, within a
certain security framework.  However, I do not think that "the user
kills the process" should be counted on as part of the recovery process.
That "the user may kill the process" should be taken into consideration,
most definitely.
 
> However, what's the alternative you suggest?  We are at the 
> limit of what we can computers make to do at this point of 
> the discussion.  A computer can not in all cases successfully 
> argue about the validity of its own operation (sounds 
> equivalent to the halting problem to me).
> 
> Hard real-time systems define an appropriate response to this 
> challenge, but that response is not the most useful in all cases.

I like hard real-time systems.  I have thought a lot about the recovery
aspect of system design.  To me it seems like you have two situations:

(1) The user may have some idea what they're doing and in certain
situations you can thus present them with reasonable choices.  If there
are no reasonable choices to present the user, then they should be
ignored for purposes of recovery.

(2) The user cannot possibly have any good idea what's going on because
it's so internal to the application or server that only the authors of
the server would have any reasonable idea what's going on at that point.
Users should be ignored for purposes of recovery.

In both cases, you should fail quickly and gracefully.  It may be
possible to notify the user that Something Bad(tm) has happened, and it
may not.  That should be decided for each case.  Quickly may be defined
as "as soon as possible."  Gracefully may be defined as "try not to
break anything on your way out."  

In all cases, IMHO, all apps and servers should always expect that their
communication with the external world (where external is defined as
outside the current thread of execution) may be cut off or stalled.  In
other words, Bad Things Will Happen.

Think about non-locking synchronization primitives.  The idea is that at
least ONE thread will make progress.  So say you're appending an item to
a list:  you create the node, set it up for list append, then get a copy
of the tail pointer.  Using a platform-dependent primitive, you then
compare and exchange the copied tail pointer with the actual tail
pointer.  If they're the same, you can go ahead and swap it with you
current tail pointer, otherwise you have to back up and retry.  Someone,
somewhere, is guaranteed to make progress.

This might be extended to IPC by doing something similar.  It may not
ever be necessary to know "when" to stop retrying.  It may be possible
to indicate to a user that the requested operation is taking longer than
expected, and to give the user the opportunity to cancel the request.
Other servers (such as a mail server) may have a settings file which
dictates how "long" it should keep retrying an operation.  

In these situations, the metric for timing out may not be some
compile-time constant, but can be dependent on what the user has said
should happen.  (In the case of a settings file, it is probably a
"knowledgeable" user, since all servers should come set with reasonable
defaults.)

One other idea that may not be feasible is in regards to timouts being
flaky in the case of heavy load.  Perhaps it would be better to
stipulate that the watchdog should keep track of how many requests have
been processed, and how many are pending.  Over time this indicates an
"average load".  If this number starts to rise sharply, the watchdog may
assume that it is now under a heavier load, and can use some metric to
back off on it's abort policy.  Think about how Ethernet cards use
binary exponential backoff to make sure only one system is transmitting
at once, without any explicit session policy.

Essentially, apps and servers need to be smarter and need to expect
things to go wrong.  

-={C}=-




reply via email to

[Prev in Thread] Current Thread [Next in Thread]