l4-hurd
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services


From: Jonathan S. Shapiro
Subject: Re: Reliability of RPC services
Date: Tue, 25 Apr 2006 16:39:00 -0400

On Tue, 2006-04-25 at 22:06 +0200, Bas Wijnen wrote:
> > Your proposal is to use a watchdog, which is not what is meant by "send
> > exactly once".
> 
> Well, a watchdog combined with send-on-destroy.  I didn't mention this, but I
> was still talking about move-only-send-exactly-once capabilities, which must
> have send-on-destroy (otherwise they are send-at-most-once).

I do not believe this statement is correct. It appears (to me) that what
you are really doing is to use a watchdog so that you can achieve
"receive exactly once" behavior on top of single-copy, send-at-most-once
capabilities.

Can you confirm, or if not, can you explain why you see it differently?

> Even though strictly speaking it is correct that also inside a CPU or on a
> motherboard a signal might get lost, the chance that you still have a
> functioning system when that happens is minimal.  Compared to a network where
> a broken cable doesn't actually break the system, this is a very different
> situation.  I think on one machine it is reasonable to assume that wires
> aren't broken, and signals don't get lost.  If they do, the computer should be
> replaced (or at least a part of it).

That is a tempting view, but it is a very dangerous view: it leads
directly to bad application development.

A correctly written program A should actively manage any situation where
it speaks to a second program B such that failure-domain(A) !=
failure-domain(B). This should be true in the local case as well as the
remote case.

Programmers are (generally speaking) both lazy and stupid. If a
programmer can rely on robust behavior in the local case, and also gets
it 99%+ of the time in the network case, they will write programs that
assume that this behavior is universally true, and these programs will
fail when the bad thing actually happens. Such conditions are extremely
hard to test, and they really do happen in the real world, because a
0.02% likely event happens quite often when measured over 100,000
machines across the world.

Empirical evidence for my statement: run grep on any large body of
source code. Measure the percentage of calls to read() where the error
result is actually checked. How many programs recover from bad disk
blocks? Hell, how many Linux *FS implmentations* check for them?

The only situation where hiding the failure in the API is okay is the
situation where the network proxy can simulate some expected *local*
failure that the program *is* likely to deal with, and also the action
taken in response to this failure causes the right behavior in the local
application.

Fundamentally, I am arguing that a well-designed API does not encourage
the programmer to have unrealistic expectations about reliability.
Instead, a well-designed API is structured (a) to encourage the
programmer to actually deal with these issues, and (b) where possible,
makes it straightforward to deal with them. But it *never* deludes the
programmer into believing that the situation is better than it actually
is. [Delusional behavior of this form is called "religion", or sometimes
"politics". :-)]


> > I am not sure how you inferred "crappy hardware" from "network
> > transparency".
> 
> When I think about writing an OS, I think about writing it on one computer.
> That means, as I wrote above, that I assume that things work.

Ah. Since you believe this, you obviously have never used a disk drive.
In my lab alone, we have had more disk drive failures over the last 5
years than network failures. Yes, this is because we have a depressingly
large number of disk drives, but the point is that local behavior is
definitely *not* robust in the way that you assume.

You may say that disk drive failure is easier to detect in advance of
fatality, and is easy to mask with things like mirroring and RAID. I
agree. But running out of space *isn't* easy to mask, and then I ask
about what percentage of calls to write() [or fwrite()] have their error
result unchecked.

> 
> So what I'm saying is that if you consider a network as the "machine" to write
> an OS for, then the failure-rate of that machine is very high, so the hardware
> is crappy compared to "usual" computers.

Hey, it could be worse! Imagine if you had to deal with Pentium chips at
the same time! :-)

> You wrote you want to be able to expand the system to a network "machine", and
> therefore you don't want to use certain constructs which perhaps cannot handle
> the fragility of such a machine.  Or at least that's how I understood it, but
> please correct me if I'm wrong. :-)

I'm not certain that I want to do that. At the moment, I simply do not
want to throw that option away yet, and I don't want to build an API
that encourages delusional programming practices.


shap





reply via email to

[Prev in Thread] Current Thread [Next in Thread]