From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,XPRIO autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,c473e498c84938dd,start
X-Google-Attributes: gid103376,public
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news1.google.com!news2.google.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: "Marc A. Criley" <mcNOSPAM@mckae.com>
Newsgroups: comp.lang.ada
Subject: Orders of Fault Management
Date: Tue, 27 Jul 2004 15:12:30 -0500
Message-ID: <2mnr9kFnpbivU1@uni-berlin.de>
X-Trace: news.uni-berlin.de /lGSRjMT6Y1DiNOtoQ8YXQyCF2odNk5e9RZ/FXTG/kI3h0wtxP
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2800.1437
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441
Xref: g2news1.google.com comp.lang.ada:2423
Date: 2004-07-27T15:12:30-05:00
List-Id: <comp.lang.ada>

There is a hierarchy of ignorance, which has been summarized into Orders of
Ignorance (see http://www.corvusintl.com/CACM_Oct_2000.htm).

It strikes me that there's an analogous hierarchy of Fault Management for
software systems, which I summarize as follows:

Fault Management Order 0: Nothing can go wrong.
   - Short of hardware failure, proper verification of the software ensures
no faults. (See www.sparkada.com).

Fault Management Order 1: I know what can go wrong.
   - And then plan for it. Timeouts with appropriate retry or other recovery
processing, exception handling (such as End_Error being raised when reading
a file), and validation (with 'Valid and other checks) of bad data received
via an external interface, is in place and ready to handle the faults that
it is known can occur, no matter how egregious they may be.

Fault Management Order 2: I don't know what can go wrong.
   - Assuming Fault Management 1 is properly addressed, this predominantly
involves bugs. E.g., Order 0 or 1 interfaces violate that which is thought
known, or a bug in the system manifests itself. Recovery from such
situations could involve restarting the system, or the individual component
in which the problem occurred.

Fault Management Order 3: I wouldn't know if something went wrong.
   - This can involve not checking return codes or the results of resource
requests, or blithely using "when others => null" exception handlers. The
system will continue to run, with its users ignorant of the degradation and
errors that may be accumulating.


FMO-3 is unacceptable of course. You shouldn't even be programming if your
code "handles" faults this way. You can at least turn FMO-3 into FMO-2 with
technqiues like asserting return codes and resource request results, and
removing all exception handlers whose purpose is not explicit. This means
not just "when others =>", but probably also most "when Constraint_Error =>"
and "when Program_Error =>" appearances. Even if you don't know why
something somewhere may have gone wrong, at least know when it's gone
_right_.

FMO-2 is what I always find problematic. The statement "All software has
bugs" gets thrown around, and through gritted teeth I have to agree, but too
often I hear that used an excuse for lack of development rigor. And just
today I discovered a new term, "software rejuvenation", that addresses FMO-2
by preemptively and regularly restarting a system
(http://www.stsc.hill.af.mil/crosstalk/2004/08/0408Bernstein.html). The
authors' research shows that it's been used and is effective, but I just
want to sigh "You're giving up! Fix the bugs!"

FMO-2 can be attacked with well-defined engineering practices, a good
software engineering oriented language (Ada)  and liberal use of [pragma]
assert, but what do you about the effect of those bugs that slip by? Or when
a trusted interface unexpectedly misbehaves? You can't anticipate these
specific occurrences, so how do you deal with them?

My predilection is to let the system fail (definitely so during the
development and integration phases), at least then you can fix the bug, or
turn an FMO-2 instance into FMO-1. But the counter-scenario is of course,
"What if that system has been delivered and is flying your plane?"

I bring this up because I often see exception handling discussed and how it
pertains to what amounts to FMO-1 and FMO-2 scenarios, but with the
participants not clearly aware of the distinction between them, and wherein
exceptions serve different purposes.

Something to ponder...

Marc A. Criley
McKae Technologies
www.mckae.com