comp.lang.ada
 help / color / mirror / Atom feed
* Orders of Fault Management
@ 2004-07-27 20:12 Marc A. Criley
  2004-07-28 12:06 ` Marin David Condic
  0 siblings, 1 reply; 8+ messages in thread
From: Marc A. Criley @ 2004-07-27 20:12 UTC (permalink / raw)


There is a hierarchy of ignorance, which has been summarized into Orders of
Ignorance (see http://www.corvusintl.com/CACM_Oct_2000.htm).

It strikes me that there's an analogous hierarchy of Fault Management for
software systems, which I summarize as follows:

Fault Management Order 0: Nothing can go wrong.
   - Short of hardware failure, proper verification of the software ensures
no faults. (See www.sparkada.com).

Fault Management Order 1: I know what can go wrong.
   - And then plan for it. Timeouts with appropriate retry or other recovery
processing, exception handling (such as End_Error being raised when reading
a file), and validation (with 'Valid and other checks) of bad data received
via an external interface, is in place and ready to handle the faults that
it is known can occur, no matter how egregious they may be.

Fault Management Order 2: I don't know what can go wrong.
   - Assuming Fault Management 1 is properly addressed, this predominantly
involves bugs. E.g., Order 0 or 1 interfaces violate that which is thought
known, or a bug in the system manifests itself. Recovery from such
situations could involve restarting the system, or the individual component
in which the problem occurred.

Fault Management Order 3: I wouldn't know if something went wrong.
   - This can involve not checking return codes or the results of resource
requests, or blithely using "when others => null" exception handlers. The
system will continue to run, with its users ignorant of the degradation and
errors that may be accumulating.


FMO-3 is unacceptable of course. You shouldn't even be programming if your
code "handles" faults this way. You can at least turn FMO-3 into FMO-2 with
technqiues like asserting return codes and resource request results, and
removing all exception handlers whose purpose is not explicit. This means
not just "when others =>", but probably also most "when Constraint_Error =>"
and "when Program_Error =>" appearances. Even if you don't know why
something somewhere may have gone wrong, at least know when it's gone
_right_.

FMO-2 is what I always find problematic. The statement "All software has
bugs" gets thrown around, and through gritted teeth I have to agree, but too
often I hear that used an excuse for lack of development rigor. And just
today I discovered a new term, "software rejuvenation", that addresses FMO-2
by preemptively and regularly restarting a system
(http://www.stsc.hill.af.mil/crosstalk/2004/08/0408Bernstein.html). The
authors' research shows that it's been used and is effective, but I just
want to sigh "You're giving up! Fix the bugs!"

FMO-2 can be attacked with well-defined engineering practices, a good
software engineering oriented language (Ada)  and liberal use of [pragma]
assert, but what do you about the effect of those bugs that slip by? Or when
a trusted interface unexpectedly misbehaves? You can't anticipate these
specific occurrences, so how do you deal with them?

My predilection is to let the system fail (definitely so during the
development and integration phases), at least then you can fix the bug, or
turn an FMO-2 instance into FMO-1. But the counter-scenario is of course,
"What if that system has been delivered and is flying your plane?"

I bring this up because I often see exception handling discussed and how it
pertains to what amounts to FMO-1 and FMO-2 scenarios, but with the
participants not clearly aware of the distinction between them, and wherein
exceptions serve different purposes.

Something to ponder...

Marc A. Criley
McKae Technologies
www.mckae.com





^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-08-11 11:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-07-27 20:12 Orders of Fault Management Marc A. Criley
2004-07-28 12:06 ` Marin David Condic
2004-07-28 13:11   ` Dmitry A. Kazakov
2004-07-28 14:14     ` Puckdropper
2004-07-29 12:46     ` Marin David Condic
2004-08-11  4:56     ` Mark A. Biggar
2004-08-11  8:38       ` Dmitry A. Kazakov
2004-08-11 11:49         ` Marin David Condic

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox