From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,XPRIO autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,c473e498c84938dd,start X-Google-Attributes: gid103376,public X-Google-Language: ENGLISH,ASCII-7-bit Path: g2news1.google.com!news2.google.com!fu-berlin.de!uni-berlin.de!not-for-mail From: "Marc A. Criley" Newsgroups: comp.lang.ada Subject: Orders of Fault Management Date: Tue, 27 Jul 2004 15:12:30 -0500 Message-ID: <2mnr9kFnpbivU1@uni-berlin.de> X-Trace: news.uni-berlin.de /lGSRjMT6Y1DiNOtoQ8YXQyCF2odNk5e9RZ/FXTG/kI3h0wtxP X-Priority: 3 X-MSMail-Priority: Normal X-Newsreader: Microsoft Outlook Express 6.00.2800.1437 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1441 Xref: g2news1.google.com comp.lang.ada:2423 Date: 2004-07-27T15:12:30-05:00 List-Id: There is a hierarchy of ignorance, which has been summarized into Orders of Ignorance (see http://www.corvusintl.com/CACM_Oct_2000.htm). It strikes me that there's an analogous hierarchy of Fault Management for software systems, which I summarize as follows: Fault Management Order 0: Nothing can go wrong. - Short of hardware failure, proper verification of the software ensures no faults. (See www.sparkada.com). Fault Management Order 1: I know what can go wrong. - And then plan for it. Timeouts with appropriate retry or other recovery processing, exception handling (such as End_Error being raised when reading a file), and validation (with 'Valid and other checks) of bad data received via an external interface, is in place and ready to handle the faults that it is known can occur, no matter how egregious they may be. Fault Management Order 2: I don't know what can go wrong. - Assuming Fault Management 1 is properly addressed, this predominantly involves bugs. E.g., Order 0 or 1 interfaces violate that which is thought known, or a bug in the system manifests itself. Recovery from such situations could involve restarting the system, or the individual component in which the problem occurred. Fault Management Order 3: I wouldn't know if something went wrong. - This can involve not checking return codes or the results of resource requests, or blithely using "when others => null" exception handlers. The system will continue to run, with its users ignorant of the degradation and errors that may be accumulating. FMO-3 is unacceptable of course. You shouldn't even be programming if your code "handles" faults this way. You can at least turn FMO-3 into FMO-2 with technqiues like asserting return codes and resource request results, and removing all exception handlers whose purpose is not explicit. This means not just "when others =>", but probably also most "when Constraint_Error =>" and "when Program_Error =>" appearances. Even if you don't know why something somewhere may have gone wrong, at least know when it's gone _right_. FMO-2 is what I always find problematic. The statement "All software has bugs" gets thrown around, and through gritted teeth I have to agree, but too often I hear that used an excuse for lack of development rigor. And just today I discovered a new term, "software rejuvenation", that addresses FMO-2 by preemptively and regularly restarting a system (http://www.stsc.hill.af.mil/crosstalk/2004/08/0408Bernstein.html). The authors' research shows that it's been used and is effective, but I just want to sigh "You're giving up! Fix the bugs!" FMO-2 can be attacked with well-defined engineering practices, a good software engineering oriented language (Ada) and liberal use of [pragma] assert, but what do you about the effect of those bugs that slip by? Or when a trusted interface unexpectedly misbehaves? You can't anticipate these specific occurrences, so how do you deal with them? My predilection is to let the system fail (definitely so during the development and integration phases), at least then you can fix the bug, or turn an FMO-2 instance into FMO-1. But the counter-scenario is of course, "What if that system has been delivered and is flying your plane?" I bring this up because I often see exception handling discussed and how it pertains to what amounts to FMO-1 and FMO-2 scenarios, but with the participants not clearly aware of the distinction between them, and wherein exceptions serve different purposes. Something to ponder... Marc A. Criley McKae Technologies www.mckae.com