From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,5ac12f5a60b1bfe,start
X-Google-Attributes: gid103376,public
X-Google-Thread: f43e6,5ac12f5a60b1bfe,start
X-Google-Attributes: gidf43e6,public
From: simonb@pact.srf.ac.uk (Simon Bluck)
Subject: Ariane 5 - not an exception?
Date: 1996/07/25
Message-ID: <Dv45EJ.8r@fsa.bris.ac.uk>
X-Deja-AN: 170130020
sender: usenet@fsa.bris.ac.uk (Usenet)
x-nntp-posting-host: talisker.pact.srf.ac.uk
organization: University of Bristol, England
newsgroups: comp.software-eng,comp.lang.ada
Date: 1996-07-25T00:00:00+00:00
List-Id: <comp.lang.ada>


The Ariane 501 flight failure was due to the raising of an unexpected
Ada exception, which was handled by switching off the computer.  The
report on this:

   http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html

is clear and hard-hitting: it will result in much improved software.
But does it get right to the bottom of the issues, and does the
software community appreciate that there are fundamental software
control problems which can directly give rise to such enormous
failures, in this particular case thankfully without loss of life?

It is most unfortunate, but must be accepted as true, that if the
Ariane software had been written in a less powerful language the
numeric overflow might have gone unnoticed, the computers would have
remained switched on, and the rocket would have continued its upward
flight.

Exceptions and assertions are both used, in Ada and C/C++, to detect
software/hardware anomalies.  When one of these trips, it is
frequently very difficult for the designer to know how best to handle
the problem.  To continue may result in corrupt data; to abort is
drastic but eliminates the possibility that further processing will
compound the problem.

The more checks you have, the more likely it is that one of them will
trip.  If you can't think of good ways of handling these checks, the
end result, for the user, may well be very much worse than if the
check had never been performed in the first place.

Of the two handling options, neither is really acceptable.  However,
there is a third option which ought to be considered: to continue but
mark the processed data as suspect.

I.e. each data item would have a truth value of 1.0 for good data,
0.0 for absolutely rotten data, utilising values in between if you
have some idea how good the data is.  If you have numeric overflow,
you could set the data to the largest value available, and mark it as
suspect.

Any data further derived from suspect data must also be marked as
suspect.

Taking a probabilistic attitude to data would bring a lot of software
into the real world where failures can happen at all levels.  Using
this approach would made complex mission-critical software like the
failing Ariane software much easier to understand and control.  Data
would be processed along the same path regardless of whether it is
suspect or entirely valid.  Only the end-users of the data would be
affected, and where duplication of systems provides redundancy, the
algorithm would be to switch to the backup on receiving suspect data,
and switch back to the main source if the backup was suspect.  If
both sources are suspect, then take the least suspect source.  This
is simple and you don't lose your vital input data.  The data truth
values would be passed on from system to system along with the data.

You _never_ switch off a computer, but you may have cause to mark all
data emanating from it as suspect.  Leave it up to the users of the
data to decide if they want to use it or not - they may have no
choice.


Along with the data truth attribute, you need a data type attribute.
This is tending to be relatively standard stuff now that objects are
around and need to know what kind of object they are.  But adding a
data type field is still something that designers skimp on if not
supplied by the language, relying instead on implicit coding of type
information in the senders and receivers of data.

Lack of type information accounts for why the Ariane flight control
was able to interpret diagnostic data as attitude data, virtually
guaranteeing catastrophic failure.  At least if attitude data had
been cut short it could have continued in a straight line.


Well, those are what I think are the important lessons to be learned.
The main reasons cited for Ariane 501's failure are typical human
ones which will be made again on the next big project.  I.e.
inadequate testing, particularly of the complete system in its
(simulated) environment.  Surprise, surprise, this turns out to be
too difficult and too costly to achieve thoroughly.  And small system
mistakes which stress the adequate functioning of the system as a
whole (like thinking that the Ariane 4 alignment process didn't need
changing for Ariane 5).  These will happen time and again, we're only
human.  But with more realistic data processing the system as a whole
would stand a better chance of survival.

SimonB

[All my own opinions, of course.]