From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,INVALID_MSGID
	autolearn=no autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,f55a4f84e352c8ec
X-Google-Attributes: gid103376,public
From: "Samuel T. Harris" <samuel_t_harris@Raytheon.com>
Subject: Re: Ariane (yet again...)
Date: 2000/01/19
Message-ID: <38864C08.E9C68954@Raytheon.com>#1/1
X-Deja-AN: 574915474
Content-Transfer-Encoding: 7bit
References: <3882120e_3@news.jps.net>
X-Accept-Language: en
Content-Type: text/plain; charset=us-ascii
Organization: Raytheon Aerospace Engineering Services
Mime-Version: 1.0
Newsgroups: comp.lang.ada
Date: 2000-01-19T00:00:00+00:00
List-Id: <comp.lang.ada>

Mike Silva wrote:
> 
> Before anybody starts throwing anything, my question is very specific --
> does anybody know exactly what the Ariane report means when it speaks of
> "protecting" conversions?  The subject came up in alt.folklore.computers and
> it seems there are at least three possible meanings: (a) turn off the
> runtime checks for a given conversion, (b) put some code before the
> conversion to explicitly check for in-range, or (c) have a local exception
> handler to catch the error.  Anybody know what exactly was / was not done
> (or even better, have an actual code fragment)?  I've just always been
> curious...
> 
> Mike

You should read the report itself. It is an excellent read on
the nature of cascading failures and how good technical effort
can be fouled-up by poor management practices. I'll try to summarize
below ...


The report did not specify the nature of the conversion code.
However, given the nature of problem it might have been a scaled
conversion of an integer-type sensor reading to a floating or
fixed point type for the code to use. This is pretty normal
in this problem domain. This probably was not a simple
unchecked_conversion.
The sizes required for the two mentioned types do not match.

The report specified that the conversion resulted in a value
being out of range. It did not specify how the code determined this.
Namely, did a normal Ada runtime check on the resulting value
see that it was outside the range of the type or did the code
use some explicit range check? We don't know from the report.
The report did specify that an exception was raised.

The report did specify that an exception handler was not provided
based on the exhaustive analysis of the Ariane 4 teams proving
that such an exception would never occur. Several exception handlers
were not provided based on similar analysis to save processing time
and memory requirements. Such analysis is common in this field
and very reliable given the known constraints of the Ariane 4
trajectory and acceleration profile. This kind of analysis is
vital in supporting the assumption that bad data is the result
of a hardware failure. This was the case with the Ariane 5.
The bad data was interpreted as a hardware failure so the component
went into diagnostic mode. In fact, the backup component actually
failed before the main component.

Had an exception handler been present, I'm not sure what
it could do except indicate a hardware failure which is the
supported assumption of the component in the intended
environment (the Ariane 4).

In this diagnostic mode, the system sends diagnostic information
to the central processor. The central processor misinterpreted
this as real attitude and altitude information and commanded
the thrusters to maximum deflection to correct the "course"
of the rocket. This caused the rocket to turn sideways introducing
catastrophic stresses on the fuselage as the air flow moved
from the nose to the side of the rocket. Sensors detected
the impending failure of the superstructure and the rocket commanded
a self-destruct to insure lots of little bits of debris fell downrange
instead of two or three very large sections.

If there is a real "bug" it is the misinterpretation of the
command processor of this diagnostic information. It should
have know it was not real attitude and altitude information.

The sad part of this is that the code in use is used by the Ariane 4
to enable a quick reset should a launch be aborted. This code is
useless on the Ariane 5. 

The real problem was that this code was not being used on an Ariane 4,
but was being reused on an Ariane 5 without any verification whatsoever.
The Ariane 5 has a significantly different acceleration
and trajectory profile. These differences simply made all that work
proving the Ariane 4 would never raise the exception inapplicable
but similar work was not done to verify this code on the Ariane 5.
The contractor was not given the expected acceleration and trajectory
profile of the Araine 5 nor was the contractor required to test against
them. 

The report also noted that no simulations were run and speculated
and a single simulation of the involved components, either individually
or in an integrated environment, would have quickly identified the
problem.

A big case of overreliance on code reuse and under-employment of basic
verification methods. Just because something works in the past does
not mean it will work in the future, especically when the enclosing
environment changes. The design teams considered each and every
verification method and decided on each of them that they were not
worth doing. The problem is they didn't review to see they really
did nothing at all to verify the Ariane 4 code. While each decision
had some merit when considered individually, all together they present
an insane position to take.

A management problem after all.

-- 
Samuel T. Harris, Principal Engineer
Raytheon, Aerospace Engineering Services
"If you can make it, We can fake it!"