From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,885dab3998d28a4
X-Google-Attributes: gid103376,public
From: "Samuel T. Harris" <s_harris@gsde.hso.link.com>
Subject: Re: Ariane 5 failure
Date: 1996/10/18
Message-ID: <3267D929.167E@gsde.hso.link.com>
X-Deja-AN: 190486188
references: <96100111162774@psavax.pwfl.com>
 <mheaney-ya023180000210962238070001@news.ni.net>
 <32555A39.E38@lmtas.lmco.com> <dewar.844518011@schonberg>
 <mheaney-ya023180001410962319550001@news.ni.net>
 <326506D2.1E40@lmtas.lmco.com> <DzG9vM.3u8@thomsoft.com>
content-type: text/plain; charset=us-ascii
organization: Hughes Training Inc. - Houston Operations
mime-version: 1.0
newsgroups: comp.lang.ada
x-mailer: Mozilla 3.01b1 (X11; I; IRIX 5.3 IP19)
Date: 1996-10-18T00:00:00+00:00
List-Id: <comp.lang.ada>


Keith Thompson wrote:
> 
> In <326506D2.1E40@lmtas.lmco.com> Ken Garlington <garlingtonke@lmtas.lmco.com> writes:
> [...]
> > Not necessarily. Keep in mind that an exception _was_ raised -- a
> > predefined exception (Operand_Error according to the report).
> 
> This is one thing that's confused me about this report.  There is no
> predefined exception in Ada called Operand_Error.  Either the overflow
> raised Constraint_Error (or Numeric_Error if they were using an Ada
> 83 compiler that doesn't follow AI-00387), or a user-defined exception
> called Operand_Error was raised explicitly.
> 

Remember, the report does NOT state that an unchecked_conversion
was used (as some on this thread have assumed). It only states
a "data conversion from 64-bit floating point to 16-bit signed
integer value". As someone (I forget who) pointed out early
in the thread weeks ago, a standard practice is to scale down
the range of a float value to fit into an integer variable.
This may not have been an unchecked_conversion at all, but
some mathimatical expression.

Whenever software is reused, it must be reverified AND
revalidated. The report cites several reasons for not
reverifying the reuse of the SRI from the Ariane 4. Any
one of which may be justifiable. However, a cardinal rule
of risk management is that any risk to which NO measures
are applied remains a risk. Here they justified their way
into applying no measures at all toward insuring the stuff
would work.

The report also states that the code which contained the
conversion was part of a feature which was now obsolete
for the Ariane 5. It was left in "presumably based on the view that,
unless proven necessary, it was not wise to make changes in software
which worked well on Ariane 4." While this does make good sense,
it is not by any means a verification nor a validation.
It just seems to mitigate your risk, but it really does
no such thing. You can't let such thinking lull you into
a false sense of security.

The analysis which lead to protecting four variables from
Operand_Error and leaving 3 unprotected was not revisited
with the new environment in mind. How could it be since
the Ariane 5 trajectory data was not included as a function
requirement. Hence this measure does not apply to the risk
of the Ariane 5, though some in the decision may have relied
upon it for just that protection.

Then they went as far as not revalidating the SRI in an
Ariane 5 environment, which was the real hurt. While the
report states the Ariane 5 flight data was not included as
a functional requirement, someone should have asked for it
if they needed it. Its omission means any verification testing
which was done would not have taken it into account.
So it would have been verified (which is testing against what
the user said he wanted). However, validation testers (who
test what the user actually wants and are supposed to be
smart enough NOT to take the specification at face value)
should have insisted on such data, included or not.
That's the silly part about the whole affair, validation
testing also was not performed.

The report then goes on to discuss why the SRI's were not
included in a closed-loop test. So even if the Ariane 5
trajectory data had been included as a functional requirement,
it would not have helped. While the technical reasons
cited are appropriate for a verification test, the report
correctly points out that the goals of validation testing
are not so strigently dependent on the fidelity of the test
environment so those reasons just don't justify not having
the SRI's in at least one validation test using Arian 5
trajectory data, especially when other measures have NOT
been taken to insure a compatible reuse of software.

In fact, section 2.1 states "The SRI internal events that
led to the failure have been reproduced by simulation calculations."
I wonder if they compiled and ran the Ada code on another
platform (which is a viable way of doing a lot of testing
for embedded software prior to embedding the software).
The report does not state if such testing was performed
by the developer. If the developer done such testing, then
the Ariane 5 trajectory data would have spotted the flaw.
If such testing was done, someone would have to ask
explicitly for such data.

The end of secion 2.3 summarizes the fact that the reviews
did not pick up the fact that of all potential measures which
could have been applied to determine a compatible reuse of
software into the Ariane 5 operational environment, NONE of
them were actually performed. Which left the reviewers
blissfully ignorant of an unmitigated risk glaring them in
the face.

Of the SRI, I conclude ...

No design error (though it could have done something better).
No programming error (given the design).
An arguable specification error (but without appropriate testing).
A lapse in validation testing (assuming other non-existance measures).
A grave risk management and oversite problem.

Bottom line, a management (both customer and contractor) problem.

The OBC and main computer are another matter entirely.

I've not seen anyone on this thread address the entries
3.1.f and g concerning the SRI sending diagnostic data (item f)
which was interpreted as flight data by the launcher's main
computer (item g). Section 2.1 states the backup failed first and
declared a failure and the OBC could not switch to it because
it already ceased to function. It seems the OBC knew about
the failures, so why did the main computer still interpret
any data from a failed component as flight data.

That seems like a design or programming problem. It is
blind luck that the diagnostic data caused the main computer
to try to correct the trajectory via extreme positions of the
thruster nozzles which caused the rocket to turn sideways
to the air flow which caused buckling in the superstructure
which caused the self-destruct to engage.

Given the design philosophy of the designers, had the main
computer known both SRI had failed, it should have signaled a
self-destruct right then and there. What would have happened
if the "diagnostic" data caused minor course corrections and
brought the rocket over a population area before the subsequent
course or events (or the ground flight controllers themselves)
signaled a self-destruct?

The report does not delve into this aspect of the problem
which I consider to be even more important. This tends to
tell me the SRI simulators in the closed-loop testing which
was performed were not used to check malfunctions, or if
they were, then the test scenarios are incomplete or flawed.

How many other interface/protocol/integration problems
are waiting to crop up? Which reused Arian 4 software component
will fail next? Stay tuned for these and other provocative
questions on "As the Arian Burns" ;)

I wonder how the payload insurance companies will repond with
their pricing for the next couple of launches.

-- 
Samuel T. Harris, Senior Engineer
Hughes Training, Inc. - Houston Operations
2224 Bay Area Blvd. Houston, TX 77058-2099
"If you can make it, We can fake it!"