comp.lang.ada
 help / color / mirror / Atom feed
From: Ken Garlington <garlingtonke@lmtas.lmco.com>
Subject: Re: Ariane 5 Failure - Summary Report
Date: 1996/07/26
Date: 1996-07-26T00:00:00+00:00	[thread overview]
Message-ID: <31F8DCDF.190F@lmtas.lmco.com> (raw)
In-Reply-To: 4t6opg$4cp@goanna.cs.rmit.edu.au


++ robin wrote:
>
>         >Ken Garlington <garlingtonke@lmtas.lmco.com> writes:
>
>         >Don't know what happened there, but I was just going to point out
>         >that the Ariane 5 report is at:
>
>         >  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html
>
>         >Be sure to read the full report, which is linked to this page. It
>         >goes into some length about the sequence of events (which includes
>         >an Ada exception I never heard of before, Operand Error?
>
> ---That's fixed-point overflow.

Could you send me an Ada RM cite? I couldn't find it...

> Converting a 64-bit
> floating-point value to a 16 bit signed integer.
> The conversion was unchecked (programming error--

I don't know if I would call this a programming error or
a requirements error. Apparently, there was an analysis done
to see if the check should be required, and the analysis said
that it wasn't. Given the 80% utilization, I'm sure there was
some not-so-subtle pressure to leave out any code that wasn't
absolutely necessary.

> 1.  The size of the variable to hold the value (16 bits) was
>     inadequate;

Not that this was mentioned in the report, but the commnications
link between the INS and the flight control computer uses a
MIL-STD-1553B data bus, which is a 16-bit protocol. They could
have used multiple words to contain this value, but it is common
for 1553B application to convert floating point numbers to scaled
16-bit values, so long as the precision is still acceptable. What
apparently happened was that the scaling for the Ariane 4 was acceptable,
but was not updated for the Ariane 5 based on an analysis that said the
ranges should be maintained.

> 2.  It was assumed that the value would not be large enough
>     to overflow ; therefore, it was not checked; and

Yes - clearly the analysis done here was not adequtely revisited for
the new environment.

> 3.  The folly that a floating-point value of some
>     58 significant bits could be converted "safely" to
>     16 bits.

This is actually quite routine for IRS to flight control interfaces.
The IRS usually has to do high-precision calculations internally,
but the flight control system does not need this precision. In and of
itself, there's no problem dumping the extra bits of precision, so long
as the range is preserved (which, in this case, it wasn't).

>    An error-handler for overflow should have been included,
> but should have returned control directly to the program
> (this only as an emergency resort).  The code should have
> included a check for data out of range (or better, storage
> of adequate size.)

I agree with the second part. However, it's not clear that returning
to the program would have helped. This is one area in which I think
the final report is too optimistic. It suggests that the correct
response to the exception was to provide the "best data available."
That might be possible, but in general it's a tricky business.

I wonder more why the IRS message to the flight controls did not
include (a) an indication as to IRS mode (alignment, in this case) and
(b) an indication that the IRS had detected a data error.

>    This project might well have been written in PL/I, which
> has excellent real-time facilities, including error
> handling, error simulation and validation facilities.
> The language has robust compilers, and experts with many
> years of PL/I programming experience.
>
>    As to PL/I facilities, I refer to the SIGNAL statement,
> with which given conditions (errors such as fixed-point
> overflow) can be signalled as if the condition (error)
> actually occurred.

The language in which it was actually written (Ada) has equivalent
facilities, so I'm not sure how PL/I would have helped here. Having
programmed in both PL/I and Ada, I can't think of anything specific
in this area. As noted in the final report, this was a system requirements
and design error, not a programming error.

>    This alone would have showed up the deficiency of the
> overall design (that the system would shut itself down for
> fixed-point overflow).

It might have shown that this was the result, but as pointed out in the
report, the system designers knew this could happen, and discounted it
as improbable. So, I'm not sure it would have made a difference.

>    Further, an ON unit can return control simply and easily
> to some re-start point, or another convenient point in the
> program, or even pass control to the following statement.

Again, it's not clear to me that any of these options would have saved
the vehicle. An IRS in alignment mode simply cannot generate valid data,
period. Furthermore, since the designers felt this error couldn't really
occur, it's unlikely they would have made the right choice as to what to
do when the SIGNAL was raised.

>         >With Definitely good "lessons learned" about:
>
>         >1. The limits of exceptions (they are only as good as what you can do
>         >when they are raised).
>
> ---There's a lot you can do with an exception.  One of
> them isn't to shut down the computer.  I've already itemized
> what can be done with an exception.  But in this case,
> the proper course is to ensure that values are within
> range and to take appropriate action, rather than
> to let it get as far as the error handler, which should
> be a last resort for catching something overlooked (and
> hopefully, there's none of those).

Sure. Now, what was the appropriate action here? Switch from alignment
to operational mode? The whole point of alignment mode is to make the
values computed during operational mode accurate. If it doesn't finish,
the results are suspect.

Particularly for feedback systems, having lots of options at the language
level does not equate to having an adequate solution to the existence of
an error. That's a system design problem, not a language issue.

> ---No, this is a clear programming error.  A PL/I programmer
> experienced with real time systems, would have CHALLENGED
> such a stupid requirement that the computer be shut down by the
> error-handler in the event of a fixed-point overflow.  He would
> have had it changed.

To what?

>    I'd go further to say that no experienced PL/I programmer
> would have shut down the system as a result of a fixed-point
> overflow.

What would he have done?

>    Furthermore, he would have included a check that the value
> did not go out of range;
>
>    Skills in PL/I and real time systems would not have gone
> astray here.  And probably skills in Ada too.

I will agree that experience in real time systems (which these
folks had, by the way) is definitely a useful thing here. However,
these guys simply came up with the wrong answer in their analysis.
It wasn't that they didn't realize there was a risk - the report
clearly states that the issue was discussed and the resolution
approved at several levels of management. They weren't ignorant,
just wrong. There is a difference.

>
> ___________________________________________________________
>
> Extract from full report:
>
> "  * The internal SRI software exception was caused during execution of a
>      data conversion from 64-bit floating point to 16-bit signed integer
>      value. The floating point number which was converted had a value
>      greater than what could be represented by a 16-bit signed integer.
>      This resulted in an Operand Error. The data conversion instructions
>      (in Ada code) were not protected from causing an Operand Error,
>      although other conversions of comparable variables in the same place
>      in the code were protected."

Tell me, if you read in the paper that a drunk driver was speeding, and killed
someone, do you blame the auto manufacturer for providing a vehicle that could
go fast enough to kill someone?

Read the report. Its conclusions are definitely at odds with yours.

--
LMTAS - "Our Brand Means Quality"




  reply	other threads:[~1996-07-26  0:00 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <31F60E8A.2D74@lmtas.lmco.com>
1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
1996-07-24  0:00   ` Byron B. Kauffman
1996-07-24  0:00     ` Stephen D. House
1996-07-25  0:00     ` Theodore E. Dennison
1996-07-25  0:00   ` ++           robin
1996-07-26  0:00     ` Ken Garlington [this message]
1996-07-30  0:00       ` Theodore E. Dennison
1996-07-26  0:00     ` ++           robin
1996-07-25  0:00   ` Alan Brain
1996-07-29  0:00     ` Ken Garlington
1996-07-30  0:00       ` John McCabe
1996-07-25  0:00   ` ++           robin
1996-07-25  0:00   ` Dale Stanbrough
1996-07-26  0:00     ` OS2 User
1996-07-26  0:00   ` Con Bradley
1996-07-26  0:00     ` P. Cnudde VH14 (8218)
1996-07-26  0:00     ` Peter Hermann
1996-08-01  0:00   ` root
replies disabled

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox