comp.lang.ada
 help / color / mirror / Atom feed
* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
@ 1996-07-24  0:00   ` Byron B. Kauffman
  1996-07-24  0:00     ` Stephen D. House
  1996-07-25  0:00     ` Theodore E. Dennison
  1996-07-25  0:00   ` Alan Brain
                     ` (5 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: Byron B. Kauffman @ 1996-07-24  0:00 UTC (permalink / raw)



From the Summary Report:

"...The same requirement does not apply to Ariane 5, which has a
different preparation sequence and it wasmaintained for commonality 
reasons, presumably based on the view that, unless proven necessary, it 
was not wise to make changes in software which worked well on Ariane 
4..."

Is anyone else sick and tired of the COTS argument for getting rid of
the Ada mandate? Of course, the C crowd will want to blame this
scenario on Ada, but it appears to me that the same results would have
occurred no matter what language the original software was written in.
Does anyone but me think that COTS was a bad idea conjured up by a
hardware guy stuck in an office somewhere in Dayton who thinks he
knows something about hardware?

I guess we're going to have to trade in our 'software engineering' hats
for 'software scavenger/cut-and-paste/rewrite-the-interfaces' hats
(otherwise known as HACKING). 

Just my opinion, of course.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
       [not found] <31F60E8A.2D74@lmtas.lmco.com>
@ 1996-07-24  0:00 ` Ken Garlington
  1996-07-24  0:00   ` Byron B. Kauffman
                     ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Ken Garlington @ 1996-07-24  0:00 UTC (permalink / raw)



Ken Garlington wrote: <nothing!>

Don't know what happened there, but I was just going to point out
that the Ariane 5 report is at:

  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html

Be sure to read the full report, which is linked to this page. It
goes into some length about the sequence of events (which includes
an Ada exception I never heard of before, Operand Error? Maybe it's user 
defined, or there's a language difference at work).

Definitely good "lessons learned" about:

1. The limits of exceptions (they are only as good as what you can do
when they are raised).

2. The problems with reusing items outside their original environment.

3. The need to check inputs and outputs aggressively.

4. The pitfalls of assuming that testing all of the components of a 
system equates to testing the system, as well as the need to use 
realistic test scenarios.

5. The problems with isolating the safety-critical components of a 
system.

So, anyway, we now have another software package written in Ada that
caused the loss of a system, and again specification and design issues 
outside Ada's control are the culprit. 

-- 
LMTAS - "Our Brand Means Quality"




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00   ` Byron B. Kauffman
@ 1996-07-24  0:00     ` Stephen D. House
  1996-07-25  0:00     ` Theodore E. Dennison
  1 sibling, 0 replies; 18+ messages in thread
From: Stephen D. House @ 1996-07-24  0:00 UTC (permalink / raw)



Byron B. Kauffman wrote:
> I guess we're going to have to trade in our 'software engineering' hats
> for 'software scavenger/cut-and-paste/rewrite-the-interfaces' hats
> (otherwise known as HACKING).
> 
> Just my opinion, of course.

For a rocket, you might be right.  BUT...

One of the advantages of "visual" programming languages is that they are 
a language which ties together components.  The software crises will not 
be reduced until productivity goes up.  Productivity isn't the number of 
lines of code you can code a month, its how much functionality you can 
give you your customer per month.  Unless software houses part building 
products by putting together subsystems instead of subprograms, no gains 
will be made.

COTS is one way.  In house components; ones which are understood, domain 
specific, consistent with other components, etc.; are better solutions. 
 I don't think that companies are doing enough with reuse.  They'd 
rather buy a magic bullet from somebody else than dig through their own 
attic for something.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00   ` Byron B. Kauffman
  1996-07-24  0:00     ` Stephen D. House
@ 1996-07-25  0:00     ` Theodore E. Dennison
  1 sibling, 0 replies; 18+ messages in thread
From: Theodore E. Dennison @ 1996-07-25  0:00 UTC (permalink / raw)



Byron B. Kauffman wrote:
> 
> the Ada mandate? Of course, the C crowd will want to blame this
> scenario on Ada, but it appears to me that the same results would have
> occurred no matter what language the original software was written in.

Not true. In C, the operation that performed the type conversion (cast),
would have corrupted nearby memory locations and continued running with
faulty data (or possibly instructions). The resulting error would have
occured at a seemingly random place with seemingly random frequency, and
the commision studying the failure would never have been able to isolate 
it.

-- 
T.E.D.          
                |  Work - mailto:dennison@escmail.orl.mmc.com  |
                |  Home - mailto:dennison@iag.net              |
                |  URL  - http://www.iag.net/~dennison         |




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
                     ` (3 preceding siblings ...)
  1996-07-25  0:00   ` ++           robin
@ 1996-07-25  0:00   ` Dale Stanbrough
  1996-07-26  0:00     ` OS2 User
  1996-07-26  0:00   ` Con Bradley
  1996-08-01  0:00   ` root
  6 siblings, 1 reply; 18+ messages in thread
From: Dale Stanbrough @ 1996-07-25  0:00 UTC (permalink / raw)



Ken Garlington writes:

"1. The limits of exceptions (they are only as good as what you can do
 when they are raised)."
 

Now I know what is meant when people say "exceptions can be expensive" :-).

Dale




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
  1996-07-24  0:00   ` Byron B. Kauffman
  1996-07-25  0:00   ` Alan Brain
@ 1996-07-25  0:00   ` ++           robin
  1996-07-26  0:00     ` Ken Garlington
  1996-07-26  0:00     ` ++           robin
  1996-07-25  0:00   ` ++           robin
                     ` (3 subsequent siblings)
  6 siblings, 2 replies; 18+ messages in thread
From: ++           robin @ 1996-07-25  0:00 UTC (permalink / raw)



        Ken Garlington <garlingtonke@lmtas.lmco.com> writes:

        >Ken Garlington wrote: <nothing!>

        >Don't know what happened there, but I was just going to point out
        >that the Ariane 5 report is at:

        >  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html

        >Be sure to read the full report, which is linked to this page. It
        >goes into some length about the sequence of events (which includes
        >an Ada exception I never heard of before, Operand Error?

---That's fixed-point overflow.  Converting a 64-bit
floating-point value to a 16 bit signed integer.
The conversion was unchecked (programming error--
the other conversions in the same module were
checked; the assumption was made that the value would
be within range); consequently the error condition was raised.
The exception-handling routine was to record the
status of the error and to then shut down the system.

         Maybe it's user
        >defined, or there's a language difference at work).

        >Definitely good "lessons learned" about:

        >1. The limits of exceptions (they are only as good as what you can do
        >when they are raised).

        >2. The problems with reusing items outside their original environment.

        >3. The need to check inputs and outputs aggressively.

        >4. The pitfalls of assuming that testing all of the components of a
        >system equates to testing the system, as well as the need to use
        >realistic test scenarios.

        >5. The problems with isolating the safety-critical components of a
        >system.

        >So, anyway, we now have another software package written in Ada that
        >caused the loss of a system, and again specification and design issues
        >outside Ada's control are the culprit.




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
                     ` (2 preceding siblings ...)
  1996-07-25  0:00   ` ++           robin
@ 1996-07-25  0:00   ` ++           robin
  1996-07-25  0:00   ` Dale Stanbrough
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: ++           robin @ 1996-07-25  0:00 UTC (permalink / raw)



	Ken Garlington <garlingtonke@lmtas.lmco.com> writes:

	>Ken Garlington wrote: <nothing!>

	>Don't know what happened there, but I was just going to point out
	>that the Ariane 5 report is at:

	>  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html

	>Be sure to read the full report, which is linked to this page. It
	>goes into some length about the sequence of events (which includes
	>an Ada exception I never heard of before, Operand Error?

---That's fixed-point overflow.  Converting a 64-bit 
floating-point value to a 16 bit signed integer.
The conversion was unchecked (programming error--
the other conversions in the same module were
checked; the assumption was made that the value would
be within range); consequently the error condition was raised.
The exception-handling routine was to record the
status of the error and to then shut down the system.

	 Maybe it's user 
	>defined, or there's a language difference at work).

	>Definitely good "lessons learned" about:

	>1. The limits of exceptions (they are only as good as what you can do
	>when they are raised).

	>2. The problems with reusing items outside their original environment.

	>3. The need to check inputs and outputs aggressively.

	>4. The pitfalls of assuming that testing all of the components of a 
	>system equates to testing the system, as well as the need to use 
	>realistic test scenarios.

	>5. The problems with isolating the safety-critical components of a 
	>system.

	>So, anyway, we now have another software package written in Ada that
	>caused the loss of a system, and again specification and design issues 
	>outside Ada's control are the culprit. 




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
  1996-07-24  0:00   ` Byron B. Kauffman
@ 1996-07-25  0:00   ` Alan Brain
  1996-07-29  0:00     ` Ken Garlington
  1996-07-25  0:00   ` ++           robin
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Alan Brain @ 1996-07-25  0:00 UTC (permalink / raw)



Ken Garlington <garlingtonke@lmtas.lmco.com> wrote:

>So, anyway, we now have another software package written in Ada that
>caused the loss of a system, and again specification and design issues 
>outside Ada's control are the culprit. 

Not just design and specification, the implementation as well.

Firstly, the brain-dead attitude of "handle all exceptions by shutting down and 
going to the backup" on a complex piece of equipment without many, many redundancies 
is ... incredible. Only duplication? Glad I'm not riding it... So that's a 
Specification fault.

Secondly, the notion that conversion from a 64-bit value to a 16 bit value will 
always be OK, and that any time it isn't means a total failure of the unit, is a bit 
hard to swallow. In a complex piece of software, incapable of strict mathematical 
verification, I'd expect this to happen sometimes, not because of any soft failure 
or random hardware failure, but because Software Has Bugs. That's no excuse for 
losing a payload! This is a design fault.

Thirdly, assuming either of the above, not checking that an arithmetic operation of 
this kind before it's fully complete is just plain silly. And such a check is un 
morceau de gateaux. This is an implementation fault. 

Jeez, Ada provides safety belts, Anti-lock brakes, etc but if people don't buckle 
up, and don't even bother to use the brake peddle, what can you do?







^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-25  0:00   ` ++           robin
@ 1996-07-26  0:00     ` Ken Garlington
  1996-07-30  0:00       ` Theodore E. Dennison
  1996-07-26  0:00     ` ++           robin
  1 sibling, 1 reply; 18+ messages in thread
From: Ken Garlington @ 1996-07-26  0:00 UTC (permalink / raw)



++ robin wrote:
>
>         >Ken Garlington <garlingtonke@lmtas.lmco.com> writes:
>
>         >Don't know what happened there, but I was just going to point out
>         >that the Ariane 5 report is at:
>
>         >  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html
>
>         >Be sure to read the full report, which is linked to this page. It
>         >goes into some length about the sequence of events (which includes
>         >an Ada exception I never heard of before, Operand Error?
>
> ---That's fixed-point overflow.

Could you send me an Ada RM cite? I couldn't find it...

> Converting a 64-bit
> floating-point value to a 16 bit signed integer.
> The conversion was unchecked (programming error--

I don't know if I would call this a programming error or
a requirements error. Apparently, there was an analysis done
to see if the check should be required, and the analysis said
that it wasn't. Given the 80% utilization, I'm sure there was
some not-so-subtle pressure to leave out any code that wasn't
absolutely necessary.

> 1.  The size of the variable to hold the value (16 bits) was
>     inadequate;

Not that this was mentioned in the report, but the commnications
link between the INS and the flight control computer uses a
MIL-STD-1553B data bus, which is a 16-bit protocol. They could
have used multiple words to contain this value, but it is common
for 1553B application to convert floating point numbers to scaled
16-bit values, so long as the precision is still acceptable. What
apparently happened was that the scaling for the Ariane 4 was acceptable,
but was not updated for the Ariane 5 based on an analysis that said the
ranges should be maintained.

> 2.  It was assumed that the value would not be large enough
>     to overflow ; therefore, it was not checked; and

Yes - clearly the analysis done here was not adequtely revisited for
the new environment.

> 3.  The folly that a floating-point value of some
>     58 significant bits could be converted "safely" to
>     16 bits.

This is actually quite routine for IRS to flight control interfaces.
The IRS usually has to do high-precision calculations internally,
but the flight control system does not need this precision. In and of
itself, there's no problem dumping the extra bits of precision, so long
as the range is preserved (which, in this case, it wasn't).

>    An error-handler for overflow should have been included,
> but should have returned control directly to the program
> (this only as an emergency resort).  The code should have
> included a check for data out of range (or better, storage
> of adequate size.)

I agree with the second part. However, it's not clear that returning
to the program would have helped. This is one area in which I think
the final report is too optimistic. It suggests that the correct
response to the exception was to provide the "best data available."
That might be possible, but in general it's a tricky business.

I wonder more why the IRS message to the flight controls did not
include (a) an indication as to IRS mode (alignment, in this case) and
(b) an indication that the IRS had detected a data error.

>    This project might well have been written in PL/I, which
> has excellent real-time facilities, including error
> handling, error simulation and validation facilities.
> The language has robust compilers, and experts with many
> years of PL/I programming experience.
>
>    As to PL/I facilities, I refer to the SIGNAL statement,
> with which given conditions (errors such as fixed-point
> overflow) can be signalled as if the condition (error)
> actually occurred.

The language in which it was actually written (Ada) has equivalent
facilities, so I'm not sure how PL/I would have helped here. Having
programmed in both PL/I and Ada, I can't think of anything specific
in this area. As noted in the final report, this was a system requirements
and design error, not a programming error.

>    This alone would have showed up the deficiency of the
> overall design (that the system would shut itself down for
> fixed-point overflow).

It might have shown that this was the result, but as pointed out in the
report, the system designers knew this could happen, and discounted it
as improbable. So, I'm not sure it would have made a difference.

>    Further, an ON unit can return control simply and easily
> to some re-start point, or another convenient point in the
> program, or even pass control to the following statement.

Again, it's not clear to me that any of these options would have saved
the vehicle. An IRS in alignment mode simply cannot generate valid data,
period. Furthermore, since the designers felt this error couldn't really
occur, it's unlikely they would have made the right choice as to what to
do when the SIGNAL was raised.

>         >With Definitely good "lessons learned" about:
>
>         >1. The limits of exceptions (they are only as good as what you can do
>         >when they are raised).
>
> ---There's a lot you can do with an exception.  One of
> them isn't to shut down the computer.  I've already itemized
> what can be done with an exception.  But in this case,
> the proper course is to ensure that values are within
> range and to take appropriate action, rather than
> to let it get as far as the error handler, which should
> be a last resort for catching something overlooked (and
> hopefully, there's none of those).

Sure. Now, what was the appropriate action here? Switch from alignment
to operational mode? The whole point of alignment mode is to make the
values computed during operational mode accurate. If it doesn't finish,
the results are suspect.

Particularly for feedback systems, having lots of options at the language
level does not equate to having an adequate solution to the existence of
an error. That's a system design problem, not a language issue.

> ---No, this is a clear programming error.  A PL/I programmer
> experienced with real time systems, would have CHALLENGED
> such a stupid requirement that the computer be shut down by the
> error-handler in the event of a fixed-point overflow.  He would
> have had it changed.

To what?

>    I'd go further to say that no experienced PL/I programmer
> would have shut down the system as a result of a fixed-point
> overflow.

What would he have done?

>    Furthermore, he would have included a check that the value
> did not go out of range;
>
>    Skills in PL/I and real time systems would not have gone
> astray here.  And probably skills in Ada too.

I will agree that experience in real time systems (which these
folks had, by the way) is definitely a useful thing here. However,
these guys simply came up with the wrong answer in their analysis.
It wasn't that they didn't realize there was a risk - the report
clearly states that the issue was discussed and the resolution
approved at several levels of management. They weren't ignorant,
just wrong. There is a difference.

>
> ___________________________________________________________
>
> Extract from full report:
>
> "  * The internal SRI software exception was caused during execution of a
>      data conversion from 64-bit floating point to 16-bit signed integer
>      value. The floating point number which was converted had a value
>      greater than what could be represented by a 16-bit signed integer.
>      This resulted in an Operand Error. The data conversion instructions
>      (in Ada code) were not protected from causing an Operand Error,
>      although other conversions of comparable variables in the same place
>      in the code were protected."

Tell me, if you read in the paper that a drunk driver was speeding, and killed
someone, do you blame the auto manufacturer for providing a vehicle that could
go fast enough to kill someone?

Read the report. Its conclusions are definitely at odds with yours.

--
LMTAS - "Our Brand Means Quality"




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
                     ` (4 preceding siblings ...)
  1996-07-25  0:00   ` Dale Stanbrough
@ 1996-07-26  0:00   ` Con Bradley
  1996-07-26  0:00     ` Peter Hermann
  1996-07-26  0:00     ` P. Cnudde VH14 (8218)
  1996-08-01  0:00   ` root
  6 siblings, 2 replies; 18+ messages in thread
From: Con Bradley @ 1996-07-26  0:00 UTC (permalink / raw)



I have read the report on the Ariane 5 failure and feel that
somebody should congratulate ESA for their remarkable candour
in making this report so widely available. 

It is a pity that other organizations are not so willing to
go public on their mistakes.


------------------------------------------------------------------
Con Bradley                     "A pint of plain is your only man"
SGS Thomson Microelectronics Limited
10 Priory Road, Clifton
Bristol, BS8 1TU
e-mail: ceb@bristol.st.com
------------------------------------------------------------------





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-26  0:00   ` Con Bradley
@ 1996-07-26  0:00     ` Peter Hermann
  1996-07-26  0:00     ` P. Cnudde VH14 (8218)
  1 sibling, 0 replies; 18+ messages in thread
From: Peter Hermann @ 1996-07-26  0:00 UTC (permalink / raw)



Con Bradley (ceb@pact.srf.ac.uk) wrote:
: I have read the report on the Ariane 5 failure and feel that
: somebody should congratulate ESA for their remarkable candour
: in making this report so widely available. 

agreed

: It is a pity that other organizations are not so willing to
: go public on their mistakes.

politicians could learn a lot: in that they obscure many facts 
they are mistrusted as a consequence.

--
Peter Hermann  Tel:+49-711-685-3611 Fax:3758 ph@csv.ica.uni-stuttgart.de
Pfaffenwaldring 27, 70569 Stuttgart Uni Computeranwendungen
Team Ada: "C'mon people let the world begin" (Paul McCartney)




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-25  0:00   ` ++           robin
  1996-07-26  0:00     ` Ken Garlington
@ 1996-07-26  0:00     ` ++           robin
  1 sibling, 0 replies; 18+ messages in thread
From: ++           robin @ 1996-07-26  0:00 UTC (permalink / raw)



	>Ken Garlington <garlingtonke@lmtas.lmco.com> writes:

	>Don't know what happened there, but I was just going to point out
	>that the Ariane 5 report is at:

	>  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html

	>Be sure to read the full report, which is linked to this page. It
	>goes into some length about the sequence of events (which includes
	>an Ada exception I never heard of before, Operand Error?

---That's fixed-point overflow.  Converting a 64-bit
floating-point value to a 16 bit signed integer.
The conversion was unchecked (programming error--
other conversions in the same module were
checked; the assumption was made that the value would
be within range); consequently the error condition was raised.
The exception-handling routine was to record the
status of the error and to then shut down the system.

	>Maybe it's user
	>defined, or there's a language difference at work).

---A user-defined data conversion that went unchecked.  Three
programming mistakes were made here:

1.  The size of the variable to hold the value (16 bits) was
    inadequate; and

2.  It was assumed that the value would not be large enough
    to overflow ; therefore, it was not checked; and

3.  The folly that a floating-point value of some
    58 significant bits could be converted "safely" to
    16 bits.

  The problem then went to the error-handler, which was
designed to shut down the system.  This was a major
blunder.

   An error-handler for overflow should have been included,
but should have returned control directly to the program
(this only as an emergency resort).  The code should have
included a check for data out of range (or better, storage
of adequate size.)

   This project might well have been written in PL/I, which
has excellent real-time facilities, including error
handling, error simulation and validation facilities.
The language has robust compilers, and experts with many
years of PL/I programming experience.

   As to PL/I facilities, I refer to the SIGNAL statement,
with which given conditions (errors such as fixed-point
overflow) can be signalled as if the condition (error)
actually occurred.

   This alone would have showed up the deficiency of the
overall design (that the system would shut itself down for 
fixed-point overflow).

   Further, an ON unit can return control simply and easily
to some re-start point, or another convenient point in the
program, or even pass control to the following statement.

        >With Definitely good "lessons learned" about:

	>1. The limits of exceptions (they are only as good as what you can do
	>when they are raised).

---There's a lot you can do with an exception.  One of
them isn't to shut down the computer.  I've already itemized
what can be done with an exception.  But in this case,
the proper course is to ensure that values are within
range and to take appropriate action, rather than
to let it get as far as the error handler, which should
be a last resort for catching something overlooked (and
hopefully, there's none of those).

	>2. The problems with reusing items outside their original environment.

	>3. The need to check inputs and outputs aggressively.

	>4. The pitfalls of assuming that testing all of the components of a
	>system equates to testing the system, as well as the need to use
	>realistic test scenarios.

	>5. The problems with isolating the safety-critical components of a
	>system.

	>So, anyway, we now have another software package written in Ada that
	>caused the loss of a system, and again specification and design issues
	>outside Ada's control are the culprit.

---No, this is a clear programming error.  A PL/I programmer
experienced with real time systems, would have CHALLENGED
such a stupid requirement that the computer be shut down by the
error-handler in the event of a fixed-point overflow.  He would
have had it changed.

   I'd go further to say that no experienced PL/I programmer
would have shut down the system as a result of a fixed-point
overflow.

   Furthermore, he would have included a check that the value
did not go out of range;

   Skills in PL/I and real time systems would not have gone
astray here.  And probably skills in Ada too.

___________________________________________________________

Extract from full report:

"  * The internal SRI software exception was caused during execution of a
     data conversion from 64-bit floating point to 16-bit signed integer
     value. The floating point number which was converted had a value
     greater than what could be represented by a 16-bit signed integer.
     This resulted in an Operand Error. The data conversion instructions
     (in Ada code) were not protected from causing an Operand Error,
     although other conversions of comparable variables in the same place
     in the code were protected."




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-26  0:00   ` Con Bradley
  1996-07-26  0:00     ` Peter Hermann
@ 1996-07-26  0:00     ` P. Cnudde VH14 (8218)
  1 sibling, 0 replies; 18+ messages in thread
From: P. Cnudde VH14 (8218) @ 1996-07-26  0:00 UTC (permalink / raw)



Con Bradley wrote:
> 
> I have read the report on the Ariane 5 failure and feel that
> somebody should congratulate ESA for their remarkable candour
> in making this report so widely available.

I agree fully with you on this point, but you should not forget
that they are using our money (our = european taxpayers) to finance
there projects so they owe something to the public.

I read the report and I found it nost interesting lecture. My personal conclusion
is that no matter what effort you put in a system, it can always go wrong.
(I don't say you should not put all the possible effort in it, otherwise it
will certainly go wrong)

> 
> It is a pity that other organizations are not so willing to
> go public on their mistakes.

It's even worse, people get fired when they try to mistakes public.

> SGS Thomson Microelectronics Limited
A Microelectronics collegue in comp.lang.ada, interesting !  

-- 


   ____________          Peter Cnudde
   \          /          Alcatel Telecom
    \ ALCATEL/           Switching Systems Division 
     \ BELL /            Microelectronics Design Center
      \    /             
       \  /              F. Wellesplein 1, B-2018 Antwerp
        \/                                        BELGIUM
                         e-mail  : cnuddep@sh.bel.alcatel.be
                         Phone   : +32 3 240 82 18
                         Fax     : +32 3 240 99 47




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-25  0:00   ` Dale Stanbrough
@ 1996-07-26  0:00     ` OS2 User
  0 siblings, 0 replies; 18+ messages in thread
From: OS2 User @ 1996-07-26  0:00 UTC (permalink / raw)



Dale Stanbrough (dale@goanna.cs.rmit.edu.au) wrote:
: Ken Garlington writes:

: "1. The limits of exceptions (they are only as good as what you can do
:  when they are raised)."
:  

: Now I know what is meant when people say "exceptions can be expensive" :-).

 In fact, the most expensive is not exception but the misunderstanding of 
 them. 
 For me it is the /Ostrich complex/ :
 If you not see it, you imagine it not exist. (it is not true !)  
 
 Exception is a report structure, if you specify how it is raise for
 each component you can :
 - use it as control. ( Se the sentence (1))
 - evitate it with controls ( You must be 101% reliable )
 - ignore it, Kill the process, Crash the rocket :-(



: Dale

--

					Christophe Faure

	 _______________________________________________________
	|	               *	   # #      ###    #    |
	|  Christophe FAURE     *         # # #  @  #  @  # @   |
	|  Tel :+33()55.45.72.32 *       #     #  ###    #      | 
	|  Fax :+33()55.45.73.15  *                             |
	|        		   *     Laboratoire M.S.I    	|
	|  e-mail : Faure@unilim.fr *    Universite de Limoges	|
	|			     *   F-87060 LIMOGES	|
	|_______________________________________________________|






^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-25  0:00   ` Alan Brain
@ 1996-07-29  0:00     ` Ken Garlington
  1996-07-30  0:00       ` John McCabe
  0 siblings, 1 reply; 18+ messages in thread
From: Ken Garlington @ 1996-07-29  0:00 UTC (permalink / raw)



Alan Brain wrote:
> 
> Thirdly, assuming either of the above, not checking that an arithmetic operation of
> this kind before it's fully complete is just plain silly. And such a check is un
> morceau de gateaux. This is an implementation fault.

It's a question of perception. If a system designer says, "Don't add this check," and
I as an implementer don't add this check (possibly only after asking the designer,
"Are you _sure_"?), is this a design or an implementation fault?

It appears to me, from reading the report, that the lack of a check was an intentional
_design_ decision, not just something that was required but inadvertantly left out of
the code. I consider this a design fault (if not a specification fault).

In the final analysis, you could call all of this an implementation error (since the
implementation is the only part of the process that was actually on the system), but
to me that seems to be a poor way to understand the chain of events.

> Jeez, Ada provides safety belts, Anti-lock brakes, etc but if people don't buckle
> up, and don't even bother to use the brake peddle, what can you do?

Certainly, if people don't buckle up, you don't blame the car implementer 
(manufacturer)!

-- 
LMTAS - "Our Brand Means Quality"




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-26  0:00     ` Ken Garlington
@ 1996-07-30  0:00       ` Theodore E. Dennison
  0 siblings, 0 replies; 18+ messages in thread
From: Theodore E. Dennison @ 1996-07-30  0:00 UTC (permalink / raw)



Ken Garlington wrote:
> 
> ++ robin wrote:
> 
> >    An error-handler for overflow should have been included,
> > but should have returned control directly to the program
> > (this only as an emergency resort).  The code should have
> > included a check for data out of range (or better, storage
> > of adequate size.)
> 
> I agree with the second part. However, it's not clear that returning
> to the program would have helped. This is one area in which I think
> the final report is too optimistic. It suggests that the correct
> response to the exception was to provide the "best data available."
> That might be possible, but in general it's a tricky business.
> 

I'm glad to see that I wasn't the only one bothered by that suggestion.
The thought of a missle cruising around randomly based on "best data
available" from a faulty computer is frankly a little scary.

Perhaps the commision is hoping that the next missile blows up EuroDisney.
:-)

-- 
T.E.D.          
                |  Work - mailto:dennison@escmail.orl.mmc.com  |
                |  Home - mailto:dennison@iag.net              |
                |  URL  - http://www.iag.net/~dennison         |




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-29  0:00     ` Ken Garlington
@ 1996-07-30  0:00       ` John McCabe
  0 siblings, 0 replies; 18+ messages in thread
From: John McCabe @ 1996-07-30  0:00 UTC (permalink / raw)



Ken Garlington <garlingtonke@lmtas.lmco.com> wrote:

>Alan Brain wrote:
>> 
>> Thirdly, assuming either of the above, not checking that an arithmetic operation of
>> this kind before it's fully complete is just plain silly. And such a check is un
>> morceau de gateaux. This is an implementation fault.

>It's a question of perception. If a system designer says, "Don't add this check," and
>I as an implementer don't add this check (possibly only after asking the designer,
>"Are you _sure_"?), is this a design or an implementation fault?

>It appears to me, from reading the report, that the lack of a check was an intentional
>_design_ decision, not just something that was required but inadvertantly left out of
>the code. I consider this a design fault (if not a specification fault).

I agree entirely, the rest of this article is my response to a similar
comment in a similar thread, but I thought it may interest people who
missed the other thread through a spelling error (Adriane crash).

rav@goanna.cs.rmit.edu.au (++           robin) wrote:

>	john@assen.demon.co.uk (John McCabe) writes:

>	>JOINT ESA/CNES PRESS RELEASE N  33-96  -  Paris, 23 July 1996

>	>Ariane 501 - Presentation of Inquiry Board report

>	>-------------------------------------------------------------------

>	>Hope this is useful. So basically it _was_ a software fault

>---Is this a euphemism for a programming error?  because that's
>what it was -- a programming error.

Having read the report, I don't consider it to be a programming error,
it was a design and management error. It sounds like whoever designed
the system didn't pay enough attention to the requirements, and
whoever was managing it didn't pay enough attention to its conformance
to the requirements.

I think the fact that the overflow occurred was not due to a
programming oversight, after all the analyses had been done and a
decision to not check that variable had been made (*see additional
note below), but seeing as that variable should not have been in use
at that point, I don't think you can blame whoever wrote that code.

>   The error was in assuming that a value would not overflow.
>The specific error was that a conversion of a double-precision
>floating-point value (~58 significant bits) to 15 significant
>bits caused fixed-point overflow.  The conversion was not
>checked for overflow.  It should have been.  This is, after all,
>a real-time system.  It's a fundamental check that a programmer
>experienced in real-time systems should have carried out.

>   Control was then passed to the interrupt handler, which
>shut down the system.

>   The question is, basically, why was Ada used for this work?

ESA Ada preference/mandate(?).

<..snip..>

*Note: I hope this makes ESA llok a bit closer at why they want to
limit processor loading and how the margin should be reduced through
the design and development phases. My own project has an ESA enforced
limit of 70% which is quite ridiculous given the equipment we're using
(GPS MA31750 10MHz MIL-STD-1750 processor). We cannot meet that but
have requested a waiver on that - I believe that's much better than
compromising the safety of the mission.

ESA's loading margins are really supposed to take account of a
requirement for future modifications to software once it has been
delivered. There's no way this should have been enforced for Ariane 5.


From the sound of the report,I think a pretty poor job has been done,
not by the programmers who wrote the code and performed the analysis
of what variables could safely be left unchecked, instead I think
whoever performed the requirement analysis and all levels of
management / reviewers above that havebeen completely negligent.

Best Regards
John McCabe <john@assen.demon.co.uk>


Best Regards
John McCabe <john@assen.demon.co.uk>





^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Ariane 5 Failure - Summary Report
  1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
                     ` (5 preceding siblings ...)
  1996-07-26  0:00   ` Con Bradley
@ 1996-08-01  0:00   ` root
  6 siblings, 0 replies; 18+ messages in thread
From: root @ 1996-08-01  0:00 UTC (permalink / raw)



In article <838748001.3682.0@assen.demon.co.uk> john@assen.demon.co.uk (John McCabe) writes:
[SNIP]
   Having read the report, I don't consider it to be a programming error,
   it was a design and management error. It sounds like whoever designed
   the system didn't pay enough attention to the requirements, and
   whoever was managing it didn't pay enough attention to its conformance
   to the requirements.
[SNIP]

Agreed.

Electronics Weekly (UK freebie) put it as a "Mindset Error" (can't
remember exact phrase right now). I respect their judgement and I
think it about sums up the whole thing rather neatly.

Chris Morgan

chris.morgan@baesema.co.uk




^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~1996-08-01  0:00 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <31F60E8A.2D74@lmtas.lmco.com>
1996-07-24  0:00 ` Ariane 5 Failure - Summary Report Ken Garlington
1996-07-24  0:00   ` Byron B. Kauffman
1996-07-24  0:00     ` Stephen D. House
1996-07-25  0:00     ` Theodore E. Dennison
1996-07-25  0:00   ` Alan Brain
1996-07-29  0:00     ` Ken Garlington
1996-07-30  0:00       ` John McCabe
1996-07-25  0:00   ` ++           robin
1996-07-26  0:00     ` Ken Garlington
1996-07-30  0:00       ` Theodore E. Dennison
1996-07-26  0:00     ` ++           robin
1996-07-25  0:00   ` ++           robin
1996-07-25  0:00   ` Dale Stanbrough
1996-07-26  0:00     ` OS2 User
1996-07-26  0:00   ` Con Bradley
1996-07-26  0:00     ` Peter Hermann
1996-07-26  0:00     ` P. Cnudde VH14 (8218)
1996-08-01  0:00   ` root

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox