Re: Ariane 5 failure

comp.lang.ada
 help / color / mirror / Atom feed

* Re: Ariane 5 failure
  1996-09-25  0:00       ` A. Grant
@ 1996-09-25  0:00         ` Ken Garlington
  1996-09-26  0:00         ` Byron Kauffman
  1996-09-26  0:00         ` Sandy McPherson
  2 siblings, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-09-25  0:00 UTC (permalink / raw)



A. Grant wrote:
> Robin is not a student.  He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

When it comes to building embedded safety-critical systems, trust me:
He's a student!

-- 
LMTAS - "Our Brand Means Quality"




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
       [not found] <agrapsDy4oJH.29G@netcom.com>
@ 1996-09-25  0:00 ` @@           robin
  1996-09-25  0:00   ` Michel OLAGNON
                     ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: @@           robin @ 1996-09-25  0:00 UTC (permalink / raw)

	agraps@netcom.com (Amara Graps) writes:

	>I read the following message from my co-workers that I thought was
	>interesting. So I'm forwarding it to here.

	>(begin quote)
	>Ariane 5 failure was attributed to a faulty DOUBLE -> INT conversion
	>(as the proximate cause) in some ADA code in the inertial guidance
	>system.  Diagnostic error messages from the (faulty) inertial guidance
	>system software were interpreted by the steering system as valid data.

	>English text of the inquiry board's findings is at
	>  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html
	>(end quote)

	>Amara Graps                         email: agraps@netcom.com
	>Computational Physics               vita:  finger agraps@best.com

THere's a little more to it . . .

The unchecked data conversion in the Ada program resulted
in the shutdown of the computer. The backup computer had
already shut down a whisker of a second before,  Consequently,
the on-board computer was unable to switch to the backup, and
used the error codes from the shutdown computer as
flight data.

This is not the first time that such a programming error
(integer out of range) has occurred.

In 1981, the manned STS-2 was preparing to take off, but because
some fuel was accidentally spilt and some tiles accidentally
dislodged, takeoff was delayed by a month.

During that time, the astronauts decided to get in some
more practice with the simulator.

During a simulated descent, the 4 computing systems (the main
and the 3 backups) got stuck in a loop, with the complete
loss of control.

The cause?  An integer out of range -- the same problem
as with Ariane 5, where an integer became out of range.

In the STS-2 case, the precise cause was a computed GOTO
with a bad index (similar to a CASE statement without
an OTHERWISE clause).

In both cases, the programing error could have been detected
with a simple test, but in both cases, no test was included.

One would have thought that having had one failure (at least)
for integer out-of-range, that the implementors of the software
for Ariane 5 would have been extra careful in ensuring that
all data conversions were within range -- since any kind
of interrupt would result in destruction of the spacecraft.

There's a case for a review of the programming language used.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00 ` Ariane 5 failure @@           robin
@ 1996-09-25  0:00   ` Michel OLAGNON
  1996-09-25  0:00     ` Chris Morgan
  1996-09-25  0:00     ` Byron Kauffman
  1996-09-25  0:00   ` Bob Kitzberger
  1996-09-27  0:00   ` John McCabe
  2 siblings, 2 replies; 58+ messages in thread
From: Michel OLAGNON @ 1996-09-25  0:00 UTC (permalink / raw)



In article <52a572$9kk@goanna.cs.rmit.edu.au>, rav@goanna.cs.rmit.edu.au (@@           robin) writes:
>[reports of Ariane and STS-2 bugs deleted]
>
>
>In both cases, the programing error could have been detected
>with a simple test, but in both cases, no test was included.
>
>One would have thought that having had one failure (at least)
>for integer out-of-range, that the implementors of the software
>for Ariane 5 would have been extra careful in ensuring that
>all data conversions were within range -- since any kind
>of interrupt would result in destruction of the spacecraft.
>

May be the main reason for the lack of testing and care was
that the conversion exception could only occur after lift off,
and that that particular piece of program was of no use after
lift off. It was only kept running for 50 s in order to
speed up countdown restart in case of an interruption between
H0-9 and H0-5. 

Conclusion: Never compute values that are of no use when you can
avoid it !

>There's a case for a review of the programming language used.


Michel
-- 
| Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|







^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00   ` Michel OLAGNON
  1996-09-25  0:00     ` Chris Morgan
@ 1996-09-25  0:00     ` Byron Kauffman
  1996-09-25  0:00       ` A. Grant
  1 sibling, 1 reply; 58+ messages in thread
From: Byron Kauffman @ 1996-09-25  0:00 UTC (permalink / raw)

Michel OLAGNON wrote:
> 
> May be the main reason for the lack of testing and care was
> that the conversion exception could only occur after lift off,
> and that that particular piece of program was of no use after
> lift off. It was only kept running for 50 s in order to
> speed up countdown restart in case of an interruption between
> H0-9 and H0-5.
> 
> Conclusion: Never compute values that are of no use when you can
> avoid it !
> 
> >There's a case for a review of the programming language used.
> 
> Michel
> --
> | Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|
> | IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|

Of course, Michel, you've got a great point, but let me give you some
advice,
assuming you haven't read this thread for the last few months (seems
like years). Robin's whole point is that he firmly believes that the
problem would not have occurred if PL/I had been used instead of Ada.
Several EXTREMELY competent and experienced engineers who actually have
written flight-control software have patiently, and in some cases
(though I can't blame them) impatiently attempted to explain the
situation - that this was a bad design/management decision combined with
a fatal oversight in testing - to this poor student, but alas, to no
avail.

My advice, Michel - blow it off and don't let ++robin (or is it
@@robin?) get
to you, because "++robin" is actually an alias for John Cleese. He's
gathering material for a sequel to "The Argument Sketch"...    :-)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00     ` Byron Kauffman
@ 1996-09-25  0:00       ` A. Grant
  1996-09-25  0:00         ` Ken Garlington
                           ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: A. Grant @ 1996-09-25  0:00 UTC (permalink / raw)



In article <32492E5C.562@lmtas.lmco.com> Byron Kauffman <KauffmanBB@lmtas.lmco.com> writes:
>Several EXTREMELY competent and experienced engineers who actually have
>written flight-control software have patiently, and in some cases
>(though I can't blame them) impatiently attempted to explain the
>situation - that this was a bad design/management decision combined with
>a fatal oversight in testing - to this poor student, but alas, to no
>avail.

Robin is not a student.  He is a senior lecturer at the Royal
Melbourne Institute of Technology, a highly reputable institution.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00 ` Ariane 5 failure @@           robin
  1996-09-25  0:00   ` Michel OLAGNON
@ 1996-09-25  0:00   ` Bob Kitzberger
  1996-09-26  0:00     ` Ronald Kunne
  1996-09-27  0:00   ` John McCabe
  2 siblings, 1 reply; 58+ messages in thread
From: Bob Kitzberger @ 1996-09-25  0:00 UTC (permalink / raw)



@@           robin (rav@goanna.cs.rmit.edu.au) wrote:
: The cause?  An integer out of range -- the same problem
: as with Ariane 5, where an integer became out of range.
...
: There's a case for a review of the programming language used.

Why do you persist?  

Ada _has_ range checks built into the language.  They were explicitly
disabled in this case.

What are you failing to grasp?

--
Bob Kitzberger	      Rational Software Corporation       rlk@rational.com
http://www.rational.com http://www.rational.com/pst/products/testmate.html




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00   ` Michel OLAGNON
@ 1996-09-25  0:00     ` Chris Morgan
  1996-09-25  0:00     ` Byron Kauffman
  1 sibling, 0 replies; 58+ messages in thread
From: Chris Morgan @ 1996-09-25  0:00 UTC (permalink / raw)



In article <ag129.804.0011F709@ucs.cam.ac.uk> ag129@ucs.cam.ac.uk
(A. Grant) writes:

   Robin is not a student.  He is a senior lecturer at the Royal
   Melbourne Institute of Technology, a highly reputable institution.

I'm tempted to say "not so reputable to readers of this newsgroup"
after the ridiculous statements made by Robin w.r.t. Ariane 5 but
Richard A. O'Keefe's regular excellent postings more than balance them
out.

Chris
-- 
--
Chris Morgan                     |email         cm@mihalis.demon.co.uk (home) 
http://www.mihalis.demon.co.uk/  |       or chris.morgan@baesema.co.uk (work)




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00   ` Bob Kitzberger
@ 1996-09-26  0:00     ` Ronald Kunne
  1996-09-26  0:00       ` Matthew Heaney
                         ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: Ronald Kunne @ 1996-09-26  0:00 UTC (permalink / raw)

In article <52bm1c$gvn@rational.rational.com>
rlk@rational.com (Bob Kitzberger) writes:

>Ada _has_ range checks built into the language.  They were explicitly
>disabled in this case.

The problem of constructing bug-free real-time software seems to me
a trade-off between safety and speed of execution (and maybe available
memory?). In other words: including tests on array boundaries might
make the code saver, but also slower.

Comments?

Greetings,
Ronald

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00       ` A. Grant
  1996-09-25  0:00         ` Ken Garlington
@ 1996-09-26  0:00         ` Byron Kauffman
  1996-09-27  0:00           ` A. Grant
  1996-09-26  0:00         ` Sandy McPherson
  2 siblings, 1 reply; 58+ messages in thread
From: Byron Kauffman @ 1996-09-26  0:00 UTC (permalink / raw)

A. Grant wrote:
> 
> In article <32492E5C.562@lmtas.lmco.com> Byron Kauffman <KauffmanBB@lmtas.lmco.com> writes:
> >Several EXTREMELY competent and experienced engineers who actually have
> >written flight-control software have patiently, and in some cases
> >(though I can't blame them) impatiently attempted to explain the
> >situation - that this was a bad design/management decision combined with
> >a fatal oversight in testing - to this poor student, but alas, to no
> >avail.
> 
> Robin is not a student.  He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

A. -

Thank you for confirming my long-held theory that those who inhabit the
ivory towers
of engineering/CS academia should spend 2 of every 5 years working at a
real job out 
in the real world. My intent is not to slam professors who are in touch
with reality, 
of course (e.g., Feldman, Dewar, et al), but the idealistic theoretical
side often
is a far cry from the practical, just-get-it-done world we have to deal
with once
we're out of school.

I just KNOW there's a good Dilbert strip here somewhere...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00       ` A. Grant
  1996-09-25  0:00         ` Ken Garlington
  1996-09-26  0:00         ` Byron Kauffman
@ 1996-09-26  0:00         ` Sandy McPherson
  2 siblings, 0 replies; 58+ messages in thread
From: Sandy McPherson @ 1996-09-26  0:00 UTC (permalink / raw)



A. Grant wrote:
> 
> Robin is not a student.  He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

Why doesn't he wise up and act like one then? 

I don't know the man, and I suspect he has been winding everybody up
just for a laugh. But, if this is not the case, the thought of such a
closed mind teaching students is quite horrific.

"Use PL/I mate, you'll be tucker",

-- 
Sandy McPherson	MBCS CEng.	tel: 	+31 71 565 4288 (w)
ESTEC/WAS
P.O. Box 299
NL-2200AG Noordwijk




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00     ` Ronald Kunne
@ 1996-09-26  0:00       ` Matthew Heaney
  1996-09-27  0:00         ` Wayne Hayes
                           ` (2 more replies)
  1996-09-27  0:00       ` Alan Brain
                         ` (2 subsequent siblings)
  3 siblings, 3 replies; 58+ messages in thread
From: Matthew Heaney @ 1996-09-26  0:00 UTC (permalink / raw)



In article <1780E8471.KUNNE@frcpn11.in2p3.fr>, KUNNE@frcpn11.in2p3.fr
(Ronald Kunne) wrote:

>In article <52bm1c$gvn@rational.rational.com>
>rlk@rational.com (Bob Kitzberger) writes:
> 
>>Ada _has_ range checks built into the language.  They were explicitly
>>disabled in this case.
> 
>The problem of constructing bug-free real-time software seems to me
>a trade-off between safety and speed of execution (and maybe available
>memory?). In other words: including tests on array boundaries might
>make the code saver, but also slower.
> 
>Comments?

Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
I'm not clear what the value of "faster execution" is.  The rocket's gone,
so what difference does it make how fast the code executed?  If you left
the range checks in, your code would be *marginally* slower, but you'd
still have your rocket, now wouldn't you?

>Ronald

Matt

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mheaney@ni.net
(818) 985-1271




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00       ` Matthew Heaney
@ 1996-09-27  0:00         ` Wayne Hayes
  1996-09-27  0:00           ` Richard Pattis
  1996-09-27  0:00         ` Ronald Kunne
  1996-09-28  0:00         ` Ken Garlington
  2 siblings, 1 reply; 58+ messages in thread
From: Wayne Hayes @ 1996-09-27  0:00 UTC (permalink / raw)

In article <mheaney-ya023180002609962252500001@news.ni.net>,
Matthew Heaney <mheaney@ni.net> wrote:
>Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is.  The rocket's gone,
>so what difference does it make how fast the code executed?  If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?

You have a moot point.  In this case, catching the error wouldn't have
helped.  The out-of-bounds error happened in a piece of code designed
for the Ariane-4, in which it was *physically impossible* for the value
to overflow (the Ariane-4 didn't go that fast, and it was a velocity
variable).  Then the code was used, as-is, in the Ariane-5, without an
analysis of how the code would react in the new hardware, which flew
faster.  Had the analysis been done, they wouldn't have added bounds
checking, they would have modified the code to actually *work*, because
they would have realized that the code was *guaranteed* to fail on the
first flight.

-- 
        "And a woman needs a man...        || Wayne Hayes, wayne@cs.utoronto.ca
      like a fish needs a bicycle..."      || Astrophysics & Computer Science
-- U2 (apparently quoting Gloria Steinem?) || http://www.cs.utoronto.ca/~wayne

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00     ` Ronald Kunne
  1996-09-26  0:00       ` Matthew Heaney
@ 1996-09-27  0:00       ` Alan Brain
  1996-09-28  0:00         ` Ken Garlington
  1996-09-27  0:00       ` Ken Garlington
  1996-09-29  0:00       ` Louis K. Scheffer
  3 siblings, 1 reply; 58+ messages in thread
From: Alan Brain @ 1996-09-27  0:00 UTC (permalink / raw)

Ronald Kunne wrote:

> The problem of constructing bug-free real-time software seems to me
> a trade-off between safety and speed of execution (and maybe available
> memory?). In other words: including tests on array boundaries might
> make the code saver, but also slower.
> 
> Comments?

Bug-free software is not a reasonable criterion for success in a
safety-critical system, IMHO. A good program should meet the
requirements for safety etc despite bugs. Also despite hardware
failures, soft failures, and so on. A really good safety-critical
program should be remarkably difficult to de-bug, as the only way you
know it's got a major problem is by examining the error log, and
calculating that it's performance is below theoretical expectations.

And if it runs too slow, many times in the real-world you can spend 2
years of development time and many megabucks kludging the software, or
wait 12 months and get the new 400 Mhz chip instead of your current 133.

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00       ` Matthew Heaney
  1996-09-27  0:00         ` Wayne Hayes
@ 1996-09-27  0:00         ` Ronald Kunne
  1996-09-27  0:00           ` Lawrence Foard
                             ` (2 more replies)
  1996-09-28  0:00         ` Ken Garlington
  2 siblings, 3 replies; 58+ messages in thread
From: Ronald Kunne @ 1996-09-27  0:00 UTC (permalink / raw)

In article <mheaney-ya023180002609962252500001@news.ni.net>
mheaney@ni.net (Matthew Heaney) writes:

>>The problem of constructing bug-free real-time software seems to me
>>a trade-off between safety and speed of execution (and maybe available
>>memory?). In other words: including tests on array boundaries might
>>make the code saver, but also slower.

>Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is.  The rocket's gone,
>so what difference does it make how fast the code executed?  If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?

Despite the sarcasm, I will elaborate.

Suppose an array goes from 0 to 100, and the calculated index is known
not to go outside this range. Why would one insist on putting the
range test in, which will slow down the code? This might be a problem
if the particular piece of code is heavily used, and the code executes
too slowly otherwise. "Marginally slower" if it happens only once, but
such checks on indices and function arguments (like squareroots), are
necessary *everywhere* in code, if one is consequent.

Actually, this was the case here: the code was taken from an Ariane 4
code where it was physically impossible that the index would go out
of range: a test would have been a waste of time.
Unfortunately this was no longer the case in the Ariane 5.

Friendly greetings,
Ronald Kunne

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00         ` Byron Kauffman
@ 1996-09-27  0:00           ` A. Grant
  0 siblings, 0 replies; 58+ messages in thread
From: A. Grant @ 1996-09-27  0:00 UTC (permalink / raw)

In article <324A7C1C.6718@lmtas.lmco.com> Byron Kauffman <KauffmanBB@lmtas.lmco.com> writes:
>A. Grant wrote:
>> Robin is not a student.  He is a senior lecturer at the Royal
>> Melbourne Institute of Technology, a highly reputable institution.

>Thank you for confirming my long-held theory that those who inhabit the
>ivory towers of engineering/CS academia should spend 2 of every 5 years 
>working at a real job out in the real world. My intent is not to slam 
>professors who are in touch with reality, of course (e.g., Feldman, 
>Dewar, et al), but the idealistic theoretical side often is a far cry 
>from the practical, just-get-it-done world we have to deal with once
>we're out of school.

You're being a bit hard on theoretical computer scientists here.
Just because it's called computer science doesn't mean it has to be
able to instantly make money on real computers.  And the Ariane 5 
failure was due to pragmatism (reusing old stuff to save money)
not idealism (applying theoretical proofs of correctness).

But in any case RMIT is noted for its involvement with industry.
(I used to work for a start-up company out of RMIT premises.)
If PL/I is being pushed by RMIT it's probably because the DP
managers in Collins St. want it.  Australia doesn't have much call
for aerospace systems.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00     ` Ronald Kunne
  1996-09-26  0:00       ` Matthew Heaney
  1996-09-27  0:00       ` Alan Brain
@ 1996-09-27  0:00       ` Ken Garlington
  1996-09-29  0:00       ` Louis K. Scheffer
  3 siblings, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-09-27  0:00 UTC (permalink / raw)

Ronald Kunne wrote:
> 
> In article <52bm1c$gvn@rational.rational.com>
> rlk@rational.com (Bob Kitzberger) writes:
> 
> >Ada _has_ range checks built into the language.  They were explicitly
> >disabled in this case.
> 
> The problem of constructing bug-free real-time software seems to me
> a trade-off between safety and speed of execution (and maybe available
> memory?). In other words: including tests on array boundaries might
> make the code saver, but also slower.

Particularly for fail-operate systems that must continue to function in
harsh environments, memory and throughput can be tight. This usually happens
because the system must continue to operate on emergency power and/or
cooling. At least until recently, the processing systems that had lots of
memory and CPU power also had larger power and cooling requirements, so they
couldn't always be used in this class of systems. (That's changing, somewhat.) So,
the tradeoff you describe can occur.

The trade-off I find even more interesting is the safety gained from
adding extra features vs. the safety _lost_ by adding those features. Every
time you add a check, whether it's an explicit check or one automatically
generated by the compiler, you have to have some way to gain confidence that
the check will not only work, but won't create some side-effect that causes
a different problem. The effort expended to get confidence for that additional
feature is effort that can't be spent gaining assurance of other features in
the system, assuming finite resources. There is no magic formula I've ever
seen to make that trade-off - ultimately, it's human judgement.

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00 ` Ariane 5 failure @@           robin
  1996-09-25  0:00   ` Michel OLAGNON
  1996-09-25  0:00   ` Bob Kitzberger
@ 1996-09-27  0:00   ` John McCabe
  1996-10-01  0:00     ` Michael Dworetsky
  1996-10-04  0:00     ` @@           robin
  2 siblings, 2 replies; 58+ messages in thread
From: John McCabe @ 1996-09-27  0:00 UTC (permalink / raw)



rav@goanna.cs.rmit.edu.au (@@           robin) wrote:

<..snip..>

Just a point for your information. From clari.tw.space:

	 "An inquiry board investigating the explosion concluded in  
July that the failure was caused by software design errors in a 
guidance system."

Note software DESIGN errors - not programming errors.



Best Regards
John McCabe <john@assen.demon.co.uk>





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00         ` Ronald Kunne
@ 1996-09-27  0:00           ` Lawrence Foard
  1996-10-04  0:00             ` @@           robin
  1996-09-28  0:00           ` Ken Garlington
  1996-09-29  0:00           ` Alan Brain
  2 siblings, 1 reply; 58+ messages in thread
From: Lawrence Foard @ 1996-09-27  0:00 UTC (permalink / raw)



Ronald Kunne wrote:
> 
> Actually, this was the case here: the code was taken from an Ariane 4
> code where it was physically impossible that the index would go out
> of range: a test would have been a waste of time.
> Unfortunately this was no longer the case in the Ariane 5.

Actually it would still present a danger on Ariane 4. If the sensor
which apparently was no longer needed during flight became defective,
then you could get a value out of range.

-- 
The virgin birth of Pythagoras via Apollo. The martyrdom of 
St. Socrates. The Gospel according to Iamblichus. 
--  Have an 18.9cents/minute 6 second billed calling card tomorrow --
                  http://www.vwis.com/cards.html




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00         ` Wayne Hayes
@ 1996-09-27  0:00           ` Richard Pattis
  1996-09-29  0:00             ` Chris McKnight
                               ` (3 more replies)
  0 siblings, 4 replies; 58+ messages in thread
From: Richard Pattis @ 1996-09-27  0:00 UTC (permalink / raw)



As an instructor in CS1/CS2, this discussion interests me. I try to talk about
designing robust, reusable code, and actually have students reuse code that
I have written as well as some that they (and their peers) have written.
The Ariane falure adds a new view to robustness, having to do with future
use of code, and mathematical proof vs "engineering" considerations..

Should a software engineer remove safety checks if he/she can prove - based on
physical limitations, like a rocket not exceeding a certain speed - that they
are unnecessary. Or, knowing that his/her code will be reused (in an unknown
context, by someone who is not so skilled, and will probably not think to
redo the proof) should such checks not be optimized out? What rule of thumb
should be used to decide (e.g., what if the proof assumes the rocket speed
will not exceed that of light)? Since software operates in the real world (not
the world of mathematics) should mathematical proofs about code always yield
to engineering rules of thumb to expect the unexpected.

  "In the Russian theatre, every 5 years an unloaded gun accidentally 
   discharges and kills someone; every 20 years a broom does."

What is the rule of thumb about when should mathematics be believed? 

  As to saving SPEED by disabling the range checks: did the code not meet its
speed requirements with range checks on? Only in this case would I have turned
them off. Does "real time" mean fast enough or as fast as possible? To
misquote Einstein, "Code should run as fast as necessary, but no faster...."
since something is always traded away to increase speed.

If I were to try to create a lecture on this topic, what other similar
failures should I know about (beside the legendary Venus probe)?
Your comments?

Rich




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00       ` Matthew Heaney
  1996-09-27  0:00         ` Wayne Hayes
  1996-09-27  0:00         ` Ronald Kunne
@ 1996-09-28  0:00         ` Ken Garlington
  2 siblings, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-09-28  0:00 UTC (permalink / raw)



Matthew Heaney wrote:
> 




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00         ` Ronald Kunne
  1996-09-27  0:00           ` Lawrence Foard
@ 1996-09-28  0:00           ` Ken Garlington
  1996-09-28  0:00             ` Ken Garlington
  1996-09-29  0:00           ` Alan Brain
  2 siblings, 1 reply; 58+ messages in thread
From: Ken Garlington @ 1996-09-28  0:00 UTC (permalink / raw)

Ronald Kunne wrote:
> 
> In article <mheaney-ya023180002609962252500001@news.ni.net>
> mheaney@ni.net (Matthew Heaney) writes:
> 
> >>The problem of constructing bug-free real-time software seems to me
> >>a trade-off between safety and speed of execution (and maybe available
> >>memory?). In other words: including tests on array boundaries might
> >>make the code saver, but also slower.
> 
> >Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
> >I'm not clear what the value of "faster execution" is.  The rocket's gone,
> >so what difference does it make how fast the code executed?  If you left
> >the range checks in, your code would be *marginally* slower, but you'd
> >still have your rocket, now wouldn't you?
> 
> Despite the sarcasm, I will elaborate.
> 
> Suppose an array goes from 0 to 100, and the calculated index is known
> not to go outside this range. Why would one insist on putting the
> range test in, which will slow down the code? This might be a problem
> if the particular piece of code is heavily used, and the code executes
> too slowly otherwise. "Marginally slower" if it happens only once, but
> such checks on indices and function arguments (like squareroots), are
> necessary *everywhere* in code, if one is consequent.

I might agree with the conclusion, but probably not with the argument.
If the array is statically typed to go from 0 to 100, and everything
that indexes it is statically typed for that range or smaller, most
modern Ada compilers won't generate _any_ code for the check.

I still believe the more interesting issue has to do with the _consequences_
of the check. If your environment doesn't lend itself to a reasonable response
to the check (quite possible in fail-operate systems inside systems that move
really fast), and you have to test the checks to make sure they don't _create_
a problem, then you've got a hard decision on your hands: suppress the check
(which might trigger a compiler bug or some other problems), or leave the check in 
(which might introduce a problem, or divert your attention away from some other
problem).

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00       ` Alan Brain
@ 1996-09-28  0:00         ` Ken Garlington
  0 siblings, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-09-28  0:00 UTC (permalink / raw)

Alan Brain wrote:
> 
> Ronald Kunne wrote:
> 
> > The problem of constructing bug-free real-time software seems to me
> > a trade-off between safety and speed of execution (and maybe available
> > memory?). In other words: including tests on array boundaries might
> > make the code saver, but also slower.
> >
> > Comments?
> 
> Bug-free software is not a reasonable criterion for success in a
> safety-critical system, IMHO. A good program should meet the
> requirements for safety etc despite bugs.

An OK statement for a fail-safe system. How do you propose to implement
this theory for a fail-operate system, particularly if there are system
constraints on weight, etc. that preclude hardware backups?

> Also despite hardware
> failures, soft failures, and so on.

A system which will always meet its requirements despite any combination
of failures is in the same regime as the perpetual motion system. If
you build one, you'll probably make a lot of money, so go to it!

> A really good safety-critical
> program should be remarkably difficult to de-bug, as the only way you
> know it's got a major problem is by examining the error log, and
> calculating that it's performance is below theoretical expectations.
> And if it runs too slow, many times in the real-world you can spend 2
> years of development time and many megabucks kludging the software, or
> wait 12 months and get the new 400 Mhz chip instead of your current 133.

I really need to change jobs. It sounds so much simpler to build 
software for ground-based PCs, where you don't have to worry about the 
weight, power requirements, heat dissipation, physical size, 
vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-28  0:00           ` Ken Garlington
@ 1996-09-28  0:00             ` Ken Garlington
  0 siblings, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-09-28  0:00 UTC (permalink / raw)



From the  "There's always time to test it the second time around"
department...

      ORBITAL JUNK:  The second Ariane 5 to be launched in April at the
      earliest will put two dummy satellites, worth less than $3
      million, into orbit. The first Ariane 5 exploded in June carrying
      four uninsured satellites worth $500 million.  (Financial Times)

I wonder if the test labs at Arianespace, etc. are keeping busy... :)




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-29  0:00           ` Alan Brain
@ 1996-09-29  0:00             ` Robert A Duff
  1996-09-30  0:00               ` Wayne L. Beavers
  1996-10-01  0:00             ` Ken Garlington
  1 sibling, 1 reply; 58+ messages in thread
From: Robert A Duff @ 1996-09-29  0:00 UTC (permalink / raw)

In article <324F1157.625C@dynamite.com.au>,
Alan Brain  <aebrain@dynamite.com.au> wrote:
>Brain's law:
>"Software Bugs and Hardware Faults are no excuse for the Program not to
>work".   
>
>So: it costs peanuts, and may save your hide.

This reasoning doesn't sound right to me.  The hardware part, I mean.
The reason checks-on costs only 5% or so is that compilers aggressively
optimize out almost all of the checks.  When the compiler proves that a
check can't fail, it assumes that the hardware is perfect.  So, hardware
faults and cosmics rays and so forth are just as likely to destroy the
RTS, or cause the program to take a wild jump, or destroy the call
stack, or whatever -- as opposed to getting  a Constraint_Error a
reocovering gracefully.  After all, the compiler doesn't range-check the
return address just before doing a return instruction!

- Bob

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Richard Pattis
@ 1996-09-29  0:00             ` Chris McKnight
  1996-09-29  0:00               ` Real-world education (was: Ariane 5 failure) Michael Feldman
  1996-09-29  0:00             ` Ariane 5 failure Alan Brain
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 58+ messages in thread
From: Chris McKnight @ 1996-09-29  0:00 UTC (permalink / raw)



In article Hzz@beaver.cs.washington.edu, pattis@cs.washington.edu (Richard Pattis) writes:
>As an instructor in CS1/CS2, this discussion interests me. I try to talk about
>designing robust, reusable code, and actually have students reuse code that
>I have written as well as some that they (and their peers) have written.
>The Ariane falure adds a new view to robustness, having to do with future
>use of code, and mathematical proof vs "engineering" considerations..

  An excellent bit of teaching, IMHO. Glad to hear they're putting some
  more of the real world issues in the class room.

>Should a software engineer remove safety checks if he/she can prove - based on
>physical limitations, like a rocket not exceeding a certain speed - that they
>are unnecessary. Or, knowing that his/her code will be reused (in an unknown
>context, by someone who is not so skilled, and will probably not think to
>redo the proof) should such checks not be optimized out? What rule of thumb
>should be used to decide (e.g., what if the proof assumes the rocket speed
>will not exceed that of light)? Since software operates in the real world (not
>the world of mathematics) should mathematical proofs about code always yield
>to engineering rules of thumb to expect the unexpected.

 A good question.  For the most part, I'd go with engineering rules of thumb
 (what did you expect, I'm an engineer).  As an engineer, you never know what
 may happen in the real world (in spite of what you may think), so I prefer
 error detection and predictable recovery.  The key factors to consider include
 the likelihood and the cost of failures, and the cost of leaving in (or adding
 where your language doesn't already provide it) the checks.

 Consider these factors, likelihood and cost of failures:

    In a real-time embedded system, both of these factors are often high.  Of
    the two, I think people most often get caught on misbeliefs on likelihood of
    failure.  As an example, I've argued more than once with engineers who think
    that since a device is only "able" to give them a value in a certain range, 
    they needn't check for out of range values.  I've seen enough failed hardware
    to know that anything is possible, regardless of what the manufacturer may
    claim.  Consider your speed of light example, what if the sensor goes bonkers
    and tells you that you're going faster?  Your "proof" that you can't get that
    value falls apart then.  Your point about reuse is also well made.  Who knows
    what someone else may want to use your code for?

    As for cost of failure, it's usually obvious; in dollars, in lives, or both.
 
 As for cost of leaving checks in (or putting them in):

    IMHO, the cost is almost always insignificant.  If the timing is so tight that 
    removing checks makes the difference, it's probably time to redesign anyway.
    Afterall, in the real world there's always going to be fixes, new features, 
    etc.. that need to be added later, so you'd better plan for it.  Also, it's
    been my experience that removing checks is somewhere in the single digits
    on % improvement.  If you're really that tight, a good optimizer can yield
    10%-15% or more (actual mileage may vary of course).  But again, if that
    makes the difference, you'd better rethink your design.
   
 So the rule of thumb I use is, unless a device is not physically capable (as
 opposed to theoretically capable) of giving me out of range data, I'm going
 to range check it.  I.E. if there's 3 bits, you'd better check for 8 values
 regardless of the number of values you think you can get.

 That having been said, it's often not up to the engineer to make these 
 decisions.  Such things as political considerations, customer demands, and 
 (more often than not) management decisions  have been known to succeed in
 convincing me to turn checks off.  As a rule, however, I fight to keep them
 in, at very least through development and integration. 

>  As to saving SPEED by disabling the range checks: did the code not meet its
>speed requirements with range checks on? Only in this case would I have turned
>them off. Does "real time" mean fast enough or as fast as possible? To
>misquote Einstein, "Code should run as fast as necessary, but no faster...."
>since something is always traded away to increase speed.

  Precisely!  And when what's being traded is safety, it's not worth it.


  Cheers,

     Chris


=========================================================================

"I was gratified to be able to answer promptly.  I said I don't know".  
  -- Mark Twain

=========================================================================





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Real-world education (was: Ariane 5 failure)
  1996-09-29  0:00             ` Chris McKnight
@ 1996-09-29  0:00               ` Michael Feldman
  0 siblings, 0 replies; 58+ messages in thread
From: Michael Feldman @ 1996-09-29  0:00 UTC (permalink / raw)

In article <1996Sep29.193602.17369@enterprise.rdd.lmsc.lockheed.com>,
Chris McKnight <cmcknigh@hercii.lasc.lockheed.com> wrote:

[Rich Pattis' good stuff snipped.]
>
>  An excellent bit of teaching, IMHO. Glad to hear they're putting some
>  more of the real world issues in the class room.

Rich Pattis is indeed an experienced, even gifted teacher of
introductory courses, with a very practical view of what they
should be about.

Without diminishing Rich Pattis' teaching experience or skill one bit,
I am somewhat perplexed at the unfortunate stereotypical view you
seem to have of CS profs. Yours is the second post today to have
shown evidence of that stereotypical view; both you and the other
poster have industry addresses.

This is my 22nd year as a CS prof, I travel a lot in CS education
circles, and - while we, like any population, tend to hit a bell
curve - I've found that there are a lot more of us out here than
you may think with Pattis-like commitment to bring the real world
into our teaching.

Sure, there are theorists, as there are in any field, studying
and teaching computing just because it's "beautiful", with little
reference to real application, and there's a definite place in the
teaching world for them.  Indeed, exposure to their "purity" of
approach is healthy for undergraduates - there is no harm at all
in taking on computing - sometimes - as purely an intellectual
exercise.

But it's a real reach from there to an assumption that most of us
are in that theoretical category.

I must say that there's a definite connection between an interest
in Ada and an interest in real-world software; certainly most of
the Ada teachers I've met are more like Pattis than you must think.
Indeed, it's probably our commitment to that "engineering" view
of computing that brings us to like and teach Ada.

But it's not just limited to Ada folks. I had the pleasure of
participating in a SIGCSE panel last March entitled "the first
year beyond language." Organized by Owen Astrachan of Duke,
a C++ fan, this panel consisted of 6 teachers of first-year
courses, each using a different language. Pascal, C++, Ada,
Scheme, Eiffel, and (as I recall) ML were represented.

The challenge Owen made to each of us was to give a 10-minute
"vision statement" for first-year courses, without identifying
which language we "represented." Owen revealed the languages to
the audience only after the presentations were done.

It was _really_ gratifying that - with no prior agreement or
discussion among us - five of the six of us presented very similar
visions, in the "computing as engineering" category. It doesn;t
matter which language the 6th used; the important thing was that,
considering the diversity of our backgrounds, teaching everywhere
from small private colleges to big public universities, we were
in _amazing_ agreement.

The message for me in the stereotype presented above is that it's
probably out of date and certainly out of touch. I urge my
industry friends to get out of _their_ ivory towers, and come
visit us. Find out what we're _really_ doing. I think you'll
be pleasantly surprised.

Especially, check out those of us who are introducing students
to _Ada_ as their first, foundation language.

Mike Feldman

------------------------------------------------------------------------
Michael B. Feldman -  chair, SIGAda Education Working Group
Professor, Dept. of Electrical Engineering and Computer Science
The George Washington University -  Washington, DC 20052 USA
202-994-5919 (voice) - 202-994-0227 (fax) 
http://www.seas.gwu.edu/faculty/mfeldman
------------------------------------------------------------------------
       Pork is all that money the government gives the other guys.
------------------------------------------------------------------------
WWW: http://lglwww.epfl.ch/Ada/ or http://info.acm.org/sigada/education
------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Richard Pattis
  1996-09-29  0:00             ` Chris McKnight
  1996-09-29  0:00             ` Ariane 5 failure Alan Brain
@ 1996-09-29  0:00             ` Dann Corbit
  1996-10-01  0:00             ` Ken Garlington
  3 siblings, 0 replies; 58+ messages in thread
From: Dann Corbit @ 1996-09-29  0:00 UTC (permalink / raw)



I propose a software IC metaphor for high
reliability projects. (And all eventually).

Currently, the software industry goes by
what I call a "software schematic" metaphor.
We put in components that are tested, but
we do not necessarily know the performance
curves.

If you look at S. Moshier's code in the
Cephes Library on Netlib, you will see that
he offers statistical evidence that his 
programs are robust.  So you can at least
infer, on a probability basis, what the odds
are of a component failing.  So instead of
just dropping in a resistor or a transistor,
we read the little gold band, or the spec
on the transistor that shows what voltages
it can operate under.
For simple components with, say, five bytes
of input, we could exhaustively test all 
possible inputs and outputs.  For more
complicated procedures with many bytes of
inputs, we could perform probability testing,
and test other key values.

Imagine a database like the following:
TABLE: MODULES
int      ModuleUniqueID
int      ModuleCategory
char*60  ModuleName
char*255 ModuleDescription
text     ModuleCode
text     TestRoutineUsed
bit      CompletelyTested

TABLE: TestResults (many result sets for one module)
int      TestResultUniqueID
int      ModuleUniqueID
char*60  OperatingSystem
char*60  CompilerUsed
binary   ResultChart
text     ResultDescription
float    ProbabilityOfFailure
float    RmsErrorObserved
float    MaxErrorObserved

TABLE: KnownBugs  (many known bugs for one module)
int      KnownBugUniqueID
int      ModuleUniqueID
char*60  KnownBugDescription
text     BugDefinition
text     PossibleWorkAround

Well, this is just a rough outline, but the value of
a database like this would be obvious.  This could
easily be improved and expanded. (More domain tables,
tables for defs of parameters to the module, etc.)

If we had a tool like that, we would be using
software IC's, not software schematics.
-- 
"I speak for myself and all of the lawyers of the world"
If I say something dumb, then they will have to sue themselves.





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00         ` Ronald Kunne
  1996-09-27  0:00           ` Lawrence Foard
  1996-09-28  0:00           ` Ken Garlington
@ 1996-09-29  0:00           ` Alan Brain
  1996-09-29  0:00             ` Robert A Duff
  1996-10-01  0:00             ` Ken Garlington
  2 siblings, 2 replies; 58+ messages in thread
From: Alan Brain @ 1996-09-29  0:00 UTC (permalink / raw)

Ronald Kunne wrote:

> Suppose an array goes from 0 to 100, and the calculated index is known
> not to go outside this range. Why would one insist on putting the
> range test in, which will slow down the code? This might be a problem
> if the particular piece of code is heavily used, and the code executes
> too slowly otherwise. "Marginally slower" if it happens only once, but
> such checks on indices and function arguments (like squareroots), are
> necessary *everywhere* in code, if one is consequent.

Why insist?
1. Suppressing all checks in Ada-83 makes about a 5% difference in
execution speed, in typical real-time and avionics systems. (For
example, B2 simulator, CSU-90 sonar, COSYS-200 Combat system). If your
hardware budget is this tight,
you'd better not have lives at risk, or a lot of money, as technical
risk is
appallingly high.

2. If you know the range is 0-100, and you get 101, what does this show?
a) A bug in the code (99.9999....% probable). b) A hardware fault. c) A
soft failure, as in a stray cosmic ray zapping a bit. d) a faulty
analysis of your "can't happen" situation. As in re-use, or where your
array comes from an IO channel with noise on....

Type a) and d) failures should be caught during testing. Most of them.
OK, some of them. Range checking here is a neccessary debugging aid. But
type b) and c) can happen too out in the real world, and if you don't
test for an error early, you often can't recover the situation. Lives or
$ lost.

Brain's law:
"Software Bugs and Hardware Faults are no excuse for the Program not to
work".   

So: it costs peanuts, and may save your hide.

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00     ` Ronald Kunne
                         ` (2 preceding siblings ...)
  1996-09-27  0:00       ` Ken Garlington
@ 1996-09-29  0:00       ` Louis K. Scheffer
  3 siblings, 0 replies; 58+ messages in thread
From: Louis K. Scheffer @ 1996-09-29  0:00 UTC (permalink / raw)

KUNNE@frcpn11.in2p3.fr (Ronald Kunne) writes:

>The problem of constructing bug-free real-time software seems to me
>a trade-off between safety and speed of execution (and maybe available
>memory?). In other words: including tests on array boundaries might
>make the code saver, but also slower.
> 
>Comments?

True in this case, but not in the way you might expect.  The software group
decided that they wanted the guidance computers to be no more than 80 percent 
busy.  Range checking ALL the variables took too much time, so they analyzed 
the situation and only checked those that might overflow.  In the Ariane 4,
this particular variable could not overflow unless the trajectory was wildly 
off, so they left out the range checking.

I think you could make a good case for range checking in the Ariane
software making it less safe, rather than more safe.  The only reason they
check for overflow is to find hardware errors - since the software is designed
to not overflow, then any overflow must be because of a hardware problem, so
if any processor detects an overflow it shuts down.  So on the one hand, each
additional range check increases the odds of catching a hardware error before
it does damage, but increases the odds that a processor shuts down while it
could still be delivering useful data. (Say the overflow occurs while 
computing unimportant results, as on the Ariane 5).   Given the relative
odds of hardware and software errors, it's not at all obvious to me that
range checking helps at all in this case!

The real problem is that they did not re-examine this software for the Ariane 5.If they had eitehr simulated it, or examined it closely, they would probably
have found this problem.
   -Lou Scheffer

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Richard Pattis
  1996-09-29  0:00             ` Chris McKnight
@ 1996-09-29  0:00             ` Alan Brain
  1996-09-29  0:00             ` Dann Corbit
  1996-10-01  0:00             ` Ken Garlington
  3 siblings, 0 replies; 58+ messages in thread
From: Alan Brain @ 1996-09-29  0:00 UTC (permalink / raw)

Richard Pattis wrote:
> 
> As an instructor in CS1/CS2, this discussion interests me. I try to talk about
> designing robust, reusable code.... --->8----

> The Ariane falure adds a new view to robustness, having to do with future
> use of code, and mathematical proof vs "engineering" considerations..
> 
> Should a software engineer remove safety checks if he/she can prove - based on
> physical limitations, like a rocket not exceeding a certain speed - that they
> are unnecessary. Or, knowing that his/her code will be reused (in an unknown
> context, by someone who is not so skilled, and will probably not think to
> redo the proof) should such checks not be optimized out? What rule of thumb
> should be used to decide (e.g., what if the proof assumes the rocket speed
> will not exceed that of light)? Since software operates in the real world (not
> the world of mathematics) should mathematical proofs about code always yield
> to engineering rules of thumb to expect the unexpected.

> What is the rule of thumb about when should mathematics be believed?
> 

Firstly, I wish more there were more CS teachers like you. These are
excellent
Engineering questions.

Secondly, answers:
I tend towards the philosophy of "Leave every check in". In 12+ years of
Ada programming, I've never seen Pragma Suppress All Checks make the
difference
between success and failure. At best it gives a 5% improvement. This
means
in order to debug the code quickly, it's useful to have such checks,
even when
not strictly neccessary.

For re-use, you then often have the Ariane problem. That is, the
un-neccessary
checks you included coming around and biting you, as the assumptions you
were
making in the previous project become invalid.

So.... You make sure the assumptions/consequences get put into a
seperate package.
A system-specific package, that will be changed when re-used. Which
means that if the subsystem gets re-used a lot, the system specific
stuff will eventually be re-written so as to allow for re-use easily.
Example: Car's Cruise Control: MAX_SPEED : constant 200.0*MPH;
Get's re-used in an airliner - change to 700.0*MPH. Then onto an SST -
2000.0*MPH.
Eventually, you make it 2.98E26*MetresPerSec. Then some Bunt invents a
Warp Drive, and you're wrong again.

Summary: Label the constraints and assumptions, stick them as comments
in the code and design notes, put them in a seperate package...and some
dill will still stuff up, but that's the best you can do. And in the
meantime, you allow the possibility of finding a number of errors
early.   

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-29  0:00             ` Robert A Duff
@ 1996-09-30  0:00               ` Wayne L. Beavers
  1996-10-01  0:00                 ` Ken Garlington
  1996-10-03  0:00                 ` Richard A. O'Keefe
  0 siblings, 2 replies; 58+ messages in thread
From: Wayne L. Beavers @ 1996-09-30  0:00 UTC (permalink / raw)



I have been reading this thread awhile and one topic that I have not seen mentioned is protecting the code 
area from damage.  When I code in PL/I or any other reentrant language I always make sure that the executable 
code is executing from read-only storage.  There is no way to put the data areas in read-only storage 
(obviously) but I can't think of any reason to put the executable code in writeable storage. 

I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable code from one system to another.  The 
single most common error I had to correct was incorrect usage of pointer variables.  I caught a lot of them 
when ever they attempted to accidently store into the code area.  At that point it is trivial to correct the 
bug.  This technique certainly doesn't catch all pointer failures, but it will catch at least some of them.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-30  0:00               ` Wayne L. Beavers
@ 1996-10-01  0:00                 ` Ken Garlington
  1996-10-01  0:00                   ` Wayne L. Beavers
  1996-10-03  0:00                 ` Richard A. O'Keefe
  1 sibling, 1 reply; 58+ messages in thread
From: Ken Garlington @ 1996-10-01  0:00 UTC (permalink / raw)



Wayne L. Beavers wrote:
> 
> I have been reading this thread awhile and one topic that I have not seen mentioned is protecting the code
> area from damage.  When I code in PL/I or any other reentrant language I always make sure that the executable
> code is executing from read-only storage.  There is no way to put the data areas in read-only storage
> (obviously) but I can't think of any reason to put the executable code in writeable storage.

That's actually a pretty common rule of thumb for safety-critical systems. 
Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors 
can cause a random change in the memory. So, it's not a perfect fix.

> 
> I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable code from one system to another.  The
> single most common error I had to correct was incorrect usage of pointer variables.  I caught a lot of them
> when ever they attempted to accidently store into the code area.  At that point it is trivial to correct the
> bug.  This technique certainly doesn't catch all pointer failures, but it will catch at least some of them.

-- 
LMTAS - "Our Brand Means Quality"




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Richard Pattis
                               ` (2 preceding siblings ...)
  1996-09-29  0:00             ` Dann Corbit
@ 1996-10-01  0:00             ` Ken Garlington
  3 siblings, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-10-01  0:00 UTC (permalink / raw)

Richard Pattis wrote:
> 
[snip]
> If I were to try to create a lecture on this topic, what other similar
> failures should I know about (beside the legendary Venus probe)?
> Your comments?

"Safeware" by Levison has some additional good examples about what can
go wrong with software. The RISKS conference also has a lot of info on
this.

There was a study done several years ago by a Dr. Avezzianis (I always screw
up that spelling, and I'm always too lazy to go look it up...) trying to
show the worth of N-version programming. He had five teams of students write
code for part of a flight control system. Each team was given the same set
of control law diagrams (which are pretty detailed, as requirements go), and
each team used the same sort of meticulous software engineering approach that
you would expect for a safety-critical system (no formal methods, however).
Each team's software was almost error-free, based on tests done using the
same test data as the actual delivered flight controls.

Note I said "almost". Every team made one mistake. Worse, it was the _same_
mistake. The control law diagrams were copies. The copier apparently wasn't
a good one, because a comma in one of the gains ended up looking like a
decimal point (or maybe it was the other way around -- I forget). Anyway,
the gain was accidentally coded as 2.345 vs 2,345, or something like that.
That kind of error makes a big difference!

In the face of that kind of error, I've never felt that formal methods had a
chance. That's not to say that formal methods can't detect a lot of different
kinds of failures, but at some level some engineer has to be able to say: "That
doesn't make sense..."

If you want to try to find this study, I believe it was reported at a Digital
Avionics Systems Conference many years ago (in San Jose?), probably around 1986.

> 
> Rich

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-29  0:00           ` Alan Brain
  1996-09-29  0:00             ` Robert A Duff
@ 1996-10-01  0:00             ` Ken Garlington
  1 sibling, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-10-01  0:00 UTC (permalink / raw)

Alan Brain wrote:
> 
> 1. Suppressing all checks in Ada-83 makes about a 5% difference in
> execution speed, in typical real-time and avionics systems. (For
> example, B2 simulator, CSU-90 sonar, COSYS-200 Combat system). If your
> hardware budget is this tight,
> you'd better not have lives at risk, or a lot of money, as technical
> risk is
> appallingly high.

Actually, I've seen systems where checks make much more than a 5% difference.
For example, in a flight control system, checks done in the redundancy
management monitor (comparing many redundant inputs in a tight loop) can
easily add 10% or more.

I have also seen flight-critical systems where 5% is a big deal, and where you
can _not_ add a more powerful processor to fix the problem. Flight control
software usually exists in a flight control _system_, with system issues of
power, cooling, space, etc. to consider. On a missile, these are important
issues. You might consider the technical risk "appalingly high," but the fix
for that risk can introduce equally dangerous risks in other areas.

> 2. If you know the range is 0-100, and you get 101, what does this show?
> a) A bug in the code (99.9999....% probable). b) A hardware fault. c) A
> soft failure, as in a stray cosmic ray zapping a bit. d) a faulty
> analysis of your "can't happen" situation. As in re-use, or where your
> array comes from an IO channel with noise on....

You forgot (e) - a failure in the inputs. The range may be calculated,
directly or indirectly, from an input to the system. In practice, at least
for the systems I'm familiar with, that's usually where the error came
from -- either a connector fell off, or some wiring shorted out, or a bird
strike took out half of your sensors. I definitely would say that, when we
have a failure reported in operation, it's not usually because of a bug in
the software for our systems!

> Type a) and d) failures should be caught during testing. Most of them.
> OK, some of them. Range checking here is a neccessary debugging aid. But
> type b) and c) can happen too out in the real world, and if you don't
> test for an error early, you often can't recover the situation. Lives or
> $ lost.
> 
> Brain's law:
> "Software Bugs and Hardware Faults are no excuse for the Program not to
> work".

Too bad that law can't be enforced :)

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00                 ` Ken Garlington
@ 1996-10-01  0:00                   ` Wayne L. Beavers
  1996-10-01  0:00                     ` Ken Garlington
  0 siblings, 1 reply; 58+ messages in thread
From: Wayne L. Beavers @ 1996-10-01  0:00 UTC (permalink / raw)

Ken Garlington wrote:

> That's actually a pretty common rule of thumb for safety-critical systems.
> Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> can cause a random change in the memory. So, it's not a perfect fix.

  Your right, but the risk and probability of memory failures is pretty low I would think.  I have never seen 
or heard of a memory failure in any of the systems that I have worked on.  I don't know what the current 
technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit 
memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were 
correctable.  But I also don't work on realtime systems, my experience is with commercial systems.

  Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you 
refering to ground base systems that don't have similar constraints?

  Does anyone know just how good memory ECC is these days?

Wayne L. Beavers   wayneb@beyond-software.com
Beyond Software, Inc.      
The Mainframe/Internet Company
http://www.beyond-software.com/

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00                   ` Wayne L. Beavers
@ 1996-10-01  0:00                     ` Ken Garlington
  1996-10-02  0:00                       ` Sandy McPherson
  0 siblings, 1 reply; 58+ messages in thread
From: Ken Garlington @ 1996-10-01  0:00 UTC (permalink / raw)

Wayne L. Beavers wrote:
> 
> Ken Garlington wrote:
> 
> > That's actually a pretty common rule of thumb for safety-critical systems.
> > Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> > can cause a random change in the memory. So, it's not a perfect fix.
> 
>   Your right, but the risk and probability of memory failures is pretty low I would think.  I have never seen
> or heard of a memory failure in any of the systems that I have worked on.  I don't know what the current
> technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit
> memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were
> correctable.  But I also don't work on realtime systems, my experience is with commercial systems.
> 
>   Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you
> refering to ground base systems that don't have similar constraints?

On-board systems. The failure _rate_ is usually pretty low, but in a harsh environment 
you can get quite a few failure _sources_, including mechanical failures (stress 
fractures, solder loss due to excessive heat, etc.), electrical failures (EMI, 
lightening), and so forth. You don't have to take out the actual chip, of course: just 
as bad is a failure in the address or data lines connecting the memory to the CPU. Add 
a memory management unit to the mix, along with various I/O devices mapped into the 
memory space, and you can get a whole slew of memory-related failure modes.

You can also get into some neat system failures. For example, some "read-only" memory 
actually allows writes to the execution space in certain modes, to allow quick 
reprogramming. If you have a system failure that allows writes at the wrong time, 
coupled with a failure that does a write where it shouldn't...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00   ` John McCabe
@ 1996-10-01  0:00     ` Michael Dworetsky
  1996-10-04  0:00       ` Steve Bell
  1996-10-04  0:00     ` @@           robin
  1 sibling, 1 reply; 58+ messages in thread
From: Michael Dworetsky @ 1996-10-01  0:00 UTC (permalink / raw)

In article <843845039.4461.0@assen.demon.co.uk> john@assen.demon.co.uk (John McCabe) writes:
>rav@goanna.cs.rmit.edu.au (@@           robin) wrote:
>
><..snip..>
>
>Just a point for your information. From clari.tw.space:
>
>	 "An inquiry board investigating the explosion concluded in  
>July that the failure was caused by software design errors in a 
>guidance system."
>
>Note software DESIGN errors - not programming errors.
>

Indeed, the problems were in the specifications given to the programmers, 
not in the coding activity itself.  They wrote exactly what they were 
asked to write, as far as I could see from reading the report summary.

The problem was caused by using software developed for Ariane 4's flight
characteristics, which were different from those of Ariane 5.  When the
launch vehicle exceeded the boundary parameters of the Ariane-4 software,
it send an error message and, as specified by the remit given to
programmers, a critical guidance system shut down in mid-flight. Ka-boom. 

-- 
Mike Dworetsky, Department of Physics  | Haiku: Nine men ogle gnats
& Astronomy, University College London |         all lit
Gower Street, London WC1E 6BT  UK      |   till last angel gone.
   email: mmd@star.ucl.ac.uk           |       Men in Ukiah.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00                     ` Ken Garlington
@ 1996-10-02  0:00                       ` Sandy McPherson
  0 siblings, 0 replies; 58+ messages in thread
From: Sandy McPherson @ 1996-10-02  0:00 UTC (permalink / raw)

Ken Garlington wrote:
> 
> Wayne L. Beavers wrote:
> >
> > Ken Garlington wrote:
> >
> > > That's actually a pretty common rule of thumb for safety-critical systems.
> > > Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> > > can cause a random change in the memory. So, it's not a perfect fix.
> >
> >   Your right, but the risk and probability of memory failures is pretty low I would think.  I have never seen
> > or heard of a memory failure in any of the systems that I have worked on.  I don't know what the current
> > technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit
> > memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were
> > correctable.  But I also don't work on realtime systems, my experience is with commercial systems.
> >
> >   Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you
> > refering to ground base systems that don't have similar constraints?
> 
> On-board systems. The failure _rate_ is usually pretty low, but in a harsh environment
> you can get quite a few failure _sources_, including mechanical failures (stress
> fractures, solder loss due to excessive heat, etc.), electrical failures (EMI,
> lightening), and so forth. You don't have to take out the actual chip, of course: just
> as bad is a failure in the address or data lines connecting the memory to the CPU. Add
> a memory management unit to the mix, along with various I/O devices mapped into the
> memory space, and you can get a whole slew of memory-related failure modes.
> 
> You can also get into some neat system failures. For example, some "read-only" memory
> actually allows writes to the execution space in certain modes, to allow quick
> reprogramming. If you have a system failure that allows writes at the wrong time,
> coupled with a failure that does a write where it shouldn't...

It depends upon what you mean by a memory failure. I can imagine that
the chances of your memory being trashed completely is very very low and
in rad-hardened systems the chances of a single-event-upset (SEU) is
also low, but has to be guarded against. I have recently been working on
a system where the specified hardware has a parity bit for each octet of
memory, so SEUs which flip bit values in the memory can be detected.
This parity check is built into the system's micro-code. 

Similarily the definition of what is and isn't read only memory is
usually a feature of the processor and or operating system being used. A
compiler cannot put code into read only areas of memory, unless the
processor its micro-code and/or o/s are playing ball as well. If you are
unfortunate enough to be in this situation (are there any such systems
left?), then the only thing you can do is DIY, but the compiler can't
help you much, other than the for-use-at.

I once read an interesting definition of two types of bugs in
"transaction processing" by Gray & Reuter, Heisenbugs and Bohrbugs. 

Identification of potential Heisenbugs, estimation of probability of
occurence, impact to system on occurrence and appropriate recovery
procedures are part of the risk analysis. An SEU is a classic Heisenbug,
which IMO is out of scope of compiler checks, because they can result in
a valid but incorrect value for a variable and are just as likely to
occur in the code section as the data section of your application. A
complete memory failure is of course beyond the scope of the compiler.

IMO an Ada compiler's job (when used properly) is to make sure that
syntactic Bohrbugs do not enter a system and all semantic Bohrbugs get
detected at runtime (as Bohrbugs, by definition have a fixed location
and are certain to occur under given conditions- the Ariane 5 bug was
definitely a Bohrbug). The compiler cannot do anything about Heisenbugs
(because they only have a probability of occurrence). To handle
Heisenbugs generally you need to have a detection, reporting and
handling mechanism: built using the hardwares error detection, generally
accepted software practices (e.g. duplicate storage, process-pairs) and
an application dependent exception handling mechanism. Ada provides the
means to trap the error condition once it has been reported, but it does
not implement exception handlers for you, other than the default "I'm
gone..."; additionally if the underlying system does not provide the
means to detect  a probable error, you have to implement the means of
detectin the probel and reporting this through the Ada exception
handling yourself. 

-- 
Sandy McPherson	MBCS CEng.	tel: 	+31 71 565 4288 (w)
ESTEC/WAS
P.O. Box 299
NL-2200AG Noordwijk

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Real-world education (was: Ariane 5 failure)
@ 1996-10-02  0:00 Simon Johnston
  0 siblings, 0 replies; 58+ messages in thread
From: Simon Johnston @ 1996-10-02  0:00 UTC (permalink / raw)

Michael Feldman wrote:
> In article <1996Sep29.193602.17369@enterprise.rdd.lmsc.lockheed.com>,
> Chris McKnight <cmcknigh@hercii.lasc.lockheed.com> wrote:
>=20
> [Rich Pattis' good stuff snipped.]
> >
> >  An excellent bit of teaching, IMHO. Glad to hear they're putting =
some
> >  more of the real world issues in the class room.
>=20
> Rich Pattis is indeed an experienced, even gifted teacher of
> introductory courses, with a very practical view of what they
> should be about.
>=20
> Without diminishing Rich Pattis' teaching experience or skill one bit,
> I am somewhat perplexed at the unfortunate stereotypical view you
> seem to have of CS profs. Yours is the second post today to have
> shown evidence of that stereotypical view; both you and the other
> poster have industry addresses.

I think some of it must come from experience, I have met some really =
good, industry focused profs ho teach with a real "useful" view (my =
first serious language was COBOL!). I have also met the "computer =
science" guys, without whom we would never move forward. I have also met =
some inbetween who really don't have that engineering focus or the =
science.
 =20
> This is my 22nd year as a CS prof, I travel a lot in CS education
> circles, and - while we, like any population, tend to hit a bell
> curve - I've found that there are a lot more of us out here than
> you may think with Pattis-like commitment to bring the real world
> into our teaching.

Mike, I know from your books and postings here the level of engineering =
you bring to your teaching, we are discussing (I believe) the balance in =
teaching computing as an engineering discipline or as an ad-hoc =
individual "art".

> Sure, there are theorists, as there are in any field, studying
> and teaching computing just because it's "beautiful", with little
> reference to real application, and there's a definite place in the
> teaching world for them.  Indeed, exposure to their "purity" of
> approach is healthy for undergraduates - there is no harm at all
> in taking on computing - sometimes - as purely an intellectual
> exercise.
>=20
> But it's a real reach from there to an assumption that most of us
> are in that theoretical category.

I don't think many of the people I work with have made this leap.
=20
> I must say that there's a definite connection between an interest
> in Ada and an interest in real-world software; certainly most of
> the Ada teachers I've met are more like Pattis than you must think.
> Indeed, it's probably our commitment to that "engineering" view
> of computing that brings us to like and teach Ada.

Certainly (or as in my case COBOL) it leads you into an application =
oriented way of thinking which makes you think about requirements, =
testing etc.

 [snip]

let me give you a little anecdote f my own.=20
I recently went for a job interview with a very large well-known =
software firm. Firstly they wanted me write the code to traverse a =
binary tree for which they described the (C) data structures. Then I was =
asked to write code to insert a node in a linked list (I had to ask what =
the requirements for cases such as the list being empty or the node =
already existing where). Finally I was asked to write the code to find =
all the anagrams in a given string.
There were no business type questions, no true analytical questions, the =
things which as an engineer I have to do each day. The problems set me =
have a simple and single answer which I don't write each day. I am sure =
you can recite off hand the way to traverse a binary tree, but I have to =
stop and think because I wrote it ONCE, AGES AGO and wrote it as a =
GENERIC which I can REUSE. I know an understanding of these algorithms =
is required so that I can decide which of my generics to use, but that =
is why I invest in good books!
By the way I happen to know someone who works for this firm who told me =
that graduate programmers seem to do well in their interview process, he =
once interviewed an engineer with 20 years industry experience and a PhD =
who got up and left half way through the interview in disgust.

with StandardDisclaimer; use StandardDisclaimer;
package Sig is
--,----------------------------------------------------------------------=
---.
--|Simon K. Johnston - Development Engineer (C++/Ada95) |ICL Retail =
Systems |
--|-----------------------------------------------------|3/4 Willoughby =
Road|
--|Internet : skj@acm.org                               |Bracknell       =
   |
--|Telephone: +44 (0)1344 476320 Fax: +44 (0)1344 476302|Berkshire       =
   |
--|Internal : 7261 6320   OP Mail: S.K.Johnston@BRA0801 |RG12 8TJ        =
   |
--|WWW URL  : http://www.acm.org/~skj/                  |United Kingdom  =
   |
--`----------------------------------------------------------------------=
---'
end Sig;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-30  0:00               ` Wayne L. Beavers
  1996-10-01  0:00                 ` Ken Garlington
@ 1996-10-03  0:00                 ` Richard A. O'Keefe
  1 sibling, 0 replies; 58+ messages in thread
From: Richard A. O'Keefe @ 1996-10-03  0:00 UTC (permalink / raw)

"Wayne L. Beavers" <wayneb@beyond-software.com> writes:

>I have been reading this thread awhile and one topic that I have not
>seen mentioned is protecting the code area from damage.

I imagine that everyone else has taken this for granted.
UNIX compilers have been doing it for years, and so I believe have VMS ones.

>When I code in PL/I or any other reentrant language I always make sure
>that the executable code is executing from read-only storage.

(a) This is not something that the programmer should normally have to be
    concerned with, it just happens.
(b) It cannot always be done. Run-time code generation is a practical and
    important technique.  (Making a page read-only after new code has been
    written to it is a good idea, of course.)

>There is no way to put the data areas in read-only storage (obviously)

It may be obvious, but in important cases it isn't true.
UNIX (and I believe VMS) compilers have for years had the ability to put
_selected_ data in read-only storage.  And of course it is perfectly
feasible in many operating systems (certainly UNIX and VMS) to write data
into a page and then ask the operating system to make that page read-only.

>but I can't think of any reason to put the executable code in writeable
>storage.

Run-time binary translation.  Some approaches to relocation.  How many
reasons do you want?

>I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable
>code from one system to another.

In a language where the last revision of the standard was 1976?
You have my deepest sympathy.

-- 
Australian citizen since 14 August 1996.  *Now* I can vote the xxxs out!
Richard A. O'Keefe; http://www.cs.rmit.edu.au/%7Eok; RMIT Comp.Sci.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00   ` John McCabe
  1996-10-01  0:00     ` Michael Dworetsky
@ 1996-10-04  0:00     ` @@           robin
  1996-10-04  0:00       ` Joseph C Williams
                         ` (2 more replies)
  1 sibling, 3 replies; 58+ messages in thread
From: @@           robin @ 1996-10-04  0:00 UTC (permalink / raw)



	john@assen.demon.co.uk (John McCabe) writes:

	>Just a point for your information. From clari.tw.space:

	>	 "An inquiry board investigating the explosion concluded in  
	>July that the failure was caused by software design errors in a 
	>guidance system."

	>Note software DESIGN errors - not programming errors.

	>Best Regards
	>John McCabe <john@assen.demon.co.uk>

---If you read the Report, you'll see that that's not the case.
This is what the report says:


    "* The internal SRI software exception was caused during execution of a
     data conversion from 64-bit floating point to 16-bit signed integer
     value. The floating point number which was converted had a value
     greater than what could be represented by a 16-bit signed integer.
     This resulted in an Operand Error. The data conversion instructions
     (in Ada code) were not protected from causing an Operand Error,
     although other conversions of comparable variables in the same place
     in the code were protected.

    "In the failure scenario, the primary technical causes are the Operand Error
    when converting the horizontal bias variable BH, and the lack of protection
    of this conversion which caused the SRI computer to stop."

---As you can see, it's clearly a programming error.  It's a failure
to check for overflow on converting a double precision value to
a 16-bit integer.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00     ` @@           robin
  1996-10-04  0:00       ` Joseph C Williams
@ 1996-10-04  0:00       ` Michel OLAGNON
  1996-10-09  0:00         ` @@           robin
  1996-10-17  0:00       ` Ralf Tilch
  2 siblings, 1 reply; 58+ messages in thread
From: Michel OLAGNON @ 1996-10-04  0:00 UTC (permalink / raw)



In article <532k32$r4r@goanna.cs.rmit.edu.au>, rav@goanna.cs.rmit.edu.au (@@           robin) writes:
>	john@assen.demon.co.uk (John McCabe) writes:
>
>	>Just a point for your information. From clari.tw.space:
>
>	>	 "An inquiry board investigating the explosion concluded in  
>	>July that the failure was caused by software design errors in a 
>	>guidance system."
>
>	>Note software DESIGN errors - not programming errors.
>
>	>Best Regards
>	>John McCabe <john@assen.demon.co.uk>
>
>---If you read the Report, you'll see that that's not the case.
>This is what the report says:
>
>    "* The internal SRI software exception was caused during execution of a
>     data conversion from 64-bit floating point to 16-bit signed integer
>     value. The floating point number which was converted had a value
>     greater than what could be represented by a 16-bit signed integer.
>     This resulted in an Operand Error. The data conversion instructions
>     (in Ada code) were not protected from causing an Operand Error,
>     although other conversions of comparable variables in the same place
>     in the code were protected.
>
>    "In the failure scenario, the primary technical causes are the Operand Error
>    when converting the horizontal bias variable BH, and the lack of protection
>    of this conversion which caused the SRI computer to stop."
>
>---As you can see, it's clearly a programming error.  It's a failure
>to check for overflow on converting a double precision value to
>a 16-bit integer.

But if you read a bit further on, it is stated that

    The reason why three conversions, including the horizontal bias variable one,
    were not protected, is that it was decided that they were physically bounded
    or had a wide safety margin (...) The decision was a joint one of the project
    partners at various contractual levels.

Deciding at various contractual levels is not what one usually means by
``programming''. It looks closer to ``design'', IMHO. But, of course, anyone
can give any word any meaning.
And it might be probable that the action taken in case of protected conversion,
and exception, would also have been stop the SRI computer because such a high
horizontal bias would have meant that it was broken....

Michel

-- 
| Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|







^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00     ` Michael Dworetsky
@ 1996-10-04  0:00       ` Steve Bell
  1996-10-07  0:00         ` Ken Garlington
  1996-10-09  0:00         ` @@           robin
  0 siblings, 2 replies; 58+ messages in thread
From: Steve Bell @ 1996-10-04  0:00 UTC (permalink / raw)

Michael Dworetsky wrote:
> 
> >Just a point for your information. From clari.tw.space:
> >
> >        "An inquiry board investigating the explosion concluded in
> >July that the failure was caused by software design errors in a
> >guidance system."
> >
> >Note software DESIGN errors - not programming errors.
> >
> 
> Indeed, the problems were in the specifications given to the programmers,
> not in the coding activity itself.  They wrote exactly what they were
> asked to write, as far as I could see from reading the report summary.
> 
> The problem was caused by using software developed for Ariane 4's flight
> characteristics, which were different from those of Ariane 5.  When the
> launch vehicle exceeded the boundary parameters of the Ariane-4 software,
> it send an error message and, as specified by the remit given to
> programmers, a critical guidance system shut down in mid-flight. Ka-boom.
> 

I work for an aerospace company, and we recieved a fairly detailed accounting of what 
went wrong with the Ariane 5. Launch vehicles, while they are sitting on the launch 
pad, run a guidance program that updates their position and velocity in reference to 
an coordinate frame whose origin is at the center of the earth (usually called an 
Earth-Centered-Inertial (ECI) frame). This program is usually started up from 1 to 3-4 
hours before launch and is allowed to run all the way until liftoff, so that the 
rocket will know where it's at and how fast it's going at liftoff. Although called 
"ground software," (because it runs while the rocket is on the ground), it resides 
inside the rocket's guidance computer(s), and for the Titan family of launch vehicles, 
the code is exited at t=0 (liftoff). This code is designed with knowing that the 
rocket is rotating on the surface of the earth, and the algorithms expect only very 
mild accelerations (as compared to when the rocket hauls ass off the pad at liftoff). 
Well, the French do things a little differently (but probably now they don't). The 
Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs 
past liftoff. They do (did) this in case there are any unanticipated holds in the 
countdown right close to liftoff. In this way, this position and velocity updating 
code would *not* have to be reset if they could get off the ground within just a few 
seconds of nominal. Well, it appears that the Ariane 5 really hauls ass off the pad, 
because at about 30 secs, it was pulling some accelerations that caused floating pount 
overflows in the still functioning ground software. The actual flight software (which 
was also running, naturally) was computing the positions and velocities that were 
being used to actually fly the rocket, and it was doing just fine - no overflow errors 
there because it was designed to expect high accelerations. There are two flight 
computers on the Ariane 5 - a primary and a backup - and each was designed to shut 
down if an error such as a floating point overflow occurred, thinking that the other 
one would take over. Both computers were running the ground software, and both 
experienced the floating point errors. Actually, the primary went belly-up first, and 
then the backup within a fraction of a second later. With no functioning guidance 
computer on board, well, ka-boom as you say.

Apparently the Ariane 4 gets off the ground with smaller accelerations than the 5, and 
this never happened with a 4. You might take note that this would never happen with a 
Titan because we don't execute this ground software after liftoff. Even if we did, we 
would have caught the floating point overflows way before launch because we run all 
code in what's called "Real-Time Simulations" where actual flight harware and software 
are subjected to any and all known physical conditions. This was another finding of 
the investigation board - apparently the French don't do enough of this type of 
testing because it's real expensive. Oh well, they probably do now!

-- 
Clear skies,
Steve Bell
sb635@delphi.com
http://people.delphi.com/sb635 - Astrophoto page

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00     ` @@           robin
@ 1996-10-04  0:00       ` Joseph C Williams
  1996-10-06  0:00         ` Wayne Hayes
  1996-10-04  0:00       ` Michel OLAGNON
  1996-10-17  0:00       ` Ralf Tilch
  2 siblings, 1 reply; 58+ messages in thread
From: Joseph C Williams @ 1996-10-04  0:00 UTC (permalink / raw)



Why didn't they run the code against an Ariane 5 simulator to
reverify the Ariane 4 software what was reused?  A good real-time
engineering simulation would have caught the problem.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Lawrence Foard
@ 1996-10-04  0:00             ` @@           robin
  0 siblings, 0 replies; 58+ messages in thread
From: @@           robin @ 1996-10-04  0:00 UTC (permalink / raw)



	Lawrence Foard <entropy@vwis.com> writes:

	>Ronald Kunne wrote:

	>> Actually, this was the case here: the code was taken from an Ariane 4
	>> code where it was physically impossible that the index would go out
	>> of range: a test would have been a waste of time.

---A test for overflow in a system that aborts if unexpected overflow
occurs, is never a waste of time.

   Recall Murphy's Law: "If anything can go wrong, it will."
Then there's Robert's Law: "Even if it can't go wrong, it will."

	>> Unfortunately this was no longer the case in the Ariane 5.

	>Actually it would still present a danger on Ariane 4. If the sensor
	>which apparently was no longer needed during flight became defective,
	>then you could get a value out of range.

---Good point Lawrence.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Joseph C Williams
@ 1996-10-06  0:00         ` Wayne Hayes
  0 siblings, 0 replies; 58+ messages in thread
From: Wayne Hayes @ 1996-10-06  0:00 UTC (permalink / raw)



In article <32551A66.41C6@gsde.hso.link.com>,
Joseph C Williams  <u6p35@gsde.hso.link.com> wrote:
>Why didn't they run the code against an Ariane 5 simulator to
>reverify the Ariane 4 software what was reused?

Money.  (The more cynical among us may say this translates to "stupidity".)

-- 
"Unix is simple and coherent, but it takes || Wayne Hayes, wayne@cs.utoronto.ca
a genius (or at any rate, a programmer) to || Astrophysics & Computer Science
appreciate its simplicity." -Dennis Ritchie|| http://www.cs.utoronto.ca/~wayne




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Steve Bell
@ 1996-10-07  0:00         ` Ken Garlington
  1996-10-09  0:00         ` @@           robin
  1 sibling, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-10-07  0:00 UTC (permalink / raw)



Steve Bell wrote:

> Well, the French do things a little differently (but probably now they don't). The
> Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs
> past liftoff. They do (did) this in case there are any unanticipated holds in the
> countdown right close to liftoff. In this way, this position and velocity updating
> code would *not* have to be reset if they could get off the ground within just a few
> seconds of nominal.

But why 40 seconds? Why not 1 second (or one millisecond, for that matter)?

> You might take note that this would never happen with a
> Titan because we don't execute this ground software after liftoff. Even if we did, we
> would have caught the floating point overflows way before launch because we run all
> code in what's called "Real-Time Simulations" where actual flight harware and software
> are subjected to any and all known physical conditions. This was another finding of
> the investigation board - apparently the French don't do enough of this type of
> testing because it's real expensive.

Going way back into my history, I believe this is also true for Atlas.

> --
> Clear skies,
> Steve Bell
> sb635@delphi.com
> http://people.delphi.com/sb635 - Astrophoto page

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Michel OLAGNON
@ 1996-10-09  0:00         ` @@           robin
  0 siblings, 0 replies; 58+ messages in thread
From: @@           robin @ 1996-10-09  0:00 UTC (permalink / raw)



	molagnon@ifremer.fr (Michel OLAGNON) writes:

	>In article <532k32$r4r@goanna.cs.rmit.edu.au>, rav@goanna.cs.rmit.edu.au (@@           robin) writes:
	>>	john@assen.demon.co.uk (John McCabe) writes:
	>>
	>>	>Just a point for your information. From clari.tw.space:
	>>
	>>	>	 "An inquiry board investigating the explosion concluded in  
	>>	>July that the failure was caused by software design errors in a 
	>>	>guidance system."
	>>
	>>	>Note software DESIGN errors - not programming errors.
	>>
	>>	>Best Regards
	>>	>John McCabe <john@assen.demon.co.uk>
	>>
	>>---If you read the Report, you'll see that that's not the case.
	>>This is what the report says:
	>>
	>>    "* The internal SRI software exception was caused during execution of a
	>>     data conversion from 64-bit floating point to 16-bit signed integer
	>>     value. The floating point number which was converted had a value
	>>     greater than what could be represented by a 16-bit signed integer.
	>>     This resulted in an Operand Error. The data conversion instructions
	>>     (in Ada code) were not protected from causing an Operand Error,
	>>     although other conversions of comparable variables in the same place
	>>     in the code were protected.
	>>
	>>    "In the failure scenario, the primary technical causes are the Operand Error
	>>    when converting the horizontal bias variable BH, and the lack of protection
	>>    of this conversion which caused the SRI computer to stop."
	>>
	>>---As you can see, it's clearly a programming error.  It's a failure
	>>to check for overflow on converting a double precision value to
	>>a 16-bit integer.

	>But if you read a bit further on, it is stated that

	>    The reason why three conversions, including the horizontal bias variable one,
	>    were not protected, is that it was decided that they were physically bounded
	>    or had a wide safety margin (...) The decision was a joint one of the project
	>    partners at various contractual levels.

	>Deciding at various contractual levels is not what one usually means by
	>``programming''. It looks closer to ``design'', IMHO. But, of course, anyone
	>can give any word any meaning.
	>And it might be probable that the action taken in case of protected conversion,
	>and exception, would also have been stop the SRI computer because such a high
	>horizontal bias would have meant that it was broken....

	>| Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|

But if you read further on ....

   "However, three of the variables were left unprotected. No reference to
    justification of this decision was found directly in the source code. Given
    the large amount of documentation associated with any industrial
    application, the assumption, although agreed, was essentially obscured,
    though not deliberately, from any external review."

.... you'll see that there was no documentation in the code to
explain why these particular 3 (dangerous) conversions  were
left unprotected.  There is the implication that one or more
of them might have been overlooked . . . ..  Don't place
too much reliance on the conclusion of the report, when
the detail is right there in the body of the report.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Steve Bell
  1996-10-07  0:00         ` Ken Garlington
@ 1996-10-09  0:00         ` @@           robin
  1996-10-09  0:00           ` Steve O'Neill
  1 sibling, 1 reply; 58+ messages in thread
From: @@           robin @ 1996-10-09  0:00 UTC (permalink / raw)



Steve Bell <sb635@delphi.com> writes:

	>Michael Dworetsky wrote:
	>> 
	>> >Just a point for your information. From clari.tw.space:
	>> >
	>> >        "An inquiry board investigating the explosion concluded in
	>> >July that the failure was caused by software design errors in a
	>> >guidance system."
	>> >
	>> >Note software DESIGN errors - not programming errors.
	>> >
	>> 
	>> Indeed, the problems were in the specifications given to the programmers,
	>> not in the coding activity itself.  They wrote exactly what they were
	>> asked to write, as far as I could see from reading the report summary.
	>> 
	>> The problem was caused by using software developed for Ariane 4's flight
	>> characteristics, which were different from those of Ariane 5.  When the
	>> launch vehicle exceeded the boundary parameters of the Ariane-4 software,
	>> it send an error message and, as specified by the remit given to
	>> programmers, a critical guidance system shut down in mid-flight. Ka-boom.
	>> 

	>I work for an aerospace company, and we recieved a fairly detailed accounting of what 
	>went wrong with the Ariane 5. Launch vehicles, while they are sitting on the launch 
	>pad, run a guidance program that updates their position and velocity in reference to 
	>an coordinate frame whose origin is at the center of the earth (usually called an 
	>Earth-Centered-Inertial (ECI) frame). This program is usually started up from 1 to 3-4 
	>hours before launch and is allowed to run all the way until liftoff, so that the 
	>rocket will know where it's at and how fast it's going at liftoff. Although called 
	>"ground software," (because it runs while the rocket is on the ground), it resides 
	>inside the rocket's guidance computer(s), and for the Titan family of launch vehicles, 
	>the code is exited at t=0 (liftoff). This code is designed with knowing that the 
	>rocket is rotating on the surface of the earth, and the algorithms expect only very 
	>mild accelerations (as compared to when the rocket hauls ass off the pad at liftoff). 
	>Well, the French do things a little differently (but probably now they don't). The 
	>Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs 
	>past liftoff. They do (did) this in case there are any unanticipated holds in the 
	>countdown right close to liftoff. In this way, this position and velocity updating 
	>code would *not* have to be reset if they could get off the ground within just a few 
	>seconds of nominal. Well, it appears that the Ariane 5 really hauls ass off the pad, 
	>because at about 30 secs, it was pulling some accelerations that caused floating pount 
	>overflows

---Definitely not.  No floating-point overflow occurred.  In
Ariane 5, the overflow occurred on converting a double-precision
(some 56 bits?) floating-point to a 16-bit integer (15
significant bits).

   That's why it was so important to have a check that the
conversion couldn't overflow!


	in the still functioning ground software. The actual flight software (which 
	>was also running, naturally) was computing the positions and velocities that were 
	>being used to actually fly the rocket, and it was doing just fine - no overflow errors 
	>there because it was designed to expect high accelerations. There are two flight 
	>computers on the Ariane 5 - a primary and a backup - and each was designed to shut 
	>down if an error such as a floating point overflow occurred,

---Again, not at all.  It was designed to shut down if any interrupt
occurred.  It wasn't intended to be shut down for a routine thing as
a conversion of floating-point to integer.

	 thinking that the other 
	>one would take over. Both computers were running the ground software, and both 
	>experienced the floating point errors.


---No, the backup SRI experienced the programming error (UNCHECKED
CONVERSION from floating-point to integer) first, and shut itself
down, then the active SRI computer experienced the same programming
error, then it shut itself down.

	Actually, the primary went belly-up first, and 
	>then the backup within a fraction of a second later. With no functioning guidance 
	>computer on board, well, ka-boom as you say.




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-09  0:00         ` @@           robin
@ 1996-10-09  0:00           ` Steve O'Neill
  1996-10-12  0:00             ` Alan Brain
  0 siblings, 1 reply; 58+ messages in thread
From: Steve O'Neill @ 1996-10-09  0:00 UTC (permalink / raw)

@@ robin wrote:
> ---Definitely not.  No floating-point overflow occurred.  In
> Ariane 5, the overflow occurred on converting a double-precision
> (some 56 bits?) floating-point to a 16-bit integer (15
> significant bits).
> 
>    That's why it was so important to have a check that the
> conversion couldn't overflow!
> Agreed.  Yes, the basic reason for the destruction of a billion dollar 
vehicle was for want of a couple of lines of code.  But it relects a 
systemic problem much more damaging than what language was used.

I would have expected that in a mission/safety critical application 
the proper checks would have been implemented, no matter what. And in a 
'belts-and-suspenders' mode I would also expect an exception handler to 
take care of unforeseen possibilities at the lowest possible level and 
raise things to a higher level only when absolutely necessary.  Had these 
precautions been taken there would probably be lots of entries in an 
error log but the satellites would now be orbiting.  

As outsiders we can only second guess as to why this approach was not 
taken but the review board implies that 1) the SRI software developers 
had an 80% max utilization requirement and 2) careful consideration 
(including faulty assumptions) was used in deciding what to protect and 
not protect.

>It was designed to shut down if any interrupt occurred.  It wasn't                                       ^^^^^^^^^ exception, actually
>intended to be shut down for a routine thing as a conversion of 
>floating-point to integer.

This was based on the (faulty) system-wide assumption that any exception 
was the result of a random hardware failure.  This is related to the 
other faulty assumption that "software should be considered correct until 
is proven to be at fault".  But that's what the specification said.

> ---No, the backup SRI experienced the programming error (UNCHECKED
> CONVERSION from floating-point to integer) first, and shut itself
> down, then the active SRI computer experienced the same programming
> error, then it shut itself down.

Yes, according to the report the backup died first (by 0.05 seconds).  
Probably not as a result of an unchecked_conversion though - the source 
and target are of different sizes which would not be allowed.  Most 
likely just a conversion of a float to an sixteen-bit integer.  This 
would have raised a Constraint_Error (or Operand_Error in this 
environment).  This error could have been handled within the context of 
this procedure (and the mission continued) but obviously was not.  
Instead it appears to have been propagated to a global exception handler 
which performed the specified actions admirably.  Unfortunately these 
included committing suicide and, in doing so, dooming the mission.

-- 
Steve O'Neill                      | "No,no,no, don't tug on that!
Sanders, A Lockheed Martin Company |  You never know what it might
smoneill@sanders.lockheed.com      |  be attached to." 
(603) 885-8774  fax: (603) 885-4071|    Buckaroo Banzai

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-09  0:00           ` Steve O'Neill
@ 1996-10-12  0:00             ` Alan Brain
  0 siblings, 0 replies; 58+ messages in thread
From: Alan Brain @ 1996-10-12  0:00 UTC (permalink / raw)

Steve O'Neill wrote:

> I would have expected that in a mission/safety critical application
> the proper checks would have been implemented, no matter what. And in a
> 'belts-and-suspenders' mode I would also expect an exception handler to
> take care of unforeseen possibilities at the lowest possible level and
> raise things to a higher level only when absolutely necessary.  Had these
> precautions been taken there would probably be lots of entries in an
> error log but the satellites would now be orbiting.

Concur completely. This should be Standard Operating Procedure, a matter
of habit. Frankly, it's just good engineering practice. But is honoured
more in the breach than the observance it seems, because....

> As outsiders we can only second guess as to why this approach was not
> taken but the review board implies that 1) the SRI software developers
> had an 80% max utilization requirement and 2) careful consideration
> (including faulty assumptions) was used in deciding what to protect and
> not protect.

... as some very reputable people, working for very reputable firms have
tried to pound into my thick skull, they are used to working with 15%,
no
more, tolerances. And with diamond-grade Hard Real Time slices, where
any
over-run, no matter how slight, means disaster. In this case, Formal
Proof
and strict attention to the no of CPU cycles in all possible paths seems
the only way to go.
But this leaves you so open to error in all but the simplest, most
trivial
tasks, (just the race analysis would be nightmarish) that these slices
had
better be a very small part of the task, or the task itself must be very
simple indeed. Either way, not having much bearing on the vast majority
of
problems I've encountered.
If the tasks are not simple....then can I please ask the firms concerned
to
tell me which aircraft their software is on, so I can take appropriate
action?

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00     ` @@           robin
  1996-10-04  0:00       ` Joseph C Williams
  1996-10-04  0:00       ` Michel OLAGNON
@ 1996-10-17  0:00       ` Ralf Tilch
  1996-10-17  0:00         ` Ravi Sundaram
  2 siblings, 1 reply; 58+ messages in thread
From: Ralf Tilch @ 1996-10-17  0:00 UTC (permalink / raw)



--

Hello,

I followed the discussion of the ARIANE 5 failure.
I didn't read all the mail's, and I am quite astonished
how far and how many details can discussed.
Like,
What for program-language would have been the best,
.....

It's good to know what's happen.
I think more important,
you built something new (very complex).
You invest some billion to develop it.
You built it (an ARIANE 5,  put several sattelites). 
The price of it several hundred millions
and you don't check as much as possible,
make a 'very complete check',
especially the software.

The reason that the software wasn't checked:
It was too 'expensive'?!?!.

They forgot murphy's law, which always 'works'.


I think you can't design a new car without 
testing it completely.

We test 95% of the construction and after six month 
selling the new car a weel will fall of at 160km/h.
Ok, there was a small problem in the construction-software
some wrong values, due to some over- or underflows or 
whatever.

The result, the company probhably will have to pay quite a
lot and probhably to close !

--------------------------------------------------------
-DON'T TRUST YOURSELF, TRUST MURPHY'S LAW !!!! 

"If anything can go wrong, it will."
--------------------------------------------------------
With this, have fun and continue the discussion about 
conversion from 64bit to 16bit values,etc..

RT


________________|_______________________________________|_                
                | E-mail : R.Tilch@gmd.de               |
                | Tel.   : (+49) (0)2241/14-23.69       |
________________|_______________________________________|_
                |                                       |




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-17  0:00       ` Ralf Tilch
@ 1996-10-17  0:00         ` Ravi Sundaram
  1996-10-22  0:00           ` shmuel
  0 siblings, 1 reply; 58+ messages in thread
From: Ravi Sundaram @ 1996-10-17  0:00 UTC (permalink / raw)



Ralf Tilch wrote:
> The reason that the software wasn't checked:
> It was too 'expensive'?!?!.

	Yeah, isn't hindsight a wonderful thing?  
	They, whoever were in charge of these decisions,
	too knew testing is important.  But it is impossible
	to test every subcomponant under every possible
	condition. There is simply not enough money or time
	available to do that.

	Take space shuttle for example. The total computing
	power available on board is probably as much as used
	in Nintindo gameboy. The design was frozen in 1970s.
	Upgrading the computers and software would be so expensive
	to test and prove they approach it with much trepidation.

	Richard Feyman was examining the practices of NASA and
	found that the workers who assembled some large bulkheads
	had to count bolts from two refrence points. He thought
	providing four reference points would simplify the job.
	NASA rejected the proposal because it would involve 
	too many changes to the documentation, procedures and
	testing. (Surely You are joking, Mr Feyman I? or II?)

	So praise them for conducting a no nonsense investigation
	and owning up to the mistakes.  Learn to live with
	failed space shots. They will become as reliable as
	air travel once we have launched about 10 million rockets.

-- 
Ravi Sundaram.  
10/17/96
PS:	I am out of here. Going on vacation. Wont read followups
	for a month.
                                (Opinions are mine, not Ansoft's.)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-17  0:00         ` Ravi Sundaram
@ 1996-10-22  0:00           ` shmuel
  1996-10-22  0:00             ` Jim Carr
  0 siblings, 1 reply; 58+ messages in thread
From: shmuel @ 1996-10-22  0:00 UTC (permalink / raw)



In <3266741B.4DAA@ansoft.com>, Ravi Sundaram <ravi@ansoft.com> writes:
>Ralf Tilch wrote:
>> The reason that the software wasn't checked:
>> It was too 'expensive'?!?!.
>
>	Yeah, isn't hindsight a wonderful thing?  
>	They, whoever were in charge of these decisions,
>	too knew testing is important.  But it is impossible
>	to test every subcomponant under every possible
>	condition. There is simply not enough money or time
>	available to do that.

Why do you assume that it was hindsight? They violated fundamental
software engineering principles, and anyone who has been in this business
for long should have expected chickens coming home to roost, even if they
couldn't predict what would go wrong first.

>	Richard Feyman was examining the practices of NASA and
>	found that the workers who assembled some large bulkheads
>	had to count bolts from two refrence points. He thought
>	providing four reference points would simplify the job.
>	NASA rejected the proposal because it would involve 
>	too many changes to the documentation, procedures and
>	testing. (Surely You are joking, Mr Feyman I? or II?)
>
>	So praise them for conducting a no nonsense investigation
>	and owning up to the mistakes.  Learn to live with
>	failed space shots. They will become as reliable as
>	air travel once we have launched about 10 million rockets.

I hope that you're talking about Ariane and not NASA Challenger; Feynman's
account of the behavior of most of the Rogers Commission, in "Why Do
You Care ..." sounds more like a failed coverup than like "owning up to 
their mistakes", and Feynman had to threaten to air a dissenting opinion
on television before they agreed to publish it in their report.

	Shmuel (Seymour J.) Metz
	Atid/2





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-22  0:00           ` shmuel
@ 1996-10-22  0:00             ` Jim Carr
  1996-10-24  0:00               ` hayim
  0 siblings, 1 reply; 58+ messages in thread
From: Jim Carr @ 1996-10-22  0:00 UTC (permalink / raw)

shmuel.metz@os2bbs.com writes:
>
>I hope that you're talking about Ariane and not NASA Challenger; Feynman's
>account of the behavior of most of the Rogers Commission, in "Why Do
>You Care ..." sounds more like a failed coverup than like "owning up to 
>their mistakes", ...

The coverup was not entirely unsuccessful.  Feynman did manage to break 
through and get his dissenting remarks on NASA reliability estimates 
into the report (as well as into Physics Today), but the coverup did 
succeed in keeping most people ignorant of the fact that the astronauts 
did not die until impact with the ocean despite a Miami Herald story 
pointing that out to its mostly-regional audience. 

Did you ever see a picture of the crew compartment? 

-- 
 James A. Carr   <jac@scri.fsu.edu>     |  Raw data, like raw sewage, needs 
    http://www.scri.fsu.edu/~jac        |  some processing before it can be
 Supercomputer Computations Res. Inst.  |  spread around.  The opposite is
 Florida State, Tallahassee FL 32306    |  true of theories.  -- JAC

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-22  0:00             ` Jim Carr
@ 1996-10-24  0:00               ` hayim
  1996-10-25  0:00                 ` Michel OLAGNON
  1996-10-25  0:00                 ` Ken Garlington
  0 siblings, 2 replies; 58+ messages in thread
From: hayim @ 1996-10-24  0:00 UTC (permalink / raw)



Unfortunately, I missed the original article describing the Ariane failure.
If someone could please, either point me in the right direction as to where
I can get a copy, or could even send it to me, I would greatly appreciate it.

Thanks very much,

Hayim Hendeles

E-mail: hayim@platsol.com





^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-24  0:00               ` hayim
@ 1996-10-25  0:00                 ` Michel OLAGNON
  1996-10-25  0:00                 ` Ken Garlington
  1 sibling, 0 replies; 58+ messages in thread
From: Michel OLAGNON @ 1996-10-25  0:00 UTC (permalink / raw)



In article <54oht1$ln1@orchard.la.platsol.com>, <hayim> writes:
>Unfortunately, I missed the original article describing the Ariane failure.
>If someone could please, either point me in the right direction as to where
>I can get a copy, or could even send it to me, I would greatly appreciate it.
>

It may be useful to remind the source address for the full report, since
many comments seem based only on a presentation summary:

http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html

Michel

-- 
| Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|







^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: Ariane 5 failure
  1996-10-24  0:00               ` hayim
  1996-10-25  0:00                 ` Michel OLAGNON
@ 1996-10-25  0:00                 ` Ken Garlington
  1 sibling, 0 replies; 58+ messages in thread
From: Ken Garlington @ 1996-10-25  0:00 UTC (permalink / raw)



hayim wrote:
> 
> Unfortunately, I missed the original article describing the Ariane failure.
> If someone could please, either point me in the right direction as to where
> I can get a copy, or could even send it to me, I would greatly appreciate it.
> 
> Thanks very much,
> 
> Hayim Hendeles
> 
> E-mail: hayim@platsol.com

See:
  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~1996-10-25  0:00 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <agrapsDy4oJH.29G@netcom.com>
1996-09-25  0:00 ` Ariane 5 failure @@           robin
1996-09-25  0:00   ` Michel OLAGNON
1996-09-25  0:00     ` Chris Morgan
1996-09-25  0:00     ` Byron Kauffman
1996-09-25  0:00       ` A. Grant
1996-09-25  0:00         ` Ken Garlington
1996-09-26  0:00         ` Byron Kauffman
1996-09-27  0:00           ` A. Grant
1996-09-26  0:00         ` Sandy McPherson
1996-09-25  0:00   ` Bob Kitzberger
1996-09-26  0:00     ` Ronald Kunne
1996-09-26  0:00       ` Matthew Heaney
1996-09-27  0:00         ` Wayne Hayes
1996-09-27  0:00           ` Richard Pattis
1996-09-29  0:00             ` Chris McKnight
1996-09-29  0:00               ` Real-world education (was: Ariane 5 failure) Michael Feldman
1996-09-29  0:00             ` Ariane 5 failure Alan Brain
1996-09-29  0:00             ` Dann Corbit
1996-10-01  0:00             ` Ken Garlington
1996-09-27  0:00         ` Ronald Kunne
1996-09-27  0:00           ` Lawrence Foard
1996-10-04  0:00             ` @@           robin
1996-09-28  0:00           ` Ken Garlington
1996-09-28  0:00             ` Ken Garlington
1996-09-29  0:00           ` Alan Brain
1996-09-29  0:00             ` Robert A Duff
1996-09-30  0:00               ` Wayne L. Beavers
1996-10-01  0:00                 ` Ken Garlington
1996-10-01  0:00                   ` Wayne L. Beavers
1996-10-01  0:00                     ` Ken Garlington
1996-10-02  0:00                       ` Sandy McPherson
1996-10-03  0:00                 ` Richard A. O'Keefe
1996-10-01  0:00             ` Ken Garlington
1996-09-28  0:00         ` Ken Garlington
1996-09-27  0:00       ` Alan Brain
1996-09-28  0:00         ` Ken Garlington
1996-09-27  0:00       ` Ken Garlington
1996-09-29  0:00       ` Louis K. Scheffer
1996-09-27  0:00   ` John McCabe
1996-10-01  0:00     ` Michael Dworetsky
1996-10-04  0:00       ` Steve Bell
1996-10-07  0:00         ` Ken Garlington
1996-10-09  0:00         ` @@           robin
1996-10-09  0:00           ` Steve O'Neill
1996-10-12  0:00             ` Alan Brain
1996-10-04  0:00     ` @@           robin
1996-10-04  0:00       ` Joseph C Williams
1996-10-06  0:00         ` Wayne Hayes
1996-10-04  0:00       ` Michel OLAGNON
1996-10-09  0:00         ` @@           robin
1996-10-17  0:00       ` Ralf Tilch
1996-10-17  0:00         ` Ravi Sundaram
1996-10-22  0:00           ` shmuel
1996-10-22  0:00             ` Jim Carr
1996-10-24  0:00               ` hayim
1996-10-25  0:00                 ` Michel OLAGNON
1996-10-25  0:00                 ` Ken Garlington
1996-10-02  0:00 Real-world education (was: Ariane 5 failure) Simon Johnston

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox