Re: Ariane 5 failure

comp.lang.ada
 help / color / mirror / Atom feed

* Re: Ariane 5 failure
  1996-09-25  0:00       ` A. Grant
@ 1996-09-25  0:00         ` Ken Garlington
  1996-09-26  0:00         ` Byron Kauffman
  1996-09-26  0:00         ` Sandy McPherson
  2 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-09-25  0:00 UTC (permalink / raw)



A. Grant wrote:
> Robin is not a student.  He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

When it comes to building embedded safety-critical systems, trust me:
He's a student!

-- 
LMTAS - "Our Brand Means Quality"




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
       [not found] <agrapsDy4oJH.29G@netcom.com>
@ 1996-09-25  0:00 ` @@           robin
  1996-09-25  0:00   ` Michel OLAGNON
                     ` (2 more replies)
  0 siblings, 3 replies; 105+ messages in thread
From: @@           robin @ 1996-09-25  0:00 UTC (permalink / raw)

	agraps@netcom.com (Amara Graps) writes:

	>I read the following message from my co-workers that I thought was
	>interesting. So I'm forwarding it to here.

	>(begin quote)
	>Ariane 5 failure was attributed to a faulty DOUBLE -> INT conversion
	>(as the proximate cause) in some ADA code in the inertial guidance
	>system.  Diagnostic error messages from the (faulty) inertial guidance
	>system software were interpreted by the steering system as valid data.

	>English text of the inquiry board's findings is at
	>  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html
	>(end quote)

	>Amara Graps                         email: agraps@netcom.com
	>Computational Physics               vita:  finger agraps@best.com

THere's a little more to it . . .

The unchecked data conversion in the Ada program resulted
in the shutdown of the computer. The backup computer had
already shut down a whisker of a second before,  Consequently,
the on-board computer was unable to switch to the backup, and
used the error codes from the shutdown computer as
flight data.

This is not the first time that such a programming error
(integer out of range) has occurred.

In 1981, the manned STS-2 was preparing to take off, but because
some fuel was accidentally spilt and some tiles accidentally
dislodged, takeoff was delayed by a month.

During that time, the astronauts decided to get in some
more practice with the simulator.

During a simulated descent, the 4 computing systems (the main
and the 3 backups) got stuck in a loop, with the complete
loss of control.

The cause?  An integer out of range -- the same problem
as with Ariane 5, where an integer became out of range.

In the STS-2 case, the precise cause was a computed GOTO
with a bad index (similar to a CASE statement without
an OTHERWISE clause).

In both cases, the programing error could have been detected
with a simple test, but in both cases, no test was included.

One would have thought that having had one failure (at least)
for integer out-of-range, that the implementors of the software
for Ariane 5 would have been extra careful in ensuring that
all data conversions were within range -- since any kind
of interrupt would result in destruction of the spacecraft.

There's a case for a review of the programming language used.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00 ` Ariane 5 failure @@           robin
@ 1996-09-25  0:00   ` Michel OLAGNON
  1996-09-25  0:00     ` Chris Morgan
  1996-09-25  0:00     ` Byron Kauffman
  1996-09-25  0:00   ` Bob Kitzberger
  1996-09-27  0:00   ` John McCabe
  2 siblings, 2 replies; 105+ messages in thread
From: Michel OLAGNON @ 1996-09-25  0:00 UTC (permalink / raw)



In article <52a572$9kk@goanna.cs.rmit.edu.au>, rav@goanna.cs.rmit.edu.au (@@           robin) writes:
>[reports of Ariane and STS-2 bugs deleted]
>
>
>In both cases, the programing error could have been detected
>with a simple test, but in both cases, no test was included.
>
>One would have thought that having had one failure (at least)
>for integer out-of-range, that the implementors of the software
>for Ariane 5 would have been extra careful in ensuring that
>all data conversions were within range -- since any kind
>of interrupt would result in destruction of the spacecraft.
>

May be the main reason for the lack of testing and care was
that the conversion exception could only occur after lift off,
and that that particular piece of program was of no use after
lift off. It was only kept running for 50 s in order to
speed up countdown restart in case of an interruption between
H0-9 and H0-5. 

Conclusion: Never compute values that are of no use when you can
avoid it !

>There's a case for a review of the programming language used.


Michel
-- 
| Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|







^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00   ` Michel OLAGNON
  1996-09-25  0:00     ` Chris Morgan
@ 1996-09-25  0:00     ` Byron Kauffman
  1996-09-25  0:00       ` A. Grant
  1 sibling, 1 reply; 105+ messages in thread
From: Byron Kauffman @ 1996-09-25  0:00 UTC (permalink / raw)

Michel OLAGNON wrote:
> 
> May be the main reason for the lack of testing and care was
> that the conversion exception could only occur after lift off,
> and that that particular piece of program was of no use after
> lift off. It was only kept running for 50 s in order to
> speed up countdown restart in case of an interruption between
> H0-9 and H0-5.
> 
> Conclusion: Never compute values that are of no use when you can
> avoid it !
> 
> >There's a case for a review of the programming language used.
> 
> Michel
> --
> | Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|
> | IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|

Of course, Michel, you've got a great point, but let me give you some
advice,
assuming you haven't read this thread for the last few months (seems
like years). Robin's whole point is that he firmly believes that the
problem would not have occurred if PL/I had been used instead of Ada.
Several EXTREMELY competent and experienced engineers who actually have
written flight-control software have patiently, and in some cases
(though I can't blame them) impatiently attempted to explain the
situation - that this was a bad design/management decision combined with
a fatal oversight in testing - to this poor student, but alas, to no
avail.

My advice, Michel - blow it off and don't let ++robin (or is it
@@robin?) get
to you, because "++robin" is actually an alias for John Cleese. He's
gathering material for a sequel to "The Argument Sketch"...    :-)

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00     ` Byron Kauffman
@ 1996-09-25  0:00       ` A. Grant
  1996-09-25  0:00         ` Ken Garlington
                           ` (2 more replies)
  0 siblings, 3 replies; 105+ messages in thread
From: A. Grant @ 1996-09-25  0:00 UTC (permalink / raw)



In article <32492E5C.562@lmtas.lmco.com> Byron Kauffman <KauffmanBB@lmtas.lmco.com> writes:
>Several EXTREMELY competent and experienced engineers who actually have
>written flight-control software have patiently, and in some cases
>(though I can't blame them) impatiently attempted to explain the
>situation - that this was a bad design/management decision combined with
>a fatal oversight in testing - to this poor student, but alas, to no
>avail.

Robin is not a student.  He is a senior lecturer at the Royal
Melbourne Institute of Technology, a highly reputable institution.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00 ` Ariane 5 failure @@           robin
  1996-09-25  0:00   ` Michel OLAGNON
@ 1996-09-25  0:00   ` Bob Kitzberger
  1996-09-26  0:00     ` Ronald Kunne
  1996-09-27  0:00   ` John McCabe
  2 siblings, 1 reply; 105+ messages in thread
From: Bob Kitzberger @ 1996-09-25  0:00 UTC (permalink / raw)



@@           robin (rav@goanna.cs.rmit.edu.au) wrote:
: The cause?  An integer out of range -- the same problem
: as with Ariane 5, where an integer became out of range.
...
: There's a case for a review of the programming language used.

Why do you persist?  

Ada _has_ range checks built into the language.  They were explicitly
disabled in this case.

What are you failing to grasp?

--
Bob Kitzberger	      Rational Software Corporation       rlk@rational.com
http://www.rational.com http://www.rational.com/pst/products/testmate.html




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00   ` Michel OLAGNON
@ 1996-09-25  0:00     ` Chris Morgan
  1996-09-25  0:00     ` Byron Kauffman
  1 sibling, 0 replies; 105+ messages in thread
From: Chris Morgan @ 1996-09-25  0:00 UTC (permalink / raw)



In article <ag129.804.0011F709@ucs.cam.ac.uk> ag129@ucs.cam.ac.uk
(A. Grant) writes:

   Robin is not a student.  He is a senior lecturer at the Royal
   Melbourne Institute of Technology, a highly reputable institution.

I'm tempted to say "not so reputable to readers of this newsgroup"
after the ridiculous statements made by Robin w.r.t. Ariane 5 but
Richard A. O'Keefe's regular excellent postings more than balance them
out.

Chris
-- 
--
Chris Morgan                     |email         cm@mihalis.demon.co.uk (home) 
http://www.mihalis.demon.co.uk/  |       or chris.morgan@baesema.co.uk (work)




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00   ` Bob Kitzberger
@ 1996-09-26  0:00     ` Ronald Kunne
  1996-09-26  0:00       ` Matthew Heaney
                         ` (3 more replies)
  0 siblings, 4 replies; 105+ messages in thread
From: Ronald Kunne @ 1996-09-26  0:00 UTC (permalink / raw)

In article <52bm1c$gvn@rational.rational.com>
rlk@rational.com (Bob Kitzberger) writes:

>Ada _has_ range checks built into the language.  They were explicitly
>disabled in this case.

The problem of constructing bug-free real-time software seems to me
a trade-off between safety and speed of execution (and maybe available
memory?). In other words: including tests on array boundaries might
make the code saver, but also slower.

Comments?

Greetings,
Ronald

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00       ` A. Grant
  1996-09-25  0:00         ` Ken Garlington
@ 1996-09-26  0:00         ` Byron Kauffman
  1996-09-27  0:00           ` A. Grant
  1996-09-26  0:00         ` Sandy McPherson
  2 siblings, 1 reply; 105+ messages in thread
From: Byron Kauffman @ 1996-09-26  0:00 UTC (permalink / raw)

A. Grant wrote:
> 
> In article <32492E5C.562@lmtas.lmco.com> Byron Kauffman <KauffmanBB@lmtas.lmco.com> writes:
> >Several EXTREMELY competent and experienced engineers who actually have
> >written flight-control software have patiently, and in some cases
> >(though I can't blame them) impatiently attempted to explain the
> >situation - that this was a bad design/management decision combined with
> >a fatal oversight in testing - to this poor student, but alas, to no
> >avail.
> 
> Robin is not a student.  He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

A. -

Thank you for confirming my long-held theory that those who inhabit the
ivory towers
of engineering/CS academia should spend 2 of every 5 years working at a
real job out 
in the real world. My intent is not to slam professors who are in touch
with reality, 
of course (e.g., Feldman, Dewar, et al), but the idealistic theoretical
side often
is a far cry from the practical, just-get-it-done world we have to deal
with once
we're out of school.

I just KNOW there's a good Dilbert strip here somewhere...

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00       ` A. Grant
  1996-09-25  0:00         ` Ken Garlington
  1996-09-26  0:00         ` Byron Kauffman
@ 1996-09-26  0:00         ` Sandy McPherson
  2 siblings, 0 replies; 105+ messages in thread
From: Sandy McPherson @ 1996-09-26  0:00 UTC (permalink / raw)



A. Grant wrote:
> 
> Robin is not a student.  He is a senior lecturer at the Royal
> Melbourne Institute of Technology, a highly reputable institution.

Why doesn't he wise up and act like one then? 

I don't know the man, and I suspect he has been winding everybody up
just for a laugh. But, if this is not the case, the thought of such a
closed mind teaching students is quite horrific.

"Use PL/I mate, you'll be tucker",

-- 
Sandy McPherson	MBCS CEng.	tel: 	+31 71 565 4288 (w)
ESTEC/WAS
P.O. Box 299
NL-2200AG Noordwijk




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00     ` Ronald Kunne
@ 1996-09-26  0:00       ` Matthew Heaney
  1996-09-27  0:00         ` Wayne Hayes
                           ` (2 more replies)
  1996-09-27  0:00       ` Ken Garlington
                         ` (2 subsequent siblings)
  3 siblings, 3 replies; 105+ messages in thread
From: Matthew Heaney @ 1996-09-26  0:00 UTC (permalink / raw)



In article <1780E8471.KUNNE@frcpn11.in2p3.fr>, KUNNE@frcpn11.in2p3.fr
(Ronald Kunne) wrote:

>In article <52bm1c$gvn@rational.rational.com>
>rlk@rational.com (Bob Kitzberger) writes:
> 
>>Ada _has_ range checks built into the language.  They were explicitly
>>disabled in this case.
> 
>The problem of constructing bug-free real-time software seems to me
>a trade-off between safety and speed of execution (and maybe available
>memory?). In other words: including tests on array boundaries might
>make the code saver, but also slower.
> 
>Comments?

Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
I'm not clear what the value of "faster execution" is.  The rocket's gone,
so what difference does it make how fast the code executed?  If you left
the range checks in, your code would be *marginally* slower, but you'd
still have your rocket, now wouldn't you?

>Ronald

Matt

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mheaney@ni.net
(818) 985-1271




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00       ` Matthew Heaney
@ 1996-09-27  0:00         ` Wayne Hayes
  1996-09-27  0:00           ` Richard Pattis
  1996-09-27  0:00         ` Ronald Kunne
  1996-09-28  0:00         ` Ken Garlington
  2 siblings, 1 reply; 105+ messages in thread
From: Wayne Hayes @ 1996-09-27  0:00 UTC (permalink / raw)

In article <mheaney-ya023180002609962252500001@news.ni.net>,
Matthew Heaney <mheaney@ni.net> wrote:
>Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is.  The rocket's gone,
>so what difference does it make how fast the code executed?  If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?

You have a moot point.  In this case, catching the error wouldn't have
helped.  The out-of-bounds error happened in a piece of code designed
for the Ariane-4, in which it was *physically impossible* for the value
to overflow (the Ariane-4 didn't go that fast, and it was a velocity
variable).  Then the code was used, as-is, in the Ariane-5, without an
analysis of how the code would react in the new hardware, which flew
faster.  Had the analysis been done, they wouldn't have added bounds
checking, they would have modified the code to actually *work*, because
they would have realized that the code was *guaranteed* to fail on the
first flight.

-- 
        "And a woman needs a man...        || Wayne Hayes, wayne@cs.utoronto.ca
      like a fish needs a bicycle..."      || Astrophysics & Computer Science
-- U2 (apparently quoting Gloria Steinem?) || http://www.cs.utoronto.ca/~wayne

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00     ` Ronald Kunne
  1996-09-26  0:00       ` Matthew Heaney
  1996-09-27  0:00       ` Ken Garlington
@ 1996-09-27  0:00       ` Alan Brain
  1996-09-28  0:00         ` Ken Garlington
  1996-09-29  0:00       ` Louis K. Scheffer
  3 siblings, 1 reply; 105+ messages in thread
From: Alan Brain @ 1996-09-27  0:00 UTC (permalink / raw)

Ronald Kunne wrote:

> The problem of constructing bug-free real-time software seems to me
> a trade-off between safety and speed of execution (and maybe available
> memory?). In other words: including tests on array boundaries might
> make the code saver, but also slower.
> 
> Comments?

Bug-free software is not a reasonable criterion for success in a
safety-critical system, IMHO. A good program should meet the
requirements for safety etc despite bugs. Also despite hardware
failures, soft failures, and so on. A really good safety-critical
program should be remarkably difficult to de-bug, as the only way you
know it's got a major problem is by examining the error log, and
calculating that it's performance is below theoretical expectations.

And if it runs too slow, many times in the real-world you can spend 2
years of development time and many megabucks kludging the software, or
wait 12 months and get the new 400 Mhz chip instead of your current 133.

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00       ` Matthew Heaney
  1996-09-27  0:00         ` Wayne Hayes
@ 1996-09-27  0:00         ` Ronald Kunne
  1996-09-27  0:00           ` Lawrence Foard
                             ` (2 more replies)
  1996-09-28  0:00         ` Ken Garlington
  2 siblings, 3 replies; 105+ messages in thread
From: Ronald Kunne @ 1996-09-27  0:00 UTC (permalink / raw)

In article <mheaney-ya023180002609962252500001@news.ni.net>
mheaney@ni.net (Matthew Heaney) writes:

>>The problem of constructing bug-free real-time software seems to me
>>a trade-off between safety and speed of execution (and maybe available
>>memory?). In other words: including tests on array boundaries might
>>make the code saver, but also slower.

>Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is.  The rocket's gone,
>so what difference does it make how fast the code executed?  If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?

Despite the sarcasm, I will elaborate.

Suppose an array goes from 0 to 100, and the calculated index is known
not to go outside this range. Why would one insist on putting the
range test in, which will slow down the code? This might be a problem
if the particular piece of code is heavily used, and the code executes
too slowly otherwise. "Marginally slower" if it happens only once, but
such checks on indices and function arguments (like squareroots), are
necessary *everywhere* in code, if one is consequent.

Actually, this was the case here: the code was taken from an Ariane 4
code where it was physically impossible that the index would go out
of range: a test would have been a waste of time.
Unfortunately this was no longer the case in the Ariane 5.

Friendly greetings,
Ronald Kunne

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00         ` Byron Kauffman
@ 1996-09-27  0:00           ` A. Grant
  0 siblings, 0 replies; 105+ messages in thread
From: A. Grant @ 1996-09-27  0:00 UTC (permalink / raw)

In article <324A7C1C.6718@lmtas.lmco.com> Byron Kauffman <KauffmanBB@lmtas.lmco.com> writes:
>A. Grant wrote:
>> Robin is not a student.  He is a senior lecturer at the Royal
>> Melbourne Institute of Technology, a highly reputable institution.

>Thank you for confirming my long-held theory that those who inhabit the
>ivory towers of engineering/CS academia should spend 2 of every 5 years 
>working at a real job out in the real world. My intent is not to slam 
>professors who are in touch with reality, of course (e.g., Feldman, 
>Dewar, et al), but the idealistic theoretical side often is a far cry 
>from the practical, just-get-it-done world we have to deal with once
>we're out of school.

You're being a bit hard on theoretical computer scientists here.
Just because it's called computer science doesn't mean it has to be
able to instantly make money on real computers.  And the Ariane 5 
failure was due to pragmatism (reusing old stuff to save money)
not idealism (applying theoretical proofs of correctness).

But in any case RMIT is noted for its involvement with industry.
(I used to work for a start-up company out of RMIT premises.)
If PL/I is being pushed by RMIT it's probably because the DP
managers in Collins St. want it.  Australia doesn't have much call
for aerospace systems.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00     ` Ronald Kunne
  1996-09-26  0:00       ` Matthew Heaney
@ 1996-09-27  0:00       ` Ken Garlington
  1996-09-27  0:00       ` Alan Brain
  1996-09-29  0:00       ` Louis K. Scheffer
  3 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-09-27  0:00 UTC (permalink / raw)

Ronald Kunne wrote:
> 
> In article <52bm1c$gvn@rational.rational.com>
> rlk@rational.com (Bob Kitzberger) writes:
> 
> >Ada _has_ range checks built into the language.  They were explicitly
> >disabled in this case.
> 
> The problem of constructing bug-free real-time software seems to me
> a trade-off between safety and speed of execution (and maybe available
> memory?). In other words: including tests on array boundaries might
> make the code saver, but also slower.

Particularly for fail-operate systems that must continue to function in
harsh environments, memory and throughput can be tight. This usually happens
because the system must continue to operate on emergency power and/or
cooling. At least until recently, the processing systems that had lots of
memory and CPU power also had larger power and cooling requirements, so they
couldn't always be used in this class of systems. (That's changing, somewhat.) So,
the tradeoff you describe can occur.

The trade-off I find even more interesting is the safety gained from
adding extra features vs. the safety _lost_ by adding those features. Every
time you add a check, whether it's an explicit check or one automatically
generated by the compiler, you have to have some way to gain confidence that
the check will not only work, but won't create some side-effect that causes
a different problem. The effort expended to get confidence for that additional
feature is effort that can't be spent gaining assurance of other features in
the system, assuming finite resources. There is no magic formula I've ever
seen to make that trade-off - ultimately, it's human judgement.

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-25  0:00 ` Ariane 5 failure @@           robin
  1996-09-25  0:00   ` Michel OLAGNON
  1996-09-25  0:00   ` Bob Kitzberger
@ 1996-09-27  0:00   ` John McCabe
  1996-10-01  0:00     ` Michael Dworetsky
  1996-10-04  0:00     ` @@           robin
  2 siblings, 2 replies; 105+ messages in thread
From: John McCabe @ 1996-09-27  0:00 UTC (permalink / raw)



rav@goanna.cs.rmit.edu.au (@@           robin) wrote:

<..snip..>

Just a point for your information. From clari.tw.space:

	 "An inquiry board investigating the explosion concluded in  
July that the failure was caused by software design errors in a 
guidance system."

Note software DESIGN errors - not programming errors.



Best Regards
John McCabe <john@assen.demon.co.uk>





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00         ` Ronald Kunne
@ 1996-09-27  0:00           ` Lawrence Foard
  1996-10-04  0:00             ` @@           robin
  1996-09-28  0:00           ` Ken Garlington
  1996-09-29  0:00           ` Alan Brain
  2 siblings, 1 reply; 105+ messages in thread
From: Lawrence Foard @ 1996-09-27  0:00 UTC (permalink / raw)



Ronald Kunne wrote:
> 
> Actually, this was the case here: the code was taken from an Ariane 4
> code where it was physically impossible that the index would go out
> of range: a test would have been a waste of time.
> Unfortunately this was no longer the case in the Ariane 5.

Actually it would still present a danger on Ariane 4. If the sensor
which apparently was no longer needed during flight became defective,
then you could get a value out of range.

-- 
The virgin birth of Pythagoras via Apollo. The martyrdom of 
St. Socrates. The Gospel according to Iamblichus. 
--  Have an 18.9cents/minute 6 second billed calling card tomorrow --
                  http://www.vwis.com/cards.html




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00         ` Wayne Hayes
@ 1996-09-27  0:00           ` Richard Pattis
  1996-09-29  0:00             ` Dann Corbit
                               ` (3 more replies)
  0 siblings, 4 replies; 105+ messages in thread
From: Richard Pattis @ 1996-09-27  0:00 UTC (permalink / raw)



As an instructor in CS1/CS2, this discussion interests me. I try to talk about
designing robust, reusable code, and actually have students reuse code that
I have written as well as some that they (and their peers) have written.
The Ariane falure adds a new view to robustness, having to do with future
use of code, and mathematical proof vs "engineering" considerations..

Should a software engineer remove safety checks if he/she can prove - based on
physical limitations, like a rocket not exceeding a certain speed - that they
are unnecessary. Or, knowing that his/her code will be reused (in an unknown
context, by someone who is not so skilled, and will probably not think to
redo the proof) should such checks not be optimized out? What rule of thumb
should be used to decide (e.g., what if the proof assumes the rocket speed
will not exceed that of light)? Since software operates in the real world (not
the world of mathematics) should mathematical proofs about code always yield
to engineering rules of thumb to expect the unexpected.

  "In the Russian theatre, every 5 years an unloaded gun accidentally 
   discharges and kills someone; every 20 years a broom does."

What is the rule of thumb about when should mathematics be believed? 

  As to saving SPEED by disabling the range checks: did the code not meet its
speed requirements with range checks on? Only in this case would I have turned
them off. Does "real time" mean fast enough or as fast as possible? To
misquote Einstein, "Code should run as fast as necessary, but no faster...."
since something is always traded away to increase speed.

If I were to try to create a lecture on this topic, what other similar
failures should I know about (beside the legendary Venus probe)?
Your comments?

Rich




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00       ` Matthew Heaney
  1996-09-27  0:00         ` Wayne Hayes
  1996-09-27  0:00         ` Ronald Kunne
@ 1996-09-28  0:00         ` Ken Garlington
  2 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-09-28  0:00 UTC (permalink / raw)



Matthew Heaney wrote:
> 




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00         ` Ronald Kunne
  1996-09-27  0:00           ` Lawrence Foard
@ 1996-09-28  0:00           ` Ken Garlington
  1996-09-28  0:00             ` Ken Garlington
  1996-09-29  0:00           ` Alan Brain
  2 siblings, 1 reply; 105+ messages in thread
From: Ken Garlington @ 1996-09-28  0:00 UTC (permalink / raw)

Ronald Kunne wrote:
> 
> In article <mheaney-ya023180002609962252500001@news.ni.net>
> mheaney@ni.net (Matthew Heaney) writes:
> 
> >>The problem of constructing bug-free real-time software seems to me
> >>a trade-off between safety and speed of execution (and maybe available
> >>memory?). In other words: including tests on array boundaries might
> >>make the code saver, but also slower.
> 
> >Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
> >I'm not clear what the value of "faster execution" is.  The rocket's gone,
> >so what difference does it make how fast the code executed?  If you left
> >the range checks in, your code would be *marginally* slower, but you'd
> >still have your rocket, now wouldn't you?
> 
> Despite the sarcasm, I will elaborate.
> 
> Suppose an array goes from 0 to 100, and the calculated index is known
> not to go outside this range. Why would one insist on putting the
> range test in, which will slow down the code? This might be a problem
> if the particular piece of code is heavily used, and the code executes
> too slowly otherwise. "Marginally slower" if it happens only once, but
> such checks on indices and function arguments (like squareroots), are
> necessary *everywhere* in code, if one is consequent.

I might agree with the conclusion, but probably not with the argument.
If the array is statically typed to go from 0 to 100, and everything
that indexes it is statically typed for that range or smaller, most
modern Ada compilers won't generate _any_ code for the check.

I still believe the more interesting issue has to do with the _consequences_
of the check. If your environment doesn't lend itself to a reasonable response
to the check (quite possible in fail-operate systems inside systems that move
really fast), and you have to test the checks to make sure they don't _create_
a problem, then you've got a hard decision on your hands: suppress the check
(which might trigger a compiler bug or some other problems), or leave the check in 
(which might introduce a problem, or divert your attention away from some other
problem).

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00       ` Alan Brain
@ 1996-09-28  0:00         ` Ken Garlington
  0 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-09-28  0:00 UTC (permalink / raw)

Alan Brain wrote:
> 
> Ronald Kunne wrote:
> 
> > The problem of constructing bug-free real-time software seems to me
> > a trade-off between safety and speed of execution (and maybe available
> > memory?). In other words: including tests on array boundaries might
> > make the code saver, but also slower.
> >
> > Comments?
> 
> Bug-free software is not a reasonable criterion for success in a
> safety-critical system, IMHO. A good program should meet the
> requirements for safety etc despite bugs.

An OK statement for a fail-safe system. How do you propose to implement
this theory for a fail-operate system, particularly if there are system
constraints on weight, etc. that preclude hardware backups?

> Also despite hardware
> failures, soft failures, and so on.

A system which will always meet its requirements despite any combination
of failures is in the same regime as the perpetual motion system. If
you build one, you'll probably make a lot of money, so go to it!

> A really good safety-critical
> program should be remarkably difficult to de-bug, as the only way you
> know it's got a major problem is by examining the error log, and
> calculating that it's performance is below theoretical expectations.
> And if it runs too slow, many times in the real-world you can spend 2
> years of development time and many megabucks kludging the software, or
> wait 12 months and get the new 400 Mhz chip instead of your current 133.

I really need to change jobs. It sounds so much simpler to build 
software for ground-based PCs, where you don't have to worry about the 
weight, power requirements, heat dissipation, physical size, 
vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-28  0:00           ` Ken Garlington
@ 1996-09-28  0:00             ` Ken Garlington
  0 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-09-28  0:00 UTC (permalink / raw)



From the  "There's always time to test it the second time around"
department...

      ORBITAL JUNK:  The second Ariane 5 to be launched in April at the
      earliest will put two dummy satellites, worth less than $3
      million, into orbit. The first Ariane 5 exploded in June carrying
      four uninsured satellites worth $500 million.  (Financial Times)

I wonder if the test labs at Arianespace, etc. are keeping busy... :)




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-29  0:00           ` Alan Brain
@ 1996-09-29  0:00             ` Robert A Duff
  1996-09-30  0:00               ` Wayne L. Beavers
  1996-10-01  0:00             ` Ken Garlington
  1 sibling, 1 reply; 105+ messages in thread
From: Robert A Duff @ 1996-09-29  0:00 UTC (permalink / raw)

In article <324F1157.625C@dynamite.com.au>,
Alan Brain  <aebrain@dynamite.com.au> wrote:
>Brain's law:
>"Software Bugs and Hardware Faults are no excuse for the Program not to
>work".   
>
>So: it costs peanuts, and may save your hide.

This reasoning doesn't sound right to me.  The hardware part, I mean.
The reason checks-on costs only 5% or so is that compilers aggressively
optimize out almost all of the checks.  When the compiler proves that a
check can't fail, it assumes that the hardware is perfect.  So, hardware
faults and cosmics rays and so forth are just as likely to destroy the
RTS, or cause the program to take a wild jump, or destroy the call
stack, or whatever -- as opposed to getting  a Constraint_Error a
reocovering gracefully.  After all, the compiler doesn't range-check the
return address just before doing a return instruction!

- Bob

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Richard Pattis
  1996-09-29  0:00             ` Dann Corbit
  1996-09-29  0:00             ` Alan Brain
@ 1996-09-29  0:00             ` Chris McKnight
  1996-09-29  0:00               ` Real-world education (was: Ariane 5 failure) Michael Feldman
  1996-10-01  0:00             ` Ariane 5 failure Ken Garlington
  3 siblings, 1 reply; 105+ messages in thread
From: Chris McKnight @ 1996-09-29  0:00 UTC (permalink / raw)



In article Hzz@beaver.cs.washington.edu, pattis@cs.washington.edu (Richard Pattis) writes:
>As an instructor in CS1/CS2, this discussion interests me. I try to talk about
>designing robust, reusable code, and actually have students reuse code that
>I have written as well as some that they (and their peers) have written.
>The Ariane falure adds a new view to robustness, having to do with future
>use of code, and mathematical proof vs "engineering" considerations..

  An excellent bit of teaching, IMHO. Glad to hear they're putting some
  more of the real world issues in the class room.

>Should a software engineer remove safety checks if he/she can prove - based on
>physical limitations, like a rocket not exceeding a certain speed - that they
>are unnecessary. Or, knowing that his/her code will be reused (in an unknown
>context, by someone who is not so skilled, and will probably not think to
>redo the proof) should such checks not be optimized out? What rule of thumb
>should be used to decide (e.g., what if the proof assumes the rocket speed
>will not exceed that of light)? Since software operates in the real world (not
>the world of mathematics) should mathematical proofs about code always yield
>to engineering rules of thumb to expect the unexpected.

 A good question.  For the most part, I'd go with engineering rules of thumb
 (what did you expect, I'm an engineer).  As an engineer, you never know what
 may happen in the real world (in spite of what you may think), so I prefer
 error detection and predictable recovery.  The key factors to consider include
 the likelihood and the cost of failures, and the cost of leaving in (or adding
 where your language doesn't already provide it) the checks.

 Consider these factors, likelihood and cost of failures:

    In a real-time embedded system, both of these factors are often high.  Of
    the two, I think people most often get caught on misbeliefs on likelihood of
    failure.  As an example, I've argued more than once with engineers who think
    that since a device is only "able" to give them a value in a certain range, 
    they needn't check for out of range values.  I've seen enough failed hardware
    to know that anything is possible, regardless of what the manufacturer may
    claim.  Consider your speed of light example, what if the sensor goes bonkers
    and tells you that you're going faster?  Your "proof" that you can't get that
    value falls apart then.  Your point about reuse is also well made.  Who knows
    what someone else may want to use your code for?

    As for cost of failure, it's usually obvious; in dollars, in lives, or both.
 
 As for cost of leaving checks in (or putting them in):

    IMHO, the cost is almost always insignificant.  If the timing is so tight that 
    removing checks makes the difference, it's probably time to redesign anyway.
    Afterall, in the real world there's always going to be fixes, new features, 
    etc.. that need to be added later, so you'd better plan for it.  Also, it's
    been my experience that removing checks is somewhere in the single digits
    on % improvement.  If you're really that tight, a good optimizer can yield
    10%-15% or more (actual mileage may vary of course).  But again, if that
    makes the difference, you'd better rethink your design.
   
 So the rule of thumb I use is, unless a device is not physically capable (as
 opposed to theoretically capable) of giving me out of range data, I'm going
 to range check it.  I.E. if there's 3 bits, you'd better check for 8 values
 regardless of the number of values you think you can get.

 That having been said, it's often not up to the engineer to make these 
 decisions.  Such things as political considerations, customer demands, and 
 (more often than not) management decisions  have been known to succeed in
 convincing me to turn checks off.  As a rule, however, I fight to keep them
 in, at very least through development and integration. 

>  As to saving SPEED by disabling the range checks: did the code not meet its
>speed requirements with range checks on? Only in this case would I have turned
>them off. Does "real time" mean fast enough or as fast as possible? To
>misquote Einstein, "Code should run as fast as necessary, but no faster...."
>since something is always traded away to increase speed.

  Precisely!  And when what's being traded is safety, it's not worth it.


  Cheers,

     Chris


=========================================================================

"I was gratified to be able to answer promptly.  I said I don't know".  
  -- Mark Twain

=========================================================================





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Real-world education (was: Ariane 5 failure)
  1996-09-29  0:00             ` Chris McKnight
@ 1996-09-29  0:00               ` Michael Feldman
  0 siblings, 0 replies; 105+ messages in thread
From: Michael Feldman @ 1996-09-29  0:00 UTC (permalink / raw)

In article <1996Sep29.193602.17369@enterprise.rdd.lmsc.lockheed.com>,
Chris McKnight <cmcknigh@hercii.lasc.lockheed.com> wrote:

[Rich Pattis' good stuff snipped.]
>
>  An excellent bit of teaching, IMHO. Glad to hear they're putting some
>  more of the real world issues in the class room.

Rich Pattis is indeed an experienced, even gifted teacher of
introductory courses, with a very practical view of what they
should be about.

Without diminishing Rich Pattis' teaching experience or skill one bit,
I am somewhat perplexed at the unfortunate stereotypical view you
seem to have of CS profs. Yours is the second post today to have
shown evidence of that stereotypical view; both you and the other
poster have industry addresses.

This is my 22nd year as a CS prof, I travel a lot in CS education
circles, and - while we, like any population, tend to hit a bell
curve - I've found that there are a lot more of us out here than
you may think with Pattis-like commitment to bring the real world
into our teaching.

Sure, there are theorists, as there are in any field, studying
and teaching computing just because it's "beautiful", with little
reference to real application, and there's a definite place in the
teaching world for them.  Indeed, exposure to their "purity" of
approach is healthy for undergraduates - there is no harm at all
in taking on computing - sometimes - as purely an intellectual
exercise.

But it's a real reach from there to an assumption that most of us
are in that theoretical category.

I must say that there's a definite connection between an interest
in Ada and an interest in real-world software; certainly most of
the Ada teachers I've met are more like Pattis than you must think.
Indeed, it's probably our commitment to that "engineering" view
of computing that brings us to like and teach Ada.

But it's not just limited to Ada folks. I had the pleasure of
participating in a SIGCSE panel last March entitled "the first
year beyond language." Organized by Owen Astrachan of Duke,
a C++ fan, this panel consisted of 6 teachers of first-year
courses, each using a different language. Pascal, C++, Ada,
Scheme, Eiffel, and (as I recall) ML were represented.

The challenge Owen made to each of us was to give a 10-minute
"vision statement" for first-year courses, without identifying
which language we "represented." Owen revealed the languages to
the audience only after the presentations were done.

It was _really_ gratifying that - with no prior agreement or
discussion among us - five of the six of us presented very similar
visions, in the "computing as engineering" category. It doesn;t
matter which language the 6th used; the important thing was that,
considering the diversity of our backgrounds, teaching everywhere
from small private colleges to big public universities, we were
in _amazing_ agreement.

The message for me in the stereotype presented above is that it's
probably out of date and certainly out of touch. I urge my
industry friends to get out of _their_ ivory towers, and come
visit us. Find out what we're _really_ doing. I think you'll
be pleasantly surprised.

Especially, check out those of us who are introducing students
to _Ada_ as their first, foundation language.

Mike Feldman

------------------------------------------------------------------------
Michael B. Feldman -  chair, SIGAda Education Working Group
Professor, Dept. of Electrical Engineering and Computer Science
The George Washington University -  Washington, DC 20052 USA
202-994-5919 (voice) - 202-994-0227 (fax) 
http://www.seas.gwu.edu/faculty/mfeldman
------------------------------------------------------------------------
       Pork is all that money the government gives the other guys.
------------------------------------------------------------------------
WWW: http://lglwww.epfl.ch/Ada/ or http://info.acm.org/sigada/education
------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Richard Pattis
@ 1996-09-29  0:00             ` Dann Corbit
  1996-09-29  0:00             ` Alan Brain
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 105+ messages in thread
From: Dann Corbit @ 1996-09-29  0:00 UTC (permalink / raw)



I propose a software IC metaphor for high
reliability projects. (And all eventually).

Currently, the software industry goes by
what I call a "software schematic" metaphor.
We put in components that are tested, but
we do not necessarily know the performance
curves.

If you look at S. Moshier's code in the
Cephes Library on Netlib, you will see that
he offers statistical evidence that his 
programs are robust.  So you can at least
infer, on a probability basis, what the odds
are of a component failing.  So instead of
just dropping in a resistor or a transistor,
we read the little gold band, or the spec
on the transistor that shows what voltages
it can operate under.
For simple components with, say, five bytes
of input, we could exhaustively test all 
possible inputs and outputs.  For more
complicated procedures with many bytes of
inputs, we could perform probability testing,
and test other key values.

Imagine a database like the following:
TABLE: MODULES
int      ModuleUniqueID
int      ModuleCategory
char*60  ModuleName
char*255 ModuleDescription
text     ModuleCode
text     TestRoutineUsed
bit      CompletelyTested

TABLE: TestResults (many result sets for one module)
int      TestResultUniqueID
int      ModuleUniqueID
char*60  OperatingSystem
char*60  CompilerUsed
binary   ResultChart
text     ResultDescription
float    ProbabilityOfFailure
float    RmsErrorObserved
float    MaxErrorObserved

TABLE: KnownBugs  (many known bugs for one module)
int      KnownBugUniqueID
int      ModuleUniqueID
char*60  KnownBugDescription
text     BugDefinition
text     PossibleWorkAround

Well, this is just a rough outline, but the value of
a database like this would be obvious.  This could
easily be improved and expanded. (More domain tables,
tables for defs of parameters to the module, etc.)

If we had a tool like that, we would be using
software IC's, not software schematics.
-- 
"I speak for myself and all of the lawyers of the world"
If I say something dumb, then they will have to sue themselves.





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00         ` Ronald Kunne
  1996-09-27  0:00           ` Lawrence Foard
  1996-09-28  0:00           ` Ken Garlington
@ 1996-09-29  0:00           ` Alan Brain
  1996-09-29  0:00             ` Robert A Duff
  1996-10-01  0:00             ` Ken Garlington
  2 siblings, 2 replies; 105+ messages in thread
From: Alan Brain @ 1996-09-29  0:00 UTC (permalink / raw)

Ronald Kunne wrote:

> Suppose an array goes from 0 to 100, and the calculated index is known
> not to go outside this range. Why would one insist on putting the
> range test in, which will slow down the code? This might be a problem
> if the particular piece of code is heavily used, and the code executes
> too slowly otherwise. "Marginally slower" if it happens only once, but
> such checks on indices and function arguments (like squareroots), are
> necessary *everywhere* in code, if one is consequent.

Why insist?
1. Suppressing all checks in Ada-83 makes about a 5% difference in
execution speed, in typical real-time and avionics systems. (For
example, B2 simulator, CSU-90 sonar, COSYS-200 Combat system). If your
hardware budget is this tight,
you'd better not have lives at risk, or a lot of money, as technical
risk is
appallingly high.

2. If you know the range is 0-100, and you get 101, what does this show?
a) A bug in the code (99.9999....% probable). b) A hardware fault. c) A
soft failure, as in a stray cosmic ray zapping a bit. d) a faulty
analysis of your "can't happen" situation. As in re-use, or where your
array comes from an IO channel with noise on....

Type a) and d) failures should be caught during testing. Most of them.
OK, some of them. Range checking here is a neccessary debugging aid. But
type b) and c) can happen too out in the real world, and if you don't
test for an error early, you often can't recover the situation. Lives or
$ lost.

Brain's law:
"Software Bugs and Hardware Faults are no excuse for the Program not to
work".   

So: it costs peanuts, and may save your hide.

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-26  0:00     ` Ronald Kunne
                         ` (2 preceding siblings ...)
  1996-09-27  0:00       ` Alan Brain
@ 1996-09-29  0:00       ` Louis K. Scheffer
  3 siblings, 0 replies; 105+ messages in thread
From: Louis K. Scheffer @ 1996-09-29  0:00 UTC (permalink / raw)

KUNNE@frcpn11.in2p3.fr (Ronald Kunne) writes:

>The problem of constructing bug-free real-time software seems to me
>a trade-off between safety and speed of execution (and maybe available
>memory?). In other words: including tests on array boundaries might
>make the code saver, but also slower.
> 
>Comments?

True in this case, but not in the way you might expect.  The software group
decided that they wanted the guidance computers to be no more than 80 percent 
busy.  Range checking ALL the variables took too much time, so they analyzed 
the situation and only checked those that might overflow.  In the Ariane 4,
this particular variable could not overflow unless the trajectory was wildly 
off, so they left out the range checking.

I think you could make a good case for range checking in the Ariane
software making it less safe, rather than more safe.  The only reason they
check for overflow is to find hardware errors - since the software is designed
to not overflow, then any overflow must be because of a hardware problem, so
if any processor detects an overflow it shuts down.  So on the one hand, each
additional range check increases the odds of catching a hardware error before
it does damage, but increases the odds that a processor shuts down while it
could still be delivering useful data. (Say the overflow occurs while 
computing unimportant results, as on the Ariane 5).   Given the relative
odds of hardware and software errors, it's not at all obvious to me that
range checking helps at all in this case!

The real problem is that they did not re-examine this software for the Ariane 5.If they had eitehr simulated it, or examined it closely, they would probably
have found this problem.
   -Lou Scheffer

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Richard Pattis
  1996-09-29  0:00             ` Dann Corbit
@ 1996-09-29  0:00             ` Alan Brain
  1996-09-29  0:00             ` Chris McKnight
  1996-10-01  0:00             ` Ariane 5 failure Ken Garlington
  3 siblings, 0 replies; 105+ messages in thread
From: Alan Brain @ 1996-09-29  0:00 UTC (permalink / raw)

Richard Pattis wrote:
> 
> As an instructor in CS1/CS2, this discussion interests me. I try to talk about
> designing robust, reusable code.... --->8----

> The Ariane falure adds a new view to robustness, having to do with future
> use of code, and mathematical proof vs "engineering" considerations..
> 
> Should a software engineer remove safety checks if he/she can prove - based on
> physical limitations, like a rocket not exceeding a certain speed - that they
> are unnecessary. Or, knowing that his/her code will be reused (in an unknown
> context, by someone who is not so skilled, and will probably not think to
> redo the proof) should such checks not be optimized out? What rule of thumb
> should be used to decide (e.g., what if the proof assumes the rocket speed
> will not exceed that of light)? Since software operates in the real world (not
> the world of mathematics) should mathematical proofs about code always yield
> to engineering rules of thumb to expect the unexpected.

> What is the rule of thumb about when should mathematics be believed?
> 

Firstly, I wish more there were more CS teachers like you. These are
excellent
Engineering questions.

Secondly, answers:
I tend towards the philosophy of "Leave every check in". In 12+ years of
Ada programming, I've never seen Pragma Suppress All Checks make the
difference
between success and failure. At best it gives a 5% improvement. This
means
in order to debug the code quickly, it's useful to have such checks,
even when
not strictly neccessary.

For re-use, you then often have the Ariane problem. That is, the
un-neccessary
checks you included coming around and biting you, as the assumptions you
were
making in the previous project become invalid.

So.... You make sure the assumptions/consequences get put into a
seperate package.
A system-specific package, that will be changed when re-used. Which
means that if the subsystem gets re-used a lot, the system specific
stuff will eventually be re-written so as to allow for re-use easily.
Example: Car's Cruise Control: MAX_SPEED : constant 200.0*MPH;
Get's re-used in an airliner - change to 700.0*MPH. Then onto an SST -
2000.0*MPH.
Eventually, you make it 2.98E26*MetresPerSec. Then some Bunt invents a
Warp Drive, and you're wrong again.

Summary: Label the constraints and assumptions, stick them as comments
in the code and design notes, put them in a seperate package...and some
dill will still stuff up, but that's the best you can do. And in the
meantime, you allow the possibility of finding a number of errors
early.   

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-29  0:00             ` Robert A Duff
@ 1996-09-30  0:00               ` Wayne L. Beavers
  1996-10-01  0:00                 ` Ken Garlington
  1996-10-03  0:00                 ` Richard A. O'Keefe
  0 siblings, 2 replies; 105+ messages in thread
From: Wayne L. Beavers @ 1996-09-30  0:00 UTC (permalink / raw)



I have been reading this thread awhile and one topic that I have not seen mentioned is protecting the code 
area from damage.  When I code in PL/I or any other reentrant language I always make sure that the executable 
code is executing from read-only storage.  There is no way to put the data areas in read-only storage 
(obviously) but I can't think of any reason to put the executable code in writeable storage. 

I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable code from one system to another.  The 
single most common error I had to correct was incorrect usage of pointer variables.  I caught a lot of them 
when ever they attempted to accidently store into the code area.  At that point it is trivial to correct the 
bug.  This technique certainly doesn't catch all pointer failures, but it will catch at least some of them.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-30  0:00               ` Wayne L. Beavers
@ 1996-10-01  0:00                 ` Ken Garlington
  1996-10-01  0:00                   ` Wayne L. Beavers
  1996-10-03  0:00                 ` Richard A. O'Keefe
  1 sibling, 1 reply; 105+ messages in thread
From: Ken Garlington @ 1996-10-01  0:00 UTC (permalink / raw)



Wayne L. Beavers wrote:
> 
> I have been reading this thread awhile and one topic that I have not seen mentioned is protecting the code
> area from damage.  When I code in PL/I or any other reentrant language I always make sure that the executable
> code is executing from read-only storage.  There is no way to put the data areas in read-only storage
> (obviously) but I can't think of any reason to put the executable code in writeable storage.

That's actually a pretty common rule of thumb for safety-critical systems. 
Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors 
can cause a random change in the memory. So, it's not a perfect fix.

> 
> I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable code from one system to another.  The
> single most common error I had to correct was incorrect usage of pointer variables.  I caught a lot of them
> when ever they attempted to accidently store into the code area.  At that point it is trivial to correct the
> bug.  This technique certainly doesn't catch all pointer failures, but it will catch at least some of them.

-- 
LMTAS - "Our Brand Means Quality"




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Richard Pattis
                               ` (2 preceding siblings ...)
  1996-09-29  0:00             ` Chris McKnight
@ 1996-10-01  0:00             ` Ken Garlington
  3 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-01  0:00 UTC (permalink / raw)

Richard Pattis wrote:
> 
[snip]
> If I were to try to create a lecture on this topic, what other similar
> failures should I know about (beside the legendary Venus probe)?
> Your comments?

"Safeware" by Levison has some additional good examples about what can
go wrong with software. The RISKS conference also has a lot of info on
this.

There was a study done several years ago by a Dr. Avezzianis (I always screw
up that spelling, and I'm always too lazy to go look it up...) trying to
show the worth of N-version programming. He had five teams of students write
code for part of a flight control system. Each team was given the same set
of control law diagrams (which are pretty detailed, as requirements go), and
each team used the same sort of meticulous software engineering approach that
you would expect for a safety-critical system (no formal methods, however).
Each team's software was almost error-free, based on tests done using the
same test data as the actual delivered flight controls.

Note I said "almost". Every team made one mistake. Worse, it was the _same_
mistake. The control law diagrams were copies. The copier apparently wasn't
a good one, because a comma in one of the gains ended up looking like a
decimal point (or maybe it was the other way around -- I forget). Anyway,
the gain was accidentally coded as 2.345 vs 2,345, or something like that.
That kind of error makes a big difference!

In the face of that kind of error, I've never felt that formal methods had a
chance. That's not to say that formal methods can't detect a lot of different
kinds of failures, but at some level some engineer has to be able to say: "That
doesn't make sense..."

If you want to try to find this study, I believe it was reported at a Digital
Avionics Systems Conference many years ago (in San Jose?), probably around 1986.

> 
> Rich

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-29  0:00           ` Alan Brain
  1996-09-29  0:00             ` Robert A Duff
@ 1996-10-01  0:00             ` Ken Garlington
  1 sibling, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-01  0:00 UTC (permalink / raw)

Alan Brain wrote:
> 
> 1. Suppressing all checks in Ada-83 makes about a 5% difference in
> execution speed, in typical real-time and avionics systems. (For
> example, B2 simulator, CSU-90 sonar, COSYS-200 Combat system). If your
> hardware budget is this tight,
> you'd better not have lives at risk, or a lot of money, as technical
> risk is
> appallingly high.

Actually, I've seen systems where checks make much more than a 5% difference.
For example, in a flight control system, checks done in the redundancy
management monitor (comparing many redundant inputs in a tight loop) can
easily add 10% or more.

I have also seen flight-critical systems where 5% is a big deal, and where you
can _not_ add a more powerful processor to fix the problem. Flight control
software usually exists in a flight control _system_, with system issues of
power, cooling, space, etc. to consider. On a missile, these are important
issues. You might consider the technical risk "appalingly high," but the fix
for that risk can introduce equally dangerous risks in other areas.

> 2. If you know the range is 0-100, and you get 101, what does this show?
> a) A bug in the code (99.9999....% probable). b) A hardware fault. c) A
> soft failure, as in a stray cosmic ray zapping a bit. d) a faulty
> analysis of your "can't happen" situation. As in re-use, or where your
> array comes from an IO channel with noise on....

You forgot (e) - a failure in the inputs. The range may be calculated,
directly or indirectly, from an input to the system. In practice, at least
for the systems I'm familiar with, that's usually where the error came
from -- either a connector fell off, or some wiring shorted out, or a bird
strike took out half of your sensors. I definitely would say that, when we
have a failure reported in operation, it's not usually because of a bug in
the software for our systems!

> Type a) and d) failures should be caught during testing. Most of them.
> OK, some of them. Range checking here is a neccessary debugging aid. But
> type b) and c) can happen too out in the real world, and if you don't
> test for an error early, you often can't recover the situation. Lives or
> $ lost.
> 
> Brain's law:
> "Software Bugs and Hardware Faults are no excuse for the Program not to
> work".

Too bad that law can't be enforced :)

-- 
LMTAS - "Our Brand Means Quality"

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00                 ` Ken Garlington
@ 1996-10-01  0:00                   ` Wayne L. Beavers
  1996-10-01  0:00                     ` Ken Garlington
  0 siblings, 1 reply; 105+ messages in thread
From: Wayne L. Beavers @ 1996-10-01  0:00 UTC (permalink / raw)

Ken Garlington wrote:

> That's actually a pretty common rule of thumb for safety-critical systems.
> Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> can cause a random change in the memory. So, it's not a perfect fix.

  Your right, but the risk and probability of memory failures is pretty low I would think.  I have never seen 
or heard of a memory failure in any of the systems that I have worked on.  I don't know what the current 
technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit 
memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were 
correctable.  But I also don't work on realtime systems, my experience is with commercial systems.

  Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you 
refering to ground base systems that don't have similar constraints?

  Does anyone know just how good memory ECC is these days?

Wayne L. Beavers   wayneb@beyond-software.com
Beyond Software, Inc.      
The Mainframe/Internet Company
http://www.beyond-software.com/

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00                   ` Wayne L. Beavers
@ 1996-10-01  0:00                     ` Ken Garlington
  1996-10-02  0:00                       ` Sandy McPherson
  0 siblings, 1 reply; 105+ messages in thread
From: Ken Garlington @ 1996-10-01  0:00 UTC (permalink / raw)

Wayne L. Beavers wrote:
> 
> Ken Garlington wrote:
> 
> > That's actually a pretty common rule of thumb for safety-critical systems.
> > Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> > can cause a random change in the memory. So, it's not a perfect fix.
> 
>   Your right, but the risk and probability of memory failures is pretty low I would think.  I have never seen
> or heard of a memory failure in any of the systems that I have worked on.  I don't know what the current
> technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit
> memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were
> correctable.  But I also don't work on realtime systems, my experience is with commercial systems.
> 
>   Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you
> refering to ground base systems that don't have similar constraints?

On-board systems. The failure _rate_ is usually pretty low, but in a harsh environment 
you can get quite a few failure _sources_, including mechanical failures (stress 
fractures, solder loss due to excessive heat, etc.), electrical failures (EMI, 
lightening), and so forth. You don't have to take out the actual chip, of course: just 
as bad is a failure in the address or data lines connecting the memory to the CPU. Add 
a memory management unit to the mix, along with various I/O devices mapped into the 
memory space, and you can get a whole slew of memory-related failure modes.

You can also get into some neat system failures. For example, some "read-only" memory 
actually allows writes to the execution space in certain modes, to allow quick 
reprogramming. If you have a system failure that allows writes at the wrong time, 
coupled with a failure that does a write where it shouldn't...

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00   ` John McCabe
@ 1996-10-01  0:00     ` Michael Dworetsky
  1996-10-04  0:00       ` Steve Bell
  1996-10-04  0:00     ` @@           robin
  1 sibling, 1 reply; 105+ messages in thread
From: Michael Dworetsky @ 1996-10-01  0:00 UTC (permalink / raw)

In article <843845039.4461.0@assen.demon.co.uk> john@assen.demon.co.uk (John McCabe) writes:
>rav@goanna.cs.rmit.edu.au (@@           robin) wrote:
>
><..snip..>
>
>Just a point for your information. From clari.tw.space:
>
>	 "An inquiry board investigating the explosion concluded in  
>July that the failure was caused by software design errors in a 
>guidance system."
>
>Note software DESIGN errors - not programming errors.
>

Indeed, the problems were in the specifications given to the programmers, 
not in the coding activity itself.  They wrote exactly what they were 
asked to write, as far as I could see from reading the report summary.

The problem was caused by using software developed for Ariane 4's flight
characteristics, which were different from those of Ariane 5.  When the
launch vehicle exceeded the boundary parameters of the Ariane-4 software,
it send an error message and, as specified by the remit given to
programmers, a critical guidance system shut down in mid-flight. Ka-boom. 

-- 
Mike Dworetsky, Department of Physics  | Haiku: Nine men ogle gnats
& Astronomy, University College London |         all lit
Gower Street, London WC1E 6BT  UK      |   till last angel gone.
   email: mmd@star.ucl.ac.uk           |       Men in Ukiah.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  1996-10-02  0:00 ` Robert I. Eachus
  1996-10-02  0:00 ` Matthew Heaney
  0 siblings, 2 replies; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-01  0:00 UTC (permalink / raw)



Matthew Heaney <mheaney@NI.NET> writes:
>
>Why, yes.  If the rocket blows up, at the cost of millions of dollars, then
>I'm not clear what the value of "faster execution" is.  The rocket's gone,
>so what difference does it make how fast the code executed?  If you left
>the range checks in, your code would be *marginally* slower, but you'd
>still have your rocket, now wouldn't you?
>
    It's not a case of saving a few CPU cycles so you can run Space
    Invaders in the background. Quite often (and in particular in
    *space* systems which are limited to rather antiquated
    processors) the decision is to a) remove the runtime checks from
    the compiled image and run with the possible risk of undetected
    constraint errors, etc. or b) give up and go home because there's
    no way you are going to squeeze the necessary logic into the box
    you've got with all the checks turned on.

    It's not as if we take these decisions lightly and are just being
    stingy with CPU cycles so we can save them up for our old age. We
    remove the checks typically because there's no other choice.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "Some people say a front-engine car handles best. Some people say
    a rear-engine car handles best. I say a rented car handles best."

        --  P. J. O'Rourke
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  1996-10-02  0:00 ` Alan Brain
  0 siblings, 1 reply; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-01  0:00 UTC (permalink / raw)



Ken Garlington <garlingtonke@LMTAS.LMCO.COM> writes:
>Alan Brain wrote:
>> A really good safety-critical
>> program should be remarkably difficult to de-bug, as the only way you
>> know it's got a major problem is by examining the error log, and
>> calculating that it's performance is below theoretical expectations.
>> And if it runs too slow, many times in the real-world you can spend 2
>> years of development time and many megabucks kludging the software, or
>> wait 12 months and get the new 400 Mhz chip instead of your current 133.
>
>I really need to change jobs. It sounds so much simpler to build
>software for ground-based PCs, where you don't have to worry about the
>weight, power requirements, heat dissipation, physical size,
>vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.
>
    I personally like the part about "performance is below theoretical
    expectations". Where I live, I have a 5 millisecond loop which
    *must* finish in 5 milliseconds. If it runs in 7 milliseconds, we
    will fail to close the loop in sufficient time to keep valves from
    "slamming into stops", causing them to break, rendering someone's
    billion dollar rocket and billion dollar payload "unserviceable".
    In this business, that's what *we* mean by "performance is below
    theoretical expectations" and why runtime checks which seem
    "trivial" to most folks can mean the difference between having a
    working system and having an interesting exercise in computer
    science which isn't going to go anywhere.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "Some people say a front-engine car handles best. Some people say
    a rear-engine car handles best. I say a rented car handles best."

        --  P. J. O'Rourke
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  1996-10-02  0:00 ` Ken Garlington
  0 siblings, 1 reply; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-01  0:00 UTC (permalink / raw)



Robert A Duff <bobduff@WORLD.STD.COM> writes:
>Alan Brain  <aebrain@dynamite.com.au> wrote:
>>Brain's law:
>>"Software Bugs and Hardware Faults are no excuse for the Program not to
>>work".
>>
>>So: it costs peanuts, and may save your hide.
>
>This reasoning doesn't sound right to me.  The hardware part, I mean.
>The reason checks-on costs only 5% or so is that compilers aggressively
>optimize out almost all of the checks.  When the compiler proves that a
>check can't fail, it assumes that the hardware is perfect.  So, hardware
>faults and cosmics rays and so forth are just as likely to destroy the
>RTS, or cause the program to take a wild jump, or destroy the call
>stack, or whatever -- as opposed to getting  a Constraint_Error a
>reocovering gracefully.  After all, the compiler doesn't range-check the
>return address just before doing a return instruction!
>
    Typically, this is why you build dual-redundant systems. If a
    cosmic ray flips some bits in one processor causing bad data which
    does/does not get range-checked, then computer "A" goes crazy and
    computer "B" takes control. Hopefully they don't *both* get hit by
    cosmic rays at the same time.

    The real danger is a common mode failure where a design flaw
    exists in the software used by both channels - they both see the
    same inputs and both make the same mistake. Of course trapping
    those exceptions doesn't necessarily guarantee success since
    either the exception handler or the desired accommodation could
    also be flawed and the flaw will, by definition, exist in both
    channels.

    If all you're protecting against is software design failures (not
    hardware failures) then obviously being able to analyze code and
    prove that a particular case can never happen should be sufficient
    to permit the removal of runtime checks.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "Some people say a front-engine car handles best. Some people say
    a rear-engine car handles best. I say a rented car handles best."

        --  P. J. O'Rourke
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00                     ` Ken Garlington
@ 1996-10-02  0:00                       ` Sandy McPherson
  0 siblings, 0 replies; 105+ messages in thread
From: Sandy McPherson @ 1996-10-02  0:00 UTC (permalink / raw)

Ken Garlington wrote:
> 
> Wayne L. Beavers wrote:
> >
> > Ken Garlington wrote:
> >
> > > That's actually a pretty common rule of thumb for safety-critical systems.
> > > Unfortunately, read-only memory isn't exactly read-only. For example, hardware errors
> > > can cause a random change in the memory. So, it's not a perfect fix.
> >
> >   Your right, but the risk and probability of memory failures is pretty low I would think.  I have never seen
> > or heard of a memory failure in any of the systems that I have worked on.  I don't know what the current
> > technology is but I can remember quite awhile ago that at least one vendor was claiming that ALL double bit
> > memory errors were fully detectable and recoverable, ALL triple bit errors were detectable but only some were
> > correctable.  But I also don't work on realtime systems, my experience is with commercial systems.
> >
> >   Are you refering to on-board systems for aircraft where weight and vibration are also a factor or are you
> > refering to ground base systems that don't have similar constraints?
> 
> On-board systems. The failure _rate_ is usually pretty low, but in a harsh environment
> you can get quite a few failure _sources_, including mechanical failures (stress
> fractures, solder loss due to excessive heat, etc.), electrical failures (EMI,
> lightening), and so forth. You don't have to take out the actual chip, of course: just
> as bad is a failure in the address or data lines connecting the memory to the CPU. Add
> a memory management unit to the mix, along with various I/O devices mapped into the
> memory space, and you can get a whole slew of memory-related failure modes.
> 
> You can also get into some neat system failures. For example, some "read-only" memory
> actually allows writes to the execution space in certain modes, to allow quick
> reprogramming. If you have a system failure that allows writes at the wrong time,
> coupled with a failure that does a write where it shouldn't...

It depends upon what you mean by a memory failure. I can imagine that
the chances of your memory being trashed completely is very very low and
in rad-hardened systems the chances of a single-event-upset (SEU) is
also low, but has to be guarded against. I have recently been working on
a system where the specified hardware has a parity bit for each octet of
memory, so SEUs which flip bit values in the memory can be detected.
This parity check is built into the system's micro-code. 

Similarily the definition of what is and isn't read only memory is
usually a feature of the processor and or operating system being used. A
compiler cannot put code into read only areas of memory, unless the
processor its micro-code and/or o/s are playing ball as well. If you are
unfortunate enough to be in this situation (are there any such systems
left?), then the only thing you can do is DIY, but the compiler can't
help you much, other than the for-use-at.

I once read an interesting definition of two types of bugs in
"transaction processing" by Gray & Reuter, Heisenbugs and Bohrbugs. 

Identification of potential Heisenbugs, estimation of probability of
occurence, impact to system on occurrence and appropriate recovery
procedures are part of the risk analysis. An SEU is a classic Heisenbug,
which IMO is out of scope of compiler checks, because they can result in
a valid but incorrect value for a variable and are just as likely to
occur in the code section as the data section of your application. A
complete memory failure is of course beyond the scope of the compiler.

IMO an Ada compiler's job (when used properly) is to make sure that
syntactic Bohrbugs do not enter a system and all semantic Bohrbugs get
detected at runtime (as Bohrbugs, by definition have a fixed location
and are certain to occur under given conditions- the Ariane 5 bug was
definitely a Bohrbug). The compiler cannot do anything about Heisenbugs
(because they only have a probability of occurrence). To handle
Heisenbugs generally you need to have a detection, reporting and
handling mechanism: built using the hardwares error detection, generally
accepted software practices (e.g. duplicate storage, process-pairs) and
an application dependent exception handling mechanism. Ada provides the
means to trap the error condition once it has been reported, but it does
not implement exception handlers for you, other than the default "I'm
gone..."; additionally if the underlying system does not provide the
means to detect  a probable error, you have to implement the means of
detectin the probel and reporting this through the Ada exception
handling yourself. 

-- 
Sandy McPherson	MBCS CEng.	tel: 	+31 71 565 4288 (w)
ESTEC/WAS
P.O. Box 299
NL-2200AG Noordwijk

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
@ 1996-10-02  0:00 ` Alan Brain
  1996-10-02  0:00   ` Ken Garlington
  0 siblings, 1 reply; 105+ messages in thread
From: Alan Brain @ 1996-10-02  0:00 UTC (permalink / raw)

Marin David Condic, 407.796.8997, M/S 731-93 wrote:
> 
> Ken Garlington <garlingtonke@LMTAS.LMCO.COM> writes:

> >I really need to change jobs. It sounds so much simpler to build
> >software for ground-based PCs, where you don't have to worry about the
> >weight, power requirements, heat dissipation, physical size,
> >vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.
> >

The particular system I was talking about was for a Submarine. Very
tight
constraints indeed, on power (it was a diesel sub), physical size (had
to
fit in a torpedo hatch), heat dissipation (a bit), vulnerability to 100%
humidity, salt, chlorine etc etc. Been there, Done that, Got the
T-shirt.

I'm a Software Engineer who works mainly in Systems. Or maybe a Systems
Engineer with a hardware bias. Regardless, in the initial Systems
Engineering
phase, when one gets all the HWCIs and CSCIs defined, it is only good
professional practice to build in plenty of slack. If the requirement is
to fit
in a 21" hatch, you DON'T design something that's 20.99999" wide. If you
can,
make it 16", 18 at max. It'll probably grow. Similarly, if you require a
minimum
of 25 MFlops, make sure there's a growth path to at least 100. It may
well be less
expensive and less risky to build a chip factory to make a faster CPU
than to
lose a rocket, or a sub due to software failure that could have been
prevented.
Usually such ridiculously extreme measures are not neccessary. The
Hardware guys
bitch about the cost-per-CPU going through the roof. Heck, it could cost
$10 million.
But if it saves 2 years of Software effort, that's a net saving of $90
million.
(All numbers are representative ie plucked out of mid-air, and as you
USAians say,
 Your Mileage May Vary)

>     I personally like the part about "performance is below theoretical
>     expectations". Where I live, I have a 5 millisecond loop which
>     *must* finish in 5 milliseconds. If it runs in 7 milliseconds, we
>     will fail to close the loop in sufficient time to keep valves from
>     "slamming into stops", causing them to break, rendering someone's
>     billion dollar rocket and billion dollar payload "unserviceable".
>     In this business, that's what *we* mean by "performance is below
>     theoretical expectations" and why runtime checks which seem
>     "trivial" to most folks can mean the difference between having a
>     working system and having an interesting exercise in computer
>     science which isn't going to go anywhere.

In this case, "theoretical expectations" for a really tight 5 MuSec loop
should be less than 1 MuSec. Yes, I'm dreaming. OK, 3 MuSec, that's my
final offer. For the vast majority of cases, if your engineering is
closer to
the edge than that, it'll cost big bucks to fix the over-runs you always
get.

Typical example: I had a big bun-fight with project management about a
hefty
data transfer rate required for a broadband sonar. They wanted to
hand-code
the lot in assembler, as the requirements were really, really tight. No
time
for any of this range-check crap, the data was always good.
I eventually threw enough of a professional tantrum to wear down even a
group
of German Herr Professor Doktors, and we did it in Ada-83. If only as a
first
pass, to see what the rate really would be.
The spec called for 160 MB/Sec. First attempt was 192 MB/Sec, and after
some optimisation, we got over 250. After the hardware flaws were fixed
(the ones
the "un-neccessary" range-bound checking detected ) this was above 300.
Now that's
too close for my druthers, but even 161 I could live with. Saved maybe
16 months
on the project, about 100 people at $15K a month. After the transfer,
the data
really was trustworthy - which saved a lot of time downstream on the
applications
in debug time.
Note that even with (minor) hardware flaws, the system still worked.
Note also
that by paying big $ for more capable hardware than strictly neccessary,
you
can save bigger $ on the project.
Many projects spend many months and many $ Million to fix, by hacking,
Kludging,
and sheer Genius what a few lousy $100K of extra hardware cost would
make
un-neccessary. A good software engineer in the Risk-management team, and
on the
Systems Engineering early on, one with enough technical nous in hardware
to know
what's feasible, enough courage to cost the firm millions in initial
costs, and
enough power to make it stick, that's what's neccessary. I've seen it;
it works.

But it's been tried less than a dozen times in 15 years in my experience
:(

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
@ 1996-10-02  0:00 ` Robert I. Eachus
  1996-10-02  0:00   ` Ken Garlington
  1996-10-02  0:00 ` Matthew Heaney
  1 sibling, 1 reply; 105+ messages in thread
From: Robert I. Eachus @ 1996-10-02  0:00 UTC (permalink / raw)

In article <96100111162774@psavax.pwfl.com> "Marin David Condic, 407.796.8997, M/S 731-93" <condicma@PWFL.COM> writes:

   Marin David Condic

   > It's not a case of saving a few CPU cycles so you can run Space
   > Invaders in the background. Quite often (and in particular in
   > *space* systems which are limited to rather antiquated
   > processors) the decision is to a) remove the runtime checks from
   > the compiled image and run with the possible risk of undetected
   > constraint errors, etc. or b) give up and go home because there's
   > no way you are going to squeeze the necessary logic into the box
   > you've got with all the checks turned on.

   > It's not as if we take these decisions lightly and are just being
   > stingy with CPU cycles so we can save them up for our old age. We
   > remove the checks typically because there's no other choice.

   In this case though, management threw out the baby with the
bathwater.  To preserve a 20% margin in the presence of a kludge
already known to be applicable only to the Ariane 4, they took out
checks that would be vital if the kludge ran on the Ariane 5, then
forgot to take the kludge out.

   The proper solution was to recognize in the performance specs that
the load was 81% or whatever until the intertial alignment software
shut down after launch.

--

					Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
@ 1996-10-02  0:00 ` Ken Garlington
  0 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-02  0:00 UTC (permalink / raw)



Marin David Condic, 407.796.8997, M/S 731-93 wrote:
> 
>     The real danger is a common mode failure where a design flaw
>     exists in the software used by both channels - they both see the
>     same inputs and both make the same mistake. Of course trapping
>     those exceptions doesn't necessarily guarantee success since
>     either the exception handler or the desired accommodation could
>     also be flawed and the flaw will, by definition, exist in both
>     channels.

The problem also exists if you have a common-mode _hardware_ failure (e.g.
a hardware design fault, or an external upset like lightning that hits
both together).

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-02  0:00 ` Robert I. Eachus
@ 1996-10-02  0:00   ` Ken Garlington
  0 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-02  0:00 UTC (permalink / raw)



Robert I. Eachus wrote:
> 
> In article <96100111162774@psavax.pwfl.com> "Marin David Condic, 407.796.8997, M/S 731-93" <condicma@PWFL.COM> writes:
> 
>    Marin David Condic
> 
>    > It's not a case of saving a few CPU cycles so you can run Space
>    > Invaders in the background. Quite often (and in particular in
>    > *space* systems which are limited to rather antiquated
>    > processors) the decision is to a) remove the runtime checks from
>    > the compiled image and run with the possible risk of undetected
>    > constraint errors, etc. or b) give up and go home because there's
>    > no way you are going to squeeze the necessary logic into the box
>    > you've got with all the checks turned on.
> 
>    > It's not as if we take these decisions lightly and are just being
>    > stingy with CPU cycles so we can save them up for our old age. We
>    > remove the checks typically because there's no other choice.
> 
>    In this case though, management threw out the baby with the
> bathwater.  To preserve a 20% margin in the presence of a kludge
> already known to be applicable only to the Ariane 4, they took out
> checks that would be vital if the kludge ran on the Ariane 5, then
> forgot to take the kludge out.

The critical part of this correct statement, of course, being "In this 
case..". In another context, this might have been the right decision.

It's also important to remember that Ariane 5 didn't exist when the 
Ariane 4 team made this decision. They may have been short-sighted, but 
they weren't idiots based on what they knew at the time.

The Ariane _5 management not doing sufficient re-analysis and re-test of 
this "off-the-shelf" system is, to me, much less excusable.

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-02  0:00 ` Alan Brain
@ 1996-10-02  0:00   ` Ken Garlington
  1996-10-02  0:00     ` Matthew Heaney
  1996-10-03  0:00     ` Alan Brain
  0 siblings, 2 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-02  0:00 UTC (permalink / raw)
  To: aebrain

Alan Brain wrote:
> 
> Marin David Condic, 407.796.8997, M/S 731-93 wrote:
> >
> > Ken Garlington <garlingtonke@LMTAS.LMCO.COM> writes:
> 
> > >I really need to change jobs. It sounds so much simpler to build
> > >software for ground-based PCs, where you don't have to worry about the
> > >weight, power requirements, heat dissipation, physical size,
> > >vulnerability to EMI/radiation/salt fog/temperature/etc. of your system.
> > >
> 
> The particular system I was talking about was for a Submarine. Very
> tight
> constraints indeed, on power (it was a diesel sub), physical size (had
> to
> fit in a torpedo hatch), heat dissipation (a bit), vulnerability to 100%
> humidity, salt, chlorine etc etc. Been there, Done that, Got the
> T-shirt.

So what did you do when you needed to build a system that was bigger than the
torpedo hatch? Re-design the submarine? You have physical limits that you just can't
exceed. On a rocket, or an airplane, you have even stricter limits.

Oh for the luxury of a diesel generator! We have to be able to operate on basic
battery power (and we share that bus with emergency lighting, etc.)

> I'm a Software Engineer who works mainly in Systems. Or maybe a Systems
> Engineer with a hardware bias. Regardless, in the initial Systems
> Engineering
> phase, when one gets all the HWCIs and CSCIs defined, it is only good
> professional practice to build in plenty of slack. If the requirement is
> to fit
> in a 21" hatch, you DON'T design something that's 20.99999" wide. If you
> can,
> make it 16", 18 at max. It'll probably grow.

Exactly. You build a system that has slack. Say, 15% slack. Which is exactly
why the INU design team didn't want to add checks unless they had to. Because
they were starting to eat into that slack.

> Similarly, if you require a
> minimum
> of 25 MFlops, make sure there's a growth path to at least 100. It may
> well be less
> expensive and less risky to build a chip factory to make a faster CPU
> than to
> lose a rocket, or a sub due to software failure that could have been
> prevented.

What if your brand new CPU requires more power than your diesel generator
can generate?

What if your brand new CPU requires a technology that doesn't let you meet
your heat dissipation?

Doesn't sound like you had to make a lot of tradeoffs in your system.
Unfortunately, airborne systems, particular those that have to operate in
lower-power, zero-cooling situations (amazing how hot the air gets around
Mach 1!), don't have such luxuries.

> Usually such ridiculously extreme measures are not neccessary. The
> Hardware guys
> bitch about the cost-per-CPU going through the roof. Heck, it could cost
> $10 million.
> But if it saves 2 years of Software effort, that's a net saving of $90
> million.

What does maintenance costs have to do with this discussion?

> In this case, "theoretical expectations" for a really tight 5 MuSec loop
> should be less than 1 MuSec. Yes, I'm dreaming. OK, 3 MuSec, that's my
> final offer. For the vast majority of cases, if your engineering is
> closer to
> the edge than that, it'll cost big bucks to fix the over-runs you always
> get.

I've never had a project yet where we didn't routinely cut it that fine,
and we've yet to spend the big bucks. If you're used to developing systems
with those kind of constraints, you know how to make those decisions.
Occasionally, you make the wrong decision, as the Ariane designers discovered.
Welcome to engineering.

> Typical example: I had a big bun-fight with project management about a
> hefty
> data transfer rate required for a broadband sonar. They wanted to
> hand-code
> the lot in assembler, as the requirements were really, really tight. No
> time
> for any of this range-check crap, the data was always good.
> I eventually threw enough of a professional tantrum to wear down even a
> group
> of German Herr Professor Doktors, and we did it in Ada-83. If only as a
> first
> pass, to see what the rate really would be.
> The spec called for 160 MB/Sec. First attempt was 192 MB/Sec, and after
> some optimisation, we got over 250. After the hardware flaws were fixed
> (the ones
> the "un-neccessary" range-bound checking detected ) this was above 300.

And, if you had only got 20MB per second after all that, you would have
done...?

Certainly, if you just throw out range checking without knowing its cost,
you're an idiot. However, no one has shown that the Ariane team did this.
I guarantee you (and am willing to post object code to prove it) that
range checking is not always zero cost, and in the right circumstances can
cause you to bust your budget.

> Note also
> that by paying big $ for more capable hardware than strictly neccessary,
> you
> can save bigger $ on the project.

Unfortunately, cost is not the only controlling variable.

Interesting that a $100K difference in per-unit cost in your systems is
negligible. No wonder people think military systems are too expensive!

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  1996-10-02  0:00 ` Robert I. Eachus
@ 1996-10-02  0:00 ` Matthew Heaney
  1996-10-04  0:00   ` Ken Garlington
  1 sibling, 1 reply; 105+ messages in thread
From: Matthew Heaney @ 1996-10-02  0:00 UTC (permalink / raw)

In article <96100111162774@psavax.pwfl.com>, "Marin David Condic,
407.796.8997, M/S 731-93" <condicma@PWFL.COM> wrote:

    It's not a case of saving a few CPU cycles so you can run Space
>    Invaders in the background. Quite often (and in particular in
>    *space* systems which are limited to rather antiquated
>    processors) the decision is to a) remove the runtime checks from
>    the compiled image and run with the possible risk of undetected
>    constraint errors, etc. or b) give up and go home because there's
>    no way you are going to squeeze the necessary logic into the box
>    you've got with all the checks turned on.
>
>    It's not as if we take these decisions lightly and are just being
>    stingy with CPU cycles so we can save them up for our old age. We
>    remove the checks typically because there's no other choice.

Funny you mention that, because I would have said take option b.  My
attitude is that there is a state of the art today, and it's not cost
effective to try to push too far beyond that.

I'm not unsympathetic to your situation, as my own background is in
real-time (ground-based) systems.  But when you try to push the technology
envelope beyond what is (easily) available today, the cost of your system
and the risk of failure shoots up.

To do what you wanted to do with your existing hardware meant you had to
turn off checks.  Fair enough.  But that decision very much increased your
risk that something bad would happen from which you wouldn't be able to
recover.

I heard those satellites cost $500 million dollars.  Was turning off those
checks really worth the risk of losing that much money?  To me you were
just gambling.

I would have said that, no, the risk is too great.  Scale back the
requirements and let's do something less ambitious.  If you really want to
do that, wait 18 months and Dr. Moore will give you hardware that's twice
as fast.  But if you want to do it today, and you have turn the checks off,
well then, you're just rolling the dice.

The state of software art today is such that we can't deploy a provably
correct system, and we have resort to run-time checks to catch logical
flaws.  I accept this "limitation," and I accept that there are certain
kinds of systems we can't do today (because to do them would require
turning off checks).

Buyers of mission-critical software should think very carefully before they
commit any financial resources to implementing a software system that
requires checks be turned off.  I'd say take your money instead to Las
Vegas: your odds for success are better there.

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mheaney@ni.net
(818) 985-1271

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-02  0:00   ` Ken Garlington
@ 1996-10-02  0:00     ` Matthew Heaney
  1996-10-04  0:00       ` Robert S. White
  1996-10-03  0:00     ` Alan Brain
  1 sibling, 1 reply; 105+ messages in thread
From: Matthew Heaney @ 1996-10-02  0:00 UTC (permalink / raw)



In article <3252B46C.5E9D@lmtas.lmco.com>, Ken Garlington
<garlingtonke@lmtas.lmco.com> wrote:

>Interesting that a $100K difference in per-unit cost in your systems is
>negligible. No wonder people think military systems are too expensive!

I think he meant "negligable compared to the programming cost that would be
required to get the software to run on the cheaper hardware."

It's never cost effective to skimp on hardware if it means human
programmers have to write more complex software.

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mheaney@ni.net
(818) 985-1271




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-30  0:00               ` Wayne L. Beavers
  1996-10-01  0:00                 ` Ken Garlington
@ 1996-10-03  0:00                 ` Richard A. O'Keefe
  1 sibling, 0 replies; 105+ messages in thread
From: Richard A. O'Keefe @ 1996-10-03  0:00 UTC (permalink / raw)

"Wayne L. Beavers" <wayneb@beyond-software.com> writes:

>I have been reading this thread awhile and one topic that I have not
>seen mentioned is protecting the code area from damage.

I imagine that everyone else has taken this for granted.
UNIX compilers have been doing it for years, and so I believe have VMS ones.

>When I code in PL/I or any other reentrant language I always make sure
>that the executable code is executing from read-only storage.

(a) This is not something that the programmer should normally have to be
    concerned with, it just happens.
(b) It cannot always be done. Run-time code generation is a practical and
    important technique.  (Making a page read-only after new code has been
    written to it is a good idea, of course.)

>There is no way to put the data areas in read-only storage (obviously)

It may be obvious, but in important cases it isn't true.
UNIX (and I believe VMS) compilers have for years had the ability to put
_selected_ data in read-only storage.  And of course it is perfectly
feasible in many operating systems (certainly UNIX and VMS) to write data
into a page and then ask the operating system to make that page read-only.

>but I can't think of any reason to put the executable code in writeable
>storage.

Run-time binary translation.  Some approaches to relocation.  How many
reasons do you want?

>I one had to port 8,000 subroutines in PL/I, 24 megabytes of executable
>code from one system to another.

In a language where the last revision of the standard was 1976?
You have my deepest sympathy.

-- 
Australian citizen since 14 August 1996.  *Now* I can vote the xxxs out!
Richard A. O'Keefe; http://www.cs.rmit.edu.au/%7Eok; RMIT Comp.Sci.

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-02  0:00   ` Ken Garlington
  1996-10-02  0:00     ` Matthew Heaney
@ 1996-10-03  0:00     ` Alan Brain
  1996-10-04  0:00       ` Ken Garlington
  1 sibling, 1 reply; 105+ messages in thread
From: Alan Brain @ 1996-10-03  0:00 UTC (permalink / raw)

Ken Garlington wrote:

> So what did you do when you needed to build a system that was bigger than the
> torpedo hatch? Re-design the submarine? 

Nope, we re-designed the system so it fit anyway. Actually, we designed
the thing in the first place so that the risk of it physically growing
too big and needing re-design was tolerable (ie contingency money was
allocated for doing this, if we couldn't accurately estimate the risk as
being small).

> Oh for the luxury of a diesel generator! We have to be able to operate on basic
> battery power (and we share that bus with emergency lighting, etc.)

Well ours had a generator connected to a hamster wheel with a piece of
cheese as backup ;-).... but seriously folks, yes we have a diesel. Why?
to charge the batteries. Use of the Diesel under many conditions - eg
when taking piccies in Vladivostok Harbour - would be unwise.

> Exactly. You build a system that has slack. Say, 15% slack. Which is exactly
> why the INU design team didn't want to add checks unless they had to. Because
> they were starting to eat into that slack.

I'd be very, very suspicious of a slack like "15%". This implies you
know to within 2 significant figures what the load is going to be. Which
in my experience is not the case. "About a Seventh" is more accurate, as
it implies more imprecision. And I'd be surprised if any Bungee-Jumper
would tolerate that small amount of safety margin using new equipment.  
Then again, slack is supposed to be used up. It's for the unforeseen.
When you come across a problem during development, you shouldn't be
afraid of using up that slack, that's what it's there for! One is
reminded of the apocryphal story of the quartemaster at Pearl Harbour,
who refused to hand out ammunition as it could have been needed more
later. 

> What if your brand new CPU requires more power than your diesel generator
> can generate? 
> What if your brand new CPU requires a technology that doesn't let you meet
> your heat dissipation?

But it doesn't. When you did your initial systems engineering, you made
sure there was enough slack - OR had enough contingency money so that
you could get custom-built stuff.

> Doesn't sound like you had to make a lot of tradeoffs in your system.
> Unfortunately, airborne systems, particular those that have to operate in
> lower-power, zero-cooling situations (amazing how hot the air gets around
> Mach 1!), don't have such luxuries.

I see your zero-cooling situations, and I raise you H2, CO2, CO, Cl, H3O
conditions etc. The constraints on a sub are different, but the same in
scope. Until such time as you do work on a sub, or I do more than just a
little work on aerospace, we may have to leave it at that.

> > Usually such ridiculously extreme measures are not neccessary. The
> > Hardware guys
> > bitch about the cost-per-CPU going through the roof. Heck, it could cost
> > $10 million.
> > But if it saves 2 years of Software effort, that's a net saving of $90
> > million.
> 
> What does maintenance costs have to do with this discussion?

Sorry I didn't make myself clear: I was talking development costs, not
maintenance.

> I've never had a project yet where we didn't routinely cut it that fine,
> and we've yet to spend the big bucks.

Then I guess either a) You're one heck of a better engineer than me (and
I freely admit the distinct possibility) or b) You've been really lucky
or c) You must tolerate a lot more failures than the organisations I've
worked for.

>  If you're used to developing systems
> with those kind of constraints, you know how to make those decisions.
> Occasionally, you make the wrong decision, as the Ariane designers discovered.
> Welcome to engineering.

My work has only killed 2 people (Iraqi pilots - that particular system
worked as advertised in the Gulf). There might be as many as 5000 people
whose lives depend on my work at any time, more if War breaks out. I
guess we have a different view of "acceptable losses" here, and your
view may well be more correct. Why? Because such a conservative view as
my own may mean I just can't attempt some risky things. Things which
your team (sometimes at least) gets working, teherby saving more lives.
Yet I don't think so.  

> And, if you had only got 20MB per second after all that, you would have
> done...?

20 MB? First, re-check all calculations. Examine hardware options. Then
(probably) set up a "get-well" program using 5-6 different tracks and
pick the best. Most probably though, we'd give up: it's not doable
within the budget. The difficult case is 150 MB. In this case, assembler
coding might just make the difference - I do get your point, BTW.

> Certainly, if you just throw out range checking without knowing its cost,
> you're an idiot. However, no one has shown that the Ariane team did this.
> I guarantee you (and am willing to post object code to prove it) that
> range checking is not always zero cost, and in the right circumstances can
> cause you to bust your budget.

Agree. There's always pathological cases where general rules don't
apply. Being fair, I didn't say "zero cost", I said "typically 5%
measured". In doing the initial Systems work, I'd usually budget for
10%, as I'm paranoid.

> Unfortunately, cost is not the only controlling variable.
> 
> Interesting that a $100K difference in per-unit cost in your systems is
> negligible. No wonder people think military systems are too expensive!

You get what you pay for, IF you're lucky. My point though is that many
of the hacks, kludges etc in software are caused by insufficient
foresight in systems design. Case in point: RAN Collins class submarine.
Now many years late due to software problems. Last time I heard, they're
still trying to get that last 10% performance out of the 68020s on the
cards. Which were leading-edge when the systems work was done. Putting
in 68040s a few years ago would have meant the Software would have been
complete by now, as the hacks wouldn't have been neccessary.

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-03  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  0 siblings, 0 replies; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-03  0:00 UTC (permalink / raw)



Ken Garlington <garlingtonke@LMTAS.LMCO.COM> writes:
>So what did you do when you needed to build a system that was bigger than the
>torpedo hatch? Re-design the submarine? You have physical limits that you just
>can't
>exceed. On a rocket, or an airplane, you have even stricter limits.
>
>Oh for the luxury of a diesel generator! We have to be able to operate on basic
>battery power (and we share that bus with emergency lighting, etc.)
>
    Just as you have physical limits and need to leave physical
    margins, software has timing limits and needs to leave timing
    margins. Both to accommodate the inevitable change and growth as
    production units are fielded, but also as a *safety* factor. What
    would happen to the Ariane 5 if that 80% utilization went to 105%
    because the software hit an untested "corner case"? It's a good
    reason to insist on leaving some margin.

    You have emergency lighting? Lucky dog!

>What if your brand new CPU requires more power than your diesel generator
>can generate?
>
>What if your brand new CPU requires a technology that doesn't let you meet
>your heat dissipation?
>
>Doesn't sound like you had to make a lot of tradeoffs in your system.
>Unfortunately, airborne systems, particular those that have to operate in
>lower-power, zero-cooling situations (amazing how hot the air gets around
>Mach 1!), don't have such luxuries.
>
    You get zero-cooling? Lucky dog! My box just keeps getting hotter
    and hotter until it burns up. Hopefully *after* the mission is
    over.

    You get *air???!*! And never mind that Mach 1 stuff - my box is
    strapped to the side of a blow-torch!

    You're absolutely right about the engineering tradeoffs - In
    flight systems especially since the biggest constraint is
    typically weight & space. (Two commodities that are *much* easier
    to compromise on when you get to sit on the ground - or sink under
    the ocean) I'd gladly give my eye teeth to get double the CPU
    speed I've got. Unfortunately, this is the best that can be done
    within the current CPU technology and adding a second processor is
    out of the question at this time: The box can't get heavier or
    bigger without risking payload, power consumption and heat
    disapation go up, etc. etc. etc. If it weren't for the megabucks
    and the chance to meet chicks, I'd quit the engineering business
    because of the headaches.

>And, if you had only got 20MB per second after all that, you would have
>done...?
>
    Anyone can afford to be a purist right up to the point where they
    have to tell their boss that they're at 105% utilization and that
    the project they've invested millions on won't work. At that
    point, you start looking at what you might inline to avoid
    procedure call overhead, recode sections in assembler because you
    can be smarter at it than the compiler, and yes, remove all those
    extraneous runtime checks and prove out your code instead.

>Certainly, if you just throw out range checking without knowing its cost,
>you're an idiot. However, no one has shown that the Ariane team did this.
>I guarantee you (and am willing to post object code to prove it) that
>range checking is not always zero cost, and in the right circumstances can
>cause you to bust your budget.
>
    Amen! Let's say you have 20 computations. Lets say that the
    runtime checks total time is 5uSec. (Not unrealistic on many
    processors where the average instruction uses 0.5 to 1.0uSec)
    That's 100uSec. Suppose this code needs to run once every 1mSec.
    Your runtime checks just consumed 10% of your CPU.

    We did *exactly* this sort of analysis (both bench checking and
    running sample code) and concluded that the runtime checks were
    out or the project wouldn't work. And we're using one of the
    *best* Ada compilers available for the 1750a - the EDS-Scicon
    XD-Ada compiler.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    Glendower: "I can call spirits from the vasty deep."
    Hotspur: "Why so can I, or so can any man; but will they come when
    you do call for them?"

        -- Shakespeare, "Henry IV"
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-03  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  0 siblings, 0 replies; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-03  0:00 UTC (permalink / raw)



Ken Garlington <garlingtonke@LMTAS.LMCO.COM> writes:
>> Brain's law:
>> "Software Bugs and Hardware Faults are no excuse for the Program not to
>> work".
>
>Too bad that law can't be enforced :)
>
    Yup! Hardware faults - such as a CPU out to lunch - can pretty
    much be impossible to fix with the software that's running on it.
    As for software faults, isn't it a little like being in the
    "Physcian, heel thyself!" mode? I am insane - let me diagnose and
    cure my own insanity...But being insane, can I know that my
    diagnosis and/or cure isn't also insane? A bit of a paradox, no?

    Yes, yes, yes. Exception handlers and so on can do a remarkable
    job of catching problems and fixing them. But out of the set of
    all possible software bugs, there is a non-empty set containing
    software bugs which mean your program has gone insane.

    You can only accommodate the bugs and/or faults which you can
    think of. What about the few hundred bugs/faults you *didn't*
    think of? Bet your donkey that they're going to happen someday,
    somewhere and the only way you're going to learn about them is by
    having them rear their ugly heads. Ask the engineers who designed
    The Tacoma Bridge or the o-rings on the space shuttle about it.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    Glendower: "I can call spirits from the vasty deep."
    Hotspur: "Why so can I, or so can any man; but will they come when
    you do call for them?"

        -- Shakespeare, "Henry IV"
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-03  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  0 siblings, 0 replies; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-03  0:00 UTC (permalink / raw)



Ken Garlington <garlingtonke@LMTAS.LMCO.COM> writes:
>Wayne L. Beavers wrote:
>>
>> I have been reading this thread awhile and one topic that I have not seen
>mentioned is protecting the code
>> area from damage.  When I code in PL/I or any other reentrant language I
>always make sure that the executable
>> code is executing from read-only storage.  There is no way to put the data
>areas in read-only storage
>> (obviously) but I can't think of any reason to put the executable code in
>writeable storage.
>
>That's actually a pretty common rule of thumb for safety-critical systems.
>Unfortunately, read-only memory isn't exactly read-only. For example, hardware
>errors
>can cause a random change in the memory. So, it's not a perfect fix.
>
    Actually there is a reason for sucking the code out of EEPROM and
    into RAM. EEPROMs (as I understand what the hardware dweebes tell
    me) are unusually susceptible to single event upsets (SEUs) if you
    have lots of gamma radiation hanging around in the neighborhood.
    Whereas RAMs are easier to make Rad-Hard and survive this stuff
    better.

    This poses problems for us software geeks to solve when creating
    the bootstrap, but there are apparently good engineering reasons
    for doing so. It would be nice if we could simply put an S.E.P.
    Field (S.omebody E.lses P.roblem) around the hardware issues, but
    once in a while the software guys have to bail out the hardware
    guys because of physics.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    Glendower: "I can call spirits from the vasty deep."
    Hotspur: "Why so can I, or so can any man; but will they come when
    you do call for them?"

        -- Shakespeare, "Henry IV"
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00   ` John McCabe
  1996-10-01  0:00     ` Michael Dworetsky
@ 1996-10-04  0:00     ` @@           robin
  1996-10-04  0:00       ` Joseph C Williams
                         ` (2 more replies)
  1 sibling, 3 replies; 105+ messages in thread
From: @@           robin @ 1996-10-04  0:00 UTC (permalink / raw)



	john@assen.demon.co.uk (John McCabe) writes:

	>Just a point for your information. From clari.tw.space:

	>	 "An inquiry board investigating the explosion concluded in  
	>July that the failure was caused by software design errors in a 
	>guidance system."

	>Note software DESIGN errors - not programming errors.

	>Best Regards
	>John McCabe <john@assen.demon.co.uk>

---If you read the Report, you'll see that that's not the case.
This is what the report says:


    "* The internal SRI software exception was caused during execution of a
     data conversion from 64-bit floating point to 16-bit signed integer
     value. The floating point number which was converted had a value
     greater than what could be represented by a 16-bit signed integer.
     This resulted in an Operand Error. The data conversion instructions
     (in Ada code) were not protected from causing an Operand Error,
     although other conversions of comparable variables in the same place
     in the code were protected.

    "In the failure scenario, the primary technical causes are the Operand Error
    when converting the horizontal bias variable BH, and the lack of protection
    of this conversion which caused the SRI computer to stop."

---As you can see, it's clearly a programming error.  It's a failure
to check for overflow on converting a double precision value to
a 16-bit integer.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00     ` @@           robin
  1996-10-04  0:00       ` Joseph C Williams
@ 1996-10-04  0:00       ` Michel OLAGNON
  1996-10-09  0:00         ` @@           robin
  1996-10-17  0:00       ` Ralf Tilch
  2 siblings, 1 reply; 105+ messages in thread
From: Michel OLAGNON @ 1996-10-04  0:00 UTC (permalink / raw)



In article <532k32$r4r@goanna.cs.rmit.edu.au>, rav@goanna.cs.rmit.edu.au (@@           robin) writes:
>	john@assen.demon.co.uk (John McCabe) writes:
>
>	>Just a point for your information. From clari.tw.space:
>
>	>	 "An inquiry board investigating the explosion concluded in  
>	>July that the failure was caused by software design errors in a 
>	>guidance system."
>
>	>Note software DESIGN errors - not programming errors.
>
>	>Best Regards
>	>John McCabe <john@assen.demon.co.uk>
>
>---If you read the Report, you'll see that that's not the case.
>This is what the report says:
>
>    "* The internal SRI software exception was caused during execution of a
>     data conversion from 64-bit floating point to 16-bit signed integer
>     value. The floating point number which was converted had a value
>     greater than what could be represented by a 16-bit signed integer.
>     This resulted in an Operand Error. The data conversion instructions
>     (in Ada code) were not protected from causing an Operand Error,
>     although other conversions of comparable variables in the same place
>     in the code were protected.
>
>    "In the failure scenario, the primary technical causes are the Operand Error
>    when converting the horizontal bias variable BH, and the lack of protection
>    of this conversion which caused the SRI computer to stop."
>
>---As you can see, it's clearly a programming error.  It's a failure
>to check for overflow on converting a double precision value to
>a 16-bit integer.

But if you read a bit further on, it is stated that

    The reason why three conversions, including the horizontal bias variable one,
    were not protected, is that it was decided that they were physically bounded
    or had a wide safety margin (...) The decision was a joint one of the project
    partners at various contractual levels.

Deciding at various contractual levels is not what one usually means by
``programming''. It looks closer to ``design'', IMHO. But, of course, anyone
can give any word any meaning.
And it might be probable that the action taken in case of protected conversion,
and exception, would also have been stop the SRI computer because such a high
horizontal bias would have meant that it was broken....

Michel

-- 
| Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|







^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-01  0:00     ` Michael Dworetsky
@ 1996-10-04  0:00       ` Steve Bell
  1996-10-07  0:00         ` Ken Garlington
  1996-10-09  0:00         ` @@           robin
  0 siblings, 2 replies; 105+ messages in thread
From: Steve Bell @ 1996-10-04  0:00 UTC (permalink / raw)

Michael Dworetsky wrote:
> 
> >Just a point for your information. From clari.tw.space:
> >
> >        "An inquiry board investigating the explosion concluded in
> >July that the failure was caused by software design errors in a
> >guidance system."
> >
> >Note software DESIGN errors - not programming errors.
> >
> 
> Indeed, the problems were in the specifications given to the programmers,
> not in the coding activity itself.  They wrote exactly what they were
> asked to write, as far as I could see from reading the report summary.
> 
> The problem was caused by using software developed for Ariane 4's flight
> characteristics, which were different from those of Ariane 5.  When the
> launch vehicle exceeded the boundary parameters of the Ariane-4 software,
> it send an error message and, as specified by the remit given to
> programmers, a critical guidance system shut down in mid-flight. Ka-boom.
> 

I work for an aerospace company, and we recieved a fairly detailed accounting of what 
went wrong with the Ariane 5. Launch vehicles, while they are sitting on the launch 
pad, run a guidance program that updates their position and velocity in reference to 
an coordinate frame whose origin is at the center of the earth (usually called an 
Earth-Centered-Inertial (ECI) frame). This program is usually started up from 1 to 3-4 
hours before launch and is allowed to run all the way until liftoff, so that the 
rocket will know where it's at and how fast it's going at liftoff. Although called 
"ground software," (because it runs while the rocket is on the ground), it resides 
inside the rocket's guidance computer(s), and for the Titan family of launch vehicles, 
the code is exited at t=0 (liftoff). This code is designed with knowing that the 
rocket is rotating on the surface of the earth, and the algorithms expect only very 
mild accelerations (as compared to when the rocket hauls ass off the pad at liftoff). 
Well, the French do things a little differently (but probably now they don't). The 
Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs 
past liftoff. They do (did) this in case there are any unanticipated holds in the 
countdown right close to liftoff. In this way, this position and velocity updating 
code would *not* have to be reset if they could get off the ground within just a few 
seconds of nominal. Well, it appears that the Ariane 5 really hauls ass off the pad, 
because at about 30 secs, it was pulling some accelerations that caused floating pount 
overflows in the still functioning ground software. The actual flight software (which 
was also running, naturally) was computing the positions and velocities that were 
being used to actually fly the rocket, and it was doing just fine - no overflow errors 
there because it was designed to expect high accelerations. There are two flight 
computers on the Ariane 5 - a primary and a backup - and each was designed to shut 
down if an error such as a floating point overflow occurred, thinking that the other 
one would take over. Both computers were running the ground software, and both 
experienced the floating point errors. Actually, the primary went belly-up first, and 
then the backup within a fraction of a second later. With no functioning guidance 
computer on board, well, ka-boom as you say.

Apparently the Ariane 4 gets off the ground with smaller accelerations than the 5, and 
this never happened with a 4. You might take note that this would never happen with a 
Titan because we don't execute this ground software after liftoff. Even if we did, we 
would have caught the floating point overflows way before launch because we run all 
code in what's called "Real-Time Simulations" where actual flight harware and software 
are subjected to any and all known physical conditions. This was another finding of 
the investigation board - apparently the French don't do enough of this type of 
testing because it's real expensive. Oh well, they probably do now!

-- 
Clear skies,
Steve Bell
sb635@delphi.com
http://people.delphi.com/sb635 - Astrophoto page

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00     ` @@           robin
@ 1996-10-04  0:00       ` Joseph C Williams
  1996-10-06  0:00         ` Wayne Hayes
  1996-10-04  0:00       ` Michel OLAGNON
  1996-10-17  0:00       ` Ralf Tilch
  2 siblings, 1 reply; 105+ messages in thread
From: Joseph C Williams @ 1996-10-04  0:00 UTC (permalink / raw)



Why didn't they run the code against an Ariane 5 simulator to
reverify the Ariane 4 software what was reused?  A good real-time
engineering simulation would have caught the problem.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-09-27  0:00           ` Lawrence Foard
@ 1996-10-04  0:00             ` @@           robin
  0 siblings, 0 replies; 105+ messages in thread
From: @@           robin @ 1996-10-04  0:00 UTC (permalink / raw)



	Lawrence Foard <entropy@vwis.com> writes:

	>Ronald Kunne wrote:

	>> Actually, this was the case here: the code was taken from an Ariane 4
	>> code where it was physically impossible that the index would go out
	>> of range: a test would have been a waste of time.

---A test for overflow in a system that aborts if unexpected overflow
occurs, is never a waste of time.

   Recall Murphy's Law: "If anything can go wrong, it will."
Then there's Robert's Law: "Even if it can't go wrong, it will."

	>> Unfortunately this was no longer the case in the Ariane 5.

	>Actually it would still present a danger on Ariane 4. If the sensor
	>which apparently was no longer needed during flight became defective,
	>then you could get a value out of range.

---Good point Lawrence.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-03  0:00     ` Alan Brain
@ 1996-10-04  0:00       ` Ken Garlington
  0 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-04  0:00 UTC (permalink / raw)

Alan Brain wrote:
> 
> Ken Garlington wrote:
> 
> > So what did you do when you needed to build a system that was bigger than the
> > torpedo hatch? Re-design the submarine?
> 
> Nope, we re-designed the system so it fit anyway.

Tsk, tsk! You violated your own design constraint of "always provide enough
margin for growth." Just think how much money you would have saved if you had
built it bigger to begin with!

> Actually, we designed
> the thing in the first place so that the risk of it physically growing
> too big and needing re-design was tolerable (ie contingency money was
> allocated for doing this, if we couldn't accurately estimate the risk as
> being small).

I'm sure the Arianespace folks had the same contingency funding. In fact, they're
spending it right now. :)

> 
> > Oh for the luxury of a diesel generator! We have to be able to operate on basic
> > battery power (and we share that bus with emergency lighting, etc.)
> 
> Well ours had a generator connected to a hamster wheel with a piece of
> cheese as backup ;-).... but seriously folks, yes we have a diesel. Why?
> to charge the batteries.

Batteries, plural? Wow!

> I'd be very, very suspicious of a slack like "15%". This implies you
> know to within 2 significant figures what the load is going to be. Which
> in my experience is not the case. "About a Seventh" is more accurate, as
> it implies more imprecision. And I'd be surprised if any Bungee-Jumper
> would tolerate that small amount of safety margin using new equipment.
> Then again, slack is supposed to be used up. It's for the unforeseen.
> When you come across a problem during development, you shouldn't be
> afraid of using up that slack, that's what it's there for!

Actually, no. For most military programs, slack is for a combination of
growth _after_ the initial development, or for unforseen variations in
the production system (e.g., a processor that's a little slower than spec.)
And, 15% is a common number for such slack.

I think you're confusing "slack" with "management reserve," which is usually
an number set by the development organization and used up (if needed) during
development. The 15% number is usually imposed by a prime on a subcontractor
for the reasons described above.

> > What if your brand new CPU requires more power than your diesel generator
> > can generate?
> > What if your brand new CPU requires a technology that doesn't let you meet
> > your heat dissipation?
> 
> But it doesn't. When you did your initial systems engineering, you made
> sure there was enough slack - OR had enough contingency money so that
> you could get custom-built stuff.

How much money is required to violate the laws of physics? _That's_ the
kind of limitations we're talking about when you get into power, cooling,
heat dissipation, etc.

> I see your zero-cooling situations, and I raise you H2, CO2, CO, Cl, H3O
> conditions etc. The constraints on a sub are different, but the same in
> scope. Until such time as you do work on a sub, or I do more than just a
> little work on aerospace, we may have to leave it at that.

But we _already_ have these same restrictions, since we have to operate in
Naval environments. We also have _extra_ requirements.

Considering that the topic of this thread is an aerospace system, I think
it's not enough to "leave it at that."

> 
> > > Usually such ridiculously extreme measures are not neccessary. The
> > > Hardware guys
> > > bitch about the cost-per-CPU going through the roof. Heck, it could cost
> > > $10 million.
> > > But if it saves 2 years of Software effort, that's a net saving of $90
> > > million.
> >
> > What does maintenance costs have to do with this discussion?
> 
> Sorry I didn't make myself clear: I was talking development costs, not
> maintenance.

Then you're not talking about inertial nav systems. On most of the projects
I've seen, the total software development time is two years or less. You're
not going to save 2 years of software effort for a new system!

> >  If you're used to developing systems
> > with those kind of constraints, you know how to make those decisions.
> > Occasionally, you make the wrong decision, as the Ariane designers discovered.
> > Welcome to engineering.
> 
> My work has only killed 2 people (Iraqi pilots - that particular system
> worked as advertised in the Gulf). There might be as many as 5000 people
> whose lives depend on my work at any time, more if War breaks out. I
> guess we have a different view of "acceptable losses" here, and your
> view may well be more correct.

You're misisng the point. It's not a question as to whether it's OK for the
system to fail. It's a question of humans having to make decisions that
don't include "well, if we throw enough money at it, we'll get everything we
want." You cannot optimize software development time and ignore all other
factors! In some cases, you have to compromise software development/maintenance
efficiencies to meet other requirements. Sometimes, you make the wrong
decision. Anyone who says they've always made the right call is a lawyer, not
an engineer.

> Why? Because such a conservative view as
> my own may mean I just can't attempt some risky things. Things which
> your team (sometimes at least) gets working, teherby saving more lives.

However, if you build a system with the latest and greatest CPU, thereby
having the maximum amount of horsepower to permit the software engineers
to avoid turning off certain checks, etc., you _have_ attempted a risky
thing. The latest hardware technology is the least used.

> Yet I don't think so.
> 
> > And, if you had only got 20MB per second after all that, you would have
> > done...?
> 
> 20 MB? First, re-check all calculations. Examine hardware options. Then
> (probably) set up a "get-well" program using 5-6 different tracks and
> pick the best. Most probably though, we'd give up: it's not doable
> within the budget.

That's the difference. We would not go to our management and say, "The
only solutions we have require us to make compromises in our software
approach, therefore it can't be done. Take your multi-billion project
and go home." We'd work with the other engineering disciplines to come
up with the best compromise. It's the difference, in my mind, between a
computer scientist and a software engineer. The software engineer is paid
to find a way to make it work -- even if (horrors) he has to write it in
assembly, or use Unchecked_Conversion, or whatever.

 The difficult case is 150 MB. In this case, assembler
> coding might just make the difference - I do get your point, BTW.
> 
> > Certainly, if you just throw out range checking without knowing its cost,
> > you're an idiot. However, no one has shown that the Ariane team did this.
> > I guarantee you (and am willing to post object code to prove it) that
> > range checking is not always zero cost, and in the right circumstances can
> > cause you to bust your budget.
> 
> Agree. There's always pathological cases where general rules don't
> apply. Being fair, I didn't say "zero cost", I said "typically 5%
> measured". In doing the initial Systems work, I'd usually budget for
> 10%, as I'm paranoid.

I've seen checks in just the wrong place that cause differences in 30% or
more in a high-rate process. It's just not that trivial.

> You get what you pay for, IF you're lucky. My point though is that many
> of the hacks, kludges etc in software are caused by insufficient
> foresight in systems design. 

And I wouldn't argue that. However, it's a _big_ leap to say ALL hacks
are caused by such problems. Also, having gone through the system design
process a few times, I've never had "sufficient foresight." There's always
been at least one choice I made then that I would have made differently today.
(Why didn't I see the obvious answer in 1985: HTML for my documentation! :)

That's why reuse is always so tricky in safety-critical systems. It's very
easy to make reasonable decisions then that don't make sense now. That's
why at laugh at people who say, "reused code is safer; you don't have to
test it once you get it working once!"

> Case in point: RAN Collins class submarine.
> Now many years late due to software problems. Last time I heard, they're
> still trying to get that last 10% performance out of the 68020s on the
> cards. Which were leading-edge when the systems work was done. Putting
> in 68040s a few years ago would have meant the Software would have been
> complete by now, as the hacks wouldn't have been neccessary.

68040s? I didn't think you could get mil-screened 68040s anymore. They're
already obsolete.

Not easy to make those foresighted decisions, is it? :)

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-02  0:00 ` Matthew Heaney
@ 1996-10-04  0:00   ` Ken Garlington
  1996-10-05  0:00     ` Robert Dewar
  0 siblings, 1 reply; 105+ messages in thread
From: Ken Garlington @ 1996-10-04  0:00 UTC (permalink / raw)

Matthew Heaney wrote:
> 
> Buyers of mission-critical software should think very carefully before they
> commit any financial resources to implementing a software system that
> requires checks be turned off.  I'd say take your money instead to Las
> Vegas: your odds for success are better there.

Better not drive or fly there: more than likely, the software systems running
in your car, plane, etc. are written in a language without any built-in
support for checks.

Checks are not a magic wand. They do not inherently make systems safer. What
matters is how you use the checks. If your ABS software fails in the middle
of winter, printing out a stack dump is not going to make you much safer!

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-02  0:00     ` Matthew Heaney
@ 1996-10-04  0:00       ` Robert S. White
  1996-10-05  0:00         ` Robert Dewar
  1996-10-05  0:00         ` Alan Brain
  0 siblings, 2 replies; 105+ messages in thread
From: Robert S. White @ 1996-10-04  0:00 UTC (permalink / raw)



In article <mheaney-ya023180000210962257430001@news.ni.net>, mheaney@ni.net 
says...

>It's never cost effective to skimp on hardware if it means human
>programmers have to write more complex software.

  Not if the ratio is tilted very heavy towards reoccuring cost versus
Non-Reoccuring Engineering (NRE).  How about 12 staff-months versus $300 extra 
hardware cost on 60,000 units?

___________________________________________________________________________
Robert S. White                    -- an embedded systems software engineer
WhiteR@CRPL.Cedar-Rapids.lib.IA.US -- It's long, but I pay for it!





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Robert S. White
  1996-10-05  0:00         ` Robert Dewar
@ 1996-10-05  0:00         ` Alan Brain
  1996-10-06  0:00           ` Robert S. White
  1 sibling, 1 reply; 105+ messages in thread
From: Alan Brain @ 1996-10-05  0:00 UTC (permalink / raw)

Robert S. White wrote:
> 
> In article <mheaney-ya023180000210962257430001@news.ni.net>, mheaney@ni.net
> says...
> 
> >It's never cost effective to skimp on hardware if it means human
> >programmers have to write more complex software.
> 
>   Not if the ratio is tilted very heavy towards reoccuring cost versus
> Non-Reoccuring Engineering (NRE).  How about 12 staff-months versus $300 extra
> hardware cost on 60,000 units?

$300 extra on 60,000 units. That's $18 Million, right?

vs

12 Staff-months. Now if your staff is 1, then that's maybe $200,000 for
a single top-notch profi. If your staff is 200, each at 100,000 cost (ie
average wage is about 50K/year), then that's 20 million. But say you
only have the one guy. And say it adds 50% to the risk of failure. With
consequent and liquidated damages of 100 Million. Then that's 50
million, 200 thousand it's really costing.

Feel free to make whatever strawman case you want. The above figures are
based on 2 different projects (actually the liquidated damages one
involved 6 people, rather than 1, and an estimated 70% increased chance
of failure, but I digress). 

Summary: In the real world, and with the current state-of-the-art, I
agree with the original statement as an excellent general rule.

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Robert S. White
@ 1996-10-05  0:00         ` Robert Dewar
  1996-10-05  0:00         ` Alan Brain
  1 sibling, 0 replies; 105+ messages in thread
From: Robert Dewar @ 1996-10-05  0:00 UTC (permalink / raw)



iRobert White said

">It's never cost effective to skimp on hardware if it means human
>programmers have to write more complex software.

  Not if the ratio is tilted very heavy towards reoccuring cost versus
Non-Reoccuring Engineering (NRE).  How about 12 staff-months versus $300 extra
hardware cost on 60,000 units?"



Of course this is true at some level, but the critical thing is that a proper
cost comparison here must take into account:

a) full life cycle costs of the software, not just development costs
b) time-to-market delays caused by more complex software
c) decreased quality and reliability caused by more complex software

There are certainly cases where careful consideration of these three factors
still results in a decision to use less hardware and more complex software,
but I think we have all seen cases where such decisions were made, and n
in retrospect turned out to be huge mistakes.





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00   ` Ken Garlington
@ 1996-10-05  0:00     ` Robert Dewar
  1996-10-06  0:00       ` Keith Thompson
                         ` (2 more replies)
  0 siblings, 3 replies; 105+ messages in thread
From: Robert Dewar @ 1996-10-05  0:00 UTC (permalink / raw)



Matthew said

"> Buyers of mission-critical software should think very carefully before they
> commit any financial resources to implementing a software system that
> requires checks be turned off.  I'd say take your money instead to Las
> Vegas: your odds for success are better there."

To the extent that checks are used for catching hardware failures this might
be true, but in practice the runtime checks of Ada are not a well tuned
tool for this purpose, although I have seen programs that work hard to take
more advantage of such checks. For example:

   type My_Boolean is new Boolean;
   for My_Boolean use (2#0101#, 2#1010#);

so that 1 bit errors cannot give valid Boolean values (check and see if your
compiler supports this, it is not required to do so!)

However, to the extent that checks are used to catch programming errors,
I think that I would prefer that a safety critical system NOT depend on
such devices. A programming error in a checks on program may indeed result
in a constraint error, but it may also cause the plane to dive into the sea
without raising a constraint error.

I find the second outcome here unacceptable, so the methodology must simply
prevent such errors completely. Indeed if you look at safety critical
subsets for Ada they often omit exceptions precisely because of this
consideration. After all exceptions make the language and compiler more
complex, and that itself may introduce concerns at the safety critical
level.

Note also that exceptions are a double edged sword. An exception that is
not handled properly can be much worse than no exception at all. If you
have a section of code doing non-critical calculations (e.g. how much
time to wait before showing the movie in the main cabin), it really does
not matter too much if that calculation overflows and shows the movie a
bit early, but if it causes an unhandled exception that wipes out the
entire passenger control system, and turns out all the reading lights
etc. that can be much worse. Even in a safety critical system, there will
be calculations that are relatively unimportant.

For example, a low priority task may cause an overflow. If ignored, an
unimportant result is simply wrong. if not ignored, the handoling of the
exception mayh cause that low priority task to overrun its CPU slice, and
cause chaos elsewhere.

As Ken says, checks are not a magic wand. They are a powerful tool, but
like any tool, subject to abuse. A chain saw with a kickback guard on the
end is definitely a safer tool to use, especially for an amateur, than
one without (something I appreciate while clearing paths through the woods
at my Vermont house), but it does not mean that now the tool is a completely
safe one, and indeed a real expert with a chain saw will often feel that it
is safer to operate without the guard, because then the behavior of the
chainsaw is simpler and more predictable.





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Joseph C Williams
@ 1996-10-06  0:00         ` Wayne Hayes
  0 siblings, 0 replies; 105+ messages in thread
From: Wayne Hayes @ 1996-10-06  0:00 UTC (permalink / raw)



In article <32551A66.41C6@gsde.hso.link.com>,
Joseph C Williams  <u6p35@gsde.hso.link.com> wrote:
>Why didn't they run the code against an Ariane 5 simulator to
>reverify the Ariane 4 software what was reused?

Money.  (The more cynical among us may say this translates to "stupidity".)

-- 
"Unix is simple and coherent, but it takes || Wayne Hayes, wayne@cs.utoronto.ca
a genius (or at any rate, a programmer) to || Astrophysics & Computer Science
appreciate its simplicity." -Dennis Ritchie|| http://www.cs.utoronto.ca/~wayne




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-05  0:00         ` Alan Brain
@ 1996-10-06  0:00           ` Robert S. White
  0 siblings, 0 replies; 105+ messages in thread
From: Robert S. White @ 1996-10-06  0:00 UTC (permalink / raw)



In article <3256ED61.7952@dynamite.com.au>, aebrain@dynamite.com.au says...
>
>12 Staff-months. Now if your staff is 1, then that's maybe $200,000 for
>a single top-notch profi. If your staff is 200, each at 100,000 cost (ie
>average wage is about 50K/year), then that's 20 million. 

   Number of staff * amount of time = staff months 
         (with a dash of reality for reasonable parallel tasks)

   The type of strawman that I had in mind could be 1 person for a year, two 
persons for six months, to a limit of 4 persons for 3 months.  And watch out 
for the mythical man-machine month!

> But say you
>only have the one guy. And say it adds 50% to the risk of failure. With
>consequent and liquidated damages of 100 Million. Then that's 50
>million, 200 thousand it's really costing.

  Projects these days also have a "Risk Management Plan"  per SEI CMM 
recomendations.  That 50% to the risk of failure has to be assigned a estimated 
cost and factored in the decision.


>Feel free to make whatever strawman case you want. The above figures are
>based on 2 different projects (actually the liquidated damages one
>involved 6 people, rather than 1, and an estimated 70% increased chance
>of failure, but I digress).

  I've seen a lot of successes.  Failures most often can be attributed to poor 
judgement by incompetent personnel.  That can be tough to manage when the 
managers don't want to hear bad news or risk projections.  Especially when they 
set up a project and move on before it is done. 
>
>Summary: In the real world, and with the current state-of-the-art, I
>agree with the original statement as an excellent general rule.

  I beg to disagree in the case of higher volume markets.  I do agree very much 
for lower volumes or when the type of development task is new to the engineers 
and managers.  You must have a good understanding of the problem and the 
solution domain to do proper cost tradeoffs that involve significant risk.

___________________________________________________________________________
Robert S. White                    -- an embedded systems software engineer
WhiteR@CRPL.Cedar-Rapids.lib.IA.US -- It's long, but I pay for it!





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-05  0:00     ` Robert Dewar
@ 1996-10-06  0:00       ` Keith Thompson
  1996-10-10  0:00       ` Ken Garlington
  1996-10-14  0:00       ` Matthew Heaney
  2 siblings, 0 replies; 105+ messages in thread
From: Keith Thompson @ 1996-10-06  0:00 UTC (permalink / raw)

In <dewar.844518011@schonberg> dewar@schonberg.cs.nyu.edu (Robert Dewar) writes:
> To the extent that checks are used for catching hardware failures this might
> be true, but in practice the runtime checks of Ada are not a well tuned
> tool for this purpose, although I have seen programs that work hard to take
> more advantage of such checks. For example:
> 
>    type My_Boolean is new Boolean;
>    for My_Boolean use (2#0101#, 2#1010#);
> 
> so that 1 bit errors cannot give valid Boolean values (check and see if your
> compiler supports this, it is not required to do so!)

But then there's still no guarantee that an invalid Boolean value will
be detected.  The code generated for an if statement, for example, is
unlikely to check its Boolean condition for validity.

Of course, there will probably be a runtime call to a routine that
converts from My_Boolean to Boolean, and this routine will *probably*
do something sensible (raise Program_Error) for an invalid argument.

Anyone something this tricky presumably is already examining the generated
code to make sure there are no surprises.

-- 
Keith Thompson (The_Other_Keith) kst@thomsoft.com <*>
TeleSoft^H^H^H^H^H^H^H^H Alsys^H^H^H^H^H Thomson Software Products
10251 Vista Sorrento Parkway, Suite 300, San Diego, CA, USA, 92121-2706
FIJAGDWOL

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Steve Bell
@ 1996-10-07  0:00         ` Ken Garlington
  1996-10-09  0:00         ` @@           robin
  1 sibling, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-07  0:00 UTC (permalink / raw)



Steve Bell wrote:

> Well, the French do things a little differently (but probably now they don't). The
> Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs
> past liftoff. They do (did) this in case there are any unanticipated holds in the
> countdown right close to liftoff. In this way, this position and velocity updating
> code would *not* have to be reset if they could get off the ground within just a few
> seconds of nominal.

But why 40 seconds? Why not 1 second (or one millisecond, for that matter)?

> You might take note that this would never happen with a
> Titan because we don't execute this ground software after liftoff. Even if we did, we
> would have caught the floating point overflows way before launch because we run all
> code in what's called "Real-Time Simulations" where actual flight harware and software
> are subjected to any and all known physical conditions. This was another finding of
> the investigation board - apparently the French don't do enough of this type of
> testing because it's real expensive.

Going way back into my history, I believe this is also true for Atlas.

> --
> Clear skies,
> Steve Bell
> sb635@delphi.com
> http://people.delphi.com/sb635 - Astrophoto page

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Michel OLAGNON
@ 1996-10-09  0:00         ` @@           robin
  0 siblings, 0 replies; 105+ messages in thread
From: @@           robin @ 1996-10-09  0:00 UTC (permalink / raw)



	molagnon@ifremer.fr (Michel OLAGNON) writes:

	>In article <532k32$r4r@goanna.cs.rmit.edu.au>, rav@goanna.cs.rmit.edu.au (@@           robin) writes:
	>>	john@assen.demon.co.uk (John McCabe) writes:
	>>
	>>	>Just a point for your information. From clari.tw.space:
	>>
	>>	>	 "An inquiry board investigating the explosion concluded in  
	>>	>July that the failure was caused by software design errors in a 
	>>	>guidance system."
	>>
	>>	>Note software DESIGN errors - not programming errors.
	>>
	>>	>Best Regards
	>>	>John McCabe <john@assen.demon.co.uk>
	>>
	>>---If you read the Report, you'll see that that's not the case.
	>>This is what the report says:
	>>
	>>    "* The internal SRI software exception was caused during execution of a
	>>     data conversion from 64-bit floating point to 16-bit signed integer
	>>     value. The floating point number which was converted had a value
	>>     greater than what could be represented by a 16-bit signed integer.
	>>     This resulted in an Operand Error. The data conversion instructions
	>>     (in Ada code) were not protected from causing an Operand Error,
	>>     although other conversions of comparable variables in the same place
	>>     in the code were protected.
	>>
	>>    "In the failure scenario, the primary technical causes are the Operand Error
	>>    when converting the horizontal bias variable BH, and the lack of protection
	>>    of this conversion which caused the SRI computer to stop."
	>>
	>>---As you can see, it's clearly a programming error.  It's a failure
	>>to check for overflow on converting a double precision value to
	>>a 16-bit integer.

	>But if you read a bit further on, it is stated that

	>    The reason why three conversions, including the horizontal bias variable one,
	>    were not protected, is that it was decided that they were physically bounded
	>    or had a wide safety margin (...) The decision was a joint one of the project
	>    partners at various contractual levels.

	>Deciding at various contractual levels is not what one usually means by
	>``programming''. It looks closer to ``design'', IMHO. But, of course, anyone
	>can give any word any meaning.
	>And it might be probable that the action taken in case of protected conversion,
	>and exception, would also have been stop the SRI computer because such a high
	>horizontal bias would have meant that it was broken....

	>| Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|

But if you read further on ....

   "However, three of the variables were left unprotected. No reference to
    justification of this decision was found directly in the source code. Given
    the large amount of documentation associated with any industrial
    application, the assumption, although agreed, was essentially obscured,
    though not deliberately, from any external review."

.... you'll see that there was no documentation in the code to
explain why these particular 3 (dangerous) conversions  were
left unprotected.  There is the implication that one or more
of them might have been overlooked . . . ..  Don't place
too much reliance on the conclusion of the report, when
the detail is right there in the body of the report.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00       ` Steve Bell
  1996-10-07  0:00         ` Ken Garlington
@ 1996-10-09  0:00         ` @@           robin
  1996-10-09  0:00           ` Steve O'Neill
  1 sibling, 1 reply; 105+ messages in thread
From: @@           robin @ 1996-10-09  0:00 UTC (permalink / raw)



Steve Bell <sb635@delphi.com> writes:

	>Michael Dworetsky wrote:
	>> 
	>> >Just a point for your information. From clari.tw.space:
	>> >
	>> >        "An inquiry board investigating the explosion concluded in
	>> >July that the failure was caused by software design errors in a
	>> >guidance system."
	>> >
	>> >Note software DESIGN errors - not programming errors.
	>> >
	>> 
	>> Indeed, the problems were in the specifications given to the programmers,
	>> not in the coding activity itself.  They wrote exactly what they were
	>> asked to write, as far as I could see from reading the report summary.
	>> 
	>> The problem was caused by using software developed for Ariane 4's flight
	>> characteristics, which were different from those of Ariane 5.  When the
	>> launch vehicle exceeded the boundary parameters of the Ariane-4 software,
	>> it send an error message and, as specified by the remit given to
	>> programmers, a critical guidance system shut down in mid-flight. Ka-boom.
	>> 

	>I work for an aerospace company, and we recieved a fairly detailed accounting of what 
	>went wrong with the Ariane 5. Launch vehicles, while they are sitting on the launch 
	>pad, run a guidance program that updates their position and velocity in reference to 
	>an coordinate frame whose origin is at the center of the earth (usually called an 
	>Earth-Centered-Inertial (ECI) frame). This program is usually started up from 1 to 3-4 
	>hours before launch and is allowed to run all the way until liftoff, so that the 
	>rocket will know where it's at and how fast it's going at liftoff. Although called 
	>"ground software," (because it runs while the rocket is on the ground), it resides 
	>inside the rocket's guidance computer(s), and for the Titan family of launch vehicles, 
	>the code is exited at t=0 (liftoff). This code is designed with knowing that the 
	>rocket is rotating on the surface of the earth, and the algorithms expect only very 
	>mild accelerations (as compared to when the rocket hauls ass off the pad at liftoff). 
	>Well, the French do things a little differently (but probably now they don't). The 
	>Ariane 4 and the first Ariane 5 allow(ed) this program to keep running for 40 secs 
	>past liftoff. They do (did) this in case there are any unanticipated holds in the 
	>countdown right close to liftoff. In this way, this position and velocity updating 
	>code would *not* have to be reset if they could get off the ground within just a few 
	>seconds of nominal. Well, it appears that the Ariane 5 really hauls ass off the pad, 
	>because at about 30 secs, it was pulling some accelerations that caused floating pount 
	>overflows

---Definitely not.  No floating-point overflow occurred.  In
Ariane 5, the overflow occurred on converting a double-precision
(some 56 bits?) floating-point to a 16-bit integer (15
significant bits).

   That's why it was so important to have a check that the
conversion couldn't overflow!


	in the still functioning ground software. The actual flight software (which 
	>was also running, naturally) was computing the positions and velocities that were 
	>being used to actually fly the rocket, and it was doing just fine - no overflow errors 
	>there because it was designed to expect high accelerations. There are two flight 
	>computers on the Ariane 5 - a primary and a backup - and each was designed to shut 
	>down if an error such as a floating point overflow occurred,

---Again, not at all.  It was designed to shut down if any interrupt
occurred.  It wasn't intended to be shut down for a routine thing as
a conversion of floating-point to integer.

	 thinking that the other 
	>one would take over. Both computers were running the ground software, and both 
	>experienced the floating point errors.


---No, the backup SRI experienced the programming error (UNCHECKED
CONVERSION from floating-point to integer) first, and shut itself
down, then the active SRI computer experienced the same programming
error, then it shut itself down.

	Actually, the primary went belly-up first, and 
	>then the backup within a fraction of a second later. With no functioning guidance 
	>computer on board, well, ka-boom as you say.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-09  0:00         ` @@           robin
@ 1996-10-09  0:00           ` Steve O'Neill
  1996-10-12  0:00             ` Alan Brain
  0 siblings, 1 reply; 105+ messages in thread
From: Steve O'Neill @ 1996-10-09  0:00 UTC (permalink / raw)

@@ robin wrote:
> ---Definitely not.  No floating-point overflow occurred.  In
> Ariane 5, the overflow occurred on converting a double-precision
> (some 56 bits?) floating-point to a 16-bit integer (15
> significant bits).
> 
>    That's why it was so important to have a check that the
> conversion couldn't overflow!
> Agreed.  Yes, the basic reason for the destruction of a billion dollar 
vehicle was for want of a couple of lines of code.  But it relects a 
systemic problem much more damaging than what language was used.

I would have expected that in a mission/safety critical application 
the proper checks would have been implemented, no matter what. And in a 
'belts-and-suspenders' mode I would also expect an exception handler to 
take care of unforeseen possibilities at the lowest possible level and 
raise things to a higher level only when absolutely necessary.  Had these 
precautions been taken there would probably be lots of entries in an 
error log but the satellites would now be orbiting.  

As outsiders we can only second guess as to why this approach was not 
taken but the review board implies that 1) the SRI software developers 
had an 80% max utilization requirement and 2) careful consideration 
(including faulty assumptions) was used in deciding what to protect and 
not protect.

>It was designed to shut down if any interrupt occurred.  It wasn't                                       ^^^^^^^^^ exception, actually
>intended to be shut down for a routine thing as a conversion of 
>floating-point to integer.

This was based on the (faulty) system-wide assumption that any exception 
was the result of a random hardware failure.  This is related to the 
other faulty assumption that "software should be considered correct until 
is proven to be at fault".  But that's what the specification said.

> ---No, the backup SRI experienced the programming error (UNCHECKED
> CONVERSION from floating-point to integer) first, and shut itself
> down, then the active SRI computer experienced the same programming
> error, then it shut itself down.

Yes, according to the report the backup died first (by 0.05 seconds).  
Probably not as a result of an unchecked_conversion though - the source 
and target are of different sizes which would not be allowed.  Most 
likely just a conversion of a float to an sixteen-bit integer.  This 
would have raised a Constraint_Error (or Operand_Error in this 
environment).  This error could have been handled within the context of 
this procedure (and the mission continued) but obviously was not.  
Instead it appears to have been propagated to a global exception handler 
which performed the specified actions admirably.  Unfortunately these 
included committing suicide and, in doing so, dooming the mission.

-- 
Steve O'Neill                      | "No,no,no, don't tug on that!
Sanders, A Lockheed Martin Company |  You never know what it might
smoneill@sanders.lockheed.com      |  be attached to." 
(603) 885-8774  fax: (603) 885-4071|    Buckaroo Banzai

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-05  0:00     ` Robert Dewar
  1996-10-06  0:00       ` Keith Thompson
@ 1996-10-10  0:00       ` Ken Garlington
  1996-10-14  0:00       ` Matthew Heaney
  2 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-10  0:00 UTC (permalink / raw)

Robert Dewar wrote:
> 
> I find the second outcome here unacceptable, so the methodology must simply
> prevent such errors completely. Indeed if you look at safety critical
> subsets for Ada they often omit exceptions precisely because of this
> consideration. After all exceptions make the language and compiler more
> complex, and that itself may introduce concerns at the safety critical
> level.

I'm also starting to be convinced, after some anecdotal evidence with the systems
I work, that _suppressing_ checks can also make the compiler more fragile. My guess is
that fewer people in general suppress all checks for most compilers, so those
paths in the compiler that run with checks suppressed are used less often,
and so they have a higher probability of containing bugs. I also suspect that most
vendors do not run their standard tests suites (including ACVCs) with checks
suppressed (how could you, for the part of the test suite that validates exception
raising and handling?), so there's less coverage from that source as well.

I'm not saying that it's dumb to suppress checks (or not suppress checks) for
safety-critical systems. I'm just saying the answer appears to be a lot more
complicated than I thought it was 10 years ago (or even 2 years ago).

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-09  0:00           ` Steve O'Neill
@ 1996-10-12  0:00             ` Alan Brain
  0 siblings, 0 replies; 105+ messages in thread
From: Alan Brain @ 1996-10-12  0:00 UTC (permalink / raw)

Steve O'Neill wrote:

> I would have expected that in a mission/safety critical application
> the proper checks would have been implemented, no matter what. And in a
> 'belts-and-suspenders' mode I would also expect an exception handler to
> take care of unforeseen possibilities at the lowest possible level and
> raise things to a higher level only when absolutely necessary.  Had these
> precautions been taken there would probably be lots of entries in an
> error log but the satellites would now be orbiting.

Concur completely. This should be Standard Operating Procedure, a matter
of habit. Frankly, it's just good engineering practice. But is honoured
more in the breach than the observance it seems, because....

> As outsiders we can only second guess as to why this approach was not
> taken but the review board implies that 1) the SRI software developers
> had an 80% max utilization requirement and 2) careful consideration
> (including faulty assumptions) was used in deciding what to protect and
> not protect.

... as some very reputable people, working for very reputable firms have
tried to pound into my thick skull, they are used to working with 15%,
no
more, tolerances. And with diamond-grade Hard Real Time slices, where
any
over-run, no matter how slight, means disaster. In this case, Formal
Proof
and strict attention to the no of CPU cycles in all possible paths seems
the only way to go.
But this leaves you so open to error in all but the simplest, most
trivial
tasks, (just the race analysis would be nightmarish) that these slices
had
better be a very small part of the task, or the task itself must be very
simple indeed. Either way, not having much bearing on the vast majority
of
problems I've encountered.
If the tasks are not simple....then can I please ask the firms concerned
to
tell me which aircraft their software is on, so I can take appropriate
action?

----------------------      <> <>    How doth the little Crocodile
| Alan & Carmel Brain|      xxxxx       Improve his shining tail?
| Canberra Australia |  xxxxxHxHxxxxxx _MMMMMMMMM_MMMMMMMMM
---------------------- o OO*O^^^^O*OO o oo     oo oo     oo  
                    By pulling Maerklin Wagons, in 1/220 Scale

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-05  0:00     ` Robert Dewar
  1996-10-06  0:00       ` Keith Thompson
  1996-10-10  0:00       ` Ken Garlington
@ 1996-10-14  0:00       ` Matthew Heaney
  1996-10-15  0:00         ` Robert Dewar
  1996-10-16  0:00         ` Ken Garlington
  2 siblings, 2 replies; 105+ messages in thread
From: Matthew Heaney @ 1996-10-14  0:00 UTC (permalink / raw)

In article <dewar.844518011@schonberg>, dewar@schonberg.cs.nyu.edu (Robert
Dewar) wrote:

>As Ken says, checks are not a magic wand. They are a powerful tool, but
>like any tool, subject to abuse. A chain saw with a kickback guard on the
>end is definitely a safer tool to use, especially for an amateur, than
>one without (something I appreciate while clearing paths through the woods
>at my Vermont house), but it does not mean that now the tool is a completely
>safe one, and indeed a real expert with a chain saw will often feel that it
>is safer to operate without the guard, because then the behavior of the
>chainsaw is simpler and more predictable.

I think we're all in basic agreement.

As you stated, exceptions are only a tool.  They don't replace the need for
(mental) reasoning about the correctness of my program, nor should they be
used to guard against sloppy programming.  Exceptions don't correct the
problem for you, but at least they let you know that a problem exists.

And in spite of all the efforts of the Ariane 5 developers, a problem did
exist, significant enough to cause mission failure.  Don't you think an
exception was justified in this case?

Yes, I agree that there may be times when you don't need any sophisticated
exception handling, and you could safely turn checks off.  But surely there
are important sections of code, say for a critical algorithm, that justify
the use of checks.

Believe me, I would love to write a software system that I knew were
(formally) correct and didn't require run-time checks.  But I am not able
to build that system today.  So what should I do?

Though I may be the most practiced walker of tightropes, I still like
having that safety net underneath me.

-matt

--------------------------------------------------------------------
Matthew Heaney
Software Development Consultant
mheaney@ni.net
(818) 985-1271

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-14  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  1996-10-15  0:00 ` Robert I. Eachus
  1996-10-23  0:00 ` robin
  0 siblings, 2 replies; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-14  0:00 UTC (permalink / raw)



Alan Brain <aebrain@DYNAMITE.COM.AU> writes:
>more, tolerances. And with diamond-grade Hard Real Time slices, where
>any
>over-run, no matter how slight, means disaster. In this case, Formal
>Proof
>and strict attention to the no of CPU cycles in all possible paths seems
>the only way to go.
>But this leaves you so open to error in all but the simplest, most
>trivial
>tasks, (just the race analysis would be nightmarish) that these slices
>had
>better be a very small part of the task, or the task itself must be very
>simple indeed. Either way, not having much bearing on the vast majority
>
    In my experience with this sort of "Hard Real Time" code, you are
    typically talking about relatively straightforward code - albeit
    difficult to develop. (Ask A. Einstein how long it took him to
    write the "E := M * C**2 ;" function.)

    The parts which typically have hard deadlines tend to be heavy on
    math or data motion and rather light on branching and call chain
    complexity. You want your "worst case" timing to be your nominal
    path and you'd like for it to be easily analyzed and very
    predictable. Usually, it's a relatively small part of the system
    and maybe (MAYBE!) you can turn off runtime checks for just this
    portion of the code, leaving it in for the things which run at a
    lower duty cycle.

    Of course the truly important thing to remember is that compiler
    generated runtime checks are not a panacea. They *may* have helped
    with the Ariane 5, if there was an appropriate accommodation once
    the error was detected. (Think about it. If the accommodation was
    "Shut down the channel and pass control to the other side" {Very
    common in a dual-redundant system} it would have made no
    difference.) But most of the errors I've encountered in realtime
    systems have been of the "logic" variety. ("Gee! We thought 'x'
    was the proper course of action when this condition comes up and
    really it should have been 'y'" or "I didn't know the control
    would go unstable if parameter 'x' would slew across its range
    that quickly!?!?!") Runtime checks aren't ever going to save us
    from that sort of mistake - and those are the ones which show up
    most often. (Unless, of course, you program in C ;-)

    An aside which has something to do with Ada language constructs:
    In most of our work (control systems) it would be far more useful
    for math over/underflows to saturate and continue on, rather than
    raise an exception and halt processing. Ada never defined any
    numeric types with this sort of behavior - and I find it difficult
    to believe that many others in similar embedded applications
    wouldn't also desire this behavior from some predefined floating,
    fixed, and integer types. Of course, the language allows us to
    define our own types and (if there's proper hardware and compiler
    support for dealing with it) efficient "home-brew" solutions can
    be built. Still, it would have seemed appropriate for the language
    designers to have built some direct support for a very common
    embedded need.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "The speed with which people can change a courtesy into an
    entitlement is awe-inspiring."

        --  Miss Manners, February 8, 1994
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-14  0:00       ` Matthew Heaney
@ 1996-10-15  0:00         ` Robert Dewar
  1996-10-16  0:00         ` Ken Garlington
  1 sibling, 0 replies; 105+ messages in thread
From: Robert Dewar @ 1996-10-15  0:00 UTC (permalink / raw)



Matthew says

"Believe me, I would love to write a software system that I knew were
(formally) correct and didn't require run-time checks.  But I am not able
to build that system today.  So what should I do?"


First of all, I would object to the "formally" and even the word "corect"
here. These are technical terms which relate to, but are not identical with,
the impoortant concept which is reliability.

It *is* possible to write reliable programs, though it is expensive. If you
need to do this, and are not able to do it, then the answer is to investigate
the tools that make this possible, and understand the necessary investment
(which is alarmingly high). Some of these tools are related to correctness,
but that's not the main focus. There are reliable incorect programs and
correct unreliable programs, and what we are interested in is reliability.

For an example of toolsets that help achieve this aim, take a look at the
Praxis tools. There are many other examples of methodologies and tools that
can be used to achieve high reliability. 

Now of course informally we would like to make all programs realiable, but
there is a cost/benefit trade off. For most non-safety critical programming
(but not all), it is simply not cost effective to demand total reliability.





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-14  0:00 Marin David Condic, 407.796.8997, M/S 731-93
@ 1996-10-15  0:00 ` Robert I. Eachus
  1996-10-15  0:00   ` Robert Dewar
  1996-10-23  0:00 ` robin
  1 sibling, 1 reply; 105+ messages in thread
From: Robert I. Eachus @ 1996-10-15  0:00 UTC (permalink / raw)



In article <96101416363982@psavax.pwfl.com> "Marin David Condic, 407.796.8997, M/S 731-93" <condicma@PWFL.COM> writes:

  > In most of our work (control systems) it would be far more useful
  > for math over/underflows to saturate and continue on, rather than
  > raise an exception and halt processing. Ada never defined any
  > numeric types with this sort of behavior - and I find it difficult
  > to believe that many others in similar embedded applications
  > wouldn't also desire this behavior from some predefined floating,
  > fixed, and integer types. Of course, the language allows us to
  > define our own types and (if there's proper hardware and compiler
  > support for dealing with it) efficient "home-brew" solutions can
  > be built. Still, it would have seemed appropriate for the language
  > designers to have built some direct support for a very common
  > embedded need.

    They did.  First look at 'Machine_Overflows.  It is perfectly
legal for even Float'Machine_Overflows to be false and the
implementation to return, say IEEE nonsignaling NaNs in such a case.
Also RM95 3.5.5(26) and 3.5.6(8) allow for nonstandard integer and
real types respectively, and mention saturation types as one possible
use for the feature.

   Talk to your vendor or check out what GNAT actually does on your
hardware.

--

					Robert I. Eachus

with Standard_Disclaimer;
use  Standard_Disclaimer;
function Message (Text: in Clever_Ideas) return Better_Ideas is...




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-15  0:00 ` Robert I. Eachus
@ 1996-10-15  0:00   ` Robert Dewar
  1996-10-16  0:00     ` Michael F Brenner
  0 siblings, 1 reply; 105+ messages in thread
From: Robert Dewar @ 1996-10-15  0:00 UTC (permalink / raw)



Marin said

"  > In most of our work (control systems) it would be far more useful
  > for math over/underflows to saturate and continue on, rather than
  > raise an exception and halt processing. Ada never defined any
  > numeric types with this sort of behavior - and I find it difficult
  > to believe that many others in similar embedded applications
  > wouldn't also desire this behavior from some predefined floating,
  > fixed, and integer types. Of course, the language allows us to
  > define our own types and (if there's proper hardware and compiler
  > support for dealing with it) efficient "home-brew" solutions can
  > be built. Still, it would have seemed appropriate for the language
  > designers to have built some direct support for a very common
  > embedded need."


Well there is always a certain kind of viewpoint that wants more, more, more
when it comes to features in a language, but I think that saturating types
would be overkill in terms of predefined integral types. Adding new classes
of integral types adds a lot of stuff to the language, just look at all the
stuff for supporting modular types.

I think a much more reasonable approach for saturating operators is to
define the necessary operators. If you need some very clever efficient
code for these operators, then either use inlined machine code, or persuade
your vendor to implement these as efficient intrinsics, that is always
allowed.





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-14  0:00       ` Matthew Heaney
  1996-10-15  0:00         ` Robert Dewar
@ 1996-10-16  0:00         ` Ken Garlington
  1996-10-18  0:00           ` Keith Thompson
  1996-10-23  0:00           ` robin
  1 sibling, 2 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-16  0:00 UTC (permalink / raw)

Matthew Heaney wrote:
> 
> As you stated, exceptions are only a tool.  They don't replace the need for
> (mental) reasoning about the correctness of my program, nor should they be
> used to guard against sloppy programming.  Exceptions don't correct the
> problem for you, but at least they let you know that a problem exists.
> 
> And in spite of all the efforts of the Ariane 5 developers, a problem did
> exist, significant enough to cause mission failure.  Don't you think an
> exception was justified in this case?

Not necessarily. Keep in mind that an exception _was_ raised -- a predefined 
exception (Operand_Error according to the report). There was sufficient telemetry 
to determine where the error occured (obviously, otherwise we wouldn't know what 
happened!). If the real Ariane 5 trajectory had been tested in an integrated 
laboratory enviroment, then (assuming the environment was realistic enough to 
trigger the problem), the fault would have been seen (and presumably analyzed and 
fixed) prior to launch. So, the issue is not the addition of a user-defined 
exception to find the error -- the issue is the addition of a new exception 
_handler_ to _recover_ from the error in flight.

Assuming that a new exception _handler_ had been added, then it _might_ have made 
a difference. If it did nothing more than the system exception handler (shutting 
down the channel), then the only potential advantage of the exception _handler_ 
might have been to allow fault isolation to happen faster (e.g. if the exception 
were logged in some manner). This assumes that either the exception message was 
sent out with the telemetry, or else the on-board fault logging survived the 
crash. On the other hand, if it had shut down just the alignment function, then 
it might have saved the system. Without more knowledge about the IRS 
architecture, there's no way to say.

> Yes, I agree that there may be times when you don't need any sophisticated
> exception handling, and you could safely turn checks off.  But surely there
> are important sections of code, say for a critical algorithm, that justify
> the use of checks.
> 
> Believe me, I would love to write a software system that I knew were
> (formally) correct and didn't require run-time checks.  But I am not able
> to build that system today.  So what should I do?
> 
> Though I may be the most practiced walker of tightropes, I still like
> having that safety net underneath me.

Just make sure that your safety net isn't lying directly on the ground. Without 
the use of a frame (exception handlers that actually do the right thing to 
recover the system), you'll find the landing is just as hard with or without the 
net!

You might also want to make sure that the net isn't suspended so high that you're 
walking _below_ it, or even worse that you hit your head on the net and it knocks 
you off the rope (just to stretch this analogy a bit further). In other words, a 
complex exception handling structure might actually _detract_ from the 
reliability of your system. There is some merit to the Keep It Simple, Stupid 
principle.

> 
> -matt
> 
> --------------------------------------------------------------------
> Matthew Heaney
> Software Development Consultant
> mheaney@ni.net
> (818) 985-1271

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-16  0:00     ` Michael F Brenner
@ 1996-10-16  0:00       ` Robert Dewar
  0 siblings, 0 replies; 105+ messages in thread
From: Robert Dewar @ 1996-10-16  0:00 UTC (permalink / raw)



Michael Brenner said

"(2) do not generate code for a given instantiation of unchecked_conversion,"

Most of the points are dubious, but I concentrate on this one, because it
is a common confusion. In general, almost any unchecked conversoin you canm
think of will require code on some architecture. The attempt to legislate
such code out of existence is pragmatically badly flawed, never mind being
completely impractical to specify formally (at the level of a language
definition there is no such thing as code!)





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-16  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  1996-10-18  0:00 ` Ken Garlington
  0 siblings, 1 reply; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-16  0:00 UTC (permalink / raw)



Robert Dewar <dewar@MERV.CS.NYU.EDU> writes:
>It *is* possible to write reliable programs, though it is expensive. If you
>need to do this, and are not able to do it, then the answer is to investigate
>the tools that make this possible, and understand the necessary investment
>(which is alarmingly high). Some of these tools are related to correctness,
>but that's not the main focus. There are reliable incorect programs and
>correct unreliable programs, and what we are interested in is reliability.
>
<snip>
>Now of course informally we would like to make all programs realiable, but
>there is a cost/benefit trade off. For most non-safety critical programming
>(but not all), it is simply not cost effective to demand total reliability.
>
    You are absolutely correct about the cost. The control software we
    build is tested exhaustively from the module level on up to the
    integration with physical sensors & actuators well before it gets
    to drive an engine on a test stand - much less fly. It *is*
    enormously expensive - but in the present day it's the only way to
    be sure you aren't trying to fly something that will break.

    The point is that our software testing was derived from the same
    mindset as our hardware testing (turbine blades, pumps, bearings,
    etc.) We probably test a hardware component for an engine even
    more rigorously and at greater expense than we do for software -
    which is, after all, just another "part" for the engine. The
    mistake that is often made when looking at software it to think
    that somehow (because it passed the "smoke" test?) it doesn't need
    the same sort of rigorous testing we'd demand of any physical
    device in order to be proven reliable.

    Who would want to fly in an airplane powered by engines, the
    design for which had been verified by powering up a single
    prototype once and running it for 10 minutes. You'd probably feel
    a lot safer if we ran a couple of prototypes right into the
    ground, including making them ingest a few birds and deliberately
    cutting loose a turbine blade or two at speed. If you want
    reliable software, the testing can be no less rigorous.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "The speed with which people can change a courtesy into an
    entitlement is awe-inspiring."

        --  Miss Manners, February 8, 1994
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-15  0:00   ` Robert Dewar
@ 1996-10-16  0:00     ` Michael F Brenner
  1996-10-16  0:00       ` Robert Dewar
  0 siblings, 1 reply; 105+ messages in thread
From: Michael F Brenner @ 1996-10-16  0:00 UTC (permalink / raw)



. Dewar said:

    > I think that saturating types
    > would be overkill in terms of predefined integral types. Adding new classes
    > of integral types adds a lot of stuff to the language, just look at all the
    > stuff for supporting modular types.
    >
    > I think a much more reasonable approach for saturating operators is to
    > define the necessary operators. If you need some very clever efficient
    > code for these operators, then either use inlined machine code, or persuade
    > your vendor to implement these as efficient intrinsics, that is always
    > allowed.

This hops on both sides of the horse at the same time. It was good to add
modular types into Ada-95, but it was bad to add a lot of stuff to the language.
It was an unnecessary management decision, not related to the technical
requirement for efficient modular types, to add anything to the language
Other Than clever, efficient, reliable code for modular operators. A different
management decision would have been to keep the way all Ada 83 compilers
with modular types did it, leaving them Represented as ordinary integers,
but overloading an alternate set of arithemetic operators over those ordinary
integers, so that conversion between twoUs complement and modular binary
would not require any code to be generated (except a possibly optimized
away copy of the integer). This would still permit inefficient BCD
implementations of modular arithmetic wherever the efficient hardware
operators are not available on a given target architecture. 

Had this alternate decision been made, then R. DewarUs second half of the
comment could have been focussed on with more energy, namely, how can
more efficient code be generated for several different kinds of operators. 
Solutions sometimes available are interfacing to assembler language and
inline machine code. Solutions available to those with larger than normal
amounts of funding include paying a compiler maintainer to implement 
and efficient intrinsic function. But Another Way, for future consideration,
is to permit users to implement attributes or efficient intrinsic functions
by permitting pragmas which Demand certain performance requirements
of the generated code. As Dr. Dewar has repeatedly pointed out, performance
requirements are currently beyond the scope of the language definition. 
However, many programs have performance requirements, and having
a way to specify them (in an Appendix) would not detract from the 
language, but make it more useful in the realtime world. Examples of
such specifications include: (1) the topic of this thread (saturating overflows),
(2) do not generate code for a given instantiation of unchecked_conversion,
(3) do not even generate a copy for invocations of a given instantiation
of unchecked_conversion, (4) permit modular operations on an ordinary
user-defined range type, (5) use a particular run-time routine to implement
a particular array slice or Others initialization, (6) use a particular
machine code instruction to implement a particular array slice or Others
initialization, (7) truly deallocate a variable now, (8) truly deallocate
all variables of a given subtype now, (9) permit the use of all bits in 
the word in a given fixed-point arithmetic type, etc.
.
ZZ




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-04  0:00     ` @@           robin
  1996-10-04  0:00       ` Joseph C Williams
  1996-10-04  0:00       ` Michel OLAGNON
@ 1996-10-17  0:00       ` Ralf Tilch
  1996-10-17  0:00         ` Ravi Sundaram
  2 siblings, 1 reply; 105+ messages in thread
From: Ralf Tilch @ 1996-10-17  0:00 UTC (permalink / raw)



--

Hello,

I followed the discussion of the ARIANE 5 failure.
I didn't read all the mail's, and I am quite astonished
how far and how many details can discussed.
Like,
What for program-language would have been the best,
.....

It's good to know what's happen.
I think more important,
you built something new (very complex).
You invest some billion to develop it.
You built it (an ARIANE 5,  put several sattelites). 
The price of it several hundred millions
and you don't check as much as possible,
make a 'very complete check',
especially the software.

The reason that the software wasn't checked:
It was too 'expensive'?!?!.

They forgot murphy's law, which always 'works'.


I think you can't design a new car without 
testing it completely.

We test 95% of the construction and after six month 
selling the new car a weel will fall of at 160km/h.
Ok, there was a small problem in the construction-software
some wrong values, due to some over- or underflows or 
whatever.

The result, the company probhably will have to pay quite a
lot and probhably to close !

--------------------------------------------------------
-DON'T TRUST YOURSELF, TRUST MURPHY'S LAW !!!! 

"If anything can go wrong, it will."
--------------------------------------------------------
With this, have fun and continue the discussion about 
conversion from 64bit to 16bit values,etc..

RT


________________|_______________________________________|_                
                | E-mail : R.Tilch@gmd.de               |
                | Tel.   : (+49) (0)2241/14-23.69       |
________________|_______________________________________|_
                |                                       |




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-17  0:00       ` Ralf Tilch
@ 1996-10-17  0:00         ` Ravi Sundaram
  1996-10-22  0:00           ` shmuel
  0 siblings, 1 reply; 105+ messages in thread
From: Ravi Sundaram @ 1996-10-17  0:00 UTC (permalink / raw)



Ralf Tilch wrote:
> The reason that the software wasn't checked:
> It was too 'expensive'?!?!.

	Yeah, isn't hindsight a wonderful thing?  
	They, whoever were in charge of these decisions,
	too knew testing is important.  But it is impossible
	to test every subcomponant under every possible
	condition. There is simply not enough money or time
	available to do that.

	Take space shuttle for example. The total computing
	power available on board is probably as much as used
	in Nintindo gameboy. The design was frozen in 1970s.
	Upgrading the computers and software would be so expensive
	to test and prove they approach it with much trepidation.

	Richard Feyman was examining the practices of NASA and
	found that the workers who assembled some large bulkheads
	had to count bolts from two refrence points. He thought
	providing four reference points would simplify the job.
	NASA rejected the proposal because it would involve 
	too many changes to the documentation, procedures and
	testing. (Surely You are joking, Mr Feyman I? or II?)

	So praise them for conducting a no nonsense investigation
	and owning up to the mistakes.  Learn to live with
	failed space shots. They will become as reliable as
	air travel once we have launched about 10 million rockets.

-- 
Ravi Sundaram.  
10/17/96
PS:	I am out of here. Going on vacation. Wont read followups
	for a month.
                                (Opinions are mine, not Ansoft's.)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-18  0:00           ` Keith Thompson
  1996-10-18  0:00             ` Samuel T. Harris
@ 1996-10-18  0:00             ` Ken Garlington
  1 sibling, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-18  0:00 UTC (permalink / raw)



Keith Thompson wrote:
> 
> In <326506D2.1E40@lmtas.lmco.com> Ken Garlington <garlingtonke@lmtas.lmco.com> writes:
> [...]
> > Not necessarily. Keep in mind that an exception _was_ raised -- a
> > predefined exception (Operand_Error according to the report).
> 
> This is one thing that's confused me about this report.  There is no
> predefined exception in Ada called Operand_Error.  Either the overflow
> raised Constraint_Error (or Numeric_Error if they were using an Ada
> 83 compiler that doesn't follow AI-00387), or a user-defined exception
> called Operand_Error was raised explicitly.

It confused me too. I'm guessing that language differences are part of the
answer here, but I have no idea. It's also possible that the CPU hardware
specification has something called an "Operand Error" interrupt which is
generated during an overflow, which I assume gets mapped into Constraint_Error
(as is common with the MIL-STD-1750 CPU, for instance).

I also world be interested in any information about "Operand_Error".

> 
> --
> Keith Thompson (The_Other_Keith) kst@thomsoft.com <*>
> TeleSoft^H^H^H^H^H^H^H^H Alsys^H^H^H^H^H Thomson Software Products
> 10251 Vista Sorrento Parkway, Suite 300, San Diego, CA, USA, 92121-2706
> FIJAGDWOL

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-18  0:00           ` Keith Thompson
@ 1996-10-18  0:00             ` Samuel T. Harris
  1996-10-21  0:00               ` Ken Garlington
  1996-10-18  0:00             ` Ken Garlington
  1 sibling, 1 reply; 105+ messages in thread
From: Samuel T. Harris @ 1996-10-18  0:00 UTC (permalink / raw)

Keith Thompson wrote:
> 
> In <326506D2.1E40@lmtas.lmco.com> Ken Garlington <garlingtonke@lmtas.lmco.com> writes:
> [...]
> > Not necessarily. Keep in mind that an exception _was_ raised -- a
> > predefined exception (Operand_Error according to the report).
> 
> This is one thing that's confused me about this report.  There is no
> predefined exception in Ada called Operand_Error.  Either the overflow
> raised Constraint_Error (or Numeric_Error if they were using an Ada
> 83 compiler that doesn't follow AI-00387), or a user-defined exception
> called Operand_Error was raised explicitly.
> 

Remember, the report does NOT state that an unchecked_conversion
was used (as some on this thread have assumed). It only states
a "data conversion from 64-bit floating point to 16-bit signed
integer value". As someone (I forget who) pointed out early
in the thread weeks ago, a standard practice is to scale down
the range of a float value to fit into an integer variable.
This may not have been an unchecked_conversion at all, but
some mathimatical expression.

Whenever software is reused, it must be reverified AND
revalidated. The report cites several reasons for not
reverifying the reuse of the SRI from the Ariane 4. Any
one of which may be justifiable. However, a cardinal rule
of risk management is that any risk to which NO measures
are applied remains a risk. Here they justified their way
into applying no measures at all toward insuring the stuff
would work.

The report also states that the code which contained the
conversion was part of a feature which was now obsolete
for the Ariane 5. It was left in "presumably based on the view that,
unless proven necessary, it was not wise to make changes in software
which worked well on Ariane 4." While this does make good sense,
it is not by any means a verification nor a validation.
It just seems to mitigate your risk, but it really does
no such thing. You can't let such thinking lull you into
a false sense of security.

The analysis which lead to protecting four variables from
Operand_Error and leaving 3 unprotected was not revisited
with the new environment in mind. How could it be since
the Ariane 5 trajectory data was not included as a function
requirement. Hence this measure does not apply to the risk
of the Ariane 5, though some in the decision may have relied
upon it for just that protection.

Then they went as far as not revalidating the SRI in an
Ariane 5 environment, which was the real hurt. While the
report states the Ariane 5 flight data was not included as
a functional requirement, someone should have asked for it
if they needed it. Its omission means any verification testing
which was done would not have taken it into account.
So it would have been verified (which is testing against what
the user said he wanted). However, validation testers (who
test what the user actually wants and are supposed to be
smart enough NOT to take the specification at face value)
should have insisted on such data, included or not.
That's the silly part about the whole affair, validation
testing also was not performed.

The report then goes on to discuss why the SRI's were not
included in a closed-loop test. So even if the Ariane 5
trajectory data had been included as a functional requirement,
it would not have helped. While the technical reasons
cited are appropriate for a verification test, the report
correctly points out that the goals of validation testing
are not so strigently dependent on the fidelity of the test
environment so those reasons just don't justify not having
the SRI's in at least one validation test using Arian 5
trajectory data, especially when other measures have NOT
been taken to insure a compatible reuse of software.

In fact, section 2.1 states "The SRI internal events that
led to the failure have been reproduced by simulation calculations."
I wonder if they compiled and ran the Ada code on another
platform (which is a viable way of doing a lot of testing
for embedded software prior to embedding the software).
The report does not state if such testing was performed
by the developer. If the developer done such testing, then
the Ariane 5 trajectory data would have spotted the flaw.
If such testing was done, someone would have to ask
explicitly for such data.

The end of secion 2.3 summarizes the fact that the reviews
did not pick up the fact that of all potential measures which
could have been applied to determine a compatible reuse of
software into the Ariane 5 operational environment, NONE of
them were actually performed. Which left the reviewers
blissfully ignorant of an unmitigated risk glaring them in
the face.

Of the SRI, I conclude ...

No design error (though it could have done something better).
No programming error (given the design).
An arguable specification error (but without appropriate testing).
A lapse in validation testing (assuming other non-existance measures).
A grave risk management and oversite problem.

Bottom line, a management (both customer and contractor) problem.

The OBC and main computer are another matter entirely.

I've not seen anyone on this thread address the entries
3.1.f and g concerning the SRI sending diagnostic data (item f)
which was interpreted as flight data by the launcher's main
computer (item g). Section 2.1 states the backup failed first and
declared a failure and the OBC could not switch to it because
it already ceased to function. It seems the OBC knew about
the failures, so why did the main computer still interpret
any data from a failed component as flight data.

That seems like a design or programming problem. It is
blind luck that the diagnostic data caused the main computer
to try to correct the trajectory via extreme positions of the
thruster nozzles which caused the rocket to turn sideways
to the air flow which caused buckling in the superstructure
which caused the self-destruct to engage.

Given the design philosophy of the designers, had the main
computer known both SRI had failed, it should have signaled a
self-destruct right then and there. What would have happened
if the "diagnostic" data caused minor course corrections and
brought the rocket over a population area before the subsequent
course or events (or the ground flight controllers themselves)
signaled a self-destruct?

The report does not delve into this aspect of the problem
which I consider to be even more important. This tends to
tell me the SRI simulators in the closed-loop testing which
was performed were not used to check malfunctions, or if
they were, then the test scenarios are incomplete or flawed.

How many other interface/protocol/integration problems
are waiting to crop up? Which reused Arian 4 software component
will fail next? Stay tuned for these and other provocative
questions on "As the Arian Burns" ;)

I wonder how the payload insurance companies will repond with
their pricing for the next couple of launches.

-- 
Samuel T. Harris, Senior Engineer
Hughes Training, Inc. - Houston Operations
2224 Bay Area Blvd. Houston, TX 77058-2099
"If you can make it, We can fake it!"

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-16  0:00         ` Ken Garlington
@ 1996-10-18  0:00           ` Keith Thompson
  1996-10-18  0:00             ` Samuel T. Harris
  1996-10-18  0:00             ` Ken Garlington
  1996-10-23  0:00           ` robin
  1 sibling, 2 replies; 105+ messages in thread
From: Keith Thompson @ 1996-10-18  0:00 UTC (permalink / raw)



In <326506D2.1E40@lmtas.lmco.com> Ken Garlington <garlingtonke@lmtas.lmco.com> writes:
[...]
> Not necessarily. Keep in mind that an exception _was_ raised -- a
> predefined exception (Operand_Error according to the report).

This is one thing that's confused me about this report.  There is no
predefined exception in Ada called Operand_Error.  Either the overflow
raised Constraint_Error (or Numeric_Error if they were using an Ada
83 compiler that doesn't follow AI-00387), or a user-defined exception
called Operand_Error was raised explicitly.

-- 
Keith Thompson (The_Other_Keith) kst@thomsoft.com <*>
TeleSoft^H^H^H^H^H^H^H^H Alsys^H^H^H^H^H Thomson Software Products
10251 Vista Sorrento Parkway, Suite 300, San Diego, CA, USA, 92121-2706
FIJAGDWOL




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-16  0:00 Marin David Condic, 407.796.8997, M/S 731-93
@ 1996-10-18  0:00 ` Ken Garlington
  1996-10-19  0:00   ` Frank Manning
  0 siblings, 1 reply; 105+ messages in thread
From: Ken Garlington @ 1996-10-18  0:00 UTC (permalink / raw)

Marin David Condic, 407.796.8997, M/S 731-93 wrote:

>     Who would want to fly in an airplane powered by engines, the
>     design for which had been verified by powering up a single
>     prototype once and running it for 10 minutes. You'd probably feel
>     a lot safer if we ran a couple of prototypes right into the
>     ground, including making them ingest a few birds and deliberately
>     cutting loose a turbine blade or two at speed. If you want
>     reliable software, the testing can be no less rigorous.

Well, I know that on the YF-22 program, one of the engine manufacturers
did in fact cut loose a few turbine blades during system test -- although,
in that case, it was unintentional. We also ran one of the aircraft into the
ground -- again, unintentionally.

As for the birds, there is an interesting test done here in Fort Worth. (At
least, we used to do it -- I haven't actually witnessed one of these tests lately). 
To determine if the canopy will survive a bird strike, they actually take a bird
(presumably of mil-spec size and weight), load it into a cannon-type deveice, and
fire the bird at the canopy. By the way, it's not a good idea to use a _frozen_ 
bird for this test...

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-18  0:00 ` Ken Garlington
@ 1996-10-19  0:00   ` Frank Manning
  1996-10-21  0:00     ` Norman H. Cohen
  0 siblings, 1 reply; 105+ messages in thread
From: Frank Manning @ 1996-10-19  0:00 UTC (permalink / raw)



In article <32678222.6F5C@lmtas.lmco.com> Ken Garlington
<garlingtonke@lmtas.lmco.com>

> As for the birds, there is an interesting test done here in Fort Worth.
> (At least, we used to do it -- I haven't actually witnessed one of these
> tests lately). To determine if the canopy will survive a bird strike,
> they actually take a bird (presumably of mil-spec size and weight), load
> it into a cannon-type deveice, and fire the bird at the canopy. By the
> way, it's not a good idea to use a _frozen_ bird for this test...

When I was in the Air Force, I heard a rumor there was an Air
Force facility that used chickens for similar testing. At one
time the guy in charge was a certain Colonel Sanders...

-- Frank Manning




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-19  0:00   ` Frank Manning
@ 1996-10-21  0:00     ` Norman H. Cohen
  0 siblings, 0 replies; 105+ messages in thread
From: Norman H. Cohen @ 1996-10-21  0:00 UTC (permalink / raw)



Frank Manning wrote:
> 
> In article <32678222.6F5C@lmtas.lmco.com> Ken Garlington
> <garlingtonke@lmtas.lmco.com>
> 
> > As for the birds, there is an interesting test done here in Fort Worth.
> > (At least, we used to do it -- I haven't actually witnessed one of these
> > tests lately). To determine if the canopy will survive a bird strike,
> > they actually take a bird (presumably of mil-spec size and weight), load
> > it into a cannon-type deveice, and fire the bird at the canopy. By the
> > way, it's not a good idea to use a _frozen_ bird for this test...
> 
> When I was in the Air Force, I heard a rumor there was an Air
> Force facility that used chickens for similar testing. At one
> time the guy in charge was a certain Colonel Sanders...

Similar testing was done in the Chinese air force.  The program was so
successful that its director, Colonel Tso was promoted to the rank of
general.

:-)

-- 
Norman H. Cohen
mailto:ncohen@watson.ibm.com
http://www.research.ibm.com/people/n/ncohen




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-18  0:00             ` Samuel T. Harris
@ 1996-10-21  0:00               ` Ken Garlington
  0 siblings, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-21  0:00 UTC (permalink / raw)



Samuel T. Harris wrote:
> 
> Keith Thompson wrote:
> >
> > In <326506D2.1E40@lmtas.lmco.com> Ken Garlington <garlingtonke@lmtas.lmco.com> writes:
> > [...]
> > > Not necessarily. Keep in mind that an exception _was_ raised -- a
> > > predefined exception (Operand_Error according to the report).
> >
> > This is one thing that's confused me about this report.  There is no
> > predefined exception in Ada called Operand_Error.  Either the overflow
> > raised Constraint_Error (or Numeric_Error if they were using an Ada
> > 83 compiler that doesn't follow AI-00387), or a user-defined exception
> > called Operand_Error was raised explicitly.
> >
> 
> Remember, the report does NOT state that an unchecked_conversion
> was used (as some on this thread have assumed). It only states
> a "data conversion from 64-bit floating point to 16-bit signed
> integer value". As someone (I forget who) pointed out early
> in the thread weeks ago, a standard practice is to scale down
> the range of a float value to fit into an integer variable.
> This may not have been an unchecked_conversion at all, but
> some mathimatical expression.

In fact, I would be very surprised if unchecked_conversion was used.
It wouldn't make much sense to convert from float to fixed using UC.
More than likely, the constraint error/hardware interrupt was raised
due to an overflow of the 16-bit value during the type conversion part
of the scaling equation.

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-21  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  1996-10-22  0:00 ` Adam Beneschan
  0 siblings, 1 reply; 105+ messages in thread
From: Marin David Condic, 407.796.8997, M/S 731-93 @ 1996-10-21  0:00 UTC (permalink / raw)



Frank Manning <frank@BIGDOG.ENGR.ARIZONA.EDU> writes:
>In article <32678222.6F5C@lmtas.lmco.com> Ken Garlington
><garlingtonke@lmtas.lmco.com>
>
>> As for the birds, there is an interesting test done here in Fort Worth.
>> (At least, we used to do it -- I haven't actually witnessed one of these
>> tests lately). To determine if the canopy will survive a bird strike,
>> they actually take a bird (presumably of mil-spec size and weight), load
>> it into a cannon-type deveice, and fire the bird at the canopy. By the
>> way, it's not a good idea to use a _frozen_ bird for this test...
>
>When I was in the Air Force, I heard a rumor there was an Air
>Force facility that used chickens for similar testing. At one
>time the guy in charge was a certain Colonel Sanders...
>
    There is, in fact, a Mil Spec bird for bird-ingestion tests on jet
    engines. (Similar procedure - fire 'em out of a cannon into the
    turbine blades and film at high speed so you can watch it get
    sliced into cold-cuts.) The specification may well apply to canopy
    impact tests also since it would be seeing similar takeoff/landing
    profiles.

    I hear the Navy has it's own standard for bird-ingestion. The
    birds that follow aircraft carriers are apparently larger than
    a Mark One/Mod Zero Air Force bird.

    MDC
Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "If you don't say anything, you won't be called on to repeat it."

        --  Calvin Coolidge
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-17  0:00         ` Ravi Sundaram
@ 1996-10-22  0:00           ` shmuel
  1996-10-22  0:00             ` Jim Carr
  0 siblings, 1 reply; 105+ messages in thread
From: shmuel @ 1996-10-22  0:00 UTC (permalink / raw)



In <3266741B.4DAA@ansoft.com>, Ravi Sundaram <ravi@ansoft.com> writes:
>Ralf Tilch wrote:
>> The reason that the software wasn't checked:
>> It was too 'expensive'?!?!.
>
>	Yeah, isn't hindsight a wonderful thing?  
>	They, whoever were in charge of these decisions,
>	too knew testing is important.  But it is impossible
>	to test every subcomponant under every possible
>	condition. There is simply not enough money or time
>	available to do that.

Why do you assume that it was hindsight? They violated fundamental
software engineering principles, and anyone who has been in this business
for long should have expected chickens coming home to roost, even if they
couldn't predict what would go wrong first.

>	Richard Feyman was examining the practices of NASA and
>	found that the workers who assembled some large bulkheads
>	had to count bolts from two refrence points. He thought
>	providing four reference points would simplify the job.
>	NASA rejected the proposal because it would involve 
>	too many changes to the documentation, procedures and
>	testing. (Surely You are joking, Mr Feyman I? or II?)
>
>	So praise them for conducting a no nonsense investigation
>	and owning up to the mistakes.  Learn to live with
>	failed space shots. They will become as reliable as
>	air travel once we have launched about 10 million rockets.

I hope that you're talking about Ariane and not NASA Challenger; Feynman's
account of the behavior of most of the Rogers Commission, in "Why Do
You Care ..." sounds more like a failed coverup than like "owning up to 
their mistakes", and Feynman had to threaten to air a dissenting opinion
on television before they agreed to publish it in their report.

	Shmuel (Seymour J.) Metz
	Atid/2





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-22  0:00           ` shmuel
@ 1996-10-22  0:00             ` Jim Carr
  1996-10-24  0:00               ` hayim
  0 siblings, 1 reply; 105+ messages in thread
From: Jim Carr @ 1996-10-22  0:00 UTC (permalink / raw)

shmuel.metz@os2bbs.com writes:
>
>I hope that you're talking about Ariane and not NASA Challenger; Feynman's
>account of the behavior of most of the Rogers Commission, in "Why Do
>You Care ..." sounds more like a failed coverup than like "owning up to 
>their mistakes", ...

The coverup was not entirely unsuccessful.  Feynman did manage to break 
through and get his dissenting remarks on NASA reliability estimates 
into the report (as well as into Physics Today), but the coverup did 
succeed in keeping most people ignorant of the fact that the astronauts 
did not die until impact with the ocean despite a Miami Herald story 
pointing that out to its mostly-regional audience. 

Did you ever see a picture of the crew compartment? 

-- 
 James A. Carr   <jac@scri.fsu.edu>     |  Raw data, like raw sewage, needs 
    http://www.scri.fsu.edu/~jac        |  some processing before it can be
 Supercomputer Computations Res. Inst.  |  spread around.  The opposite is
 Florida State, Tallahassee FL 32306    |  true of theories.  -- JAC

^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-21  0:00 Marin David Condic, 407.796.8997, M/S 731-93
@ 1996-10-22  0:00 ` Adam Beneschan
  0 siblings, 0 replies; 105+ messages in thread
From: Adam Beneschan @ 1996-10-22  0:00 UTC (permalink / raw)



"Marin David Condic, 407.796.8997, M/S 731-93" <condicma@PWFL.COM> writes:

 >    There is, in fact, a Mil Spec bird for bird-ingestion tests on jet
 >    engines. (Similar procedure - fire 'em out of a cannon into the
 >    turbine blades and film at high speed so you can watch it get
 >    sliced into cold-cuts.) . . .

So that's where those MRE's come from . . .

:)

                                -- Adam




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-14  0:00 Marin David Condic, 407.796.8997, M/S 731-93
  1996-10-15  0:00 ` Robert I. Eachus
@ 1996-10-23  0:00 ` robin
  1 sibling, 0 replies; 105+ messages in thread
From: robin @ 1996-10-23  0:00 UTC (permalink / raw)



	"Marin David Condic, 407.796.8997, M/S 731-93" <condicma@PWFL.COM> writes:

	>    The parts which typically have hard deadlines tend to be heavy on
	>    math or data motion and rather light on branching and call chain
	>    complexity. You want your "worst case" timing to be your nominal
	>    path and you'd like for it to be easily analyzed and very
	>    predictable. Usually, it's a relatively small part of the system
	>    and maybe (MAYBE!) you can turn off runtime checks for just this
	>    portion of the code, leaving it in for the things which run at a
	>    lower duty cycle.

	>    Of course the truly important thing to remember is that compiler
	>    generated runtime checks are not a panacea. They *may* have helped
	>    with the Ariane 5,

The Report said that it could have been done, and obviously,
it should have.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-16  0:00         ` Ken Garlington
  1996-10-18  0:00           ` Keith Thompson
@ 1996-10-23  0:00           ` robin
  1 sibling, 0 replies; 105+ messages in thread
From: robin @ 1996-10-23  0:00 UTC (permalink / raw)



	Ken Garlington <garlingtonke@lmtas.lmco.com> writes:

	>Matthew Heaney wrote:
	>> 
	>> As you stated, exceptions are only a tool.  They don't replace the need for
	>> (mental) reasoning about the correctness of my program, nor should they be
	>> used to guard against sloppy programming.  Exceptions don't correct the
	>> problem for you, but at least they let you know that a problem exists.
	>> 
	>> And in spite of all the efforts of the Ariane 5 developers, a problem did
	>> exist, significant enough to cause mission failure.  Don't you think an
	>> exception was justified in this case?

	>Not necessarily. Keep in mind that an exception _was_ raised -- a predefined 
	>exception (Operand_Error according to the report). There was sufficient telemetry 
	>to determine where the error occured (obviously, otherwise we wouldn't know what 
	>happened!). If the real Ariane 5 trajectory had been tested in an integrated 
	>laboratory enviroment, then (assuming the environment was realistic enough to 
	>trigger the problem), the fault would have been seen (and presumably analyzed and 
	>fixed) prior to launch. So, the issue is not the addition of a user-defined 
	>exception to find the error -- the issue is the addition of a new exception 
	>_handler_ to _recover_ from the error in flight.

---The issue was not the addition of a new exception handler.
The issue was that a magnitude check should have been
performed on a conversion from double precision floating
point to 16-bit integer, but it wasn't.
Of course, having an exceptin handler for this specific purpose
would have helped, and should have been included as a fallback.

	>Assuming that a new exception _handler_ had been added, then it _might_ have made 
	>a difference.

You can be absolutely certain that it would have helped.




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-22  0:00             ` Jim Carr
@ 1996-10-24  0:00               ` hayim
  1996-10-25  0:00                 ` Michel OLAGNON
  1996-10-25  0:00                 ` Ken Garlington
  0 siblings, 2 replies; 105+ messages in thread
From: hayim @ 1996-10-24  0:00 UTC (permalink / raw)



Unfortunately, I missed the original article describing the Ariane failure.
If someone could please, either point me in the right direction as to where
I can get a copy, or could even send it to me, I would greatly appreciate it.

Thanks very much,

Hayim Hendeles

E-mail: hayim@platsol.com





^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-24  0:00               ` hayim
@ 1996-10-25  0:00                 ` Michel OLAGNON
  1996-10-25  0:00                 ` Ken Garlington
  1 sibling, 0 replies; 105+ messages in thread
From: Michel OLAGNON @ 1996-10-25  0:00 UTC (permalink / raw)



In article <54oht1$ln1@orchard.la.platsol.com>, <hayim> writes:
>Unfortunately, I missed the original article describing the Ariane failure.
>If someone could please, either point me in the right direction as to where
>I can get a copy, or could even send it to me, I would greatly appreciate it.
>

It may be useful to remind the source address for the full report, since
many comments seem based only on a presentation summary:

http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html

Michel

-- 
| Michel OLAGNON                       email : Michel.Olagnon@ifremer.fr|
| IFREMER: Institut Francais de Recherches pour l'Exploitation de la Mer|







^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-24  0:00               ` hayim
  1996-10-25  0:00                 ` Michel OLAGNON
@ 1996-10-25  0:00                 ` Ken Garlington
  1 sibling, 0 replies; 105+ messages in thread
From: Ken Garlington @ 1996-10-25  0:00 UTC (permalink / raw)



hayim wrote:
> 
> Unfortunately, I missed the original article describing the Ariane failure.
> If someone could please, either point me in the right direction as to where
> I can get a copy, or could even send it to me, I would greatly appreciate it.
> 
> Thanks very much,
> 
> Hayim Hendeles
> 
> E-mail: hayim@platsol.com

See:
  http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-28  0:00 Marin David Condic, 561.796.8997, M/S 731-93
  0 siblings, 0 replies; 105+ messages in thread
From: Marin David Condic, 561.796.8997, M/S 731-93 @ 1996-10-28  0:00 UTC (permalink / raw)



robin <rav@GOANNA.CS.RMIT.EDU.AU> writes:
>        >    Of course the truly important thing to remember is that compiler
>        >    generated runtime checks are not a panacea. They *may* have helped
>        >    with the Ariane 5,
>
>The Report said that it could have been done, and obviously,
>it should have.
>
    The point of my statement was in the part of my previous message
    which was inadvertently clipped: I do not disagree that the
    runtime checks should have been done (20/20 hindsight is a
    wonderful thing.) But failure detection is not, in and of itself,
    sufficient. Had the accommodation for the detected failure been
    "Shut down the channel and pass control to the other side", they
    would have been in *exactly* the same place they were without the
    runtime checks. (And this is a *VERY* common defined accommodation
    for dual redundant systems for large classes of errors.)

    procedure ARIANE_FIVE_OPERATION is
    begin
        DO_STUFF_THATS_COOL_TO_RUN_THE_ARIANE_FIVE_ROCKET ;
    exception
        when CONSTRAINT_ERROR | NUMERIC_ERROR => --Yup! Got them runtime checks!
            SHUT_DOWN_THE_CHANNEL_AND_PASS_CONTROL_TO_THE_OTHER_SIDE ; --Boom!
    end ARIANE_FIVE_OPERATION ;

    In other words, the most serious problems with software are bad
    engineering decisions - not the use, or lack thereof, of any given
    language attribute. It's a little like the company which makes
    concrete life-preservers getting ISO-9000 certification. By gum,
    they have a procedure and it's written down and it's adhered to
    with religious fervor by every single employee and they make an
    absolutely flawless concrete life-preserver. But there's still
    something fundamentally wrong with this picture, isn't there?

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "If you don't say anything, you won't be called on to repeat it."

        --  Calvin Coolidge
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-28  0:00 Marin David Condic, 561.796.8997, M/S 731-93
  1996-10-29  0:00 ` Ken Garlington
  0 siblings, 1 reply; 105+ messages in thread
From: Marin David Condic, 561.796.8997, M/S 731-93 @ 1996-10-28  0:00 UTC (permalink / raw)



robin <rav@GOANNA.CS.RMIT.EDU.AU> writes:
>---The issue was not the addition of a new exception handler.
>The issue was that a magnitude check should have been
>performed on a conversion from double precision floating
>point to 16-bit integer, but it wasn't.
>Of course, having an exceptin handler for this specific purpose
>would have helped, and should have been included as a fallback.
>
>        >Assuming that a new exception _handler_ had been added, then it _might
>made
>        >a difference.
>
>You can be absolutely certain that it would have helped.
>
    procedure ARIANE_FIVE_OPERATION is
        X   : FLOAT range -65536.0..65535.0 := 65535.0 ;
        Y   : INTEGER range -32768..32767 := 0 ;
    begin
        if (X not in -32768.0..32767.0) then
            SHUT_DOWN_THE_CHANNEL_AND_PASS_CONTROL_TO_THE_OTHER_SIDE ; --Boom!
        else
            Y := INTEGER (X) ;
        end if ;
    end ARIANE_FIVE_OPERATION ;

    *Absolutely* certain?

    Seems that the above algorithm *DOES* have runtime checks to
    constrain the assignment of a float to an integer, yet the
    accommodation is still a *BAD* engineering decision. We can polish
    and polish and polish the software all we want and perform all
    sorts of runtime checks, but if the accommodation remains wrong,
    we're polishing a turd.

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "If you don't say anything, you won't be called on to repeat it."

        --  Calvin Coolidge
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-28  0:00 Marin David Condic, 561.796.8997, M/S 731-93
@ 1996-10-29  0:00 ` Ken Garlington
  1996-11-08  0:00   ` robin
  0 siblings, 1 reply; 105+ messages in thread
From: Ken Garlington @ 1996-10-29  0:00 UTC (permalink / raw)



Marin David Condic, 561.796.8997, M/S 731-93 wrote:
> 
> robin <rav@GOANNA.CS.RMIT.EDU.AU> writes:
> >---The issue was not the addition of a new exception handler.
> >The issue was that a magnitude check should have been
> >performed on a conversion from double precision floating
> >point to 16-bit integer, but it wasn't.
> >Of course, having an exceptin handler for this specific purpose
> >would have helped, and should have been included as a fallback.
> >
> >        >Assuming that a new exception _handler_ had been added, then it _might
> >made
> >        >a difference.
> >
> >You can be absolutely certain that it would have helped.
> >
>     procedure ARIANE_FIVE_OPERATION is
>         X   : FLOAT range -65536.0..65535.0 := 65535.0 ;
>         Y   : INTEGER range -32768..32767 := 0 ;
>     begin
>         if (X not in -32768.0..32767.0) then
>             SHUT_DOWN_THE_CHANNEL_AND_PASS_CONTROL_TO_THE_OTHER_SIDE ; --Boom!
>         else
>             Y := INTEGER (X) ;
>         end if ;
>     end ARIANE_FIVE_OPERATION ;
> 
>     *Absolutely* certain?

Well, in PL/I, it won't permit an exception handler to shut down the channel. :)

(I wasn't going to answer the silly assertion posted above, but I'm glad someone
did.)

> 
>     Seems that the above algorithm *DOES* have runtime checks to
>     constrain the assignment of a float to an integer, yet the
>     accommodation is still a *BAD* engineering decision. We can polish
>     and polish and polish the software all we want and perform all
>     sorts of runtime checks, but if the accommodation remains wrong,
>     we're polishing a turd.
> 
>     MDC
> 
> Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
> M/S 731-96                                      Technet:    796.8997
> Pratt & Whitney, GESP                           Fax:        561.796.4669
> P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
> West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
> ===============================================================================
>     "If you don't say anything, you won't be called on to repeat it."
> 
>         --  Calvin Coolidge
> ===============================================================================

-- 
LMTAS - "Our Brand Means Quality"
For more info, see http://www.lmtas.com or http://www.lmco.com




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
@ 1996-10-31  0:00 Marin David Condic, 561.796.8997, M/S 731-93
  0 siblings, 0 replies; 105+ messages in thread
From: Marin David Condic, 561.796.8997, M/S 731-93 @ 1996-10-31  0:00 UTC (permalink / raw)



Adam Beneschan <adam@IRVINE.COM> writes:
> >    There is, in fact, a Mil Spec bird for bird-ingestion tests on jet
> >    engines. (Similar procedure - fire 'em out of a cannon into the
> >    turbine blades and film at high speed so you can watch it get
> >    sliced into cold-cuts.) . . .
>
>So that's where those MRE's come from . . .
>
>:)
>
    We think that's how the cafeteria around here developed UMS
    (Universal Meat Substitute) - or possibly OFS (Other Food
    Substitute)

    BTW: I misspoke myself above: You fire the bird into the
    compressor, not the turbine (although bird-whiz will pass thru the
    turbine, eventually.) And the first thing it will hit is not
    blades, but inlet guide vanes, which are stationary. A little like
    that Ronco Veg-O-Matic - It Slices! It Dices! It never needs
    ironing! Turns a sandwich into a banquet!...

    MDC

Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
M/S 731-96                                      Technet:    796.8997
Pratt & Whitney, GESP                           Fax:        561.796.4669
P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM
===============================================================================
    "A man who has a million dollars is as well off as if he were
    rich"

        --  John Jacob Astor.
===============================================================================




^ permalink raw reply	[flat|nested] 105+ messages in thread

* Re: Ariane 5 failure
  1996-10-29  0:00 ` Ken Garlington
@ 1996-11-08  0:00   ` robin
  0 siblings, 0 replies; 105+ messages in thread
From: robin @ 1996-11-08  0:00 UTC (permalink / raw)



	Ken Garlington <garlingtonke@lmtas.lmco.com> writes:

	>Marin David Condic, 561.796.8997, M/S 731-93 wrote:
	>> 
	>> robin <rav@GOANNA.CS.RMIT.EDU.AU> writes:
	>> >---The issue was not the addition of a new exception handler.
	>> >The issue was that a magnitude check should have been
	>> >performed on a conversion from double precision floating
	>> >point to 16-bit integer, but it wasn't.
	>> >Of course, having an exceptin handler for this specific purpose
	>> >would have helped, and should have been included as a fallback.
	>> >
	>> >        >Assuming that a new exception _handler_ had been added, then it _might
	>> >made
	>> >        >a difference.
	>> >
	>> >You can be absolutely certain that it would have helped.
	>> >
	>>     procedure ARIANE_FIVE_OPERATION is
	>>         X   : FLOAT range -65536.0..65535.0 := 65535.0 ;
	>>         Y   : INTEGER range -32768..32767 := 0 ;
	>>     begin
	>>         if (X not in -32768.0..32767.0) then
	>>             SHUT_DOWN_THE_CHANNEL_AND_PASS_CONTROL_TO_THE_OTHER_SIDE ; --Boom!
	>>         else
	>>             Y := INTEGER (X) ;
	>>         end if ;
	>>     end ARIANE_FIVE_OPERATION ;
	>> 
	>>     *Absolutely* certain?

---Absolutely.  The Report stated that the best estimate of the
value could have been used.

Thus:

   IF X > 32767 THEN
	Y = 32767;
   ELSE IF X < -32767 THEN
	Y = -32767;
   ELSE
	Y = X;

but back to the point, which you've missed apparently, which is that
the overflow could have been trapped (locally), as a fallback:

with a simple ON FIXEDOVERFLOW statement.

	>Well, in PL/I, it won't permit an exception handler to shut down the channel. :)

	>>     Seems that the above algorithm *DOES* have runtime checks to
	>>     constrain the assignment of a float to an integer, yet the
	>>     accommodation is still a *BAD* engineering decision.

You have to be joking.

	>> We can polish
	>>     and polish and polish the software all we want and perform all
	>>     sorts of runtime checks, but if the accommodation remains wrong,
	>>     we're polishing a turd.
	>> 
	>>     MDC
	>> 
	>> Marin David Condic, Senior Computer Engineer    ATT:        561.796.8997
	>> M/S 731-96                                      Technet:    796.8997
	>> Pratt & Whitney, GESP                           Fax:        561.796.4669
	>> P.O. Box 109600                                 Internet:   CONDICMA@PWFL.COM
	>> West Palm Beach, FL 33410-9600                  Internet:   CONDIC@FLINET.COM




^ permalink raw reply	[flat|nested] 105+ messages in thread

end of thread, other threads:[~1996-11-08  0:00 UTC | newest]

Thread overview: 105+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <agrapsDy4oJH.29G@netcom.com>
1996-09-25  0:00 ` Ariane 5 failure @@           robin
1996-09-25  0:00   ` Michel OLAGNON
1996-09-25  0:00     ` Chris Morgan
1996-09-25  0:00     ` Byron Kauffman
1996-09-25  0:00       ` A. Grant
1996-09-25  0:00         ` Ken Garlington
1996-09-26  0:00         ` Byron Kauffman
1996-09-27  0:00           ` A. Grant
1996-09-26  0:00         ` Sandy McPherson
1996-09-25  0:00   ` Bob Kitzberger
1996-09-26  0:00     ` Ronald Kunne
1996-09-26  0:00       ` Matthew Heaney
1996-09-27  0:00         ` Wayne Hayes
1996-09-27  0:00           ` Richard Pattis
1996-09-29  0:00             ` Dann Corbit
1996-09-29  0:00             ` Alan Brain
1996-09-29  0:00             ` Chris McKnight
1996-09-29  0:00               ` Real-world education (was: Ariane 5 failure) Michael Feldman
1996-10-01  0:00             ` Ariane 5 failure Ken Garlington
1996-09-27  0:00         ` Ronald Kunne
1996-09-27  0:00           ` Lawrence Foard
1996-10-04  0:00             ` @@           robin
1996-09-28  0:00           ` Ken Garlington
1996-09-28  0:00             ` Ken Garlington
1996-09-29  0:00           ` Alan Brain
1996-09-29  0:00             ` Robert A Duff
1996-09-30  0:00               ` Wayne L. Beavers
1996-10-01  0:00                 ` Ken Garlington
1996-10-01  0:00                   ` Wayne L. Beavers
1996-10-01  0:00                     ` Ken Garlington
1996-10-02  0:00                       ` Sandy McPherson
1996-10-03  0:00                 ` Richard A. O'Keefe
1996-10-01  0:00             ` Ken Garlington
1996-09-28  0:00         ` Ken Garlington
1996-09-27  0:00       ` Ken Garlington
1996-09-27  0:00       ` Alan Brain
1996-09-28  0:00         ` Ken Garlington
1996-09-29  0:00       ` Louis K. Scheffer
1996-09-27  0:00   ` John McCabe
1996-10-01  0:00     ` Michael Dworetsky
1996-10-04  0:00       ` Steve Bell
1996-10-07  0:00         ` Ken Garlington
1996-10-09  0:00         ` @@           robin
1996-10-09  0:00           ` Steve O'Neill
1996-10-12  0:00             ` Alan Brain
1996-10-04  0:00     ` @@           robin
1996-10-04  0:00       ` Joseph C Williams
1996-10-06  0:00         ` Wayne Hayes
1996-10-04  0:00       ` Michel OLAGNON
1996-10-09  0:00         ` @@           robin
1996-10-17  0:00       ` Ralf Tilch
1996-10-17  0:00         ` Ravi Sundaram
1996-10-22  0:00           ` shmuel
1996-10-22  0:00             ` Jim Carr
1996-10-24  0:00               ` hayim
1996-10-25  0:00                 ` Michel OLAGNON
1996-10-25  0:00                 ` Ken Garlington
1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-02  0:00 ` Alan Brain
1996-10-02  0:00   ` Ken Garlington
1996-10-02  0:00     ` Matthew Heaney
1996-10-04  0:00       ` Robert S. White
1996-10-05  0:00         ` Robert Dewar
1996-10-05  0:00         ` Alan Brain
1996-10-06  0:00           ` Robert S. White
1996-10-03  0:00     ` Alan Brain
1996-10-04  0:00       ` Ken Garlington
  -- strict thread matches above, loose matches on Subject: below --
1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-02  0:00 ` Ken Garlington
1996-10-01  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-02  0:00 ` Robert I. Eachus
1996-10-02  0:00   ` Ken Garlington
1996-10-02  0:00 ` Matthew Heaney
1996-10-04  0:00   ` Ken Garlington
1996-10-05  0:00     ` Robert Dewar
1996-10-06  0:00       ` Keith Thompson
1996-10-10  0:00       ` Ken Garlington
1996-10-14  0:00       ` Matthew Heaney
1996-10-15  0:00         ` Robert Dewar
1996-10-16  0:00         ` Ken Garlington
1996-10-18  0:00           ` Keith Thompson
1996-10-18  0:00             ` Samuel T. Harris
1996-10-21  0:00               ` Ken Garlington
1996-10-18  0:00             ` Ken Garlington
1996-10-23  0:00           ` robin
1996-10-03  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-03  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-03  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-14  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-15  0:00 ` Robert I. Eachus
1996-10-15  0:00   ` Robert Dewar
1996-10-16  0:00     ` Michael F Brenner
1996-10-16  0:00       ` Robert Dewar
1996-10-23  0:00 ` robin
1996-10-16  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-18  0:00 ` Ken Garlington
1996-10-19  0:00   ` Frank Manning
1996-10-21  0:00     ` Norman H. Cohen
1996-10-21  0:00 Marin David Condic, 407.796.8997, M/S 731-93
1996-10-22  0:00 ` Adam Beneschan
1996-10-28  0:00 Marin David Condic, 561.796.8997, M/S 731-93
1996-10-29  0:00 ` Ken Garlington
1996-11-08  0:00   ` robin
1996-10-28  0:00 Marin David Condic, 561.796.8997, M/S 731-93
1996-10-31  0:00 Marin David Condic, 561.796.8997, M/S 731-93

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox