comp.lang.ada
 help / color / mirror / Atom feed
* Ariane Failure
@ 1996-06-28  0:00 Robert B. Love 
  1996-07-01  0:00 ` Ken Garlington
  0 siblings, 1 reply; 29+ messages in thread
From: Robert B. Love  @ 1996-06-28  0:00 UTC (permalink / raw)


If I understand what I read here the earlier comments about
the failed Ariane-5 indicated that the flight s/w was coded 
in Ada.  The blurb I've read in Space News says the code
that failed resided in the inertial measurement units.  This
is different than the flight software.  Does anybody know 
what the embedded code for the IMUs is coded in?

Overall, it seems a design failure.  The IMU's couldn't handle
the flight profile of the Ar-5 and the test bed was killed to 
save money.

----------------------------------------------------------------
Bob Love, rlove@neosoft.com (local)        MIME & NeXT Mail OK
rlove@raptor.rmnug.org  (permanent)        PGP key available
----------------------------------------------------------------





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  1996-06-28  0:00 Robert B. Love 
@ 1996-07-01  0:00 ` Ken Garlington
  0 siblings, 0 replies; 29+ messages in thread
From: Ken Garlington @ 1996-07-01  0:00 UTC (permalink / raw)



Robert B. Love wrote:
> 
> If I understand what I read here the earlier comments about
> the failed Ariane-5 indicated that the flight s/w was coded
> in Ada.  The blurb I've read in Space News says the code
> that failed resided in the inertial measurement units.

From what I saw, the European Space Agency preliminary announcement
didn't refer to _any_ code, in either the flight controller or the
INU. It said that there was a fault in the INU _system_. We won't know
until at least the final report later in July whether this was a 
software fault, a dual hardware fault, an interface/design error, or 
what.

-- 
LMTAS - "Our Brand Means Quality"




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
       [not found] <ee2a195b.0203260725.a02dbfe@posting.google.com>
@ 2002-03-29 18:56 ` Richard Riehle
  2002-03-29 20:56   ` Michael Feathers
  2002-04-01 15:08   ` Marin David Condic
  0 siblings, 2 replies; 29+ messages in thread
From: Richard Riehle @ 2002-03-29 18:56 UTC (permalink / raw)


rjk wrote:

> What is XPers response to this? (I was going to ask a more specific
> question, but I thought I'd leave it broad until an interesting question is
> found).

The problem with Ariane V begins with Systems Engineering management.
The decisions about what to do when an exception occurs were wrong, and
not tested.    Although Design By Contract might have helped,  I doubt that
Eiffel would have been appropriate because of other issues related to
Eiffel.   I like Eiffel, but don't consider it appropriate for a project such
as Ariane V.    The SPARK approach to Design By Contract (they don't
call it that, but that is what it is)  could have worked well, especially
since it was programmed in Ada.   By the way, the Ada code worked as
it was directed to work, but it was given bad directions.

The other problem was one of software reuse.    We often tout the value of
software reuse, but here is case where it was not working as expected.

The software module that failed was originally used in Ariane IV, where
it worked fine.   Without testing, it was used on Ariane V which had
slightly different launch characteristics.   A perfect good module from
one context was used in another context without considering the full
range of issues in that new context.

We could draw the analogy of a physician who prescribes a medicine
for a patient, knowing that this medicine has worked well for other
patients with the same illness.   If the physician fails to do a complete
medical history, including an evaluation of the other medications being
used by that new patient,  there is the possibility that contradindicated
drug interactions might kill this hapless patient.

When we reuse existing modules, in safety-critical software, it is
imperative that we both inspect and test for interactions that might
kill our software.   For embedded real-time software these are
often timing issues.   Those are hard to detect.

As to the contention that XP would have prevented the failue of Ariane V,
that is mostly wishful thinking, reminiscent of what is often called
"Monday morning quarterbacking."    There are some XP practices
that might have been useful (features that predate XP by some considerable
amount of time), but XP itself would not have saved Ariane V, nor would
most of the other suggestions made by the Monday morning Quarterbacks.

At present, the French are launching the current version of Ariane quite
safely.

There are other project failures where XP might have been able to save
the project.   The one that comes to mind quickly is the Denver airport
baggage handling system.    I am sure there are others.   However, I don't
want to be accused of Monday morning quarterbacking.   That fact is that
building software is hard and it is easy to make engineering errors.  Our
tools and methods can help us do it right, but neither the languages nor
the processes can consistently save us from ourselves.

Richard Riehle





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-03-29 18:56 ` Ariane Failure Richard Riehle
@ 2002-03-29 20:56   ` Michael Feathers
  2002-03-30  1:02     ` Bill
  2002-04-01 15:08   ` Marin David Condic
  1 sibling, 1 reply; 29+ messages in thread
From: Michael Feathers @ 2002-03-29 20:56 UTC (permalink / raw)



"Richard Riehle" <richard@adaworks.com> wrote in message
news:3CA4B8E5.72909C9B@adaworks.com...
> rjk wrote:
>
> > What is XPers response to this? (I was going to ask a more specific
> > question, but I thought I'd leave it broad until an interesting question
is
> > found).
>
> The problem with Ariane V begins with Systems Engineering management.
> The decisions about what to do when an exception occurs were wrong, and
> not tested.    Although Design By Contract might have helped,  I doubt
that
> Eiffel would have been appropriate because of other issues related to
> Eiffel.   I like Eiffel, but don't consider it appropriate for a project
such
> as Ariane V.    The SPARK approach to Design By Contract (they don't
> call it that, but that is what it is)  could have worked well, especially
> since it was programmed in Ada.   By the way, the Ada code worked as
> it was directed to work, but it was given bad directions.


IIRC, there's also the issue of casting integers across sizes.  It is great
when you can hide representation and promote or demote its size as needed.


Michael Feathers
www.objectmentor.com






^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-03-29 20:56   ` Michael Feathers
@ 2002-03-30  1:02     ` Bill
  2002-03-30  3:20       ` Keith Ray
  2002-03-30 13:36       ` Michael Feathers
  0 siblings, 2 replies; 29+ messages in thread
From: Bill @ 2002-03-30  1:02 UTC (permalink / raw)




Michael Feathers wrote:<snip>

>
> IIRC, there's also the issue of casting integers across sizes.  It is great
> when you can hide representation and promote or demote its size as needed.

<snip>
Promoting and demoting size as needed was part of the problem. Because of
limitations of typical launch vehicals, in particular their down link
capabilities to ground operations, but also limitted on board storage and
central processing, it is often necessary to reduce the size of a value from
larger storage representations to a smaller storage representations, typically
from floats or doubles to 8 or 16 bit integers. In order to ensure that the
real time constraints of the system are met, there has to be an explicit
decision as to what information needs to be communicated, at what rate, and
precision. It is tempting to maintain more precision than you need, just to be
certain you haven't misjudged the need, by applying an offset and scale factor
prior to the conversion to an integer, such that all possible values of the
rescaled number just fit within the range of values of the integer. However,
that decision is subject to the error of underestimating the range of possible
values of the original number before rescaling. In particular, a velocity scale
factor that was valid for the Ariane IV, for the actual and planned operating
conditions of the Ariane V, resulted in a value that exceeded the integer range
of the desired integer size, because the Ariane V has a larger acceleration and
more horizontal trajectory than the Ariane IV.

Note that information hiding per se doesn't help with this. If the writer of
the software has made the explicit decision to rescale and the rescale factor
to use, but doesn't communicate that information to others so they can make no
decisions based on a knowledge of the rescale factor, the rescale factor could
still be inappropriate and cause breakage. Also designing the software to
automatically rescale using more global heuristcs, can cause other problems as
additional information about its decisions then needs to be communicated to the
ground station so that it can interpret the rescaled data.




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-03-30  1:02     ` Bill
@ 2002-03-30  3:20       ` Keith Ray
  2002-03-30 12:12         ` John Roth
  2002-03-30 13:36       ` Michael Feathers
  1 sibling, 1 reply; 29+ messages in thread
From: Keith Ray @ 2002-03-30  3:20 UTC (permalink / raw)


Weinberg's "The Psychology of Computer Programming", first published in 
1971, mentions the explosion of an 18 million dollar rocket because "one 
instruction on the tape was left out".

I wonder how long that programming error was debated.
-- 
C. Keith Ray

<http://homepage.mac.com/keithray/xpminifaq.html>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-03-30  3:20       ` Keith Ray
@ 2002-03-30 12:12         ` John Roth
  0 siblings, 0 replies; 29+ messages in thread
From: John Roth @ 2002-03-30 12:12 UTC (permalink / raw)



"Keith Ray" <k1e2i3t4h5r6a7y@1m2a3c4.5c6o7m> wrote in message
news:k1e2i3t4h5r6a7y-630660.19202829032002@netnews.attbi.com...
> Weinberg's "The Psychology of Computer Programming", first published
in
> 1971, mentions the explosion of an 18 million dollar rocket because
"one
> instruction on the tape was left out".
>
> I wonder how long that programming error was debated.

Until everyone got tired of the snake oil salesmen - or the next
major disaster took people's fickle attention away.

John Roth

> --
> C. Keith Ray
>
> <http://homepage.mac.com/keithray/xpminifaq.html>





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-03-30  1:02     ` Bill
  2002-03-30  3:20       ` Keith Ray
@ 2002-03-30 13:36       ` Michael Feathers
  2002-04-01 15:22         ` Marin David Condic
       [not found]         ` <a8oo51$tsk$2@slb2.atl.mindspring.net>
  1 sibling, 2 replies; 29+ messages in thread
From: Michael Feathers @ 2002-03-30 13:36 UTC (permalink / raw)



"Bill" <wclodius@lanl.gov> wrote in message
news:3CA50E9A.CBF24F1B@lanl.gov...
>
> Michael Feathers wrote:<snip>
>
> > IIRC, there's also the issue of casting integers across sizes.  It is
great
> > when you can hide representation and promote or demote its size as
needed.
>
> <snip>
> Promoting and demoting size as needed was part of the problem. Because of
> limitations of typical launch vehicals, in particular their down link
> capabilities to ground operations, but also limitted on board storage and
> central processing, it is often necessary to reduce the size of a value
from
> larger storage representations to a smaller storage representations,
typically
> from floats or doubles to 8 or 16 bit integers. In order to ensure that
the
> real time constraints of the system are met, there has to be an explicit
> decision as to what information needs to be communicated, at what rate,
and
> precision. It is tempting to maintain more precision than you need, just
to be
> certain you haven't misjudged the need, by applying an offset and scale
factor
> prior to the conversion to an integer, such that all possible values of
the
> rescaled number just fit within the range of values of the integer.
However,
> that decision is subject to the error of underestimating the range of
possible
> values of the original number before rescaling. In particular, a velocity
scale
> factor that was valid for the Ariane IV, for the actual and planned
operating
> conditions of the Ariane V, resulted in a value that exceeded the integer
range
> of the desired integer size, because the Ariane V has a larger
acceleration and
> more horizontal trajectory than the Ariane IV.
>
> Note that information hiding per se doesn't help with this. If the writer
of
> the software has made the explicit decision to rescale and the rescale
factor
> to use, but doesn't communicate that information to others so they can
make no
> decisions based on a knowledge of the rescale factor, the rescale factor
could
> still be inappropriate and cause breakage. Also designing the software to
> automatically rescale using more global heuristcs, can cause other
problems as
> additional information about its decisions then needs to be communicated
to the
> ground station so that it can interpret the rescaled data.

Yes.  It seems like the error prone part is going back to integers at all.
Since it is a safety consideration it seems like it would be great to
revisit that as processing and communications speeds increase.

Michael Feathers
www.objectmentor.com






^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-03-29 18:56 ` Ariane Failure Richard Riehle
  2002-03-29 20:56   ` Michael Feathers
@ 2002-04-01 15:08   ` Marin David Condic
  2002-04-02 18:32     ` Wes Groleau
  1 sibling, 1 reply; 29+ messages in thread
From: Marin David Condic @ 2002-04-01 15:08 UTC (permalink / raw)


I beg to differ on the "Bad Directions" part. Note that the software in
question was designed for the Ariane IV which had a different flight
profile. The FDA thinking for the module in question went sort of like this:
"Any number that shows up here big enough to generate a hardware overflow
interrupt has got to be so far out of the flight profile that it would most
likely indicate a bad sensor. The accommodation for this failure should be
to transfer control to the other side where we might still have a good
sensor..." This logic worked fine in Ariane 4 and would likely have detected
a sensor failure and accommodated it appropriately. In my mind, that sounded
a lot like "Good Directions" :-)

The problem arose when the assumption was made that software that was
designed for Ariane 4 and that worked just fine in that environment was
therefore fit to fly Ariane 5 WITHOUT being tested and validated against the
Ariane 5 flight profile. That's a pretty basic and fundamental error that
goes well outside the realm of control of a programming language or
methodology.

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com


"Richard Riehle" <richard@adaworks.com> wrote in message
news:3CA4B8E5.72909C9B@adaworks.com...
>
> The problem with Ariane V begins with Systems Engineering management.
> The decisions about what to do when an exception occurs were wrong, and
> not tested.    Although Design By Contract might have helped,  I doubt
that
> Eiffel would have been appropriate because of other issues related to
> Eiffel.   I like Eiffel, but don't consider it appropriate for a project
such
> as Ariane V.    The SPARK approach to Design By Contract (they don't
> call it that, but that is what it is)  could have worked well, especially
> since it was programmed in Ada.   By the way, the Ada code worked as
> it was directed to work, but it was given bad directions.
>






^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-03-30 13:36       ` Michael Feathers
@ 2002-04-01 15:22         ` Marin David Condic
       [not found]         ` <a8oo51$tsk$2@slb2.atl.mindspring.net>
  1 sibling, 0 replies; 29+ messages in thread
From: Marin David Condic @ 2002-04-01 15:22 UTC (permalink / raw)


Given infinite processor speed and infinite memory, they might have done a
whole lot with the software to make it more flexible and safe. But that sort
of sounds like doing engineering in Heaven. :-) These sorts of compromises
are made all the time in real world engineering and you have to ask if they
were reasonable for the conditions at hand.

In my mind, the decisions made by the Ariane 4 engineers in designing their
software were very similar to the decisions made by data processing
programmers years ago in using only two digits to store a year - thus
creating The Great Y2K Disaster. Back in the 70's & 80's, they were
confronted with smaller storage devices and saving those extra couple of
bytes in every date was important. The thinking at the time was "Its a known
limitation and the useful life of this software ought to be something less
than twenty years, so when they build the next system they can accommodate
4-digit years..." I thought that was a reasonable compromise in order to
deal with the constraints of technology at the time. Even though hardware
and resources get more abundant in the future, we'll still be making
compromises - just different ones.

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com


"Michael Feathers" <mfeathers@objectmentor.com> wrote in message
news:a84f5p$nlm$1@slb5.atl.mindspring.net...
>
> Yes.  It seems like the error prone part is going back to integers at all.
> Since it is a safety consideration it seems like it would be great to
> revisit that as processing and communications speeds increase.
>






^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-01 15:08   ` Marin David Condic
@ 2002-04-02 18:32     ` Wes Groleau
  2002-04-02 18:42       ` Marin David Condic
  0 siblings, 1 reply; 29+ messages in thread
From: Wes Groleau @ 2002-04-02 18:32 UTC (permalink / raw)




> sensor..." This logic worked fine in Ariane 4 and would likely have detected
> a sensor failure and accommodated it appropriately. In my mind, that sounded
> a lot like "Good Directions" :-)
> 
> The problem arose when the assumption was made that software that was
> designed for Ariane 4 and that worked just fine in that environment was
> therefore fit to fly Ariane 5 WITHOUT being tested and validated against the
> Ariane 5 flight profile. That's a pretty basic and fundamental error that

"It worked before, so no review or test is necessary.  Make it so."

Sounds like bad directions to me.

-- 
Wes Groleau
http://freepages.rootsweb.com/~wgroleau



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-02 18:32     ` Wes Groleau
@ 2002-04-02 18:42       ` Marin David Condic
  0 siblings, 0 replies; 29+ messages in thread
From: Marin David Condic @ 2002-04-02 18:42 UTC (permalink / raw)


O.K. "Good Directions" from the perspective of the software directing the
IRS (or the developer's directing the software...) "Bad Directions" from the
perspective of the project management missing the boat on the system
constraints.

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com


"Wes Groleau" <wesgroleau@despammed.com> wrote in message
news:3CA9F956.CCCFD3CD@despammed.com...
>
> "It worked before, so no review or test is necessary.  Make it so."
>
> Sounds like bad directions to me.
>






^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
       [not found]         ` <a8oo51$tsk$2@slb2.atl.mindspring.net>
@ 2002-04-08 13:59           ` Marin David Condic
  2002-04-09 12:49             ` John Roth
                               ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Marin David Condic @ 2002-04-08 13:59 UTC (permalink / raw)


"Dennis Lee Bieber" <wlfraed@ix.netcom.com> wrote in message
news:a8oo51$tsk$2@slb2.atl.mindspring.net...
>
>         I do have to confess to having only the general explanation of the
> problem, not details of the code internals -- it does sound, from a quick
> perusal of this message thread, that some sort of overflow in integer
> processing occurred. This is new to me; the general report tended to the
> concept that the measured rates were accurate, but exceeded what the
> Ariane IV code deemed proper, and attempts to correct this "faulty rate"
> led to vehicle instability...
>
Yes and no. The report was clearly not written by software guys since it
otherwise would have explained in more accurate terms exactly what happened
at the software level. Hence, you kind of have to read between the lines and
interpret it some from the perspective of a more generalized engineer's
view.

The software module in question was originally analyzed on Ariane 4 with a
veiw toward improving speed. They had a shortage of CPU cycles and had
identified this one module as a major consumer of resources. They changed
the code to eliminate all the range checking and other "safety features"
(not at all uncommon in this business) in order to speed it up. This was not
without analysis that examined the possible valid ranges for various numbers
and mathematically reasoning about it & coming to the conclusion that any
values that would possibly generate a hardware overflow error could not be
in the valid flight path of the Ariane 4 - hence it was likely to be a
sensor failure and the proper accommodation would be to transfer control to
the other channel. The ISR for that overflow error did just that. So the
design was valid and correct for the Ariane 4.

The problem for Ariane 5 was that nobody tested or checked the assumptions
on the software intended to run on a different rocket. Had they run the unit
through the flight profile, they would have spotted the error in a cocaine
heartbeat.

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-08 13:59           ` Marin David Condic
@ 2002-04-09 12:49             ` John Roth
  2002-04-09 14:58               ` Steve O'Neill
  2002-04-09 15:04             ` Steve O'Neill
  2002-04-09 19:07             ` Bill
  2 siblings, 1 reply; 29+ messages in thread
From: John Roth @ 2002-04-09 12:49 UTC (permalink / raw)



"Marin David Condic" <dont.bother.mcondic.auntie.spam@[acm.org> wrote in
message news:a8s7nu$ibo$1@nh.pace.co.uk...
> "Dennis Lee Bieber" <wlfraed@ix.netcom.com> wrote in message
> news:a8oo51$tsk$2@slb2.atl.mindspring.net...
> >
> >         I do have to confess to having only the general explanation
of the
> > problem, not details of the code internals -- it does sound, from a
quick
> > perusal of this message thread, that some sort of overflow in
integer
> > processing occurred. This is new to me; the general report tended to
the
> > concept that the measured rates were accurate, but exceeded what the
> > Ariane IV code deemed proper, and attempts to correct this "faulty
rate"
> > led to vehicle instability...
> >
> Yes and no. The report was clearly not written by software guys since
it
> otherwise would have explained in more accurate terms exactly what
happened
> at the software level. Hence, you kind of have to read between the
lines and
> interpret it some from the perspective of a more generalized
engineer's
> view.

I think the technical report went into more detail. However, this
particular thread got started by a post that referenced an article which
claimed that if the implementers had used Eifel with Design by Contract,
the problem would not have occured.

This is patently absurd. The proximate cause, as several posters
have pointed out, was the failure to recertify and test a component
designed for one rocket for use with a different rocket with different
characteristics.

Drilling deeper, the next level cause was attempting to do too much
function for a given combination of processor / language. This caused
performance-motivated shortcuts in the implementation.

Thus the 'solution' would have been to use a processor with higher
performance, or a language with less overhead. Pursuing this path,
we come to the inescapable conclusion that the problem would
not have occured if the implementors had used either Assembler
or Forth!

> The software module in question was originally analyzed on Ariane 4
with a
> veiw toward improving speed. They had a shortage of CPU cycles and had
> identified this one module as a major consumer of resources. They
changed
> the code to eliminate all the range checking and other "safety
features"
> (not at all uncommon in this business) in order to speed it up. This
was not
> without analysis that examined the possible valid ranges for various
numbers
> and mathematically reasoning about it & coming to the conclusion that
any
> values that would possibly generate a hardware overflow error could
not be
> in the valid flight path of the Ariane 4 - hence it was likely to be a
> sensor failure and the proper accommodation would be to transfer
control to
> the other channel. The ISR for that overflow error did just that. So
the
> design was valid and correct for the Ariane 4.
>
> The problem for Ariane 5 was that nobody tested or checked the
assumptions
> on the software intended to run on a different rocket. Had they run
the unit
> through the flight profile, they would have spotted the error in a
cocaine
> heartbeat.
>
> MDC
> --
> Marin David Condic
> Senior Software Engineer
> Pace Micro Technology Americas    www.pacemicro.com
> Enabling the digital revolution
> e-Mail:    marin.condic@pacemicro.com

John Roth

>





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-09 12:49             ` John Roth
@ 2002-04-09 14:58               ` Steve O'Neill
  0 siblings, 0 replies; 29+ messages in thread
From: Steve O'Neill @ 2002-04-09 14:58 UTC (permalink / raw)


John Roth wrote:
> Thus the 'solution' would have been to use a processor with higher
> performance, or a language with less overhead. Pursuing this path,
> we come to the inescapable conclusion that the problem would
> not have occured if the implementors had used either Assembler
> or Forth!

No, no, no... this was settled years ago after a lengthy discussion here
on cla that the problem never would have occurred had the developers
used PL/I. :-)



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-08 13:59           ` Marin David Condic
  2002-04-09 12:49             ` John Roth
@ 2002-04-09 15:04             ` Steve O'Neill
  2002-04-09 23:00               ` John Roth
  2002-04-09 19:07             ` Bill
  2 siblings, 1 reply; 29+ messages in thread
From: Steve O'Neill @ 2002-04-09 15:04 UTC (permalink / raw)


Marin David Condic wrote:
> The software module in question was originally analyzed on Ariane 4 with a
> veiw toward improving speed. They had a shortage of CPU cycles and had
> identified this one module as a major consumer of resources. They changed
> the code to eliminate all the range checking and other "safety features"
> (not at all uncommon in this business) in order to speed it up. This was not
> without analysis that examined the possible valid ranges for various numbers
> and mathematically reasoning about it & coming to the conclusion that any
> values that would possibly generate a hardware overflow error could not be
> in the valid flight path of the Ariane 4 - hence it was likely to be a
> sensor failure and the proper accommodation would be to transfer control to
> the other channel. 

And here was another of the fatal system design flaws that should never
have been made... it seems that this 'other channel' was an *identical*
system which, of course, reacted in the same manner.  Leaving the poor
flight control computer with no valid data.  Ooops!



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-08 13:59           ` Marin David Condic
  2002-04-09 12:49             ` John Roth
  2002-04-09 15:04             ` Steve O'Neill
@ 2002-04-09 19:07             ` Bill
  2002-04-09 19:44               ` Marin David Condic
  2 siblings, 1 reply; 29+ messages in thread
From: Bill @ 2002-04-09 19:07 UTC (permalink / raw)




Marin David Condic wrote:

> <snip>This was not
> without analysis that examined the possible valid ranges for various numbers
> and mathematically reasoning about it & coming to the conclusion that any
> values that would possibly generate a hardware overflow error could not be
> in the valid flight path of the Ariane 4 - hence it was likely to be a
> sensor failure and the proper accommodation would be to transfer control to
> the other channel. The ISR for that overflow error did just that. So the
> design was valid and correct for the Ariane 4.

<snip>

Are you sure this was their reasoning? My interpretation of the reasoning was
that it had to be a hardware failure, but the only hardware they could do
anything about  was the processor interpretting the sensor data, wo they
transferred control to another processor handling the same sensor data. with
the same program.




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-09 19:07             ` Bill
@ 2002-04-09 19:44               ` Marin David Condic
  0 siblings, 0 replies; 29+ messages in thread
From: Marin David Condic @ 2002-04-09 19:44 UTC (permalink / raw)


Not having been on the design team, I obviously can't state definitively
what their reasoning was. This was my best possible interpretation of the
situation after reading the report. Its been quite a while (yet still this
topic comes up! :-) since I last read the report but having been involved in
similar system designs (dual-redundant engine controls rather than dual
redundant IRS's) my best interpretation was that they had two computers
looking at two separate sets of sensors. (I'll bow to a more authoritative
source on this - but that's my best recollection.)

Your big risk is not so much that the computer itself will fail (which you
can't do much about with software anyway, right?) but that a sensor or
actuator will fail. Dual redundant computers that are looking at the same
set of sensors would create a common-mode failure and loss of a sensor would
make both computers useless. Not much point in dual redundancy then is
there? :-)

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com


"Bill" <wclodius@lanl.gov> wrote in message
news:3CB33C0A.9125A6A7@lanl.gov...
>
> Are you sure this was their reasoning? My interpretation of the reasoning
was
> that it had to be a hardware failure, but the only hardware they could do
> anything about  was the processor interpretting the sensor data, wo they
> transferred control to another processor handling the same sensor data.
with
> the same program.
>





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-09 15:04             ` Steve O'Neill
@ 2002-04-09 23:00               ` John Roth
  2002-04-10 12:52                 ` Steve O'Neill
  0 siblings, 1 reply; 29+ messages in thread
From: John Roth @ 2002-04-09 23:00 UTC (permalink / raw)



"Steve O'Neill" <oneils@gbr.msd.ray.com> wrote in message
news:3CB3031A.26E08904@gbr.msd.ray.com...
> Marin David Condic wrote:
> > The software module in question was originally analyzed on Ariane 4
with a
> > veiw toward improving speed. They had a shortage of CPU cycles and
had
> > identified this one module as a major consumer of resources. They
changed
> > the code to eliminate all the range checking and other "safety
features"
> > (not at all uncommon in this business) in order to speed it up. This
was not
> > without analysis that examined the possible valid ranges for various
numbers
> > and mathematically reasoning about it & coming to the conclusion
that any
> > values that would possibly generate a hardware overflow error could
not be
> > in the valid flight path of the Ariane 4 - hence it was likely to be
a
> > sensor failure and the proper accommodation would be to transfer
control to
> > the other channel.
>
> And here was another of the fatal system design flaws that should
never
> have been made... it seems that this 'other channel' was an
*identical*
> system which, of course, reacted in the same manner.  Leaving the poor
> flight control computer with no valid data.  Ooops!

Not exactly. The assumption was that failures would be hardware,
so dual coding the software wasn't an objective.

John Roth





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-09 23:00               ` John Roth
@ 2002-04-10 12:52                 ` Steve O'Neill
  2002-04-10 12:59                   ` Marin David Condic
  2002-04-11 12:12                   ` fdebruin
  0 siblings, 2 replies; 29+ messages in thread
From: Steve O'Neill @ 2002-04-10 12:52 UTC (permalink / raw)


John Roth wrote:
> 
> "Steve O'Neill" <oneils@gbr.msd.ray.com> wrote in message
> > And here was another of the fatal system design flaws that should never
> > have been made... it seems that this 'other channel' was an *identical*
> > system which, of course, reacted in the same manner.  Leaving the poor
> > flight control computer with no valid data.  Ooops!
> 
> Not exactly. The assumption was that failures would be hardware,
> so dual coding the software wasn't an objective.

Well, no matter where you assume the failures will or will not occur you
should never design a dual-redundant system where both strings are
identical.

Steve



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-10 12:52                 ` Steve O'Neill
@ 2002-04-10 12:59                   ` Marin David Condic
  2002-04-11  0:48                     ` Steve O'Neill
  2002-04-11 13:47                     ` Ted Dennison
  2002-04-11 12:12                   ` fdebruin
  1 sibling, 2 replies; 29+ messages in thread
From: Marin David Condic @ 2002-04-10 12:59 UTC (permalink / raw)


"Never" is a really long time! :-) Seriously. There are lots of good
engineering reasons to develop multi-redundant identical systems. See my
other post relating to that. See RAID and JABOD drives as one example of
how/why this can be a good thing.

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com


"Steve O'Neill" <oneils@gbr.msd.ray.com> wrote in message
news:3CB435A4.8A011DF1@gbr.msd.ray.com...
>
> Well, no matter where you assume the failures will or will not occur you
> should never design a dual-redundant system where both strings are
> identical.
>
> Steve





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-10 12:59                   ` Marin David Condic
@ 2002-04-11  0:48                     ` Steve O'Neill
  2002-04-11 13:17                       ` Marin David Condic
  2002-04-11 13:47                     ` Ted Dennison
  1 sibling, 1 reply; 29+ messages in thread
From: Steve O'Neill @ 2002-04-11  0:48 UTC (permalink / raw)


Marin David Condic wrote:
> 
> "Never" is a really long time! :-) 

Well, I'll give you that... and concede that one should never say never.

> There are lots of good engineering reasons to develop multi-redundant 
> identical systems. 

Agreed... except when the potential result may be raining down flaming
pieces of a billion dollars worth of satellite.  As I recall the photos
were very impressive.

Steve



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-10 12:52                 ` Steve O'Neill
  2002-04-10 12:59                   ` Marin David Condic
@ 2002-04-11 12:12                   ` fdebruin
  2002-04-11 14:33                     ` Larry Kilgallen
  1 sibling, 1 reply; 29+ messages in thread
From: fdebruin @ 2002-04-11 12:12 UTC (permalink / raw)


Steve O'Neill <oneils@gbr.msd.ray.com> writes:

>John Roth wrote:
>> 
>> "Steve O'Neill" <oneils@gbr.msd.ray.com> wrote in message
>> > And here was another of the fatal system design flaws that should never
>> > have been made... it seems that this 'other channel' was an *identical*
>> > system which, of course, reacted in the same manner.  Leaving the poor
>> > flight control computer with no valid data.  Ooops!
>> 
>> Not exactly. The assumption was that failures would be hardware,
>> so dual coding the software wasn't an objective.

>Well, no matter where you assume the failures will or will not occur you
>should never design a dual-redundant system where both strings are
>identical.

The strings were not identical because they were using physically different
hardware components. 

You could argue that the strings should have no commonalities at
at all for maximum safety and robustness. This will only work if you
have endless resources in terms of money and time.

For example, in the case of Ariane 501 the IRS software in the two units
could have been developed indepently by two different companies. Maybe
the problem would have occurred in only one of them.

This will immediately raise the issue of deciding which software is 
roviding you the correct data, in case they differ. This will lead to
some kind of arbitrator (a new single point of failure) or a third
software package to allow majority voting.

You will easily tripple the cost of your software. In addition, I am not
convinced that the added robustness makes up for the added complexity.
Furthemore, you are still not fully covered because your software
specification might contain an error that will be common to all indepently
developed software.

Redundancy is a vital concept but it is not *the* solution, it is just
contributing to it. It all boils down to a tradeoff. To the extreme: the
costs of your redunancy should not be higher than the recurrent costs
for building a new launcher.


Frank de Bruin





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-11  0:48                     ` Steve O'Neill
@ 2002-04-11 13:17                       ` Marin David Condic
  0 siblings, 0 replies; 29+ messages in thread
From: Marin David Condic @ 2002-04-11 13:17 UTC (permalink / raw)


"Steve O'Neill" <oneills@top.monad.net> wrote in message
news:3CB4DD65.99F17199@top.monad.net...
>
> Agreed... except when the potential result may be raining down flaming
> pieces of a billion dollars worth of satellite.  As I recall the photos
> were very impressive.
>
Well, I'm impressed by the photos too. It can be very educational to
engineers to look over the videos and photos of various engineering
disasters. There are plenty to choose from.

I'll still disagree that dual-redundant identical systems are a bad idea in
rocket technology and that they are somehow inherently less safe than
dissimilar systems. Having worked in that field I know some of the thinking
that goes into these sorts of designs and lots of highly reliable identical
systems have been built. "Dissimilar" only protects you from common design
errors - maybe. It also increases the probability that there *will* be a
design error. When considering the potential designs for a given piece of
avionics, you need to look very carefully at all the possible failure modes
you can think of and look at the probabilities of those failures occurring
and ask how well a given design strategy will minimize the risk. Dual
redundant, identical systems can and do function very well and at very high
levels of reliability and it isn't automatically clear that for a given
application a dual redundant dissimilar system is going to improve
reliability. In fact, quite the opposite might be the case.

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-10 12:59                   ` Marin David Condic
  2002-04-11  0:48                     ` Steve O'Neill
@ 2002-04-11 13:47                     ` Ted Dennison
  2002-04-11 14:15                       ` Marin David Condic
  1 sibling, 1 reply; 29+ messages in thread
From: Ted Dennison @ 2002-04-11 13:47 UTC (permalink / raw)


"Marin David Condic" <dont.bother.mcondic.auntie.spam@[acm.org> wrote in message news:<a91cv1$5s6$1@nh.pace.co.uk>...
> "Never" is a really long time! :-) Seriously. There are lots of good
> engineering reasons to develop multi-redundant identical systems. See my
> other post relating to that. See RAID and JABOD drives as one example of
> how/why this can be a good thing.

In fact, I can tell you firsthand that NASA's STGT sattelite/shuttle
groundstation is designed exactly that way.


-- 
T.E.D.
Home     -  mailto:dennison@telepath.com (Yahoo: Ted_Dennison)
Homepage -  http://www.telepath.com/dennison/Ted/TED.html



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-11 13:47                     ` Ted Dennison
@ 2002-04-11 14:15                       ` Marin David Condic
  0 siblings, 0 replies; 29+ messages in thread
From: Marin David Condic @ 2002-04-11 14:15 UTC (permalink / raw)


"Ted Dennison" <dennison@telepath.com> wrote in message
news:4519e058.0204110547.36730467@posting.google.com...
>
> In fact, I can tell you firsthand that NASA's STGT sattelite/shuttle
> groundstation is designed exactly that way.
>
>
From similar experience, engine controls for commercial and military jet
engines are typically dual redundant, identical systems. (I don't know what
the guys over at The Light Bulb did but that's at least the way Pratt built
them.) While I don't have first hand experience with them, I'd suspect that
lots of fly-by-wire flight controls are of similar design - and that's
pretty critical if it were to fail and turn the plain into a lawn-dart.
"Dissimilar" doesn't necessarily equate to "Better" - and even if it did, it
might just be the enemy of "Good Enough".

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com





^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-11 12:12                   ` fdebruin
@ 2002-04-11 14:33                     ` Larry Kilgallen
  2002-04-11 18:16                       ` Ted Dennison
  0 siblings, 1 reply; 29+ messages in thread
From: Larry Kilgallen @ 2002-04-11 14:33 UTC (permalink / raw)


In article <a93uj0$1be$1@news1.xs4all.nl>, fdebruin@xs3.xs4all.nl (fdebruin) writes:

> For example, in the case of Ariane 501 the IRS software in the two units
> could have been developed indepently by two different companies. Maybe
> the problem would have occurred in only one of them.
> 
> This will immediately raise the issue of deciding which software is 
> roviding you the correct data, in case they differ. This will lead to
> some kind of arbitrator (a new single point of failure) or a third
> software package to allow majority voting.
> 
> You will easily tripple the cost of your software. In addition, I am not
> convinced that the added robustness makes up for the added complexity.
> Furthemore, you are still not fully covered because your software
> specification might contain an error that will be common to all indepently
> developed software.

Just have each implementation use a different specification :-)



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-11 14:33                     ` Larry Kilgallen
@ 2002-04-11 18:16                       ` Ted Dennison
  2002-04-11 18:30                         ` Marin David Condic
  0 siblings, 1 reply; 29+ messages in thread
From: Ted Dennison @ 2002-04-11 18:16 UTC (permalink / raw)


Kilgallen@SpamCop.net (Larry Kilgallen) wrote in message news:<Tz$ACDLOR7ku@eisner.encompasserve.org>...
> In article <a93uj0$1be$1@news1.xs4all.nl>, fdebruin@xs3.xs4all.nl (fdebruin) writes:
> > For example, in the case of Ariane 501 the IRS software in the two units
> > could have been developed indepently by two different companies. Maybe
...
> > Furthemore, you are still not fully covered because your software
> > specification might contain an error that will be common to all indepently
> > developed software.
> 
> Just have each implementation use a different specification :-)

...or just send sattelites up on a Proton, Atlas, and Titan as well,
and hope one of them makes it. :-)


-- 
T.E.D.
Home     -  mailto:dennison@telepath.com (Yahoo: Ted_Dennison)
Homepage -  http://www.telepath.com/dennison/Ted/TED.html



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Ariane Failure
  2002-04-11 18:16                       ` Ted Dennison
@ 2002-04-11 18:30                         ` Marin David Condic
  0 siblings, 0 replies; 29+ messages in thread
From: Marin David Condic @ 2002-04-11 18:30 UTC (permalink / raw)


"Ted Dennison" <dennison@telepath.com> wrote in message
news:4519e058.0204111016.22d01aef@posting.google.com...
>
> ...or just send sattelites up on a Proton, Atlas, and Titan as well,
> and hope one of them makes it. :-)
>
Now *that's* an example of multi-redundant, dissimilar systems that had not
occurred to me. And a good illustration of overkill engineering.

I'm reminded of:

    "Insisting on perfect safety is for people who don't have the
    balls to live in the real world."
        -- Mary Shafer, NASA Ames Dryden

MDC
--
Marin David Condic
Senior Software Engineer
Pace Micro Technology Americas    www.pacemicro.com
Enabling the digital revolution
e-Mail:    marin.condic@pacemicro.com





^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2002-04-11 18:30 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <ee2a195b.0203260725.a02dbfe@posting.google.com>
2002-03-29 18:56 ` Ariane Failure Richard Riehle
2002-03-29 20:56   ` Michael Feathers
2002-03-30  1:02     ` Bill
2002-03-30  3:20       ` Keith Ray
2002-03-30 12:12         ` John Roth
2002-03-30 13:36       ` Michael Feathers
2002-04-01 15:22         ` Marin David Condic
     [not found]         ` <a8oo51$tsk$2@slb2.atl.mindspring.net>
2002-04-08 13:59           ` Marin David Condic
2002-04-09 12:49             ` John Roth
2002-04-09 14:58               ` Steve O'Neill
2002-04-09 15:04             ` Steve O'Neill
2002-04-09 23:00               ` John Roth
2002-04-10 12:52                 ` Steve O'Neill
2002-04-10 12:59                   ` Marin David Condic
2002-04-11  0:48                     ` Steve O'Neill
2002-04-11 13:17                       ` Marin David Condic
2002-04-11 13:47                     ` Ted Dennison
2002-04-11 14:15                       ` Marin David Condic
2002-04-11 12:12                   ` fdebruin
2002-04-11 14:33                     ` Larry Kilgallen
2002-04-11 18:16                       ` Ted Dennison
2002-04-11 18:30                         ` Marin David Condic
2002-04-09 19:07             ` Bill
2002-04-09 19:44               ` Marin David Condic
2002-04-01 15:08   ` Marin David Condic
2002-04-02 18:32     ` Wes Groleau
2002-04-02 18:42       ` Marin David Condic
1996-06-28  0:00 Robert B. Love 
1996-07-01  0:00 ` Ken Garlington

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox