* Re: Ariane Failure [not found] <ee2a195b.0203260725.a02dbfe@posting.google.com> @ 2002-03-29 18:56 ` Richard Riehle 2002-03-29 20:56 ` Michael Feathers 2002-04-01 15:08 ` Marin David Condic 0 siblings, 2 replies; 29+ messages in thread From: Richard Riehle @ 2002-03-29 18:56 UTC (permalink / raw) rjk wrote: > What is XPers response to this? (I was going to ask a more specific > question, but I thought I'd leave it broad until an interesting question is > found). The problem with Ariane V begins with Systems Engineering management. The decisions about what to do when an exception occurs were wrong, and not tested. Although Design By Contract might have helped, I doubt that Eiffel would have been appropriate because of other issues related to Eiffel. I like Eiffel, but don't consider it appropriate for a project such as Ariane V. The SPARK approach to Design By Contract (they don't call it that, but that is what it is) could have worked well, especially since it was programmed in Ada. By the way, the Ada code worked as it was directed to work, but it was given bad directions. The other problem was one of software reuse. We often tout the value of software reuse, but here is case where it was not working as expected. The software module that failed was originally used in Ariane IV, where it worked fine. Without testing, it was used on Ariane V which had slightly different launch characteristics. A perfect good module from one context was used in another context without considering the full range of issues in that new context. We could draw the analogy of a physician who prescribes a medicine for a patient, knowing that this medicine has worked well for other patients with the same illness. If the physician fails to do a complete medical history, including an evaluation of the other medications being used by that new patient, there is the possibility that contradindicated drug interactions might kill this hapless patient. When we reuse existing modules, in safety-critical software, it is imperative that we both inspect and test for interactions that might kill our software. For embedded real-time software these are often timing issues. Those are hard to detect. As to the contention that XP would have prevented the failue of Ariane V, that is mostly wishful thinking, reminiscent of what is often called "Monday morning quarterbacking." There are some XP practices that might have been useful (features that predate XP by some considerable amount of time), but XP itself would not have saved Ariane V, nor would most of the other suggestions made by the Monday morning Quarterbacks. At present, the French are launching the current version of Ariane quite safely. There are other project failures where XP might have been able to save the project. The one that comes to mind quickly is the Denver airport baggage handling system. I am sure there are others. However, I don't want to be accused of Monday morning quarterbacking. That fact is that building software is hard and it is easy to make engineering errors. Our tools and methods can help us do it right, but neither the languages nor the processes can consistently save us from ourselves. Richard Riehle ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-03-29 18:56 ` Ariane Failure Richard Riehle @ 2002-03-29 20:56 ` Michael Feathers 2002-03-30 1:02 ` Bill 2002-04-01 15:08 ` Marin David Condic 1 sibling, 1 reply; 29+ messages in thread From: Michael Feathers @ 2002-03-29 20:56 UTC (permalink / raw) "Richard Riehle" <richard@adaworks.com> wrote in message news:3CA4B8E5.72909C9B@adaworks.com... > rjk wrote: > > > What is XPers response to this? (I was going to ask a more specific > > question, but I thought I'd leave it broad until an interesting question is > > found). > > The problem with Ariane V begins with Systems Engineering management. > The decisions about what to do when an exception occurs were wrong, and > not tested. Although Design By Contract might have helped, I doubt that > Eiffel would have been appropriate because of other issues related to > Eiffel. I like Eiffel, but don't consider it appropriate for a project such > as Ariane V. The SPARK approach to Design By Contract (they don't > call it that, but that is what it is) could have worked well, especially > since it was programmed in Ada. By the way, the Ada code worked as > it was directed to work, but it was given bad directions. IIRC, there's also the issue of casting integers across sizes. It is great when you can hide representation and promote or demote its size as needed. Michael Feathers www.objectmentor.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-03-29 20:56 ` Michael Feathers @ 2002-03-30 1:02 ` Bill 2002-03-30 3:20 ` Keith Ray 2002-03-30 13:36 ` Michael Feathers 0 siblings, 2 replies; 29+ messages in thread From: Bill @ 2002-03-30 1:02 UTC (permalink / raw) Michael Feathers wrote:<snip> > > IIRC, there's also the issue of casting integers across sizes. It is great > when you can hide representation and promote or demote its size as needed. <snip> Promoting and demoting size as needed was part of the problem. Because of limitations of typical launch vehicals, in particular their down link capabilities to ground operations, but also limitted on board storage and central processing, it is often necessary to reduce the size of a value from larger storage representations to a smaller storage representations, typically from floats or doubles to 8 or 16 bit integers. In order to ensure that the real time constraints of the system are met, there has to be an explicit decision as to what information needs to be communicated, at what rate, and precision. It is tempting to maintain more precision than you need, just to be certain you haven't misjudged the need, by applying an offset and scale factor prior to the conversion to an integer, such that all possible values of the rescaled number just fit within the range of values of the integer. However, that decision is subject to the error of underestimating the range of possible values of the original number before rescaling. In particular, a velocity scale factor that was valid for the Ariane IV, for the actual and planned operating conditions of the Ariane V, resulted in a value that exceeded the integer range of the desired integer size, because the Ariane V has a larger acceleration and more horizontal trajectory than the Ariane IV. Note that information hiding per se doesn't help with this. If the writer of the software has made the explicit decision to rescale and the rescale factor to use, but doesn't communicate that information to others so they can make no decisions based on a knowledge of the rescale factor, the rescale factor could still be inappropriate and cause breakage. Also designing the software to automatically rescale using more global heuristcs, can cause other problems as additional information about its decisions then needs to be communicated to the ground station so that it can interpret the rescaled data. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-03-30 1:02 ` Bill @ 2002-03-30 3:20 ` Keith Ray 2002-03-30 12:12 ` John Roth 2002-03-30 13:36 ` Michael Feathers 1 sibling, 1 reply; 29+ messages in thread From: Keith Ray @ 2002-03-30 3:20 UTC (permalink / raw) Weinberg's "The Psychology of Computer Programming", first published in 1971, mentions the explosion of an 18 million dollar rocket because "one instruction on the tape was left out". I wonder how long that programming error was debated. -- C. Keith Ray <http://homepage.mac.com/keithray/xpminifaq.html> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-03-30 3:20 ` Keith Ray @ 2002-03-30 12:12 ` John Roth 0 siblings, 0 replies; 29+ messages in thread From: John Roth @ 2002-03-30 12:12 UTC (permalink / raw) "Keith Ray" <k1e2i3t4h5r6a7y@1m2a3c4.5c6o7m> wrote in message news:k1e2i3t4h5r6a7y-630660.19202829032002@netnews.attbi.com... > Weinberg's "The Psychology of Computer Programming", first published in > 1971, mentions the explosion of an 18 million dollar rocket because "one > instruction on the tape was left out". > > I wonder how long that programming error was debated. Until everyone got tired of the snake oil salesmen - or the next major disaster took people's fickle attention away. John Roth > -- > C. Keith Ray > > <http://homepage.mac.com/keithray/xpminifaq.html> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-03-30 1:02 ` Bill 2002-03-30 3:20 ` Keith Ray @ 2002-03-30 13:36 ` Michael Feathers 2002-04-01 15:22 ` Marin David Condic [not found] ` <a8oo51$tsk$2@slb2.atl.mindspring.net> 1 sibling, 2 replies; 29+ messages in thread From: Michael Feathers @ 2002-03-30 13:36 UTC (permalink / raw) "Bill" <wclodius@lanl.gov> wrote in message news:3CA50E9A.CBF24F1B@lanl.gov... > > Michael Feathers wrote:<snip> > > > IIRC, there's also the issue of casting integers across sizes. It is great > > when you can hide representation and promote or demote its size as needed. > > <snip> > Promoting and demoting size as needed was part of the problem. Because of > limitations of typical launch vehicals, in particular their down link > capabilities to ground operations, but also limitted on board storage and > central processing, it is often necessary to reduce the size of a value from > larger storage representations to a smaller storage representations, typically > from floats or doubles to 8 or 16 bit integers. In order to ensure that the > real time constraints of the system are met, there has to be an explicit > decision as to what information needs to be communicated, at what rate, and > precision. It is tempting to maintain more precision than you need, just to be > certain you haven't misjudged the need, by applying an offset and scale factor > prior to the conversion to an integer, such that all possible values of the > rescaled number just fit within the range of values of the integer. However, > that decision is subject to the error of underestimating the range of possible > values of the original number before rescaling. In particular, a velocity scale > factor that was valid for the Ariane IV, for the actual and planned operating > conditions of the Ariane V, resulted in a value that exceeded the integer range > of the desired integer size, because the Ariane V has a larger acceleration and > more horizontal trajectory than the Ariane IV. > > Note that information hiding per se doesn't help with this. If the writer of > the software has made the explicit decision to rescale and the rescale factor > to use, but doesn't communicate that information to others so they can make no > decisions based on a knowledge of the rescale factor, the rescale factor could > still be inappropriate and cause breakage. Also designing the software to > automatically rescale using more global heuristcs, can cause other problems as > additional information about its decisions then needs to be communicated to the > ground station so that it can interpret the rescaled data. Yes. It seems like the error prone part is going back to integers at all. Since it is a safety consideration it seems like it would be great to revisit that as processing and communications speeds increase. Michael Feathers www.objectmentor.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-03-30 13:36 ` Michael Feathers @ 2002-04-01 15:22 ` Marin David Condic [not found] ` <a8oo51$tsk$2@slb2.atl.mindspring.net> 1 sibling, 0 replies; 29+ messages in thread From: Marin David Condic @ 2002-04-01 15:22 UTC (permalink / raw) Given infinite processor speed and infinite memory, they might have done a whole lot with the software to make it more flexible and safe. But that sort of sounds like doing engineering in Heaven. :-) These sorts of compromises are made all the time in real world engineering and you have to ask if they were reasonable for the conditions at hand. In my mind, the decisions made by the Ariane 4 engineers in designing their software were very similar to the decisions made by data processing programmers years ago in using only two digits to store a year - thus creating The Great Y2K Disaster. Back in the 70's & 80's, they were confronted with smaller storage devices and saving those extra couple of bytes in every date was important. The thinking at the time was "Its a known limitation and the useful life of this software ought to be something less than twenty years, so when they build the next system they can accommodate 4-digit years..." I thought that was a reasonable compromise in order to deal with the constraints of technology at the time. Even though hardware and resources get more abundant in the future, we'll still be making compromises - just different ones. MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com "Michael Feathers" <mfeathers@objectmentor.com> wrote in message news:a84f5p$nlm$1@slb5.atl.mindspring.net... > > Yes. It seems like the error prone part is going back to integers at all. > Since it is a safety consideration it seems like it would be great to > revisit that as processing and communications speeds increase. > ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <a8oo51$tsk$2@slb2.atl.mindspring.net>]
* Re: Ariane Failure [not found] ` <a8oo51$tsk$2@slb2.atl.mindspring.net> @ 2002-04-08 13:59 ` Marin David Condic 2002-04-09 12:49 ` John Roth ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Marin David Condic @ 2002-04-08 13:59 UTC (permalink / raw) "Dennis Lee Bieber" <wlfraed@ix.netcom.com> wrote in message news:a8oo51$tsk$2@slb2.atl.mindspring.net... > > I do have to confess to having only the general explanation of the > problem, not details of the code internals -- it does sound, from a quick > perusal of this message thread, that some sort of overflow in integer > processing occurred. This is new to me; the general report tended to the > concept that the measured rates were accurate, but exceeded what the > Ariane IV code deemed proper, and attempts to correct this "faulty rate" > led to vehicle instability... > Yes and no. The report was clearly not written by software guys since it otherwise would have explained in more accurate terms exactly what happened at the software level. Hence, you kind of have to read between the lines and interpret it some from the perspective of a more generalized engineer's view. The software module in question was originally analyzed on Ariane 4 with a veiw toward improving speed. They had a shortage of CPU cycles and had identified this one module as a major consumer of resources. They changed the code to eliminate all the range checking and other "safety features" (not at all uncommon in this business) in order to speed it up. This was not without analysis that examined the possible valid ranges for various numbers and mathematically reasoning about it & coming to the conclusion that any values that would possibly generate a hardware overflow error could not be in the valid flight path of the Ariane 4 - hence it was likely to be a sensor failure and the proper accommodation would be to transfer control to the other channel. The ISR for that overflow error did just that. So the design was valid and correct for the Ariane 4. The problem for Ariane 5 was that nobody tested or checked the assumptions on the software intended to run on a different rocket. Had they run the unit through the flight profile, they would have spotted the error in a cocaine heartbeat. MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-08 13:59 ` Marin David Condic @ 2002-04-09 12:49 ` John Roth 2002-04-09 14:58 ` Steve O'Neill 2002-04-09 15:04 ` Steve O'Neill 2002-04-09 19:07 ` Bill 2 siblings, 1 reply; 29+ messages in thread From: John Roth @ 2002-04-09 12:49 UTC (permalink / raw) "Marin David Condic" <dont.bother.mcondic.auntie.spam@[acm.org> wrote in message news:a8s7nu$ibo$1@nh.pace.co.uk... > "Dennis Lee Bieber" <wlfraed@ix.netcom.com> wrote in message > news:a8oo51$tsk$2@slb2.atl.mindspring.net... > > > > I do have to confess to having only the general explanation of the > > problem, not details of the code internals -- it does sound, from a quick > > perusal of this message thread, that some sort of overflow in integer > > processing occurred. This is new to me; the general report tended to the > > concept that the measured rates were accurate, but exceeded what the > > Ariane IV code deemed proper, and attempts to correct this "faulty rate" > > led to vehicle instability... > > > Yes and no. The report was clearly not written by software guys since it > otherwise would have explained in more accurate terms exactly what happened > at the software level. Hence, you kind of have to read between the lines and > interpret it some from the perspective of a more generalized engineer's > view. I think the technical report went into more detail. However, this particular thread got started by a post that referenced an article which claimed that if the implementers had used Eifel with Design by Contract, the problem would not have occured. This is patently absurd. The proximate cause, as several posters have pointed out, was the failure to recertify and test a component designed for one rocket for use with a different rocket with different characteristics. Drilling deeper, the next level cause was attempting to do too much function for a given combination of processor / language. This caused performance-motivated shortcuts in the implementation. Thus the 'solution' would have been to use a processor with higher performance, or a language with less overhead. Pursuing this path, we come to the inescapable conclusion that the problem would not have occured if the implementors had used either Assembler or Forth! > The software module in question was originally analyzed on Ariane 4 with a > veiw toward improving speed. They had a shortage of CPU cycles and had > identified this one module as a major consumer of resources. They changed > the code to eliminate all the range checking and other "safety features" > (not at all uncommon in this business) in order to speed it up. This was not > without analysis that examined the possible valid ranges for various numbers > and mathematically reasoning about it & coming to the conclusion that any > values that would possibly generate a hardware overflow error could not be > in the valid flight path of the Ariane 4 - hence it was likely to be a > sensor failure and the proper accommodation would be to transfer control to > the other channel. The ISR for that overflow error did just that. So the > design was valid and correct for the Ariane 4. > > The problem for Ariane 5 was that nobody tested or checked the assumptions > on the software intended to run on a different rocket. Had they run the unit > through the flight profile, they would have spotted the error in a cocaine > heartbeat. > > MDC > -- > Marin David Condic > Senior Software Engineer > Pace Micro Technology Americas www.pacemicro.com > Enabling the digital revolution > e-Mail: marin.condic@pacemicro.com John Roth > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-09 12:49 ` John Roth @ 2002-04-09 14:58 ` Steve O'Neill 0 siblings, 0 replies; 29+ messages in thread From: Steve O'Neill @ 2002-04-09 14:58 UTC (permalink / raw) John Roth wrote: > Thus the 'solution' would have been to use a processor with higher > performance, or a language with less overhead. Pursuing this path, > we come to the inescapable conclusion that the problem would > not have occured if the implementors had used either Assembler > or Forth! No, no, no... this was settled years ago after a lengthy discussion here on cla that the problem never would have occurred had the developers used PL/I. :-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-08 13:59 ` Marin David Condic 2002-04-09 12:49 ` John Roth @ 2002-04-09 15:04 ` Steve O'Neill 2002-04-09 23:00 ` John Roth 2002-04-09 19:07 ` Bill 2 siblings, 1 reply; 29+ messages in thread From: Steve O'Neill @ 2002-04-09 15:04 UTC (permalink / raw) Marin David Condic wrote: > The software module in question was originally analyzed on Ariane 4 with a > veiw toward improving speed. They had a shortage of CPU cycles and had > identified this one module as a major consumer of resources. They changed > the code to eliminate all the range checking and other "safety features" > (not at all uncommon in this business) in order to speed it up. This was not > without analysis that examined the possible valid ranges for various numbers > and mathematically reasoning about it & coming to the conclusion that any > values that would possibly generate a hardware overflow error could not be > in the valid flight path of the Ariane 4 - hence it was likely to be a > sensor failure and the proper accommodation would be to transfer control to > the other channel. And here was another of the fatal system design flaws that should never have been made... it seems that this 'other channel' was an *identical* system which, of course, reacted in the same manner. Leaving the poor flight control computer with no valid data. Ooops! ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-09 15:04 ` Steve O'Neill @ 2002-04-09 23:00 ` John Roth 2002-04-10 12:52 ` Steve O'Neill 0 siblings, 1 reply; 29+ messages in thread From: John Roth @ 2002-04-09 23:00 UTC (permalink / raw) "Steve O'Neill" <oneils@gbr.msd.ray.com> wrote in message news:3CB3031A.26E08904@gbr.msd.ray.com... > Marin David Condic wrote: > > The software module in question was originally analyzed on Ariane 4 with a > > veiw toward improving speed. They had a shortage of CPU cycles and had > > identified this one module as a major consumer of resources. They changed > > the code to eliminate all the range checking and other "safety features" > > (not at all uncommon in this business) in order to speed it up. This was not > > without analysis that examined the possible valid ranges for various numbers > > and mathematically reasoning about it & coming to the conclusion that any > > values that would possibly generate a hardware overflow error could not be > > in the valid flight path of the Ariane 4 - hence it was likely to be a > > sensor failure and the proper accommodation would be to transfer control to > > the other channel. > > And here was another of the fatal system design flaws that should never > have been made... it seems that this 'other channel' was an *identical* > system which, of course, reacted in the same manner. Leaving the poor > flight control computer with no valid data. Ooops! Not exactly. The assumption was that failures would be hardware, so dual coding the software wasn't an objective. John Roth ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-09 23:00 ` John Roth @ 2002-04-10 12:52 ` Steve O'Neill 2002-04-10 12:59 ` Marin David Condic 2002-04-11 12:12 ` fdebruin 0 siblings, 2 replies; 29+ messages in thread From: Steve O'Neill @ 2002-04-10 12:52 UTC (permalink / raw) John Roth wrote: > > "Steve O'Neill" <oneils@gbr.msd.ray.com> wrote in message > > And here was another of the fatal system design flaws that should never > > have been made... it seems that this 'other channel' was an *identical* > > system which, of course, reacted in the same manner. Leaving the poor > > flight control computer with no valid data. Ooops! > > Not exactly. The assumption was that failures would be hardware, > so dual coding the software wasn't an objective. Well, no matter where you assume the failures will or will not occur you should never design a dual-redundant system where both strings are identical. Steve ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-10 12:52 ` Steve O'Neill @ 2002-04-10 12:59 ` Marin David Condic 2002-04-11 0:48 ` Steve O'Neill 2002-04-11 13:47 ` Ted Dennison 2002-04-11 12:12 ` fdebruin 1 sibling, 2 replies; 29+ messages in thread From: Marin David Condic @ 2002-04-10 12:59 UTC (permalink / raw) "Never" is a really long time! :-) Seriously. There are lots of good engineering reasons to develop multi-redundant identical systems. See my other post relating to that. See RAID and JABOD drives as one example of how/why this can be a good thing. MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com "Steve O'Neill" <oneils@gbr.msd.ray.com> wrote in message news:3CB435A4.8A011DF1@gbr.msd.ray.com... > > Well, no matter where you assume the failures will or will not occur you > should never design a dual-redundant system where both strings are > identical. > > Steve ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-10 12:59 ` Marin David Condic @ 2002-04-11 0:48 ` Steve O'Neill 2002-04-11 13:17 ` Marin David Condic 2002-04-11 13:47 ` Ted Dennison 1 sibling, 1 reply; 29+ messages in thread From: Steve O'Neill @ 2002-04-11 0:48 UTC (permalink / raw) Marin David Condic wrote: > > "Never" is a really long time! :-) Well, I'll give you that... and concede that one should never say never. > There are lots of good engineering reasons to develop multi-redundant > identical systems. Agreed... except when the potential result may be raining down flaming pieces of a billion dollars worth of satellite. As I recall the photos were very impressive. Steve ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-11 0:48 ` Steve O'Neill @ 2002-04-11 13:17 ` Marin David Condic 0 siblings, 0 replies; 29+ messages in thread From: Marin David Condic @ 2002-04-11 13:17 UTC (permalink / raw) "Steve O'Neill" <oneills@top.monad.net> wrote in message news:3CB4DD65.99F17199@top.monad.net... > > Agreed... except when the potential result may be raining down flaming > pieces of a billion dollars worth of satellite. As I recall the photos > were very impressive. > Well, I'm impressed by the photos too. It can be very educational to engineers to look over the videos and photos of various engineering disasters. There are plenty to choose from. I'll still disagree that dual-redundant identical systems are a bad idea in rocket technology and that they are somehow inherently less safe than dissimilar systems. Having worked in that field I know some of the thinking that goes into these sorts of designs and lots of highly reliable identical systems have been built. "Dissimilar" only protects you from common design errors - maybe. It also increases the probability that there *will* be a design error. When considering the potential designs for a given piece of avionics, you need to look very carefully at all the possible failure modes you can think of and look at the probabilities of those failures occurring and ask how well a given design strategy will minimize the risk. Dual redundant, identical systems can and do function very well and at very high levels of reliability and it isn't automatically clear that for a given application a dual redundant dissimilar system is going to improve reliability. In fact, quite the opposite might be the case. MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-10 12:59 ` Marin David Condic 2002-04-11 0:48 ` Steve O'Neill @ 2002-04-11 13:47 ` Ted Dennison 2002-04-11 14:15 ` Marin David Condic 1 sibling, 1 reply; 29+ messages in thread From: Ted Dennison @ 2002-04-11 13:47 UTC (permalink / raw) "Marin David Condic" <dont.bother.mcondic.auntie.spam@[acm.org> wrote in message news:<a91cv1$5s6$1@nh.pace.co.uk>... > "Never" is a really long time! :-) Seriously. There are lots of good > engineering reasons to develop multi-redundant identical systems. See my > other post relating to that. See RAID and JABOD drives as one example of > how/why this can be a good thing. In fact, I can tell you firsthand that NASA's STGT sattelite/shuttle groundstation is designed exactly that way. -- T.E.D. Home - mailto:dennison@telepath.com (Yahoo: Ted_Dennison) Homepage - http://www.telepath.com/dennison/Ted/TED.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-11 13:47 ` Ted Dennison @ 2002-04-11 14:15 ` Marin David Condic 0 siblings, 0 replies; 29+ messages in thread From: Marin David Condic @ 2002-04-11 14:15 UTC (permalink / raw) "Ted Dennison" <dennison@telepath.com> wrote in message news:4519e058.0204110547.36730467@posting.google.com... > > In fact, I can tell you firsthand that NASA's STGT sattelite/shuttle > groundstation is designed exactly that way. > > From similar experience, engine controls for commercial and military jet engines are typically dual redundant, identical systems. (I don't know what the guys over at The Light Bulb did but that's at least the way Pratt built them.) While I don't have first hand experience with them, I'd suspect that lots of fly-by-wire flight controls are of similar design - and that's pretty critical if it were to fail and turn the plain into a lawn-dart. "Dissimilar" doesn't necessarily equate to "Better" - and even if it did, it might just be the enemy of "Good Enough". MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-10 12:52 ` Steve O'Neill 2002-04-10 12:59 ` Marin David Condic @ 2002-04-11 12:12 ` fdebruin 2002-04-11 14:33 ` Larry Kilgallen 1 sibling, 1 reply; 29+ messages in thread From: fdebruin @ 2002-04-11 12:12 UTC (permalink / raw) Steve O'Neill <oneils@gbr.msd.ray.com> writes: >John Roth wrote: >> >> "Steve O'Neill" <oneils@gbr.msd.ray.com> wrote in message >> > And here was another of the fatal system design flaws that should never >> > have been made... it seems that this 'other channel' was an *identical* >> > system which, of course, reacted in the same manner. Leaving the poor >> > flight control computer with no valid data. Ooops! >> >> Not exactly. The assumption was that failures would be hardware, >> so dual coding the software wasn't an objective. >Well, no matter where you assume the failures will or will not occur you >should never design a dual-redundant system where both strings are >identical. The strings were not identical because they were using physically different hardware components. You could argue that the strings should have no commonalities at at all for maximum safety and robustness. This will only work if you have endless resources in terms of money and time. For example, in the case of Ariane 501 the IRS software in the two units could have been developed indepently by two different companies. Maybe the problem would have occurred in only one of them. This will immediately raise the issue of deciding which software is roviding you the correct data, in case they differ. This will lead to some kind of arbitrator (a new single point of failure) or a third software package to allow majority voting. You will easily tripple the cost of your software. In addition, I am not convinced that the added robustness makes up for the added complexity. Furthemore, you are still not fully covered because your software specification might contain an error that will be common to all indepently developed software. Redundancy is a vital concept but it is not *the* solution, it is just contributing to it. It all boils down to a tradeoff. To the extreme: the costs of your redunancy should not be higher than the recurrent costs for building a new launcher. Frank de Bruin ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-11 12:12 ` fdebruin @ 2002-04-11 14:33 ` Larry Kilgallen 2002-04-11 18:16 ` Ted Dennison 0 siblings, 1 reply; 29+ messages in thread From: Larry Kilgallen @ 2002-04-11 14:33 UTC (permalink / raw) In article <a93uj0$1be$1@news1.xs4all.nl>, fdebruin@xs3.xs4all.nl (fdebruin) writes: > For example, in the case of Ariane 501 the IRS software in the two units > could have been developed indepently by two different companies. Maybe > the problem would have occurred in only one of them. > > This will immediately raise the issue of deciding which software is > roviding you the correct data, in case they differ. This will lead to > some kind of arbitrator (a new single point of failure) or a third > software package to allow majority voting. > > You will easily tripple the cost of your software. In addition, I am not > convinced that the added robustness makes up for the added complexity. > Furthemore, you are still not fully covered because your software > specification might contain an error that will be common to all indepently > developed software. Just have each implementation use a different specification :-) ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-11 14:33 ` Larry Kilgallen @ 2002-04-11 18:16 ` Ted Dennison 2002-04-11 18:30 ` Marin David Condic 0 siblings, 1 reply; 29+ messages in thread From: Ted Dennison @ 2002-04-11 18:16 UTC (permalink / raw) Kilgallen@SpamCop.net (Larry Kilgallen) wrote in message news:<Tz$ACDLOR7ku@eisner.encompasserve.org>... > In article <a93uj0$1be$1@news1.xs4all.nl>, fdebruin@xs3.xs4all.nl (fdebruin) writes: > > For example, in the case of Ariane 501 the IRS software in the two units > > could have been developed indepently by two different companies. Maybe ... > > Furthemore, you are still not fully covered because your software > > specification might contain an error that will be common to all indepently > > developed software. > > Just have each implementation use a different specification :-) ...or just send sattelites up on a Proton, Atlas, and Titan as well, and hope one of them makes it. :-) -- T.E.D. Home - mailto:dennison@telepath.com (Yahoo: Ted_Dennison) Homepage - http://www.telepath.com/dennison/Ted/TED.html ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-11 18:16 ` Ted Dennison @ 2002-04-11 18:30 ` Marin David Condic 0 siblings, 0 replies; 29+ messages in thread From: Marin David Condic @ 2002-04-11 18:30 UTC (permalink / raw) "Ted Dennison" <dennison@telepath.com> wrote in message news:4519e058.0204111016.22d01aef@posting.google.com... > > ...or just send sattelites up on a Proton, Atlas, and Titan as well, > and hope one of them makes it. :-) > Now *that's* an example of multi-redundant, dissimilar systems that had not occurred to me. And a good illustration of overkill engineering. I'm reminded of: "Insisting on perfect safety is for people who don't have the balls to live in the real world." -- Mary Shafer, NASA Ames Dryden MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-08 13:59 ` Marin David Condic 2002-04-09 12:49 ` John Roth 2002-04-09 15:04 ` Steve O'Neill @ 2002-04-09 19:07 ` Bill 2002-04-09 19:44 ` Marin David Condic 2 siblings, 1 reply; 29+ messages in thread From: Bill @ 2002-04-09 19:07 UTC (permalink / raw) Marin David Condic wrote: > <snip>This was not > without analysis that examined the possible valid ranges for various numbers > and mathematically reasoning about it & coming to the conclusion that any > values that would possibly generate a hardware overflow error could not be > in the valid flight path of the Ariane 4 - hence it was likely to be a > sensor failure and the proper accommodation would be to transfer control to > the other channel. The ISR for that overflow error did just that. So the > design was valid and correct for the Ariane 4. <snip> Are you sure this was their reasoning? My interpretation of the reasoning was that it had to be a hardware failure, but the only hardware they could do anything about was the processor interpretting the sensor data, wo they transferred control to another processor handling the same sensor data. with the same program. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-09 19:07 ` Bill @ 2002-04-09 19:44 ` Marin David Condic 0 siblings, 0 replies; 29+ messages in thread From: Marin David Condic @ 2002-04-09 19:44 UTC (permalink / raw) Not having been on the design team, I obviously can't state definitively what their reasoning was. This was my best possible interpretation of the situation after reading the report. Its been quite a while (yet still this topic comes up! :-) since I last read the report but having been involved in similar system designs (dual-redundant engine controls rather than dual redundant IRS's) my best interpretation was that they had two computers looking at two separate sets of sensors. (I'll bow to a more authoritative source on this - but that's my best recollection.) Your big risk is not so much that the computer itself will fail (which you can't do much about with software anyway, right?) but that a sensor or actuator will fail. Dual redundant computers that are looking at the same set of sensors would create a common-mode failure and loss of a sensor would make both computers useless. Not much point in dual redundancy then is there? :-) MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com "Bill" <wclodius@lanl.gov> wrote in message news:3CB33C0A.9125A6A7@lanl.gov... > > Are you sure this was their reasoning? My interpretation of the reasoning was > that it had to be a hardware failure, but the only hardware they could do > anything about was the processor interpretting the sensor data, wo they > transferred control to another processor handling the same sensor data. with > the same program. > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-03-29 18:56 ` Ariane Failure Richard Riehle 2002-03-29 20:56 ` Michael Feathers @ 2002-04-01 15:08 ` Marin David Condic 2002-04-02 18:32 ` Wes Groleau 1 sibling, 1 reply; 29+ messages in thread From: Marin David Condic @ 2002-04-01 15:08 UTC (permalink / raw) I beg to differ on the "Bad Directions" part. Note that the software in question was designed for the Ariane IV which had a different flight profile. The FDA thinking for the module in question went sort of like this: "Any number that shows up here big enough to generate a hardware overflow interrupt has got to be so far out of the flight profile that it would most likely indicate a bad sensor. The accommodation for this failure should be to transfer control to the other side where we might still have a good sensor..." This logic worked fine in Ariane 4 and would likely have detected a sensor failure and accommodated it appropriately. In my mind, that sounded a lot like "Good Directions" :-) The problem arose when the assumption was made that software that was designed for Ariane 4 and that worked just fine in that environment was therefore fit to fly Ariane 5 WITHOUT being tested and validated against the Ariane 5 flight profile. That's a pretty basic and fundamental error that goes well outside the realm of control of a programming language or methodology. MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com "Richard Riehle" <richard@adaworks.com> wrote in message news:3CA4B8E5.72909C9B@adaworks.com... > > The problem with Ariane V begins with Systems Engineering management. > The decisions about what to do when an exception occurs were wrong, and > not tested. Although Design By Contract might have helped, I doubt that > Eiffel would have been appropriate because of other issues related to > Eiffel. I like Eiffel, but don't consider it appropriate for a project such > as Ariane V. The SPARK approach to Design By Contract (they don't > call it that, but that is what it is) could have worked well, especially > since it was programmed in Ada. By the way, the Ada code worked as > it was directed to work, but it was given bad directions. > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-01 15:08 ` Marin David Condic @ 2002-04-02 18:32 ` Wes Groleau 2002-04-02 18:42 ` Marin David Condic 0 siblings, 1 reply; 29+ messages in thread From: Wes Groleau @ 2002-04-02 18:32 UTC (permalink / raw) > sensor..." This logic worked fine in Ariane 4 and would likely have detected > a sensor failure and accommodated it appropriately. In my mind, that sounded > a lot like "Good Directions" :-) > > The problem arose when the assumption was made that software that was > designed for Ariane 4 and that worked just fine in that environment was > therefore fit to fly Ariane 5 WITHOUT being tested and validated against the > Ariane 5 flight profile. That's a pretty basic and fundamental error that "It worked before, so no review or test is necessary. Make it so." Sounds like bad directions to me. -- Wes Groleau http://freepages.rootsweb.com/~wgroleau ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 2002-04-02 18:32 ` Wes Groleau @ 2002-04-02 18:42 ` Marin David Condic 0 siblings, 0 replies; 29+ messages in thread From: Marin David Condic @ 2002-04-02 18:42 UTC (permalink / raw) O.K. "Good Directions" from the perspective of the software directing the IRS (or the developer's directing the software...) "Bad Directions" from the perspective of the project management missing the boat on the system constraints. MDC -- Marin David Condic Senior Software Engineer Pace Micro Technology Americas www.pacemicro.com Enabling the digital revolution e-Mail: marin.condic@pacemicro.com "Wes Groleau" <wesgroleau@despammed.com> wrote in message news:3CA9F956.CCCFD3CD@despammed.com... > > "It worked before, so no review or test is necessary. Make it so." > > Sounds like bad directions to me. > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Ariane Failure @ 1996-06-28 0:00 Robert B. Love 1996-07-01 0:00 ` Ken Garlington 0 siblings, 1 reply; 29+ messages in thread From: Robert B. Love @ 1996-06-28 0:00 UTC (permalink / raw) If I understand what I read here the earlier comments about the failed Ariane-5 indicated that the flight s/w was coded in Ada. The blurb I've read in Space News says the code that failed resided in the inertial measurement units. This is different than the flight software. Does anybody know what the embedded code for the IMUs is coded in? Overall, it seems a design failure. The IMU's couldn't handle the flight profile of the Ar-5 and the test bed was killed to save money. ---------------------------------------------------------------- Bob Love, rlove@neosoft.com (local) MIME & NeXT Mail OK rlove@raptor.rmnug.org (permanent) PGP key available ---------------------------------------------------------------- ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Ariane Failure 1996-06-28 0:00 Robert B. Love @ 1996-07-01 0:00 ` Ken Garlington 0 siblings, 0 replies; 29+ messages in thread From: Ken Garlington @ 1996-07-01 0:00 UTC (permalink / raw) Robert B. Love wrote: > > If I understand what I read here the earlier comments about > the failed Ariane-5 indicated that the flight s/w was coded > in Ada. The blurb I've read in Space News says the code > that failed resided in the inertial measurement units. From what I saw, the European Space Agency preliminary announcement didn't refer to _any_ code, in either the flight controller or the INU. It said that there was a fault in the INU _system_. We won't know until at least the final report later in July whether this was a software fault, a dual hardware fault, an interface/design error, or what. -- LMTAS - "Our Brand Means Quality" ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2002-04-11 18:30 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <ee2a195b.0203260725.a02dbfe@posting.google.com> 2002-03-29 18:56 ` Ariane Failure Richard Riehle 2002-03-29 20:56 ` Michael Feathers 2002-03-30 1:02 ` Bill 2002-03-30 3:20 ` Keith Ray 2002-03-30 12:12 ` John Roth 2002-03-30 13:36 ` Michael Feathers 2002-04-01 15:22 ` Marin David Condic [not found] ` <a8oo51$tsk$2@slb2.atl.mindspring.net> 2002-04-08 13:59 ` Marin David Condic 2002-04-09 12:49 ` John Roth 2002-04-09 14:58 ` Steve O'Neill 2002-04-09 15:04 ` Steve O'Neill 2002-04-09 23:00 ` John Roth 2002-04-10 12:52 ` Steve O'Neill 2002-04-10 12:59 ` Marin David Condic 2002-04-11 0:48 ` Steve O'Neill 2002-04-11 13:17 ` Marin David Condic 2002-04-11 13:47 ` Ted Dennison 2002-04-11 14:15 ` Marin David Condic 2002-04-11 12:12 ` fdebruin 2002-04-11 14:33 ` Larry Kilgallen 2002-04-11 18:16 ` Ted Dennison 2002-04-11 18:30 ` Marin David Condic 2002-04-09 19:07 ` Bill 2002-04-09 19:44 ` Marin David Condic 2002-04-01 15:08 ` Marin David Condic 2002-04-02 18:32 ` Wes Groleau 2002-04-02 18:42 ` Marin David Condic 1996-06-28 0:00 Robert B. Love 1996-07-01 0:00 ` Ken Garlington
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox