* Re: Ariane 5 Failure - Summary Report [not found] <31F60E8A.2D74@lmtas.lmco.com> @ 1996-07-24 0:00 ` Ken Garlington 1996-07-24 0:00 ` Byron B. Kauffman ` (6 more replies) 0 siblings, 7 replies; 18+ messages in thread From: Ken Garlington @ 1996-07-24 0:00 UTC (permalink / raw) Ken Garlington wrote: <nothing!> Don't know what happened there, but I was just going to point out that the Ariane 5 report is at: http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html Be sure to read the full report, which is linked to this page. It goes into some length about the sequence of events (which includes an Ada exception I never heard of before, Operand Error? Maybe it's user defined, or there's a language difference at work). Definitely good "lessons learned" about: 1. The limits of exceptions (they are only as good as what you can do when they are raised). 2. The problems with reusing items outside their original environment. 3. The need to check inputs and outputs aggressively. 4. The pitfalls of assuming that testing all of the components of a system equates to testing the system, as well as the need to use realistic test scenarios. 5. The problems with isolating the safety-critical components of a system. So, anyway, we now have another software package written in Ada that caused the loss of a system, and again specification and design issues outside Ada's control are the culprit. -- LMTAS - "Our Brand Means Quality" ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Ariane 5 Failure - Summary Report Ken Garlington @ 1996-07-24 0:00 ` Byron B. Kauffman 1996-07-24 0:00 ` Stephen D. House 1996-07-25 0:00 ` Theodore E. Dennison 1996-07-25 0:00 ` ++ robin ` (5 subsequent siblings) 6 siblings, 2 replies; 18+ messages in thread From: Byron B. Kauffman @ 1996-07-24 0:00 UTC (permalink / raw) From the Summary Report: "...The same requirement does not apply to Ariane 5, which has a different preparation sequence and it wasmaintained for commonality reasons, presumably based on the view that, unless proven necessary, it was not wise to make changes in software which worked well on Ariane 4..." Is anyone else sick and tired of the COTS argument for getting rid of the Ada mandate? Of course, the C crowd will want to blame this scenario on Ada, but it appears to me that the same results would have occurred no matter what language the original software was written in. Does anyone but me think that COTS was a bad idea conjured up by a hardware guy stuck in an office somewhere in Dayton who thinks he knows something about hardware? I guess we're going to have to trade in our 'software engineering' hats for 'software scavenger/cut-and-paste/rewrite-the-interfaces' hats (otherwise known as HACKING). Just my opinion, of course. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Byron B. Kauffman @ 1996-07-24 0:00 ` Stephen D. House 1996-07-25 0:00 ` Theodore E. Dennison 1 sibling, 0 replies; 18+ messages in thread From: Stephen D. House @ 1996-07-24 0:00 UTC (permalink / raw) Byron B. Kauffman wrote: > I guess we're going to have to trade in our 'software engineering' hats > for 'software scavenger/cut-and-paste/rewrite-the-interfaces' hats > (otherwise known as HACKING). > > Just my opinion, of course. For a rocket, you might be right. BUT... One of the advantages of "visual" programming languages is that they are a language which ties together components. The software crises will not be reduced until productivity goes up. Productivity isn't the number of lines of code you can code a month, its how much functionality you can give you your customer per month. Unless software houses part building products by putting together subsystems instead of subprograms, no gains will be made. COTS is one way. In house components; ones which are understood, domain specific, consistent with other components, etc.; are better solutions. I don't think that companies are doing enough with reuse. They'd rather buy a magic bullet from somebody else than dig through their own attic for something. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Byron B. Kauffman 1996-07-24 0:00 ` Stephen D. House @ 1996-07-25 0:00 ` Theodore E. Dennison 1 sibling, 0 replies; 18+ messages in thread From: Theodore E. Dennison @ 1996-07-25 0:00 UTC (permalink / raw) Byron B. Kauffman wrote: > > the Ada mandate? Of course, the C crowd will want to blame this > scenario on Ada, but it appears to me that the same results would have > occurred no matter what language the original software was written in. Not true. In C, the operation that performed the type conversion (cast), would have corrupted nearby memory locations and continued running with faulty data (or possibly instructions). The resulting error would have occured at a seemingly random place with seemingly random frequency, and the commision studying the failure would never have been able to isolate it. -- T.E.D. | Work - mailto:dennison@escmail.orl.mmc.com | | Home - mailto:dennison@iag.net | | URL - http://www.iag.net/~dennison | ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Ariane 5 Failure - Summary Report Ken Garlington 1996-07-24 0:00 ` Byron B. Kauffman @ 1996-07-25 0:00 ` ++ robin 1996-07-25 0:00 ` Dale Stanbrough ` (4 subsequent siblings) 6 siblings, 0 replies; 18+ messages in thread From: ++ robin @ 1996-07-25 0:00 UTC (permalink / raw) Ken Garlington <garlingtonke@lmtas.lmco.com> writes: >Ken Garlington wrote: <nothing!> >Don't know what happened there, but I was just going to point out >that the Ariane 5 report is at: > http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html >Be sure to read the full report, which is linked to this page. It >goes into some length about the sequence of events (which includes >an Ada exception I never heard of before, Operand Error? ---That's fixed-point overflow. Converting a 64-bit floating-point value to a 16 bit signed integer. The conversion was unchecked (programming error-- the other conversions in the same module were checked; the assumption was made that the value would be within range); consequently the error condition was raised. The exception-handling routine was to record the status of the error and to then shut down the system. Maybe it's user >defined, or there's a language difference at work). >Definitely good "lessons learned" about: >1. The limits of exceptions (they are only as good as what you can do >when they are raised). >2. The problems with reusing items outside their original environment. >3. The need to check inputs and outputs aggressively. >4. The pitfalls of assuming that testing all of the components of a >system equates to testing the system, as well as the need to use >realistic test scenarios. >5. The problems with isolating the safety-critical components of a >system. >So, anyway, we now have another software package written in Ada that >caused the loss of a system, and again specification and design issues >outside Ada's control are the culprit. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Ariane 5 Failure - Summary Report Ken Garlington 1996-07-24 0:00 ` Byron B. Kauffman 1996-07-25 0:00 ` ++ robin @ 1996-07-25 0:00 ` Dale Stanbrough 1996-07-26 0:00 ` OS2 User 1996-07-25 0:00 ` ++ robin ` (3 subsequent siblings) 6 siblings, 1 reply; 18+ messages in thread From: Dale Stanbrough @ 1996-07-25 0:00 UTC (permalink / raw) Ken Garlington writes: "1. The limits of exceptions (they are only as good as what you can do when they are raised)." Now I know what is meant when people say "exceptions can be expensive" :-). Dale ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-25 0:00 ` Dale Stanbrough @ 1996-07-26 0:00 ` OS2 User 0 siblings, 0 replies; 18+ messages in thread From: OS2 User @ 1996-07-26 0:00 UTC (permalink / raw) Dale Stanbrough (dale@goanna.cs.rmit.edu.au) wrote: : Ken Garlington writes: : "1. The limits of exceptions (they are only as good as what you can do : when they are raised)." : : Now I know what is meant when people say "exceptions can be expensive" :-). In fact, the most expensive is not exception but the misunderstanding of them. For me it is the /Ostrich complex/ : If you not see it, you imagine it not exist. (it is not true !) Exception is a report structure, if you specify how it is raise for each component you can : - use it as control. ( Se the sentence (1)) - evitate it with controls ( You must be 101% reliable ) - ignore it, Kill the process, Crash the rocket :-( : Dale -- Christophe Faure _______________________________________________________ | * # # ### # | | Christophe FAURE * # # # @ # @ # @ | | Tel :+33()55.45.72.32 * # # ### # | | Fax :+33()55.45.73.15 * | | * Laboratoire M.S.I | | e-mail : Faure@unilim.fr * Universite de Limoges | | * F-87060 LIMOGES | |_______________________________________________________| ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Ariane 5 Failure - Summary Report Ken Garlington ` (2 preceding siblings ...) 1996-07-25 0:00 ` Dale Stanbrough @ 1996-07-25 0:00 ` ++ robin 1996-07-26 0:00 ` ++ robin 1996-07-26 0:00 ` Ken Garlington 1996-07-25 0:00 ` Alan Brain ` (2 subsequent siblings) 6 siblings, 2 replies; 18+ messages in thread From: ++ robin @ 1996-07-25 0:00 UTC (permalink / raw) Ken Garlington <garlingtonke@lmtas.lmco.com> writes: >Ken Garlington wrote: <nothing!> >Don't know what happened there, but I was just going to point out >that the Ariane 5 report is at: > http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html >Be sure to read the full report, which is linked to this page. It >goes into some length about the sequence of events (which includes >an Ada exception I never heard of before, Operand Error? ---That's fixed-point overflow. Converting a 64-bit floating-point value to a 16 bit signed integer. The conversion was unchecked (programming error-- the other conversions in the same module were checked; the assumption was made that the value would be within range); consequently the error condition was raised. The exception-handling routine was to record the status of the error and to then shut down the system. Maybe it's user >defined, or there's a language difference at work). >Definitely good "lessons learned" about: >1. The limits of exceptions (they are only as good as what you can do >when they are raised). >2. The problems with reusing items outside their original environment. >3. The need to check inputs and outputs aggressively. >4. The pitfalls of assuming that testing all of the components of a >system equates to testing the system, as well as the need to use >realistic test scenarios. >5. The problems with isolating the safety-critical components of a >system. >So, anyway, we now have another software package written in Ada that >caused the loss of a system, and again specification and design issues >outside Ada's control are the culprit. ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-25 0:00 ` ++ robin @ 1996-07-26 0:00 ` ++ robin 1996-07-26 0:00 ` Ken Garlington 1 sibling, 0 replies; 18+ messages in thread From: ++ robin @ 1996-07-26 0:00 UTC (permalink / raw) >Ken Garlington <garlingtonke@lmtas.lmco.com> writes: >Don't know what happened there, but I was just going to point out >that the Ariane 5 report is at: > http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html >Be sure to read the full report, which is linked to this page. It >goes into some length about the sequence of events (which includes >an Ada exception I never heard of before, Operand Error? ---That's fixed-point overflow. Converting a 64-bit floating-point value to a 16 bit signed integer. The conversion was unchecked (programming error-- other conversions in the same module were checked; the assumption was made that the value would be within range); consequently the error condition was raised. The exception-handling routine was to record the status of the error and to then shut down the system. >Maybe it's user >defined, or there's a language difference at work). ---A user-defined data conversion that went unchecked. Three programming mistakes were made here: 1. The size of the variable to hold the value (16 bits) was inadequate; and 2. It was assumed that the value would not be large enough to overflow ; therefore, it was not checked; and 3. The folly that a floating-point value of some 58 significant bits could be converted "safely" to 16 bits. The problem then went to the error-handler, which was designed to shut down the system. This was a major blunder. An error-handler for overflow should have been included, but should have returned control directly to the program (this only as an emergency resort). The code should have included a check for data out of range (or better, storage of adequate size.) This project might well have been written in PL/I, which has excellent real-time facilities, including error handling, error simulation and validation facilities. The language has robust compilers, and experts with many years of PL/I programming experience. As to PL/I facilities, I refer to the SIGNAL statement, with which given conditions (errors such as fixed-point overflow) can be signalled as if the condition (error) actually occurred. This alone would have showed up the deficiency of the overall design (that the system would shut itself down for fixed-point overflow). Further, an ON unit can return control simply and easily to some re-start point, or another convenient point in the program, or even pass control to the following statement. >With Definitely good "lessons learned" about: >1. The limits of exceptions (they are only as good as what you can do >when they are raised). ---There's a lot you can do with an exception. One of them isn't to shut down the computer. I've already itemized what can be done with an exception. But in this case, the proper course is to ensure that values are within range and to take appropriate action, rather than to let it get as far as the error handler, which should be a last resort for catching something overlooked (and hopefully, there's none of those). >2. The problems with reusing items outside their original environment. >3. The need to check inputs and outputs aggressively. >4. The pitfalls of assuming that testing all of the components of a >system equates to testing the system, as well as the need to use >realistic test scenarios. >5. The problems with isolating the safety-critical components of a >system. >So, anyway, we now have another software package written in Ada that >caused the loss of a system, and again specification and design issues >outside Ada's control are the culprit. ---No, this is a clear programming error. A PL/I programmer experienced with real time systems, would have CHALLENGED such a stupid requirement that the computer be shut down by the error-handler in the event of a fixed-point overflow. He would have had it changed. I'd go further to say that no experienced PL/I programmer would have shut down the system as a result of a fixed-point overflow. Furthermore, he would have included a check that the value did not go out of range; Skills in PL/I and real time systems would not have gone astray here. And probably skills in Ada too. ___________________________________________________________ Extract from full report: " * The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This resulted in an Operand Error. The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected." ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-25 0:00 ` ++ robin 1996-07-26 0:00 ` ++ robin @ 1996-07-26 0:00 ` Ken Garlington 1996-07-30 0:00 ` Theodore E. Dennison 1 sibling, 1 reply; 18+ messages in thread From: Ken Garlington @ 1996-07-26 0:00 UTC (permalink / raw) ++ robin wrote: > > >Ken Garlington <garlingtonke@lmtas.lmco.com> writes: > > >Don't know what happened there, but I was just going to point out > >that the Ariane 5 report is at: > > > http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html > > >Be sure to read the full report, which is linked to this page. It > >goes into some length about the sequence of events (which includes > >an Ada exception I never heard of before, Operand Error? > > ---That's fixed-point overflow. Could you send me an Ada RM cite? I couldn't find it... > Converting a 64-bit > floating-point value to a 16 bit signed integer. > The conversion was unchecked (programming error-- I don't know if I would call this a programming error or a requirements error. Apparently, there was an analysis done to see if the check should be required, and the analysis said that it wasn't. Given the 80% utilization, I'm sure there was some not-so-subtle pressure to leave out any code that wasn't absolutely necessary. > 1. The size of the variable to hold the value (16 bits) was > inadequate; Not that this was mentioned in the report, but the commnications link between the INS and the flight control computer uses a MIL-STD-1553B data bus, which is a 16-bit protocol. They could have used multiple words to contain this value, but it is common for 1553B application to convert floating point numbers to scaled 16-bit values, so long as the precision is still acceptable. What apparently happened was that the scaling for the Ariane 4 was acceptable, but was not updated for the Ariane 5 based on an analysis that said the ranges should be maintained. > 2. It was assumed that the value would not be large enough > to overflow ; therefore, it was not checked; and Yes - clearly the analysis done here was not adequtely revisited for the new environment. > 3. The folly that a floating-point value of some > 58 significant bits could be converted "safely" to > 16 bits. This is actually quite routine for IRS to flight control interfaces. The IRS usually has to do high-precision calculations internally, but the flight control system does not need this precision. In and of itself, there's no problem dumping the extra bits of precision, so long as the range is preserved (which, in this case, it wasn't). > An error-handler for overflow should have been included, > but should have returned control directly to the program > (this only as an emergency resort). The code should have > included a check for data out of range (or better, storage > of adequate size.) I agree with the second part. However, it's not clear that returning to the program would have helped. This is one area in which I think the final report is too optimistic. It suggests that the correct response to the exception was to provide the "best data available." That might be possible, but in general it's a tricky business. I wonder more why the IRS message to the flight controls did not include (a) an indication as to IRS mode (alignment, in this case) and (b) an indication that the IRS had detected a data error. > This project might well have been written in PL/I, which > has excellent real-time facilities, including error > handling, error simulation and validation facilities. > The language has robust compilers, and experts with many > years of PL/I programming experience. > > As to PL/I facilities, I refer to the SIGNAL statement, > with which given conditions (errors such as fixed-point > overflow) can be signalled as if the condition (error) > actually occurred. The language in which it was actually written (Ada) has equivalent facilities, so I'm not sure how PL/I would have helped here. Having programmed in both PL/I and Ada, I can't think of anything specific in this area. As noted in the final report, this was a system requirements and design error, not a programming error. > This alone would have showed up the deficiency of the > overall design (that the system would shut itself down for > fixed-point overflow). It might have shown that this was the result, but as pointed out in the report, the system designers knew this could happen, and discounted it as improbable. So, I'm not sure it would have made a difference. > Further, an ON unit can return control simply and easily > to some re-start point, or another convenient point in the > program, or even pass control to the following statement. Again, it's not clear to me that any of these options would have saved the vehicle. An IRS in alignment mode simply cannot generate valid data, period. Furthermore, since the designers felt this error couldn't really occur, it's unlikely they would have made the right choice as to what to do when the SIGNAL was raised. > >With Definitely good "lessons learned" about: > > >1. The limits of exceptions (they are only as good as what you can do > >when they are raised). > > ---There's a lot you can do with an exception. One of > them isn't to shut down the computer. I've already itemized > what can be done with an exception. But in this case, > the proper course is to ensure that values are within > range and to take appropriate action, rather than > to let it get as far as the error handler, which should > be a last resort for catching something overlooked (and > hopefully, there's none of those). Sure. Now, what was the appropriate action here? Switch from alignment to operational mode? The whole point of alignment mode is to make the values computed during operational mode accurate. If it doesn't finish, the results are suspect. Particularly for feedback systems, having lots of options at the language level does not equate to having an adequate solution to the existence of an error. That's a system design problem, not a language issue. > ---No, this is a clear programming error. A PL/I programmer > experienced with real time systems, would have CHALLENGED > such a stupid requirement that the computer be shut down by the > error-handler in the event of a fixed-point overflow. He would > have had it changed. To what? > I'd go further to say that no experienced PL/I programmer > would have shut down the system as a result of a fixed-point > overflow. What would he have done? > Furthermore, he would have included a check that the value > did not go out of range; > > Skills in PL/I and real time systems would not have gone > astray here. And probably skills in Ada too. I will agree that experience in real time systems (which these folks had, by the way) is definitely a useful thing here. However, these guys simply came up with the wrong answer in their analysis. It wasn't that they didn't realize there was a risk - the report clearly states that the issue was discussed and the resolution approved at several levels of management. They weren't ignorant, just wrong. There is a difference. > > ___________________________________________________________ > > Extract from full report: > > " * The internal SRI software exception was caused during execution of a > data conversion from 64-bit floating point to 16-bit signed integer > value. The floating point number which was converted had a value > greater than what could be represented by a 16-bit signed integer. > This resulted in an Operand Error. The data conversion instructions > (in Ada code) were not protected from causing an Operand Error, > although other conversions of comparable variables in the same place > in the code were protected." Tell me, if you read in the paper that a drunk driver was speeding, and killed someone, do you blame the auto manufacturer for providing a vehicle that could go fast enough to kill someone? Read the report. Its conclusions are definitely at odds with yours. -- LMTAS - "Our Brand Means Quality" ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-26 0:00 ` Ken Garlington @ 1996-07-30 0:00 ` Theodore E. Dennison 0 siblings, 0 replies; 18+ messages in thread From: Theodore E. Dennison @ 1996-07-30 0:00 UTC (permalink / raw) Ken Garlington wrote: > > ++ robin wrote: > > > An error-handler for overflow should have been included, > > but should have returned control directly to the program > > (this only as an emergency resort). The code should have > > included a check for data out of range (or better, storage > > of adequate size.) > > I agree with the second part. However, it's not clear that returning > to the program would have helped. This is one area in which I think > the final report is too optimistic. It suggests that the correct > response to the exception was to provide the "best data available." > That might be possible, but in general it's a tricky business. > I'm glad to see that I wasn't the only one bothered by that suggestion. The thought of a missle cruising around randomly based on "best data available" from a faulty computer is frankly a little scary. Perhaps the commision is hoping that the next missile blows up EuroDisney. :-) -- T.E.D. | Work - mailto:dennison@escmail.orl.mmc.com | | Home - mailto:dennison@iag.net | | URL - http://www.iag.net/~dennison | ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Ariane 5 Failure - Summary Report Ken Garlington ` (3 preceding siblings ...) 1996-07-25 0:00 ` ++ robin @ 1996-07-25 0:00 ` Alan Brain 1996-07-29 0:00 ` Ken Garlington 1996-07-26 0:00 ` Con Bradley 1996-08-01 0:00 ` root 6 siblings, 1 reply; 18+ messages in thread From: Alan Brain @ 1996-07-25 0:00 UTC (permalink / raw) Ken Garlington <garlingtonke@lmtas.lmco.com> wrote: >So, anyway, we now have another software package written in Ada that >caused the loss of a system, and again specification and design issues >outside Ada's control are the culprit. Not just design and specification, the implementation as well. Firstly, the brain-dead attitude of "handle all exceptions by shutting down and going to the backup" on a complex piece of equipment without many, many redundancies is ... incredible. Only duplication? Glad I'm not riding it... So that's a Specification fault. Secondly, the notion that conversion from a 64-bit value to a 16 bit value will always be OK, and that any time it isn't means a total failure of the unit, is a bit hard to swallow. In a complex piece of software, incapable of strict mathematical verification, I'd expect this to happen sometimes, not because of any soft failure or random hardware failure, but because Software Has Bugs. That's no excuse for losing a payload! This is a design fault. Thirdly, assuming either of the above, not checking that an arithmetic operation of this kind before it's fully complete is just plain silly. And such a check is un morceau de gateaux. This is an implementation fault. Jeez, Ada provides safety belts, Anti-lock brakes, etc but if people don't buckle up, and don't even bother to use the brake peddle, what can you do? ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-25 0:00 ` Alan Brain @ 1996-07-29 0:00 ` Ken Garlington 1996-07-30 0:00 ` John McCabe 0 siblings, 1 reply; 18+ messages in thread From: Ken Garlington @ 1996-07-29 0:00 UTC (permalink / raw) Alan Brain wrote: > > Thirdly, assuming either of the above, not checking that an arithmetic operation of > this kind before it's fully complete is just plain silly. And such a check is un > morceau de gateaux. This is an implementation fault. It's a question of perception. If a system designer says, "Don't add this check," and I as an implementer don't add this check (possibly only after asking the designer, "Are you _sure_"?), is this a design or an implementation fault? It appears to me, from reading the report, that the lack of a check was an intentional _design_ decision, not just something that was required but inadvertantly left out of the code. I consider this a design fault (if not a specification fault). In the final analysis, you could call all of this an implementation error (since the implementation is the only part of the process that was actually on the system), but to me that seems to be a poor way to understand the chain of events. > Jeez, Ada provides safety belts, Anti-lock brakes, etc but if people don't buckle > up, and don't even bother to use the brake peddle, what can you do? Certainly, if people don't buckle up, you don't blame the car implementer (manufacturer)! -- LMTAS - "Our Brand Means Quality" ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-29 0:00 ` Ken Garlington @ 1996-07-30 0:00 ` John McCabe 0 siblings, 0 replies; 18+ messages in thread From: John McCabe @ 1996-07-30 0:00 UTC (permalink / raw) Ken Garlington <garlingtonke@lmtas.lmco.com> wrote: >Alan Brain wrote: >> >> Thirdly, assuming either of the above, not checking that an arithmetic operation of >> this kind before it's fully complete is just plain silly. And such a check is un >> morceau de gateaux. This is an implementation fault. >It's a question of perception. If a system designer says, "Don't add this check," and >I as an implementer don't add this check (possibly only after asking the designer, >"Are you _sure_"?), is this a design or an implementation fault? >It appears to me, from reading the report, that the lack of a check was an intentional >_design_ decision, not just something that was required but inadvertantly left out of >the code. I consider this a design fault (if not a specification fault). I agree entirely, the rest of this article is my response to a similar comment in a similar thread, but I thought it may interest people who missed the other thread through a spelling error (Adriane crash). rav@goanna.cs.rmit.edu.au (++ robin) wrote: > john@assen.demon.co.uk (John McCabe) writes: > >JOINT ESA/CNES PRESS RELEASE N 33-96 - Paris, 23 July 1996 > >Ariane 501 - Presentation of Inquiry Board report > >------------------------------------------------------------------- > >Hope this is useful. So basically it _was_ a software fault >---Is this a euphemism for a programming error? because that's >what it was -- a programming error. Having read the report, I don't consider it to be a programming error, it was a design and management error. It sounds like whoever designed the system didn't pay enough attention to the requirements, and whoever was managing it didn't pay enough attention to its conformance to the requirements. I think the fact that the overflow occurred was not due to a programming oversight, after all the analyses had been done and a decision to not check that variable had been made (*see additional note below), but seeing as that variable should not have been in use at that point, I don't think you can blame whoever wrote that code. > The error was in assuming that a value would not overflow. >The specific error was that a conversion of a double-precision >floating-point value (~58 significant bits) to 15 significant >bits caused fixed-point overflow. The conversion was not >checked for overflow. It should have been. This is, after all, >a real-time system. It's a fundamental check that a programmer >experienced in real-time systems should have carried out. > Control was then passed to the interrupt handler, which >shut down the system. > The question is, basically, why was Ada used for this work? ESA Ada preference/mandate(?). <..snip..> *Note: I hope this makes ESA llok a bit closer at why they want to limit processor loading and how the margin should be reduced through the design and development phases. My own project has an ESA enforced limit of 70% which is quite ridiculous given the equipment we're using (GPS MA31750 10MHz MIL-STD-1750 processor). We cannot meet that but have requested a waiver on that - I believe that's much better than compromising the safety of the mission. ESA's loading margins are really supposed to take account of a requirement for future modifications to software once it has been delivered. There's no way this should have been enforced for Ariane 5. From the sound of the report,I think a pretty poor job has been done, not by the programmers who wrote the code and performed the analysis of what variables could safely be left unchecked, instead I think whoever performed the requirement analysis and all levels of management / reviewers above that havebeen completely negligent. Best Regards John McCabe <john@assen.demon.co.uk> Best Regards John McCabe <john@assen.demon.co.uk> ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Ariane 5 Failure - Summary Report Ken Garlington ` (4 preceding siblings ...) 1996-07-25 0:00 ` Alan Brain @ 1996-07-26 0:00 ` Con Bradley 1996-07-26 0:00 ` P. Cnudde VH14 (8218) 1996-07-26 0:00 ` Peter Hermann 1996-08-01 0:00 ` root 6 siblings, 2 replies; 18+ messages in thread From: Con Bradley @ 1996-07-26 0:00 UTC (permalink / raw) I have read the report on the Ariane 5 failure and feel that somebody should congratulate ESA for their remarkable candour in making this report so widely available. It is a pity that other organizations are not so willing to go public on their mistakes. ------------------------------------------------------------------ Con Bradley "A pint of plain is your only man" SGS Thomson Microelectronics Limited 10 Priory Road, Clifton Bristol, BS8 1TU e-mail: ceb@bristol.st.com ------------------------------------------------------------------ ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-26 0:00 ` Con Bradley @ 1996-07-26 0:00 ` P. Cnudde VH14 (8218) 1996-07-26 0:00 ` Peter Hermann 1 sibling, 0 replies; 18+ messages in thread From: P. Cnudde VH14 (8218) @ 1996-07-26 0:00 UTC (permalink / raw) Con Bradley wrote: > > I have read the report on the Ariane 5 failure and feel that > somebody should congratulate ESA for their remarkable candour > in making this report so widely available. I agree fully with you on this point, but you should not forget that they are using our money (our = european taxpayers) to finance there projects so they owe something to the public. I read the report and I found it nost interesting lecture. My personal conclusion is that no matter what effort you put in a system, it can always go wrong. (I don't say you should not put all the possible effort in it, otherwise it will certainly go wrong) > > It is a pity that other organizations are not so willing to > go public on their mistakes. It's even worse, people get fired when they try to mistakes public. > SGS Thomson Microelectronics Limited A Microelectronics collegue in comp.lang.ada, interesting ! -- ____________ Peter Cnudde \ / Alcatel Telecom \ ALCATEL/ Switching Systems Division \ BELL / Microelectronics Design Center \ / \ / F. Wellesplein 1, B-2018 Antwerp \/ BELGIUM e-mail : cnuddep@sh.bel.alcatel.be Phone : +32 3 240 82 18 Fax : +32 3 240 99 47 ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-26 0:00 ` Con Bradley 1996-07-26 0:00 ` P. Cnudde VH14 (8218) @ 1996-07-26 0:00 ` Peter Hermann 1 sibling, 0 replies; 18+ messages in thread From: Peter Hermann @ 1996-07-26 0:00 UTC (permalink / raw) Con Bradley (ceb@pact.srf.ac.uk) wrote: : I have read the report on the Ariane 5 failure and feel that : somebody should congratulate ESA for their remarkable candour : in making this report so widely available. agreed : It is a pity that other organizations are not so willing to : go public on their mistakes. politicians could learn a lot: in that they obscure many facts they are mistrusted as a consequence. -- Peter Hermann Tel:+49-711-685-3611 Fax:3758 ph@csv.ica.uni-stuttgart.de Pfaffenwaldring 27, 70569 Stuttgart Uni Computeranwendungen Team Ada: "C'mon people let the world begin" (Paul McCartney) ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: Ariane 5 Failure - Summary Report 1996-07-24 0:00 ` Ariane 5 Failure - Summary Report Ken Garlington ` (5 preceding siblings ...) 1996-07-26 0:00 ` Con Bradley @ 1996-08-01 0:00 ` root 6 siblings, 0 replies; 18+ messages in thread From: root @ 1996-08-01 0:00 UTC (permalink / raw) In article <838748001.3682.0@assen.demon.co.uk> john@assen.demon.co.uk (John McCabe) writes: [SNIP] Having read the report, I don't consider it to be a programming error, it was a design and management error. It sounds like whoever designed the system didn't pay enough attention to the requirements, and whoever was managing it didn't pay enough attention to its conformance to the requirements. [SNIP] Agreed. Electronics Weekly (UK freebie) put it as a "Mindset Error" (can't remember exact phrase right now). I respect their judgement and I think it about sums up the whole thing rather neatly. Chris Morgan chris.morgan@baesema.co.uk ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~1996-08-01 0:00 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <31F60E8A.2D74@lmtas.lmco.com> 1996-07-24 0:00 ` Ariane 5 Failure - Summary Report Ken Garlington 1996-07-24 0:00 ` Byron B. Kauffman 1996-07-24 0:00 ` Stephen D. House 1996-07-25 0:00 ` Theodore E. Dennison 1996-07-25 0:00 ` ++ robin 1996-07-25 0:00 ` Dale Stanbrough 1996-07-26 0:00 ` OS2 User 1996-07-25 0:00 ` ++ robin 1996-07-26 0:00 ` ++ robin 1996-07-26 0:00 ` Ken Garlington 1996-07-30 0:00 ` Theodore E. Dennison 1996-07-25 0:00 ` Alan Brain 1996-07-29 0:00 ` Ken Garlington 1996-07-30 0:00 ` John McCabe 1996-07-26 0:00 ` Con Bradley 1996-07-26 0:00 ` P. Cnudde VH14 (8218) 1996-07-26 0:00 ` Peter Hermann 1996-08-01 0:00 ` root
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox