From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 12b42c,cefd160f01546cf0 X-Google-Attributes: gid12b42c,public X-Google-Thread: 101deb,e8bee2f0c75c8235 X-Google-Attributes: gid101deb,public X-Google-Thread: 103376,82c2596e4584d057 X-Google-Attributes: gid103376,public From: rav@goanna.cs.rmit.edu.au (++ robin) Subject: Re: Ariane 5 Failure - Summary Report Date: 1996/07/26 Message-ID: <4t9tcp$gvo@goanna.cs.rmit.edu.au> X-Deja-AN: 171132170 expires: 1 October 1996 00:00:00 GMT references: <31F60E8A.2D74@lmtas.lmco.com> <31F629B8.5FFB@lmtas.lmco.com> <4t6opg$4cp@goanna.cs.rmit.edu.au> organization: Comp Sci, RMIT, Melbourne, Australia newsgroups: comp.lang.ada,comp.lang.pl1,rmit.cs.100 nntp-posting-user: rav Date: 1996-07-26T00:00:00+00:00 List-Id: >Ken Garlington writes: >Don't know what happened there, but I was just going to point out >that the Ariane 5 report is at: > http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html >Be sure to read the full report, which is linked to this page. It >goes into some length about the sequence of events (which includes >an Ada exception I never heard of before, Operand Error? ---That's fixed-point overflow. Converting a 64-bit floating-point value to a 16 bit signed integer. The conversion was unchecked (programming error-- other conversions in the same module were checked; the assumption was made that the value would be within range); consequently the error condition was raised. The exception-handling routine was to record the status of the error and to then shut down the system. >Maybe it's user >defined, or there's a language difference at work). ---A user-defined data conversion that went unchecked. Three programming mistakes were made here: 1. The size of the variable to hold the value (16 bits) was inadequate; and 2. It was assumed that the value would not be large enough to overflow ; therefore, it was not checked; and 3. The folly that a floating-point value of some 58 significant bits could be converted "safely" to 16 bits. The problem then went to the error-handler, which was designed to shut down the system. This was a major blunder. An error-handler for overflow should have been included, but should have returned control directly to the program (this only as an emergency resort). The code should have included a check for data out of range (or better, storage of adequate size.) This project might well have been written in PL/I, which has excellent real-time facilities, including error handling, error simulation and validation facilities. The language has robust compilers, and experts with many years of PL/I programming experience. As to PL/I facilities, I refer to the SIGNAL statement, with which given conditions (errors such as fixed-point overflow) can be signalled as if the condition (error) actually occurred. This alone would have showed up the deficiency of the overall design (that the system would shut itself down for fixed-point overflow). Further, an ON unit can return control simply and easily to some re-start point, or another convenient point in the program, or even pass control to the following statement. >With Definitely good "lessons learned" about: >1. The limits of exceptions (they are only as good as what you can do >when they are raised). ---There's a lot you can do with an exception. One of them isn't to shut down the computer. I've already itemized what can be done with an exception. But in this case, the proper course is to ensure that values are within range and to take appropriate action, rather than to let it get as far as the error handler, which should be a last resort for catching something overlooked (and hopefully, there's none of those). >2. The problems with reusing items outside their original environment. >3. The need to check inputs and outputs aggressively. >4. The pitfalls of assuming that testing all of the components of a >system equates to testing the system, as well as the need to use >realistic test scenarios. >5. The problems with isolating the safety-critical components of a >system. >So, anyway, we now have another software package written in Ada that >caused the loss of a system, and again specification and design issues >outside Ada's control are the culprit. ---No, this is a clear programming error. A PL/I programmer experienced with real time systems, would have CHALLENGED such a stupid requirement that the computer be shut down by the error-handler in the event of a fixed-point overflow. He would have had it changed. I'd go further to say that no experienced PL/I programmer would have shut down the system as a result of a fixed-point overflow. Furthermore, he would have included a check that the value did not go out of range; Skills in PL/I and real time systems would not have gone astray here. And probably skills in Ada too. ___________________________________________________________ Extract from full report: " * The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer. This resulted in an Operand Error. The data conversion instructions (in Ada code) were not protected from causing an Operand Error, although other conversions of comparable variables in the same place in the code were protected."