From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 101deb,e8bee2f0c75c8235 X-Google-Attributes: gid101deb,public X-Google-Thread: 12b42c,cefd160f01546cf0 X-Google-Attributes: gid12b42c,public X-Google-Thread: 103376,82c2596e4584d057 X-Google-Attributes: gid103376,public From: Ken Garlington Subject: Re: Ariane 5 Failure - Summary Report Date: 1996/07/26 Message-ID: <31F8DCDF.190F@lmtas.lmco.com> X-Deja-AN: 170637199 references: <31F60E8A.2D74@lmtas.lmco.com> <31F629B8.5FFB@lmtas.lmco.com> <4t6opg$4cp@goanna.cs.rmit.edu.au> content-type: text/plain; charset=us-ascii organization: Lockheed Martin Tactical Aircraft Systems mime-version: 1.0 newsgroups: comp.lang.ada,comp.lang.pl1,rmit.cs.100 x-mailer: Mozilla 2.02 (Macintosh; I; 68K) Date: 1996-07-26T00:00:00+00:00 List-Id: ++ robin wrote: > > >Ken Garlington writes: > > >Don't know what happened there, but I was just going to point out > >that the Ariane 5 report is at: > > > http://www.esrin.esa.it/htdocs/tidc/Press/Press96/press33.html > > >Be sure to read the full report, which is linked to this page. It > >goes into some length about the sequence of events (which includes > >an Ada exception I never heard of before, Operand Error? > > ---That's fixed-point overflow. Could you send me an Ada RM cite? I couldn't find it... > Converting a 64-bit > floating-point value to a 16 bit signed integer. > The conversion was unchecked (programming error-- I don't know if I would call this a programming error or a requirements error. Apparently, there was an analysis done to see if the check should be required, and the analysis said that it wasn't. Given the 80% utilization, I'm sure there was some not-so-subtle pressure to leave out any code that wasn't absolutely necessary. > 1. The size of the variable to hold the value (16 bits) was > inadequate; Not that this was mentioned in the report, but the commnications link between the INS and the flight control computer uses a MIL-STD-1553B data bus, which is a 16-bit protocol. They could have used multiple words to contain this value, but it is common for 1553B application to convert floating point numbers to scaled 16-bit values, so long as the precision is still acceptable. What apparently happened was that the scaling for the Ariane 4 was acceptable, but was not updated for the Ariane 5 based on an analysis that said the ranges should be maintained. > 2. It was assumed that the value would not be large enough > to overflow ; therefore, it was not checked; and Yes - clearly the analysis done here was not adequtely revisited for the new environment. > 3. The folly that a floating-point value of some > 58 significant bits could be converted "safely" to > 16 bits. This is actually quite routine for IRS to flight control interfaces. The IRS usually has to do high-precision calculations internally, but the flight control system does not need this precision. In and of itself, there's no problem dumping the extra bits of precision, so long as the range is preserved (which, in this case, it wasn't). > An error-handler for overflow should have been included, > but should have returned control directly to the program > (this only as an emergency resort). The code should have > included a check for data out of range (or better, storage > of adequate size.) I agree with the second part. However, it's not clear that returning to the program would have helped. This is one area in which I think the final report is too optimistic. It suggests that the correct response to the exception was to provide the "best data available." That might be possible, but in general it's a tricky business. I wonder more why the IRS message to the flight controls did not include (a) an indication as to IRS mode (alignment, in this case) and (b) an indication that the IRS had detected a data error. > This project might well have been written in PL/I, which > has excellent real-time facilities, including error > handling, error simulation and validation facilities. > The language has robust compilers, and experts with many > years of PL/I programming experience. > > As to PL/I facilities, I refer to the SIGNAL statement, > with which given conditions (errors such as fixed-point > overflow) can be signalled as if the condition (error) > actually occurred. The language in which it was actually written (Ada) has equivalent facilities, so I'm not sure how PL/I would have helped here. Having programmed in both PL/I and Ada, I can't think of anything specific in this area. As noted in the final report, this was a system requirements and design error, not a programming error. > This alone would have showed up the deficiency of the > overall design (that the system would shut itself down for > fixed-point overflow). It might have shown that this was the result, but as pointed out in the report, the system designers knew this could happen, and discounted it as improbable. So, I'm not sure it would have made a difference. > Further, an ON unit can return control simply and easily > to some re-start point, or another convenient point in the > program, or even pass control to the following statement. Again, it's not clear to me that any of these options would have saved the vehicle. An IRS in alignment mode simply cannot generate valid data, period. Furthermore, since the designers felt this error couldn't really occur, it's unlikely they would have made the right choice as to what to do when the SIGNAL was raised. > >With Definitely good "lessons learned" about: > > >1. The limits of exceptions (they are only as good as what you can do > >when they are raised). > > ---There's a lot you can do with an exception. One of > them isn't to shut down the computer. I've already itemized > what can be done with an exception. But in this case, > the proper course is to ensure that values are within > range and to take appropriate action, rather than > to let it get as far as the error handler, which should > be a last resort for catching something overlooked (and > hopefully, there's none of those). Sure. Now, what was the appropriate action here? Switch from alignment to operational mode? The whole point of alignment mode is to make the values computed during operational mode accurate. If it doesn't finish, the results are suspect. Particularly for feedback systems, having lots of options at the language level does not equate to having an adequate solution to the existence of an error. That's a system design problem, not a language issue. > ---No, this is a clear programming error. A PL/I programmer > experienced with real time systems, would have CHALLENGED > such a stupid requirement that the computer be shut down by the > error-handler in the event of a fixed-point overflow. He would > have had it changed. To what? > I'd go further to say that no experienced PL/I programmer > would have shut down the system as a result of a fixed-point > overflow. What would he have done? > Furthermore, he would have included a check that the value > did not go out of range; > > Skills in PL/I and real time systems would not have gone > astray here. And probably skills in Ada too. I will agree that experience in real time systems (which these folks had, by the way) is definitely a useful thing here. However, these guys simply came up with the wrong answer in their analysis. It wasn't that they didn't realize there was a risk - the report clearly states that the issue was discussed and the resolution approved at several levels of management. They weren't ignorant, just wrong. There is a difference. > > ___________________________________________________________ > > Extract from full report: > > " * The internal SRI software exception was caused during execution of a > data conversion from 64-bit floating point to 16-bit signed integer > value. The floating point number which was converted had a value > greater than what could be represented by a 16-bit signed integer. > This resulted in an Operand Error. The data conversion instructions > (in Ada code) were not protected from causing an Operand Error, > although other conversions of comparable variables in the same place > in the code were protected." Tell me, if you read in the paper that a drunk driver was speeding, and killed someone, do you blame the auto manufacturer for providing a vehicle that could go fast enough to kill someone? Read the report. Its conclusions are definitely at odds with yours. -- LMTAS - "Our Brand Means Quality"