From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.3 required=5.0 tests=BAYES_00,INVALID_MSGID autolearn=no autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,f55a4f84e352c8ec X-Google-Attributes: gid103376,public From: "Samuel T. Harris" Subject: Re: Ariane (yet again...) Date: 2000/01/19 Message-ID: <38864C08.E9C68954@Raytheon.com>#1/1 X-Deja-AN: 574915474 Content-Transfer-Encoding: 7bit References: <3882120e_3@news.jps.net> X-Accept-Language: en Content-Type: text/plain; charset=us-ascii Organization: Raytheon Aerospace Engineering Services Mime-Version: 1.0 Newsgroups: comp.lang.ada Date: 2000-01-19T00:00:00+00:00 List-Id: Mike Silva wrote: > > Before anybody starts throwing anything, my question is very specific -- > does anybody know exactly what the Ariane report means when it speaks of > "protecting" conversions? The subject came up in alt.folklore.computers and > it seems there are at least three possible meanings: (a) turn off the > runtime checks for a given conversion, (b) put some code before the > conversion to explicitly check for in-range, or (c) have a local exception > handler to catch the error. Anybody know what exactly was / was not done > (or even better, have an actual code fragment)? I've just always been > curious... > > Mike You should read the report itself. It is an excellent read on the nature of cascading failures and how good technical effort can be fouled-up by poor management practices. I'll try to summarize below ... The report did not specify the nature of the conversion code. However, given the nature of problem it might have been a scaled conversion of an integer-type sensor reading to a floating or fixed point type for the code to use. This is pretty normal in this problem domain. This probably was not a simple unchecked_conversion. The sizes required for the two mentioned types do not match. The report specified that the conversion resulted in a value being out of range. It did not specify how the code determined this. Namely, did a normal Ada runtime check on the resulting value see that it was outside the range of the type or did the code use some explicit range check? We don't know from the report. The report did specify that an exception was raised. The report did specify that an exception handler was not provided based on the exhaustive analysis of the Ariane 4 teams proving that such an exception would never occur. Several exception handlers were not provided based on similar analysis to save processing time and memory requirements. Such analysis is common in this field and very reliable given the known constraints of the Ariane 4 trajectory and acceleration profile. This kind of analysis is vital in supporting the assumption that bad data is the result of a hardware failure. This was the case with the Ariane 5. The bad data was interpreted as a hardware failure so the component went into diagnostic mode. In fact, the backup component actually failed before the main component. Had an exception handler been present, I'm not sure what it could do except indicate a hardware failure which is the supported assumption of the component in the intended environment (the Ariane 4). In this diagnostic mode, the system sends diagnostic information to the central processor. The central processor misinterpreted this as real attitude and altitude information and commanded the thrusters to maximum deflection to correct the "course" of the rocket. This caused the rocket to turn sideways introducing catastrophic stresses on the fuselage as the air flow moved from the nose to the side of the rocket. Sensors detected the impending failure of the superstructure and the rocket commanded a self-destruct to insure lots of little bits of debris fell downrange instead of two or three very large sections. If there is a real "bug" it is the misinterpretation of the command processor of this diagnostic information. It should have know it was not real attitude and altitude information. The sad part of this is that the code in use is used by the Ariane 4 to enable a quick reset should a launch be aborted. This code is useless on the Ariane 5. The real problem was that this code was not being used on an Ariane 4, but was being reused on an Ariane 5 without any verification whatsoever. The Ariane 5 has a significantly different acceleration and trajectory profile. These differences simply made all that work proving the Ariane 4 would never raise the exception inapplicable but similar work was not done to verify this code on the Ariane 5. The contractor was not given the expected acceleration and trajectory profile of the Araine 5 nor was the contractor required to test against them. The report also noted that no simulations were run and speculated and a single simulation of the involved components, either individually or in an integrated environment, would have quickly identified the problem. A big case of overreliance on code reuse and under-employment of basic verification methods. Just because something works in the past does not mean it will work in the future, especically when the enclosing environment changes. The design teams considered each and every verification method and decided on each of them that they were not worth doing. The problem is they didn't review to see they really did nothing at all to verify the Ariane 4 code. While each decision had some merit when considered individually, all together they present an insane position to take. A management problem after all. -- Samuel T. Harris, Principal Engineer Raytheon, Aerospace Engineering Services "If you can make it, We can fake it!"