From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,885dab3998d28a4 X-Google-Attributes: gid103376,public From: "Samuel T. Harris" Subject: Re: Ariane 5 failure Date: 1996/10/18 Message-ID: <3267D929.167E@gsde.hso.link.com> X-Deja-AN: 190486188 references: <96100111162774@psavax.pwfl.com> <32555A39.E38@lmtas.lmco.com> <326506D2.1E40@lmtas.lmco.com> content-type: text/plain; charset=us-ascii organization: Hughes Training Inc. - Houston Operations mime-version: 1.0 newsgroups: comp.lang.ada x-mailer: Mozilla 3.01b1 (X11; I; IRIX 5.3 IP19) Date: 1996-10-18T00:00:00+00:00 List-Id: Keith Thompson wrote: > > In <326506D2.1E40@lmtas.lmco.com> Ken Garlington writes: > [...] > > Not necessarily. Keep in mind that an exception _was_ raised -- a > > predefined exception (Operand_Error according to the report). > > This is one thing that's confused me about this report. There is no > predefined exception in Ada called Operand_Error. Either the overflow > raised Constraint_Error (or Numeric_Error if they were using an Ada > 83 compiler that doesn't follow AI-00387), or a user-defined exception > called Operand_Error was raised explicitly. > Remember, the report does NOT state that an unchecked_conversion was used (as some on this thread have assumed). It only states a "data conversion from 64-bit floating point to 16-bit signed integer value". As someone (I forget who) pointed out early in the thread weeks ago, a standard practice is to scale down the range of a float value to fit into an integer variable. This may not have been an unchecked_conversion at all, but some mathimatical expression. Whenever software is reused, it must be reverified AND revalidated. The report cites several reasons for not reverifying the reuse of the SRI from the Ariane 4. Any one of which may be justifiable. However, a cardinal rule of risk management is that any risk to which NO measures are applied remains a risk. Here they justified their way into applying no measures at all toward insuring the stuff would work. The report also states that the code which contained the conversion was part of a feature which was now obsolete for the Ariane 5. It was left in "presumably based on the view that, unless proven necessary, it was not wise to make changes in software which worked well on Ariane 4." While this does make good sense, it is not by any means a verification nor a validation. It just seems to mitigate your risk, but it really does no such thing. You can't let such thinking lull you into a false sense of security. The analysis which lead to protecting four variables from Operand_Error and leaving 3 unprotected was not revisited with the new environment in mind. How could it be since the Ariane 5 trajectory data was not included as a function requirement. Hence this measure does not apply to the risk of the Ariane 5, though some in the decision may have relied upon it for just that protection. Then they went as far as not revalidating the SRI in an Ariane 5 environment, which was the real hurt. While the report states the Ariane 5 flight data was not included as a functional requirement, someone should have asked for it if they needed it. Its omission means any verification testing which was done would not have taken it into account. So it would have been verified (which is testing against what the user said he wanted). However, validation testers (who test what the user actually wants and are supposed to be smart enough NOT to take the specification at face value) should have insisted on such data, included or not. That's the silly part about the whole affair, validation testing also was not performed. The report then goes on to discuss why the SRI's were not included in a closed-loop test. So even if the Ariane 5 trajectory data had been included as a functional requirement, it would not have helped. While the technical reasons cited are appropriate for a verification test, the report correctly points out that the goals of validation testing are not so strigently dependent on the fidelity of the test environment so those reasons just don't justify not having the SRI's in at least one validation test using Arian 5 trajectory data, especially when other measures have NOT been taken to insure a compatible reuse of software. In fact, section 2.1 states "The SRI internal events that led to the failure have been reproduced by simulation calculations." I wonder if they compiled and ran the Ada code on another platform (which is a viable way of doing a lot of testing for embedded software prior to embedding the software). The report does not state if such testing was performed by the developer. If the developer done such testing, then the Ariane 5 trajectory data would have spotted the flaw. If such testing was done, someone would have to ask explicitly for such data. The end of secion 2.3 summarizes the fact that the reviews did not pick up the fact that of all potential measures which could have been applied to determine a compatible reuse of software into the Ariane 5 operational environment, NONE of them were actually performed. Which left the reviewers blissfully ignorant of an unmitigated risk glaring them in the face. Of the SRI, I conclude ... No design error (though it could have done something better). No programming error (given the design). An arguable specification error (but without appropriate testing). A lapse in validation testing (assuming other non-existance measures). A grave risk management and oversite problem. Bottom line, a management (both customer and contractor) problem. The OBC and main computer are another matter entirely. I've not seen anyone on this thread address the entries 3.1.f and g concerning the SRI sending diagnostic data (item f) which was interpreted as flight data by the launcher's main computer (item g). Section 2.1 states the backup failed first and declared a failure and the OBC could not switch to it because it already ceased to function. It seems the OBC knew about the failures, so why did the main computer still interpret any data from a failed component as flight data. That seems like a design or programming problem. It is blind luck that the diagnostic data caused the main computer to try to correct the trajectory via extreme positions of the thruster nozzles which caused the rocket to turn sideways to the air flow which caused buckling in the superstructure which caused the self-destruct to engage. Given the design philosophy of the designers, had the main computer known both SRI had failed, it should have signaled a self-destruct right then and there. What would have happened if the "diagnostic" data caused minor course corrections and brought the rocket over a population area before the subsequent course or events (or the ground flight controllers themselves) signaled a self-destruct? The report does not delve into this aspect of the problem which I consider to be even more important. This tends to tell me the SRI simulators in the closed-loop testing which was performed were not used to check malfunctions, or if they were, then the test scenarios are incomplete or flawed. How many other interface/protocol/integration problems are waiting to crop up? Which reused Arian 4 software component will fail next? Stay tuned for these and other provocative questions on "As the Arian Burns" ;) I wonder how the payload insurance companies will repond with their pricing for the next couple of launches. -- Samuel T. Harris, Senior Engineer Hughes Training, Inc. - Houston Operations 2224 Bay Area Blvd. Houston, TX 77058-2099 "If you can make it, We can fake it!"