From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 101deb,f96f757d5586710a X-Google-Attributes: gid101deb,public X-Google-Thread: 103376,5ac12f5a60b1bfe X-Google-Attributes: gid103376,public X-Google-Thread: f43e6,5ac12f5a60b1bfe X-Google-Attributes: gidf43e6,public From: g1006@fs1.mar.lmco.com (Francis Lipski) Subject: Re: Ariane 5 - not an exception? Date: 1996/08/06 Message-ID: <4u7fdm$e6m@morgan.vf.lmco.com> X-Deja-AN: 172474769 distribution: world sender: g1006@fs1.mar.lmco.com (Francis Lepski) references: <4t9vdg$jfb@goanna.cs.rmit.edu.au> <31FE35BC.1A0D@sanders.lockheed.com> <4totv7$o9f@goanna.cs.rmit.edu.au> <32065615.77C7@sanders.lockheed.com> organization: Lockheed Martin Corp, Valley Forge PA newsgroups: comp.software-eng,comp.lang.ada,comp.lang.pl1 Date: 1996-08-06T00:00:00+00:00 List-Id: In article <32065615.77C7@sanders.lockheed.com>, you write: > ++ robin wrote: > > Steve O'Neill writes: > > >I disagree completely! The language was not the > > >problem the design decisions in how the language > > >was used were. > > > > ---The choice of language is indeed very relevant. > > What I wrote in an earlier posting on this topic is highly > > apt: > > > > "A PL/I programmer > > experienced with real time systems, would have CHALLENGED > > such a stupid requirement that the computer be shut down by the > > error-handler in the event of a fixed-point overflow. He would > > have had it changed. Not always possible. If you are in the minority and are unsuccessful to argue others to your point, what do you do? As a previous message in this thread had stated, what should someone do? Say to hell with the requirements, I'm going to code what I think is correct. You can argue you position for only so long. If you haven't convinced others that your position is correct, after a reasonable time, then either you can't argue effectively, or may your position is wrong. My recommendations would be to document your position and your attempts to persuade others. Then if something like the Ariane 5 happens, you can say, "see I told you so". Not that thats a big consolation after the loss of a rocket, or if peoples lives have been lost. > > > > "I'd go further to say that no experienced PL/I programmer > > would have shut down the system as a result of a fixed-point > > overflow. > > Substitute Ada (or C or FORTRAN or Assembly) for PL/I here and you see my > point. It's not the language that makes the developer challange the > ridiculous requirement to shut down it is the developer "experienced with > real-time systems". Just because I am programming in PL/I doesn't mean I > am magically a better real-time developer. As a real-time designer > concerned with the system-wide aspects of completely shutting down any > sensor I would question this approach regardless of the language in use. > This has nothing to do with the fact that much of my experience is with > Ada. > > The (flawed) reasoning for why certain conversions were not protected was > also covered in the report. Invalid assumptions were made and we know > what assuming does don't we (makes an ASS out of U and ME). This was > compounded by the requirement for 20% spare capacity. Spare capacity for > what we don't know. Especially considering that the very software which > failed didn't need to be and should not have been running at the time > consuming some of that precious spare. > > Certainly you and I would not have shut down the system but what about > the vast majority of developers without as much experience or who thought > that their job was to implement the requirements that they were given? > The report states that the rationale was based on the "culture within the Ariane programme of only addressing random hardware failures. From this point of view exception - or error- handling mechanisms are designed for a random hardware failure which can quite rationally be handled by a backup system" If all conversions and other possible overflow conditions are protected, and then an overflow occurs, what action should be taken? The system has just had a random hardware failure. Continue to operate with known bad hardware? In the case of an overflow, set to max value, continue and hope for the best? While clearly the design, in this case, did not protect itself sufficiently, and compounded errors by not handling the case of a simultaneous failure of both processors, what action should be taken on an overflow if not to shut down. With flight controls or inertial systems, partitioning into tasks and then restarting the offending task is not an option. It would take entirely too long to restart the task to be able to effectively recover. Regarding the spare requirements. The answer as to why to have spare time is to ensure that all hard deadlines are met and to allow growth for future versions of SW. Allowing room for growth is necessary in development programs however, the requirement is usually never relaxed as more functionality is added. That is another story. However, it is necessary to ensure sufficient time is available to complete all the processing within the allotted time. The execution time of the software is at best a statistical problem, at least the hardware times can be statistical. If the SW is always measured as a worse case time, and all these are added together can can not allow this time to meet or exceed the allowable time, given the statistical nature of the HW. So how much spare time should be allotted? If 20% is unrealistic, what number should be used, 10%, 1%, 0.001%? > > > > "Furthermore, he would have included a check that the value > > did not go out of range;" > > > > ---But all it needed was a check that the value was in range. > > Such checks had been included on other similar conversions in > > the vicinity! > > > > Yes, and there was mention in the report that 'they' thought that this > would violate that precious spare requirement. So they set about picking > and choosing which conversions to protect. I find it extremely hard to > believe that the (small) handful of instructions to do a range check > would have been too much! And, in hindsight, well worth it. > > The issue of the OBC interpreting the 'essentially diagnostic data' as > valid sensor data really makes me wonder. In a system with a reasonable > interface between the two devices this should *never* happen. I am > surprised that this misinterpretation didn't cause a similar overflow in > the OBC and resulting shutdown! :( I was also amazed by the poor design of the interface that didn't detect this problem. Probably given enough time, some form of error would have occurred resulting in the OBC shutting down. > > I think that we agree in our assessment of the situation and the fact > that these problems could have been avoided with a better overall system > design and more extensive testing. Essentially the same conclusions that > the review board came to. My only disagreement is with your _opinion_ > that the simple choice of a different language would have saved the day. > And with this point I will continue to disagree. > > -- > Steve O'Neill | "No,no,no, don't tug on that! > Sanders, A Lockheed Martin Company | You never know what it might > smoneill@sanders.lockheed.com | be attached to." > (603) 885-8774 fax: (603) 885-4071| Buckaroo Banzai ---------------------------------------------------------------------- Standard Disclaimer applies. Frank Lipski lipski@fs1.mar.lmco.com 770-494-8322 "The most exciting phrase to hear in science, the one that heralds new discoveries, is not "Eureka!" ("I found it!") but rather "hmm....that's funny..." -- Isaac Asimov ---------------------------------------------------------------------