From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,5ac12f5a60b1bfe X-Google-Attributes: gid103376,public X-Google-Thread: 101deb,f96f757d5586710a X-Google-Attributes: gid101deb,public X-Google-Thread: f43e6,5ac12f5a60b1bfe X-Google-Attributes: gidf43e6,public From: rav@goanna.cs.rmit.edu.au (++ robin) Subject: Re: Ariane 5 - not an exception? Date: 1996/08/13 Message-ID: <4up6jg$h3e@goanna.cs.rmit.edu.au> X-Deja-AN: 173853773 expires: 1 November 1996 00:00:00 GMT references: <4t9vdg$jfb@goanna.cs.rmit.edu.au> <31FE35BC.1A0D@sanders.lockheed.com> <4totv7$o9f@goanna.cs.rmit.edu.au> <32065615.77C7@sanders.lockheed.com> organization: Comp Sci, RMIT, Melbourne, Australia newsgroups: comp.software-eng,comp.lang.ada,comp.lang.pl1 nntp-posting-user: rav Date: 1996-08-13T00:00:00+00:00 List-Id: g1006@fs1.mar.lmco.com (Francis Lipski) writes: >In article <32065615.77C7@sanders.lockheed.com>, you write: >> ++ robin wrote: >> > Steve O'Neill writes: >> > >I disagree completely! The language was not the >> > >problem the design decisions in how the language >> > >was used were. >> > >> > ---The choice of language is indeed very relevant. >> > What I wrote in an earlier posting on this topic is highly >> > apt: >> > >> > "A PL/I programmer >> > experienced with real time systems, would have CHALLENGED >> > such a stupid requirement that the computer be shut down by the >> > error-handler in the event of a fixed-point overflow. He would >> > have had it changed. > Not always possible. If you are in the minority and are unsuccessful >to argue others to your point, what do you do? ---Don't be absurd. The checks WERE included in all but 3 of the type conversions in the vicinity of the conversion that blew up. > As a previous message in this thread had stated, what >should someone do? Say to hell with the requirements, >I'm going to code what I think is correct. ---The requirements were that any kind of interrupt was going to be handled by the interrupt handler (which would then shut doen the computer). A *real* real-time PL/I programmer would have included a test to make certain that the interrupt could not occur. That was NOT going against the specifications. But, as I wrote in a previous post, a belt-and-braces approach should have been taken, viz, to include an error handler for fixed-point overflow, as an interrupt was to be taken as SUDDEN DEATH for the project. This is where a PL/I programmer would have had the specification changed. >> > "I'd go further to say that no experienced PL/I programmer >> > would have shut down the system as a result of a fixed-point >> > overflow. >> Substitute Ada (or C or FORTRAN or Assembly) for >> PL/I here and you see my point. ---Neither C nor Fortran have error-handling. Ada *was* used, and look what happened. Hence the suggestion that PL/I expertise on the project would have been advantage. You see, real-time programming in PL/I has been part of the scene since 1966! >> It's not the language that makes the developer challange the >> ridiculous requirement to shut down it is the developer "experienced with >> real-time systems". Just because I am programming in PL/I doesn't mean I >> am magically a better real-time developer. As a real-time designer >> concerned with the system-wide aspects of completely shutting down any >> sensor I would question this approach regardless of the language in use. >> This has nothing to do with the fact that much of my experience is with >> Ada. >> The (flawed) reasoning for why certain conversions were not protected was >> also covered in the report. Invalid assumptions were made ---Yes; it was assumed that the value would not overflow but it did!. They have forgotten Murphy's Law: "If anything can go wrong, it will". And Robert's Law: "Even if it *can't* go wrong, it will". >> Certainly you and I would not have shut down the system but what about >> the vast majority of developers without as much experience or who thought >> that their job was to implement the requirements that they were given? ---They could have implemented the "requirements" WITHOUT raising a fixed-point interrupt, just by checking for overflow! > The report states that the rationale was based on the "culture within the >Ariane programme of only addressing random hardware failures. From this point of view exception - or error- handling mechanisms are designed for a random >hardware failure which can quite rationally be handled by a backup system" > If all conversions and other possible overflow >conditions are protected, >and then an overflow occurs, what action should be taken? ---Action should be taken to deal with a fixed-point overflow! Something was overlooked. It needed to be dealt with. That it was not is a fundamental error! That's why error-handling is provided! To provide a margin of safety. > The system has >just had a random hardware failure. Continue to operate with known bad >hardware? In the case of an overflow, set to max value, continue and >hope for the best? ---Good idea, already suggested in the report. But the report also suggested that the design needed to take into account programmer error. > While clearly the design, in this case, did not protect itself sufficiently, >and compounded errors by not handling the case of a simultaneous failure of >both processors, what action should be taken on an overflow if not to shut >down. With flight controls or inertial systems, partitioning into tasks and >then restarting the offending task is not an option. It would take entirely >too long to restart the task to be able to effectively recover. > Regarding the spare requirements. The answer as to why to have spare time >is to ensure that all hard deadlines are met and to allow growth for future >versions of SW. Allowing room for growth is necessary in development programs >however, the requirement is usually never relaxed as more functionality is >added. That is another story. However, it is necessary to ensure sufficient >time is available to complete all the processing within the allotted time. >The execution time of the software is at best a statistical problem, at least >the hardware times can be statistical. If the SW is always measured as a worse >case time, and all these are added together can can not allow this time >to meet or exceed the allowable time, given the statistical nature of the HW. >So how much spare time should be allotted? If 20% is unrealistic, what >number should be used, 10%, 1%, 0.001%? >> > >> > "Furthermore, he would have included a check that the value >> > did not go out of range;" >> > >> > ---But all it needed was a check that the value was in range. >> > Such checks had been included on other similar conversions in >> > the vicinity! >> Yes, and there was mention in the report that 'they' thought that this >> would violate that precious spare requirement. ---That's a red herring. > So they set about picking >> and choosing which conversions to protect. ---This doesn't sppear specifically in the report as regards this conversion and the 2 others in the vicinity. There's the impliciation that these conversions were overlooked. In any case, the test would have introduced a trivial number of additional instructions. >> I find it extremely hard to >> believe that the (small) handful of instructions to do a range check >> would have been too much! ---Agreed. >> And, in hindsight, well worth it. ---Agreed again. >> The issue of the OBC interpreting the 'essentially diagnostic data' as >> valid sensor data really makes me wonder. In a system with a reasonable >> interface between the two devices this should *never* happen. I am >> surprised that this misinterpretation didn't cause a similar overflow in >> the OBC and resulting shutdown! :( ---Yes. > I was also amazed by the poor design of the interface that didn't detect >this problem. Probably given enough time, some form of error would >have occurred resulting in the OBC shutting down. ---There were a number of inadequacies revealed in the design. >> I think that we agree in our assessment of the situation and the fact >> that these problems could have been avoided with a better overall system >> design and more extensive testing. Essentially the same conclusions that >> the review board came to. My only disagreement is with your _opinion_ >> that the simple choice of a different language would have saved the day. ---As I stated, a PL/I programmer experienced in real-time programming, would not have made this stupid mistake. >> And with this point I will continue to disagree. ---You do not appear to have grounds for this opinion. >> Steve O'Neill | "No,no,no, don't tug on that! >> Sanders, A Lockheed Martin Company | You never know what it might >> smoneill@sanders.lockheed.com | be attached to." >> (603) 885-8774 fax: (603) 885-4071| Buckaroo Banzai