From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,5ac12f5a60b1bfe X-Google-Attributes: gid103376,public X-Google-Thread: 101deb,f96f757d5586710a,start X-Google-Attributes: gid101deb,public X-Google-Thread: f43e6,5ac12f5a60b1bfe X-Google-Attributes: gidf43e6,public From: rav@goanna.cs.rmit.edu.au (++ robin) Subject: Re: Ariane 5 - not an exception? Date: 1996/07/26 Message-ID: <4t9vdg$jfb@goanna.cs.rmit.edu.au> X-Deja-AN: 171132172 expires: 1 October 1996 00:00:00 GMT references: organization: Comp Sci, RMIT, Melbourne, Australia newsgroups: comp.software-eng,comp.lang.ada,comp.lang.pl1 nntp-posting-user: rav Date: 1996-07-26T00:00:00+00:00 List-Id: simonb@pact.srf.ac.uk (Simon Bluck) writes: >The Ariane 501 flight failure was due to the raising of an unexpected >Ada exception, ---An exception, yes, but not unexpected. The programming mistake made was in assuming that a floating-point value of some 58 significant bits would somehow "fit" into a 15-bit integer. There was no check that the data conversion would not result in overflow, so the problem went to the error handler, which shut down the system. >which was handled by switching off the computer. The >report on this: > http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html >is clear and hard-hitting: it will result in much improved software. >But does it get right to the bottom of the issues, and does the >software community appreciate that there are fundamental software >control problems which can directly give rise to such enormous >failures, in this particular case thankfully without loss of life? >It is most unfortunate, but must be accepted as true, that if the >Ariane software had been written in a less powerful language the >numeric overflow might have gone unnoticed, the computers would have >remained switched on, and the rocket would have continued its upward >flight. >Exceptions and assertions are both used, in Ada and C/C++, ---and PL/I >to detect >software/hardware anomalies. When one of these trips, it is >frequently very difficult for the designer to know how best to handle >the problem. ---Not in the case of a simple fixed-point overflow -- as was the case with Ariane. It is a fact that real-time programming has been available in PL/I for some 30 years, and recovery from errors is standard established practice. > To continue may result in corrupt data; ---To continue in this case probably would need the value to be set to the maximum. And it wouldn't be corrupt data. >to abort is >drastic but eliminates the possibility that further processing will >compound the problem. ---What? Here, the lack of further processing resulted in destruction of the project! >The more checks you have, the more likely it is that one of them will >trip. If you can't think of good ways of handling these checks, the >end result, for the user, may well be very much worse than if the >check had never been performed in the first place. >Of the two handling options, neither is really acceptable. However, >there is a third option which ought to be considered: to continue but >mark the processed data as suspect. There are other better approaches. One is to continue with the maximum value; another is to avoid the use of a 16-bit variable, and to use a variable as the same size and type (here floating-point storage), thus avoiding the problem altogether. >I.e. each data item would have a truth value of 1.0 for good data, >0.0 for absolutely rotten data, utilising values in between if you >have some idea how good the data is. If you have numeric overflow, >you could set the data to the largest value available, and mark it as >suspect. >Any data further derived from suspect data must also be marked as >suspect. >Taking a probabilistic attitude to data would bring a lot of software >into the real world where failures can happen at all levels. Using >this approach would made complex mission-critical software like the >failing Ariane software much easier to understand and control. Data >would be processed along the same path regardless of whether it is >suspect or entirely valid. Only the end-users of the data would be >affected, and where duplication of systems provides redundancy, the >algorithm would be to switch to the backup on receiving suspect data, >and switch back to the main source if the backup was suspect. ---In Ariane, both the active processor and the backup failed at the same time, because it was a *programming* error that was encountered at the same time in both processors, and both processors were shut down at the same time by their respective error handlers. > If both sources are suspect, then take the least suspect source. This >is simple and you don't lose your vital input data. The data truth >values would be passed on from system to system along with the data. >You _never_ switch off a computer, but you may have cause to mark all >data emanating from it as suspect. Leave it up to the users of the >data to decide if they want to use it or not - they may have no >choice. ---Indeed. >Along with the data truth attribute, you need a data type attribute. >This is tending to be relatively standard stuff now that objects are >around and need to know what kind of object they are. But adding a >data type field is still something that designers skimp on if not >supplied by the language, relying instead on implicit coding of type >information in the senders and receivers of data. >Lack of type information accounts for why the Ariane flight control >was able to interpret diagnostic data as attitude data, virtually >guaranteeing catastrophic failure. At least if attitude data had >been cut short it could have continued in a straight line. ---This is more of a lack of communication between the two programs. Another design error. >Well, those are what I think are the important lessons to be learned. ---I think the real lessons are that 1. real-time programming requires special expertise. 2. the choice of language is suspect. A better-established language such as PL/I -- specifically designed for real-time programming -- with robust compilers, and with its base of experienced programming staff could well have prevented this disaster.