From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,5ac12f5a60b1bfe,start X-Google-Attributes: gid103376,public X-Google-Thread: f43e6,5ac12f5a60b1bfe,start X-Google-Attributes: gidf43e6,public From: simonb@pact.srf.ac.uk (Simon Bluck) Subject: Ariane 5 - not an exception? Date: 1996/07/25 Message-ID: X-Deja-AN: 170130020 sender: usenet@fsa.bris.ac.uk (Usenet) x-nntp-posting-host: talisker.pact.srf.ac.uk organization: University of Bristol, England newsgroups: comp.software-eng,comp.lang.ada Date: 1996-07-25T00:00:00+00:00 List-Id: The Ariane 501 flight failure was due to the raising of an unexpected Ada exception, which was handled by switching off the computer. The report on this: http://www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html is clear and hard-hitting: it will result in much improved software. But does it get right to the bottom of the issues, and does the software community appreciate that there are fundamental software control problems which can directly give rise to such enormous failures, in this particular case thankfully without loss of life? It is most unfortunate, but must be accepted as true, that if the Ariane software had been written in a less powerful language the numeric overflow might have gone unnoticed, the computers would have remained switched on, and the rocket would have continued its upward flight. Exceptions and assertions are both used, in Ada and C/C++, to detect software/hardware anomalies. When one of these trips, it is frequently very difficult for the designer to know how best to handle the problem. To continue may result in corrupt data; to abort is drastic but eliminates the possibility that further processing will compound the problem. The more checks you have, the more likely it is that one of them will trip. If you can't think of good ways of handling these checks, the end result, for the user, may well be very much worse than if the check had never been performed in the first place. Of the two handling options, neither is really acceptable. However, there is a third option which ought to be considered: to continue but mark the processed data as suspect. I.e. each data item would have a truth value of 1.0 for good data, 0.0 for absolutely rotten data, utilising values in between if you have some idea how good the data is. If you have numeric overflow, you could set the data to the largest value available, and mark it as suspect. Any data further derived from suspect data must also be marked as suspect. Taking a probabilistic attitude to data would bring a lot of software into the real world where failures can happen at all levels. Using this approach would made complex mission-critical software like the failing Ariane software much easier to understand and control. Data would be processed along the same path regardless of whether it is suspect or entirely valid. Only the end-users of the data would be affected, and where duplication of systems provides redundancy, the algorithm would be to switch to the backup on receiving suspect data, and switch back to the main source if the backup was suspect. If both sources are suspect, then take the least suspect source. This is simple and you don't lose your vital input data. The data truth values would be passed on from system to system along with the data. You _never_ switch off a computer, but you may have cause to mark all data emanating from it as suspect. Leave it up to the users of the data to decide if they want to use it or not - they may have no choice. Along with the data truth attribute, you need a data type attribute. This is tending to be relatively standard stuff now that objects are around and need to know what kind of object they are. But adding a data type field is still something that designers skimp on if not supplied by the language, relying instead on implicit coding of type information in the senders and receivers of data. Lack of type information accounts for why the Ariane flight control was able to interpret diagnostic data as attitude data, virtually guaranteeing catastrophic failure. At least if attitude data had been cut short it could have continued in a straight line. Well, those are what I think are the important lessons to be learned. The main reasons cited for Ariane 501's failure are typical human ones which will be made again on the next big project. I.e. inadequate testing, particularly of the complete system in its (simulated) environment. Surprise, surprise, this turns out to be too difficult and too costly to achieve thoroughly. And small system mistakes which stress the adequate functioning of the system as a whole (like thinking that the Ariane 4 alignment process didn't need changing for Ariane 5). These will happen time and again, we're only human. But with more realistic data processing the system as a whole would stand a better chance of survival. SimonB [All my own opinions, of course.]