From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Thread: 103376,7e8cebf09cf80560 X-Google-NewGroupId: yes X-Google-Attributes: gida07f3367d7,domainid0,public,usenet X-Google-Language: ENGLISH,ASCII-7-bit Path: g2news2.google.com!news3.google.com!feeder.news-service.com!85.214.198.2.MISMATCH!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Simon Wright Newsgroups: comp.lang.ada Subject: Re: How would Ariane 5 have behaved if overflow checking were not turned off? Date: Thu, 17 Mar 2011 20:58:59 +0000 Organization: A noiseless patient Spider Message-ID: References: <4d80b13f$0$43832$c30e37c6@exi-reader.telstra.net> <4d8200ce$0$43837$c30e37c6@exi-reader.telstra.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: mx02.eternal-september.org; posting-host="dFCm8HWntFqmDIilBLqEJQ"; logging-data="27965"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19EGr/MFIn5t37CPryxGlCyUnEzIBYtYj0=" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (darwin) Cancel-Lock: sha1:67d4h8ADhhZf25L6wF1vf4bNqHI= sha1:XbitgGNvGTV3oi4hTtwR6Hcp4ls= Xref: g2news2.google.com comp.lang.ada:19263 Date: 2011-03-17T20:58:59+00:00 List-Id: "robin" writes: > Simon Wright wrote in message ... >>"robin" writes: >> >>> Anyone competent in real-time programming would never have let the >>> software go with unhandled overflow, because such an event would >>> result in failure of the mission. >> >>The engineers, being competent in tightly-constrained real-time >>programming, found that installing exception handlers cost cpu cycles >>they could not afford, so looked at all the potential overflow sites and >>found that _this_ one could only occur if there was a hardware >>failure. > > Nonsense. The Full Report says nothing of the kind. Oh yes it does. Well, very very nearly. See http://esamultimedia.esa.int/docs/esa-x-1819eng.pdf page 5 second para. Especially note the last sentence. >> Even if they had installed an exception handler, the only proper >>response would have been to shutdown this node and hand over to the >>alternate; > > No, the exception handler could have done something sensible, such as > using the maximum integer value, thus allowing the trajectory to > continue. Better still would have been to include a magnitude test in > the code that avoided the need for an error handler. > >> and this was the action that would result from not having an >>exception handler in the first place. So, after considerable thought, >>they decided against having an exception handler. > > There were 7 places in the code in the vicinity where overflow could > occur. Four were chosen for protection, but three were not. That was > the fatal flaw. I know that the last but one paragraph on that page (5) starts "Although the failure was due to a systematic software design error..." but .. where I come from there are system designers and software designers. The system people work out the requirements and the software people - after making sure that the requirements appear sensible and questioning them if they don't - just get on and do what has been agreed by people probably on a higher pay grade and certainly with the assigned responsibility. So I don't agree that it was a software design error. You may say that it makes no difference; I say it affects who should get fired (or sued). Of course, for Ariane 4 it wasn't even a system design error. I remember a Kalman-filter-based target motion analysis for passive sonar (which only gives you bearings, of course). At one point, there was a value named Range_Squared. The programmer used a natural float (ie, not allowed to go negative) and, when tests revealed to him that it sometimes did go negative, he decided to limit the value to >= 0.0. Unfortunately the underlying quantity was actually complex at this point, and the result of this well-intentioned change was that the algorithm could become very very unstable. The mathematician responsible was not pleased. Reverting to the Report, the last paragraph on page 6 says "This means that critical software - in the sense that failure of the software puts the mission at risk - must be identified at a very detailed level, that exceptional behaviour must be confined, and that a reasonable back-up policy must take software failures into account." It seems obvious to me that you cannot take software failures into account by having two identical systems. You might get away with it for some tight race conditions, but for processing input data I just don't see it. You really need diversity.