From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Thread: 103376,7e8cebf09cf80560
X-Google-NewGroupId: yes
X-Google-Attributes: gida07f3367d7,domainid0,public,usenet
X-Google-Language: ENGLISH,ASCII-7-bit
Path: 
 g2news2.google.com!news3.google.com!feeder.news-service.com!85.214.198.2.MISMATCH!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: Simon Wright <simon@pushface.org>
Newsgroups: comp.lang.ada
Subject: Re: How would Ariane 5 have behaved if overflow checking were not
 turned off?
Date: Thu, 17 Mar 2011 20:58:59 +0000
Organization: A noiseless patient Spider
Message-ID: <m2lj0dh3ak.fsf@pushface.org>
References: 
 <a8387564-0835-467d-a461-60a093c38133@k15g2000prk.googlegroups.com>
	<4d80b13f$0$43832$c30e37c6@exi-reader.telstra.net>
	<m239mnhzmb.fsf@pushface.org>
	<4d8200ce$0$43837$c30e37c6@exi-reader.telstra.net>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: mx02.eternal-september.org;
 posting-host="dFCm8HWntFqmDIilBLqEJQ";
	logging-data="27965"; mail-complaints-to="abuse@eternal-september.org";
	posting-account="U2FsdGVkX19EGr/MFIn5t37CPryxGlCyUnEzIBYtYj0="
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (darwin)
Cancel-Lock: sha1:67d4h8ADhhZf25L6wF1vf4bNqHI=
	sha1:XbitgGNvGTV3oi4hTtwR6Hcp4ls=
Xref: g2news2.google.com comp.lang.ada:19263
Date: 2011-03-17T20:58:59+00:00
List-Id: <comp.lang.ada>

"robin" <robin51@dodo.com.au> writes:

> Simon Wright wrote in message ...
>>"robin" <robin51@dodo.com.au> writes:
>>
>>> Anyone competent in real-time programming would never have let the
>>> software go with unhandled overflow, because such an event would
>>> result in failure of the mission.
>>
>>The engineers, being competent in tightly-constrained real-time
>>programming, found that installing exception handlers cost cpu cycles
>>they could not afford, so looked at all the potential overflow sites and
>>found that _this_ one could only occur if there was a hardware
>>failure.
>
> Nonsense.  The Full Report says nothing of the kind.

Oh yes it does. Well, very very nearly. See
http://esamultimedia.esa.int/docs/esa-x-1819eng.pdf page 5 second
para. Especially note the last sentence.

>> Even if they had installed an exception handler, the only proper
>>response would have been to shutdown this node and hand over to the
>>alternate;
>
> No, the exception handler could have done something sensible, such as
> using the maximum integer value, thus allowing the trajectory to
> continue.  Better still would have been to include a magnitude test in
> the code that avoided the need for an error handler.
>
>> and this was the action that would result from not having an
>>exception handler in the first place. So, after considerable thought,
>>they decided against having an exception handler.
>
> There were 7 places in the code in the vicinity where overflow could
> occur.  Four were chosen for protection, but three were not.  That was
> the fatal flaw.

I know that the last but one paragraph on that page (5) starts "Although
the failure was due to a systematic software design error..." but
.. where I come from there are system designers and software
designers. The system people work out the requirements and the software
people - after making sure that the requirements appear sensible and
questioning them if they don't - just get on and do what has been agreed
by people probably on a higher pay grade and certainly with the assigned
responsibility. So I don't agree that it was a software design
error. You may say that it makes no difference; I say it affects who
should get fired (or sued). Of course, for Ariane 4 it wasn't even a
system design error.

I remember a Kalman-filter-based target motion analysis for passive
sonar (which only gives you bearings, of course). At one point, there
was a value named Range_Squared. The programmer used a natural float
(ie, not allowed to go negative) and, when tests revealed to him that it
sometimes did go negative, he decided to limit the value to >= 0.0.

Unfortunately the underlying quantity was actually complex at this
point, and the result of this well-intentioned change was that the
algorithm could become very very unstable. The mathematician responsible
was not pleased.

Reverting to the Report, the last paragraph on page 6 says "This means
that critical software - in the sense that failure of the software puts
the mission at risk - must be identified at a very detailed level, that
exceptional behaviour must be confined, and that a reasonable back-up
policy must take software failures into account."

It seems obvious to me that you cannot take software failures into
account by having two identical systems. You might get away with it for
some tight race conditions, but for processing input data I just don't
see it. You really need diversity.