From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,5ac12f5a60b1bfe
X-Google-Attributes: gid103376,public
X-Google-Thread: 101deb,f96f757d5586710a
X-Google-Attributes: gid101deb,public
X-Google-Thread: f43e6,5ac12f5a60b1bfe
X-Google-Attributes: gidf43e6,public
From: rav@goanna.cs.rmit.edu.au (++           robin)
Subject: Re: Ariane 5 - not an exception?
Date: 1996/08/13
Message-ID: <4up6jg$h3e@goanna.cs.rmit.edu.au>
X-Deja-AN: 173853773
expires: 1 November 1996 00:00:00 GMT
references: <Dv45EJ.8r@fsa.bris.ac.uk> <4t9vdg$jfb@goanna.cs.rmit.edu.au>
 <31FE35BC.1A0D@sanders.lockheed.com> <4totv7$o9f@goanna.cs.rmit.edu.au>
 <32065615.77C7@sanders.lockheed.com>
organization: Comp Sci, RMIT, Melbourne, Australia
newsgroups: comp.software-eng,comp.lang.ada,comp.lang.pl1
nntp-posting-user: rav
Date: 1996-08-13T00:00:00+00:00
List-Id: <comp.lang.ada>


	g1006@fs1.mar.lmco.com (Francis Lipski) writes:

	>In article <32065615.77C7@sanders.lockheed.com>, you write:
	>> ++ robin wrote:
	>> >         Steve O'Neill <smoneill@sanders.lockheed.com> writes:
	>> >         >I disagree completely!  The language was not the
	>> >         >problem the design decisions in how the language
	>> >         >was used were.
	>> > 
	>> > ---The choice of language is indeed very relevant.
	>> > What I wrote in an earlier posting on this topic is highly
	>> > apt:
	>> > 
	>> > "A PL/I programmer
	>> > experienced with real time systems, would have CHALLENGED
	>> > such a stupid requirement that the computer be shut down by the
	>> > error-handler in the event of a fixed-point overflow.  He would
	>> > have had it changed.

	>   Not always possible.  If you are in the minority and are unsuccessful
	>to argue others to your point, what do you do?  

---Don't be absurd.  The checks WERE included in all but 3
of the type conversions in the vicinity of the conversion
that blew up.

	>  As a previous message in this thread had stated, what
	>should someone do?  Say to hell with the requirements,
	>I'm going to code what I think is correct.

---The requirements were that any kind of interrupt was
going to be handled by the interrupt handler (which would
then shut doen the computer).

   A *real* real-time PL/I programmer would have included
a test to make certain that the interrupt could not occur.
That was NOT going against the specifications.

   But, as I wrote in a previous post, a belt-and-braces
approach should have been taken, viz, to include an
error handler for fixed-point overflow, as an interrupt
was to be taken as SUDDEN DEATH for the project.

   This is where a PL/I programmer would have had the
specification changed.

	>> > "I'd go further to say that no experienced PL/I programmer
	>> > would have shut down the system as a result of a fixed-point
	>> > overflow.

	>> Substitute Ada (or C or FORTRAN or Assembly) for
	>> PL/I here and you see my point.

---Neither C nor Fortran have error-handling.
Ada *was* used, and look what happened.
Hence the suggestion that PL/I expertise on the
project would have been advantage.  You see,
real-time programming in PL/I has been part of the scene
since 1966!

	>> It's not the language that makes the developer challange the 
	>> ridiculous requirement to shut down it is the developer "experienced with 
	>> real-time systems".  Just because I am programming in PL/I doesn't mean I 
	>> am magically a better real-time developer.  As a real-time designer 
	>> concerned with the system-wide aspects of completely shutting down any 
	>> sensor I would question this approach regardless of the language in use. 
	>> This has nothing to do with the fact that much of my experience is with 
	>> Ada.

	>> The (flawed) reasoning for why certain conversions were not protected was 
	>> also covered in the report.  Invalid assumptions were made

---Yes; it was assumed that the value would not overflow
but it did!.  They have forgotten Murphy's Law:
"If anything can go wrong, it will".  And Robert's
Law: "Even if it *can't* go wrong, it will".

	>> Certainly you and I would not have shut down the system but what about 
	>> the vast majority of developers without as much experience or who thought 
	>> that their job was to implement the requirements that they were given?

---They could have implemented the "requirements"
WITHOUT raising a fixed-point interrupt,
just by checking for overflow!

	>  The report states that the rationale was based on the "culture within the
	>Ariane programme of only addressing random hardware failures.  From this point of view exception - or error- handling mechanisms are designed for a random
	>hardware failure which can quite rationally be handled by a backup system"

	>  If all conversions and other possible overflow
	>conditions are protected,
	>and then an overflow occurs, what action should be taken?

---Action should be taken to deal with a fixed-point overflow!
Something was overlooked.  It needed to be dealt with.  That
it was not is a fundamental error!  That's why error-handling
is provided!  To provide a margin of safety.

	> The system has
	>just had a random hardware failure.  Continue to operate with known bad 
	>hardware?  In the case of an overflow, set to max value, continue and
	>hope for the best?  

---Good idea, already suggested in the report.  But the
report also suggested that the design needed to
take into account programmer error.

	>  While clearly the design, in this case, did not protect itself sufficiently,
	>and compounded errors by not handling the case of a simultaneous failure of
	>both processors, what action should be taken on an overflow if not to shut
	>down.  With flight controls or inertial systems, partitioning into tasks and
	>then restarting the offending task is not an option.  It would take entirely
	>too long to restart the task to be able to effectively recover.

	>  Regarding the spare requirements.  The answer as to why to have spare time
	>is to ensure that all hard deadlines are met and to allow growth for future
	>versions of SW.   Allowing room for growth is necessary in development programs
	>however, the requirement is usually never relaxed as more functionality is
	>added.  That is another story.  However, it is necessary to ensure sufficient
	>time is available to complete all the processing within the allotted time.
	>The execution time of the software is at best a statistical problem, at least
	>the hardware times can be statistical.  If the SW is always measured as a worse
	>case time, and all these are added together can can not allow this time
	>to meet or exceed the allowable time, given the statistical nature of the HW.
	>So how much spare time should be allotted?  If 20% is unrealistic, what
	>number should be used, 10%, 1%, 0.001%?  

	>> > 
	>> > "Furthermore, he would have included a check that the value
	>> > did not go out of range;"
	>> > 
	>> > ---But all it needed was a check that the value was in range.
	>> > Such checks had been included on other similar conversions in
	>> > the vicinity!

	>> Yes, and there was mention in the report that 'they' thought that this 
	>> would violate that precious spare requirement.

---That's a red herring.

        > So they set about picking
        >> and choosing which conversions to protect.
   
---This doesn't sppear specifically in the report as regards
this conversion and the 2 others in the vicinity.  There's
the impliciation that these conversions were overlooked.
In any case, the test would have introduced a trivial
number of additional instructions.

        >>  I find it extremely hard to
        >> believe that the (small) handful of instructions to do a range check
        >> would have been too much!

---Agreed.

        >>  And, in hindsight, well worth it.

---Agreed again.

        >> The issue of the OBC interpreting the 'essentially diagnostic data' as
        >> valid sensor data really makes me wonder.  In a system with a reasonable
        >> interface between the two devices this should *never* happen.  I am
        >> surprised that this misinterpretation didn't cause a similar overflow in
        >> the OBC and resulting shutdown! :(

---Yes.

        > I was also amazed by the poor design of the interface that didn't detect
        >this problem.  Probably given enough time, some form of error would
        >have occurred resulting in the OBC shutting down.

---There were a number of inadequacies revealed in the design.

        >> I think that we agree in our assessment of the situation and the fact
        >> that these problems could have been avoided with a better overall system
        >> design and more extensive testing.  Essentially the same conclusions that
        >> the review board came to.  My only disagreement is with your _opinion_
        >> that the simple choice of a different language would have saved the day.

---As I stated, a PL/I programmer experienced in real-time
programming, would not have made this stupid mistake.

        >>  And with this point I will continue to disagree.

---You do not appear to have grounds for this opinion.

        >> Steve O'Neill                      | "No,no,no, don't tug on that!
        >> Sanders, A Lockheed Martin Company |  You never know what it might
        >> smoneill@sanders.lockheed.com      |  be attached to."
        >> (603) 885-8774  fax: (603) 885-4071|    Buckaroo Banzai