From mboxrd@z Thu Jan  1 00:00:00 1970
X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham
	autolearn_force=no version=3.4.4
X-Google-Language: ENGLISH,ASCII-7-bit
X-Google-Thread: 103376,3d3f20d31be1c33a
X-Google-Attributes: gid103376,public
From: "Marin David Condic, 561.796.8997, M/S 731-96" <condicma@PWFL.COM>
Subject: Re: Safety-critical development in Ada and Eiffel
Date: 1997/07/24
Message-ID: <97072410111462@psavax.pwfl.com>
X-Deja-AN: 258524051
Sender: Ada programming language <INFO-ADA@LISTSERV.NODAK.EDU>
Comments: Gated by NETNEWS@AUVM.AMERICAN.EDU
X-Vms-To: SMTP%"INFO-ADA@VM1.NODAK.EDU"
Newsgroups: comp.lang.ada
X-Vms-Cc: CONDIC
Date: 1997-07-24T00:00:00+00:00
List-Id: <comp.lang.ada>


Ken Garlington <kennieg@FLASH.NET> writes:
>>     If you have a control loop executing code, say, every 5mSec,
>>     sensing some inputs and doing some loop closure, you know by the
>>     rules of Ada that there are some exception possibilities you can't
>>     disable.
>
>Realisitically, you can disable all of them (and we have in the past).
>
    Seems like the last time I checked the Ada83 standard when my
    project had an issue with this, there were some cases where
    disabling the checks would have been difficult and probably
    unwise. (I'm thinking specifically of PROGRAM_ERROR, STORAGE_ERROR
    and TASKING_ERROR. Given the sorts of things that can raise these
    errors, they may be beyond your direct control. Of course, you may
    get some variance based on compilers.) Where there's a whip,
    there's a way, so I'll agree you can turn off all exceptions,
    albeit you may get stuck doing some tweeking of your RTK. (Doesn't
    scare me - just makes one more thing I've got to document & test.)

>>     Hence they could be raised by code beyond your control.
>>     You insert an exception handler in the loop to catch any of these,
>>     possibly logging them for telemetry (or at least ticking off a
>>     counter somewhere so you know it happened in lab testing!) then
>>     allow the loop to restart.
>
>Yes, we do this with interrupt handlers (although we resume where we
>left off,
>rather than restart). The problem with restart is blowing off a frame of
>data.
>For high-gain data, you might see a significant transient, which could
>have very bad effects structurally, operationally, etc.
>
    Interrupt handlers are a similar, but different enough to warrant
    some special consideration. First off, you *can* return to where
    you left off with an interrupt - not so with exceptions. (I'm sure
    you realize this, but it needed to be stated.) We took this view
    of interrupts: Some we were using and they had some appropriate
    code to do what needed to be done. Others we weren't using, so
    they were masked off. In the off chance that some mysterious code
    messed up the mask or gamma rays punched holes in the mask at the
    same instant that a spurious interrupt happened, we had handlers
    tied to the unused interrupts "just in case". You could presume
    that the most probable cause of receiving one of these interrupts
    was a hardware failure of some sort - which may have been either
    transient or permanent. Software *might* have caused it
    (accidentally performed some XIO instructions with the wrong
    addresses & data which just so happened to unmask, then trigger an
    interrupt?) and again you could presume it was either transient or
    permanent. So the catch-all handlers were designed to log the
    error, report it in telemetry and if it was the third occurrence,
    presume something was broke permanently and transfer control to
    the other side.

    You could devise a dozen variants for accommodating these errors,
    all of which have strengths & weaknesses - but eventually you've
    got to fly with only one of them and live with it's weaknesses. In
    any event, if the interrupt was occurring from a flaw in the
    software design - well you're truly intercoursed and there's no way
    around it when the design is "common mode". It's the same deal you
    get with identical processor board designs - if some transistor is
    plugged into the design to operate at it's ragged edge limit of
    failure and some corner case drives the transistor just a little
    harder than that, you let the smoke out. Guess what? The same
    transistor on the other side is seeing the same corner case and
    it's probably letting the smoke out too. Kiss the rocket goodbye.

>The bottom line is, there is no intrinsically "safe" general-purpose
>approach
>to handling exceptions. For the ones you can't suppress (or figure out
>how to
>handle otherwise), you end up basically making the best of a bad
>situation.
>
    True. We looked at them and used the outer-most-loop scheme I
    outlined. Again the accommodation was *usually* to log it three
    times, then shut down presuming the channel to be broke. This was
    the general philosophy for all our system level FDA - not just
    exceptions. Some, we reasoned might be fixed by reloading out of
    E**2, so we'd reboot after the transfer. Some errors we'd presume
    broken hardware and stay down. Some, the three occurrences would
    have to be in a row - if it cleared for a given cycle, you'd reset
    the counters. All of it had to be based on an analysis of the
    errors we could detect and looking at the most probable cause,
    then deriving a reasonable accommodation. However, you can't fix
    everything, right? What if the processor gets fried by gamma rays?
    What's your software going to do to clear *that* problem?
    (Seriously got asked that question!) What if the common mode
    design is flawed? What if the sun expands suddenly and totally
    engulfs the rocket in fire? Some things you can't fix with
    software.
>>
>>     What you're saying is this: "On pass N everything was fine. On
>>     pass N+1, something went haywire and interrupted normal execution.
>>     Because quitting operation is not an acceptable alternative, what
>>     I'm betting on is that on pass N+2, the problem will clear
>>     itself."
>
>OK for transient input problems (we use input filtering to handle those,
>however),
>or for transient hardware problems (and you should read the beating
>Ariane
>took for assuming that!), but there's absolutely no reason to assume a
>software
>design fault will act this way.
>
>That's not to say that your approach is wrong, but if it fails... what
>will your
>inquiry board's report look like?
>
    Agreed. Software design faults are probably the most difficult to
    accommodate because a) you don't know what it's going to do so you
    don't even know if you can detect it and b) unless you know what
    it's nature is you can't devise a reasonable accommodation. (And
    if I knew what it's nature was, I'd probably have gotten rid of it
    and insured that it would never happen anyway!) Our approach with
    the outer-loop exception handler was based on some assumptions: If
    something goes wrong for which you don't already have some
    accommodation (the stuff you didn't know about) then it is
    probably better for the control to press on trying to run the
    system than it would be to shut down and leave the engine
    fail-fixed. If the problem is serious enough not to clear,
    eventually your watchdog timer is going to shut the channel down
    anyway (or some other FDA is going to come into play) and you
    *probably* didn't do any harm by continuing to try to run.

    For transient data problems and such, we'd naturally go do some
    form of input filtering, range checking, invalid input logic,
    whatever. But that's all for the errors you know about and
    anticipate. The toughies are the ones you *don't* know about and
    *don't* anticipate. Those are always the ones that kill you. And I
    don't know of any design strategy or theory or rule of thumb
    method that's going to help you with the problems that fit in this
    category.
>>
>>     This would potentially give you a viable use for raising
>>     exceptions on the fly. Granted, you wouldn't do this for any sort
>>     of expected conditions with planned for accommodations, but
>>     strictly for those sorts of errors that should never occur, but
>>     might just do so anyway. Your accommodation at that point might be
>>     something like resetting all of memory to its initial state and
>>     hoping that the next batch of inputs gets you back to where you
>>     should be.
>
>We actually have a top-level handler on some programs that does a warm
>start
>if a really serious event happens, that's similar to what you describe.
>However, it's more of wishful thinking than anything else that says this
>will save the system. It's the last line of defense, not the first, and
>certainly
>not something you want to depend on to say your system is safe!

    You're right - it's wishful thinking. But if the alternative is to
    shut down the system and let the rocket fall in the ocean - well
    the mission's over anyway, you might just as well try *something*,
    no matter how desparate.

    I agree, it's the last resort - not the first line of defense. But
    I don't think most of us would run off raising exceptions for
    things we could easily detect and accommodate as we're reading the
    data and making our computations. Given that little Ariane event,
    I'd think that if you went to all the trouble of putting in the
    assertion to range check the input, you might just as well have
    saturated the number and set the "bad data" flag riding with it.
    (assuming we're redesigning the system). Our practical experience
    with control systems indicates that saturated arithmetic most
    often "does the right thing" for out of range situations.

    But again, this is all 20/20 hindsight and as I've observed
    before, the Ariane software was an adequate design in it's
    original context.

    MDC

Marin David Condic, Senior Computer Engineer     ATT:        561.796.8997
Pratt & Whitney GESP, M/S 731-96, P.O.B. 109600  Fax:        561.796.4669
West Palm Beach, FL, 33410-9600                  Internet:   CONDICMA@PWFL.COM
===============================================================================
    "They can't get you for what you didn't say."
        --  Calvin Coolidge
===============================================================================