From mboxrd@z Thu Jan 1 00:00:00 1970 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on polar.synack.me X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=ham autolearn_force=no version=3.4.4 X-Google-Language: ENGLISH,ASCII-7-bit X-Google-Thread: 103376,3d3f20d31be1c33a X-Google-Attributes: gid103376,public From: "Marin David Condic, 561.796.8997, M/S 731-96" Subject: Re: Safety-critical development in Ada and Eiffel Date: 1997/07/24 Message-ID: <97072410111462@psavax.pwfl.com> X-Deja-AN: 258524051 Sender: Ada programming language Comments: Gated by NETNEWS@AUVM.AMERICAN.EDU X-Vms-To: SMTP%"INFO-ADA@VM1.NODAK.EDU" Newsgroups: comp.lang.ada X-Vms-Cc: CONDIC Date: 1997-07-24T00:00:00+00:00 List-Id: Ken Garlington writes: >> If you have a control loop executing code, say, every 5mSec, >> sensing some inputs and doing some loop closure, you know by the >> rules of Ada that there are some exception possibilities you can't >> disable. > >Realisitically, you can disable all of them (and we have in the past). > Seems like the last time I checked the Ada83 standard when my project had an issue with this, there were some cases where disabling the checks would have been difficult and probably unwise. (I'm thinking specifically of PROGRAM_ERROR, STORAGE_ERROR and TASKING_ERROR. Given the sorts of things that can raise these errors, they may be beyond your direct control. Of course, you may get some variance based on compilers.) Where there's a whip, there's a way, so I'll agree you can turn off all exceptions, albeit you may get stuck doing some tweeking of your RTK. (Doesn't scare me - just makes one more thing I've got to document & test.) >> Hence they could be raised by code beyond your control. >> You insert an exception handler in the loop to catch any of these, >> possibly logging them for telemetry (or at least ticking off a >> counter somewhere so you know it happened in lab testing!) then >> allow the loop to restart. > >Yes, we do this with interrupt handlers (although we resume where we >left off, >rather than restart). The problem with restart is blowing off a frame of >data. >For high-gain data, you might see a significant transient, which could >have very bad effects structurally, operationally, etc. > Interrupt handlers are a similar, but different enough to warrant some special consideration. First off, you *can* return to where you left off with an interrupt - not so with exceptions. (I'm sure you realize this, but it needed to be stated.) We took this view of interrupts: Some we were using and they had some appropriate code to do what needed to be done. Others we weren't using, so they were masked off. In the off chance that some mysterious code messed up the mask or gamma rays punched holes in the mask at the same instant that a spurious interrupt happened, we had handlers tied to the unused interrupts "just in case". You could presume that the most probable cause of receiving one of these interrupts was a hardware failure of some sort - which may have been either transient or permanent. Software *might* have caused it (accidentally performed some XIO instructions with the wrong addresses & data which just so happened to unmask, then trigger an interrupt?) and again you could presume it was either transient or permanent. So the catch-all handlers were designed to log the error, report it in telemetry and if it was the third occurrence, presume something was broke permanently and transfer control to the other side. You could devise a dozen variants for accommodating these errors, all of which have strengths & weaknesses - but eventually you've got to fly with only one of them and live with it's weaknesses. In any event, if the interrupt was occurring from a flaw in the software design - well you're truly intercoursed and there's no way around it when the design is "common mode". It's the same deal you get with identical processor board designs - if some transistor is plugged into the design to operate at it's ragged edge limit of failure and some corner case drives the transistor just a little harder than that, you let the smoke out. Guess what? The same transistor on the other side is seeing the same corner case and it's probably letting the smoke out too. Kiss the rocket goodbye. >The bottom line is, there is no intrinsically "safe" general-purpose >approach >to handling exceptions. For the ones you can't suppress (or figure out >how to >handle otherwise), you end up basically making the best of a bad >situation. > True. We looked at them and used the outer-most-loop scheme I outlined. Again the accommodation was *usually* to log it three times, then shut down presuming the channel to be broke. This was the general philosophy for all our system level FDA - not just exceptions. Some, we reasoned might be fixed by reloading out of E**2, so we'd reboot after the transfer. Some errors we'd presume broken hardware and stay down. Some, the three occurrences would have to be in a row - if it cleared for a given cycle, you'd reset the counters. All of it had to be based on an analysis of the errors we could detect and looking at the most probable cause, then deriving a reasonable accommodation. However, you can't fix everything, right? What if the processor gets fried by gamma rays? What's your software going to do to clear *that* problem? (Seriously got asked that question!) What if the common mode design is flawed? What if the sun expands suddenly and totally engulfs the rocket in fire? Some things you can't fix with software. >> >> What you're saying is this: "On pass N everything was fine. On >> pass N+1, something went haywire and interrupted normal execution. >> Because quitting operation is not an acceptable alternative, what >> I'm betting on is that on pass N+2, the problem will clear >> itself." > >OK for transient input problems (we use input filtering to handle those, >however), >or for transient hardware problems (and you should read the beating >Ariane >took for assuming that!), but there's absolutely no reason to assume a >software >design fault will act this way. > >That's not to say that your approach is wrong, but if it fails... what >will your >inquiry board's report look like? > Agreed. Software design faults are probably the most difficult to accommodate because a) you don't know what it's going to do so you don't even know if you can detect it and b) unless you know what it's nature is you can't devise a reasonable accommodation. (And if I knew what it's nature was, I'd probably have gotten rid of it and insured that it would never happen anyway!) Our approach with the outer-loop exception handler was based on some assumptions: If something goes wrong for which you don't already have some accommodation (the stuff you didn't know about) then it is probably better for the control to press on trying to run the system than it would be to shut down and leave the engine fail-fixed. If the problem is serious enough not to clear, eventually your watchdog timer is going to shut the channel down anyway (or some other FDA is going to come into play) and you *probably* didn't do any harm by continuing to try to run. For transient data problems and such, we'd naturally go do some form of input filtering, range checking, invalid input logic, whatever. But that's all for the errors you know about and anticipate. The toughies are the ones you *don't* know about and *don't* anticipate. Those are always the ones that kill you. And I don't know of any design strategy or theory or rule of thumb method that's going to help you with the problems that fit in this category. >> >> This would potentially give you a viable use for raising >> exceptions on the fly. Granted, you wouldn't do this for any sort >> of expected conditions with planned for accommodations, but >> strictly for those sorts of errors that should never occur, but >> might just do so anyway. Your accommodation at that point might be >> something like resetting all of memory to its initial state and >> hoping that the next batch of inputs gets you back to where you >> should be. > >We actually have a top-level handler on some programs that does a warm >start >if a really serious event happens, that's similar to what you describe. >However, it's more of wishful thinking than anything else that says this >will save the system. It's the last line of defense, not the first, and >certainly >not something you want to depend on to say your system is safe! You're right - it's wishful thinking. But if the alternative is to shut down the system and let the rocket fall in the ocean - well the mission's over anyway, you might just as well try *something*, no matter how desparate. I agree, it's the last resort - not the first line of defense. But I don't think most of us would run off raising exceptions for things we could easily detect and accommodate as we're reading the data and making our computations. Given that little Ariane event, I'd think that if you went to all the trouble of putting in the assertion to range check the input, you might just as well have saturated the number and set the "bad data" flag riding with it. (assuming we're redesigning the system). Our practical experience with control systems indicates that saturated arithmetic most often "does the right thing" for out of range situations. But again, this is all 20/20 hindsight and as I've observed before, the Ariane software was an adequate design in it's original context. MDC Marin David Condic, Senior Computer Engineer ATT: 561.796.8997 Pratt & Whitney GESP, M/S 731-96, P.O.B. 109600 Fax: 561.796.4669 West Palm Beach, FL, 33410-9600 Internet: CONDICMA@PWFL.COM =============================================================================== "They can't get you for what you didn't say." -- Calvin Coolidge ===============================================================================