[SystemSafety] Critical Design Checklist

Les Chambers les at chambers.com.au
Fri Aug 30 01:55:01 CEST 2013


Hi Kevin

How's it going? I hope you're enjoying the responses from the list. Here are
some more contributions:

Configuration audit - boring but necessary

I experienced this on a distributed control systems project with 200
computing nodes.

The hazard was: loss of control due to installation of the wrong version of
one or all of three node components: the control logic file, the
configuration file or the operating system itself. The hazardous event
actually occurred at 2 AM one morning when a gang of commissioning engineers
attempted to install the wrong version of all three. Luckily there was no
safety incident but it was highly embarrassing and very expensive with
everyone's time. This event turned some laissez-faire systems architects and
developers into hard-core card-carrying configuration audit zealots. I was
truly impressed with the speed with which they implemented an automated
system to run a real time audit of the operational baseline. A configuration
baseline definition was held in a redundant set of supervisory computers and
a check run every few minutes. We ran a 32-bit CRC over all files and use
that as a unique identifier of each system component. In a design review I
would ask the question: what measures are you taking to secure the integrity
of the operational baseline. You could get some interesting answers.

Glue tracking

And then there was the glue incident. I should expand on that as it has
configuration management ramifications. The heatsinks on the CPU chips in
all the control node's were glued to the main processor chip. A factory in
Shanghai used a bad batch of glue. Once the computers heated up in operation
the glue melted and the heatsinks fell off rattling around inside the card
cage. The stuff of nightmares for a maintenance engineer. Once again this
did not trigger a hazardous event because the control node configurations
were highly redundant with much health checking going on. But it did cause
some consternation for some time. Somewhat like the sort of Damocles hanging
over the head of the commissioning guys. The core problem was: no one
thought to include the glue batch number in the configuration definition of
a control computer. It turns out that software engineers are not strong in
glue tracking. Who knows, there may not even have been such a thing as a
glue batch number. We therefore did not know which of the 200 computing
nodes had been assembled with the defective glue. Anyway, at least that
bunch of guys are now card-carrying glue trackers. In a design review I
would ask the general question as to what measures have been taken to avoid
common cause failures and to what degree are components traceable to their
manufacture.

Self-modifying code

I mentioned this because I see the strategy that calls for system components
to modify each other at run time coming back into vogue. It used to be a
cool thing to do in assembler language programming when you had limited
working memory. It naturally died out when working memory became like air,
infinite. Now its back. It is standard practice in web apps for code to
modify web page HTML at run time. In fact I do it myself I'm ashamed to
admit. The thing is you are forced down this parth if you want to create
dynamic webpages (my excuse is: PHP made me do it!!). My concern is that if
someone should use these architectural design components for something
serious such as provisioning or tasking and armed drone there could be
serious consequences. Self modifying code is difficult to review. You sort
of have to imagine what it's going to look like at runtime. In a design
review I would ask: justify any strategy that involves design components
modifying themselves or others at run time.

 

Sidebar:

There seems to be some confusion on this list as to how to answer your
simple request for design review questions in the context of safety critical
systems. I would encourage everyone to get your collective heads out of the
meta world and just let it all hang out. Random accounts of things that went
wrong because of a design flaw can be extremely useful to other designers.
This is an excellent opportunity to build a library of same. Some meta
person can sort it out later. How hard can that be?

Good luck with your checklist.

Cheers

Les

 

 

 

 

 

 

From: systemsafety-bounces at lists.techfak.uni-bielefeld.de
[mailto:systemsafety-bounces at lists.techfak.uni-bielefeld.de] On Behalf Of
Driscoll, Kevin R
Sent: Tuesday, August 27, 2013 6:38 AM
To: systemsafety at techfak.uni-bielefeld.de
Subject: [SystemSafety] Critical Design Checklist

 

For NASA, we are creating a Critical Design Checklist:

.      Objective

-      A checklist for designers to help them determine if a safety-critical
design has met its safety requirements

-      Not a "Have you done ..." checklist

w  Too easy to just check "yes" without doing sufficient work

w  Instead, "What have you done ..."

w  Prove what you have done is sufficient

.      We are looking for inputs to include in this checklist

.      Do you have any inputs that should be included? 

-      Meta-question:  "If you were asked to participate in a design review
of a safety-critical design, what questions would you ask?"  (Particularly,
general questions you would have before seeing the details of a design.)

-      Inverse meta-question:  "If you were presenting a design, what
questions would you dread being asked?"  :-}

w  Where are the bodies buried?

 

We are finishing the Checklist by next week and would like to include any
good questions you may have that we have overlooked.   Realizing this is an
imposition on your time, I am hoping some of you would be so kind as to
spend just a few minutes to send questions or even question fragments.

 

--

P.S.

I am also looking for unusual failure scenarios to add to my collection,
like those I've described in my series of "Murphy was an Optimist"
presentations (e.g.
http://www.rvs.uni-bielefeld.de/publications/DriscollMurphyv19.pdf).

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20130830/84c14e04/attachment.html>


More information about the systemsafety mailing list