[SystemSafety] A comparison of STPA and ARP 4761

Tue Jul 29 07:42:12 CEST 2014

On 2014-07-29 01:49 , Matthew Squair wrote:
> To be absolutely fair, the comparison is between the worked example provided in ARP 4761 and a STAMP
> rerun of the same example. 

Redoing a worked example is a standard method of comparison. But I am inclined to be wary of
overgeneralising the conclusion (unlike the authors cited by M. Fabre :-) )

There is a real, pressing issue here which the cited conclusion addresses. Put simply, it is that we
have to do better than the "traditional" methods of hazard and risk analysis.

We do indeed "... need to create and employ more powerful and inclusive approaches to evaluating
safety" than those referred to in the current aerospace standards. FMEA as it is currently
performed, and FTA, RBD, are the three techniques referred to explicitly in 14 CFR 25.1309 for
auxiliary kit, as commonly executed in aerospace contexts. They all miss stuff that needs to be
identified and mitigated. This is well known to people like us, but there are people out there in
industry who swear by (their version of) FMEA, FTA and so on and to my mind the message really needs
to be put out there much more visibly that there are issues that will cause problems with your kit
which these methods do not identify.

I don't know about the 4761 example, but there are well-known actual cases. The "usual techniques"
did not identify the error in the boot-up configuration SW of the Boeing 777 FMS which led to the
uncommanded pitch excursions of the Malaysian Airlines Boeing 777 out of Perth in 2005. And it
should be clear that they could never do so. Neither did they identify that spikey misleading output
emanating from a sporadically faulty ADFC could be accepted as veridical by the primary flight
control computer SW and lead (also) to uncommanded pitch excursions (and some damaged people this
time) in the 2008 Learmonth A330 accident, or to a similar incident a couple months later on a
sister ship. And it should be clear.....etc. Neither did they adequately assess the risk of the
Byzantine fault in <famous airplane> databus processes, recounted in Driscoll, Hall, Sivencrona and
Zumsteg in their classic SAFECOMP 2003 paper, which almost led to the withdrawal of the
airworthiness certificate.

It is also evident (at least to those on this list, I hope!) that these traditional methods do not
address important issues of how operators work with the systems they are to operate. What is called
HF. There are many recent examples. Try Turkish in Amsterdam in 2009 or Asiana in San Francisco in
2013. These two are different in that Turkish slaved faulty kit to the autothrottle and left
themselves under autothrottle control, whereas there was nothing wrong with anything on Asiana, but
they are similar in that experienced crews didn't monitor the basics on final approach until it was
too late to recover (and we really are talking stuff that every pilot is taught on hisher first
flying lesson and is emphasised throughout primary training). The puzzle is why not. That's puzzling
everyone at the moment. (But I take the likelihood that one could wave some magical method at the
problem and have the answer pop out to be just about zero, if not slightly less.)

What happens, constantly in my experience, and likely that of the MIT people also, is that
development engineering says "I do this-and-this and this like the guidance says, and it shows me
that-and-that and then I fix it. <And, implicitly, anything I didn't see when doing this-and-this
doesn't exist/isn't my concern/etc> So, job done and the certification agencies are happy."
And that's just not enough. Methods should be used that identify *all* the hazards if possible. Such
methods are available and as the paper says STPA is one. Traditional FMEA, FTA and RBD aren't such
methods.

But I think people can do much better than they do at the moment with incremental improvements,
rather than having to learn an entire sociological-system model (this is presumably where I differ
in my view from the authors of the cited article :-) ). One obvious lack in the traditional methods
is any kind of reliable self-check. You do an FMEA - how do you know the one you got is
right/sufficient/adequate? Same with FTA. The answer is: traditionally you don't - there are no
technical correctness/completeness checks. So, devise a valid check, add it on, and you are already
doing much better.

I am not the only person to have noticed this, of course. There are myriad papers out there which
propose an enhancement, take an example, redo it using the enhancement, and show how much better the
results are. But a common conclusion is "sure, every PhD student and hisher mother can enhance a
technique to work better on a given specific example". One could rather observe that if there are a
ton of papers out there showing by example how the traditional method doesn't catch everything that
needs catching, then it is very likely indeed to be the case that using the traditional method risks
not catching everything you should be catching. But this reasonable conclusion is rarely heard.

So, for example, "everyone" redoes the pressure-vessel example in the Fault Tree Handbook (kudos,
BTW to Bedford/Cooke who don't, while having good material on FTA). Kamen and Hassezahl in their
1999 text and Leveson in her 1995 monograph. We had a go too in 2000. Some simple causal analysis of
the control loops in the pressure-vessel system followed by a syntactic transformation into a tree
yield an FT that is obviously "more complete" than the one in the FTH, than Kamen and Hassenzahl's
or Leveson's. I've been sitting around for a decade and a half waiting for a paper from some PhD
student taking the same example and doing even better than Vesely, Kamen, Hassenzahl, Leveson and
Ladkin. I haven't seen one, so either people are bored with playing the game or I've not been
keeping up with the literature. Or maybe our fault tree was perfect :-)

For a more recent example, in 2010 Daniel Jackson picked one out of Nancy's recent book, and showed
how an Alloy analysis picked up some things which STPA hadn't. (See his November 2010 entry "How to
Prevent Disasters" at http://people.csail.mit.edu/dnj/talks/ ) Then Jan Sanders in my group started
an OHA and picked up some additional things which Jackson's analysis seemed also to have missed.
There followed a discussion of completeness and how to check for it. (I have some blog posts on it
and the mailing-list discussion is in the York archives.)

The main point is that the traditional methods don't work well, better is available, indeed much
better is available, and people should be using it. As the citation says, "STPA is one possibility."
Even adding some half-decent self-checks to FMEA or FTA would be better than what's currently done.

The question is how we get there, socially. Prominently picking prominent examples and redoing them
prominently is helpful, but it is susceptible to the "everybody and hisher mother can do that"
response above, usually followed with "and I can't speak for our competitors, but all *our*
engineers can do a decent FMEA and we don't get it wrong".

Others have mooted that things will change when the compensation lawsuits start mounting. Having
been involved in some of those processes, I am not so sure. As others here with similar experience
can testify, mostly only a tiny fraction of any such negotiations concern the technical engineering
details, and very few of them get to open court like Bookout/Toyota. Indeed, that case was notable
not only for its visibility but also for the fact that it was decided directly on engineering
details. A participant in such discussions who has had to disclose a hazard analysis (often an FMEA)
and had it successfully trashed by the opposition can choose to fault the engineers who produced it
rather than the inadequate method the company required them to use.

One could maybe hope for progress coming through engineering training. Say, in engineering
departments at universities. York and MIT do their best, but system safety methods are not taught in
any detail in most places. The head of the division responsible for risk analysis at a prominent
German transportation safety-assessment company once told me he'd been looking for a new engineering
graduate to perform FTAs and ETAs, and could find only one German uni at which FTA was taught to
usable levels (that was Stuttgart). He's right; I checked. There are lots of people who *mention*
FTAs in coursework and have a couple in their lecture slides, including yours truly, but there's
nobody except perhaps Stuttgart who can say "oh, XYZ passed our course so heshe can certainly do a
decent FTA for you". They are still, mainly, methods learnt on the job, and if you learn a method on
the job because it's company standard, you mostly don't learn about where it doesn't work.

The frustration in some quarters is palpable. In a discussion on the FMEA standard a few years ago,
Jens Braband told me of the number of people he encounters who think FMEA is, in that wonderful
German phrase, the eierlegende Wollmilchsau - the egg-laying wool/milk-sow, the Swiss army knife of
farm animals. I've met some of these people. It is true that a well-performed FMEA can solve complex
logistics problems that people couldn't solve any other way - you have a very complex piece of kit
out there in the field, with thousands to millions of components, all of which have their own MTBF
and you have to figure out a maintenance schedule and parts-supply schedule which keeps it running
while minimising maintenance down-time. I've had someone explain to me in detail over many hours how
it can do that wonderfully. Then I say: what it can't do is reliably identify all the hazards when
you're performing a functional system safety analysis. The reply, "OK, but <there follows a further
encomium to the wonders of FMEA>".

The cited conclusion is that modern kit is so interactively complex that traditional risk analysis
methods don't work and we need something better. That is just so right. It is a sad comment on the
current state of engineering affairs that M. Fabre should need to observe "I suspect that this
conclusion will generate some controversy."

PBL

Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany
Tel+msg +49 (0)521 880 7319  www.rvs.uni-bielefeld.de