[SystemSafety] The bomb again

Thu Oct 10 10:58:48 CEST 2013

Matthew,

I am not going to indulge in a detailed discussion of STAMP, because I don't use it, and there are others here whose experience with it is far richer than mine. 

On 10 Oct 2013, at 02:59, Matthew Squair <mattsquair at gmail.com> wrote:
> Nor was I trying to make a case for STAMP as a methodology for accident analysis, though clearly you can use it for that.

STAMP was originally presented as a "new accident model". 

> To rephrase my question a little, if you accept the thesis that various sociologists make that you can't divorce the safety of a 'hard system' from it's organisational context then surely that implies that you should expend some effort on designing the containing organisational system, or at least critically reviewing it?

I would have thought so. 

I have specific experience and expertise that do not extend to designing organisational structures (no matter how much I might mouth off about them here). Professional code of conduct says I get someone who is expert on those to do that. A clever organisational theorist with relevant experience. 

Which isn't to say that the results can't appear in engineering analyses. For example, I find Nancy's STAMP analysis of the 1993 Iraq friendly-fire shootdown insightful as well as way shorter and more precise than that of the organisational theorist Snook. Snook makes some obvious mistakes, for example he claims as a basis for his analysis that applying the Counterfactual Test is insufficient. (First, his attempt at applying the CT is full of mistakes - see a paper of ours from 2003. Second, all of his analysis can be represented in a WBG, which of course *only* uses the CT as its criterion.) But most of what he observes in his book is insightful and pertinent. Amongst other organisational-behavioral (OB) phenomena, he makes good use of Rasmussen's 1997 notion of Migration to the Boundary, and the notion of crowd indecision, the "Kitty Genovese" phenomenon, as well as information-flow analyses which one could routinely consider to be engineering phenomena. I wonder whether either the STAMP analysis or the WBA would be there without his having appeared first as a point of reference? That is, it has to occur to an analyst that MttB and the Genovese phenomenon are present and explanatory; it doesn't just drop out of some magical methodological back pocket. (Whereas I think the info-flow observations do.) Engineers not specially trained in OB can't be expected to know about or observe instances of MttB or Genovese. Both the military investigation and the Commission missed them, which was part of the motivation for Snook, as I understand.  

> And if you do so, then what are the system design and analysis tools and methodologies that can support you?

I thought I had already answered that question. Let me answer it in a little more detail.

As you suggest above, I do get to analyse the resulting proposals for sociotechnical system design, being a systems safety expert. And for that I use OHA plus the techniques I already mentioned; CCFD, WBA-on-CCFD, PARDIA and RCM analyses (I guess I hadn't mentioned the CCFD-involved techniques). Those won't derive Perrin's insight either, but, as I said, I would explicitly rely on the organisational theorists for such observations.

None of the engineering analyses performed upon TCAS identified the decision-theoretic weaknesses of the sociotechnical algorithm which RCM identifies. There is also at least one flight configuration which I identified which no one has shown that TCAS II/ACAS II can resolve (it's a math problem in trajectories, and, having discussed it with some outstanding mathematicians expert in the field I made the public offer in 2002 of a PhD for a resolution of that issue. Eleven years later, it's still open). 

Yet the algorithm was "formally verified" by two teams led by two outstanding computer scientists, Nancy Leveson and Nancy Lynch. Automated proof checkers and the lot. 

Those analyses turned out to be quite influential. I have continued to meet engineers who ignore the results of the RCM analysis, dismiss the unresolved configuration issue, and explain how TCAS is quite satisfactory as it is by repeating the usual trope (the "philosophy") and that it was "formally verified". (The people at Eurocontrol involved in the design of the new ACAS X are well aware of the matters above, so they might well get addressed in the new system definition.)

Now, how can all that happen? Part of the answer lies, unsurprisingly, in the assumptions  imposed upon the Leveson and Lynch projects from outside (for example, to analyse just two-aircraft configurations). Those assumptions were not explicitly discussed anywhere of which I know during the TCAS project analyses, although I think people were quite aware of them (I am open to being corrected on this, if people know of such analyses). Even after incidents happened before Überlingen, the (what I take to be) necessary analyses were not performed. 

(And then there was the requirements failure which manifested itself in the Überlingen accident, which I mentioned in my comment a few weeks later. It turns out Eurocontrol, and indeed RTCA, knew about this, and Eurocontrol already had a Change Proposal filed for two years before the accident.)

Organisational theorists can explain all these social phenomena in some detail. They haven't yet, but I am sure some tenacious PhD student of someone like Pinch, Downer or Slayton is going to do so sometime.

I am sceptical whether analyses performed by engineers, using STAMP or OHA or whatever, are going to have much impact on these more general phenomena. What the engineers do is very controlled and compartmentalised, and the results then summarised in two words and broadcast without the caveats.  Such as "formally verified" - it drives me nuts when people utter these words without context. 

The first time somebody in Germany has a real accident involving an electric car being charged, then somebody is going to turn around and say "the charging infrastructure was analysed and verified using OHA", followed by "somebody at the pointy end screwed up". (That second will come, as it always does, from the state prosecutor's office.) Is there any engineering method which predicts that behavior? No. Not STAMP, not OHA, not RCM, not any. It comes from experience and observation of the organisations and social structures involved, and I hope some organisational empiricist has noted it down and is working on it. Because I think it needs to be explicit in system development that things work that way. 

Did I say the first time? I meant the second. The first happened already - some home for mentally-challenged people in Southern Germany burnt up last year because of a fault arising when a utility vehicle was being charged overnight. Luckily - or miraculously - no one was hurt. Damage was (only) a half-million euros. I don't know yet what the investigation showed; I think it is sub judice and might remain so. I'd like to know, to see if we have it covered in the HazAn. But the social mechanisms are not in place to have such results shared with the relevant committee of the engineering organisation responsible for standards. There: yet another phenomenon for the organisational theorists to get their teeth into. You don't solve it by having an engineer impose a social model which not everybody buys, and then say "look, see, there's a feedback loop missing!" It's true in this case, there is a feedback loop missing and it needs to be put in. But if it is going to happen, it is going to happen by having some organisational theorist make observations, analyse, and relate the phenomenon convincingly to other similar social phenomena where things went wrong, and query whether we want things to go wrong in the same way here too.

You know, I could go on and on. I bet people are worried that I will! So I'll shut up.

BTW, I heard that Rebecca Slayton's book on anti-missile missile systems (Star Wars and Patriot and so on) is out. I've seen a couple of chapters and was gripped. Let me recommend it!

> Interesting you should mention the hierarchical nature of DB, Alfred Chandler makes the point in The Visible Hand that the technology of the railways drove a specific centralised, hierarchical organisational form, which I imagine DB adheres to. 

Yes and no. It is now a bunch of nominally-independent organisations contracting with each other. Planning-and-booking, catering, stations, network infrastructure, maintenance, regional rail, long-distance rail, freight. All notionally separate companies. All owned by the government still, though, and all overseen by the same board of governors.  

(Some changes are obvious. Local trains used to wait for long-distance connections. Now they don't, because it ruins their timeliness statistics. Which means loads of people missing their connections where they didn't before, but the individual timeliness statistics of the individual companies continue to improve. DB claims its long-distance trains have a 97% "on-time" record. But my record of completing a long-distance journey with the planned connections hovers around two-thirds, even though I almost never go with a connection-time of under 10 minutes, which is outside the "on-time" area of DB statistics. If I include bus/tram connections it's about 50%. We need an organisational theorist to collect the data and point out in public that the wrong thing is being measured.) 

And then there is the railway law. The railway law imposes certain procedures which often have a hierarchical form because, well, the law is about who is responsible for what and that imposes hierarchy. (That also helps explain why STAMP wasn't really helpful in that one example I gave; to follow its recommendations would mean having to change a half-dozen laws because of one accident which luckily didn't hurt anyone too badly. First, that's just not going to happen. Second, the legal justification for changing a law because of an accident would be the Counterfactual Test, and so a procedural (and therefore legal) change which STAMP recommends but which does not pass the CT just couldn't be implemented. There were quite a few of those. It's not really surprising that people working in such an environment would prefer an analysis method which imposes the CT.) 

I thought I said above that I'd stop? Now, I really will.

PBL

Prof. Peter Bernard Ladkin, University of Bielefeld and Causalis Limited