[SystemSafety] Component Reliability and System Safety [was: New Paper on MISRA C]

Fri Sep 14 09:03:25 CEST 2018

Paul has raised the issue of component reliability, for example, the dependability that may accrue
to programs to the MISRA C coding standard and checked with static analysis tools aimed at that
standard, and what that might have to do with system safety, referencing Nancy Leveson's book
Engineering a Safer World.

There is quite a history of discussion on this list and its York predecessor about STAMP/STPA and
other analysis methods. Here is something from 15 years ago (July 2003) comparing STAMP (as it then
was) with WBA in the analysis of a railway derailment accident
http://ifev.rz.tu-bs.de/Bieleschweig/pdfB2/WBA_STAMP_Vergl.pdf  (This was at the second Bieleschweig
Workshop. The list of Bieleschweig Workshops and slides and documents from most of the contributions
may be found at https://rvs-bi.de/Bieleschweig/ )

What the authors found was that many of the supposedly "causal factors" identified through the STAMP
analysis were not what they and others wished to identify as causal factors. For example, the German
railways DB had and have a hierarchical command structure in which information and requirements
flows from "above" to "below" and often lack the feedback loops from "below" to "above" which STAMP
at the time required. The lack of loops was, at the time, classified by STAMP as "causal factor(s)"
according to Nancy's "new" conception of causality.

Except that the DB could have said, with some justification, "we have been running a railway like
this for 150 years and accidents of this sort are few". And it is not clear that modifying the
organisational structure to implement those feedback loops would be advisable. It would be a change
of fundamental culture, and not all such changes have the desired effect. I was involved in one such
translation of Anglo organisational culture into German organisational culture, with the
introduction of university Bachelor's/Master's degree programs to replace the traditional German
Diplom programs. The result has been very mixed. I think the main benefit is that it makes
transference of course credits across international boundaries easier, but lots of other things of
local (Uni BI) importance are worse than they were.

The point I wish to make with this example is that STAMP (as it then was) embedded a conception of
organisational culture that was and is by no means universal. And such an embedding is going to
mislead analysts when the culture being investigated does not match the preconception. As our Brühl
analysts found out.

I prefer, and have followed, the alternative approach of developing tools to target certain tasks in
safety analysis. WBA, OHA, OPRA, Why-Because Graphs, Causal Fault Graphs. Nancy may wish to claim,
as Paul quotes, that
> "Accidents are complex processes involving the entire socio-technical system. Traditional event-chain models cannot describe this process adequately.

(I might counter that weird concepts of causality cannot describe causal factors adequately, citing
the Brühl analysis. It would be an equally trivial intervention. So I won't.)

She once described WBA as a "chain of events model" in order to conclude it couldn't describe
necessary components of an accident. I responded that the only appropriate word in that description
was "of" - WBGs are almost never a chain; almost never include only events, and are not models but
descriptions (showing states and events and processes and the relation of causality between them).

I have also not seen any STAMP example which could not have arisen in similar form through WBA
performed by an analyst sensitive to sociotechnical features. What STAMP brings, I would suggest, is
a certain structure to the socio/organisational/governmental/legal aspects of accident and hazard
analysis which WBA & OHA does not explicitly include. There is value in that, of course, although
one must beware, as above, of implicit assumptions which do not match the analysis task at hand.

You could also use WBA, in particular the Counterfactual Test, as part of STAMP if you wish. Nancy
was never receptive to that idea - WBGs were "chain of events models" and she "knew" that couldn't work.

The original STAMP idea was prompted by Jens Rasmussen's Accimaps, I understand. Andrew Hopkins used
these quite successfully in various studies of accidents in Australia. Hopkin's analyses neatly
stratified the causal factors in a similar way to that in which STAMP does, but were typically far
less detailed than a WBG of the same event. The stopping rule and the level of abstraction (and the
abstraction processes used by the analyst) are decisive components of any analysis. (We deal with
the abstraction level in OPRA analysis; the process of deriving a stopping rule is less well
researched.)

The thing inhibiting almost any sophisticated analysis of the STAMP or WBA variety is that there are
a lot of system-failure incidents in which the sociological/organisational components become clear
but people have a vested interest in ignoring them. We have had people stop using WBA because it
highlighted too much that they couldn't change. STAMP would encounter a similar issue. For such
applications, one can imagine an "organisation-relative" WBA. An "organisation-relative" STAMP would
negate its purpose, I imagine.

An example. Sometime over 15 years ago, one of the systems on our RVS network, then a physical
subnetwork of the Uni Bielefeld campus network, was penetrated. Somebody working late at night
noticed a SysAdmin on the system and wanted to chat. Instead, odd things started to happen and the
SysAdmin went away. Telephone calls were made. It was an impostor, who in retreating had trashed 3GB
of log files and other material. My guys eventually analysed much of the trashed material in about a
person-month of forensics and figured out who the impostor was.

The day after the evening incident, the guys had called me. The main question for me was how the
impostor had obtained SysAdmin account credentials. The (true) SysAdmin often worked from home. He
came in over the telephone lines to the Uni network and pursued a login process over the Uni
Ethernet. Chances anyone was listening to the telephone data transfer were minimal; it must have
been listened to within the Uni Ethernet by someone who knew weaknesses in the login process code
and was able to read his login name and password. At the time we had, I thought, a rely-guarantee
arrangement with the Uni network admin that they could give us a network infrastructure which
excluded non-authorised users. Our network security policy was explicitly based on this
rely-guarantee arrangement. I called up their administrator on a Saturday morning: "your security
has been breached; we found out last night". The response "oh, well, we can't do everything. We'll
look at stuff next week, I guess. You should take better precautions."

I hope it is obvious that a breach of a rely-guarantee arrangement was a causal factor in the
incident. We thought that arrangement existed, and our colleague in the Uni SysAdmin conveniently
forgot it. It is hard to see that a STAMP analysis would have changed his mind.

Of course, nowadays it would be daft to think any normal network administration could provide such a
secure communication environment.

On to some comments of Clayton:

On 2018-09-13 20:22 , clayton at veriloud.com wrote:
>  Component reliability, where bugs are found, fixed, and test cases are passed, do not account for the insidious failures that occur in the real world where "component interaction" can be so complex, no amount of testing can anticipate it. 

There are a bunch of things mixed in here. Let me take one.

>  Component reliability do[es] not account for the insidious failures that occur in the real world where "component interaction" can be so complex .......

One of progressively the most safe complex technologies, civil air transport, has been built on the
principle (and the regulation) of rigorously-pursued component reliability and still is.

Let me take another:

>  [Procedures] where bugs are found, fixed, and test cases are passed, do not account for the
insidious failures that occur in the real world where .... no amount of testing can anticipate it.

That is a basic statistical observation that is, or should be, part of any system safety engineering
course. The usual references are Butler & Finelli, IEEE TSE 1993 (for a frequentist interpretation)
and Littlewood and Strigini, CACM 1993 (for a Bayesian interpretation). If you are a
frequentist/Laplacian, it all follows from simple arithmetical observations concerning the
exponential distribution.

> .. As Leveson and many have stated before, most failures arise of out system requirements flaws due to lack of rigor (e.g. poor hazard analysis and mitigation). 

It indeed seems to be the case that most mission failures of *carefully developed* systems arise out
of some mismatch between requirements and actual operating conditions (which I like to trace back to
Robyn Lutz's study for NASA, because her 1993 paper is still on-line and easily accessible). It is
also likely true (if not a truism) that most failures arise from flaws due to a lack of rigour
somewhere. But I don't know of any reliable study that has shown that most system failures arise out
of a *lack of rigour in system requirements*. Almost none of the well-known cybersecurity incidents
such as Stuxnet, Triton, TriSis, Go To Fail, Heartbleed, NotPetya arose out of a lack of rigour in
system requirements, as far as we know. I would venture to suggest that is true of most of the
advisories coming out of ICS-CERT; certainly the two significant ones of which I was advised this
week (reading between the lines of one of those, it seems a major reliable industrial communication
system had a protocol susceptible to deadlock).

> In this email thread, its interesting nobody seems to be commenting on the subject’s paper, have you read it? 

Um, yes.

> It is about .....

My reading is that it is about why people use C for programming small embedded systems and why
following the MISRA C coding standard may allow such development to be more dependable than it
otherwise would be.

Another way to allow such development to be more dependable than it otherwise would be is to switch
to SPARK and then use a back-end C generator if your code has to be in C.

Or develop in SCADE and use its C code generator.

Or .........

>> [Paul Sherwood, I think] Why is MISRA C still considered relevant to system safety in 2018?

(Banal question? Banal answer!) Because many people use C for programming small embedded systems and
adhering to MISRA C coding guidelines enables the use of static analysis tools which go some way
(but not all the way) to showing that the code does what you have said you want it to do.

PBL

Prof. Peter Bernard Ladkin, Bielefeld, Germany
MoreInCommon
Je suis Charlie
Tel+msg +49 (0)521 880 7319  www.rvs-bi.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20180914/7ccfbed9/attachment.sig>