[SystemSafety] "Reliability" culture versus "safety" culture

Mon Jul 29 16:00:50 CEST 2013

Many safety people use James Reason's model that involves layers of cheese with holes in.  The cheese represents safety barriers; the idea is that when the holes line up, you can poke grissini through and cause an accident.  Many of these same people will quite happily perform an FMEA, which proceeds by assuming that only the hole under consideration exists and all the others are blocked.  Good for assessing local effects, but not good for identifying hazards, let alone assessing their likelihoods...

John

-----Original Message-----
From: systemsafety-bounces at lists.techfak.uni-bielefeld.de [mailto:systemsafety-bounces at lists.techfak.uni-bielefeld.de] On Behalf Of Peter Bernard Ladkin
Sent: 29 July 2013 13:38
To: systemsafety at techfak.uni-bielefeld.de
Subject: [SystemSafety] "Reliability" culture versus "safety" culture

As a few of you know, I have recently been involved in what appears to be a technical-culture clash,
between "reliability" and "safety" engineers, which has led/leads to organisational problems, for
example the scope of technical standards. Some suspect that such a culture clash is moderately
rigid. I would like to figure out as many specific technical differences as I can. It is moderately
important to me that the expression of such differences attain universal assent (that is, from both
cultures as well as any others....)

Here are some I know about already.

1. Root Cause Analysis. Reliability people set store by methods such as Five-Why's, and Fishbone
Diagrams, which people analysing accidents or serious incidents consider hopelessly inadequate (in
Nancy's word, "silly").

2. Root Cause Analysis. Reliability people often look to identify "the" root cause of a quality
problem, and many methods are geared to identifying "the" root cause. Accident analysts are
(usually) adamant that there is hardly ever (in the words of many, "never") just *one* cause which
can be called root.

3. FMEA. There are considerable questions with today's complex systems of how to calculate
maintenance cycles. Even a military road vehicle nowadays can be considered a "system of systems",
in that the system-subsystem hierarchy is quite deep. Calculating maintenance cycles requires
obtaining some idea of MTBFs of components. Components may be simple, or line-replacable units, or
units that require shop maintenance. Physical components may or may not correspond to functional
blocks (there is a notation, Functional Block Diagrams or FBDs, which is widely used). There are
ways of calculating MTBFs and maintenance procedures for components hierarchically arranged in FBDs.
They may well work well enough for the control of complexity to determine the requirements for
regular maintenance.

However, if functional failures contribute to hazards, these methods, which are approximate, do not
appear to work well for assessing the likelihoods of hazards arising. (This is true even for those
hazards which arise exclusively as a result of failures.)

4. FMEA. People who work with FMEA for reliability goals are not so concerned with completeness.
Indeed, I have had reliability-FMEA experts dismiss the subject when I brought it up, claiming it to
be "impossible". However, people who use FMEA for the analysis of failures of safety-relevant
systems and their hazards must be very concerned, as a matter of due diligence, that their analyses
(their listing of failure modes) as far as possible leave nothing out (in other words, that they are
as complete as possible).

5. Testing. Safety people generally know (or can be presumed to know) of the work which tells them
that assessing software-based systems for high reliability through testing cannot practically be
accomplished, if the desired reliability is higher than one error in 10,000 to 100,000 operational
hours (e.g., Littlewood/Strigini, Butler/Finelli, both 1993).

Whereas reliability people believe that statistical analysis of testing is practical and worthwhile.
For example , from a paper in the 2000 IEEE R&M Symposium:
 > Abstract: When large hardware-software systems are run-in or an acceptance testing is made, a
 > problem is when to stop the test and deliver/accept the system. The same problem exists when a
 > large software program is tested with simulated operations data. Based on two theses from the
 > Technical University of Denmark the paper describes and evaluates 7 possible algorithms. Of these
 > algorithms the three most promising are tested with simulated data. 27 different systems are
 > simulated, and 50 Monte Carlo simulations made on each system. The stop times generated by the
 > algorithm is compared with the known perfect stop time. Of the three algorithms two is selected
 > as good. These two algorithms are then tested on 10 sets of real data. The algorithms are tested
 > with three different levels of confidence. The number of correct and wrong stop decisions are
 > counted. The conclusion is that the Weibull algorithm with 90% confidence level takes the right
 > decision in every one of the 10 cases.

6 .... and onwards. I would like to collect as many examples as possible of such differences. Do
some of you have other contrasts to contribute? I would like to share with colleagues, so I do
intend to attribute to the contributor if this is OK. (Desired-anonymous examples will also be kept,
as desired, anonymous.)

PBL

Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany
Tel+msg +49 (0)521 880 7319  www.rvs.uni-bielefeld.de<http://www.rvs.uni-bielefeld.de>

_______________________________________________
The System Safety Mailing List
systemsafety at TechFak.Uni-Bielefeld.DE<mailto:systemsafety at TechFak.Uni-Bielefeld.DE>

***************************************************************************
If you are not the intended recipient, please notify our Help Desk at Email isproduction at nats.co.uk
immediately. You should not copy or use this email or attachment(s) for any purpose nor disclose
their contents to any other person.

NATS computer systems may be monitored and communications carried on them recorded, to 
secure the effective operation of the system.

Please note that neither NATS nor the sender accepts any responsibility for viruses or any losses
caused as a result of viruses and it is your responsibility to scan or otherwise check this email
and any attachments.

NATS means NATS (En Route) plc (company number: 4129273), NATS (Services) Ltd 
(company number 4129270), NATSNAV Ltd (company number: 4164590) 
or NATS Ltd (company number 3155567) or NATS Holdings Ltd (company number 4138218). 
All companies are registered in England and their registered office is at 4000 Parkway, 
Whiteley, Fareham, Hampshire, PO15 7FL.

***************************************************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20130729/124e14d2/attachment-0001.html>