[SystemSafety] Fault, Failure and Reliability Again (short)

Wed Mar 4 08:19:27 CET 2015

Matthew,

Fair point (and interesting example).

Michael.

From: Matthew Squair [mailto:mattsquair at gmail.com] 
Sent: 04 March 2015 07:14
To: M.Pont at safetty.net
Cc: systemsafety at lists.techfak.uni-bielefeld.de
Subject: Re: [SystemSafety] Fault, Failure and Reliability Again (short)

Michael,

Nope, it was a software fault. The fault was introduced by a modification to improve capability, the effect was identified from field feedback, advise to operators was issued (albeit not very helpful) and a fix was dispatched but didn't get there until the day after the failure. 

The tech background is that the registers were 24 bit bounding the accuracy of converting the integer time register to floating point and thereby giving you a propagation error. Normally this doesn't matter because you are working with t_next relative to t_now so there's negligible error. 

The modification called a new algorithm to provide greater range gate accuracy, but when it was patched in the coder's unfortunately missed one instance. So we have a situation where the old value subtracted from the new value and there's now an error which is proportional to both track velocity and system up time. Run the situation for long enough say 16-20 hours and the values diverge sufficiently to cause a system failure (e.g. track loss).

There's a GAO report GAO/IMTEC-92-26 on the incident, there's also a Defence Science Board task force 2005 report on Patriot performance 20301-3140.

On Wed, Mar 4, 2015 at 4:24 PM, Michael J. Pont <M.Pont at safetty.net> wrote:

Matthew,

Clock drift is (in my view) a hardware issue – the software didn’t change.

(The impact of EMI, for example, may even change the software instructions – but this too is a hardware issue.)

Michael.

Michael J. Pont

SafeTTy Systems Ltd

From: systemsafety-bounces at lists.techfak.uni-bielefeld.de [mailto:systemsafety-bounces at lists.techfak.uni-bielefeld.de] On Behalf Of Matthew Squair
Sent: 04 March 2015 02:43
To: Nick Tudor; systemsafety at lists.techfak.uni-bielefeld.de
Subject: Re: [SystemSafety] Fault, Failure and Reliability Again (short)

An example of software cycle (time) driven failure.

The Patriot missile systems suffered from clock drift. Despite that, for an expected operational use of less than X hours, performance was quite adequate. During operations the operator's ran the system well past the critical X hour and the clock drift shifted the radars range gate sufficiently to ensure failure. 

So failure was in this instance driven by time not inputs.

On Wed, Mar 4, 2015 at 8:50 AM, Nick Tudor <njt at tudorassoc.com> wrote:

Hi Peter 

I have had some further thoughts wrt the reliability argument you present in the blog and have done previously.  Your proposition is as follows, I believe:

 "Software S exhibits reliability R when subject to input distribution D."

Software in this statement can be replaced with 'system' or 'structure' and I would believe it would hold because the hardware defect is subject to an input distribution and it may or may not fail.  However, there is a crucial factor omitted from the above argument which does not hold for software: time.  

For example, a wing has a huge input range (in the continuous domain) and any defect may not necessarily cause a failure.  It can, of course, fail if subject to stress outside design range and this test is done my manufacturers to ensure that the design is resistant to acceptable limits to internal design weaknesses and material defects.  Meanwhile, back to operations.  Over time, the same distribution may exacerbate the defect to the point where a failure occurs.  The same input range therefore did not always cause a failure, just after sufficient build up of stress (or whatever) over time allowed the defect to become a fault.  We can measure this and attribute a mean time between failures.  The other way of thinking about this is that we know that the system (electronics as well as structures), in the given environment will fail at some point.

For software, the time based element is irrelevant as, if the circumstances required to hit the bug occur, it manifests itself as unexpected behaviour at the system level and it will occur every time.  There is no wear out mechanism for software (as noted by Michael’s earlier).  The distribution D will always cause a system failure at the specific point of the defect in the software; the software does not fail, the system does.  It therefore makes no sense to talk about reliability of software because of the irrelevance of the time based aspect.  The other way of thinking about this is if the specific circumstances that would cause the defect do not occur, then the software will always work as expected.

The riposte to this might be to argue that the reliability of software can therefore be calculated as the probability of a set of circumstances in distribution D that would cause the software defect to have a system effect.  However, as you don’t know what the defect is (you would have removed it, if you knew of it) , it’s effect nor the set of circumstances, the value of this exercise is somewhat a guess and hence of dubious measurable value.

It may also be possible to talk about the reliability of defect detection techniques and hence make some claim about the subsequent defect freedom of software.  For instance, there are bug hunting tools that claim to find certain classes of bugs.  These tools rarely claim to have found all bugs of a certain class, all of the time.  So one might be able to claim that there are less bugs.  But all that has done has changed from one unknown level of buginess to another (probably lower, but ….) level of buginess.  So, again, this is of dubious measurable value.

What can be said, I think, is that every defect removal technique has a set of assumptions that have to be validated (by humans) and that therefore there is a level of uncertainty.  The technique itself, review, analysis (static or otherwise) or test has a limit and the boundaries of acceptability are set in standards such as DO-178 where a level of activity is agreed to be undertaken based on a set of Objectives that have to be met in order to support a System Design Assurance Level (DAL).  In this way, the reliability question is readily, acceptably and evidently addressed.

Regards

Nick Tudor

Tudor Associates Ltd

Mobile: +44(0)7412 074654 <tel:%2B44%280%297412%20074654> 

www.tudorassoc.com

Image removed by sender.

77 Barnards Green Road

Malvern

Worcestershire

WR14 3LR
Company No. 07642673

VAT No:116495996

www.aeronautique-associates.com 

On 3 March 2015 at 07:11, Peter Bernard Ladkin <ladkin at rvs.uni-bielefeld.de> wrote:

I had some private discussion with someone here who claims software cannot fail. I first heard this
trope a quarter century ago, and I am informed indirectly by another colleague that it is still rife
in certain critical-engineering areas. I address it this morning in a blog post at
http://www.abnormaldistribution.org/2015/03/03/fault-failure-reliability-again/

PBL

Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany
Je suis Charlie
Tel+msg +49 (0)521 880 7319 <tel:%2B49%20%280%29521%20880%207319>   www.rvs.uni-bielefeld.de

_______________________________________________
The System Safety Mailing List
systemsafety at TechFak.Uni-Bielefeld.DE

_______________________________________________
The System Safety Mailing List
systemsafety at TechFak.Uni-Bielefeld.DE

-- 

Matthew Squair

MIEAust CPEng

Mob: +61 488770655 <tel:%2B61%20488770655> 

Email: MattSquair at gmail.com

Website: www.criticaluncertainties.com <http://criticaluncertainties.com/> 

_______________________________________________
The System Safety Mailing List
systemsafety at TechFak.Uni-Bielefeld.DE

-- 

Matthew Squair

MIEAust CPEng

Mob: +61 488770655

Email: MattSquair at gmail.com

Website: www.criticaluncertainties.com <http://criticaluncertainties.com/> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20150304/9c42fa27/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 425 bytes
Desc: not available
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20150304/9c42fa27/attachment-0001.jpg>