[SystemSafety] Fault, Failure and Reliability Again (short)

Wed Mar 4 11:15:10 CET 2015

In agreement with Matthew: Talking about time with the Patriot problem can
get confusing, because "time" probably was a formal input, but there was
also the "elapsed time since initiation" which caused the problem. Elapsed
time was no more an input than "current phase of the moon" or "amount of
vibration the chips were experiencing". The underlying point holds, that
software _can_ exhibit degraded performance over time. Saying that the
degraded performance is "caused by hardware" isn't an escape clause either.
I might as well say that my wing _design_ doesn't experience fatigue,
because the wing failure is caused by the hardware.

That's my main concern with software reliability calculations. Getting the
maths right is like performance optimising software without profiling it
first. The maths was never the weak point - it's the assumptions necessary
to make the maths even relevant. As the hardware ages, both the input
profile to the software and the running environment of the software change.
Do they change signficantly enough to invalidate the calculations? That's
an empirical question that can't be answered by justifying how great the
calculations are.

My safety podcast: disastercast.co.uk
My mobile (from October 6th): 0450 161 361

On 4 March 2015 at 10:01, Matthew Squair <mattsquair at gmail.com> wrote:

> I kind of disagree. In the case of wing fatigue that you gave earlier
> would you also say that for the wing example 'time' was an input?
>
> I don't think time per se is an input in either the software or wing
> example. In both it's actually use cycles that are pertinent,and there's
> only an arbitrary relation with time. We can figure out how many cycles per
> hour or flight you like but it's still cycles and the relation is
> arbitrary.
>
> In both cases run through the requisite number of cycles over some period
> of time and bad things will suddenly happen. To an external observer both
> exhibit a catastrophic wear out failure modes at the system boundary, and
> we can use Laprie's taxonomy to describe both cases.
>
> On Wed, Mar 4, 2015 at 4:24 PM, Nick Tudor <njt at tudorassoc.com> wrote:
>
>> Hi Matt
>>
>> Firstly, the Input in the case of your example was time. Secondly, the
>> reliability of the system was great and the software (apparently) worked
>> really well under its expected conditions of use; right up to the point
>> where it went beyond the conditions of use. At which point, in every
>> Patriot system they all would have failed. So is the system reliable? Yes
>> if you use it as intended, not necessarily if you don't. Finally, if an
>> input is time then there needs to be a verification that time based effects
>> have been accounted for. This is required in DO-178C & 278A.
>>
>>
>> On Wednesday, 4 March 2015, Matthew Squair <mattsquair at gmail.com> wrote:
>>
>>> An example of software cycle (time) driven failure.
>>>
>>> The Patriot missile systems suffered from clock drift. Despite that, for
>>> an expected operational use of less than X hours, performance was quite
>>> adequate. During operations the operator's ran the system well past the
>>> critical X hour and the clock drift shifted the radars range gate
>>> sufficiently to ensure failure.
>>>
>>> So failure was in this instance driven by time not inputs.
>>>
>>>
>>> On Wed, Mar 4, 2015 at 8:50 AM, Nick Tudor <njt at tudorassoc.com> wrote:
>>>
>>>> Hi Peter
>>>>
>>>>
>>>>
>>>> I have had some further thoughts wrt the reliability argument you
>>>> present in the blog and have done previously.  Your proposition is as
>>>> follows, I believe:
>>>>
>>>>
>>>>
>>>>  "Software S exhibits reliability R when subject to input distribution
>>>> D."
>>>>
>>>>
>>>>
>>>> Software in this statement can be replaced with 'system' or 'structure'
>>>> and I would believe it would hold because the hardware defect is subject to
>>>> an input distribution and it may or may not fail.  However, there is a
>>>> crucial factor omitted from the above argument which does not hold for
>>>> software: time.
>>>>
>>>>
>>>>
>>>> For example, a wing has a huge input range (in the continuous domain)
>>>> and any defect may not necessarily cause a failure.  It can, of course,
>>>> fail if subject to stress outside design range and this test is done my
>>>> manufacturers to ensure that the design is resistant to acceptable limits
>>>> to internal design weaknesses and material defects.  Meanwhile, back to
>>>> operations.  Over *time*, the same distribution may exacerbate the
>>>> defect to the point where a failure occurs.  The same input range therefore
>>>> did not *always* cause a failure, just after sufficient build up of
>>>> stress (or whatever) over *time* allowed the defect to become a
>>>> fault.  We can measure this and attribute a mean *time* between
>>>> failures.  The other way of thinking about this is that we know that the
>>>> system (electronics as well as structures), in the given environment will
>>>> fail at some point.
>>>>
>>>>
>>>>
>>>> For software, the time based element is irrelevant as, if the
>>>> circumstances required to hit the bug occur, it manifests itself as
>>>> unexpected behaviour at the system level and it will occur *every*
>>>> time.  There is no wear out mechanism for software (as noted by Michael’s
>>>> earlier).  The distribution D will always cause a system failure at the
>>>> specific point of the defect in the software; the software does not fail,
>>>> the system does.  It therefore makes no sense to talk about reliability of
>>>> software because of the irrelevance of the time based aspect.  The other
>>>> way of thinking about this is if the specific circumstances that would
>>>> cause the defect do not occur, then the software will *always* work as
>>>> expected.
>>>>
>>>>
>>>>
>>>> The riposte to this might be to argue that the reliability of software
>>>> can therefore be calculated as the probability of a set of circumstances in
>>>> distribution D that would cause the software defect to have a system
>>>> effect.  However, as you don’t know what the defect is (you would have
>>>> removed it, if you knew of it) , it’s effect nor the set of circumstances,
>>>> the value of this exercise is somewhat a guess and hence of dubious
>>>> measurable value.
>>>>
>>>>
>>>>
>>>> It may also be possible to talk about the reliability of defect
>>>> detection techniques and hence make some claim about the subsequent defect
>>>> freedom of software.  For instance, there are bug hunting tools that claim
>>>> to find certain classes of bugs.  These tools rarely claim to have found
>>>> all bugs of a certain class, all of the time.  So one might be able to
>>>> claim that there are less bugs.  But all that has done has changed from one
>>>> unknown level of buginess to another (probably lower, but ….) level of
>>>> buginess.  So, again, this is of dubious measurable value.
>>>>
>>>>
>>>>
>>>> What can be said, I think, is that every defect removal technique has a
>>>> set of assumptions that have to be validated (by humans) and that therefore
>>>> there is a level of uncertainty.  The technique itself, review, analysis
>>>> (static or otherwise) or test has a limit and the boundaries of
>>>> acceptability are set in standards such as DO-178 where a level of activity
>>>> is agreed to be undertaken based on a set of Objectives that have to be met
>>>> in order to support a System Design Assurance Level (DAL).  In this way,
>>>> the reliability question is readily, acceptably and evidently addressed.
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> Nick Tudor
>>>> Tudor Associates Ltd
>>>> Mobile: +44(0)7412 074654
>>>> www.tudorassoc.com
>>>>
>>>> *77 Barnards Green Road*
>>>> *Malvern*
>>>> *Worcestershire*
>>>> *WR14 3LR*
>>>> *Company No. 07642673*
>>>> *VAT No:116495996*
>>>>
>>>> *www.aeronautique-associates.com
>>>> <http://www.aeronautique-associates.com>*
>>>>
>>>> On 3 March 2015 at 07:11, Peter Bernard Ladkin <
>>>> ladkin at rvs.uni-bielefeld.de> wrote:
>>>>
>>>>> I had some private discussion with someone here who claims software
>>>>> cannot fail. I first heard this
>>>>> trope a quarter century ago, and I am informed indirectly by another
>>>>> colleague that it is still rife
>>>>> in certain critical-engineering areas. I address it this morning in a
>>>>> blog post at
>>>>>
>>>>> http://www.abnormaldistribution.org/2015/03/03/fault-failure-reliability-again/
>>>>>
>>>>> PBL
>>>>>
>>>>> Prof. Peter Bernard Ladkin, Faculty of Technology, University of
>>>>> Bielefeld, 33594 Bielefeld, Germany
>>>>> Je suis Charlie
>>>>> Tel+msg +49 (0)521 880 7319  www.rvs.uni-bielefeld.de
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> The System Safety Mailing List
>>>>> systemsafety at TechFak.Uni-Bielefeld.DE
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> The System Safety Mailing List
>>>> systemsafety at TechFak.Uni-Bielefeld.DE
>>>>
>>>>
>>>
>>>
>>> --
>>> *Matthew Squair*
>>> MIEAust CPEng
>>>
>>> Mob: +61 488770655
>>> Email: MattSquair at gmail.com
>>> Website: www.criticaluncertainties.com
>>> <http://criticaluncertainties.com/>
>>>
>>>
>>
>> --
>> Nick Tudor
>> Tudor Associates Ltd
>> Mobile: +44(0)7412 074654
>> www.tudorassoc.com
>>
>> *77 Barnards Green Road*
>> *Malvern*
>> *Worcestershire*
>> *WR14 3LR*
>> *Company No. 07642673*
>> *VAT No:116495996*
>>
>> *www.aeronautique-associates.com <http://www.aeronautique-associates.com>*
>>
>>
>
>
> --
> *Matthew Squair*
> MIEAust CPEng
>
> Mob: +61 488770655
> Email: MattSquair at gmail.com
> Website: www.criticaluncertainties.com <http://criticaluncertainties.com/>
>
>
> _______________________________________________
> The System Safety Mailing List
> systemsafety at TechFak.Uni-Bielefeld.DE
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20150304/8823b06d/attachment-0001.html>