[SystemSafety] RE : Qualifying SW as "proven in use"

Peter Bernard Ladkin ladkin at rvs.uni-bielefeld.de
Thu Jun 27 13:35:13 CEST 2013


Matthew,

Scenarios such as those Bertrand describes are not that far-fetched. Unfortunately, there are in some places senior management who are in the same state of (lack of) expertise as Bertrand describes. That is a problem of professional qualification which I would prefer to treat as a separate issue.  

On 27 Jun 2013, at 09:18, Matthew Squair <mattsquair at gmail.com> wrote:
> I've been thinking about Peter's example a good deal, the developer seems to me to have made an implicit assumption that one can use a statistical argument based on sucessful hours run to justify the safety of the software. 

It is not an assumption. It is a well-rehearsed statistical argument with a few decades of universal acceptance, as well as various successful applications in the assessment of emergency systems in certain English nuclear power plants.

> I don't think that's true,

You might like to take that up with, for example, the editorial board of IEEE TSE.

> in fact I'd go further and say that whether you operate for a thousand hours or a million hours has no bearing on demonstrating software safety, because what we're interested in are systematic failures rather than random ones.

I presume you would want to argue that the occurrence of a failure caused by a systematic fault is functionally dependent on the inputs, and that is what distinguishes it from what you call "random". However, if your inputs have a stochastic nature, then anything functionally dependent on them will also exhibit stochastic behavior. Failures caused by systematic faults  thus exhibit stochastic behavior.

> Example, I have a piece of software and (despite my best efforts) there's a latent fatal fault within it, however testing hasn't discovered it and I'm also in luck in that the operating environment is sufficiently close to the test environment that the fault is not triggered in the operating environment. Now I could run the system for one, one hundred or a thousand years in that operating environment and I wouldn't see a problem. So according to the statistical treatment the software is safe, even with a fatal flaw isn't it?

No. According to the statistical treatment, if you have seen 3 x 10^X operational hours without failure, *and* you are guaranteed to have had perfect failure detection, *and* the future operating environment has the exact same statistical properties as the previous (not "similar" but exact, statistically), then you may be 90% confident that you will see failures with a likelihood of not more than 10^(-X) per operating hour. How that might relate to a claim that "the software is safe" is up to you. Also, you didn't express what level of confidence you might need in such a claim.

> So logically if the number of hours you run in service in a particular environment has nothing to do with proving the safety of software, why couldn't I say that after one hundred hours the software was 'proven in use', for that specific environment. Why not one hour?

It is correct that the number of hours.... has nothing to do with proving the safety of software, if by that you mean establish without a shadow of doubt. Neither does any practical statistical reasoning. Usual levels of confidence with statistical reasoning are 95%. Well away from certainty.

You can of course say that, after 100 hours of failure-free operation, the SW is "proven in use", whatever that might mean to you. What you cannot do is attribute to that assertion any other than a very, very low level of confidence. Even one hour. With an appropriately lower level of confidence (= epsilon indistinguishable from zero, I would hope).

> In Peter's example the number of hours run on the original software version could have been one, or ten million and there still would have been the same end result, e.g a failure when put into a new operational context. In other words one hour of operations has as much weight as one thousand (in the same environment).

I am not sure what you mean here. To me, "new operational context" and "same environment" are contradictory,  so maybe I don't understand the way you are using these terms.

> Another question, say I have developed a piece of software, it's now running in three quite different operating environments, in terms of evidence of 'safety' would I weight 300 hours of operation in a single environment the same as 100 hours from each of these different environments? If so why? 

What you have is 100 hours of experience from each of three different distributions. You could superimpose the distributions if you want, but the only reason to do that is if you are thinking of deploying the SW in an environment identical to that superimposition and want to get a clue as to its viability.

PBL

Prof. Peter Bernard Ladkin, University of Bielefeld and Causalis Limited



More information about the systemsafety mailing list