[SystemSafety] Software reliability (or whatever you would prefer to call it)

Sun Mar 8 21:36:33 CET 2015

Thank you Bev. Please see below.

Best regards,

Doug
----
E. Douglas Jensen
jensen at real-time.org<mailto:jensen at real-time.org>, http://www.real-time.org<http://www.real-time.org/>
Home voice 508-653-5653, Cell phone voice: 508-728-0809
Facebook https://www.facebook.com/edouglas.jensen
LinkedIn www.linkedin.com/pub/e-douglas-doug-jensen/3/82/65a/<http://www.linkedin.com/pub/e-douglas-doug-jensen/3/82/65a/>

From: Littlewood, Bev [mailto:Bev.Littlewood.1 at city.ac.uk]
Sent: Sunday, March 08, 2015 2:31 PM
To: E. Douglas Jensen
Cc: Littlewood, Bev; systemsafety at lists.techfak.uni-bielefeld.de; ladkin Ladkin
Subject: Re: [SystemSafety] Software reliability (or whatever you would prefer to call it)

Hi Doug

Thanks for this, and for correcting my gender (not sure that is well-expressed…) in your private email. My wife used to joke to my colleagues that my scientific career really took off in the 70s, with lots of papers in big US conferences, because the programme committees thought I’d be the ideal candidate for Statutory Woman on the conference programme.
[EDJ] I am trying to recover from my embarrassment …

You make good points here. The processes you describe seem extremely complex, so that detailed mathematical/probabilistic modelling to reproduce all their properties might be nigh impossible. Certainly, they do not seem candidates for the simple Bernoulli/Poisson processes that prompted this debate.
[EDJ] Modelling attempts have been made, but “all” is not a goal, “some of the most important properties” is the most we can hope for.

But never say never. Remember that it is the failure processes we are discussing here. It may sometimes be possible to model these without modelling all the complex activity that is going on in the wider system. There are, for example, theorems that show that quite complex point processes can converge to Poisson processes under quite plausible limiting conditions.
[EDJ] Yes, I like that second sentence. I am not a failure expert, rather I try to help people figure out how to reason about properties (e.g., mission effectiveness) of a highly dynamic and uncertain scenario (cruise missile defense, etc.) in which multiple concurrent complex partial software and hardware failures occur. We have used such theorems but not much for failure processes per se—we typically assert that it is necessary to make simplifying assumptions about some types of failures because there are so many different kinds to accommodate as uncertainties. Our community has people who focus on software failures in our application context, and give us information to use in the system analysis. I have to hope they learn from the discussions in this forum.

For example, I did some work many years ago which modelled program execution as a Markov process involving exchanges of control between fallible sub-programs. The exact process of failures in such a case is rather complex, and not a Poisson process, of course: the inter-event times are complex random sums of exponential sojourn times spent executing the sub-programs between successive failures. But it can be shown (among other interesting things) that in the limit (as, plausibly, the exchanges of control between the sub-programs are much more frequent than the failure rates of the latter) the overall failure process converges to a Poisson process. What is nice is that this is so even when the underlying process is semi-Markov, i.e. the sojourn times are not exponentially distributed.
[EDJ] Interesting, we have done much the same thing with useful results. Our various sojourn times are rarely exponentially distributed.

I mention this not to suggest it has any relevance for your example, but to show that there may be simplicities in the failure processes that are not apparent in the complete descriptions of system behavior.
[EDJ] Thank you, it does have some relevance to my cases. Our clients need improved understanding of their system and subsystem behaviors, which is immensely challenging. We depend on appropriate understanding and analysis of the lower levels’ behaviors (including failures)  all the way down to integrated circuits and software. This forum’s discussions help me assess the software failure information I am given.

My main point, I think, is that you have not convinced me that the practical complexities you have preclude the failure process being stochastic. After all, it just means “uncertain in time”!
[EDJ] With considerable trepidation, I question that: in very simplified terms, does “stochastic” not mean “being randomly determined?” To me, “randomly determined” is a special case of “uncertain in time.” Chaotic behavior can be uncertain in time but is not usually considered stochastic.

I thought we might be in strong agreement when I saw your remark about the limitations of frequentist statistics. I couldn’t agree more. In fact I was tempted to preach to this list about the necessity to use Bayesian probability here, but thought that was a bridge too far. However, in your private mail you say you use imprecise probabilities. My problem with that approach (and things like Shafer/Dempster and fuzzy/possibility) is that it’s hard to fit it into a wider safety case (say), where probabilities will be the norm. I also find such things really hard to understand at the very basic level of “what do the numbers mean?”
[EDJ] We are in strong agreement on the limitations of frequentist statistics, and the wide usefulness of Bayesian probability. I would not have thought that was controversial here, although I haven’t seen Bayesian probability raised. I think non-frequentist (for example, Bayesian) probability is intuitive and employed every day in everyone’s personal and professional lives—e.g., in making estimates of when to leave home to pick up one’s child at school based on prior experience with uncertainties such as traffic and weather conditions.
[EDJ] We (attempt to) use imprecise probabilities, Shafer/Dempster, and fuzzy/possibility theories frequently, often with successful results. Maybe that’s because it is at the system (as opposed to just the safety) level, in cases when we have succeeded in not being confined to using conventional probabilities. I too wonder “what do the numbers mean,” but certainly not as insightfully as you do.

Cheers
[Likewise]

Bev
[Doug]

On 8 Mar 2015, at 16:37, E. Douglas Jensen <jensen at real-time.org<mailto:jensen at real-time.org>> wrote:

I hardly consider myself sufficiently qualified on the topic of system safety to credibly thank Bev Littlewood for her concise synopsis.

On the other hand, after decades of dealing with uncertainties—software and otherwise—in military platforms and their combat contexts (“The main source of uncertainty lies in software’s interaction with the outside world.”), and after about a decade of immersion in the principles and practices of quantitative calculi for dealing with uncertainties, I am puzzled by her assertions “The only candidate for a quantitative calculus of uncertainty is probability. Thus the failure process is a stochastic process.”

There are numerous scholarly publications that contradict the first sentence, unless I misunderstand (her words or the publications). I have my professional colleagues and graduate students read some of these publications, starting with Hamming’s easy book “The Art of Probability,” then deeper ones—for example Gillies’ “Philosophical Theories of Probability” and  Klir’s “Uncertainty-Based Information”—plus selected journal and conference papers.

There is a consensus within the military combat theory and practice communities that few if any of warfare uncertainties (specifically, here, software failures) can realistically characterized as stochastic. Combat  and its concomitant exogenously-caused software failures are generally intermittent, irregular, interdependent, competitive, non-linear, and chaotic. Modeling these software failures as stochastic is highly lossy and generally infeasible (often resulting in property destruction and human death). Reasoning about such failures is better addressed with formalisms more expressive and adaptive than classical (e.g., frequentist) probability theory. Current examples of candidate formalism can be found in the literature of reasoning under uncertainties.

I look forward to having my understanding clarified if need be.

Best regards,

Doug
----
E. Douglas Jensen
jensen at real-time.org<mailto:jensen at real-time.org>, http://www.real-time.org<http://www.real-time.org/>
Home voice 508-653-5653, Cell phone voice: 508-728-0809
Facebook https://www.facebook.com/edouglas.jensen
LinkedIn www.linkedin.com/pub/e-douglas-doug-jensen/3/82/65a/<http://www.linkedin.com/pub/e-douglas-doug-jensen/3/82/65a/>

From: systemsafety-bounces at lists.techfak.uni-bielefeld.de<mailto:systemsafety-bounces at lists.techfak.uni-bielefeld.de> [mailto:systemsafety-bounces at lists.techfak.uni-bielefeld.de] On Behalf Of Littlewood, Bev
Sent: Sunday, March 08, 2015 10:03 AM
To: systemsafety at lists.techfak.uni-bielefeld.de<mailto:systemsafety at lists.techfak.uni-bielefeld.de>
Cc: ladkin Ladkin
Subject: [SystemSafety] Software reliability (or whatever you would prefer to call it)

As I am the other half of the authorial duo that has prompted this tsunami of postings on our list, my friends may be wondering why I’ve kept my head down. Rather mundane reason, actually - I’ve been snowed under with things happening in my day job (and I’m supposed to be retired…).

So I’d like to apologise to my friend and co-author of the offending paper, Peter Ladkin, for leaving him to face all this stuff alone. And I would like to express my admiration for his tenacity and patience in dealing with it over the last few days. I hope others on this list appreciate it too!

I can’t respond here to everything that has been said, but I would like to put a few things straight.

First of all, the paper in question was not intended to be at all controversial - and indeed I don’t think it is. It has a simple purpose: to clean up the currently messy and incoherent Annex D of 61508. Our aim here was not to innovate in any way, but to take the premises of the original annex, and make clear the assumptions underlying the (very simple) mathematics/statistics for any practitioners who wished to use it. The technical content of the annex, such as it is, concerns very simple Bernoulli and Poisson process models for (respectively) on-demand (discrete time) and continuous time software-based systems. Our  paper addresses the practical concerns that a potential user of the annex needs to address - in order, for example, to use the tables there. Thus there is an extensive discussion of the issue of state, and how this affects the plausibility of the necessary assumptions needed to justify claims for Bernoulli or Poisson behaviour.

Note that there is no advocacy here. We do not say “Systems necessarily fail in Bernoulli/Poisson processes, so you must assess their reliability in this way”. Whilst these are, we think, plausible models for many systems, they are clearly not applicable to all systems. Our concern was to set down what conditions a user would need to assure in order to justify the use of the results of the annex. If his system did not satisfy these requirements, then so be it.

So why has our innocuous little offering generated so much steam?

Search me. But reading some of the postings took me back forty years. “There’s no such thing as software reliability.” "Software is deterministic (or its failures are systematic) therefore probabilistic treatments are inappropriate.” Even, God help us, “Software does not fail.” (Do these people not use MS products?) “Don’t bother me with the science, I’m an engineer and I know what’s what” (is that an unfair caricature of a couple of the postings?). “A lot of this stuff came from academics, and we know how useless and out-of-touch with the real world they are (scientific peer-review? do me a favour - just academics talking to one another)”. Sigh.

Here are a few comments on a couple of the topics of recent discussions. Some of you may wish to stop reading here!

1 Deterministic, systematic…and stochastic.

Here is some text I first used thirty years ago (only slightly modified). This is not the first time I’ve had to reuse it in the intervening years.
"It used to be said – in fact sometimes still is – that 'software failures are systematic and therefore it does not make sense to talk of software reliability'. It is true, of course, that software fails systematically, in the sense that if a program fails in certain circumstances, it will always fail when those circumstances are exactly repeated. Where then, it is asked, lies the uncertainty that requires the use of probabilistic measures of reliability?
"The main source of uncertainty lies in software’s interaction with the world outside. There is inherent uncertainty about the inputs it will receive in the future, and in particular about when it will receive an input that will cause it to fail. Execution of software is thus a stochastic (random) process. It follows that many of the classic measures of reliability that have been used for decades in hardware reliability are also appropriate for software: examples include failure rate (for continuously operating systems, such as reactor control systems); probability of failure on demand (pfd) (for demand-based systems, such as reactor protection systems); mean time to failure; and so on.
"This commonality of measures of reliability between software and hardware is important, since practical interest will centre upon the reliability of systems comprising both. However, the mechanism of failure of software differs from that of hardware, and we need to understand this in order to carry out reliability evaluation.”  (it goes on to discuss this - no room to do it here)

At the risk of being repetitive: The point here is that uncertainty - "aleatory uncertainty" in the jargon - is an inevitable property of the failure process. You cannot eliminate such uncertainty (although you may be able to reduce it). The only candidate for a quantitative calculus of uncertainty is probability. Thus the failure process is a stochastic process.

Similar comments to the above can be made about “deterministic” as used in the postings. Whilst this is, of course, an important and useful concept, it has nothing to do with this particular discourse.

2. Terminology, etc.

Serious people have thought long and hard about this. The Avizienis-Laprie-Randell-Neumann paper is the result of this thinking. You may not agree with it (I have a few problems myself), but it cannot be dismissed after a few moments thought, as it seems to have been in a couple of postings. If you have problems with it, you need to engage in serious debate. It’s called science.

3. You can’t measure it, etc.

Of course you can. Annex D of 61508, in its inept way, shows how - in those special circumstances that our note addresses in some detail.

Society asks “How reliable?”, “How safe?”, “Is it safe enough?”, even “How confident are you (and should we be) in your claims?” The first three are claims about the stochastic processes of failures. If you don’t accept that, how else would you answer? I might accept that you are a good engineer, working for a good company, using best practices of all kinds - but I still would not have answers to the first three questions.

The last question above raises the interesting issue of epistemic uncertainty about claims for systems. No space to discuss that here - but members of the list will have seen Martyn Thomas’ numerous questions about how confidence will be handled (and his rightful insistence that it must be handled).

4. But I’ll never be able to claim 10^-9….

That’s probably true.

Whether 10^-9 (probability of failure per hour) is actually needed in aerospace is endlessly debated. But you clearly need some dramatic number. Years ago, talking to Mike deWalt about these things, he said that the important point was that aircraft safety needed to improve continuously. Otherwise, with the growth of traffic, we would see more and more frequent accidents, and this would be socially unacceptable. The current generation of airplanes are impressively safe, so new ones face a very high hurdle. Boeing annually provide a fascinating summary of detailed statistics on world-wide airplane safety (www.boeing.com/news/techissues/pdf/statsum.pdf<http://www.boeing.com/news/techissues/pdf/statsum.pdf>). From this you can infer that current critical computer systems have demonstrated, in hundreds of millions of hours of operation, something like 10^-8 pfh (e.g. for the Airbus A320 and its ilk). To satisfy Mike’s criterion, new systems need to demonstrate that they are better than this. This needs to be done before they are certified. Can it?

Probably not. See Butler and Finelli (IEEE Trans Software Engineering, 1993), or Littlewood and Strigini (Comm ACM, 1993) for details.

Michael Holloway’s quotes from 178B and 178C address this issue, and have always intrigued me. The key phrase is "...currently available methods do not provide results in which confidence can be placed at the level required for this purpose…” Um. This could be taken to mean: “Yes, we could measure it, but for reasons of practical feasibility, we know the results would fall far short of what’s needed (say 10^-8ish). So we are not going to do it.” This feels a little uncomfortable to me. Perhaps best not to fly on a new aircraft type until it has got a few million failure-free hours under its belt (as I have heard a regulator say).

By the way, my comments here are not meant to be critical of the industry’s safety achievements, which I think are hugely impressive (see the Boeing statsum data).

5. Engineers, scientists…academics...and statisticians...

…a descending hierarchy of intellectual respectability?

With very great effort I’m going to resist jokes about alpha-male engineers. But I did think Michael’s dig at academics was a bit below the belt. Not to mention a couple of postings that appear to question the relevance of science to engineering. Sure, science varies in quality and relevance. As do academics. But if you are engineering critical systems it seems to me you have a responsibility to be aware of, and to use, the best relevant science. Even if it comes from academics. Even if it is statistical.

My apologies for the length of this. A tentative excuse: if I’d spread it over several postings, it might have been even longer…

Cheers

Bev
_______________________________________________

Bev Littlewood
Professor of Software Engineering
Centre for Software Reliability
City University London EC1V 0HB

Phone: +44 (0)20 7040 8420  Fax: +44 (0)20 7040 8585

Email: b.littlewood at csr.city.ac.uk<mailto:b.littlewood at csr.city.ac.uk>

http://www.csr.city.ac.uk/
_______________________________________________

_______________________________________________

Bev Littlewood
Professor of Software Engineering
Centre for Software Reliability
City University London EC1V 0HB

Phone: +44 (0)20 7040 8420  Fax: +44 (0)20 7040 8585

Email: b.littlewood at csr.city.ac.uk<mailto:b.littlewood at csr.city.ac.uk>

http://www.csr.city.ac.uk/
_______________________________________________

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20150308/745fa469/attachment-0001.html>