[SystemSafety] RE : Qualifying SW as "proven in use"

Nancy Leveson leveson.nancy8 at gmail.com
Thu Jun 27 16:23:43 CEST 2013


1. Software is *not*, by itself, unsafe. It is an abstraction without any
physical reality. It cannot itself cause physical damage. The safety of
safety is dependent on
   -- the software logic itself,
   -- the behavior of the hardware on which the software executes,
   -- the state of the system that is being controlled or somehow affected
by the outputs of the software (the encompassing "system"), and
   -- the state of the environment.
All of these things determine safety so a change in one can impact the
so-called "software safety.:" For example, the change in the design of the
Ariane 5 which led to a steeper trajectory than the Ariane 4 led to the
software contributing to the explosion. The environment does matter. All
the usage of that software on the Ariane 4 meant nothing with respect to
its use in the Ariane 5.

Any change in the environment, in the controlled system, in the underlying
hardware, or in the software invalidates all previous experience unless one
can prove that the change will not lead to an accident (and that proof
cannot not be based on a statistical argument). Does anyone know any
non-trivial software, for example, that is not changed in any way over
decades of use? or even years of use? And what about changes in the
behavior of human operators, of the system itself, and of the environment?

Someone wrote:
> I've been thinking about Peter's example a good deal, the developer seems
to me to have made an implicit assumption that one can use a statistical
argument based on successful hours run to justify the safety of the
software.
And Peter responded:
>>It is not an assumption. It is a well-rehearsed statistical argument with
a few decades of universal acceptance, as well as various successful
applications in the assessment of emergency systems in certain English
nuclear power plants.

"Well-rehearsed statistical arguments with a few decades of universal
acceptance" are not proof. They are only well-rehearsed arguments. Saying
something multiple times is not a proof. The fact that nuclear power plants
in Britain have not experienced any major accidents (they have had minor
incidents by the way) rises only to the level of anecdote, and not proof.
And that experience (and well-rehearsed arguments) cannot be carried over
to other systems.

I agree with the original commenter about the implicit assumption, which
the Ariane 5 case disproves (as well as dozens of others).

2. It is not even clear what "failure" of software means when software is
merely "design abstracted from its physical realization." How can a
"design" fail? It may not satisfy its requirements (when executed on some
hardware), but design (equivalent to a blueprint for hardware) does not
fail and certainly does not fail "randomly."

Perhaps the reason why software reliability modeling still has pretty poor
performance after at least 40 years of very bright people trying to get it
to work is that the assumptions underlying it are not true. These
assumptions  have not been proven (only stated with great certainty) and,
in fact there is evidence showing they are not necessarily true. I tried
raising this point a long time ago, but I was met with such a ferocious
response (as I am sure I will be here) that I simply ignored the whole
field and worked on things that seemed to have more promise. The most
common assumption is that the environment is stochastic and that the
selection of inputs (from the entire space of inputs) that will trigger a
software fault (design error) is random. There is data from NASA (using
real aircraft) that is evidence of "error bursts" in the software (ref.
Dave Eckhardt). It appeared that these resulted when the aircraft flew a
trajectory that was near a "boundary point" in the software and thus set
off all the common problems in software related to boundary points. The
selection of inputs triggering the problems was not random.

As another example, Ravi Iyer looked at software failures of a widely used
operating system in an interesting experiment where he found that a bunch
of software errors appeared to be preceding a computer hardware failure. It
made no sense that the software could be "causing" the hardware failure.
Closer examination showed the problem. Hardware often degrades in its
behavior before it actually stops. The strange hardware behavior, if I
remember correctly, was exercising the software error handling routines
until it got beyond the capability of the software to mitigate the
problems. Again, in this case, the software was not "failing" due to
randomly selected inputs from the external input space.

When someone wrote:
> I don't think that's true,
Peter Ladkin wrote:
>>You might like to take that up with, for example, the editorial board of
IEEE TSE.

[As a past Editor-in-Chief of IEEE TSE, I can assure you that the entire
editorial board does not read and vet the papers, in fact, I was lucky if
one editor actually read the paper. Are you suggesting that anything that
is published should automatically be accepted as truth? That nothing
incorrect is ever published?]

Nancy

-- 
Prof. Nancy Leveson
Aeronautics and Astronautics and Engineering Systems
MIT, Room 33-334
77 Massachusetts Ave.
Cambridge, MA 02142

Telephone: 617-258-0505
Email: leveson at mit.edu
URL: http://sunnyday.mit.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.techfak.uni-bielefeld.de/mailman/private/systemsafety/attachments/20130627/b33f8c22/attachment.html>


More information about the systemsafety mailing list