[SystemSafety] Degraded software performance [diverged from Fault, Failure and Reliability Again]

David Haworth david.haworth at elektrobit.com
Thu Mar 5 11:21:49 CET 2015


Hi Nick (and Drew)

I cut my "safety" teeth on what was probably the forerunner of the
Eurofighter at the same place (but in the early 80s it was known
as Marconi Avionics).

If they used the same fundamental design principles for Eurofighter
as the did for the Jag, the design was intended to be fully
deterministic from the ground up. There was very little in there
that could cause non-determinstic behaviour, and what little there
was (external hardware influences such as the inputs and
outputs transferring by DMA) were tightly controlled
and accounted for. For a given set of inputs and existing state
the computer would always provide the same outputs and new state.

I.e. none of the unpredictable factors that Drew describes
(multiple threads, incorrect exclusion methods, asynchronous
preemptions, timing factors etc. of the kind you
find on your average computer-with-operating-system.

Nice to know that it worked :-)

Disclaimer: I only worked there two years, so I take no credit
for the design. When I left, we'd just started doing simulated
response tests to step inputs and were disappointed to find that
they weren't quite as the models predicted (but fully deterministic).
I suspect we were looking at an accuracy limitation of the fixed-point
arithmetic, but I left before a reason was found.

Dave



On 2015-03-05 09:21:38 +0000, Nick Tudor wrote:
>    Hi Drew
>    So the software in each case executes completely deterministically; the
>    computers are not identical.
>    Following on from this and to respond to an earlier input by Peter
>    regarding the paper by Kevin Driscoll.  I didn't attend SAFECOMP 2003,
>    but I did see him present "Beware the Byzantine Generals" at a DASC
>    conference probably in 2002.  This was an attempt to question the
>    safety of the flight control software developed by the then Ferranti
>    chaps at Rochester in Kent.  NB Honeywell were very annoyed at not
>    getting the contract and were trying to lobby to get the 787 contract -
>    which they eventually did, but not because of the reasons discussed
>    here.  I was in a position at the time where I was also working on
>    software for the Eurofighter which included aspects of the FCS and
>    hence had access to the same team.  I was allowed to see the private
>    material between them and the CAA/FAA as part of the certification
>    process which addressed the issues raised subsequently by Kevin.  This
>    is turn addresses the 'fun' outlined by you (NB I think we have a
>    common path in our education - could be wrong...).
>    The Eurofighter FCS is based upon 4 independently operating computers
>    with voter logic; you may wish to think of them as threads and, guess
>    what, they too given the same inputs have different outputs, but are
>    clocked by their own internal systems.  Not only was the software
>    intensely verified by the manufacturers, but was independently verified
>    by the team I joined in 2003 using formal techniques based upon Z.  We
>    also looked at the timing aspects based upon CSP.  There are published
>    talks, not necessarily papers, on the outcome.  The software would not
>    have been certified if we had not done the independent work which
>    showed that the software and the system was deterministic.
>    My case is rested.
> 
>    Nick Tudor
>    Tudor Associates Ltd
>    Mobile: +44(0)7412 074654
>    www.tudorassoc.com
>    [wpb4e71a5c_0f.jpg]
>    77 Barnards Green Road
>    Malvern
>    Worcestershire
>    WR14 3LR
>    Company No. 07642673
>    VAT No:116495996
>    www.aeronautique-associates.com
>    On 4 March 2015 at 21:19, DREW Rae <d.rae at griffith.edu.au> wrote:
> 
>    Nick,
>    You can't have learned software at a very fun school if they didn't
>    teach you how to write programs that give different outputs for the
>    same set of inputs without using a random number generator.
>    Here's one:
>     1. Write a multi-threaded function without proper exclusion or
>    alternate protection against thread interference
>     2. Put it inside a loop with a voter function outside which returns
>    the most common returned value from the original function
>     3. Compile the executable, and run it on a lab full of "identical"
>    computers
>     4. Watch as each computer _consistently_ returns the same value, but
>    the values aren't the same between the identical computers. Record
>    which computer returns which value.
>     5. Turn the computers off, and wait 24 hours
>     6. Turn the computers on again, and observe that the set of computers
>    returning each value has changed
>    The point is not that every part of the context is in fact an "input"
>    into the performance of the software. The point is that the probability
>    you're talking about is the _conditional_ probability of a particular
>    output given a particular context. That conditional probability is
>    useless for any practical purpose. What is relevant is the _actual_
>    probability distribution. By the time you've stripped the software away
>    from everything that causes variability in the output, you've ignored
>    most of the computer system. The small remnant that you're calling
>    "software" is of interest to the sort of computer scientist who never
>    actually writes or runs software. (That sort of computer science is a
>    legitimate field of mathematics, but like most mathematics there's a
>    long lead time between current work and relevance to practical
>    engineering).
>    Drew
> 
>    My safety podcast: disastercast.co.uk
>    My mobile (from October 6th): 0450 161 361
>    On 4 March 2015 at 18:23, Nick Tudor <njt at tudorassoc.com> wrote:
> 
>      Hi Drew
> 
>    This is thread is getting hard to follow- let alone on a phone on a
>    train in the middle of rural England so I apologise if I misinterpreted
>    some of your intent. You can now tick off 'apology ' 😀
> 
>    You make some good points but the nub of this argument is whether one
>    can attribute a reliability to software as a component in a system and
>    if so how?  I and others have yet to see any argument to support this
>    belief.
> 
>    To follow on from that and to address one of your points, if it is
>    stated that  inputs affect the outputs I would of course agree. What
>    the effect is will be entirely deterministic and have no random
>    element; at least this is what I learned at school. The probability of
>    an error in the software causing a system fault is therefore either
>    zero (it wasn't triggered by the inputs) or one (it was triggered and
>    will ALWAYS do the same thing with the same inputs).
> 
>    Cheers
> 
>    On Wednesday, 4 March 2015, DREW Rae <d.rae at griffith.edu.au> wrote:
> 
>    Nick,
>    I think you've reversed the point I was making, and then disagreed with
>    the opposite of what I was saying. What I really should have done is
>    used "computer system reliability" and refused to buy into the
>    hardware/software demarkation issue.
>    I disagree with claiming software rates for software regardless of
>    whether they are carefully concocted statistical estimates, or
>    "software doesn't fail". BOTH rely on making some arbitrary distinction
>    between what is software, and what is hardware. Whoever makes that
>    distinction, where-ever they make it, has an obligation to state clear
>    assumptions about the other side of the distinction, and have grounds
>    for believing those assumptions to be realistic.
>    You want to say that each of my failure modes for software "is a
>    hardware issue". Fine. But you don't want to make claims for software
>    reliability either. If you're not going to make a claim for
>    reliability, any distinction between software and hardware you want to
>    create is fine by me. Anyone who wants to claim either hardware or
>    software reliability though, and also wants to make a distinction
>    between "software issues" and "hardware issues", needs to consider both
>    sides of the distinction.
>    If someone wants to say "the processor that the software runs on is not
>    software", then their standard needs to specifically address how
>    they'll make sure that your software requirements consider the aging of
>    the processor. If they want to say that changes in the input profile
>    for the software are not a software issue, then they need to go back to
>    software engineering school, because there's no universe in which a
>    changed pattern of inputs does not change the probability of an
>    incorrect output.
>    On the plus side, if you'll let me characterise your message as a
>    strawman (instead of an honest misinterpretation of intent, which I'm
>    sure it was) I can complete my mailing list fallacy bingo card. We've
>    already had arguments from antiquity, argument from authority, "is"
>    equals "ought", equivocation, false equivalence, and not understanding
>    the difference between false and falsifiable. I don't think we've had
>    anyone blatantly misrepresent anyone else's position though.
>    Drew
> 
>    My safety podcast: disastercast.co.uk
>    My mobile (from October 6th): 0450 161 361
>    On 4 March 2015 at 16:25, Nick Tudor <njt at tudorassoc.com> wrote:
> 
>      In line responses Andrew:
>      On Wednesday, 4 March 2015, DREW Rae <d.rae at griffith.edu.au> wrote:
> 
>    Michael,
>    I need to give more than one example, because the point is general,
>    rather than specific to the individual causes. In each case the
>    cumulative probability of software failure increases over time.
> 
> 
> 
>    >>if you can determine the wear out mechanism for software I would
>    agree, but you can't, so I don't.
> 
>    1) Damage to the instruction set
>    e.g. the physical record of the instructions on a storage medium
>    changes
>    very specific e.g. bit flip on a magnetic storage device holding the
>    executable files
> 
> 
> 
>    >>this is a hardware issue.
> 
> 
> 
>    2) Increased unreliability of the physical execution environment
>    e.g. an increased rate of processor errors
>    very specific e.g. dust accumulates on part of the processor card,
>    making it run hot and produce calculation errors
>    >> this too is hardware.
>    3) Increased unreliability of input hardware
>    e.g. software is required to detect and respond correctly to an
>    increased rate and variety of sensor failure combinations
>    Note: This is the one that challenges "but we're running the software
>    in exactly the same hardware environment". Hardware environments change
>    as they get older.
>    >>ditto
> 
> 
> 
>    4) Software accumulates information during runtime
>    e.g. a count of elapsed time
>    e.g. increasing volume of stored data
>    e.g. memory leak
>    >>bad requirements or/and bad verification.
>    NB1: In all of these cases I've heard arguments "that's not the
>    software, that's X". Those arguments are only relevant if you can
>    control for X when collecting data for software reliability
>    calculation. Software without an execution environment is a design. It
>    "never fails" in the way that _no_ design fails. When it does fail, it
>    is subject to the same degredation over time as any physical
>    implementation
> 
>    >> there is no such thing as software reliability so don't use maths
>    (or rather statistics and claim they are maths) inappropriately.
> 
>    NB2: I'm not claiming that failure due to physical degredation is
>    significant compared to failure due to errors in the original
>    instructions. I'm saying that we don't know, and that not knowing
>    becomes a big issue once we've tested to the point of not finding
>    errors in the original instructions. At that point, absent evidence to
>    the contrary, we should be assuming that physical degredation is
>    signficant.
>    >>. No one (I hope) denies that hardware effects may influence software
>    calculations. Still doesn't mean that the maths, er Statistics are the
>    right tool for the job.
> 
> 
> 
> 
> 
>    Drew
>    On 4 March 2015 at 12:27, Michael J. Pont <M.Pont at safetty.net> wrote:
> 
>      Drew,
>      “The underlying point holds, that software _can_ exhibit degraded
>      performance over time.”
>      Can you please give me a simple example of what you mean by this.
>      Thanks,
>      Michael.
>      _______________________________________________
>      The System Safety Mailing List
>      systemsafety at TechFak.Uni-Bielefeld.DE
> 
>      --
>      Nick Tudor
> 
>    Tudor Associates Ltd
> 
>    Mobile: +44(0)7412 074654
> 
>    www.tudorassoc.com
> 
>    [wpb4e71a5c_0f.jpg]
> 
>    77 Barnards Green Road
> 
>    Malvern
> 
>    Worcestershire
> 
>    WR14 3LR
>    Company No. 07642673
> 
>    VAT No:116495996
> 
>    www.aeronautique-associates.com
> 
>      --
>      Nick Tudor
> 
>    Tudor Associates Ltd
> 
>    Mobile: +44(0)7412 074654
> 
>    www.tudorassoc.com
> 
>    [wpb4e71a5c_0f.jpg]
> 
>    77 Barnards Green Road
> 
>    Malvern
> 
>    Worcestershire
> 
>    WR14 3LR
>    Company No. 07642673
> 
>    VAT No:116495996
> 
>    www.aeronautique-associates.com



> _______________________________________________
> The System Safety Mailing List
> systemsafety at TechFak.Uni-Bielefeld.DE


-- 
David Haworth B.Sc.(Hons.), OS Kernel Developer    david.haworth at elektrobit.com
Tel: +49 9131 7701-6154     Fax: -6333                  Keys: keyserver.pgp.com
Elektrobit Automotive GmbH           Am Wolfsmantel 46, 91058 Erlangen, Germany
Geschäftsführer: Alexander Kocher, Gregor Zink       Amtsgericht Fürth HRB 4886


----------------------------------------------------------------
Please note: This e-mail may contain confidential information
intended solely for the addressee. If you have received this
e-mail in error, please do not disclose it to anyone, notify
the sender promptly, and delete the message from your system.
Thank you.



More information about the systemsafety mailing list