[SystemSafety] Component Reliability and System Safety

Mon Sep 17 17:29:34 CEST 2018

Hi Peter,

Thanks again for your feedback.

please see my comments inline. I'll probably drop the topic now, to 
avoid repeating myself further.

br
Paul

On 2018-09-17 13:19, Peter Bernard Ladkin wrote:
> On 2018-09-17 11:06 , Paul Sherwood wrote:
>> But software is a very big field. It seems to me that most of the 
>> software we are relying on these
>> days was developed without following coding standards in general, ....
> 
> That may be true in general; I wouldn't know. It is specifically not
> true for safety-related systems.

Hmmm. I'm sure you mean something specific by "safety-related systems" 
but I think my understanding must be different. Is there a succinct and 
agreed definition anywhere?

Are any IoT devices "safety-related systems" in your understanding? Is a 
car a safety-related system? For both of those classes of device, the 
majority of code in them is not MISRA C.

I could go on... pacemakers, heart monitors and other categories of 
critical medical equipment are increasingly connected to public-facing 
wireless, etc.

> IEC 61508-3 Table B.1 entry 1 says that "Use of coding standard to
> reduce errors" is Highly
> Recommended (HR) for all SILs. ("Highly Recommended" is the strongest
> form of encouragement for any
> specific technology in the standard).

Hmmm, again.

I'm attempting to get a refund for that document, since it was mis-sold 
to a colleague in such a way as to make it useless.

I confess I haven't read 61508 myself (and do not expect to, given its 
cost and EULA) but I'm currently hopeful that I can find useful and 
practical guidance elsewhere, including via this list.

> If you don't use a coding standard, your assessor is going to want to
> know why. (Telling himher you
> think they are outmoded is not generally regarded as an acceptable 
> answer.)

I'm more surprised that "we follow a coding standard" would be 
considered as an acceptable answer to anything that impacts on system 
safety.

I wouldn't accept it myself, frankly, since I've been told lies by many 
people defending their projects over the years.

>> We could insist that the software be developed in Haskell, or Rust, or 
>> some other technology that
>> provides a higher level of control over the code creation.
> 
> Not in developing safety-related systems, you can't. At least, not at
> the moment. With Haskell, the
> need for garbage collection gets in the way of being used (it
> potentially interferes with timing
> constraints in a non-deterministic way).

Hmmm. This seems to be another throw-back from decades ago. I expect 
it's possible to design for safety without requiring that all applicable 
software satisfies time constraints in a deterministic way?

And if there was a real need, I expect that we could analyse the actual 
GC code for a given Haskell implementation and make it deterministic vs 
time for a specific device/toolchain/OS.

>> Coding standards can actually be counter-productive, for example if
>> ......
>> - they are used when they shouldn't be
>> 
>> This ..... is exactly the reason for my original question.
> 
> Since the general E/E/PE functional safety standard HR's use of coding
> standards, what could be the
> scope of "used when they shouldn't be"?

Well, the simple in-context example here would be a case where folks 
misguidedly insist that something should be created in MISRA C, without 
considering potential alternative approaches that didn't require it, 
e.g.:

- handle the safety requirements elsewhere
- adopt a pre-existing, proven solution
- devise a mechanical or electrical (non-software) implementation
- implement in a 'better' language (e.g. Ada SPARK)
- implement on multiple machines with redundancy/watchdogs etc

>> dependable != safe
> 
> Thank you.

I take it you agree, and knew this already, and were being ironic.

>>>  I would go further - it is important for any system
>>> which is not deliberately built to subvert the purposes of the 
>>> client.
>> 
>> Sorry, I don't understand this comment at all.
> 
> I was referring to malware.

OK. In that case, back to your point...

The 'cattle vs pets' model is has gained significant traction in various 
sectors. Its core idea is that for some classes of systems it's better 
to just 'kill' a misbehaving machine/program/component and spin up a new 
one, rather than worrying about every instance. So while this may be an 
odd diversion from our main thread here, it does disprove your 
statement, I think :-)

>>> A question. What important safety properties of a bicycle are *not*
>>> reducible to component reliability?
>> 
>> For simple systems, where the safety mechanisms are expressly 
>> mechanical, reliability obviously
>> matters.
> 
> Yes. The question was genuine. I couldn't think of a common safety
> property of a bicycle which is
> not ensured by a subsystem/component.

Actually, as I understand it, human science has not fully figured out 
why bicycles work at all [1]. Given that bicycles *do* work, I'm 
guessing that their safety must somehow rely on the mystery 
(system-level) physics. But I'm digressing again.

In any case, the MIT work cites clear examples where safety reduces as 
component reliability increases, and others where safety is compromised 
in spite of perfect component behaviour. Having considered the examples 
and my own practical experience, I find it easy to comprehend and accept 
the assumptions and assertions made by the authors.

I'm much less convinced by the reliability principle that underpins the 
"safety requires coding standard" argument here.

Safety is a system property, not a software property, after all. But 
that's been said already, and you and others here don't seem to see the 
same implications that I do.

>> But for **safety** of complex systems, I'm guessing that current best 
>> practice must involve
>> designing-in safety from multiple directions, with failsafes, 
>> redundancy and/or similar?
> 
> Current best practice involves following the strictures of IEC 61508
> or its so-called "derivatives"
> for non-aerospace, non-medical systems, and ISO 26262 for road
> vehicles; DO-178C/EA-12C for critical
> aerospace code, including EA-217 and EA-218. (I have forgotten the
> numbers of the medical-systems
> E/E/PE safety standards. They come largely from IEC TC 62 and 66, I 
> think.)

That being the case, and assuming that the IEC document is actually 
useful (I have some concerns about that, as I am sure you already 
understand) why is IEC 61508 so ridiculously expensive that most 
engineers will never read it?

Apart from the various ecosystem players making money from the overall 
scheme I do think that in part the underlying aim has been to maintain 
credibility via mystique and avoiding public scrutiny. Given you're 
maintaining it I expect you may be unhappy with that comment, but the 
fact remains - if the document is really what people should be reading, 
then the 3K tax per reader is a disgrace.

>> Presumably the architectural level safety considerations must include 
>> the **expectation of failure
>> in components**, and lead to designs which mitigate against expected 
>> (bound to happen) failures, to
>> satisfy safety goals?
> 
> This is too vague for me to judge. A brief review of the philosophy
> behind IEC 61508 may be found at
> https://rvs-bi.de/publications/books/ComputerSafetyBook/12-Kapitel_12.pdf

30 pages is not brief IMO. I'll do my best to digest it, though, and 
thank you for the link.

> I wrote this before I got
> involved in the maintenance of IEC 61508. The civil aerospace
> airworthiness requirements specify
> certain ultrahigh reliability requirements for individual components.
> Components here are bits of
> airplanes which do things, not SW modules.

Understood. To be clear (and I believe I said this already), I'm not 
claiming that component reliability is never required for system safety. 
In some designs/scenarios (particularly for mechanical systems) there 
may be no other sensible way to achieve it.

But software is not a mechanical system.

>> If our safety depends on the reliable behaviour of even a small 
>> program on (say) a modern multi-core
>> microprocessor interacting with other pieces of software in other 
>> devices, I think "we are lost" again.
> 
> It does, and it will do for the foreseeable future.

Did you miss my point about "modern multi-core microprocessor"? That's 
not a microcontroller-scale device. It comes bundled with a surprising 
amount of software, which is commonly swept under the rug as 'firmware', 
and isn't written in MISRA C. It probably does interesting things like 
"speculative execution"...

Then there's probably a boot loader. Not MISRA C. Device drivers. Not 
MISRA C. etc...

>> I'm worrying about autonomous vehicles and other systems of similar 
>> complexity. As I understand it
>> most of the software in these systems won't even be written in C, let 
>> alone following MISRA C rules.
> 
> I wouldn't know.

OK.

>>> However, when prefaced with "dear fellow safety professionals", one
>>> might consider them banal.
>> 
>> I'm not a "safety professional".
> 
> Ah, OK. Since it costs a four-figure sum of money, I guess that also
> means you don't have a copy of
> IEC 61508 to hand.

There's one within 20 feet of my desk, but I'm not allowed to read it.

> You also may not know the analysis methods required
> to be used for safety-related
> systems.
> 
> You may also not know what safety-related systems look like. Lots of
> little and mid-sized boxes
> plugged together would not be an inappropriate picture.

OK so from this you're not talking about (say) a pacemaker.

>> However I am relatively experienced in large scale software, and (as 
>> you can see) I'm struggling to
>> understand how 'safety professionals' can advocate the application of 
>> principles from mechanical
>> reliability engineering, plus "things we learned on 
>> microcontroller-scale projects several decades
>> ago" to complex software-intensive systems in 2018.
> 
> a. Because the standard requires it.

I think the standard may be wrong, then.

But that maybe boils down to the definition of 'safety-related systems', 
which I'll be checking properly once I've finally finished this email 
:-)

> b. Because it largely works better than purely winging it.

Interesting opinion, and probably true for some classes of software.

But as I said, most of the software we're all relying on these days was 
constructed without reference to MISRA C. Arguably the most popular and 
widely used software component as of 2018 is the Linux kernel. The Linux 
community has got quite a long way by 'winging it' without much of a 
coding standard and certainly without applying MISRA C.

> c. Safety-related systems don't have a lot of "large scale software"
> in them.

OK, so I'm confused then... Is an autonomous vehicle a "Safety-related 
system"?

If yes, I believe your c. is incorrect, since a humungous amount of 
software is definitely involved.

> They have a lot of
> relatively-small-scale software and firmware running on dedicated
> devices and a bunch of
> configuration. (I say "relatively small scale" - even jet engine
> control software, which has pretty
> straightforward tasks to perform, can end up being 8-figure megabytes.
> I can't find an easy
> reference. Other people on this list can be more specific.)

A typical car these days involves 50MLOC - 100MLOC. In broad terms 
approximately half is in the infotainment system, which may or may not 
be considered relevant to safety as I understand it. The rest is mostly 
in 'ECU' boxes that sound a lot like your

>>> Similarly, those who have never flown an airplane may wonder why
>>> checklists are used for
>>> configuration for key phases of flight such as landing. Once you have
>>> flown an airplane and learned
>>> a little of what happens to others who fly, it becomes banal.
>> 
>> Fair point, but a little off topic imo.
> 
> I don't think so. You might be surprised how useful checklists are in
> complex installations.
> Although everyone likes to decry them.

Your example is about operation of specific systems, whereas we had been 
talking about the design and construction of software.

I'm in favour checklists. In the context of software in 2018 I'd expect 
them to be scripted to include capture of evidence of who completed 
them, and when.

<snip>
> A colleague observed
> that many or most of the CERT
> advisories still concern phenomena that could have been avoided
> through the use of (rigorous,
> reliable) strong typing.

Fine - we can enforce via tooling and/or better languages, rather than a 
coding standard.

<snip>
> Concerning a), many commercial airplanes used twenty years ago were
> not specifically evaluated
> against and qualified for resistance to EM fields inside the Faraday
> cage of a strength easily
> generated by consumer electronics (in particular malfunctioning
> consumer electronics). Nowadays,
> they are.

Aha, thanks for explaining that. I'm still rather concerned about the b) 
case, though.

>> In automotive I know that some user-facing (and even internet-facing) 
>> systems *do* sit on the CAN
>> bus, alongside multiple subsystems/components which are (presumed safe 
>> because they were) developed
>> in accordance with MISRA C guidelines.
> 
> *The* CAN bus?

Sorry, I'm not sure what you mean by this comment. There have been a 
reasonable number of public shamings already, though, e.g. [2]

br
Paul

[1] 
https://www.fastcompany.com/3062239/the-bicycle-is-still-a-scientific-mystery-heres-why
[2] https://www.theregister.co.uk/2015/07/24/car_hacking_using_dab/