[SystemSafety] [cip-dev] [C-safe-secure-studygroup] Critical systems Linux

Thu Nov 22 14:19:44 CET 2018

Hi Clive,
this is very helpful, thank you. I'm going to re-add the other lists, 
for the same reason as before, and hope you're ok with that. Please see 
my comments inline below...

On 2018-11-22 10:26, Clive Pygott wrote:
> I'll have a go at your question - FYI my background is system safety
> management (as in 61508 & DO178) and coding standards (MISRA & JSF++)
> 
> You are right that ultimately system safety is a _system _property.
> You cannot talk about software doing harm without knowing what its
> controlling and how it fits into its physical environment.

Understood, and I'd be surprised if anyone would challenge this 
reasoning.

> However, a standard like 61508 takes a layered approach to safety.

I'm not sure I understand "layered approach", since I've heard it 
mentioned in multiple situations outside safety (security for one, and 
general architecture/abstraction for another).

Are you saying that the aim is redundant/overlapping safety methods, to 
avoid single-point-of-failure, or something else?

> The topmost
> levels are system specific: how could the system behave (intentionally
> or under fault conditions) to cause harm? and what features of the
> architecture (including software requirements) mitigate these risks?
> This establishes traceability from software requirements to safety.

OK, understood. In previous discussions I've been attempting to 
understand whether there's any fundamental reasons that such 
requirements would need to exist before the software, or whether they 
could be originated for a specific system, and then considered/applied 
to pre-existing code. Is there a hard and fast argument one way or the 
other?

> From the software perspective, under this is the requirement to show
> that those software requirements related to safety have been
> implemented correctly, and as usual this has two components:
> 
> 	* showing that the code implements the requirements (verification -
> we've built the right thing)

OK, makes sense.

> 	* showing the code is well behaved under all circumstances
> (validation - we've built the thing right)

Here I fall off the horse. I don't believe we can be 100% certain about 
"all circumstances", except for small/constrained/simple systems. So I 
distrust claims of certainty about the behaviour of modern COTS 
multicore microprocessors, for example.

> If you are doing full formal semantic verification, the second step is
> unnecessary, as the semantic proof will consider all possible
> combinations of input and state.

It's not fair to single out any individual project/system/community, but 
as an example [1] SEL4's formal proof of correctness proved to be 
insufficient in the context of spectre/meltdown. I'd be (pleasantly) 
surprised if any semantic proof can withstand misbehaviour of the 
hardware/firmware/OS/tooling underneath it.

> However, in practice formal proof is
> so onerous that its almost never done. This means that verification is
> based on testing, which no matter how thorough, is still based on
> sampling. There is an implied belief that the digital system will
> behave continuously, even though its known that this isn't true (a
> good few years ago an early home computer had an implementation of
> BASIC that had an integer ABS functions that worked perfectly except
> for ABS(-32768) which gave -32768  and it wasn't because it was
> limited to 16-bits, ABS(-32769) gave 32769 etc).

Understood, and agreed.

> The validation part aims to improve the (albeit flawed) belief in
> contiguous behaviour by:
> 
> 	* checking that any constraints imposed by the language are respected
> 	* any deviations from arithmetic logic are identified (i.e. flagging
> where underflow, overflow, truncation, wraparound or loss of precision
> may occur)
> 
> This is the domain of MISRA and JSF++  checking that the code will
> behave sensibly, without knowledge of what it should be doing.

OK, IIUC this is mainly

- 'coding standards lead to tests we can run'. And once we are into 
tests, we have to consider applicability, correctness and completeness 
of the tests, in addition to the "flawed belief in contiguous 
behaviour".

- and possibly some untestable guidance/principles which may or may not 
be relevant/correct

> To get back to the original discussion, it is staggeringly naive to
> claim that 'I have a safe system, because I've used a certified OS
> kernel'.  I'm sure you weren't suggesting that, but I have seen
> companies try it.

I've also seen that (in part that's why I'm here) but I certainly 
wouldn't dream of suggesting it.

> What the certified kernel (or any other
> architectural component) buys you is that someone has done the
> verification and validation activities on that component, so you can
> be reasonably confident that that component will behave as advertised
> - its a level of detail your project doesn't have to look into (though
> you may want to audit the quality of the certification evidence).

OK in principle. However from some of the discussion, which I won't 
rehash here it seemed to me that some of the safety folks were expressly 
not confident in some of the certified/advertised/claimed behaviours.

> As I read your original message you are asking 'why can't a wide user
> base be accepted as evidence of correctness?'

If that's the question I seemed to be asking I apologise; certainly I 
wouldn't count a wide user base as evidence of correctness. It's 
evidence of something, though, and that evidence may be part of what 
could be assessed when considering the usefulness of software.

> The short answer is, do
> you have any evidence of what features of the component the users are
> using and in what combination?

I totally agree - which brings us back to the need for 
required/architected behaviours/properties.

> Is my project about to use some
> combination of features in an inventive manner that no-one has
> previously tried, so the wide user base provides no evidence that it
> will work  (again a good few years ago, colleagues of mine were
> writing a compiler for a VAX and traced a bug to a particular
> instruction in the VAX instruction set that had an error in its
> implementation. No DEC product or other customer had ever used this
> instruction.  BTW, DEC's solution was to remove it from the
> instruction set)

Makes sense. Thanks again Clive!

br
Paul

[1] https://research.csiro.au/tsblog/crisis-security-vs-performance/