[SystemSafety] Scalability of Formal Methods [was: Logic]

Thu Feb 27 12:27:43 CET 2014

On 27 Feb, 2014, at 11:27, Martyn Thomas <martyn at thomas-associates.co.uk> wrote:

> I agree with all this (except that I have never met anyone who "seemed to think that software proof was going to tell them what the real chips would do in all circumstances ". Who are these people who keep appearing anecdotally on mailing lists like this? Can we name them and see if we can stop them clouding the issues with such nonsense?

The people that come most readily to mind were graduate students discovering formal methods for the first time.  Fortunately, such people can be trained.  Unfortunately, there will be a never-ending stream of them.  I think the most practical approach here is just to add a few words to clarify what is meant.  For example, instead of writing ‘formal verification of the software’, we could write that ‘the source code has been shown to refine the formal specification by a machine-checked proof’.  (Or the machine-code, if someone has done that.  Or model checking or hand proof or whatever the mechanism is.  Or maybe we are only showing absence of null-pointer dereferences or double lock frees.)  A bit more verbose, perhaps, but its meaning and limits are less susceptible to misunderstanding by the naïve.

> I understand you to be saying that all the gaps in assurance need to be identified and filled as far as possible, and that the amount of assurance that has been achieved for each gap should be clearly stated. I agree completely. But I'm intrigued by your statement 
> "But safety depends on knowing what will happen in the real world on real chips.  And if you can’t prove that, then you need to bring in a different form of evidence to complete the picture.  Some sort of inference based on observation.  Testing, in other words."
> 
> I believe that testing a system is essential. But I'd like to understand what you think that passing a set of tests has told you about the safety of a system. To put it another way: what is it that you want to find out, and that can be discovered by constructing and running suitable tests? I ask this with a genuine desire to explore and learn.

Well, it’s pretty strong evidence that the system will behave as our software safety requirements say it should … in exactly the cases where it gets the inputs that were used in testing.  It’s weaker evidence that the system behaves according to the rquirements from which the test cases were derived in all circumstances, because of the inductive leap.  But if we’re talking about software or system-level testing of the release-candidate binary executable on the target hardware, it’s a form of evidence about the software behaviour that has the potential to find defects (i.e., things that would preclude meeting the SSRs) introduced into any of several different software development artefacts at several different points in the development lifecycle.

As I see this, what I want to know is whether real chips running the software behave as the software safety requirements say they should.   Formal verification of the source code and testing (at various levels) provide evidence that supports such a conclusion to different degrees.  But each of these forms of evidence are imperfect and have different strengths and weaknesses.  The SPARK verification system has the tremendous advantage of covering the entire possible input space.  But, as I pointed out, it can’t tell us about defects introduced into the delivered binary executable by a faulty compiler.  Testing of the release-candidate executable on the target chips could reveal such defects, but only if our limited set of test points stimulates the faulty behaviour.  I think we can maximise our chance of finding /any/ safety-related defect – or, in other words, maximise our confidence that the software-as-running-in-the-world meets the SSRs – by using a combination of testing and analysis techniques such that the weaknesses in one are covered by the strengths of another.

I think the typical guidance for testing of safety-critical systems aims to achieve such a balance.  Unit testing can often cover many more inputs to a given procedure than in integration, software, or system-level testing, but the test executable is generally not the unmodified release-candidate executable and we can only infer the relationship between unit test results and the whole software behaviour based on our (possibly incorrect) understanding of the software structure.  Hardware-in-the-loop testing might not be able to achieve the same structural and input-space coverage, but generalising from it to the behaviour of the deployed system is much less tenuous.  So we call for a mix of unit, integration, software, and system-level testing.

So far, I think you and I seem to be in agreement.  If I understand your question right, it is mainly about what kind of testing we should do in cases where we are also building machine-checked proof that (most of?) the source code refines a formal specification.  And here, I must admit that I have no answer.

It has occurred to me that the ideal testing to complement proof that source code refines a specification might /not/ be the same mix of testing that we recommend where formal methods are not used.  If, in these cases, we are mainly want testing so that we can find compiler-introduced defects and ways in which the real chips don’t behave as the instruction set manual says they should, perhaps testing to achieve coverage of the instruction set would find more bugs or more bugs per testing dollar than, say, achieving MC/DC.  I don’t have the answer here; I don’t even know the complete list of types of defects I’d want the testing to find in these cases.  It’s a bit of a cliche, but it might be that more research is needed here.

Is that clearer?

— Patrick

Dr Patrick John Graydon
Postdoctoral Research Fellow
School of Innovation, Design, and Engineering (IDT)
Mälardalens Högskola (MDH), Västerås, Sweden