[SystemSafety] 7% of mistakes in Coq proofs go undetected

Sun Dec 8 21:07:54 CET 2019

Dear Mr Jones,

Le 14/11/2019 à 16:16, Derek M Jones a écrit :
> Some interesting results from the mutation analysis of Coq proofs:
> http://cozy.ece.utexas.edu/mcoq/
>
> Mutation analysis introduces a mistake into code, by mutating existing
> code at some point (there is a small cottage industry working on new
> mutation operators), and the modified code is executed in some way.
>
> A common research use (not much used in industry) is using mutated
> code to check the quality of a test suite, i.e., a thorough test
> suite will detect the added mistake.
>
> "Mutation Analysis for Coq" does what it says on the tin.  It mutates
> Coq proofs, and then checks whether the verification fails (which it
> should do).
>
> In 7% of cases (6.82% to be exact), verification succeeded (averaged
> over all mutations).
>
> So, those of you who have paid for Coq proofs to be written for your
> software.  If the proof contains a mistake, there is something like
> a 7% chance it has gone undetected.

In fact, I find it a pretty good score, I would have expected it to be
much higher. Formal proofs are relative to a formal specification. If
this specification is incorrect, the proof will prove the incorrect
thing. We all know that this is the weak point of formal software.
Various approaches are used to have correct software like adequate
testing, thorough review of specifications, etc.

Looking at the qualitative analysis in interesting. 45% of live mutants
are related to underspecified definitions. The others are dangling
definitions, i.e. definitions not used elsewhere (similar in my view to
dead code), and semantically equivalent mutation (i.e. no change made by
the mutant).

In the paper examples of the 45% live mutants related to underspecified
definitions, some cases are not harmful because they preserve the
algorithm behavior but change its complexity (LRT mutant in MathComp) or
the change has no impact due to some systematic assumptions in other
parts of the proofs (BFT in Flocq). Other live mutants do find bugs in
specification.

The 7% (good) score is unfortunately related to the fragility of the
formal proofs. A slight change can break the proofs. Proof engineering
has much more to learn from software engineering in that regard.

Best regards,
D. Mentré