[SystemSafety] Bugs in LLM generated proofs

Wed Jan 14 17:23:31 CET 2026

All,

You have probably heard about how LLMs have been solving
mathematical problems, and how the LLMs have used the Lean
proof assistant to verify the correctness of the proof.

Guess what?  Not only do LLMs generate buggy code, they also
generate buggy proofs.  A particularly insidious bug occurs
when the LLM rewrites the original to create something that is
easier to prove, but does not prove that the rewritten
question is the same as the original one (sometimes it isn't).

Lots of discussion and links here
https://www.lesswrong.com/posts/rhAPh3YzhPoBNpgHg/lies-damned-lies-and-proofs-formal-methods-are-not-slopless

The main reason I prefer Deepseek and Kimi for solving maths
problems is that they provide chain-of-thought.  So I can see
how they have interpreted my question (not always how I intended),
and the simplifications they make (not always applicable).

ChatGPT and Grok often just give an answer, although ChatGPT
is starting to give more chain-of-thought feedback.

-- 
Derek M. Jones           Evidence-based software engineering
blog:https://shape-of-code.com