[SystemSafety] Bugs in LLM generated proofs
Derek M Jones
derek at knosof.co.uk
Wed Jan 14 17:23:31 CET 2026
All,
You have probably heard about how LLMs have been solving
mathematical problems, and how the LLMs have used the Lean
proof assistant to verify the correctness of the proof.
Guess what? Not only do LLMs generate buggy code, they also
generate buggy proofs. A particularly insidious bug occurs
when the LLM rewrites the original to create something that is
easier to prove, but does not prove that the rewritten
question is the same as the original one (sometimes it isn't).
Lots of discussion and links here
https://www.lesswrong.com/posts/rhAPh3YzhPoBNpgHg/lies-damned-lies-and-proofs-formal-methods-are-not-slopless
The main reason I prefer Deepseek and Kimi for solving maths
problems is that they provide chain-of-thought. So I can see
how they have interpreted my question (not always how I intended),
and the simplifications they make (not always applicable).
ChatGPT and Grok often just give an answer, although ChatGPT
is starting to give more chain-of-thought feedback.
--
Derek M. Jones Evidence-based software engineering
blog:https://shape-of-code.com
More information about the systemsafety
mailing list