[SystemSafety] Bugs in LLM generated proofs

Derek M Jones derek at knosof.co.uk
Wed Jan 14 17:23:31 CET 2026


All,

You have probably heard about how LLMs have been solving
mathematical problems, and how the LLMs have used the Lean
proof assistant to verify the correctness of the proof.

Guess what?  Not only do LLMs generate buggy code, they also
generate buggy proofs.  A particularly insidious bug occurs
when the LLM rewrites the original to create something that is
easier to prove, but does not prove that the rewritten
question is the same as the original one (sometimes it isn't).

Lots of discussion and links here
https://www.lesswrong.com/posts/rhAPh3YzhPoBNpgHg/lies-damned-lies-and-proofs-formal-methods-are-not-slopless

The main reason I prefer Deepseek and Kimi for solving maths
problems is that they provide chain-of-thought.  So I can see
how they have interpreted my question (not always how I intended),
and the simplifications they make (not always applicable).

ChatGPT and Grok often just give an answer, although ChatGPT
is starting to give more chain-of-thought feedback.

-- 
Derek M. Jones           Evidence-based software engineering
blog:https://shape-of-code.com



More information about the systemsafety mailing list