A Summary of My Recent Experience with Lean 4
In the past few months, I have been using the Lean 4 proof assistant for formalising programming language semantics and safety properties in my HDL line of research (e.g., Anvil). I would like to share my thoughts and experience regarding the use of Lean in this blog post.
Where It Helps
Lean is a proof assistant so of course we expect it to assist.
One obvious way it helps is that once I write down some proof in Lean, it can type-check it to see if it is correct. Once Lean says my proof for a theorem is correct, I don’t need to look back at the proof anymore (unless the proof references some unproven lemmas, in which case I would need to make sure these lemmas get proven in the end as well).
Beyond this kind of proof checking, I find Lean also helpful in the following ways:
-
Maintenance of proof context. The kind of proofs I need to do are tedious and involve manipulating a large number of items. To prove something by structural induction on the type, for example, would require considering ten or more cases (subgoals), and for each coming up with the corresponding induction hypotheses. This would be time-consuming and error-prone if done manually.
Lean maintains the proof context not just to allow itself to construct the desired proof terms from given tactics, but also to present it as a reference to the user. Sometimes a proof does not go through because of some unintended bugs in a definition or some corner cases that are overlooked, and reading a proof context that describes a subgoal allows us to spot the problem more easily. I have discovered issues in this way on multiple occasions.
-
Sensitivity to details. Formalising a programming language involves fixing a lot of subtle details such as reduction rules and typing judgments. Let’s say I already have a version with a collection of theorems proven about its properties. Now if I want to make any adjustment to some detailed thing, as a human, I would tend to rely on intuition to decide that the existing proofs should still go through. Even if I realise that some proofs would be affected, locating them and not missing any would be difficult. Lean helps here because it is sensitive to any changes and precisely locate the broken locations in the proofs. I could therefore work on proofs while adjusting definitions without worrying about missing details.
There’s a limit to both these points, however, and over-relying on Lean for them can lead to problems.
Where It Doesn’t Help
When I just started with Lean, I somewhat blindly leaned on Lean for the two points mentioned above.
First, I would look at the context it presented, try to understand what was going on, and try some combination of tactics following my gut feeling and see whether it solves the goal. If it didn’t, I would look at the new context, and repeat the whole process.
One of the problems with this is that reading the proof context directly is time-consuming. Don’t get me wrong: As I mentioned above, I find reading the proof context useful in uncovering mistakes. Because it is such a mental burden to process, however, I realise that the ideal case is to read it in full only on selected occasions to double-check things, when I already suspect that something is wrong, or when I need to continue where I have left off previously. Another reason this is bad is that this kind of greedy strategy that focuses on individual subgoals ignores the big picture of the proof. In the end, therefore, it often fails to actually complete the whole proof. Quite frequently, I did not realise the direction I was heading would not work until I had already spent a lot of time doing this sort of greedy proving, which ended up all wasted.
So to summarise on this first point, a proof assistant like Lean does not help figure out the high-level proof strategy. To use it effectively, one needs to already have a rough sketch of the proof, developed on scratch paper or a whiteboard. While using Lean, it is also ideal if one can predict how the context will be updated after each tactic. This way, the user maintains a higher-level context mentally, which always matches the concrete one that Lean maintains. In addition, in a proof that is broken down into subgoals, tackle the more complex and tricky cases first (rather than the ones Lean happens to show) so we can be more confident that the proof strategy would work out.
The second mistake I made when I started using Lean was to try
to complete every tiny piece of the proof the first time I
reached it.
Yes, Lean is pedantic and I want to be certain my proof has no loose
parts, but spending effort proving things like x + y = y + x in
the middle of figuring out an important theorem is just inefficient.
It breaks the flow of thinking due to context switching.
The effort would also be wasted if later at a more critical
part we realise that the proof would not work out.
A better idea is to use sorry liberally as placeholders for proofs
of conclusions that are almost certainly true, or for proofs that require
too much context switching to handle immediately.
Focus on setting up the scaffold first. After the scaffold is set up,
we tackle the individual sorrys, starting from the most non-trivial
ones. This increases the likelihood that if there is a hole in the
proof, we will discover it earlier rather than later.
So overall, Lean is more like a screwdriver that helps us tighten screws. It does not help us design the blueprint, which is still crucial and cannot be omitted.
Automation
Aesop
Aesop helped me a lot to deal with very obvious cases.
When I could see that there was no existential quantifier
in the goal and it could be proven just by composing a sequence
of existing lemmas and constructors, I could use aesop to save
the effort of manually finding definitions
of these lemmas and constructors and typing a bunch of applys.
This is a low-level dumb tool so it doesn’t really help figure out
how to prove something.
Autocompletion
I was using GitHub Copilot’s autocompletion in VS Code. I found it quite helpful. It was sometimes able to complete a proof in one shot if the proof is trivial (but could be tedious), or if it is repetitive or symmetrical. For symmetrical cases, if I had already proven one of the cases, almost always autocompletion could perfectly complete the remaining work by itself. Another case where it was useful was when I needed to make a large number of small and repetitive edits after adjusting a lemma statement. Again, once I manually made edits in one place, it could pick up the pattern and automatically complete the rest. It was unreliable, however, for most cases where it was necessary to look at a bigger context to know which lemma to apply. In these cases it would hallucinate stuff that did not exist.
AI Agent
Towards the end of my effort when I had already completed the
most important proofs, I left some more obvious and less critical
lemmas for an AI agent to figure out for me.
I used Copilot with Claude Sonnet 4.6. The outcome was a bit of
hit and miss. It was able to complete about half of the cases
without manual intervention, but for the remaining half, it would
try something that would fail the checks before giving up.
It seemed to be stuck with silly mistakes often, e.g., applying
lemmas with wrong argument labels, something a human user could
figure out instantly. I suppose the way the AI agent works
on Lean proofs is quite different from a human: It does extensive
reading and reasoning and writing up the whole code before checking
its correctness with lean build, whereas a human would do it
more incrementally and interactively with frequent feedback.
Perhaps more interactivity for AI in the proof process could
lead to better performance too.
The agent did help me figure out a bug in my formalisation once. It gave up proving a lemma and proposed that the lemma statement be adjusted for it to complete the proof. I checked the lemma and realised that the statement was indeed incorrect.
Problems Regarding Recursion and Induction
The sort of PL-related formalisation I was doing required a lot of recursive/inductive definitions. Now a basic requirement for the theorem prover to be sound is that all such definitions must be well-founded: All functions must provably terminate, and constructors for inductive types are also subject to syntactic restrictions (e.g., positivity).
Termination is undecidable, however, so the best thing Lean can do is to have some sound but incomplete heuristics to cover common cases where some argument to the function changes monotonically. For cases not covered the user would need to provide a termination proof explicitly.
I encountered this for a couple of cases where I had mutual recursion involving multiple functions.
As for induction, under the hood it boils down to applying
an induction principle (or recursor) which states how
a proof for universality of a predicate for a type
can be constructed recursively.
Although Lean automatically generates
an induction principle for an inductive definition,
I encountered cases where the generated induction principle is
insufficient for my proof. Specifically, these cases involve
nested induction. For example, a constructor foo for a term
might require List term in the argument: foo: List term -> term. I want to use
an induction principle like
forall P, ... -> (forall l, (forall t in l, P t) -> P (foo l)) -> forall t, P t
so that I can have forall t in l, P t in the induction
hypothesis for the foo case, but the automatically generated
induction principle would be something like
forall P, ... -> (forall l, P (foo l)) -> forall t, P t
The workaround would basically require me to define the induction principle I wanted, which would be some extra work. Also in that case I would need to prove its termination, since it is going to be a recursive function Lean cannot automatically verify terminates.
Wishlist
These three things will make my life easier:
- Nicer rendering of the proof context. This connects to my discussion above on the difficulty in reading the proof context. Right now it’s pure monospace text with some unicode characters for mathematical symbols. I believe it would be easier to read if we can define some sort of rendering process to map Lean terms to some formula in LaTeX or Typst. Note that this is purely a frontend thing, so it’s an entirely one-way mapping. We don’t need to worry about defining a grammar for parsing these LaTeX or Typst formulae.
- A way to disable continuous checking after each edit (similar to the default mode in Rocq). Now if I make any edit in the middle of a file Lean will recheck everything after the edited location. When the file I am working on is large (more than a few hundred lines) this will create too much unnecessary computation for my machine and affect the user experience.
- Tighter integration of the “interactive” feature with AI tools.
This interactivity aspect is quite indispensable for human users,
and from my experience so far it seems AI can also benefit from
this tool so it can react to much faster feedback without having
to rerun
lean buildevery time only after it completes a whole proof. I am yet to explore a few dedicated AI tools for theorem proving, so it could be that this is something already in existence.