comparing formal proofs of aime_1984_p1
(I work at Anthropic, but far from any efforts related to math reasoning or formalization; opinions are my own)
This is an incredibly specific rabbit-hole I fell into in 2022 but never wrote up, which I was reminded of while thinking about LLMs' math abilities recently. I don't think it provides any direct signal about how good they are today, but there might be something to take away about how we frame the question.
In February 2022, Polu et al. trained a GPT-3βlike model to solve some formal math problems. They commented that the model's proof for the problem aime_1984_p1
was "rather short" (p. 25) compared to a human formalization. The problem:
(AIME 1984 P1) Find the value of if is an arithmetic progression with common difference 1, and .
The formalized statement (which provides the answer and just asks to prove it) and the model's proof:
theorem aime_1984_p1
(u : β β β)
(hβ : β n, u (n + 1) = u n + 1)
(hβ : β k in finset.range 98, u k.succ = 137) :
β k in finset.range 49, u (2 * k.succ) = 93 :=
begin
revert_all,
simp [finset.sum_range_succ],
ring,
simp [pow_succ, mul_comm, add_left_comm, add_assoc],
simp [two_mul],
rintro p,
revert p,
simp [add_comm],
intros p hp,
simp [hp], ring_nf,
intros,
nlinarith
end
For a human, the trick to solving this problem is noticing that the desired expression is closely related to .
- Every term in is one after the corresponding term in , and therefore one more than it. Thus, .
- But also, and together cover every term in the known sum exactly once, so .
Solving the two equations gives the answer. This approach roughly describes the linked human formalization as well.
Why is the machine's proof shorter? The formalization writes the problem condition as and the goal as , and finset.sum_range_succ
is roughly the following identity:
The step simp [finset.sum_range_succ]
repeatedly simplifies the problem by applying this identity.
, so gets simplified to .
But , so that gets simplified to . And so on. The same simplification applies repeatedly to the goal expression. So, this step expands both sums into explicitly written out lists of 98 and 49 terms, respectively.
The next important step is simp [hp]
, which here repeatedly simplifies the problem by applying the property that
So is simplified to , which is simplified to , which is simplified to , and so on, until we get .1 This happens to every other term too. After all that simplification, the condition and goal are both just linear in , and the proof can be finished with standard Lean steps.
To summarize, I would describe the model's solution informally as:
By repeatedly applying and a few hundred times each, we can simplify the given condition to and the expression we're computing to . Solving the equation for and plugging it into the expression gives the answer.
To be clear, a human could solve the problem in a superficially similar way by writing
then evaluating by recognizing that it's a triangular number, so it equals . This is not what the proof does. Even though it has few steps, when those steps are executed during verification, each is turned into as a separate computation, and then the constants are added together one by one to produce 4851, just as with, say, the Python expression sum(i + 1 for i in range(97))
.
Given this strategy, the model's proof is actually rather circuitous. A briefer formalization (that works in the same version of Lean):
theorem aime_1984_p1
(u : β β β)
(hβ : β n, u (n + 1) = u n + 1)
(hβ : β k in finset.range 98, u k.succ = 137) :
β k in finset.range 49, u (2 * k.succ) = 93 :=
begin
revert hβ,
simp [finset.sum_range_succ, two_mul, hβ],
intros,
nlinarith
end
appendix: technical notes on reproduction
Lean 3 has been deprecated (in favor of Lean 4) for a year or two, so verifying these proofs today is challenging. Based on the lean gym leanpkg.toml I directly installed Lean 3.39.1. (When I tried the latest Lean 3 version, it complained that two_mul
had nothing to do.) I also got leanproject
via pipx install mathlibtools
, which still worked. Prepending these imports (subsetted from minif2f_import.lean
) to the proofs produced complete Lean 3 files that verified (by running lean filename.lean
).
import algebra.big_operators.basic
import data.finset.basic
import data.rat.basic
import tactic
open_locale big_operators
open_locale nat
open_locale rat
I wasn't able to set up a working editor that interacted with Lean 3 to inspect the intermediate proof states, so I just printed them by inserting the tactics trace
and trace_state
, like so:
begin
revert_all,
trace "STATE 1", trace_state,
simp [finset.sum_range_succ],
trace "STATE 2", trace_state,
ring,
...
end
Separately, prepending the additional line
set_option trace.simplify.rewrite true
(very verbose!) shows the specific actions simp
performs.
appendix: bear blog sum
\displaystyle
doesn't appear to work, but \sum\limits_a^b
works.
The formalization sneakily introduces an term to the definition of because it's easier to start at 0 in Lean (and most other proof systems). The sums start at for the same reason. But the problem is otherwise the same.↩