Blog: Expanded Universe

comparing formal proofs of aime_1984_p1

(I work at Anthropic, but far from any efforts related to math reasoning or formalization; opinions are my own)

This is an incredibly specific rabbit-hole I fell into in 2022 but never wrote up, which I was reminded of while thinking about LLMs' math abilities recently. I don't think it provides any direct signal about how good they are today, but there might be something to take away about how we frame the question.


In February 2022, Polu et al. trained a GPT-3–like model to solve some formal math problems. They commented that the model's proof for the problem aime_1984_p1 was "rather short" (p. 25) compared to a human formalization. The problem:

(AIME 1984 P1) Find the value of a2+a4+a6+a8++a98 if a1,a2,a3 is an arithmetic progression with common difference 1, and a1+a2+a3++a98=137.

The formalized statement (which provides the answer and just asks to prove it) and the model's proof:

theorem aime_1984_p1
  (u : β„• β†’ β„š)
  (hβ‚€ : βˆ€ n, u (n + 1) = u n + 1)
  (h₁ : βˆ‘ k in finset.range 98, u k.succ = 137) :
  βˆ‘ k in finset.range 49, u (2 * k.succ) = 93 :=
begin
  revert_all,
  simp [finset.sum_range_succ],
  ring,
  simp [pow_succ, mul_comm, add_left_comm, add_assoc],
  simp [two_mul],
  rintro p,
  revert p,
  simp [add_comm],
  intros p hp,
  simp [hp], ring_nf,
  intros,
  nlinarith
end

For a human, the trick to solving this problem is noticing that the desired expression S:=a2+a4+a6+a8++a98 is closely related to S:=a1+a3+a5+a7++a97.

  1. Every term in S is one after the corresponding term in S, and therefore one more than it. Thus, S=S+49.
  2. But also, S and S together cover every term in the known sum a1+a2+a3+a4++a98 exactly once, so S+S=137.

Solving the two equations gives the answer. This approach roughly describes the linked human formalization as well.

Why is the machine's proof shorter? The formalization writes the problem condition as k=097ak+1=137 and the goal as k=048a2(k+1), and finset.sum_range_succ is roughly the following identity:

k=0n+1f(k)=f(n+1)+k=0nf(k).()

The step simp [finset.sum_range_succ] repeatedly simplifies the problem by applying this identity. 97=96+1, so k=097ak+1 gets simplified to a98+k=096ak+1. But 96=95+1, so that gets simplified to a98+a97+k=196ak. And so on. The same simplification applies repeatedly to the goal expression. So, this step expands both sums into explicitly written out lists of 98 and 49 terms, respectively.

The next important step is simp [hp], which here repeatedly simplifies the problem by applying the property that

an+1=an+1.()

So a98 is simplified to a97+1, which is simplified to a96+2, which is simplified to a95+3, and so on, until we get a0+98.1 This happens to every other term too. After all that simplification, the condition and goal are both just linear in a0, and the proof can be finished with standard Lean steps.

To summarize, I would describe the model's solution informally as:

By repeatedly applying () and () a few hundred times each, we can simplify the given condition to 98a0+4851=137 and the expression we're computing to 49a0+2450. Solving the equation for a0 and plugging it into the expression gives the answer.

To be clear, a human could solve the problem in a superficially similar way by writing k=097ak+1=k=097(a0+k+1)=98a0+k=097(k+1), then evaluating k=097(k+1) by recognizing that it's a triangular number, so it equals 98·992=4851. This is not what the proof does. Even though it has few steps, when those steps are executed during verification, each ak+1 is turned into a0+k+1 as a separate computation, and then the constants are added together one by one to produce 4851, just as with, say, the Python expression sum(i + 1 for i in range(97)).

Given this strategy, the model's proof is actually rather circuitous. A briefer formalization (that works in the same version of Lean):

theorem aime_1984_p1
  (u : β„• β†’ β„š)
  (hβ‚€ : βˆ€ n, u (n + 1) = u n + 1)
  (h₁ : βˆ‘ k in finset.range 98, u k.succ = 137) :
  βˆ‘ k in finset.range 49, u (2 * k.succ) = 93 :=
begin
  revert h₁,
  simp [finset.sum_range_succ, two_mul, hβ‚€],
  intros,
  nlinarith
end

appendix: technical notes on reproduction

Lean 3 has been deprecated (in favor of Lean 4) for a year or two, so verifying these proofs today is challenging. Based on the lean gym leanpkg.toml I directly installed Lean 3.39.1. (When I tried the latest Lean 3 version, it complained that two_mul had nothing to do.) I also got leanproject via pipx install mathlibtools, which still worked. Prepending these imports (subsetted from minif2f_import.lean) to the proofs produced complete Lean 3 files that verified (by running lean filename.lean).

import algebra.big_operators.basic
import data.finset.basic
import data.rat.basic
import tactic

open_locale big_operators
open_locale nat
open_locale rat

I wasn't able to set up a working editor that interacted with Lean 3 to inspect the intermediate proof states, so I just printed them by inserting the tactics trace and trace_state, like so:

begin
  revert_all,
  trace "STATE 1", trace_state,
  simp [finset.sum_range_succ],
  trace "STATE 2", trace_state,
  ring,
  ...
end

Separately, prepending the additional line

set_option trace.simplify.rewrite true

(very verbose!) shows the specific actions simp performs.

appendix: bear blog sum

\displaystyle doesn't appear to work, but \sum\limits_a^b works.

  1. The formalization sneakily introduces an a0 term to the definition of ai because it's easier to start at 0 in Lean (and most other proof systems). The sums start at k=0 for the same reason. But the problem is otherwise the same.