VeriBench

Why This Benchmark Exists

Most coding benchmarks stop at passing tests. HumanEval needed HumanEval+, and SWE-Bench needed SWE-Bench Verified because finite test suites keep missing behaviorally important bugs. A model can pass every available test and still be wrong in ways the benchmark never observed.

Say you're formalizing a simple Python program. An LLM might hand you this Lean artifact:

namespace MyAdd

def myAdd : Nat → Nat → Nat := Nat.add

/-- A couple of tests pass, so this looks promising. -/
example : myAdd 1 2 = 3 := by native_decide
example : myAdd 0 0 = 0 := by native_decide

/-- But the actual semantic claims are left unproved. -/
theorem commutativeThm (a b : Nat) :
  myAdd a b = myAdd b a := by sorry

theorem correctnessThm (a b : Nat) :
  myAdd a b = a + b := by sorry

end MyAdd

The file compiles. A few tests pass. It has named theorems. But the key claims are sorry — the actual reasoning was never done. VeriBench is built to tell apart artifacts that look formal from artifacts that contain meaningful, checkable verification content.

Our Approach

VeriBench treats autoformalization as an end-to-end verification problem, not just a code-writing problem. Starting from Python, an agent must produce a Lean 4 artifact with code, tests, specifications, and theorem statements that say something real about the program.

This matters because several very different failure modes can all look similar from far away. A model can write Lean that parses. It can even write Lean that compiles. But that still does not mean it captured the right specification, or that its theorems say enough to be useful.

So our evaluation looks beyond text matching. We check whether the file compiles, whether it contains nontrivial formal content, and whether its theorems cover the gold reference specification. When exact comparison is too brittle, we use a calibrated semantic judge to estimate theorem coverage. That gives us a practical way to measure the gap between "looks formal" and "actually says the right thing."

Dataset Breakdown

VeriBench is not a tiny toy set. The full release contains 884 tasks. A five-split canonical core spans introductory programs, classical algorithms, HumanEval-style functions, Python standard-library code, and a large security-focused subset.

In the repo, these show up as familiar splits like easy_set, cs_set, humaneval_set, real_code, security_6858, and security_python. Each Python file is paired with a matching Lean target in a mirrored directory tree, so the benchmark is easy to inspect and extend.

Explore The Dataset

Click a task to see real Python source, what makes its formal specification non-trivial, and why a shallow theorem would miss the point.

Intro / explicit specs Representative slice

`easy_set`: clean benchmark primitives

This split makes it easy to see whether a model can track a straightforward precondition, executable behavior, and a crisp theorem target all at once.

# easy_set/1_MyAdd.py
def pre(a: int, b: int) -> bool:
    return a >= 0 and b >= 0

def prog(a: int, b: int) -> int:
    return a + b

Good for seeing if an autoformalizer can move from toy Python into simple but meaningful specifications.

On top of the canonical core, the release adds a 282-task high-assurance-inspired expansion across 14 domains, including cryptography, aerospace, medical devices, and compilers. The result is a benchmark that mixes small teaching examples with more realistic and security-sensitive code, which gives a clearer view of where current autoformalization agents hold up and where they fail.

What a Gold Lean Artifact Looks Like

Every VeriBench task pairs a Python source file with a hand-curated Lean 4 gold artifact that follows a fixed 9-part schema. Here is a condensed view for the binary search task from the CS set.

-- 1. Implementation (fuelled recursion for totality)
def binarySearch (arr : List Nat) (target : Nat) : Option Nat := ...

-- 2. Unit tests (positive + negative + edge cases)
example : binarySearch [1,2,3,4,5] 3 = some 2 := by native_decide
example : binarySearch [1,2,3,4,5] 6 = none  := by native_decide

-- 3. Pre-condition (non-trivial: sorted input required)
def Pre (arr : List Nat) : Prop :=
  ∀ i j xi xj, i ≤ j → arr[i]? = some xi → arr[j]? = some xj → xi ≤ xj

-- 4. Property theorems (soundness + completeness, proved)
theorem soundness_thm (arr : List Nat) (target idx : Nat) :
    binarySearch arr target = some idx → arr[idx]? = some target := ...

theorem completeness_thm (arr : List Nat) (target : Nat) :
    Pre arr → (∃ i, arr[i]? = some target) →
    ∃ idx, binarySearch arr target = some idx := ...

-- 5. Correctness theorem (Pre → Post, bundles all properties)
theorem correctness_thm (arr : List Nat) (target : Nat) :
    Pre arr → soundness_prop arr target ∧ completeness_prop arr target := ...

-- 6. Imperative implementation + equivalence theorem
def binarySearchImp (arr : List Nat) (target : Nat) : Option Nat :=
  Id.run do ...

theorem equivalence_thm (arr : List Nat) (target : Nat) :
    binarySearch arr target = binarySearchImp arr target := by ...

The schema separates executable behavior (the implementation and tests), admissible inputs (the pre-condition), atomic proof obligations (the property theorems), and bundled correctness claims. This structure is what makes VeriBench's evaluation partial-credit: even if an agent cannot prove everything, we can measure how much of the schema it recovered.

Gold artifacts are not raw model output — every theorem is written by a human curator and checked to close without sorry in the benchmark's Lean 4.22.0 + Mathlib toolchain. The remaining sorry rate across the gold set is tracked as D₂ ≈ 0.57 and reported alongside agent scores so benchmark immaturity does not inflate agent-skill claims.

Paired Safe and Unsafe Security Tasks

The security_6858 split contains 227 paired tasks, each consisting of a vulnerable and a safe Python function. The formal specification must not just describe what the function does — it must capture the safety condition that distinguishes the two versions.

# VULNERABLE — no bounds check (0_unsafeCopy.py)
def unsafe_copy(dst: bytearray, src: bytearray) -> None:
    for i, b in enumerate(src):
        dst[i] = b  # raises IndexError if len(src) > len(dst)

# SAFE — guards against overflow (0_unsafeCopy_safe.py)
def safe_copy(dst: bytearray, src: bytearray) -> None:
    if len(src) > len(dst):
        raise ValueError("source longer than destination")
    dst[:len(src)] = src

A shallow formalization of safe_copy that only restates what the function returns will score the same as a formalization of unsafe_copy. The theorem that actually discriminates is the safety property:

-- The safety theorem that distinguishes safe from unsafe:
theorem no_overflow (dst src : List Nat) :
    safeCopy dst src ≠ none → src.length ≤ dst.length := by ...

-- Wrong implementations that pass all unit tests but fail this theorem:
-- fun dst src => some src   (never raises, always "succeeds")
-- fun dst src => if dst.isEmpty then none else some src

This is why VeriBench measures theorem coverage against a gold reference: the gold file contains the safety property, and any candidate that omits it scores lower, even if it compiles cleanly and passes the unit tests.

How We Built the Dataset

Every task went through a three-stage pipeline: an AI first draft (o3 generates the Lean artifact following the 9-section schema), a human curator review against a written rubric, and Lean kernel compilation. A human must sign off before any task enters the dataset — automation can flag issues, but the final inclusion decision is always human.

The 460-task security split was the most complex to build. The 227 paired safe/unsafe tasks came from MIT 6.858 Computer Systems Security lecture content spanning 20+ topics — buffer overflows, symbolic execution, web security, Spectre, and more. For each lecture, the process identified 2–4 formalizable security concepts and generated both a vulnerable and a safe version, verifying that both model the same underlying mechanism, not just superficially related bugs.

The hardest curation judgment is theorem quality. The most common failure is duplication — two properties with different names but identical statements — and vacuity — theorems that hold for any implementation, not just the correct one. The rubric requires at least three semantically distinct property shapes per task, and this single requirement most often sends a draft back for revision.

How We Score a Solution

The core question in VeriBench is simple: did the agent produce a Lean artifact that both works and says the right things?

In the paper, this shows up as the Smooth Conjunctive Score for Code Verification. The friendly version is: we do not want to give full credit for a file that only clears one hurdle. A strong solution needs to survive several checks at once.

Step 1

Does the Lean file compile?

First, the generated artifact has to typecheck. If it does not compile, it cannot be a usable verification artifact.

Step 2

Does it contain real formal content?

We look for actual tests, specifications, and theorem statements, not just a thin wrapper around the implementation or a polished file full of placeholders.

Step 3

Do the theorems cover the gold reference?

The biggest question is whether the agent's claims match the human-written target. This is where many impressive-looking outputs still fall short.

Intuition: A VeriBench score should rise when a solution both compiles and covers the right theorem obligations. If one part is weak, the final score should stay low instead of hiding the failure.

Human Calibration

Human raters are given the same evaluation guidelines as the LLM judge and asked to score candidate Lean files against gold reference Lean files. We then compare those human scores with the judge's scores to measure how well the judge tracks human judgment.

Theorem coverage (TC1) is estimated by an LLM judge calibrated against five human raters. The held-out Pearson correlation is about 0.70 — strong enough to make the judge a practical proxy for semantic coverage at benchmark scale. The judge is fit on human evaluation data and then calibrated to those scores so it better reflects variation in human judgment.

The score is trying to reward the thing we actually care about: verified understanding, not just formal-looking syntax.

Results So Far

The short version is that current frontier agents are better at producing Lean files that compile than Lean files that capture the right theorem obligations.

Agent	Compiles (IC1)	No-sorry theorem rate (IC2)	Theorem coverage (TC1)	Agent skill (S_skill)
Codex (GPT-5.4)	1.000	0.237	0.102	0.289
Claude Code (Sonnet 4.6)	1.000	0.114	0.098	0.224
Baseline (Sonnet 4.6)	0.310	0.282	0.105	0.209
Gemini 3-flash preview	0.206	0.043	0.156	0.111
Leanstral v2	0.292	0.065	0.058	0.103

On the directly comparable benchmark runs in the paper, Codex leads with an agent-skill score of about 0.29, Claude Code follows at about 0.22, and Leanstral v2 lands at about 0.10. Those are not disaster-level numbers, but they are far from what we would want from an end-to-end formal verification system.

The more interesting pattern is deeper than the ranking itself. Codex and Claude Code can often produce files that typecheck cleanly. In other words, they can generate Lean that looks structurally coherent. But when we ask whether the generated theorem statements actually cover the gold reference, the scores remain strikingly low.

In the paper, overall theorem coverage stays at or below about 0.156 across the evaluated agents, and for the strongest headline agents it stays at or below about 0.105. That means the real bottleneck is not just proof search. It is getting the model to state the right claims in the first place.

Just as importantly, the gold-side quality term stays roughly stable across agents, at about 0.72 to 0.74. That means the ordering is not being driven by one model getting an easier slice of the benchmark. The benchmark can still be imperfect, but the main headline is agent-side: current systems are struggling to formulate theorem sets that really match the intended specification.

The numbers reveal a consistent pattern. All three failure modes below produce Lean files that look plausible — VeriBench's conjunctive score penalizes each differently:

Failure A

Compiles, sorry everywhere

IC1 = 1, IC2 ≈ 0 — the file typechecks but every theorem is a placeholder. Codex and Claude Code both hit this.

Failure B

No sorry, but vacuous

IC1 = 1, IC2 = 1, TC1 ≈ 0 — theorems close, but say nothing about the program's actual behavior.

What we want

Meaningful theorem, proved

IC1 = 1, IC2 = 1, TC1 high — compiles, closes, and captures the right specification. Rare in practice.

Conclusion

The central finding is a theorem-coverage gap. Frontier agents can compile Lean. They can write tests and produce files that look polished and formal. But they have not yet recovered the key semantic obligations that a human verifier would write down. That gap matters for scalable oversight: as systems scale, humans cannot manually validate every generated artifact, so we need agents that can be trusted to formalize the right properties — not just produce compilable Lean.

VeriBench changes the target of code-verification evaluation. It shifts the conversation away from isolated proof search or test passing and toward end-to-end conjunctive grounding: can an agent start from real source code, produce a structured Lean artifact, and state the right theorem obligations in a form that a checker can actually use?

The next step is not just to build stronger proof search. It is to build agents that formulate better specifications, state better theorems, and improve under human-calibrated oversight at scale. If VeriBench is useful, it will be because it gives that next generation of verifiable coding agents a concrete target to beat.

The big lesson is simple: passing the Lean compiler is not the end of the story. In VeriBench, the hard part is often stating the right theorems, not just getting the file to typecheck. That is the core reframing of the benchmark: code-verification evaluation should move from isolated proof search to end-to-end conjunctive grounding, where theorem formulation is treated as a first-class bottleneck.

VeriBench: An End-to-End Formal Verification Benchmark for AI Coding Agents

Interactive Example

Candidate A Compiles, but shallow

Candidate B Stronger specification

What gets measured

Why This Benchmark Exists

Our Approach

Dataset Breakdown

Explore The Dataset

`easy_set`: clean benchmark primitives

What a Gold Lean Artifact Looks Like

Paired Safe and Unsafe Security Tasks

How We Built the Dataset

How We Score a Solution

Does the Lean file compile?

Does it contain real formal content?

Do the theorems cover the gold reference?

Human Calibration

Results So Far

Compiles, sorry everywhere

No sorry, but vacuous

Meaningful theorem, proved

Conclusion

Citation

Why This Benchmark Exists

Our Approach

Dataset Breakdown

Explore The Dataset

easy_set: clean benchmark primitives

What a Gold Lean Artifact Looks Like

Paired Safe and Unsafe Security Tasks

How We Built the Dataset

How We Score a Solution

Does the Lean file compile?

Does it contain real formal content?

Do the theorems cover the gold reference?

Human Calibration

Results So Far

Compiles, sorry everywhere

No sorry, but vacuous

Meaningful theorem, proved

Conclusion

Citation

`easy_set`: clean benchmark primitives