Verification as the Engine of Compounding Intelligence

Original Title: 🔬Scaling Past Informal AI - Carina Hong, Axiom Math

The race for verified AI isn’t about catching errors--it’s about compounding intelligence in a way that redefines what machines can achieve. Carina Hong of Axiom Math argues that formal verification isn’t a compliance chore but the missing engine for scaling brilliance, turning isolated insights into reusable, trustable knowledge. This reframing reveals a hidden consequence: teams betting on informal reasoning may hit performance ceilings while those embracing verified generation unlock recursive improvement. The real advantage isn't accuracy--it's acceleration over time. Anyone building AI systems, from enterprise architects to frontier researchers, gains by understanding this shift: verification isn’t the bottleneck, it’s the bridge. This conversation exposes how the systems that win won’t be those that scale fastest in the short term, but those designed to compound correctness across interactions, models, and time.

Why the Obvious Fix Makes Things Worse

Most AI labs treat hallucination as a surface-level problem--something to be smoothed over with better prompting, larger models, or human-in-the-loop judging. The dominant response? Scale data and compute, then polish outputs through statistical refinement. But this approach ignores a deeper dynamic: the more you rely on informal signals, the harder it becomes to verify progress at the frontier. As Carina Hong points out, “I think [verifiability] is probably the hardest problem right now, because as the models get better, it can be harder and harder to find the faults on the system.” This creates a feedback loop where better models require more sophisticated evaluation, which in turn demands even better models--except now, the evaluation burden grows faster than the capabilities.

This is where the informal approach runs into a wall. Reinforcement Learning from Human Feedback (RLHF) or statistical proxies like GRPO work well up to a point. But they fail when the task exceeds human evaluators’ bandwidth--like checking novel mathematical proofs or verifying distributed system invariants. At that scale, humans can’t reliably distinguish correct from plausible. You end up optimizing for convincingness, not correctness. And since rewards are based on approximation, the model learns to game the signal, not the truth.

Axiom’s alternative--verified generation--sidesteps this by using formal systems like Lean as both training signal and inference output. Lean isn’t just a type checker; it’s a trustable substrate. When a model generates a Lean proof, correctness is decidable. No ambiguity. No expert judgment call. This turns verification from a cost into a multiplier. Instead of discarding failed rollouts, Axiom can retain every valid proof as high-confidence training data. Over time, this compounds: each verified proof becomes scaffolding for the next, creating a flywheel of ever-deeper reasoning.

"Verification to me is about scaling brilliance, compounding brilliance."

-- Carina Hong

The phrase “compounding brilliance” isn’t marketing fluff--it’s a systems-level insight. Ramanujan’s genius wasn’t diminished by formal proof; it was amplified. By forcing intuition into rigorous form, Hardy unlocked new dimensions in Ramanujan’s thinking. Similarly, formal verification forces AI models to expose their reasoning, making it inspectable, composable, and iterable. The brilliance isn’t just preserved--it’s made collaborative. A proof in Lean isn’t just an answer; it’s a reusable module that other models, or humans, can build on.

The Hidden Cost of Fast Solutions

Frontier labs like OpenAI and Anthropic have focused on coding as a gateway to reasoning. And they’ve had success: Claude Code now powers large swaths of enterprise development. But this path assumes that coding proficiency transfers cleanly to deeper domains like math or scientific discovery. Axiom’s Putnam result challenges that assumption. They scored 120/120--full marks--while the best informal model, DeepSeek, scored 103. Humans, even top students, maxed out at 110.

This gap isn’t due to raw intelligence. It’s due to signal quality. Coding tasks have clear pass/fail metrics--does the code compile? Does it pass tests? But math olympiad problems require deeper logical coherence. An informal model might generate a plausible chain of reasoning that fails at a subtle inference step. Without a formal verifier, that error goes undetected--and worse, gets reinforced.

The cost of fast, informal solutions is untraceable degradation. Each rollout introduces noise. Over time, that noise accumulates as technical debt in the model’s reasoning fabric. You can scale compute to push through, but eventually, you hit diminishing returns. This is why Axiom, with far fewer resources, can outperform giants: they’re not just training smarter--they’re training cleaner.

Their VeriNa benchmark result--99% (187/189) ProofGen--versus OpenAI’s 4.9%--isn’t just a win on paper. It reveals a divergence in system design. Axiom isn’t just generating proofs; they’re building a verification-native pipeline. Every component, from data curation to reward shaping, assumes that correctness is decidable. This changes the economics of training: mistakes are caught early, good outputs are retained forever, and the model learns to trust itself.

But there’s a catch: formal generation is expensive. Lean proofs are hard to produce. The system must decompose problems recursively, backtrack from dead ends, and manage long chains of dependencies. Yet once a proof exists, verifying it is trivial. This asymmetry--expensive to generate, cheap to verify--is what enables scalability. It means you can outsource verification to lightweight checkers, freeing the model to explore deeper reasoning paths.

How the System Routes Around Your Solution

Even with perfect verification, there’s a lurking fragility: specification drift. As Carina notes, “Anything that can be specified can be proven. Humans are bad at specifying everything we want.” This means that a formally correct program might solve the wrong problem. You can verify that a sorting algorithm is correct, but if it sorts the wrong data structure, the verification was meaningless.

This isn’t a flaw in formal methods--it’s a feature of complex systems. The real bottleneck isn’t proving correctness; it’s defining the right thing to prove. This is where Axiom’s dual investment in mathematical discovery becomes strategic. Discovery tools help mathematicians form intuitions, generate conjectures, and explore constructions--before formalization. They’re not trying to replace human insight; they’re trying to amplify it.

"Proof is not enough for math. In fact, before you start proving something, you don’t know where you want to start."

-- Carina Hong

This insight reframes the role of AI in science. Most systems focus on answer generation. Axiom focuses on question shaping. By open-sourcing their discovery toolkit, they’re inviting a broader community to participate in the pre-formal phase of reasoning. This is critical because formal verification only works when the problem is well-specified. If the AI doesn’t know what to prove, no amount of correctness checking will help.

The system responds by shifting the bottleneck. Instead of verification, the new constraint becomes taste--the human ability to identify which problems are worth solving. As models generate more proofs, attention becomes the scarce resource. The future belongs not to those who can generate the most proofs, but to those who can curate the most meaningful ones.

Where Immediate Pain Creates Lasting Moats

Axiom’s strategy is a masterclass in embracing short-term discomfort for long-term advantage. Training on Lean proofs is slow. The data is sparse. The tooling is brittle. Most organizations would abandon it. But Axiom’s team--staffed by elite mathematicians and formal methods experts--treats this as a filter, not a flaw.

Their $200M Series A wasn’t a vote of confidence in current performance--it was a bet on compounding correctness. While others chase quarterly benchmarks, Axiom is building a corpus of verified knowledge that grows in value over time. Each proof they generate becomes a node in a growing knowledge graph. New models don’t start from scratch--they inherit a foundation of trustable reasoning.

This creates a moat that’s hard to replicate. You can’t just copy their data. You need the culture to sustain it--the tolerance for delayed payoff, the patience to formalize, the willingness to admit when a proof fails. These aren’t technical choices; they’re organizational ones. And that’s why Axiom believes, unapologetically, “We do not believe there is any other possible future” for AGI.


  • Invest in verification-native training loops now -- Over the next 6 months, shift from statistical rewards to formal verification signals where possible. This creates higher sample efficiency and reduces error accumulation.
  • Build discovery tools alongside provers -- Within the next quarter, prototype a system that generates conjectures or constructions before formal proof. This closes the loop between intuition and verification.
  • Treat specifications as first-class artifacts -- Start now by auditing how problems are defined before verification. This prevents costly misalignment between intent and correctness.
  • Open-source verification tooling to attract ecosystem contributions -- Axiom’s release of AXLE shows this works: within weeks, users combined it with Claude to formalize new proofs. This pays off in 12--18 months through community-driven innovation.
  • Prioritize long-horizon problems over short-term benchmarks -- Avoid the trap of optimizing for Putnam-style exams alone. Instead, focus on recursive improvement: can today’s proofs help tomorrow’s models reason deeper?
  • Embrace discomfort in formalization -- The teams that wait for perfect tooling will lose. Start with brittle systems; the act of formalizing is the training signal.
  • Recognize that attention is the new bottleneck -- As proof generation scales, invest in human curation and taste. This creates advantage in 18--24 months as the volume of AI-generated proofs overwhelms consumption capacity.

---
Handpicked links, AI-assisted summaries. Human judgment, machine efficiency.
This content is a personally curated review and synopsis derived from the original podcast episode.