Home 9 AI 9 Testing the Limits of AI in Mathematical Research

Testing the Limits of AI in Mathematical Research

by Ruchika Saini, AI | Feb 9, 2026

A new "first proof" experiment seeks an independent measure of machine creativity.

Martin Hairer, a mathematician at the Swiss Federal Technology Institute of Lausanne. He splits his time between there and Imperial College London (source: Aurelien Bergot for The New York Times).

A high school student recently asked Fields Medalist Martin Hairer whether artificial intelligence might strip mathematics of its magic. If machines become better problem solvers than humans, does the discipline lose its essence? Hairer responded that mathematics remains “safe.” While large language models excel at solving structured or synthetic problems, he has yet to see evidence of genuinely new mathematical ideas emerging from them, tells The New York Times.

That skepticism helped inspire “First Proof,” a new paper Hairer coauthored with Mohammed Abouzaid of Stanford University, Lauren Williams of Harvard University, and Tamara Kolda of MathSci.ai. The project aims to create a more meaningful benchmark for evaluating AI’s research-level mathematical ability. Instead of relying on contrived or company-provided test problems, the team collected authentic questions drawn from their own unpublished research. Each contributor supplied one problem and its solution, which has been encrypted and will later be revealed to assess model accuracy.

The novelty lies in grounding evaluation in real mathematical practice. Research, the authors emphasize, involves more than solving tidy exercises. It includes formulating the right big questions, developing frameworks, and then proving smaller, tractable results. “First Proof” focuses on the third stage: producing correct proofs to well-defined problems. That component is measurable and avoids the difficulty of judging AI’s ability to originate conceptual breakthroughs.

Preliminary tests using leading AI systems revealed limitations. Models sometimes produced partial arguments, infinite loops, or confident but incorrect reasoning. They were adept at stringing together familiar techniques but often glossed over critical steps, resembling an overconfident graduate student. Still, the researchers acknowledged moments of technical fluency.

The broader goal is not to dismiss AI but to temper hype. The team hopes to prevent exaggerated claims that mathematics is “solved” while encouraging nuanced understanding. Future rounds will expand community participation, allowing ideas to “ferment” before structured benchmarking resumes. Ultimately, the project seeks clarity about AI’s true capabilities, and its limits, in advancing mathematical thought.