Home 9 AI 9 AI-for-Science Benchmarks May Be Rewarding the Wrong Kind of Intelligence

AI-for-Science Benchmarks May Be Rewarding the Wrong Kind of Intelligence

by Ruchika Saini, AI | May 15, 2026

Researchers are questioning whether today’s evaluation systems measure genuine scientific discovery or merely the ability to reproduce existing knowledge.

The Strange Review article challenges one of the central assumptions driving the AI-for-science movement: that strong benchmark performance automatically reflects meaningful scientific progress. The article argues that most current evaluation systems measure only the aspects of science that artificial intelligence already handles well, while overlooking the slower, more uncertain processes that produce genuine discovery.

The article begins with a widely discussed example involving Google’s AI Co-Scientist system. Researchers at Imperial College London reportedly presented the AI with an unpublished biological problem related to antibiotic resistance that had taken scientists nearly a decade to investigate. Within two days, the system generated the same hypothesis that the research team had eventually validated experimentally. While impressive, the article argues that public discussion exaggerated the achievement. The AI accelerated hypothesis generation, but it did not shorten the years of laboratory experimentation required to confirm the result.

This distinction forms the core of the article’s criticism. Existing AI benchmarks largely test convergent reasoning tasks involving known answers, such as literature retrieval, reproducing published research, or solving predefined problems. Benchmarks such as PaperBench and LitQA2 measure whether AI systems can reconstruct or retrieve established knowledge efficiently. According to the author, these evaluations fail to capture the most important parts of scientific work: asking original questions, designing experiments, interpreting ambiguous results, and discovering genuinely unexpected phenomena.

The article also warns that AI may unintentionally narrow scientific exploration. Research cited from a Nature study suggests scientists using AI publish more papers and gain citations faster, yet simultaneously engage less with adjacent disciplines and concentrate more heavily on already popular research topics. The author describes this phenomenon as “lonely crowds,” where scientific output increases while intellectual diversity declines.

Rather than focusing on publication counts or benchmark saturation, the article proposes alternative ways to measure scientific impact. Suggested metrics include tracking whether AI expands the range of topics being studied, increases cross-disciplinary collaboration, and generates hypotheses that survive experimental validation. These approaches would be slower and harder to measure but more closely aligned with real scientific progress.

Ultimately, the article argues that AI’s true value in science should not be judged by its ability to imitate existing knowledge, but by whether it helps humanity uncover ideas that were previously unknown, unexpected, and experimentally meaningful.