Home 9 AI 9 Benchmarking AGI: Why Measuring Artificial General Intelligence Is Harder than You Think

Benchmarking AGI: Why Measuring Artificial General Intelligence Is Harder than You Think

by | Sep 23, 2025

ARC, benchmarks, and the ongoing debate over how, when, and even if, we’ll know AGI when we see it.
Source: Eddie Guy.

IEEE Spectrum’s article digs into why defining and measuring Artificial General Intelligence (AGI) remains contentious, even as AI capabilities surge. As leading labs claim AGI could arrive in just a few years, the article argues we still lack agreed-upon yardsticks for when a system should count as “general intelligence.”

A primary example is the ARC benchmark (Abstraction and Reasoning Corpus) introduced in 2019 by François Chollet. ARC tests AI systems on visual puzzles where the goal is to discover abstract rules from a few examples, then apply them to new problems. Humans tend to do well, but AIs struggle, showing that some aspects of reasoning and abstraction are still hard for machine learning systems to master. The article notes that ARC’s newer version, ARC-AGI-2, is more complex, but performance remains far below human levels.

Another difficulty arises from what tasks should count toward “general intelligence.” Intelligence for machines may include skills that differ from human cognition: social reasoning, causal understanding, adaptability, creativity, or fluent multimodal interaction. Benchmarks often capture subsets of these but struggle to combine them reliably or handle novel tasks outside training data. The article raises concern that many benchmarks reward scale, huge models with massive data, rather than deeper generalization.

IEEE Spectrum also covers debates over whether internal mechanisms matter (how the intelligence works) versus purely behavioral performance (what tasks it can do). Some experts argue that humanlike internal representations, interpretability, value alignment, and robustness matter as much as raw benchmarks. Findings that systems exploit shortcuts or “gaming” of tests make benchmark results harder to treat as definitive.

In the end, the article suggests that tracking AGI will require more than just scoreboard comparisons. It will need benchmarks that are broad, hard, novel, and unfair in the sense that they test beyond what current systems have seen. It will also need consensus on definitions, responsibilities about deployment (ethics, values), and humility about what progress we think we’ve made. AGI might still be distant, or closer than we expect, but knowing when we’ve crossed the line demands sharper tools and clearer agreement.