Efforts to develop a reliable test for artificial general intelligence (AGI) have sparked significant debate, as a well-known benchmark appears closer to being surpassed. However, its creators argue that this progress highlights flaws in the test rather than a genuine breakthrough in AGI research.
In 2019, François Chollet, a prominent AI researcher, introduced the ARC-AGI benchmark, short for "Abstract and Reasoning Corpus for Artificial General Intelligence." This test was designed to evaluate whether AI systems could learn new skills beyond their training data. Chollet claims ARC-AGI is the only current benchmark for measuring progress toward AGI, though other tests have been proposed.
Until recently, AI systems struggled with the test, solving less than one-third of its tasks. Chollet attributes this to the industry’s focus on large language models (LLMs), which he believes lack true reasoning abilities. He has emphasized that LLMs primarily rely on memorization and struggle with tasks that require generalization or novel problem-solving.
LLMs function as statistical tools, learning patterns in training data to make predictions. While they can memorize reasoning patterns, Chollet argues that they cannot generate new reasoning when confronted with unfamiliar situations. He explains that requiring extensive training on specific patterns to produce reusable representations is indicative of memorization rather than true reasoning.
In June, Chollet and Zapier co-founder Mike Knoop launched a $1 million competition to develop an open-source AI capable of surpassing ARC-AGI. From nearly 18,000 entries, the top submission achieved a 55.5% success rate—20% higher than the previous year’s best but still far from the 85% threshold considered "human-level" performance.
Despite this improvement, Knoop emphasized that it does not signify being 20% closer to achieving AGI. He noted that many submissions relied on brute-force methods to solve tasks, suggesting that some ARC-AGI challenges may not effectively measure progress toward general intelligence.
The ARC-AGI benchmark consists of puzzle-like tasks where AI must adapt to novel problems by producing solutions based on colored grid patterns. While the tasks aim to test adaptability, their effectiveness in measuring AGI remains uncertain. Knoop has acknowledged shortcomings in the test, which has remained unchanged since 2019.
Both Chollet and Knoop have faced criticism for presenting ARC-AGI as a definitive benchmark for AGI, particularly as the definition of AGI itself is highly debated. For instance, some researchers argue that AGI already exists if defined as AI performing better than most humans at most tasks.
To address these concerns, Chollet and Knoop plan to release a second-generation ARC-AGI benchmark in 2025, alongside a new competition. Chollet has stated that their goal is to guide the research community toward solving critical challenges in AI and accelerate progress toward AGI.
However, refining benchmarks for intelligence in AI remains a complex and contentious task, much like defining intelligence for humans. The journey toward reliable AGI testing underscores the broader challenges of advancing AI research while navigating its philosophical and practical complexities.