Traditional AI benchmarks often fall short in measuring true problem-solving skills, as many focus on tasks that rely on memorization rather than practical ingenuity. In response, some AI developers are now exploring games as alternative benchmarks to test an AI model’s creativity and adaptability.
Paul Calcraft, a freelance AI developer, created an app where two AI models play a game similar to Pictionary. In this setup, one model draws a doodle, and the other attempts to identify it. According to Calcraft, the goal is to establish a benchmark that AI cannot easily “cheat” by simply recalling patterns from training data. Inspired by Simon Willison’s earlier project involving AI-generated vector drawings, Calcraft’s game aims to prompt AI models to think beyond their training and apply logic to unfamiliar tasks.
Another game-based test has emerged from 16-year-old developer Adonis Singh, who created “mc-bench.” This tool allows an AI model to control a character within Minecraft to design structures, presenting challenges similar to Microsoft’s Project Malmo. Singh believes that Minecraft offers AI models more freedom and varied challenges than traditional benchmarks, as it requires resourcefulness and creativity.
Using games as benchmarks for AI isn’t a new idea. The concept dates back to 1949 when Claude Shannon suggested games like chess as tests for intelligent systems. Since then, AI models have been trained to play various games, such as DeepMind’s models for Pong and Breakout, OpenAI’s bots for Dota 2, and Meta’s AI for Texas hold’em poker.
Today, enthusiasts are applying large language models (LLMs) to more complex games to evaluate logical reasoning. LLMs, such as GPT-4o, Gemini, and Claude, exhibit unique characteristics, which can be challenging to measure with conventional text-based benchmarks. Games offer a visual, interactive approach to evaluate how these models interpret tasks and respond under different conditions. AI researcher Matthew Guzdial points out that games allow AI to demonstrate decision-making in a structured, yet flexible, environment, providing insight into the models’ reasoning and communication capabilities.
Calcraft notes that his Pictionary-style game helps evaluate an LLM’s understanding of shapes, colors, and spatial relationships. While he acknowledges that the game may not reliably test deep reasoning, it still requires strategic thinking to succeed. He likens the game’s adversarial setup to generative adversarial networks (GANs), where one model creates an image and another assesses it, enhancing each other’s performance through feedback.
Singh argues that Minecraft is another effective AI benchmark, especially for testing reasoning. However, some experts, like Mike Cook from Queen Mary University, are skeptical. Cook suggests that Minecraft may not offer unique advantages for AI testing compared to other games, despite its real-world-like appearance. He contends that, like any video game, Minecraft’s usefulness for AI testing is limited, as game-trained AIs often struggle to adapt to unfamiliar environments.
While games like Pictionary and Minecraft might not fully replicate real-world reasoning, they provide developers with innovative ways to push AI models beyond rote responses.