AI Giants Use Classic Pokemon Games to Test and Train Advanced Models
AI Companies Use Pokemon Games to Test Advanced Models

AI Giants Turn to Classic Pokemon Games for Advanced Model Testing

In a fascinating development within the artificial intelligence sector, leading technology companies including Google, OpenAI, and Anthropic have discovered an unconventional but effective method for evaluating their latest AI models: Nintendo's original Pokemon games from the 1990s. These pixelated video games are emerging as a popular benchmark among AI researchers for tracking progress, assessing capabilities, and determining how to deploy AI systems toward complex, time-consuming objectives.

The Rise of AI Pokemon Streams

The trend gained significant momentum in February 2025 when David Hershey, the applied AI lead at Anthropic, created the "Claude Plays Pokémon" stream on the live-gaming platform Twitch. This initiative quickly inspired similar streams featuring AI models from OpenAI and Google. According to a Wall Street Journal report, independent developers initially launched "GPT Plays Pokémon" and "Gemini Plays Pokémon," which later received official support from the respective AI labs.

The collective streams of Claude, GPT, and Gemini playing Pokemon on Twitch have amassed hundreds of thousands of comments as viewers watch the models chart their progress in real time. In Pokemon Blue, which gained popularity on Nintendo Game Boy devices in the late 1990s, players must navigate through intricate mazes, capture various Pokemon creatures, and battle "gym masters" to earn badges—a process that presents multiple challenges for AI systems.

Corporate Enthusiasm and Cultural Impact

The phenomenon has generated considerable excitement within the AI community. For some time, OpenAI employees maintained a live GPT Pokemon stream running on a television in their office. Google CEO Sundar Pichai publicly praised a Gemini victory during his presentation at Google's I/O conference last year. Anthropic regularly features a "Claude Plays Pokémon" booth at industry conferences, and the stream has even inspired fan art. The company maintains a live Slack channel where staff members congratulate Claude on its in-game achievements.

"We're all a whole bunch of nerds," Hershey told the Wall Street Journal, capturing the enthusiastic spirit surrounding these AI gaming experiments.

A Long Tradition of Game-Based AI Testing

The Pokemon streams continue a well-established tradition of using games to evaluate AI systems. A decade ago, Google's AlphaGo made headlines by defeating a human champion at the complex game of Go. Other games like poker, chess, and more recently, video games such as Minecraft have also served as testing grounds for artificial intelligence. Google subsidiary Kaggle even launched a "Game Arena," an open-source platform designed to evaluate AI models through competitive gaming. Its inaugural event was a chess tournament in August 2025, which OpenAI's o3 model won.

Why Pokemon Presents Unique Challenges for AI

Discussing the use of Pokemon to evaluate Claude's capabilities, Hershey explained to the Wall Street Journal: "It provides us with, like, this great way to just see how a model is doing and to evaluate it in a quantitative way."

The game presents multiple strategic decisions—players must choose whether to train their current Pokemon or capture new ones to build a stronger team. Additionally, the game includes numerous mazes and puzzles that often prove particularly challenging for AI systems to solve.

"The thing that has made Pokemon fun and that has captured the [machine learning] community's interest is that it's a lot less constrained than Pong or some of the other games that people have historically done this on. It's a pretty hard problem for a computer program to be able to do," Hershey elaborated.

Technical Advancements and Practical Applications

Newer versions of Claude have demonstrated improved performance in playing the game. While none have completed it yet, the latest iteration, Claude Opus 4.5, is currently playing live on Twitch. The Pokemon experiments have also helped developers create better support systems for AI through specialized software frameworks known as harnesses. For instance, Hershey built a memory system that enables Claude to retain important details learned during gameplay.

Hershey, who works closely with customers at Anthropic, noted that he frequently shares insights gained from the Pokemon experiments with clients. Both OpenAI's GPT models and Google's Gemini have successfully completed the original Pokemon game, though this achievement may be attributed to the different harnesses developed around them. Currently, both models are tackling Pokemon sequel games, according to developers managing the "Gemini Plays Pokémon" and "GPT Plays Pokémon" livestreams.

Expert Perspectives on Game-Based AI Evaluation

Jonathan Verron, who initiated the "GPT Plays Pokémon" stream, stated: "This is a perfect game for AI right now. I've tried to think about other games, but I haven't found as good an example as Pokemon."

Researchers emphasize that using Pokemon to test AI offers genuine benefits. Graham Neubig from Carnegie Mellon University told the Wall Street Journal that unlike traditional benchmarks that test AI with isolated questions, Pokemon allows researchers to observe how an AI thinks, makes decisions, and works toward objectives over extended periods. This provides valuable insights into the reasoning processes and long-term planning capabilities of advanced AI systems.