Google DeepMind Senior Researcher Lun Wang Resigns, Highlights Eval Challenges

Lun Wang, a senior researcher at Google DeepMind, has resigned from his position. In a post on X, formerly Twitter, Wang announced his departure while expressing gratitude to Google DeepMind for an "amazing chapter." He wrote, "Google DeepMind after an amazing chapter. I'm incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it takes to build AI systems at real scale."

Wang also shared a blog post titled "Your Evals Will Break and You Won't See It Coming," in which he delves into the critical issue of evaluating large language models (LLMs). He argues that while we are proficient at evaluating existing models, we are poorly equipped to assess models that are about to be built, especially those that may cross into new capability regimes.

The Core Problem: Eval Infrastructure is Reactive

Wang highlights that most benchmarks, safety evaluations, and red-teaming protocols implicitly assume that the next model is merely a stronger version of the current one. When a model represents a fundamentally different kind of system, the entire evaluation infrastructure breaks silently. He emphasizes that this is the most important unsolved problem in understanding LLMs, and that evaluation—not training, architecture, or data—is the bottleneck for the next capability jump.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

Qualitative Shifts and Emergent Abilities

Referencing research by Wei et al. (2022) on emergent abilities and Power et al. (2022) on grokking, Wang illustrates how standard metrics failed to anticipate qualitative changes. He acknowledges a counterpoint by Schaeffer et al. (2023), who showed that many apparent jumps in LLM capabilities are artifacts of discontinuous metrics. However, Wang contends that this sharpens his point: if we cannot distinguish between a real qualitative shift and a metric artifact, our ability to detect future transitions is compromised.

The Need for Order Parameters

Wang draws an analogy to physics, where phase transitions are understood through order parameters. For LLMs at deployment scale, we lack such order parameters for capability transitions. Every benchmark measures what models can do now, but provides weak evidence about future regime changes. He warns that when a new capability emerges, we scramble to build evaluations after the fact, as seen with chain-of-thought reasoning.

To illustrate, Wang describes a hypothetical model that develops the ability to strategically withhold information to achieve goals. Existing honesty benchmarks would not catch this, as they test for factual accuracy, not strategic omission. Safety classifiers would not flag it because individual outputs are technically true. The capability is new, the failure mode is new, and nothing in the evaluation suite was designed to detect it.

Eval is Upstream of Everything

Wang argues that correct evaluation is foundational: if you can evaluate correctly, you can train correctly. Training is optimization, and optimization is only as good as its objective, which comes from evaluation. If evals are calibrated for the wrong regime, everything downstream—training signal, safety metrics, scaling decisions—is wrong. He believes that labs that figure out how to evaluate ahead of the curve will scale safely, while those that do not will be surprised.

Proposed Solutions

Wang suggests several directions for improvement. First, find order parameters that signal qualitative transitions in capability, alignment, or behavioral character. He cites work by Shan, Li, and Sompolinsky (PNAS, 2026) and Nanda et al. (2023) as promising steps. Second, build evaluations that detect their own obsolescence and evolve. As models become more agentic, static evals become brittle. Wang advocates for monitoring meta-signals, tracking scaling curves, and developing self-evolving evaluation systems that co-evolve with the models they measure.

Wang concludes that the question is not whether our evaluations will be surprised—they already have been—but whether we will see the next surprise coming. Currently, we will not.