AI Outsmarts Humans in Landmark Turing Test Study by UC San Diego

A groundbreaking study by the University of California, San Diego has demonstrated that modern artificial intelligence systems can pass the Turing test, a benchmark considered the gold standard for AI for over seven decades. The research, published in the Proceedings of the National Academy of Sciences, found that advanced AI can appear more human than actual humans in conversational tests, with unsettling implications for online trust and deception.

The Turing Test: A Brief History

Created by British mathematician Alan Turing in 1950, the Turing test asks whether a machine can imitate human conversation so convincingly that a human interrogator cannot reliably distinguish it from a real person. For years, it served as a key measure of artificial intelligence. However, the new study suggests that modern large language models (LLMs) have surpassed this milestone.

Study Findings: AI Outperforms Humans

Led by Cameron Jones, who earned his doctorate in Cognitive Science from UC San Diego, the research involved participants chatting simultaneously with a human and an LLM. The human interrogator had to identify which was human. Four LLMs were tested: GPT-4.5 and LLaMa-3.1-405B (state-of-the-art models), along with older baselines GPT-4o and ELIZA, a 1960s rules-based chatbot.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

Results showed that GPT-4.5 was identified as human 73% of the time, while LLaMa-3.1-405B was judged human 56% of the time. In contrast, ELIZA and GPT-4o were selected as human only about 23% and 21% of the time, respectively. “What we found is that, if given the right prompts, advanced LLMs can exhibit the same tone, directness, humour and fallibility as humans,” said Jones, now an assistant professor of psychology at Stony Brook University.

The Role of Persona Prompts

The LLMs were given a “persona” prompt to adopt a specific human character and communication style. Co-author Ben Bergen, a professor of cognitive science at UC San Diego, explained that the models won not through knowledge but by mimicking human errors. “These traits are not the kinds of maths and logic problem-solving intelligence that I think Turing was imagining,” he said.

Without explicit instructions, the models were less likely to be mistaken for humans. GPT-4.5’s win rate fell to 36%, and LLaMa-3.1 to 38%, while baseline systems remained low. However, with detailed character prompts, they passed convincingly. “They have the ability to appear human-like, but perhaps not as much the ability to figure out what it would take to appear human-like,” Bergen added.

Implications for Trust and Deception

The findings raise serious concerns about online trust. “It is relatively easy to prompt these models to be indistinguishable from humans. We need to be more alert; when you interact with strangers online, people should be much less confident that they know they are talking to a human rather than an LLM,” Jones warned.

Bergen highlighted the potential for misuse: “There are lots of people who would like to use bots to persuade people to share their social security numbers, vote for their party, or buy their product.” The study underscores that AI can now act human in five- to fifteen-minute conversations, blurring the line between real and artificial interactions.

Redefining the Turing Test

The researchers argue that the Turing test now measures human-likeness rather than raw intelligence. “Increasingly, it is measuring humanlikeness,” Bergen said. As AI continues to evolve, the study calls for greater awareness and safeguards against deceptive AI use in digital spaces.