In a surprising development that has sent shockwaves through the artificial intelligence community, OpenAI's ChatGPT has been ranked a disappointing eighth position in a comprehensive study evaluating leading AI models. The research, conducted by British company Prolific, places the pioneering chatbot behind several competitors including multiple Gemini models, Grok versions, and even offerings from relatively newer players like DeepSeek and Mistral.
The Humaine Benchmark: A New Way to Measure AI
The study, published in September, utilized a unique evaluation system called "Humaine" that the company claims is specifically "built to understand AI performance through the lens of natural human interaction." This approach marks a significant departure from traditional evaluation methods that have dominated AI assessment until now.
Prolific criticized current AI evaluation standards in their blogpost, stating: "Current evaluation is heavily skewed towards metrics that are meaningful to researchers but opaque to everyday users, such as accuracy on specialised datasets, performance on esoteric reasoning tasks, etc." The company emphasized that this has created a fundamental disconnect between what gets optimized during development and what ordinary users actually value in their daily interactions with AI systems.
The Top Performers and ChatGPT's Position
According to the Humaine study results, the top 10 AI models ranked as follows:
Gemini 2.5 Pro from Google claimed the top position, followed by DeepSeek v3 in second place. French company Mistral's Magistral Medium secured third position, while xAI's Grok 4 and Grok 3 took fourth and fifth places respectively.
Google's Gemini 2.5 Flash ranked sixth, with DeepSeek R1 in seventh position. ChatGPT-4.1 from OpenAI finally appeared at eighth place, followed by Google's Gemma in ninth and Gemini 2.0 Flash rounding out the top ten.
Methodology and Market Implications
The researchers addressed potential concerns about their methodology, noting that even human preference leaderboards can be flawed if not designed with scientific rigor. They pointed out that platforms requiring users to vote for their favorite model often suffer from sample bias and likely overrepresent tech-savvy users rather than the general population.
The new leaderboard aims to address these issues through automated quality monitoring to ensure participants engaged thoughtfully with evaluation tasks. This approach attempts to create a more accurate representation of how these AI models perform in real-world scenarios that matter to everyday users.
It's important to note that the study was conducted before Google released its Gemini 3 Pro model and before xAI rolled out their Grok 4.1 and Grok 4.1 Thinking models. However, the timing doesn't fully explain ChatGPT's surprisingly low ranking, especially given its market dominance since its public debut in late 2022.
While Gemini 2.5 Pro topping another benchmark isn't particularly surprising given its strong performance across various leaderboards since launch, OpenAI's failure to place any model in the top five represents a significant development in the rapidly evolving AI landscape. The fact that ChatGPT ranked behind multiple models from newer entrants like DeepSeek and Grok raises important questions about the direction of AI development and what truly constitutes quality in AI-human interactions.
The researchers didn't provide specific reasoning behind ChatGPT's relatively low placement in the rankings but noted that Google's Gemini-2.5-Pro consistently ranked as the top model for the "Overall Winner" metric throughout their evaluation process.