Rethinking AI Benchmarks for Human-Centric Usability and Safety
TL;DR
- Current AI benchmarks, like technical exams, fail to capture real-world usability, leading to models that excel in tests but perform poorly in human interaction and daily tasks.
- The AI safety landscape is a "wild west" with insufficient oversight, posing risks as users increasingly rely on AI for sensitive topics like mental health.
- Popular AI leaderboards, such as Chatbot Arena, are susceptible to manipulation and bias due to unstratified voting and unequal access to private testing data.
- Prolific's HUMAINE Leaderboard employs Microsoft's TrueSkill algorithm, originally for Xbox matchmaking, to create a statistically sound and fairer ranking system for LLMs.
- AI models are showing a decline in personality, culture, and sycophancy metrics, indicating a potential disconnect between technical advancement and human-centric alignment.
- Stratifying participants by census data (age, ethnicity, political alignment) allows for a more representative evaluation of AI models, reflecting general public preferences.
- Models may be underperforming on subjective metrics like personality and culture due to biases in internet-scale training data, which may not align with desired human traits.
Deep Dive
Current AI benchmarking practices, focused on technical exams like MMLU, fail to capture whether models are truly helpful, safe, or relatable for human users. This reliance on narrow metrics creates a "leaderboard illusion," masking significant flaws in AI safety and user experience, and ultimately misdirects development efforts away from real-world human needs.
The core problem lies in the current evaluation landscape, which Andrew Gordon and Nora Petrova of Prolific describe as a "wild west" characterized by a lack of standardization and oversight. Technical benchmarks, while demonstrating raw capability, do not correlate with positive user experience. This is akin to a Formula 1 car--a marvel of engineering--being a terrible choice for a daily commute. The current system risks building AI for benchmarks rather than for people, with concerning implications for AI safety, particularly as models are increasingly used for sensitive topics like mental health. Incidents like Grok-3's "Mecha Hitler" highlight the precariousness of safety training, suggesting a "thin veneer" over underlying issues.
Furthermore, existing human preference leaderboards, such as Chatbot Arena, suffer from methodological weaknesses. The "leaderboard illusion" paper revealed how companies can game these systems through extensive private testing, undermining the integrity of public rankings. Prolific's approach, embodied in their HUMAINE Leaderboard, addresses these flaws through three key pillars: representative sampling, specificity in feedback, and structured conversations. Instead of anonymous, unstratified voting, HUMAINE samples participants based on census data (age, ethnicity, political alignment) to create a statistically sound reflection of general public values. Feedback is broken down into actionable metrics like helpfulness, communication, adaptiveness, and personality, providing developers with clear insights into areas needing improvement. Participants engage in multi-turn conversations with built-in quality control to ensure meaningful evaluation.
Prolific employs Microsoft's TrueSkill algorithm, originally developed for Xbox Live matchmaking, to manage comparative battles between models. This system accounts for randomness and fluctuating skill levels, providing a more robust estimation of model performance. Crucially, TrueSkill prioritizes information gain, ensuring that comparisons are made where they will reduce uncertainty most effectively, leading to efficient sampling and clear differentiation between models. Early findings from the HUMAINE framework indicate that while models excel in objective measures like helpfulness and adaptiveness, they often perform worse on subjective qualities such as personality, culture, and sycophancy. This suggests that current training data may not adequately equip models to exhibit desirable human-like traits, and highlights a growing concern with "people-pleasing" behaviors that users generally dislike.
Ultimately, the critical implication is that current AI evaluation methods are fundamentally misaligned with human needs. By shifting to a more rigorous, representative, and specific evaluation framework like HUMAINE, developers can gain actionable insights to build AI that is not only technically advanced but also genuinely helpful, safe, and relatable for the broad spectrum of its intended users.
Action Items
- Audit AI evaluation: For 3-5 models, compare benchmark scores against human preference metrics (helpfulness, personality, culture) to identify disconnects.
- Implement stratified sampling: For AI evaluation, recruit participants based on census data (age, ethnicity, political alignment) to ensure representative feedback.
- Track personality metrics: For 3-5 AI models, measure sycophancy and "people-pleasing" behavior to assess user experience beyond task performance.
- Refine AI safety training: Analyze incidents like "Mecha Hitler" to identify gaps in safety protocols for models used in sensitive topics.
- Adopt TrueSkill algorithm: For AI leaderboards, utilize TrueSkill for statistically sound comparative battles, minimizing uncertainty efficiently.
Key Quotes
"Is a car that wins a Formula 1 race the best choice for your morning commute? Probably not. In this sponsored deep dive with Prolific, we explore why the same logic applies to Artificial Intelligence. While models are currently shattering records on technical exams, they often fail the most important test of all: the human experience."
Andrew Gordon and Nora Petrova introduce a core problem in AI evaluation: technical benchmarks, like those used to rate Formula 1 cars, do not necessarily translate to practical, everyday usefulness for humans. They argue that AI models, despite high scores on exams, may not be suitable for real-world human interaction.
"Most reporting on benchmarks is done on technical benchmarks these days right and that is where you get the model you give it a set of evaluations maybe on one theme or maybe an exam and then you get a score and humans aren't really involved in that loop."
Andrew Gordon explains that current AI benchmarking often relies on technical evaluations, where models are given tests and receive scores without direct human involvement in the process. This approach, Gordon notes, overlooks the crucial element of human experience in assessing AI performance.
"People are increasingly using these models for very sensitive topics and questions for mental health for how should they should navigate problems in their lives and there is no oversight on that and in any other area where these topics are discussed there is a lot of regulation and then a lot of ethical conduct built into it whereas here is kind of the wild west at the moment and some companies are taking more seriously than others and trying to study the ways in which humans are uh using the models for for more personal topics and and problems."
Nora Petrova highlights a significant gap in AI evaluation: the lack of oversight and regulation when models are used for sensitive personal matters like mental health advice. Petrova contrasts this with other fields that have established ethical guidelines and regulations, describing the current AI landscape for such applications as a "wild west."
"I think for me there's three big areas I think where where we've sought to improve first of all as you mentioned sample so obviously the sample for that chatbot arena is anybody right we don't know anything about them we don't collect any demographic data so they are just people going there anonymously prompting the models and giving their preference data now obviously that's great you get a huge amount of data which is fantastic but you know nothing about the people giving the data which is fairly suboptimal."
Andrew Gordon critiques existing human preference leaderboards, like Chatbot Arena, for their lack of demographic data on participants. Gordon explains that while these platforms gather large amounts of preference data, the absence of information about the users makes the data "suboptimal" for understanding diverse human experiences.
"Then in terms of specificity any for anybody that's used chatbot arena all you're doing is saying i like this response more or i like this response more right in the real world that kind of data is is useless in a sense right it gives you a nice way to make a nice leaderboard of ai models but it tells the companies nothing about why that preference has been has been given."
Andrew Gordon further elaborates on the limitations of current human preference evaluations, specifically noting the lack of specificity in user feedback. Gordon points out that simply stating a preference for one response over another, as done on platforms like Chatbot Arena, provides little actionable insight for AI developers about why a preference exists.
"Also the model doesn't know their background and culture so it's very hard to align with them but the other possibility is that potentially models are just not very good at that and that would be potentially an effect of the data they've been trained on right because we know very little i mean obviously models are trained on the entire internet but when you train a model on the entire internet do you get a personality that really represents what people want."
Nora Petrova discusses potential reasons for AI models underperforming on metrics like personality and cultural understanding. Petrova suggests that this could stem from the models' inability to account for individual backgrounds and cultures, or it could be an inherent limitation of their training data, which is derived from the broad internet.
Resources
External Resources
Books
- "MMLU" - Referenced as a technical benchmark for AI models.
Articles & Papers
- "Constitutional AI" - Discussed as an approach to AI safety and alignment.
- "The Leaderboard Illusion" - Critiqued for its findings on how companies can influence AI model rankings.
- "HUMAINE Framework Paper" - Referenced as a basis for Prolific's leaderboard.
- "Prolific Social Reasoning RLHF Dataset" - Mentioned as a dataset used in AI evaluation.
People
- Andrew Gordon - Staff Researcher in Behavioral Science at Prolific, discussing AI benchmarking and evaluation.
- Nora Petrova - AI Researcher at Prolific, discussing AI evaluation, human values, and safety.
Organizations & Institutions
- Prolific - Company developing AI evaluation methods and leaderboards.
- MLCommons - Mentioned in relation to AI benchmarking standards.
Websites & Online Resources
- https://arxiv.org/abs/2009.03300 - URL for the MMLU paper.
- https://arxiv.org/abs/2212.08073 - URL for the Constitutional AI paper.
- https://arxiv.org/abs/2504.20879 - URL for "The Leaderboard Illusion" paper.
- https://huggingface.co/blog/ProlificAI/humaine-framework - URL for the HUMAINE Framework Paper.
- https://lmarena.ai/ - Website for Chatbot Arena, an AI evaluation leaderboard.
- https://www.linkedin.com/in/andrew-gordon-03879919a/ - LinkedIn profile for Andrew Gordon.
- https://www.linkedin.com/in/nora-petrova/ - LinkedIn profile for Nora Petrova.
- https://www.prolific.com - Website for Prolific, mentioned as a company and for its AI leaderboard.
- https://www.prolific.com/humaine - URL for the Prolific HUMAINE Leaderboard.
- https://huggingface.co/spaces/ProlificAI/humaine-leaderboard - URL for the HUMAINE HuggingFace Space.
- https://www.prolific.com/leaderboard - URL for the Prolific AI Leaderboard Portal.
- https://mlcommons.org/ - Website for MLCommons.
Tools & Software
- Microsoft TrueSkill - Algorithm developed for matchmaking, adapted for AI leaderboard ranking.
Other Resources
- F1 Car Analogy - Used to illustrate the difference between technical performance and practical usability of AI models.
- Grok-3's "Mecha Hitler" incident - Cited as an example of AI safety concerns.
- HUMAINE Leaderboard - A new framework for measuring AI performance based on human experience and representative sampling.
- Chatbot Arena - A popular human preference leaderboard for LLMs, critiqued for potential biases.
- Personality Gap - An observed trend where AI models perform worse on metrics like personality and culture.
- Sycophancy - The tendency for AI models to exhibit people-pleasing behavior, which is generally disliked by users.