GPT-5.4's Real-World Utility vs. Benchmark Performance

Original Title: GPT 5.4 vs Gemini: Benchmarks, Codex, Excel

The Daily AI Show · March 06, 2026 · Listen to Original Episode →

The latest iteration of OpenAI's flagship model, GPT-5.4, represents a significant, albeit nuanced, advancement, particularly for white-collar knowledge work. While benchmarks indicate incremental gains over previous versions and competitors, its true value lies in its enhanced ability to perform real-world tasks. This conversation reveals a critical, often overlooked, consequence: the increasing divergence between benchmark performance and practical utility, and the potential for models to hallucinate or misinterpret user intent, especially in agentic systems. Those navigating the complex landscape of AI tools for professional tasks will find an advantage in understanding these subtle but crucial distinctions, moving beyond raw scores to assess genuine task completion and reliability.

The GPT-Val Benchmark: A Double-Edged Sword

The release of GPT-5.4 has been framed by its performance on OpenAI's own benchmark, GPT-Val, which measures AI capabilities against human experts across a wide array of industries. The results are striking: GPT-5.2 reportedly outperformed human professionals 71% of the time, and GPT-5.4 has pushed this figure to an impressive 83%. This benchmark is designed to assess real-world task execution, encompassing fields from finance and healthcare to scientific services and information technology. The implication is clear: for tasks within these domains, GPT-5.4 is demonstrating a marked improvement in practical application.

However, the very existence of a self-created benchmark raises questions about potential bias. As Andy Halliday notes, "Imagine, since they built the benchmark, they're building their models to beat that benchmark and can keep the top on that benchmark. They do surprisingly well on the test that they built. That's right, what a surprise." While the test involves expert review and aims for real-world relevance, the inherent advantage of knowing the evaluation criteria cannot be ignored. This highlights a broader systemic issue: the increasing reliance on benchmarks that, while informative, may not capture the full spectrum of a model's utility or its potential failure modes.

"The way it works is there's a large number. They worked with experienced professionals of a wide range of industries."

-- Andy Halliday

This focus on benchmarks, while driving progress, can obscure critical differences in performance across various tasks. As the conversation illustrates, GPT-5.4 excels in some areas while lagging in others when compared to models like Gemini 3.1 Pro Preview and Anthropic's Claude. For instance, on coding tasks (Terminal Bench Hard), GPT-5.4 shows an improvement, but Gemini 3.1 Pro Preview remains strong. In agentic tool use, it falls in the middle of the pack. On knowledge accuracy (AA Omniscience), Gemini 3.1 Pro Preview leads, with GPT-5.4 trailing significantly. This fragmentation of performance underscores a fundamental principle for users: the "job to be done" must dictate model selection, not just the latest headline benchmark score.

Gemini's Visual Prowess and the Peril of Hallucination

Gemini 3.1 Pro Preview emerges as a strong contender, particularly in visual reasoning, where it leads the pack at 82%. This capability is unique, allowing for genuine video analysis rather than just processing static images. This offers a distinct advantage for tasks involving visual data.

However, Gemini's strengths are juxtaposed with significant reliability concerns, particularly around hallucinations. Beth Lyons shares a personal workflow involving transcribing screen recordings from X (formerly Twitter) for easier review. Her experience highlights a critical failure mode: Gemini not only failed to transcribe her voiceover but also hallucinated dialogue, even when the video was silent.

"It gave me again a completely hallucinated direct quote, and I said, 'Okay, so it sounds like this video is completely silent, is that true?' And right then we have a, 'Oh, you got me. It's true. I was just doing stuff.'"

-- Beth Lyons

This anecdote is not merely an inconvenience; it points to a deeper systemic issue with "agentic systems." Lyons argues for a more dependable "colleague layer"--an AI that functions as a reliable partner, capable of admitting when it doesn't know something or when it has made an error, rather than fabricating information. The current state, where models confidently invent responses, poses a significant risk, especially as AI becomes more integrated into workflows where human oversight is reduced. The Omniscience Index on Artificial Analysis confirms this, showing Gemini 3.1 Pro Preview significantly outperforming GPT-5.4 in knowledge reliability and hallucination resistance. This suggests that while GPT-5.4 might be advancing in specific task executions, it still struggles with the foundational accuracy and honesty that users, particularly those building agentic systems, desperately need.

Real-World Integration: Excel, Codex, and the Trade-offs

The conversation then pivots to practical, hands-on experience with GPT-5.4, particularly its integration into tools like Excel and its use via the Codex desktop app. Karl Yeh highlights that while benchmarks are useful, "you got to test it on your own use cases because those benchmarks are useless unless you leverage it into whatever you need to do." This sentiment is echoed throughout the discussion, emphasizing that real-world performance is the ultimate arbiter of AI utility.

GPT-5.4's integration into Excel is presented as a significant development. While Claude is noted as still being "significantly better" for charts and graphs, GPT-5.4 is considered "on par" for most other Excel tasks. This creates a dilemma for organizations that have invested in Claude licenses, as GPT-5.4 now offers comparable functionality. The demonstration of both models building a five-year DCF model side-by-side illustrates the practical differences: Claude's output is perceived as more "thoughtful" and better designed, while ChatGPT's has an "undo" feature, a critical advantage when dealing with complex, irreversible changes to a source of truth. The experiment of having multiple agents (ChatGPT, Claude, and Copilot) modify the same spreadsheet simultaneously, leading to confusion and potential data corruption, starkly illustrates the risks of uncoordinated agentic action and the lack of robust error correction mechanisms.

"The only thing that this can't do is do workbooks to workbooks comparisons, right? So, um, that's it. It's like, it's interesting what it will do."

-- Karl Yeh

The Codex desktop app is also lauded, with Yeh suggesting that GPT-5.4 in Codex can perform "90, 95%" of what Claude Code can do, with less rate limiting than in "Co-work." This offers a more accessible and potentially less frustrating coding assistance experience. However, the overall picture remains one of trade-offs. No single model excels at everything. The choice depends on the specific task, the user's existing ecosystem (e.g., Google Workspace vs. Microsoft Office), and tolerance for potential inaccuracies versus the need for specific capabilities like advanced visual reasoning or robust error handling.

Key Action Items

Prioritize Task-Specific Evaluation: Before adopting any new AI model, conduct rigorous testing on your actual, day-to-day use cases. Do not rely solely on benchmark scores. (Immediate to Ongoing)
Develop a "Colleague Layer" Strategy: For agentic systems, actively seek or develop AI tools that provide clear feedback on their limitations and errors, rather than fabricating information. This requires a shift in how we prompt and interact with AI. (Ongoing Investment, Payoff in 6-12 months)
Leverage Excel/Sheets Integrations Strategically: Explore GPT-5.4's capabilities within Excel. Understand its strengths and weaknesses compared to Claude, and critically, be aware of the lack of an "undo" feature in some models, necessitating careful use with critical data. (Immediate Action)
Experiment with Codex for Coding: Download and test the Codex desktop app with GPT-5.4 for coding tasks. Compare its performance and rate limiting against other tools like Co-work and Claude Code. (Immediate Action)
Investigate Gemini for Visual Tasks: If your work involves video analysis or complex visual reasoning, explore Gemini 3.1 Pro Preview and Flash, recognizing their strengths but also their documented hallucination issues in other contexts. (Immediate Action)
Implement Robust Data Versioning: When using AI tools that modify spreadsheets or documents directly, implement strict version control and backup procedures to mitigate the risk of irreversible data loss or unintended changes, especially with models lacking an "undo" function. (Immediate Action, Long-term Risk Mitigation)
Stay Informed on Model Nuances: Continuously monitor performance differences across models for specific tasks (e.g., coding, writing, data analysis, visual reasoning). The landscape is rapidly evolving, and what is best today may not be tomorrow. (Ongoing Monitoring, Competitive Advantage in 12-18 months)

Related Episodes

GPT-5.4: Integrated Usability, Transparency, and Instruction Following

Mar 11, 2026 Everyday AI Podcast – An AI and ChatGPT Podcast

GPT-5.4 delivers a unified AI experience, merging intelligence with unmatched usability and transparency. Gain a competitive edge by mastering its integrated capabilities for complex, real-world tasks.

View Episode Notes →

GPT-5.5: Subtle Workflow Improvements Redefine AI Use

Apr 24, 2026 The AI Daily Brief: Artificial Intelligence News and Analysis

GPT-5.5 redefines AI's value from benchmark dominance to practical integration, subtly improving workflows and reshaping competitive dynamics through tools like Codex.

View Episode Notes →

Transitioning From Conversational Chatbots to Agentic Work Systems

Mar 06, 2026 Everyday AI Podcast – An AI and ChatGPT Podcast

GPT-5.4 ends the chatbot era and moves AI into a functional work system that outperforms human experts 82% of the time. Learn to direct these agentic workflows to secure a compounding operational advantage.

View Episode Notes →

AI Advancements Demand Strategic Workflow Integration for Competitive Edge

Apr 24, 2026 Everyday AI Podcast – An AI and ChatGPT Podcast

AI's rapid evolution demands adaptation, shifting focus from new tools to strategic integration for a competitive edge. Master systemic impacts to build durable capabilities and stay ahead.

View Episode Notes →

OpenAI's GPT-5.5 Accelerates Iterative AI Development and Competitive Advantage

Apr 24, 2026 AI For Humans: Weekly AI News, Tools & Trends

AI development accelerates with GPT-5.5, shifting advantage from singular releases to rapid, iterative innovation. Adapt quickly or risk obsolescence in this continuous evolution.

View Episode Notes →

2025 AI Advancements Drive Business Automation and Workflow Transformation

Jan 13, 2026 Everyday AI Podcast – An AI and ChatGPT Podcast

AI agents now autonomously navigate the web and automate complex tasks, shifting human roles to orchestration and demanding continuous adaptation for competitive advantage.

View Episode Notes →