The latest iteration of OpenAI's flagship model, GPT-5.4, represents a significant, albeit nuanced, advancement, particularly for white-collar knowledge work. While benchmarks indicate incremental gains over previous versions and competitors, its true value lies in its enhanced ability to perform real-world tasks. This conversation reveals a critical, often overlooked, consequence: the increasing divergence between benchmark performance and practical utility, and the potential for models to hallucinate or misinterpret user intent, especially in agentic systems. Those navigating the complex landscape of AI tools for professional tasks will find an advantage in understanding these subtle but crucial distinctions, moving beyond raw scores to assess genuine task completion and reliability.
The GPT-Val Benchmark: A Double-Edged Sword
The release of GPT-5.4 has been framed by its performance on OpenAI's own benchmark, GPT-Val, which measures AI capabilities against human experts across a wide array of industries. The results are striking: GPT-5.2 reportedly outperformed human professionals 71% of the time, and GPT-5.4 has pushed this figure to an impressive 83%. This benchmark is designed to assess real-world task execution, encompassing fields from finance and healthcare to scientific services and information technology. The implication is clear: for tasks within these domains, GPT-5.4 is demonstrating a marked improvement in practical application.
However, the very existence of a self-created benchmark raises questions about potential bias. As Andy Halliday notes, "Imagine, since they built the benchmark, they're building their models to beat that benchmark and can keep the top on that benchmark. They do surprisingly well on the test that they built. That's right, what a surprise." While the test involves expert review and aims for real-world relevance, the inherent advantage of knowing the evaluation criteria cannot be ignored. This highlights a broader systemic issue: the increasing reliance on benchmarks that, while informative, may not capture the full spectrum of a model's utility or its potential failure modes.
"The way it works is there's a large number. They worked with experienced professionals of a wide range of industries."
-- Andy Halliday
This focus on benchmarks, while driving progress, can obscure critical differences in performance across various tasks. As the conversation illustrates, GPT-5.4 excels in some areas while lagging in others when compared to models like Gemini 3.1 Pro Preview and Anthropic's Claude. For instance, on coding tasks (Terminal Bench Hard), GPT-5.4 shows an improvement, but Gemini 3.1 Pro Preview remains strong. In agentic tool use, it falls in the middle of the pack. On knowledge accuracy (AA Omniscience), Gemini 3.1 Pro Preview leads, with GPT-5.4 trailing significantly. This fragmentation of performance underscores a fundamental principle for users: the "job to be done" must dictate model selection, not just the latest headline benchmark score.
Gemini's Visual Prowess and the Peril of Hallucination
Gemini 3.1 Pro Preview emerges as a strong contender, particularly in visual reasoning, where it leads the pack at 82%. This capability is unique, allowing for genuine video analysis rather than just processing static images. This offers a distinct advantage for tasks involving visual data.
However, Gemini's strengths are juxtaposed with significant reliability concerns, particularly around hallucinations. Beth Lyons shares a personal workflow involving transcribing screen recordings from X (formerly Twitter) for easier review. Her experience highlights a critical failure mode: Gemini not only failed to transcribe her voiceover but also hallucinated dialogue, even when the video was silent.
"It gave me again a completely hallucinated direct quote, and I said, 'Okay, so it sounds like this video is completely silent, is that true?' And right then we have a, 'Oh, you got me. It's true. I was just doing stuff.'"
-- Beth Lyons
This anecdote is not merely an inconvenience; it points to a deeper systemic issue with "agentic systems." Lyons argues for a more dependable "colleague layer"--an AI that functions as a reliable partner, capable of admitting when it doesn't know something or when it has made an error, rather than fabricating information. The current state, where models confidently invent responses, poses a significant risk, especially as AI becomes more integrated into workflows where human oversight is reduced. The Omniscience Index on Artificial Analysis confirms this, showing Gemini 3.1 Pro Preview significantly outperforming GPT-5.4 in knowledge reliability and hallucination resistance. This suggests that while GPT-5.4 might be advancing in specific task executions, it still struggles with the foundational accuracy and honesty that users, particularly those building agentic systems, desperately need.
Real-World Integration: Excel, Codex, and the Trade-offs
The conversation then pivots to practical, hands-on experience with GPT-5.4, particularly its integration into tools like Excel and its use via the Codex desktop app. Karl Yeh highlights that while benchmarks are useful, "you got to test it on your own use cases because those benchmarks are useless unless you leverage it into whatever you need to do." This sentiment is echoed throughout the discussion, emphasizing that real-world performance is the ultimate arbiter of AI utility.
GPT-5.4's integration into Excel is presented as a significant development. While Claude is noted as still being "significantly better" for charts and graphs, GPT-5.4 is considered "on par" for most other Excel tasks. This creates a dilemma for organizations that have invested in Claude licenses, as GPT-5.4 now offers comparable functionality. The demonstration of both models building a five-year DCF model side-by-side illustrates the practical differences: Claude's output is perceived as more "thoughtful" and better designed, while ChatGPT's has an "undo" feature, a critical advantage when dealing with complex, irreversible changes to a source of truth. The experiment of having multiple agents (ChatGPT, Claude, and Copilot) modify the same spreadsheet simultaneously, leading to confusion and potential data corruption, starkly illustrates the risks of uncoordinated agentic action and the lack of robust error correction mechanisms.
"The only thing that this can't do is do workbooks to workbooks comparisons, right? So, um, that's it. It's like, it's interesting what it will do."
-- Karl Yeh
The Codex desktop app is also lauded, with Yeh suggesting that GPT-5.4 in Codex can perform "90, 95%" of what Claude Code can do, with less rate limiting than in "Co-work." This offers a more accessible and potentially less frustrating coding assistance experience. However, the overall picture remains one of trade-offs. No single model excels at everything. The choice depends on the specific task, the user's existing ecosystem (e.g., Google Workspace vs. Microsoft Office), and tolerance for potential inaccuracies versus the need for specific capabilities like advanced visual reasoning or robust error handling.
Key Action Items
- Prioritize Task-Specific Evaluation: Before adopting any new AI model, conduct rigorous testing on your actual, day-to-day use cases. Do not rely solely on benchmark scores. (Immediate to Ongoing)
- Develop a "Colleague Layer" Strategy: For agentic systems, actively seek or develop AI tools that provide clear feedback on their limitations and errors, rather than fabricating information. This requires a shift in how we prompt and interact with AI. (Ongoing Investment, Payoff in 6-12 months)
- Leverage Excel/Sheets Integrations Strategically: Explore GPT-5.4's capabilities within Excel. Understand its strengths and weaknesses compared to Claude, and critically, be aware of the lack of an "undo" feature in some models, necessitating careful use with critical data. (Immediate Action)
- Experiment with Codex for Coding: Download and test the Codex desktop app with GPT-5.4 for coding tasks. Compare its performance and rate limiting against other tools like Co-work and Claude Code. (Immediate Action)
- Investigate Gemini for Visual Tasks: If your work involves video analysis or complex visual reasoning, explore Gemini 3.1 Pro Preview and Flash, recognizing their strengths but also their documented hallucination issues in other contexts. (Immediate Action)
- Implement Robust Data Versioning: When using AI tools that modify spreadsheets or documents directly, implement strict version control and backup procedures to mitigate the risk of irreversible data loss or unintended changes, especially with models lacking an "undo" function. (Immediate Action, Long-term Risk Mitigation)
- Stay Informed on Model Nuances: Continuously monitor performance differences across models for specific tasks (e.g., coding, writing, data analysis, visual reasoning). The landscape is rapidly evolving, and what is best today may not be tomorrow. (Ongoing Monitoring, Competitive Advantage in 12-18 months)