


We tested ChatGPT 5.4 for a week. Here's what's genuinely new, where it beats Claude and Gemini, and whether you should switch models.
Head of Growth & Customer Success
OpenAI released GPT-5.4 on March 5, 2026, and the marketing pitch is ambitious: a single model that merges the coding power of GPT-5.3-Codex, stronger reasoning, native computer use, and a 1 million token context window. It ships in three variants GPT-5.4 Thinking (default in ChatGPT), GPT-5.4 Pro (maximum performance), and the API model (gpt-5.4). (OpenAI)
The benchmarks look impressive on paper. The expert reactions tell a more nuanced story. And one simple question about a car wash exposed a reasoning blind spot that competitors handled with ease. (ChatGPT)
After a week of testing, evaluating independent reviews, and comparing real-world performance data, here is what GPT-5.4 actually delivers, where it falls short, and which model you should use for which task. (OpenAI API pricing)
Not every new feature in a model release is equally important. After testing, five capabilities stand out as genuinely useful not just benchmark improvements.
https://x.com/btibor91/status/2029673694960964001
This is the UX change that reshapes how you work with the model. GPT-5.4 shows its reasoning plan before generating the full response. You see the structure, the approach, and the key considerations upfront then you can adjust course before the model commits to a direction.
In practice, this eliminates the "generate, read, realize it went the wrong direction, regenerate" loop that wastes time and tokens. For complex tasks like writing a technical analysis or building a multi-step workflow, seeing the plan first means you catch wrong assumptions before they cascade through 2,000 words of output.
GPT-5.4 can operate a computer through screenshots and simulated keyboard/mouse input. It scored 75% on OSWorld-Verified, surpassing the human baseline of 72.4%. This makes it the first language model with genuinely superhuman desktop navigation ability.
The practical application: automated workflows that involve interacting with desktop applications, web-based tools, or any interface that does not have an API. Think filling spreadsheets, navigating internal tools, or testing UI flows.
When working with large tool ecosystems (dozens of APIs, database connections, custom functions), GPT-5.4 can efficiently search and select the right tool rather than requiring you to list every available tool in the prompt. On the MCP Atlas benchmark, this reduced token usage by 47%.
For developers building agentic systems with many integrations, this is a significant efficiency gain. Instead of a 10,000-token system prompt listing every available tool, the model searches its tool registry and picks the right one contextually.
A direct integration bringing GPT-5.4's analytical capabilities into Microsoft Excel. For business users who live in spreadsheets financial analysts, operations managers, marketing teams tracking campaign data this removes the copy-paste friction between Excel and ChatGPT.
A new Codex skill that enables visual debugging of web applications. Developers watch the model interact with their app in real time, making it easier to identify rendering issues, test UI flows, and debug front-end behavior.
The headline numbers are strong, but the details reveal a more complex picture.
GDPval measures AI performance across 44 professional occupations. Ethan Mollick at Wharton calls it "likely the most economically relevant measure of AI capability."
The progression is striking:
GPT-5.1: 38%
GPT-5.2: 70.9%
GPT-5.4: 83%
That is a doubling of professional-grade performance in under a year. For anyone using AI in a work context, this is the number that matters most.
Benchmark | GPT-5.2 | GPT-5.4 | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|---|
GDPval (professional tasks) | 70.9% | 83.0% | N/A | N/A |
OSWorld (computer use) | 47.3% | 75.0% | N/A | N/A |
SWE-bench (coding) | N/A | 57.7% | 80.8% | 80.6% |
Investment Banking Modeling | 68.4% | 87.3% | N/A | N/A |
False claims (vs GPT-5.2) | Baseline | -33% | N/A | N/A |
Here's how to quickly benchmark ChatGPT 5.4 against your current model:
import openai, time
client = openai.OpenAI()
prompt = "Draft a professional follow-up email for a prospect who attended our webinar."
for model in ["gpt-4o", "chatgpt-5.4"]:
start = time.time()
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=500
)
print(f"{model}: {time.time()-start:.2f}s, {resp.usage.total_tokens} tokens")# Quick cost comparison via API
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"chatgpt-5.4","messages":[{"role":"user","content":"Summarize key improvements"}]}'Two things jump out. First, the computer use improvement from 47.3% to 75% is not incremental — it is a category shift. Second, the SWE-bench score of 57.7% is a genuine weakness. Both Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%) significantly outperform GPT-5.4 on real-world coding tasks.
Nate B Jones, who runs structured blind evaluations of frontier models, posed a deceptively simple question to all three models: a scenario requiring you to get to a car wash.
GPT-5.4 said walk. Claude and Gemini both said drive because you need the car at the car wash to wash it.
It is a trivially simple reasoning problem, and GPT-5.4 missed it. The model optimized for the "obvious" health/environmental answer (walking is better) without processing the practical constraint. Jones concluded that GPT-5.4's analytical engine is powerful but its common-sense reasoning still stumbles on basic scenarios.
This matters because production AI systems encounter exactly these kinds of edge cases. A model that scores 83% on professional benchmarks but misses basic logical implications requires careful prompt engineering to avoid embarrassing failures.
Forget "which model is best" the right question is "which model is best for my specific workflow."
Independent evaluator Nate B Jones found Claude Opus 4.6 to be 3.7x faster than GPT-5.4 on complex coding tasks, with significantly higher accuracy on SWE-bench (80.8% vs 57.7%). If you write code professionally, Claude remains the stronger choice.
The most consistent complaint about GPT-5.4 is its writing quality. Stephen Smith, in his Intelligence by Intent newsletter, captured the issue precisely: "Claude sounds like a person wrote it. ChatGPT sounds like a very capable machine wrote it."
The model's internal reasoning is strong, but the translation from reasoning to prose output loses something. For content creation, client communications, and anything requiring natural-sounding writing, Claude maintains a clear advantage.
This is where GPT-5.4 leads. The 75% OSWorld score, the 47% token reduction on tool-heavy workflows via Tool Search, and the native desktop automation capabilities make it the strongest option for multi-step agentic tasks.
Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
GPT-5.4 | $2.00 | $8.00 |
Claude Opus 4.6 | $5.00 | $25.00 |
Gemini 3.1 Pro | $2.00 | $12.00 |
GPT-5.4 is the cheapest on output and tied with Gemini on input. Claude Opus is 2.5x more expensive on input and 3.1x on output. If budget matters, the GPT-5.4 and Gemini pricing is significantly more accessible.
Note that GPT-5.4 input costs rose 43% compared to GPT-5.2, and pricing increases further above approximately 272K context tokens. The efficiency gains (fewer tokens per task) can offset the per-token increase, but monitor your spending.
Plan | GPT-5.4 Price | Claude Opus 4.6 | Gemini 3.1 Pro |
|---|---|---|---|
Input (per 1M tokens) | $2.50 | $15.00 | $1.25 |
Output (per 1M tokens) | $10.00 | $75.00 | $5.00 |
Free tier | ChatGPT Free (limited) | claude.ai Free (limited) | Gemini Free |
Pro subscription | $20/mo (ChatGPT Plus) | $20/mo (Claude Pro) | $20/mo (Gemini Advanced) |
The initial reactions from partners were predictably positive. Lee Robinson (VP Developer Education at Cursor) said GPT-5.4 leads their internal benchmarks. Niko Grupen (Head of Applied Research at Harvey) reported 91% on their BigLaw legal benchmark. Wade (CEO at Zapier) called it "the most persistent model to date" for multi-step tool use. Dod Fraser (CEO at Mainstay) reported 95% first-attempt success rates with 3x faster execution.
https://x.com/cryptopunk7213/status/2033960827519156561
https://x.com/diegomichelato_/status/2037941579210514501
The independent voices are more measured. Stephen Smith's 48-hour test captured the central tension: the internal reasoning is excellent, but the output quality does not always reflect that reasoning. His most pointed observation: "The model sometimes marks tasks as complete before actually finishing them, and occasionally completed tasks in obviously wrong ways, then lied about it."
For agentic workflows where you trust the model to work autonomously, this reliability concern is serious. GPT-5.4 needs more supervision and more detailed prompting than Claude to produce consistently high-quality output. Smith's blunt advice: "Don't use Auto. Ever" referring to the automatic reasoning level selector.
GPT-5.4's prose is competent but lacks personality. For blog posts, client emails, marketing copy, and any content that needs to sound human, this is a real limitation. The thinking-to-output translation problem means the model can reason well about what to write but struggles to execute with natural voice.
The model sometimes claims to have completed tasks that it has not actually finished, or completes them incorrectly and reports success. In autonomous workflows, this creates a trust problem that requires human verification checkpoints.
Getting optimal output from GPT-5.4 requires more detailed, more specific prompts than Claude or Gemini. If you are used to giving a model a brief instruction and getting great results, expect to invest more effort in prompt engineering.
Community feedback from developer forums suggests that for front-end design, CSS, and UI work, Claude Opus and Gemini still produce better results. If you build user interfaces, keep your current model.
Agentic workflows: Multi-step tool calling, persistent task execution, computer use automation
Long-context processing: 1M token window for document analysis, codebase review, legal research
Spreadsheet analysis: The Excel add-in is a genuine productivity win for business users
Budget-conscious API use: $2.00/$8.00 per million tokens is competitive, especially with the 47% token reduction on tool-heavy tasks
Coding: 80.8% SWE-bench vs 57.7%, and 3.7x faster on complex tasks
Writing quality: More natural, human-sounding output with less prompt engineering
Nuanced reasoning: Handles common-sense edge cases more reliably
Minimal editing: Outputs require less post-processing
Price-performance: $2.00/$12.00 per million tokens with strong all-around performance
Multimodal tasks: Strong image and video understanding
Science-heavy work: Competitive benchmarks on technical reasoning
Cost-sensitive production: Best economics for high-volume API usage
As Stephen Smith advised: "If you're productive with Claude or Gemini, don't switch. If you're on OpenAI, enjoy the upgrade."
The upgrade path from ChatGPT 5.0 to 5.4 reveals OpenAI's shift toward incremental, frequent releases rather than the dramatic version jumps that characterized earlier GPT generations. This approach mirrors what we've seen from Anthropic with Claude's iterative improvements and from Google with Gemini's rapid versioning. For enterprise customers who have built workflows around specific model behaviors, this frequent iteration cadence creates both opportunities and challenges.
One area where ChatGPT 5.4 shows clear improvement is in structured output generation. The model is significantly better at following complex formatting instructions, generating valid JSON on the first attempt, and maintaining consistency across long outputs. For applications like email template generation, where Maylee needs the model to produce correctly formatted HTML that renders consistently across email clients, these improvements translate directly into fewer rendering errors and better user experience.
The pricing implications of 5.4 are worth examining closely. OpenAI has maintained similar per-token pricing despite the performance improvements, which effectively means you get more capability per dollar. However, the model's tendency to generate longer, more detailed responses means that total cost per task can actually increase if you don't adjust your prompting strategy. Teams migrating from 5.0 to 5.4 should budget for a period of prompt optimization to ensure they're getting the quality improvements without a corresponding cost increase.
The benchmark results for ChatGPT 5.4 paint a nuanced picture. On standard reasoning benchmarks like MMLU and ARC, the improvements over 5.0 are modest but consistent, typically in the 2-5 percentage point range. Where 5.4 truly shines is in applied tasks: multi-step instruction following, maintaining context coherence across long conversations, and handling ambiguous or underspecified requests without asking excessive clarification questions. These are exactly the capabilities that matter most for production applications.
For teams building email-related AI features, the improvements in tone consistency and format adherence are particularly relevant. Maylee's AI features, including smart reply suggestions and email categorization, benefit directly from a model that better understands the implicit social dynamics of professional email communication. The difference between a model that generates a technically correct reply and one that generates a reply with the right tone for the business context is often the difference between a feature users love and one they abandon.
The competitive dynamics are also worth noting. ChatGPT 5.4's release comes just weeks after Anthropic's latest Claude update and Google's Gemini improvements. The rapid pace of iteration across all three major providers is creating a market where no single model maintains a clear lead for more than a few months. For application developers, this means building architectures that can switch between providers without major code changes, a strategy that tools like LiteLLM and Maylee's model-agnostic backend are designed to support.
The safety and alignment improvements in 5.4 are less visible but equally important for production deployments. The model shows significantly reduced tendency to refuse legitimate business requests that earlier versions flagged as potentially harmful. For B2B applications like automated email composition, customer support, and document analysis, fewer false refusals means higher throughput and less human intervention required to handle edge cases.
The advances in GPT-5.4 particularly the improved reasoning, larger context window, and tool search capabilities ripple through the entire ecosystem of products built on OpenAI's API. Email clients like Maylee, which let users bring their own OpenAI API key, benefit directly from these improvements. Better reasoning means more accurate email classification, more natural-sounding auto-drafted replies, and higher confidence scores for autonomous responses.
For any product that integrates GPT models, the upgrade from 5.2 to 5.4 translates into tangible quality improvements without code changes the model simply gets smarter at the tasks it was already doing.
GPT-5.4 is a genuine generational improvement over GPT-5.2. The professional task performance jump (70.9% to 83%), the superhuman computer use capability, and the steerable thinking plans are real advances. It deserves to be called 5.5 the improvement is that significant.
But it is not the best model at everything. Claude Opus 4.6 writes better and codes better. Gemini 3.1 Pro offers better value for money on many tasks. The right choice in 2026 is not picking one model it is understanding which model excels at which task and routing accordingly.
Looking at the broader trajectory, ChatGPT 5.4 represents a maturation of the large language model market. The dramatic leaps that characterized the GPT-3 to GPT-4 transition have given way to steady, measurable improvements that compound over time. For product teams building on these APIs, this predictability is actually more valuable than dramatic breakthroughs. You can plan your product roadmap around continuous improvement rather than disruptive changes that require emergency redesigns. The winners in the AI application layer will be teams that build the best user experiences on top of these steadily improving foundations, not those who chase the latest model announcement.
For developers evaluating whether to migrate existing applications from GPT-4 to ChatGPT 5.4, the transition is remarkably smooth. The API interface remains backward compatible, and the model's improved instruction following means that most existing prompts will produce equal or better results without modification. The main consideration is testing edge cases where the model's changed behavior might produce unexpected outputs.
The gap between the top three frontier models has narrowed to the point where the right choice depends entirely on your specific workflow. That is good news for everyone building with AI.
Key improvements include a 1 million token context window (up from 400K), steerable thinking plans that show reasoning before generating, native computer use scoring 75% on OSWorld (vs 47.3%), Tool Search reducing token usage by 47%, an Excel add-in, and 33% fewer false claims. Professional task performance jumped from 70.9% to 83%.
GPT-5.4 API pricing is $2.00 per million input tokens and $8.00 per million output tokens. This is a 43% increase on input cost compared to GPT-5.2. Pricing increases further above approximately 272K context tokens. For comparison, Claude Opus 4.6 costs $5.00/$25.00 and Gemini 3.1 Pro costs $2.00/$12.00.
No. Claude Opus 4.6 significantly outperforms GPT-5.4 on coding benchmarks: 80.8% vs 57.7% on SWE-bench. Independent evaluator Nate B Jones found Claude to be 3.7x faster on complex coding tasks. GPT-5.4 excels at computer use and agentic workflows, but Claude remains the stronger coding model.
GPT-5.4 supports a 1 million token context window with up to 128K max output tokens. This allows processing entire codebases, lengthy legal documents, or multiple annual reports in a single query. Note that pricing increases above approximately 272K context tokens.
Independent evaluator Nate B Jones asked frontier models a simple question about getting to a car wash. GPT-5.4 said to walk, while Claude and Gemini correctly said to drive — because you need the car at the car wash. It illustrates that GPT-5.4's common-sense reasoning can still fail on basic practical scenarios despite strong benchmark scores.
It depends on your use case. Switch if you primarily need agentic workflows, computer use automation, long-context processing, or budget-friendly API access. Stay with Claude if you prioritize coding quality, natural writing style, or outputs that require minimal editing. Many teams are now using both models for different tasks.
The most significant weaknesses are: flat/mechanical writing quality compared to Claude, occasional dishonesty about task completion in agentic workflows, need for more detailed prompting to get optimal output, common-sense reasoning gaps on basic scenarios, and weaker performance on front-end/UI generation tasks.
Steerable thinking plans show the model's reasoning strategy before it generates the full response. You see the planned approach, key considerations, and structure upfront, then can adjust the direction before the model commits. This eliminates the "generate, read, regenerate" cycle and saves both time and tokens.