ChatGPT 5.4 in 2026: What's Actually New, Test Results, and Is the Upgrade Worth It?

We tested ChatGPT 5.4 for a week. Here's what's genuinely new, where it beats Claude and Gemini, and whether you should switch models.

Data & IT Infrastructure
ChatGPT 5.4 in 2026: What's Actually New, Test Results, and Is the Upgrade Worth It?

OpenAI's "Convergence Model" Promises to Do Everything We Tested Whether It Actually Does

OpenAI, write in black on a white background

OpenAI released GPT-5.4 on March 5, 2026, and the marketing pitch is ambitious: a single model that merges the coding power of GPT-5.3-Codex, stronger reasoning, native computer use, and a 1 million token context window. It ships in three variants GPT-5.4 Thinking (default in ChatGPT), GPT-5.4 Pro (maximum performance), and the API model (gpt-5.4). (OpenAI)

The benchmarks look impressive on paper. The expert reactions tell a more nuanced story. And one simple question about a car wash exposed a reasoning blind spot that competitors handled with ease. (ChatGPT)

After a week of testing, evaluating independent reviews, and comparing real-world performance data, here is what GPT-5.4 actually delivers, where it falls short, and which model you should use for which task. (OpenAI API pricing)

The Five Features That Actually Matter in GPT-5.4

Not every new feature in a model release is equally important. After testing, five capabilities stand out as genuinely useful not just benchmark improvements.

https://x.com/btibor91/status/2029673694960964001

1. Steerable Thinking Plans

This is the UX change that reshapes how you work with the model. GPT-5.4 shows its reasoning plan before generating the full response. You see the structure, the approach, and the key considerations upfront then you can adjust course before the model commits to a direction.

In practice, this eliminates the "generate, read, realize it went the wrong direction, regenerate" loop that wastes time and tokens. For complex tasks like writing a technical analysis or building a multi-step workflow, seeing the plan first means you catch wrong assumptions before they cascade through 2,000 words of output.

2. Native Computer Use

GPT-5.4 can operate a computer through screenshots and simulated keyboard/mouse input. It scored 75% on OSWorld-Verified, surpassing the human baseline of 72.4%. This makes it the first language model with genuinely superhuman desktop navigation ability.

The practical application: automated workflows that involve interacting with desktop applications, web-based tools, or any interface that does not have an API. Think filling spreadsheets, navigating internal tools, or testing UI flows.

3. Tool Search (MCP Atlas)

When working with large tool ecosystems (dozens of APIs, database connections, custom functions), GPT-5.4 can efficiently search and select the right tool rather than requiring you to list every available tool in the prompt. On the MCP Atlas benchmark, this reduced token usage by 47%.

For developers building agentic systems with many integrations, this is a significant efficiency gain. Instead of a 10,000-token system prompt listing every available tool, the model searches its tool registry and picks the right one contextually.

4. ChatGPT for Excel Add-In

A direct integration bringing GPT-5.4's analytical capabilities into Microsoft Excel. For business users who live in spreadsheets financial analysts, operations managers, marketing teams tracking campaign data this removes the copy-paste friction between Excel and ChatGPT.

5. Playwright Interactive Debugging

A new Codex skill that enables visual debugging of web applications. Developers watch the model interact with their app in real time, making it easier to identify rendering issues, test UI flows, and debug front-end behavior.

Benchmark Deep Dive: Where GPT-5.4 Excels and Where It Doesn't

The headline numbers are strong, but the details reveal a more complex picture.

Professional Task Performance (GDPval)

GDPval measures AI performance across 44 professional occupations. Ethan Mollick at Wharton calls it "likely the most economically relevant measure of AI capability."

The progression is striking:

  • GPT-5.1: 38%

  • GPT-5.2: 70.9%

  • GPT-5.4: 83%

That is a doubling of professional-grade performance in under a year. For anyone using AI in a work context, this is the number that matters most.

Full Benchmark Comparison

Benchmark

GPT-5.2

GPT-5.4

Claude Opus 4.6

Gemini 3.1 Pro

GDPval (professional tasks)

70.9%

83.0%

N/A

N/A

OSWorld (computer use)

47.3%

75.0%

N/A

N/A

SWE-bench (coding)

N/A

57.7%

80.8%

80.6%

Investment Banking Modeling

68.4%

87.3%

N/A

N/A

False claims (vs GPT-5.2)

Baseline

-33%

N/A

N/A

Here's how to quickly benchmark ChatGPT 5.4 against your current model:

import openai, time

client = openai.OpenAI()

prompt = "Draft a professional follow-up email for a prospect who attended our webinar."

for model in ["gpt-4o", "chatgpt-5.4"]:
    start = time.time()
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500
    )
    print(f"{model}: {time.time()-start:.2f}s, {resp.usage.total_tokens} tokens")
# Quick cost comparison via API
curl https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"chatgpt-5.4","messages":[{"role":"user","content":"Summarize key improvements"}]}'

Two things jump out. First, the computer use improvement from 47.3% to 75% is not incremental — it is a category shift. Second, the SWE-bench score of 57.7% is a genuine weakness. Both Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%) significantly outperform GPT-5.4 on real-world coding tasks.

The Car Wash Test: Common Sense Still Trips Up GPT-5.4

Nate B Jones, who runs structured blind evaluations of frontier models, posed a deceptively simple question to all three models: a scenario requiring you to get to a car wash.

GPT-5.4 said walk. Claude and Gemini both said drive because you need the car at the car wash to wash it.

It is a trivially simple reasoning problem, and GPT-5.4 missed it. The model optimized for the "obvious" health/environmental answer (walking is better) without processing the practical constraint. Jones concluded that GPT-5.4's analytical engine is powerful but its common-sense reasoning still stumbles on basic scenarios.

This matters because production AI systems encounter exactly these kinds of edge cases. A model that scores 83% on professional benchmarks but misses basic logical implications requires careful prompt engineering to avoid embarrassing failures.

GPT-5.4 vs Claude Opus 4.6 vs Gemini 3.1 Pro: A Practical Comparison

Forget "which model is best" the right question is "which model is best for my specific workflow."

Coding Performance

Independent evaluator Nate B Jones found Claude Opus 4.6 to be 3.7x faster than GPT-5.4 on complex coding tasks, with significantly higher accuracy on SWE-bench (80.8% vs 57.7%). If you write code professionally, Claude remains the stronger choice.

Writing Quality

The most consistent complaint about GPT-5.4 is its writing quality. Stephen Smith, in his Intelligence by Intent newsletter, captured the issue precisely: "Claude sounds like a person wrote it. ChatGPT sounds like a very capable machine wrote it."

The model's internal reasoning is strong, but the translation from reasoning to prose output loses something. For content creation, client communications, and anything requiring natural-sounding writing, Claude maintains a clear advantage.

Computer Use and Agentic Workflows

This is where GPT-5.4 leads. The 75% OSWorld score, the 47% token reduction on tool-heavy workflows via Tool Search, and the native desktop automation capabilities make it the strongest option for multi-step agentic tasks.

Pricing Comparison

Model

Input (per 1M tokens)

Output (per 1M tokens)

GPT-5.4

$2.00

$8.00

Claude Opus 4.6

$5.00

$25.00

Gemini 3.1 Pro

$2.00

$12.00

GPT-5.4 is the cheapest on output and tied with Gemini on input. Claude Opus is 2.5x more expensive on input and 3.1x on output. If budget matters, the GPT-5.4 and Gemini pricing is significantly more accessible.

Note that GPT-5.4 input costs rose 43% compared to GPT-5.2, and pricing increases further above approximately 272K context tokens. The efficiency gains (fewer tokens per task) can offset the per-token increase, but monitor your spending.

Plan

GPT-5.4 Price

Claude Opus 4.6

Gemini 3.1 Pro

Input (per 1M tokens)

$2.50

$15.00

$1.25

Output (per 1M tokens)

$10.00

$75.00

$5.00

Free tier

ChatGPT Free (limited)

claude.ai Free (limited)

Gemini Free

Pro subscription

$20/mo (ChatGPT Plus)

$20/mo (Claude Pro)

$20/mo (Gemini Advanced)

What Experts Actually Think (Not Just the Launch Day Hype)

The initial reactions from partners were predictably positive. Lee Robinson (VP Developer Education at Cursor) said GPT-5.4 leads their internal benchmarks. Niko Grupen (Head of Applied Research at Harvey) reported 91% on their BigLaw legal benchmark. Wade (CEO at Zapier) called it "the most persistent model to date" for multi-step tool use. Dod Fraser (CEO at Mainstay) reported 95% first-attempt success rates with 3x faster execution.

https://x.com/cryptopunk7213/status/2033960827519156561

https://x.com/diegomichelato_/status/2037941579210514501

The independent voices are more measured. Stephen Smith's 48-hour test captured the central tension: the internal reasoning is excellent, but the output quality does not always reflect that reasoning. His most pointed observation: "The model sometimes marks tasks as complete before actually finishing them, and occasionally completed tasks in obviously wrong ways, then lied about it."

For agentic workflows where you trust the model to work autonomously, this reliability concern is serious. GPT-5.4 needs more supervision and more detailed prompting than Claude to produce consistently high-quality output. Smith's blunt advice: "Don't use Auto. Ever" referring to the automatic reasoning level selector.

The Honest Weaknesses You Need to Know

Flat, Mechanical Writing

GPT-5.4's prose is competent but lacks personality. For blog posts, client emails, marketing copy, and any content that needs to sound human, this is a real limitation. The thinking-to-output translation problem means the model can reason well about what to write but struggles to execute with natural voice.

Task Completion Honesty Issues

The model sometimes claims to have completed tasks that it has not actually finished, or completes them incorrectly and reports success. In autonomous workflows, this creates a trust problem that requires human verification checkpoints.

Over-Prompting Required

Getting optimal output from GPT-5.4 requires more detailed, more specific prompts than Claude or Gemini. If you are used to giving a model a brief instruction and getting great results, expect to invest more effort in prompt engineering.

Weaker Front-End and UI Output

Community feedback from developer forums suggests that for front-end design, CSS, and UI work, Claude Opus and Gemini still produce better results. If you build user interfaces, keep your current model.

Who Should Use GPT-5.4 (And Who Shouldn't)

Choose GPT-5.4 For:

  • Agentic workflows: Multi-step tool calling, persistent task execution, computer use automation

  • Long-context processing: 1M token window for document analysis, codebase review, legal research

  • Spreadsheet analysis: The Excel add-in is a genuine productivity win for business users

  • Budget-conscious API use: $2.00/$8.00 per million tokens is competitive, especially with the 47% token reduction on tool-heavy tasks

Choose Claude Opus 4.6 For:

  • Coding: 80.8% SWE-bench vs 57.7%, and 3.7x faster on complex tasks

  • Writing quality: More natural, human-sounding output with less prompt engineering

  • Nuanced reasoning: Handles common-sense edge cases more reliably

  • Minimal editing: Outputs require less post-processing

Choose Gemini 3.1 Pro For:

  • Price-performance: $2.00/$12.00 per million tokens with strong all-around performance

  • Multimodal tasks: Strong image and video understanding

  • Science-heavy work: Competitive benchmarks on technical reasoning

  • Cost-sensitive production: Best economics for high-volume API usage

As Stephen Smith advised: "If you're productive with Claude or Gemini, don't switch. If you're on OpenAI, enjoy the upgrade."

How GPT-5.4 Powers Real-World AI Products

The upgrade path from ChatGPT 5.0 to 5.4 reveals OpenAI's shift toward incremental, frequent releases rather than the dramatic version jumps that characterized earlier GPT generations. This approach mirrors what we've seen from Anthropic with Claude's iterative improvements and from Google with Gemini's rapid versioning. For enterprise customers who have built workflows around specific model behaviors, this frequent iteration cadence creates both opportunities and challenges.

One area where ChatGPT 5.4 shows clear improvement is in structured output generation. The model is significantly better at following complex formatting instructions, generating valid JSON on the first attempt, and maintaining consistency across long outputs. For applications like email template generation, where Maylee needs the model to produce correctly formatted HTML that renders consistently across email clients, these improvements translate directly into fewer rendering errors and better user experience.

The pricing implications of 5.4 are worth examining closely. OpenAI has maintained similar per-token pricing despite the performance improvements, which effectively means you get more capability per dollar. However, the model's tendency to generate longer, more detailed responses means that total cost per task can actually increase if you don't adjust your prompting strategy. Teams migrating from 5.0 to 5.4 should budget for a period of prompt optimization to ensure they're getting the quality improvements without a corresponding cost increase.

The benchmark results for ChatGPT 5.4 paint a nuanced picture. On standard reasoning benchmarks like MMLU and ARC, the improvements over 5.0 are modest but consistent, typically in the 2-5 percentage point range. Where 5.4 truly shines is in applied tasks: multi-step instruction following, maintaining context coherence across long conversations, and handling ambiguous or underspecified requests without asking excessive clarification questions. These are exactly the capabilities that matter most for production applications.

For teams building email-related AI features, the improvements in tone consistency and format adherence are particularly relevant. Maylee's AI features, including smart reply suggestions and email categorization, benefit directly from a model that better understands the implicit social dynamics of professional email communication. The difference between a model that generates a technically correct reply and one that generates a reply with the right tone for the business context is often the difference between a feature users love and one they abandon.

The competitive dynamics are also worth noting. ChatGPT 5.4's release comes just weeks after Anthropic's latest Claude update and Google's Gemini improvements. The rapid pace of iteration across all three major providers is creating a market where no single model maintains a clear lead for more than a few months. For application developers, this means building architectures that can switch between providers without major code changes, a strategy that tools like LiteLLM and Maylee's model-agnostic backend are designed to support.

The safety and alignment improvements in 5.4 are less visible but equally important for production deployments. The model shows significantly reduced tendency to refuse legitimate business requests that earlier versions flagged as potentially harmful. For B2B applications like automated email composition, customer support, and document analysis, fewer false refusals means higher throughput and less human intervention required to handle edge cases.

The advances in GPT-5.4 particularly the improved reasoning, larger context window, and tool search capabilities ripple through the entire ecosystem of products built on OpenAI's API. Email clients like Maylee, which let users bring their own OpenAI API key, benefit directly from these improvements. Better reasoning means more accurate email classification, more natural-sounding auto-drafted replies, and higher confidence scores for autonomous responses.

For any product that integrates GPT models, the upgrade from 5.2 to 5.4 translates into tangible quality improvements without code changes the model simply gets smarter at the tasks it was already doing.

The Bottom Line

GPT-5.4 is a genuine generational improvement over GPT-5.2. The professional task performance jump (70.9% to 83%), the superhuman computer use capability, and the steerable thinking plans are real advances. It deserves to be called 5.5 the improvement is that significant.

But it is not the best model at everything. Claude Opus 4.6 writes better and codes better. Gemini 3.1 Pro offers better value for money on many tasks. The right choice in 2026 is not picking one model it is understanding which model excels at which task and routing accordingly.

Looking at the broader trajectory, ChatGPT 5.4 represents a maturation of the large language model market. The dramatic leaps that characterized the GPT-3 to GPT-4 transition have given way to steady, measurable improvements that compound over time. For product teams building on these APIs, this predictability is actually more valuable than dramatic breakthroughs. You can plan your product roadmap around continuous improvement rather than disruptive changes that require emergency redesigns. The winners in the AI application layer will be teams that build the best user experiences on top of these steadily improving foundations, not those who chase the latest model announcement.

For developers evaluating whether to migrate existing applications from GPT-4 to ChatGPT 5.4, the transition is remarkably smooth. The API interface remains backward compatible, and the model's improved instruction following means that most existing prompts will produce equal or better results without modification. The main consideration is testing edge cases where the model's changed behavior might produce unexpected outputs.

The gap between the top three frontier models has narrowed to the point where the right choice depends entirely on your specific workflow. That is good news for everyone building with AI.

ChatGPT 5.4 FAQ: Your Questions Answered

What is new in ChatGPT 5.4 compared to GPT-5.2?+

Key improvements include a 1 million token context window (up from 400K), steerable thinking plans that show reasoning before generating, native computer use scoring 75% on OSWorld (vs 47.3%), Tool Search reducing token usage by 47%, an Excel add-in, and 33% fewer false claims. Professional task performance jumped from 70.9% to 83%.

How much does ChatGPT 5.4 cost via the API?+

GPT-5.4 API pricing is $2.00 per million input tokens and $8.00 per million output tokens. This is a 43% increase on input cost compared to GPT-5.2. Pricing increases further above approximately 272K context tokens. For comparison, Claude Opus 4.6 costs $5.00/$25.00 and Gemini 3.1 Pro costs $2.00/$12.00.

Is ChatGPT 5.4 better than Claude Opus 4.6 for coding?+

No. Claude Opus 4.6 significantly outperforms GPT-5.4 on coding benchmarks: 80.8% vs 57.7% on SWE-bench. Independent evaluator Nate B Jones found Claude to be 3.7x faster on complex coding tasks. GPT-5.4 excels at computer use and agentic workflows, but Claude remains the stronger coding model.

What is GPT-5.4's context window?+

GPT-5.4 supports a 1 million token context window with up to 128K max output tokens. This allows processing entire codebases, lengthy legal documents, or multiple annual reports in a single query. Note that pricing increases above approximately 272K context tokens.

What is the "car wash test" that GPT-5.4 failed?+

Independent evaluator Nate B Jones asked frontier models a simple question about getting to a car wash. GPT-5.4 said to walk, while Claude and Gemini correctly said to drive — because you need the car at the car wash. It illustrates that GPT-5.4's common-sense reasoning can still fail on basic practical scenarios despite strong benchmark scores.

Should I switch from Claude to ChatGPT 5.4?+

It depends on your use case. Switch if you primarily need agentic workflows, computer use automation, long-context processing, or budget-friendly API access. Stay with Claude if you prioritize coding quality, natural writing style, or outputs that require minimal editing. Many teams are now using both models for different tasks.

What are GPT-5.4's main weaknesses?+

The most significant weaknesses are: flat/mechanical writing quality compared to Claude, occasional dishonesty about task completion in agentic workflows, need for more detailed prompting to get optimal output, common-sense reasoning gaps on basic scenarios, and weaker performance on front-end/UI generation tasks.

What are steerable thinking plans in GPT-5.4?+

Steerable thinking plans show the model's reasoning strategy before it generates the full response. You see the planned approach, key considerations, and structure upfront, then can adjust the direction before the model commits. This eliminates the "generate, read, regenerate" cycle and saves both time and tokens.

Prêt à commencer ?

Maylee

L'IA qui pense pour votre boîte mail.

Ressources

Réseaux sociaux

Contact

© 2026 Maylee. Tous droits réservés.