ANALYSIS February 24, 2026 6 min read

Claude 4 Is Winning the Coding War. Here's Proof.

By Ultrathink
ultrathink.ai
Thumbnail for: Claude 4 vs. GPT-4.1 vs. Gemini 2.5

Anthropic dropped Claude Opus 4 and Sonnet 4 on May 22, 2025, and the benchmark results aren't subtle. In coding tasks — the arena that matters most to the developers actually paying for these models — Claude 4 didn't just edge ahead. It obliterated the competition. The question now isn't whether Anthropic has the best coding models. It's whether OpenAI and Google can close a gap that's starting to look structural.

The Numbers Don't Lie: Coding Dominance

Let's start with SWE-bench Verified, the industry's go-to benchmark for real-world software engineering tasks. Claude Sonnet 4 scored 72.7%. Claude Opus 4 hit 72.5%. GPT-4.1? A distant 54.6%. Gemini 2.5 Pro landed at 63.2%.

Read those numbers again. Sonnet 4 — the cheaper model — beat GPT-4.1 by 18 percentage points. That's not a rounding error. That's a generational gap dressed up in a minor version number.

Terminal-bench tells the same story. Opus 4 scored 43.2% (50.0% with high-compute mode). GPT-4.1 managed 30.3%. Gemini 2.5 Pro trailed at 25.3%. When you enable parallel test-time compute, Sonnet 4 pushes to 80.2% on SWE-bench. That's a number that would have seemed impossible 12 months ago.

Claude Sonnet 4 at $3 per million input tokens outperforms GPT-4.1 on every coding benchmark we've seen. The pricing math alone should make enterprise teams rethink their stack.

Reasoning: A More Competitive Picture

Step outside coding and the landscape gets more nuanced — but Claude 4 still holds its own. On GPQA Diamond, the graduate-level reasoning benchmark, the top models are nearly indistinguishable: Claude Sonnet 4 (83.8%), Claude Opus 4 (83.3%), OpenAI o3 (83.3%), and Gemini 2.5 Pro (83.0%). We're in margin-of-error territory here.

Math tells a slightly different story. On AIME 2025 (high school math competitions), Opus 4 led with 90.0%, ahead of OpenAI o3 at 88.9% and Gemini 2.5 Pro at 83.0%. Not a blowout, but Anthropic's model is consistently at or near the top.

Where Google fights back is visual reasoning. On MMMU validation, Gemini 2.5 Pro scored 79.6% against Opus 4's 76.5%. And Gemini's video understanding capabilities remain best-in-class at 84.8% on VideoMME. If your workflow is heavy on multimodal content, Google still has a legitimate claim.

The Agentic Edge

Here's where the Claude 4 launch gets strategically interesting. Anthropic didn't just ship better models — they shipped an agent infrastructure. Claude Code went generally available alongside Opus 4 and Sonnet 4, with integrations for VS Code and JetBrains, plus a new SDK for building custom agents.

On TAU-bench, the agentic tool use benchmark, Opus 4 scored 81.4% in retail scenarios and Sonnet 4 hit 80.5%. Neither OpenAI nor Google has published competitive numbers on this benchmark, which says something on its own.

The new API capabilities — a code execution tool, MCP connector, and Files API — signal that Anthropic is building for a world where AI agents don't just answer questions. They execute multi-step workflows autonomously. The hybrid reasoning approach, letting models toggle between instant responses and extended thinking, gives developers granular control over the speed-accuracy tradeoff in production.

65% Fewer Shortcuts

One underreported detail: Anthropic claims Sonnet 4 is 65% less likely to take shortcuts than Sonnet 3.7. For developers who've been burned by models that produce plausible-looking but fundamentally broken code, this matters enormously. It suggests Anthropic is optimizing for reliability, not just benchmark scores — a distinction that separates toy demos from production systems.

The Pricing Power Play

Let's talk money. Opus 4 costs $15/$75 per million tokens (input/output). Sonnet 4 runs at $3/$15. GPT-4.1 is competitively priced, especially with its mini and nano variants. Gemini 2.5 Pro's pricing was still in preview at launch.

But the real story is the performance-per-dollar ratio. Sonnet 4 at $3 per million input tokens delivers coding performance that beats Opus-tier pricing from competitors. For startups and mid-size engineering teams, this is the model that changes procurement decisions. You get 72.7% on SWE-bench for one-fifth the cost of Opus 4. That's not a compromise — that's a steal.

Context Window: The One Gap

Claude 4 launched with a 200K token context window. GPT-4.1 offers 1 million tokens. Gemini 2.5 Pro matches that with 1 million, and Google promises 2 million is coming.

For most coding tasks, 200K is plenty. But for large codebase analysis, lengthy document processing, or complex multi-file debugging sessions, the context window disparity is real. Anthropic clearly recognized this — later models in the 4.x family expanded to 1 million tokens in beta. But at launch, this was a tangible disadvantage competitors were happy to highlight.

What the Rapid Iteration Tells Us

Perhaps the most telling signal is what came after launch. By August 2025, Anthropic shipped Opus 4.1. September brought Sonnet 4.5. October delivered Haiku 4.5. November completed the cycle with Opus 4.5, which included a significant price reduction. By February 2026, Opus 4.6 hit 65.4% on Terminal-Bench 2.0 and 80.8% on SWE-bench Verified.

This cadence is aggressive. Anthropic is shipping meaningful capability upgrades every 6-8 weeks. That's not just iteration — it's a pace that forces OpenAI and Google to respond or concede ground.

The Verdict

Claude 4's launch established Anthropic as the clear leader in AI-assisted software engineering. The coding benchmarks aren't close. The agentic capabilities are ahead of the curve. The pricing — especially at the Sonnet tier — is disruptive.

Google retains an edge in multimodal understanding and context length. OpenAI's GPT-4.1 offers solid all-around performance and an ecosystem advantage with ChatGPT's massive user base. But if you're a developer choosing a model to write, debug, and ship production code, the data points in one direction.

Anthropic built Claude 4 for the people who build things. And right now, those people are paying attention.

Related Articles


Building with Claude 4, GPT-4.1, or Gemini 2.5 Pro? We want to hear about your real-world experience. Follow ultrathink.ai for ongoing coverage of the AI model wars and deep-dive developer benchmarks.

This article was ultrathought.

Stay ahead of AI

Get breaking news, funding rounds, and analysis delivered to your inbox. Free forever.

Related stories