Claude 4 Launch: Anthropic Enters the Frontier War
Anthropic didn't just release a new model on May 22, 2025. It released a statement of intent. Claude 4 — spanning the flagship Opus 4 and the workhorse Sonnet 4 — is the company's most aggressive bid yet to own the frontier model crown. And in a year where OpenAI's GPT-5 and Google's Gemini 2.5 Pro are both vying for the same throne, the timing couldn't be more pointed.
What Shipped: Opus 4 and Sonnet 4
The Claude 4 launch delivered two models with distinct roles. Claude Opus 4 is the heavy hitter — built for complex reasoning, sustained agentic work, and deep coding tasks. Claude Sonnet 4 is the balanced mid-tier option: faster, cheaper, and surprisingly competitive on benchmarks that should, on paper, belong to its bigger sibling.
Here's the headline number: Opus 4 hit 72.5% on SWE-bench, the gold-standard benchmark for real-world software engineering. But Sonnet 4 actually edged it out at 72.7% on the same test — a fascinating quirk that tells you something about how Anthropic optimized each model for different workload profiles. Opus 4 also posted 43.2% on Terminal-bench, a newer benchmark measuring command-line proficiency in complex environments.
Both models share a 200,000-token context window and introduce genuinely new capabilities: extended thinking with tool use, parallel tool execution, and improved memory via local file access. The extended thinking feature is particularly significant — it lets the model reason through multi-step problems while actively calling external tools, rather than forcing a choice between thinking and doing.
The Killer Feature: Seven-Hour Agentic Sessions
This is where Claude 4 gets interesting and where Anthropic is making its sharpest strategic bet. Opus 4 can work autonomously for up to seven hours without meaningful performance degradation. That's not a benchmark number — it's an architecture decision that signals where Anthropic thinks AI is heading.
While OpenAI and Google are optimizing for faster responses and broader multimodality, Anthropic is doubling down on sustained autonomous work. Think CI/CD pipelines that debug themselves. Think research agents that don't lose the thread after 20 minutes. Opus 4 was designed to be the model you hand a complex task to and walk away from.
Anthropic classified Opus 4 as a "Level 3" model on its internal safety scale, indicating "significantly higher risk" — an unusually candid acknowledgment that more capable models require more guardrails. The company simultaneously shipped new API features including a code execution tool, connectors via the Model Context Protocol, and a Files API.
The Three-Way Benchmark War
Let's cut through the marketing and look at where each model actually leads. The frontier race in mid-2025 has three clear contenders, each with distinct strengths.
Coding
Claude 4 dominates here. The 72.5–72.7% SWE-bench scores from Opus and Sonnet at launch were best-in-class. GPT-5 variants later posted around 65–74.9% on SWE-bench depending on the version, while Gemini 2.5 Pro landed around 63.8% on SWE-bench Verified. Anthropic's subsequent updates — Claude 4.5 Sonnet at 70.6% and Claude 4.6 at a staggering 80.9% on SWE-bench Verified — only widened the gap. If you're building developer tools, Claude is the model to beat.
Reasoning
This is GPT-5's territory. OpenAI's flagship scored 89.4% on GPQA Diamond (high mode) and 52.9% on ARC-AGI-2. Claude Opus 4.6 later matched or exceeded some of these numbers — 90.5% on GPQA Diamond with 32k thinking tokens — but at launch, GPT-5's reasoning depth was the benchmark to chase. Gemini 2.5 Pro held its own at 84.6% on GPQA Diamond, competitive but clearly in third.
Context and Multimodality
Google wins this one and it's not close. Gemini 2.5 Pro's 1-million-token context window — with a 2-million-token beta — dwarfs both Claude's 200K and GPT-5's 272–400K range. For tasks involving entire codebases, long documents, or cross-modal workflows spanning text, images, audio, and video, Gemini remains the pragmatic choice.
Pricing: Anthropic's Achilles Heel
Here's where Anthropic's positioning gets uncomfortable. Opus 4 is priced at $15/$75 per million tokens (input/output). Sonnet 4 comes in at $3/$15. Compare that to GPT-5 and Gemini 2.5 Pro, both offering standard tiers around $1.25/$10 per million tokens.
Sonnet 4 is roughly 2.4x more expensive than comparable GPT-5 and Gemini tiers. Opus 4 is in a different stratosphere entirely. Anthropic is betting that raw performance — especially in coding and agentic workflows — justifies the premium. For enterprise teams running complex autonomous tasks, that math might work. For everyone else, the cost-per-capability gap is hard to ignore.
The Rapid Iteration Story
What's arguably more impressive than the launch itself is what came after. Anthropic followed Claude 4's May debut with Opus 4.1 in August, Haiku 4.5 in October, Opus 4.5 in November, and Opus 4.6 in February 2026. That's five major model updates in nine months — a cadence that makes OpenAI's release schedule look leisurely.
Each update pushed meaningful improvements. Opus 4.1 added the ability to end persistently abusive conversations. Opus 4.5 sharpened coding and workplace performance. Opus 4.6 introduced agent teams and PowerPoint integration. This isn't incremental versioning — it's a sustained sprint.
The Verdict
Claude 4 established Anthropic as the clear leader in AI-assisted software engineering and long-duration agentic work. It's not the best model for everything — GPT-5 edges it on general reasoning, and Gemini 2.5 Pro crushes it on context length and price-performance. But Anthropic made a deliberate choice: rather than building the best generalist, it built the best worker.
In a market where every frontier lab claims to have the "most capable" model, Anthropic chose to build the one that can actually finish the job.
For developers, AI engineers, and teams building autonomous coding pipelines, Claude 4 isn't just competitive — it's the default. The pricing stings, but Anthropic is betting you'll pay for results. Based on the benchmarks, they might be right.
Related Articles
- Claude 4: Anthropic's Coding Juggernaut
- Claude 4: Anthropic's Coding Powerhouse
- Claude 4: Anthropic's Agentic Leap
- Claude 4 Changes the Game
Building with Claude 4, GPT-5, or Gemini 2.5? Follow ultrathink.ai for sharp, unfiltered analysis of the models reshaping how we build software.
This article was ultrathought.
Get breaking news, funding rounds, and analysis delivered to your inbox. Free forever.