PRODUCT February 26, 2026 5 min read

Claude 4 Is Here and It's a Coding Beast

By Ultrathink
ultrathink.ai
Thumbnail for: Claude 4: Anthropic's Coding Crown

Anthropic just drew a line in the sand. With the launch of Claude Opus 4 and Claude Sonnet 4 on May 22, 2025, the company isn't just competing with OpenAI and Google — it's claiming the coding throne outright. And the benchmarks back it up.

Two Models, One Message: Code Is King

The Claude 4 launch delivered two models tuned for different use cases but sharing the same DNA. Claude Opus 4 is the flagship — positioned as the world's best coding model. Claude Sonnet 4 is the workhorse — balancing raw intelligence with speed and cost efficiency. Both are hybrid models capable of near-instant responses or deep, extended thinking when the problem demands it.

Opus 4 hit 72.5% on SWE-bench Verified, the industry's go-to benchmark for real-world software engineering tasks. It also scored 43.2% on Terminal-bench, a newer and brutally hard evaluation. Sonnet 4 matched its bigger sibling on SWE-bench at 72.7% — yes, you read that right, the smaller model actually edged it out — while scoring 35.5% on Terminal-bench.

For context: GPT-4.1 managed 54.6% on SWE-bench Verified. Gemini 2.5 Pro landed at 63.2%. Neither came close.

Extended Thinking Changes the Game

The headline feature here is Extended Thinking with Tool Use, now in beta. Both Claude 4 models can pause, reason deeply, invoke tools like web search mid-thought, and then synthesize a response. This isn't just chain-of-thought prompting dressed up in marketing language. It's a fundamentally different workflow — the model thinks, acts, thinks again, and delivers.

On GPQA Diamond, a graduate-level reasoning benchmark, Opus 4 scores 74.9% normally but jumps to 79.6% with extended thinking enabled. Sonnet 4 follows the same pattern: 70.0% standard, 75.4% with thinking. This is the gap that matters. When you let these models actually reason, they pull away from the competition.

Other capabilities worth noting:

  • Parallel tool execution: Models can use multiple tools simultaneously, slashing latency on complex agentic workflows.
  • Memory files: Given local file access, Opus 4 extracts and saves key facts across sessions. It builds tacit knowledge. This is huge for long-running development projects.
  • 65% fewer shortcuts: Both models are significantly less likely to take loopholes on agentic tasks compared to Sonnet 3.7. Anthropic is actually making reliability a feature, not just a talking point.

Claude Code Goes Mainstream

Claude Code is now generally available, with native integrations for VS Code and JetBrains, plus background task execution via GitHub Actions. For developers who've been using Copilot or Cursor, this is Anthropic's direct play for your daily workflow. The new Code Execution tool, MCP connector, and Files API round out an increasingly complete developer platform.

Opus 4 can sustain focus on complex tasks for several hours straight. That's not a typo. While most models degrade on long-context tasks, Anthropic specifically optimized for endurance. For enterprise teams running autonomous coding agents overnight, this is the differentiator.

The Three-Way Race: Claude vs GPT vs Gemini

Here's the honest breakdown as of mid-2025:

Coding

Claude 4 wins. It's not even particularly close on the benchmarks that matter. OpenAI's GPT-4.1 is competent but trails by nearly 20 points on SWE-bench. Gemini 2.5 Pro is better than GPT but still behind Claude. If you're building agentic coding tools, shipping production code, or debugging complex systems, Claude is the answer right now.

That said, GPT-5.2 later closed the gap significantly, hitting 80% on SWE-bench Verified by January 2026. The lead is real but perishable.

General Purpose & Creative Work

ChatGPT still owns the general-purpose crown. GPT-5's native multimodal capabilities, 256K context window, and polished consumer experience make it the default for everyday use. Its voice mode is unmatched. For creative writing that needs flair over precision, OpenAI's models still feel more natural to many users.

Claude, however, dominates structured, professional writing — business memos, literature reviews, technical documentation. It follows complex instructions better than anything else on the market and flags uncertainty instead of hallucinating with confidence.

Context & Scale

Gemini 2.5 Pro's 1 million token context window remains the industry benchmark for sheer scale. If you need to analyze an entire codebase or a library of research papers in one shot, Google's model is still the play. Claude's 200K window (1M in beta) is catching up, but Gemini's production-ready million-token context is a genuine advantage for specific workflows.

Multimodal & Ecosystem

Google wins on ecosystem integration — Gemini woven through Workspace is powerful for enterprise. OpenAI wins on consumer polish and the broadest third-party tooling. Anthropic wins on developer trust. Its safety-first approach, instruction-following precision, and coding dominance have carved out a loyal power-user base that neither competitor has matched.

Pricing: Aggressive but Not Cheap

Opus 4 runs $15 input / $75 output per million tokens. Sonnet 4 comes in at $3 / $15. That makes Sonnet 4 the obvious sweet spot for most teams — you get near-Opus coding performance at one-fifth the cost. Compared to GPT-4.1's pricing, Sonnet 4 is competitive. Opus 4 is a premium play for teams that need the absolute best and can justify the spend.

The Bottom Line

Anthropic has made a bet: the future of AI is code, agents, and reliability. Claude 4 is that bet made manifest. It's not the best at everything — ChatGPT is more versatile, Gemini scales wider — but in the domains that matter most to developers and enterprises building on AI, Claude 4 sets the standard.

The real question isn't whether Claude 4 is good. It's whether Anthropic can hold this lead. OpenAI and Google are iterating just as fast, and the benchmarks shift every quarter. But right now, in this moment, if you're writing code with AI assistance, Claude 4 is the model to beat.

Related Articles


Building with AI agents or evaluating models for your dev team? Follow ultrathink.ai for benchmark breakdowns, hands-on reviews, and the sharpest takes on the models that actually matter.

This article was ultrathought.

Stay ahead of AI

Get breaking news, funding rounds, and analysis delivered to your inbox. Free forever.

Related stories