Skip to main content
AI & Technology

Claude Opus 4.7 Is Here — Head-to-Head Benchmark Comparison with GPT 5.4, Gemini 3.1 Pro, and Mythos

Anthropic released Opus 4.7 on April 16, 2026. Same price as before, but SWE-bench Pro jumps 10.9 points over 4.6 — beating GPT 5.4 on coding while losing on web research. Mythos still leads by 6-14 points. Here are all the real numbers.

17 Apr 202615 min
Claude Opus 4.7AnthropicGPT 5.4GeminiAI BenchmarkLLM ComparisonSoftware Development

TL;DR

On April 16, 2026, Anthropic released Claude Opus 4.7 — reclaiming the top spot for the most capable generally available LLM from OpenAI's GPT 5.4.

Key numbers:

  • SWE-bench Verified: 87.6% (up from 80.8% in Opus 4.6) — highest among publicly available models
  • SWE-bench Pro: 64.3% vs GPT 5.4's 57.7% — a 6.6-point lead
  • CursorBench: 70% — highest of any model with published results
  • Pricing: $5/M input, $25/M output — unchanged from Opus 4.6

But Opus 4.7 doesn't win everything. GPT 5.4 still leads BrowseComp (web research) by a full 10 points, and Mythos — available only to Project Glasswing consortium members — leads Opus 4.7 by 6-14 points on coding benchmarks.

This article compares real numbers across 16 benchmarks. No cheerleading — just what wins where and why it matters.


The Day Opus Came Back — April 16, 2026

For the past couple of months, if you were building AI agents for coding or enterprise workflows, GPT 5.4 was the go-to choice. Solid terminal task performance, strong browsing, reasonable pricing.

Then Anthropic dropped Opus 4.7.

This isn't a minor update. SWE-bench Verified jumped from 80.8% to 87.6% — nearly 7 points in a single release. If you've followed this space, you know that each additional point at this level gets exponentially harder to earn.

To put that in perspective: SWE-bench Verified measures the ability to fix real bugs in real open-source projects. Not synthetic puzzles — actual GitHub issues where the AI reads the codebase and produces a patch that makes tests pass. Going from 80.8% to 87.6% means roughly 7 more bugs out of every 100 that the model can now handle.

The new model is available across Claude products (web, app), the API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry — covering all major cloud platforms.

But it's not just the numbers. It's the direction Anthropic chose to push: coding, agentic tasks, and enterprise reliability — the exact areas where software development teams live.


What's New in Opus 4.7

1. High-res Image Input — 3.75 MP

Opus 4.7 handles images up to 2,576 px on the long edge (~3.75 megapixels) — a 3x improvement over the previous 1,568 px limit (~1.15 MP).

For teams doing UI reviews or feeding screenshots of ERP dashboards to AI for analysis, this means no more cropping or resizing. Send the full screenshot directly.

2. xhigh Effort Level

A new effort tier between "high" and "max" — xhigh — gives finer control over how much compute the model uses per request.

In practice: tasks that need high accuracy but don't justify the cost and latency of "max" now have a middle ground.

3. Better Instruction Following

Anthropic says it's "substantially better" at following instructions. The concrete proof: 1/3 fewer tool errors than Opus 4.6 on complex workflows.

If you've dealt with AI agents that skip steps in multi-tool workflows, you know this is a big deal.

4. Improved File System Recall

Better memory of files and context across multi-session work — critical for coding agents working with large codebases over multiple days.

5. Task Budgets (Public Beta)

Set token spend limits per task — like telling the AI "don't spend more than X tokens on this." For teams running AI agents in production, this is the cost-control feature they've been waiting for.

6. Updated Tokenizer

More efficient text processing. The result: 14% improvement over Opus 4.6 on multi-step tasks using fewer tokens.


The Benchmark Table — Numbers Don't Lie

Data compiled from Anthropic's release, Vellum, VentureBeat, and 9to5Mac:

Benchmark Opus 4.7 Opus 4.6 GPT 5.4 Gemini 3.1 Pro Mythos Preview
SWE-bench Verified 87.6% 80.8% 80.6% 93.9%
SWE-bench Pro 64.3% 53.4% 57.7% 54.2% 77.8%
Terminal-Bench 2.0 69.4% 65.4% 75.1%* 68.5% 82.0%
MCP-Atlas 77.3% 75.8% 68.1% 73.9%
Finance Agent v1.1 64.4% 60.1% 61.5% 59.7%
OSWorld-Verified 78.0% 72.7% 75.0% 79.6%
BrowseComp 79.3% 83.7% 89.3% 85.9% 86.9%
GPQA Diamond 94.2% 91.3% 94.4% 94.3% 94.6%
HLE (with tools) 54.7% 53.3% 58.7% 51.4% 64.7%
CharXiv (no tools) 82.1% 69.1% 86.1%
CursorBench 70% 58%
CyberGym 73.1% 66.3% 83.1%
XBOW visual 98.5% 54.5%
BigLaw Bench 90.9%
GDPval-AA (Elo) 1753 1674 1314
MMMLU 91.5% 91.1% 92.6%

*GPT 5.4 Terminal-Bench uses a self-reported harness — may not be directly comparable to third-party evaluations.

Rakuten-SWE-Bench: Opus 4.7 resolves 3x more production tasks than Opus 4.6 in Rakuten's testing.

Before diving into the analysis, here's what these benchmarks actually measure:

  • SWE-bench: Fixing real bugs in real open-source projects — the closest proxy to actual developer work
  • CursorBench: Performance in a code editor environment — directly measures AI pair programming quality
  • MCP-Atlas: Multi-tool agent workflows — relevant for enterprise systems that need to call multiple APIs
  • BrowseComp: Finding hard-to-locate information on the web — deep research, not just basic search
  • GPQA Diamond: Expert-level scientific reasoning
  • GDPval-AA: Code quality measured by Elo rating

Where Opus 4.7 Wins

Looking at the table, Opus 4.7 dominates in three categories:

Category 1: Coding & Software Engineering

  • SWE-bench Pro: 64.3% vs GPT 5.4's 57.7% — a 6.6-point gap, which is significant at this level
  • CursorBench: 70% — no other model has published comparable numbers, and the 12-point jump from Opus 4.6's 58% is massive
  • XBOW visual: 98.5% vs Opus 4.6's 54.5% — a staggering improvement
  • Rakuten-SWE: 3x more production tasks resolved than the previous version

Category 2: Agentic Tasks

  • MCP-Atlas: 77.3% vs GPT 5.4's 68.1% — a 9.2-point lead. This is the benchmark most relevant to enterprise automation
  • Finance Agent: 64.4% vs GPT 5.4's 61.5%
  • OSWorld-Verified: 78.0% vs GPT 5.4's 75.0%

Category 3: Enterprise Knowledge Work

  • BigLaw Bench: 90.9% — legal document analysis, relevant for compliance work including privacy regulations
  • GDPval-AA: Elo 1753 vs GPT 5.4's 1674 — higher code quality
  • CharXiv: 82.1% — dramatically better chart reading (up from 69.1%)

CyberGym is also worth noting: Opus 4.7 scores 73.1% vs GPT 5.4's 66.3% — a 6.8-point lead in cybersecurity tasks, even though Anthropic says they intentionally reduced cyber capabilities.

Bottom line: if your work is writing code, running AI agents, or processing enterprise documents — Opus 4.7 is the best available option today.


Where GPT 5.4 Wins

It wouldn't be honest to skip this — GPT 5.4 clearly outperforms Opus 4.7 in several areas:

Web Research & Browsing

  • BrowseComp: 89.3% vs 79.3% — a full 10-point gap. For AI-powered web research or competitive intelligence, GPT 5.4 is still the better tool.

Terminal Tasks

  • Terminal-Bench 2.0: 75.1% vs 69.4% — though GPT 5.4's number uses a self-reported harness, it's still a clear win.

Hard Reasoning

  • HLE (with tools): 58.7% vs 54.7% — on the hardest reasoning problems, GPT 5.4 holds a 4-point advantage.

One interesting observation: BrowseComp is the only benchmark where Opus 4.7 scored lower than Opus 4.6 (79.3% vs 83.7%). This suggests Anthropic may have made a deliberate trade-off — sacrificing some web browsing capability to boost coding and agentic performance.

From an engineering perspective, this trade-off makes sense. In most enterprise systems, AI doesn't need to browse the web — it needs to read code, process documents, and call APIs correctly. Prioritizing the more common use case is sound product thinking.

But if your team primarily uses AI for research tasks, this is a data point you need to weigh before switching.


Where Gemini 3.1 Pro Stands

Google's Gemini 3.1 Pro isn't fighting for the crown this round, but it has interesting data points:

Has Reasoning Hit a Ceiling?

Look at GPQA Diamond: Opus 4.7 at 94.2%, GPT 5.4 at 94.4%, Gemini 3.1 Pro at 94.3% — all three are within 0.2% of each other. That's not statistically meaningful.

This tells us expert-level reasoning may have hit a ceiling for current architectures. Every major provider lands in the same narrow band.

Gemini's Strengths

  • MMMLU: 92.6% — highest in the table (Opus 4.7 gets 91.5%). Multilingual general knowledge remains a Google strength.
  • BrowseComp: 85.9% — beats Opus 4.7 (79.3%) but trails GPT 5.4 (89.3%).

For teams that need multilingual knowledge or browsing at a lower price point, Gemini remains a solid choice. Especially for tasks involving mixed-language documents (like Thai-English business correspondence), MMMLU's higher score is a relevant indicator.

One more thing to note: Gemini 3.1 Pro scores only 54.2% on SWE-bench Pro — well below both Opus 4.7 (64.3%) and GPT 5.4 (57.7%). For coding-specific work, Gemini isn't a top contender right now.


Mythos — The Elephant in the Room

The numbers nobody wants to talk about:

  • SWE-bench Verified: Mythos 93.9% vs Opus 4.7's 87.6% — a 6.3-point gap
  • SWE-bench Pro: Mythos 77.8% vs Opus 4.7's 64.3% — a 13.5-point gap
  • Terminal-Bench: Mythos 82.0% vs Opus 4.7's 69.4% — a 12.6-point gap

Mythos is Anthropic's internal model available only to the Project Glasswing consortium — a group of security and research organizations. Not available to the public.

What Mythos tells us: the ceiling is much higher than what's publicly available. The technology to push AI another 10-14 points on coding already exists — it just hasn't been released.

Notable: Anthropic explicitly states that Opus 4.7 has intentionally reduced cybersecurity capabilities compared to Mythos, with automatic safeguards that detect and block prohibited uses. For legitimate security researchers, there's a separate Cyber Verification Program.

The CyberGym numbers confirm this: Mythos at 83.1% vs Opus 4.7 at 73.1% — a 10-point reduction. The capability cut is real, not just marketing. But even with the reduction, Opus 4.7 still outperforms GPT 5.4 (66.3%) on the same benchmark.

The question everyone asks: when will Mythos be publicly available? No timeline yet. Anthropic is prioritizing safety testing before releasing models at this capability level — an approach that, for organizations handling data privacy compliance, is exactly what we want to see from AI providers.


Which Model Should You Pick? — It Depends on the Job

After all the numbers, here's the practical guide:

Coding & AI Agents → Opus 4.7

If you're doing software development, using coding assistants, or building AI agents that call multiple tools and APIs — Opus 4.7 is the best publicly available option today. SWE-bench Pro, MCP-Atlas, and CursorBench make the case clearly.

Web Research & Browsing → GPT 5.4

If the primary task is finding information, analyzing websites, or doing competitive research — GPT 5.4 is still better. BrowseComp's 89.3% isn't a small number.

General Reasoning → Pick Any

With GPQA Diamond at 94%+ for all three major models, reasoning quality is effectively equivalent. Pick based on price and ecosystem fit.

Budget-Conscious → Sonnet 4.6 or GPT 5.4 Mini

Not every task needs a flagship model. For routine work that doesn't require maximum capability, smaller models offer better value.

Cybersecurity Research → Apply for Mythos Access

For security research that needs full capabilities, look into Anthropic's Cyber Verification Program.

Multilingual & Knowledge-Heavy → Gemini 3.1 Pro

For tasks requiring broad multilingual knowledge or general-purpose information retrieval, Gemini 3.1 Pro still leads on MMMLU and typically offers competitive pricing.


Impact on Software Houses

For companies doing software development — including us — Opus 4.7 matters for several reasons:

Same Price, Better Performance = Free Upgrade

$5/M input, $25/M output — identical to Opus 4.6. But SWE-bench Pro jumped 10.9 points (53.4% to 64.3%). In business terms, this is a performance improvement at zero additional cost.

More Reliable Coding Agents

1/3 fewer tool errors + 14% multi-step improvement with fewer tokens = AI agents that fail less often, finish faster, and cost less per completed task.

High-res Images Help UI/Design Work

Supporting 3.75 MP images means ERP module screenshots, dashboards, and wireframes can be sent directly without cropping — reducing friction in the review process.

Data Privacy & Cybersecurity Posture

Anthropic's intentional reduction of cyber capabilities plus automatic safeguards is a positive signal for businesses concerned about data privacy. It shows the AI provider takes security seriously — which matters when you're helping clients with privacy compliance.

For organizations implementing PDPA or similar data protection frameworks: choosing AI tools with built-in safety mechanisms reduces compliance risk out of the box. That's better than having to build safeguards yourself.

Task Budgets Help Control Cost

The Task Budgets feature entering public beta in Opus 4.7 is significant for production deployments. In real-world systems, cost control determines whether an AI agent is viable or not. Setting per-task token budgets prevents the runaway costs that plagued earlier generations of AI agents.


For Our Team at Enersys

As a software house working with Odoo ERP, Enterprise AI, and PDPA compliance, we've been watching Opus 4.7 since the first leaks.

What matters most to us:

  1. Coding agent quality — SWE-bench Pro and CursorBench numbers directly reflect the kind of work our development tools handle. Opus 4.7's improvements translate into direct benefits for our development workflow.

  2. Agentic workflow reliability — MCP-Atlas being 9.2 points above GPT 5.4 aligns with our use cases that involve AI calling multiple APIs, processing data, and summarizing results. Fewer tool errors means fewer retries.

  3. Cost efficiency — Same price with higher performance means lower cost per completed task — good for us and good for our clients.

  4. Data privacy alignment — Anthropic's approach to reducing cyber capabilities and adding safeguards aligns with the PDPA principles we help clients implement.


Conclusion

Opus 4.7 doesn't win everything — but it wins where it matters most for software development and enterprise AI.

The ranking:

  • #1 for coding (SWE-bench Pro, CursorBench) — leads all publicly available models
  • #1 for agentic tasks (MCP-Atlas, Finance Agent) — clear points ahead of GPT 5.4
  • Loses on web research (BrowseComp) — honest truth, 10-point gap
  • Reasoning is saturated — all models at 94%+, no meaningful difference
  • Mythos still leads by far — the real ceiling is much higher

For software development teams, the recommendation is straightforward: switch now. Same price, better coding performance across the board (except a small regression in web browsing).

For teams using AI across varied tasks, consider a hybrid approach: Opus 4.7 for coding and agentic work, GPT 5.4 for web research, Gemini 3.1 Pro for multilingual tasks where price is a factor.

Because in 2026, the right question isn't "which model is the best?" — it's "which model is the best for this job?"


Sources

"Empowering Innovation,
Transforming Futures."

ติดต่อเราเพื่อทำให้โปรเจกต์ของคุณเป็นจริง