Article · May 5, 2026

The best AI coding agents in 2026 — ranked, live

Eight tools, four trade-offs, and one honest admission: no single agent wins every job. Ranked from real benchmarks and merge rates as of May 2026.

The benchmark question moves quickly. Claude Opus 4.7 leads SWE-bench Verified at 87.6%, and the unreleased Mythos Preview is closer to 94. That is the floor of what these tools can do, not the ceiling. The same model drops thirty points on SWE-bench Pro, which Scale AI rebuilds every month to keep the test set fresh.

SWE-Bench Leader

Claude Opus 4.7

87.6% Verified

Highest ARR

Cursor

$1.2B · 1M+ DAU

OSS Star Leader

Cline

61.3k ★ · 5M+ installs

Async Workhorse

Devin

67% PR merge rate

So the ranking below weighs four things, not one: Verified scores, Pro scores, real-world PR merge rates, and how much the surrounding scaffolding lifts or drags the underlying model. In February three frameworks pointing at the same Anthropic model finished 17 issues apart on the same 731-task evaluation. The wrapper does work.

1.Claude Code

AnthropicTerminalSWE-Verified 80.8%Pro $20 · Max $100/$200

Top of the list on raw capability. Opus 4.6 underneath; 80.8% on Verified; 200K standard context with a 1M-token beta for the largest refactors. The number that surprised me most: identical-task runs show Claude Code using 33K tokens where Cursor uses 188K. Five and a half times fewer.

The downside is the bill. Heavy daily use lands at $150–200/month on the Max plan, and the rate limit still bites at that tier. The pattern that keeps showing up on r/ClaudeCode: Cursor for daily flow, Claude Code for the problems Cursor cannot finish. Run them together rather than choose.

2.Cursor

AnysphereIDESWE-Verified 67.2%$20 / $60 / $200

Cursor sits where you actually work. $1.2B ARR, north of a million daily users, Composer for multi-file edits, and the Supermaven Tab completion model still under 100ms. The agent itself scores 67.2% on Verified. The IDE polish is what people pay for, not the agent.

The June 2025 switch to credit-based billing damaged trust that Cursor has not yet rebuilt. Pro+ at $60 and Ultra at $200 still read as opaque to anyone tracking actual usage. Keep an eye on the token meter and you'll be fine.

3.OpenAI Codex

OpenAICLI + CloudSWE-Pro 56.8%62k ★ · ChatGPT $20/$200

The unexpected comeback story. GPT-5.3-Codex is at 77.3% on Terminal-Bench 2.0 and leads SWE-bench Pro at 56.8%; the Spark variant runs on Cerebras WSE-3 silicon at over a thousand tokens a second. The cloud-task-runner pattern is the part that's new: write a spec, get back a sandboxed PR an hour later.

A year ago Codex did not exist. Today it has roughly 60% of Cursor's install base. The Rust CLI is open source under Apache-2.0, sitting at 62k stars. Pricing rides the ChatGPT Plus or Pro message budget rather than charging per call.

SWE-Bench Verified · Top 5 (May 1, 2026)

Claude Mythos Preview

93.9%

Claude Opus 4.7

87.6%

GPT-5.3-Codex

85.0%

Claude Opus 4.5

80.9%

Claude Opus 4.6

80.8%

Source: BenchLM / llm-stats live leaderboard. The same models drop 30–35 points on SWE-bench Pro.

4.Devin

CognitionAsync cloud67% merge rate$20 + $2.25/ACU

The most autonomous tool here, on its best days. Cognition reports a 67% PR merge rate on well-defined tasks, and Goldman Sachs is running Devin across its 12,000-engineer programming team with claimed 3–4× productivity. Treat the productivity number with caution; the merge rate is the one I trust more.

The pricing dropped from $500/month to a $20 base plus $2.25 per Agent Compute Unit, which makes Devin a tool you can experiment with before committing. Bug backlogs, schema migrations, and repetitive feature work are where it shines. Open-ended product design is where it still loses.

Three different agents pointing at the same Anthropic model finished 17 issues apart on the same 731-task evaluation. Architecture is doing real work.

5.Cline

Apache-2.0VS Code extension61.3k ★ · 5M installsBYOK · 30+ providers

Cline is the open-source headline at 61.3k stars and 5M installs, with bring-your-own-model across more than 30 providers. The Plan/Act split is the cleanest oversight pattern in the category: Plan reads files and reasons about them, Act is what touches disk. You approve the boundary.

You pay only for tokens, which puts a typical feature at $0.50–$2.00 against Sonnet 4.7. The April 2026 spend-limit UI is the small change that mattered most: agents quietly draining your account in a runaway loop is no longer a failure mode you have to worry about.

6.Aider

Apache-2.0Terminal · git-native40k+ ★100+ languages · BYOK

Forty-plus thousand stars and several model cycles of survival gives Aider an unusual property: every change becomes a real git commit you can read or revert. Repo map via tree-sitter scales to large codebases. The interface is a terminal prompt and your diff.

For unfamiliar repos and refactors that need to be reviewable afterwards, this is the safest tool in the list. It is also the dullest one to demo, which is part of why it works.

7.Sourcegraph Amp

SourcegraphEnterpriseSOC 2 · ISO 27001$59/user/mo

Cody Free and Cody Pro retired in July 2025; Cody Enterprise continues at $59 per user per month. Amp is the agentic successor, shipping in terminal, VS Code, Cursor, JetBrains and Neovim, with indexing across 300,000+ repositories under SOC 2 Type II and ISO 27001:2022.

If procurement requires audit trails and your codebase is measured in hundreds of microservices, this is the only tool here built for that scale. Smaller teams will find the price hard to justify.

8.Replit Agent 3

ReplitBrowser · full-stack200-min sessions$25 Core · $100 Pro

Replit raised $250M at a $3B valuation in January, and Agent 3 is the reason. Two-hundred-minute autonomous sessions, database provisioning, full-stack scaffolding with auth, and a public URL one click after the model finishes. For getting something in front of a stakeholder before lunch, nothing else is close.

For regulated production code on an existing codebase: the wrong tool. That is fine. Replit is not pretending otherwise.

Trade-offs by job

Job	Default pick	Why
Greenfield apps	Replit Agent 3	Browser-first, deploys in one click.
Multi-file refactors	Claude Code	Long context wins; 5.5× fewer tokens than Cursor on identical work.
AI code review	Augment Code · Continue	70% win rate against Copilot in head-to-head; CI checks defined as code.
Human-in-the-loop	Cline · Aider	Plan/Act split or per-commit diffs make every step inspectable.
Async delegation	Devin	End-state-only review for issues with verifiable success criteria.

What to default to, by team shape

Solo indie dev

Cursor + Claude Code

Forty dollars before usage. Cursor for the editor, Claude Code for the hard problems Cursor can't finish.

Mid-stage startup

Cursor + Claude Code + Devin Core

Two interactive agents and one async. Wire Devin into Linear or Jira and let the bug backlog work itself down overnight.

Regulated enterprise

Sourcegraph Amp + Devin Enterprise

Audit posture and codebase indexing as defaults. Add Devin where async ROI is clearly verifiable.

How CODE-25 ranks them right now

The list above is a snapshot of how the editors stack up. CODE-25 is the moving picture: the top 25 admitted agents carrying a code-generation tag — the underlying engines that builders point harnesses at, not the editors wrapping them. Equal-weight v1, rebalances Mondays at 03:00 UTC. The numbers below are the index as of 5 May 2026; the live tape is one click away.

CODE-25 · Composite52.3▲ vs 30d

1Gemini 2.5 Pro Preview 05-0676.4
2GPT-5.3-Codex68.8
3GPT-568.2
4everything-claude-code66.0
5dify63.1
6hermes-agent56.8

Open the live tape →

The ranking above tells you what to install on Monday. CODE-25 tells you what's moving by Friday. Both are useful; neither is the whole answer.

▲ Live Index

The CODE-25, tracked daily

Stars, benchmarks, mentions and merge rates all shift week to week. The CODE-25 index is the live tape: which coding engine is gaining momentum on AgentTape this week, which has stalled, and which just entered the basket on Monday's rebalance.

View the CODE-25

Read more in all articles or open the live indexes.

All articles

Article · May 5, 2026

The best AI coding agents in 2026 — ranked, live

Eight tools, four trade-offs, and one honest admission: no single agent wins every job. Ranked from real benchmarks and merge rates as of May 2026.

SWE-Bench Leader

Claude Opus 4.7

87.6% Verified

Highest ARR

Cursor

$1.2B · 1M+ DAU

OSS Star Leader

Cline

61.3k ★ · 5M+ installs

Async Workhorse

Devin

67% PR merge rate

1.Claude Code

AnthropicTerminalSWE-Verified 80.8%Pro $20 · Max $100/$200

2.Cursor

AnysphereIDESWE-Verified 67.2%$20 / $60 / $200

3.OpenAI Codex

OpenAICLI + CloudSWE-Pro 56.8%62k ★ · ChatGPT $20/$200

SWE-Bench Verified · Top 5 (May 1, 2026)

Claude Mythos Preview

93.9%

Claude Opus 4.7

87.6%

GPT-5.3-Codex

85.0%

Claude Opus 4.5

80.9%

Claude Opus 4.6

80.8%

Source: BenchLM / llm-stats live leaderboard. The same models drop 30–35 points on SWE-bench Pro.

4.Devin

CognitionAsync cloud67% merge rate$20 + $2.25/ACU

Three different agents pointing at the same Anthropic model finished 17 issues apart on the same 731-task evaluation. Architecture is doing real work.

5.Cline

Apache-2.0VS Code extension61.3k ★ · 5M installsBYOK · 30+ providers

6.Aider

Apache-2.0Terminal · git-native40k+ ★100+ languages · BYOK

For unfamiliar repos and refactors that need to be reviewable afterwards, this is the safest tool in the list. It is also the dullest one to demo, which is part of why it works.

7.Sourcegraph Amp

SourcegraphEnterpriseSOC 2 · ISO 27001$59/user/mo

If procurement requires audit trails and your codebase is measured in hundreds of microservices, this is the only tool here built for that scale. Smaller teams will find the price hard to justify.

8.Replit Agent 3

ReplitBrowser · full-stack200-min sessions$25 Core · $100 Pro

For regulated production code on an existing codebase: the wrong tool. That is fine. Replit is not pretending otherwise.

Trade-offs by job

Job	Default pick	Why
Greenfield apps	Replit Agent 3	Browser-first, deploys in one click.
Multi-file refactors	Claude Code	Long context wins; 5.5× fewer tokens than Cursor on identical work.
AI code review	Augment Code · Continue	70% win rate against Copilot in head-to-head; CI checks defined as code.
Human-in-the-loop	Cline · Aider	Plan/Act split or per-commit diffs make every step inspectable.
Async delegation	Devin	End-state-only review for issues with verifiable success criteria.

What to default to, by team shape

Solo indie dev

Cursor + Claude Code

Forty dollars before usage. Cursor for the editor, Claude Code for the hard problems Cursor can't finish.

Mid-stage startup

Cursor + Claude Code + Devin Core

Two interactive agents and one async. Wire Devin into Linear or Jira and let the bug backlog work itself down overnight.

Regulated enterprise

Sourcegraph Amp + Devin Enterprise

Audit posture and codebase indexing as defaults. Add Devin where async ROI is clearly verifiable.

How CODE-25 ranks them right now

CODE-25 · Composite52.3▲ vs 30d

1Gemini 2.5 Pro Preview 05-0676.4
2GPT-5.3-Codex68.8
3GPT-568.2
4everything-claude-code66.0
5dify63.1
6hermes-agent56.8

Open the live tape →

The ranking above tells you what to install on Monday. CODE-25 tells you what's moving by Friday. Both are useful; neither is the whole answer.

▲ Live Index

The CODE-25, tracked daily

View the CODE-25

Read more in all articles or open the live indexes.