The best AI coding agents in 2026 — ranked, live
Eight tools, four trade-offs, and one honest admission: no single agent wins every job. Ranked from real benchmarks and merge rates as of May 2026.
The benchmark question moves quickly. Claude Opus 4.7 leads SWE-bench Verified at 87.6%, and the unreleased Mythos Preview is closer to 94. That is the floor of what these tools can do, not the ceiling. The same model drops thirty points on SWE-bench Pro, which Scale AI rebuilds every month to keep the test set fresh.
So the ranking below weighs four things, not one: Verified scores, Pro scores, real-world PR merge rates, and how much the surrounding scaffolding lifts or drags the underlying model. In February three frameworks pointing at the same Anthropic model finished 17 issues apart on the same 731-task evaluation. The wrapper does work.
1.Claude Code
Top of the list on raw capability. Opus 4.6 underneath; 80.8% on Verified; 200K standard context with a 1M-token beta for the largest refactors. The number that surprised me most: identical-task runs show Claude Code using 33K tokens where Cursor uses 188K. Five and a half times fewer.
The downside is the bill. Heavy daily use lands at $150–200/month on the Max plan, and the rate limit still bites at that tier. The pattern that keeps showing up on r/ClaudeCode: Cursor for daily flow, Claude Code for the problems Cursor cannot finish. Run them together rather than choose.
2.Cursor
Cursor sits where you actually work. $1.2B ARR, north of a million daily users, Composer for multi-file edits, and the Supermaven Tab completion model still under 100ms. The agent itself scores 67.2% on Verified. The IDE polish is what people pay for, not the agent.
The June 2025 switch to credit-based billing damaged trust that Cursor has not yet rebuilt. Pro+ at $60 and Ultra at $200 still read as opaque to anyone tracking actual usage. Keep an eye on the token meter and you'll be fine.
3.OpenAI Codex
The unexpected comeback story. GPT-5.3-Codex is at 77.3% on Terminal-Bench 2.0 and leads SWE-bench Pro at 56.8%; the Spark variant runs on Cerebras WSE-3 silicon at over a thousand tokens a second. The cloud-task-runner pattern is the part that's new: write a spec, get back a sandboxed PR an hour later.
A year ago Codex did not exist. Today it has roughly 60% of Cursor's install base. The Rust CLI is open source under Apache-2.0, sitting at 62k stars. Pricing rides the ChatGPT Plus or Pro message budget rather than charging per call.
4.Devin
The most autonomous tool here, on its best days. Cognition reports a 67% PR merge rate on well-defined tasks, and Goldman Sachs is running Devin across its 12,000-engineer programming team with claimed 3–4× productivity. Treat the productivity number with caution; the merge rate is the one I trust more.
The pricing dropped from $500/month to a $20 base plus $2.25 per Agent Compute Unit, which makes Devin a tool you can experiment with before committing. Bug backlogs, schema migrations, and repetitive feature work are where it shines. Open-ended product design is where it still loses.
Three different agents pointing at the same Anthropic model finished 17 issues apart on the same 731-task evaluation. Architecture is doing real work.
5.Cline
Cline is the open-source headline at 61.3k stars and 5M installs, with bring-your-own-model across more than 30 providers. The Plan/Act split is the cleanest oversight pattern in the category: Plan reads files and reasons about them, Act is what touches disk. You approve the boundary.
You pay only for tokens, which puts a typical feature at $0.50–$2.00 against Sonnet 4.7. The April 2026 spend-limit UI is the small change that mattered most: agents quietly draining your account in a runaway loop is no longer a failure mode you have to worry about.
6.Aider
Forty-plus thousand stars and several model cycles of survival gives Aider an unusual property: every change becomes a real git commit you can read or revert. Repo map via tree-sitter scales to large codebases. The interface is a terminal prompt and your diff.
For unfamiliar repos and refactors that need to be reviewable afterwards, this is the safest tool in the list. It is also the dullest one to demo, which is part of why it works.
7.Sourcegraph Amp
Cody Free and Cody Pro retired in July 2025; Cody Enterprise continues at $59 per user per month. Amp is the agentic successor, shipping in terminal, VS Code, Cursor, JetBrains and Neovim, with indexing across 300,000+ repositories under SOC 2 Type II and ISO 27001:2022.
If procurement requires audit trails and your codebase is measured in hundreds of microservices, this is the only tool here built for that scale. Smaller teams will find the price hard to justify.
8.Replit Agent 3
Replit raised $250M at a $3B valuation in January, and Agent 3 is the reason. Two-hundred-minute autonomous sessions, database provisioning, full-stack scaffolding with auth, and a public URL one click after the model finishes. For getting something in front of a stakeholder before lunch, nothing else is close.
For regulated production code on an existing codebase: the wrong tool. That is fine. Replit is not pretending otherwise.
Trade-offs by job
| Job | Default pick | Why |
|---|---|---|
| Greenfield apps | Replit Agent 3 | Browser-first, deploys in one click. |
| Multi-file refactors | Claude Code | Long context wins; 5.5× fewer tokens than Cursor on identical work. |
| AI code review | Augment Code · Continue | 70% win rate against Copilot in head-to-head; CI checks defined as code. |
| Human-in-the-loop | Cline · Aider | Plan/Act split or per-commit diffs make every step inspectable. |
| Async delegation | Devin | End-state-only review for issues with verifiable success criteria. |
What to default to, by team shape
Forty dollars before usage. Cursor for the editor, Claude Code for the hard problems Cursor can't finish.
Two interactive agents and one async. Wire Devin into Linear or Jira and let the bug backlog work itself down overnight.
Audit posture and codebase indexing as defaults. Add Devin where async ROI is clearly verifiable.
How CODE-25 ranks them right now
The list above is a snapshot of how the editors stack up. CODE-25
is the moving picture: the top 25 admitted agents carrying a
code-generation tag — the underlying engines that builders point
harnesses at, not the editors wrapping them. Equal-weight v1,
rebalances Mondays at 03:00 UTC. The numbers below are the index
as of 5 May 2026; the live tape is one click away.
- 1Gemini 2.5 Pro Preview 05-0676.4
- 2GPT-5.3-Codex68.8
- 3GPT-568.2
- 4everything-claude-code66.0
- 5dify63.1
- 6hermes-agent56.8
The ranking above tells you what to install on Monday. CODE-25 tells you what's moving by Friday. Both are useful; neither is the whole answer.
The CODE-25, tracked daily
Stars, benchmarks, mentions and merge rates all shift week to week. The CODE-25 index is the live tape: which coding engine is gaining momentum on AgentTape this week, which has stalled, and which just entered the basket on Monday's rebalance.
View the CODE-25