Article · May 5, 2026

The best foundation models for AI agents in 2026

Eight models that ship agent workloads in production, ranked on the three failure modes humans don't notice: brittle tool calls, context that rots past 80K, and unit cost at ten million runs.

The most important number on every flagship spec sheet is the context window. It is also the number that tells you the least about how the model will behave inside an agent. A 1M-token window with recall that collapses past 120K is a worse tool than a clean 200K window. The first thing to verify on any model you're evaluating is not the headline length — it's the depth at which the recall curve falls off.

Tool-use reliability

BFCL v4 is the headline. The number that matters more is whether the model still calls tools correctly on turn 40 of a multi-hour run. Most don't.

Effective long-context recall

Run a needle-in-a-haystack at 50K, 120K, 200K, 500K. The cliff is usually well before the documented window.

Cost per successful task

Per-million-token pricing matters less than turns-to-success. A cheaper model that needs three retries isn't cheaper.

Eight models that actually run agent workloads in production today, ordered against those three axes. Numbers come from public leaderboards (BFCL v4, SWE-bench Verified, MMLU-Pro, Open LLM) plus published price sheets. Vendor self-reports are flagged where they appear, not folded in.

§ I / Closed-source frontier

Claude Opus 4.7

AnthropicTop pick for agents

The model that has held its agent-harness lead since February. 87.6% on SWE-bench Verified. The reason it stays at the top isn't the headline number — Codex is within three points — it's what happens after a tool call fails. Opus tries another path. Most others retry the same call with a small variation, fail again, then drift. Routing the routine 70–80% of steps to Sonnet 4.6 and reserving Opus for the recovery is the standard production pattern at this point.

Best at

Multi-hour runs that have to recover from their own mistakes. Code edits that span ten or more files.

Where it costs you

$5/$25 per million in/out — roughly twice GPT-5.4. 200K context, not 1M. The 1M beta exists but isn't on the standard tier.

SWE-bench Verified87.6%

GPT-5.4 / GPT-5.3 Codex

OpenAI

The cost-quality default. GPT-5.4 at $2.50/$15 sits at roughly half Opus pricing for general work; GPT-5.3 Codex hits 85% on SWE-bench Verified and currently leads SWE-bench Pro at 56.8%. The function-calling schema remains the cleanest in the industry — every framework targets it first, which means community-tested tool definitions are easier to find for OpenAI than for anyone else. Drift past about fifty sequential tool calls is the known failure mode.

Best at

Anything where you'd otherwise default to Opus and want to halve the bill. Single-turn tool calls. Broad framework support.

Where it costs you

Drift on long sequential runs. 270K context cap is generous but trails Gemini by an order of magnitude.

SWE-bench Verified · Codex85%

Gemini 3.1 Pro

Google

Two million tokens of usable context — the largest production window any Tier-1 lab is shipping — plus a 90% cache discount on repeated prompts that changes RAG economics in Gemini's favour. The model is genuinely strong on long-document and long-video reasoning. SWE-bench numbers around 78% trail the top two, and function-calling has more sharp edges than OpenAI's: edge cases around streaming and parallel calls are where most production teams hit their first surprise.

Best at

RAG over codebases that don't fit elsewhere. Hour-long video. Document corpora at the scale where chunking starts to lose meaning.

Where it costs you

Coding agent quality trails Claude and GPT. Function-calling reliability uneven enough to break some frameworks without retries.

Context window2,000,000 tokens

Kimi K2.6

Moonshot AI · open weights

The open-weights frontier challenger. Trillion-parameter MoE with ~32B active, trained with PARL to coordinate large numbers of tool calls across self-spawned sub-agents. Multi-tool runs lasting many hours without context collapse have been documented in the community — that endurance is what earns it this rank, not the per-turn benchmarks (where it still trails the closed labs by a few points on BFCL v4). Self-hosting needs serious GPU. Vision is below Gemini.

Best at

Parallel agent swarms. Polyglot codebases. Frontier-class capability inside your own infrastructure.

Where it costs you

BFCL v4 trails closed labs by 4–6 points. Hosted latency on community endpoints is uneven. Self-hosting cost is real.

Reported parallel tool calls per session~1,500

Spec-sheet context is one number. Effective recall past 200K tokens is a different number. The first one sells; the second one ships.

§ II / Open-weight challengers

DeepSeek V4

DeepSeek · open weights

The cheapest frontier-class reasoner per token by a wide margin. 80.6% on SWE-bench Verified, a genuine 1M context, and a reasoning-per-dollar number nothing else gets close to. The tradeoff is latency: hosted endpoints run two to three times slower than Claude on short tool loops, and self-hosting a trillion-parameter model is a project, not an evening's work.

Best at

Math, science and dense reasoning where the bottleneck is the thinking, not the tool ergonomics. Background batch work where latency doesn't matter.

Where it costs you

Latency hurts on user-facing agents. Tool-use reliability is below the closed labs. Self-hosting GPU footprint is significant.

MMLU-Pro92.8%

Qwen 3.6 Plus

Alibaba · open weights

The multilingual default and the open-weights long-context story. Qwen 3.6 ships a 1M window as actual open weights, dominates Chinese-language coding, and Apache 2.0 covers the smaller variants. The flagship uses Tongyi Qianwen rather than Apache, which matters if your plan involves fine-tuning the flagship rather than the smaller siblings — read the licence before you commit a training run.

Best at

Multilingual agents. Long-context open-weights work where you need to fine-tune. Chinese-language code bases.

Where it costs you

English reasoning trails DeepSeek V4. Flagship licensing isn't Apache. Tooling ecosystem narrower than Llama's.

SWE-bench Verified73–77%

Llama 4 Maverick

Meta · open weights

Pick this when fine-tuning rights and ecosystem maturity outrank peak agent capability. Every framework targets Llama first; every cloud has a Llama endpoint; every fine-tuning library assumes its tokenizer. On the specific question of agent benchmarks, Kimi K2.6 and DeepSeek V4 have moved past it. On the question of "what will my fine-tuning team have working by Friday", Llama still wins.

Best at

Enterprise fine-tuning programmes. The most permissive licence in the open-weights tier. Tooling depth.

Where it costs you

Agent benchmarks trail the open-weights leaders. Less interesting if you're not fine-tuning.

LiveCodeBench43.4%

Mistral Medium 3.5

Mistral · open weights

77.6% on SWE-bench Verified from a 128B dense model — efficient per parameter, Apache 2.0, and the only frontier-tier lab outside the US/China duopoly. If EU data sovereignty matters to your buyer, the conversation tends to start and end here. Long context past 128K is unreliable in our testing and in the community threads, so size your RAG accordingly.

Best at

EU data sovereignty stories. Dense efficiency. Apache fine-tuning without the US-China geopolitics conversation.

Where it costs you

BFCL v4 below Qwen and the closed labs. Long context above 128K isn't production-grade.

SWE-bench Verified77.6%

How to choose

When the agent has to work end-to-end

Claude Opus 4.7. The recovery behaviour is the differentiator. Pair with Sonnet 4.6 for routine steps to halve cost.

When the bill matters more than the last 5%

GPT-5.4. Right answer for most teams not bottlenecked on the hardest tasks.

When context is the actual bottleneck

Gemini 3.1 Pro. 2M tokens, the cache discount, and RAG-friendly economics. Verify your function-calling reliability on a real harness first.

Self-hosting or budget-bound

Kimi K2.6 for long-running orchestration. DeepSeek V4 for reasoning per dollar. Llama 4 if fine-tuning is the actual job.

The teams getting the most out of agents in 2026 aren't picking one model. The pattern that works: route hard reasoning to Opus 4.7, bulk tool calls to Sonnet 4.6 or GPT-5.4, RAG-over-corpus to Gemini, and long-running orchestration to Kimi K2.6. Single-model deployments leave a meaningful chunk of the available quality on the table — how big a chunk depends entirely on how mixed your workload is, and you'll only know once you've measured.

▲ Live Index

The FM-50, tracked daily

Foundation models reshuffle every time a new release lands or a benchmark refreshes. The FM-50 is the live tape — context windows, pricing, modality, and Open LLM Leaderboard scores ranked into one composite that moves with the field.

View the FM-50

Read more in all articles or open the live indexes.

How to choose

When the agent has to work end-to-end

Claude Opus 4.7. The recovery behaviour is the differentiator. Pair with Sonnet 4.6 for routine steps to halve cost.

When the bill matters more than the last 5%

GPT-5.4. Right answer for most teams not bottlenecked on the hardest tasks.

When context is the actual bottleneck

Gemini 3.1 Pro. 2M tokens, the cache discount, and RAG-friendly economics. Verify your function-calling reliability on a real harness first.

Self-hosting or budget-bound

Kimi K2.6 for long-running orchestration. DeepSeek V4 for reasoning per dollar. Llama 4 if fine-tuning is the actual job.

▲ Live Index

The FM-50, tracked daily

View the FM-50