How AgentTape works.
The index is autonomously populated by software that watches the AI ecosystem and admits things on the day they start to matter. No curated seed list. Every input published.
Every change to scoring, weights, or index rules ships as a commit — see the methodology changelog.
1. Discovery — how agents enter the index
Nothing on AgentTape was added by hand. A discovery service sweeps a fixed list of public sources on its own schedule and opens a candidate row for anything matching an AI-agent or foundation-model pattern. A second pass scores each candidate and either admits it, rejects it, or leaves it pending for weekly review.
Sources swept: GitHub search (repos matching agent-frame patterns), Hugging Face trending, OpenRouter models catalogue (every published foundation model), MCP registries, npm and PyPI, arXiv, and the Hacker News firehose. Each source ingests on its own cadence — none of them gate on the others.
Promotion scores each candidate on five axes: LLM dependency, agent-vocabulary match, popularity floor, maintenance, and packaged distribution. Four small substance bonuses (each capped at 0.05) tip borderline cases — declared topics, description length, multi-ecosystem packaging, recent commit activity. Archived or disabled GitHub repos are hard-rejected.
2. Refresh tiers — how often signals update
Signals refresh on two cadences. The split balances ticker freshness against upstream rate limits — anything hitting a benchmark site or a slow JSON API runs once a day; everything else runs hourly.
| Tier | Cadence | What runs |
|---|---|---|
| Fast | ~1 hour | arXiv citations · Bluesky / HN / Mastodon / Reddit mentions · Crates.io downloads · Docker pulls · GitHub stars, forks, contributors, commits, mentions · HF downloads, likes, trending rank · MCP registry · npm weekly · Product Hunt · PyPI monthly · Stack Overflow questions. Drives the live ticker; inserts are deduped per signal so only changed values write. |
| Slow | daily | Artificial Analysis API (10+ canonical FM benchmarks, cost, speed) · FM leaderboards (lmarena, SWE-bench, MMLU-Pro, Open LLM) · llm-stats per-benchmark pages · Discord members · GitHub releases (90d) · GitHub issue close-rate · GitHub first-response hours · GitHub repos using model · Google Trends · Tech-news mentions (GDELT) · Wikipedia views. |
3. The AgentScore
One number, 0-100, computed as a weighted sum of pillar scores. Applications have four pillars; foundation models have five (Efficiency adds cost + speed, which doesn't apply to apps that run on user hardware).
Weights differ by entity kind because the question "what makes a coding tool good" is not the question "what makes a foundation model good". A 70 for an app and a 70 for an FM are not directly comparable — use the per-kind boards (Models, Sectors) when you want a fair comparison.
- 0.40 Adoption
- 0.20 Quality
- 0.10 Momentum
- 0.30 Community
- 0.25 Adoption
- 0.35 Quality
- 0.20 Efficiency
- 0.10 Momentum
- 0.10 Community
A pillar with no signals contributes zero to the headline (no redistribution, no re-normalisation). Less data is a lower ceiling, not a re-weighted average — a 3-pillar FM caps at 80, a 4-pillar one at 90, only a fully-rated model can reach 100. That keeps coverage honest. Models board lets you click any column header to re-sort by that single pillar.
4. The pillars
What each pillar answers, and the signals that drive it for each entity kind. Signal lists below are the source of truth — they're co-located with the source code in apps/scoring/compute.py and the page renders straight from that list.
Is anyone actually using this?
Installs, registry presence, real-world distribution. Includes the FM-style breadth signals — GitHub mentions across repos and Wikipedia views — for tools that have become household names.
- Crates.io downloads (90d)
- Docker pulls (30d)
- GitHub mentions (7d)
- GitHub stars
- HF downloads (30d)
- MCP registry listed
- npm weekly
- Product Hunt upvotes
- PyPI monthly
- Stack Overflow questions (7d)
- Tech-news mentions (30d)
- Wikipedia views (30d)
Production traffic and where the model's name shows up across the developer ecosystem. OpenRouter token volume is the closest public proxy for real billable usage.
- Bluesky mentions (7d)
- GitHub mentions (7d)
- GitHub repos using model
- GitHub stars
- HF downloads (30d)
- HN mentions (7d)
- Mastodon mentions (7d)
- OpenRouter token volume (30d)
- Reddit mentions (7d)
- Tech-news mentions (30d)
- Wikipedia views (30d)
How capable is this on the work that matters?
Two parts: published benchmark performance plus maintainer responsiveness. The benchmark side is genuinely sparse for most application agents — we match where the tool appears in agentic-coding leaderboards (SWE-bench Verified harness entries like "mini-SWE-agent + Claude Opus 4.7" land on the harness), in the Galileo + HAL agent leaderboards, or in any benchmark we scrape that names the tool directly. Most tools have no published benchmark and rely entirely on the responsiveness signals.
- Benchmark score (mean of normalised results from agentic-coding leaderboards)
- GitHub first-response hours (30d, inverted)
- GitHub issue close rate (30d)
Mean percentile rank across the canonical FM benchmark suite (SWE-bench Verified, GPQA Diamond, MMLU-Pro, AIME, MMMU, Terminal-Bench Hard, HLE, SciCode, IFBench, lmarena, Open LLM, etc.). Percentile rank is coverage-robust: what matters is consistently beating peers on the benchmarks tested, not the absolute number on a longer-or-shorter list. A minimum of three benchmarks is required for a model to be rated — below that floor the model stays Unrated rather than carrying a misleading single-source score.
- Benchmark percentile rank across the FM benchmark suite
- Sources: Artificial Analysis API · lmarena-ai HF dataset · Open LLM Leaderboard · SWE-bench · TIGER-Lab MMLU-Pro
Is interest in this growing or fading?
Rate of change on the adoption signals plus release cadence, Google Trends, and academic mindshare via arXiv citation velocity. Flat usage = score 50, doubling = 100, halving = 0.
- arXiv citations (7-day ROC)
- Bluesky mentions (7-day ROC)
- GitHub releases (90d)
- GitHub stars (7-day ROC)
- Google Trends
- HF downloads (7-day ROC)
- HN mentions (7-day ROC)
- Mastodon mentions (7-day ROC)
- npm weekly (7-day ROC)
- PyPI monthly (7-day ROC)
- Reddit mentions (7-day ROC)
Same rate-of-change treatment applied to FM-shaped signals. arXiv citation velocity lives here too (academic mindshare is an interest signal, not a capability one).
- arXiv citations (7-day ROC)
- Bluesky mentions (7-day ROC)
- GitHub mentions (7-day ROC)
- Google Trends
- HF downloads (7-day ROC)
- HN mentions (7-day ROC)
- Mastodon mentions (7-day ROC)
- OpenRouter tokens (7-day ROC)
- Reddit mentions (7-day ROC)
Who's engaging beyond just using it?
Contributors, forks, points and likes — signals of investment, not just consumption. An app with 1k contributors is structurally different from one with 1k downloads.
- Bluesky mentions (7d)
- Discord members
- GitHub contributors
- GitHub forks
- HF likes
- HN points (7d)
- Mastodon mentions (7d)
- Reddit points (7d)
Genuinely sparse for foundation models, especially closed-weight ones. We keep the pillar but Unrated is the honest answer for most Anthropic and OpenAI flagships — they don't have contributor lists or forks because there's nothing to fork.
- GitHub contributors (open-weight FMs only)
- HF likes
- Reddit points (7d)
How practical is this to ship in production?
Not used for applications — apps run on the user's hardware and their cost/speed depends on the model they're configured with, not the tool itself.
- — (Application entity kind has no Efficiency pillar)
Cost and speed via the Artificial Analysis API. Blended $/M tokens (input + output, inverse-anchored so cheaper scores higher) and median output tokens/sec. Lets a buyer see that, say, Claude Opus 4.7 and GPT-5.1 are at similar capability but very different price points.
- Blended price (input + output $/M tokens, lower is better)
- Median output tokens/sec
5. Scoring formulas
Three families of formula, applied per signal kind. Each produces a 0-100 contribution; the pillar score is the arithmetic mean of available contributions.
Counts (most signals)
scaled(v, anchor) = min(100, 50 × log₁₀(v + 1) / log₁₀(anchor + 1))
A log curve so a project with 1,000 stars scores 50 and one with 100,000 stars doesn't get 100× the credit. Each anchor is the value at which the signal scores exactly 50 — chosen so the median agent in each population lands near the middle of the scale. Anchor table lives in source at apps/scoring/compute.py:ANCHORS.
Benchmarks (Quality pillar)
Mean percentile rank across the agent's benchmark coverage. For each benchmark the agent has been scored on, its score is ranked against every other agent on that benchmark and converted to a 0-100 percentile. The pillar score is the mean of those percentiles.
Percentile rank is coverage-robust: a model with 5 benchmarks all at the 95th percentile beats a model with 10 benchmarks averaging the 70th. It's also head-to-head consistent — if A strictly beats B on every shared benchmark, A's mean percentile is ≥ B's. The previous formula (mean of normalised scores) failed this: a model could outrank a strictly-better competitor just by having extra easy benchmarks pulling its mean up.
A coverage floor of three distinct benchmarks is required to rate a model on Quality — below it the model stays Unrated.
Momentum (7-day rate of change)
roc_7d = (now − then) / max(then, 1) scaled_roc = clamp(50 + 50 × roc_7d, 0, 100)
0% growth → 50, +100% → 100, −50% → 0. A signal first seen inside the 7-day window (no "then" reading) gets a 60 — a small "newly arrived" bias, not the punitive 0 a missing baseline would otherwise imply.
Special cases
A handful of signals don't fit the count log-curve. MCP registry listed is binary (0 → 0, 1 → 75 — a hand-picked credibility bonus). HF trending rank and GitHub first-response hours are inverted: lower input = higher score, with the same log shape mirrored. Cost (blended price) is also inverted — cheaper scores higher, anchored at $5/M tokens.
6. Why some agents are Unrated
A pillar with no signals on file is Unrated rather than zero. The pillar's card on an agent page reads "Unrated" (no data), while at the headline level that pillar contributes zero (because the headline is a weighted sum and you can't add an unknown to a sum). The two read like a contradiction; they describe different layers.
Quality has the additional coverage floor — fewer than three benchmarks means Unrated rather than a noisy single-source score. New flagship releases often sit Unrated on Quality for a few days until enough leaderboards pick them up.
7. Manipulation resistance
Three patterns trigger automatic flags. Flagged signals are excluded from that day's score; the agent's record carries the reason. The manipulation_resistance confidence on every score envelope reflects how clean the inputs were.
- star_spike_no_contrib_diversity — a 10× star jump in 24 hours from very few distinct contributors. Excludes
github_starsfor that tick. - hf_surge_no_github — a Hugging Face download surge with no accompanying GitHub activity. Excludes
hf_downloads_30d. - coordinated_hn_posting — a burst of HN mentions with low account-age diversity. Excludes
hn_mentions_7dandhn_points_7d.
8. Indexes
Six indexes at launch — each with eligibility rules published in code, equal-weight v1, rebalanced Mondays at 03:00 UTC. Every diff is logged with a short narrative.
- TAPE-100 — top 100 across both kinds.
- FM-50 — top 50 foundation models.
- CODE-25 — top 25 coding agents (capability:code-generation, application).
- WEB-25 — top 25 browser agents (capability:browsing, application).
- OSS-50 — top 50 open-source applications.
- MCP-25 — top 25 MCP servers (deployment:mcp-server).
9. Show your work
Every agent page exposes its raw signals as a downloadable CSV. Every index page links its rebalance log. Methodology changes are versioned in the repository; corrections welcome via Issues or Discussions.
Last revised on the most recent deploy. Comments and corrections at github.com/flmwilkinson/AgentTape/issues.