Shreyaskc/BabelJudge: LLM-as-a-judge has become the dominant approach to scalable evaluation in NLP pipelines, yet judges themselves carry systematic biases that raw accuracy hides: they favor responses placed in slot A (position bias), they prefer longer responses regardless of quality (verbosity...
Pillar = mean of 2 scaled values = 0.0.
Awaiting first reading — these signals apply to this agent and will be ingested on the next tier tick: PyPI monthly installs, SO questions (7d), Product Hunt upvotes, Docker Hub pulls, Crates.io downloads (90d), Tech-news mentions (30d)
Not applicable — this agent doesn't have the prerequisite (no GitHub repo, no HF mirror, etc.) for these signals to ever apply: HF downloads (30d), npm weekly installs
[](https://agenttape.com/agents/shreyaskc-babeljudge)
<a href="https://agenttape.com/agents/shreyaskc-babeljudge"><img src="https://agenttape.com/api/badge/shreyaskc-babeljudge.svg" alt="AgentTape" /></a>