Benchmarks & Evals

Benchmarks, leaderboards, evaluation methodology, and LLM judging. — 47 releases covered on the show.

June 2026

Arena (LMArena)
Benchmarks & Evals

Agent Arena

Arena launches Agent Arena for real-world agent workflow evals

Arena (LMArena) launched Agent Arena during the episode, moving beyond one-turn chatbot preference battles to evaluate models on real agent workflows with web search, files, terminals, user corrections, and objective recovery signals. Peter Gostev joined live to explain why long-running, harder tasks need a different benchmark.

Major Features & Updates

WolfBench Token-Usage Visualization

WolfBench adds 3D token-depth bars to show model efficiency

Wolfram Ravenwolf shipped a WolfBench feature that visualizes token usage alongside benchmark score as 3D token-depth bars. Two models can look close on a leaderboard while one burns dramatically more tokens, which changes the real cost and latency story; Gemini 3.5 Flash and GPT 5.5 were compared as examples.

May 2026

Datacurve
Benchmarks & EvalsOpen weights

DeepSWE

Datacurve's DeepSWE: a contamination-free coding benchmark

DeepSWE is a coding leaderboard built from 113 original tasks written from scratch and shipped as shallow clones with no git history to cheat from. GPT-5.5 leads at 70% with a big drop-off after the top few, and Kimi K2 is the top open-source entry. Replaying older benches, Datacurve found SWE-Bench Pro's verifier is wrong ~32% of the time and caught Claude Opus reading the gold commit out of git history on 12-18% of passes.

70% DeepSWE leader (GPT-5.5)
Microsoft
New Models

MAI-Image-2.5

Microsoft MAI-Image-2.5 jumps to #3 on Arena text-to-image

MAI-Image-2.5 jumped to number two on Arena's image-to-image leaderboard shortly after launch, with notable strength in image cleanup, backgrounds, documents, and diagrams. Hands-on tests on the show were mixed, and it is publicly accessible through playground.microsoft.ai.

Artificial Analysis
Benchmarks & Evals

Coding Agent Index

Artificial Analysis Coding Agent Index benchmarks model + harness combos

Artificial Analysis launched the Coding Agent Index, a benchmark that evaluates model and harness combinations rather than models alone. Opus 4.7 in Cursor CLI leads at 61, GLM-5.1 tops the open-weight entries at 53, and costs vary 30x across combos for similar capability.

April 2026

Microsoft
Benchmarks & Evals

DELEGATE-52

Microsoft's DELEGATE-52 exposes stealthy document corruption

Microsoft released the DELEGATE-52 benchmark showing GPT-5.4 loses 28% of document content after 20 iterative edits. Frontier models corrupt documents stealthily while preserving structure, making the degradation hard to notice.

DatasetsOpen weights

Arena historical leaderboard & prompt datasets

Arena releases 3 years of leaderboard data and prompts on Hugging Face

Arena (formerly LMArena) released three years of historical leaderboard data plus the actual user prompts as datasets on Hugging Face. Peter Gostev, who previously scraped the site by hand into Google Sheets for his charts, now builds his Compute Wars and model-trend analyses straight from the data.

Benchmarks & Evals

WolfBench

WolfBench results show Hermes Agent beating Claude Code and OpenClaw

Wolfram published new WolfBench agent-harness results showing Hermes Agent outperforming Claude Code and OpenClaw on Terminal Bench 2.0 across most model combinations. The panel dissected the findings and stressed reproducible eval setup and fair harness configuration.

March 2026

ARC Prize Foundation
Benchmarks & Evals

ARC-AGI-3

ARC-AGI-3 launches: humans score 100%, frontier models under 1%

ARC Prize launched ARC-AGI-3, an interactive agentic reasoning benchmark of turn-based puzzle games designed to test human-like generalization in novel abstract environments. Humans hit a 100% pass rate while top frontier models score under 1%, which the panel welcomed as a healthy reality check against AGI-is-here rhetoric and easy score inflation.

<1% ARC-AGI-3 frontier model scores100% Human completion on ARC-AGI-3
MarginLab
Benchmarks & Evals

Claude Code tracker

MarginLab tracker shows degradation in Opus 4.6 on Claude Code

MarginLab's public Claude Code tracker surfaced measurable degradation in Opus 4.6 performance, discussed in the evals and benchmarks roundup. The tracker continuously evaluates Claude Code behavior over time, making silent model regressions visible.

Weights & Biases
Benchmarks & Evals

Wolf Bench

Wolfram previews Wolf Bench, a multi-metric agent eval from W&B

Wolfram Ravenwolf gave an early preview of Wolf Bench, a Terminal Bench-based evaluation framework from Weights & Biases that reports four metrics (average, best run, ceiling, and consistent floor) instead of a single score. It treats harness differences (Terminal Bench vs Claude Code vs OpenClaw) as a first-class factor and publishes benchmark cost and transparency details.

February 2026

Agentica
Benchmarks & Evals

ARC-AGI-3 public set result

Agentica claims to solve all public ARC-AGI-3 tasks

Agentica published a claim of solving all public ARC-AGI-3 tasks, adding to the week's theme of benchmark saturation. The panel discussed it alongside METR and ARC-AGI-2 results as part of weighing signal versus noise in headline benchmark leaps.

Confluence Labs
Benchmarks & Evals

ARC-AGI-2 SOTA result

Confluence Labs exits stealth with 97.9% SOTA on ARC-AGI-2

Confluence Labs emerged from stealth with a 97.9% state-of-the-art result on the ARC-AGI-2 benchmark, publishing code on GitHub. The panel read it as a major signal that ARC-AGI-2 is near saturation, part of a broader pattern of benchmarks getting solved faster than expected.

97.9% ARC-AGI-2
METR
Benchmarks & Evals

Time Horizon Benchmark

METR Time Horizon goes vertical: Opus 4.6 hits ~14.5-hour tasks

METR's updated Time Horizon benchmark shows Claude Opus 4.6 completing tasks equivalent to roughly 14.5 hours of expert human work, with the autonomy doubling time now cited at 49 days. The panel treated this as the week's strongest evidence that agent capability growth has entered a visibly faster phase.

14.5h METR Time Horizon49 days Autonomy Doubling Time

December 2025

Weights & Biases
Products & Apps

LLM Evaluation Jobs

W&B launches LLM Evaluation Jobs for OpenAI-compatible APIs

Weights & Biases launched LLM Evaluation Jobs, letting teams run evaluations against any OpenAI-compatible API during training cycles instead of only at the end. The show framed it as a practical workflow upgrade for getting earlier model quality signals without blindly burning compute.

November 2025

Google DeepMind
New Models

Gemini 3 Pro

Gemini 3 Pro launches with record ARC-AGI-2 scores

Google's new frontier multimodal model with a 1M-token context window and huge reasoning gains, scoring 31.11% on ARC-AGI-2 (45.14% with Deep Think mode) — roughly double the previous SOTA — plus 81% on MMLU-Pro and major coding improvements. Amp switched to it as their default model on launch day, the first time they have ever switched defaults. Also rolling out across Gmail, Calendar, and AI Mode in Google Search.

45.14% ARC-AGI-2 (Deep Think)31.11% ARC-AGI-2 (standard)1M Token context window
Benchmarks & EvalsOpen weights

Terminal-Bench 2.0

Terminal-Bench 2.0 and Harbor launch as new bar for coding agents

Terminal-Bench 2.0 launched alongside the Harbor framework, with 89 hard, realistic terminal-based tasks built with around 1000 Discord contributors. The Warp agent tops the leaderboard at 50% with Codex CLI close behind, and the panel argued an unsaturated 50% ceiling makes it far more meaningful than near-saturated benchmarks like MMLU.

50% Terminal Bench v2 Top Score
Inworld AI
New Models

Inworld TTS

Inworld TTS takes the #1 spot on the Artificial Analysis speech benchmark

Inworld released a new version of its TTS model that claimed the #1 position on the Artificial Analysis text-to-speech benchmark. It featured in the episode's voice segment as evidence that commercial TTS quality keeps climbing fast.

September 2025

Meta AI & Hugging Face
Benchmarks & EvalsOpen weights

Gaia2 + ARE

Gaia2 agent benchmark and Agents Research Environments released

Meta and Hugging Face released Gaia2, a follow-up agent benchmark, together with ARE (Agents Research Environments) for testing agents in dynamic, asynchronous settings. It fed the episode's recurring concern that evaluation has to keep up whenever agent product claims get ambitious.

OpenAI
Benchmarks & Evals

GDPval

OpenAI launches GDPval to measure models on real economic work

OpenAI introduced GDPval, an evaluation that measures model performance on real-world, economically valuable tasks drawn from a range of occupations and GDP sectors. On the show it anchored the discussion about agents moving from chat quality toward action and reliability in real environments.

Scale AI
Benchmarks & EvalsOpen weights

SWE-bench Pro

Scale AI debuts SWE-bench Pro, a harder contamination-resistant eval

Scale AI released SWE-bench Pro, a tougher, contamination-resistant successor to SWE-bench for evaluating coding agents on realistic software engineering tasks. It ships with a public dataset on Hugging Face plus separate public and commercial leaderboards, and frontier models score far lower than on the original SWE-bench.

Jeremy Berman & Eric Pang
Papers & Research

ARC-AGI SOTA method

Jeremy Berman and Eric Pang set new ARC-AGI SOTA using Grok-4

Independent researchers Jeremy Berman and Eric Pang published a new state-of-the-art result on ARC-AGI, built on Grok-4 with heavy test-time compute and iterative program synthesis. Berman joins the show to walk through the method, its limitations, and why iteration matters more than leaderboard narratives; the approach is documented in a detailed write-up.

Nous Research
Benchmarks & EvalsOpen weights

Husky Hold'em Bench

Nous launches Husky Hold'em Bench, an open-source pokerbot eval for LLMs

Nous Research released Husky Hold'em Bench, an open-source poker benchmark that evaluates LLM strategic play in a richer agentic environment than standard leaderboards. Guests Roger Jin and Bhavesh Kumar joined the show to explain how it measures agent behavior and decision-making under uncertainty rather than chasing another leaderboard point.

May 2025

OpenAI
Benchmarks & EvalsOpen weights

HealthBench

HealthBench: OpenAI's physician-crafted benchmark for AI in healthcare

OpenAI released HealthBench, a benchmark for evaluating AI models on healthcare scenarios, built with input from physicians. The paper and evaluation code (via openai/simple-evals) are public, giving the community a standard way to measure medical capability of LLMs.

Cohere
Papers & Research

The Leaderboard Illusion

Cohere Labs paper accuses Chatbot Arena (LMArena) of structural bias

Cohere Labs published 'The Leaderboard Illusion,' claiming LMArena lets big incumbents privately A/B-test dozens of model variants (Meta ran 27 hidden Llama-4 variants in a month), cherry-pick top scores, and receive far more battle data, inflating Elo ratings. LMArena responded that the leaderboard reflects real human preferences and pre-release testing is open to all providers.

UC Berkeley
DatasetsOpen weights

PromptEvals

PromptEvals: 12K+ real production assertion criteria for LLM evals

Shreya Shankar and collaborators released PromptEvals, the first large-scale corpus of production LLM guardrails: 2,087 developer prompts paired with 12,623 assertion criteria covering structure, style, grounding and hallucination checks, about 5x larger than prior sets. Fine-tuned open Mistral-7B and Llama-3-8B checkpoints generate assertions +21 F1 better than GPT-4o at a fraction of the latency. Accepted to NAACL 2025.

April 2025

Weights & Biases
Major Features & Updates

W&B Weave Playground

W&B Weave Playground adds GPT-4.1 family and o3/o4-mini support

The Weights & Biases Weave Playground shipped full support for the new GPT-4.1 family and the o3/o4-mini models, letting developers evaluate and compare the week's new models for their own applications.

CoreWeave
Benchmarks & Evals

CoreWeave GB200 inference benchmark

CoreWeave hits 800 tok/s on Llama 405B with NVIDIA GB200 Blackwell

CoreWeave announced record-breaking AI inference benchmarks using NVIDIA's new GB200 Grace Blackwell superchips: 800 tokens/sec on Llama 3.1 405B, plus 33,000 tokens/sec on Llama 2 70B with H200s. It is a marker of how fast inference hardware is accelerating.

800 tok/s Llama 3.1 405B on GB20033,000 tok/s Llama 2 70B on H200
Google DeepMind
Benchmarks & Evals

Gemini 2.5 Pro USAMO results

Gemini 2.5 Pro scores 24.4% on USAMO olympiad math, crushing the field

New evaluation results published this week showed Gemini 2.5 Pro scoring 24.4% on the USA Math Olympiad (USAMO), problems so hard that most top models score under 5%. The result showcases a step change in frontier reasoning ability on competition mathematics.

24.4% Gemini 2.5 Pro USAMO score<5% typical score for other top models
OpenAI
Benchmarks & EvalsOpen weights

PaperBench

OpenAI releases PaperBench eval and open-sources Nano-Eval framework

OpenAI published PaperBench, a tough new evaluation that tests whether AI agents can replicate cutting-edge AI research papers, with more than 8,300 graded tasks and meta-evaluation of the LLM judge. The best model managed only a 21.0% replication score versus 41.4% for human PhDs. The code and the Nano-Eval framework were open sourced on GitHub alongside the paper.

8,300+ graded tasks in the benchmark21.0% best model replication score41.4% human PhD baseline score

March 2025

ARC Prize Foundation
Benchmarks & Evals

ARC-AGI 2

ARC-AGI 2 benchmark revealed, thinking models score just 4%

The ARC Prize Foundation revealed ARC-AGI 2, the next iteration of the abstract reasoning benchmark. Base LLMs score 0% and even thinking models only reach about 4%, showing how far current frontier models remain from human-level fluid intelligence.

0% base LLM score on ARC-AGI 24% thinking model score on ARC-AGI 2
OpenAI
New Models

GPT-4o (2025-03-26)

GPT-4o gets an update, ties for #1 on LMArena beating GPT-4.5

OpenAI shipped a new GPT-4o checkpoint (2025-03-26) that jumped over GPT-4.5 to tie for #1 on LMArena. The update landed as the show was being written, read as a direct response to Gemini 2.5's launch in the escalating frontier-model race.

Weights & Biases
Dev ToolsOpen weights

Weave MCP Server

W&B ships official Weave MCP server - talk to your evals

Weights & Biases shipped an official MCP server for Weave, its LLM observability and evaluation tool, letting agents and MCP clients query and analyze your evals directly. Morgan McQuire of the W&B Applied AI team demoed it on the show, with wandb Models integration coming soon so agents can monitor loss curves for you.

February 2025

Haize Labs
Dev ToolsOpen weights

Verdict

Haize Labs open-sources Verdict, a framework for composing LLM judges

Haize Labs released Verdict, an open-source framework for composing LLM judges that tackles core LLM-as-a-judge problems: self-preference bias, prompt sensitivity, and meta-evaluation. Verdict combines simpler judging primitives into more robust and efficient evaluators ('judge-time compute scaling'), achieving near state-of-the-art results on benchmarks like ExpertQA at a fraction of the cost, fast enough to use as a real-time guardrail. Co-founders Leonard Tang and Nimit joined the show to discuss it.

Benchmarks & EvalsOpen weights

ZeroBench

ZeroBench: the 'impossible' benchmark where all top VLMs score zero

A new benchmark called ZeroBench launched, claiming to be the impossible benchmark for vision-language models: all current top-of-the-line VLMs score zero on it. Tasks include visually demanding puzzles like reading a question written in the shape of a star hidden among scattered letters, highlighting how far VLMs still are from true visual understanding.

Weights & Biases
Papers & Research

Agents Whitepaper & Course

Weights & Biases releases an AI agents whitepaper and announces agents course

Weights & Biases released a whitepaper on evaluating AI agent applications and announced an upcoming agents course built in collaboration with OpenAI's Ilan Biggio, with signups at wandb.me/agents. The push targets agent evaluation and observability tooling for the community.

January 2025

Benchmarks & Evals

Humanity's Last Exam (HLE)

Humanity's Last Exam: a deliberately unsaturated frontier benchmark

Humanity's Last Exam (HLE) launched as a new, very hard benchmark designed to stay unsaturated as models max out MMLU and math evals. It crowdsourced expert-level questions to measure frontier model capability where existing benchmarks are at 98-99% saturation.