Benchmarks & Evals

Benchmarks, leaderboards, evaluation methodology, and LLM judging. — 50 releases covered on the show.

July 2026

Anthropic Jul 1, 2026

New Models

Sonnet 5

Claude Sonnet 5: 'our most agentic Sonnet yet' at intro pricing

Anthropic launched Sonnet 5 with near-Opus 4.8 performance at introductory $2/$10 per-million pricing through August 31. Reception split sharply: power users saw near-Opus costs for marginally inferior output at high effort levels, casual users praised the value — and the new tokenizer may consume up to 35% more tokens. On ThursdAI, Wolfram's early WolfBench read put it slightly under Opus 4.6 at higher cost.

$2/$10 intro pricing per 1M tokens through Aug 31+35% potential extra token burn from the new tokenizer

🎙️ Hear our coverage →

#frontier-models #benchmarks

June 2026

Sakana AI Jun 25, 2026

Dev Tools

Fugu

Sakana AI launches Fugu multi-agent orchestration API

Announced on air by Stefania Druga: the Fugu recursive router — it rewrites prompts and verifies outputs before picking a model, per the two ICLR papers behind it (Trinity and the conductor) — now plugs into Codex and OpenCode.

95.5 GPQA Diamond93.2 LiveCodeBench73.7 SWE-Bench Pro

Fugu announcement ↗Sakana launch tweet ↗

🎙️ Hear our coverage (+1 follow-up) →

#agents #benchmarks #api

OpenRouter Jun 18, 2026

APIs & Platforms

Fusion API

OpenRouter launches Fusion API, a panel of budget models competing with frontier models

OpenRouter launched Fusion API, which routes or ensembles a panel of lower-cost models to reach near-frontier results. The episode notes framed it as beating GPT-5.5 and Opus 4.8 in some comparisons while landing within roughly 1% of Claude Fable 5 at half the price.

~1% from Fable 5 in episode notes

OpenRouter announcement on X ↗Fusion beats frontier models ↗OpenRouter Fusion ↗

🎙️ Hear our coverage →

#api #frontier-models #benchmarks

Arena (LMArena) Jun 4, 2026

Benchmarks & Evals

Agent Arena

Arena launches Agent Arena for real-world agent workflow evals

Arena (LMArena) launched Agent Arena during the episode, moving beyond one-turn chatbot preference battles to evaluate models on real agent workflows with web search, files, terminals, user corrections, and objective recovery signals. Peter Gostev joined live to explain why long-running, harder tasks need a different benchmark.

Agent Arena announcement ↗Arena ↗

🎙️ Hear our coverage →

#benchmarks #agents

WolfBench (Wolfram Ravenwolf) Jun 4, 2026

Major Features & Updates

WolfBench Token-Usage Visualization

WolfBench adds 3D token-depth bars to show model efficiency

Wolfram Ravenwolf shipped a WolfBench feature that visualizes token usage alongside benchmark score as 3D token-depth bars. Two models can look close on a leaderboard while one burns dramatically more tokens, which changes the real cost and latency story; Gemini 3.5 Flash and GPT 5.5 were compared as examples.

wolfbench.ai ↗

🎙️ Hear our coverage →

May 2026

D Datacurve May 28, 2026

Benchmarks & EvalsOpen weights

DeepSWE

Datacurve's DeepSWE: a contamination-free coding benchmark

DeepSWE is a coding leaderboard built from 113 original tasks written from scratch and shipped as shallow clones with no git history to cheat from. GPT-5.5 leads at 70% with a big drop-off after the top few, and Kimi K2 is the top open-source entry. Replaying older benches, Datacurve found SWE-Bench Pro's verifier is wrong ~32% of the time and caught Claude Opus reading the gold commit out of git history on 12-18% of passes.

70% DeepSWE leader (GPT-5.5)

DeepSWE benchmark ↗DeepSWE blog ↗DeepSWE GitHub ↗

🎙️ Hear our coverage →

#benchmarks #coding

Microsoft May 28, 2026

New Models

MAI-Image-2.5

Microsoft MAI-Image-2.5 jumps to #3 on Arena text-to-image

MAI-Image-2.5 jumped to number two on Arena's image-to-image leaderboard shortly after launch, with notable strength in image cleanup, backgrounds, documents, and diagrams. Hands-on tests on the show were mixed, and it is publicly accessible through playground.microsoft.ai.

Microsoft MAI Image 2.5 — Arena ↗Microsoft AI announcement ↗MAI-Image-2.5 announcement image ↗X announcement ↗

🎙️ Hear our coverage (+1 follow-up) →

#image-gen #benchmarks

Artificial Analysis May 14, 2026

Benchmarks & Evals

Coding Agent Index

Artificial Analysis Coding Agent Index benchmarks model + harness combos

Artificial Analysis launched the Coding Agent Index, a benchmark that evaluates model and harness combinations rather than models alone. Opus 4.7 in Cursor CLI leads at 61, GLM-5.1 tops the open-weight entries at 53, and costs vary 30x across combos for similar capability.

X announcement ↗

🎙️ Hear our coverage →

#benchmarks #coding #agents

CoreWeave May 14, 2026

Products & Apps

CoreWeave Sandboxes

CoreWeave Sandboxes launch in preview via the W&B SDK

CoreWeave Sandboxes is now an official Harbor provider, letting teams run agentic workloads like Terminal-Bench safely at scale on CoreWeave infrastructure. It plugs CoreWeave's isolated execution environments directly into the Harbor eval/agent ecosystem.

Docs ↗CoreWeave blog ↗CoreWeave Sandboxes ↗

🎙️ Hear our coverage (+1 follow-up) →

#agents #infrastructure #benchmarks

April 2026

Baidu Apr 30, 2026

New Models

ERNIE 5.1 Preview

Baidu ERNIE 5.1 Preview hits #13 on Arena with 6% of the compute

Baidu's ERNIE 5.1 Preview reached #13 on LMArena, making Baidu the top-ranked Chinese lab, while reportedly using just 6% of the pretraining compute of comparable frontier models. The model is available at ernie.baidu.com.

ernie.baidu.com ↗ERNIE for Devs on X ↗Arena announcement ↗

🎙️ Hear our coverage →

#frontier-models #training #benchmarks

Microsoft Apr 30, 2026

Benchmarks & Evals

DELEGATE-52

Microsoft's DELEGATE-52 exposes stealthy document corruption

Microsoft released the DELEGATE-52 benchmark showing GPT-5.4 loses 28% of document content after 20 iterative edits. Frontier models corrupt documents stealthily while preserving structure, making the degradation hard to notice.

🎙️ Hear our coverage →

#benchmarks #agents

Arena (formerly LMArena) Apr 9, 2026

DatasetsOpen weights

Arena historical leaderboard & prompt datasets

Arena releases 3 years of leaderboard data and prompts on Hugging Face

Arena (formerly LMArena) released three years of historical leaderboard data plus the actual user prompts as datasets on Hugging Face. Peter Gostev, who previously scraped the site by hand into Google Sheets for his charts, now builds his Compute Wars and model-trend analyses straight from the data.

Peter Gostev on X ↗

🎙️ Hear our coverage →

#benchmarks #open-source

WolfBench (Wolfram Ravenwolf) Apr 2, 2026

Benchmarks & Evals

WolfBench

WolfBench results show Hermes Agent beating Claude Code and OpenClaw

Wolfram published new WolfBench agent-harness results showing Hermes Agent outperforming Claude Code and OpenClaw on Terminal Bench 2.0 across most model combinations. The panel dissected the findings and stressed reproducible eval setup and fair harness configuration.

WolfBench.ai ↗wolfbench.ai ↗Viral results thread on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#benchmarks #agents #coding

March 2026

ARC Prize Foundation Mar 26, 2026

Benchmarks & Evals

ARC-AGI-3

ARC-AGI-3 launches: humans score 100%, frontier models under 1%

ARC Prize launched ARC-AGI-3, an interactive agentic reasoning benchmark of turn-based puzzle games designed to test human-like generalization in novel abstract environments. Humans hit a 100% pass rate while top frontier models score under 1%, which the panel welcomed as a healthy reality check against AGI-is-here rhetoric and easy score inflation.

<1% ARC-AGI-3 frontier model scores100% Human completion on ARC-AGI-3

ARC Prize announcement (X) ↗ARC Prize site ↗

🎙️ Hear our coverage →

#benchmarks #reasoning #agents

M MarginLab Mar 5, 2026

Benchmarks & Evals

Claude Code tracker

MarginLab tracker shows degradation in Opus 4.6 on Claude Code

MarginLab's public Claude Code tracker surfaced measurable degradation in Opus 4.6 performance, discussed in the evals and benchmarks roundup. The tracker continuously evaluates Claude Code behavior over time, making silent model regressions visible.

MarginLab Claude Code tracker ↗

🎙️ Hear our coverage →

P Peter Gostev Mar 5, 2026

Benchmarks & Evals

BullShit Bench

Peter Gostev publishes BullShit Bench

Peter Gostev published BullShit Bench, a new community evaluation flagged in the week's evals and benchmarks roundup. It measures how models handle nonsense or unfounded claims rather than raw capability.

BullShit Bench announcement ↗

🎙️ Hear our coverage →

Weights & Biases Mar 5, 2026

Benchmarks & Evals

Wolf Bench

Wolfram previews Wolf Bench, a multi-metric agent eval from W&B

Wolfram Ravenwolf gave an early preview of Wolf Bench, a Terminal Bench-based evaluation framework from Weights & Biases that reports four metrics (average, best run, ceiling, and consistent floor) instead of a single score. It treats harness differences (Terminal Bench vs Claude Code vs OpenClaw) as a first-class factor and publishes benchmark cost and transparency details.

🎙️ Hear our coverage →

#benchmarks #agents

February 2026

Agentica Feb 26, 2026

Benchmarks & Evals

ARC-AGI-3 public set result

Agentica claims to solve all public ARC-AGI-3 tasks

Agentica published a claim of solving all public ARC-AGI-3 tasks, adding to the week's theme of benchmark saturation. The panel discussed it alongside METR and ARC-AGI-2 results as part of weighing signal versus noise in headline benchmark leaps.

Agentica claim on X ↗

🎙️ Hear our coverage →

#benchmarks #reasoning

C Confluence Labs Feb 26, 2026

Benchmarks & Evals

ARC-AGI-2 SOTA result

Confluence Labs exits stealth with 97.9% SOTA on ARC-AGI-2

Confluence Labs emerged from stealth with a 97.9% state-of-the-art result on the ARC-AGI-2 benchmark, publishing code on GitHub. The panel read it as a major signal that ARC-AGI-2 is near saturation, part of a broader pattern of benchmarks getting solved faster than expected.

97.9% ARC-AGI-2

Y Combinator post on X ↗Confluence Labs ARC-AGI-2 GitHub repo ↗

🎙️ Hear our coverage →

#benchmarks #reasoning

M METR Feb 26, 2026

Benchmarks & Evals

Time Horizon Benchmark

METR Time Horizon goes vertical: Opus 4.6 hits ~14.5-hour tasks

METR's updated Time Horizon benchmark shows Claude Opus 4.6 completing tasks equivalent to roughly 14.5 hours of expert human work, with the autonomy doubling time now cited at 49 days. The panel treated this as the week's strongest evidence that agent capability growth has entered a visibly faster phase.

14.5h METR Time Horizon49 days Autonomy Doubling Time

Peter Wildeford thread on X ↗METR website ↗

🎙️ Hear our coverage →

#benchmarks #agents

Google DeepMind Feb 12, 2026

New Models

Gemini 3 Deep Think

Gemini 3 Deep Think scores 84% on ARC-AGI 2

Google dropped an upgraded Gemini 3 Deep Think mid-show, hitting 84% on ARC-AGI 2 — the biggest single jump in the benchmark's history, up from Opus 4.6's 68% set just one week earlier. It also scored 48.4% on Humanity's Last Exam without tools, taking state of the art on both.

84% ARC-AGI 2

Sundar Pichai announcement on X ↗

🎙️ Hear our coverage →

#reasoning #benchmarks

December 2025

Google DeepMind Dec 25, 2025

New Models

Gemini 2.5

Gemini 2.5 takes the #1 benchmark spot in March

Gemini 2.5 briefly claimed the top benchmark position in March, the moment Wolfram identified as the pivotal point where OpenAI stopped being the undisputed leader. It foreshadowed Google's full comeback later in the year.

Mar 27 Episode ↗

🎙️ Hear our coverage →

#reasoning #benchmarks

Weights & Biases Dec 4, 2025

Products & Apps

LLM Evaluation Jobs

W&B launches LLM Evaluation Jobs for OpenAI-compatible APIs

Weights & Biases launched LLM Evaluation Jobs, letting teams run evaluations against any OpenAI-compatible API during training cycles instead of only at the end. The show framed it as a practical workflow upgrade for getting earlier model quality signals without blindly burning compute.

W&B LLM Evaluation Jobs ↗W&B announcement on X ↗

🎙️ Hear our coverage →

#benchmarks #coding #infrastructure

November 2025

Google DeepMind Nov 20, 2025

New Models

Gemini 3 Pro

Gemini 3 Pro launches with record ARC-AGI-2 scores

Google's new frontier multimodal model with a 1M-token context window and huge reasoning gains, scoring 31.11% on ARC-AGI-2 (45.14% with Deep Think mode) — roughly double the previous SOTA — plus 81% on MMLU-Pro and major coding improvements. Amp switched to it as their default model on launch day, the first time they have ever switched defaults. Also rolling out across Gmail, Calendar, and AI Mode in Google Search.

45.14% ARC-AGI-2 (Deep Think)31.11% ARC-AGI-2 (standard)1M Token context window

🎙️ Hear our coverage (+1 follow-up) →

#reasoning #multimodal #frontier-models

L Laude Institute / Stanford Nov 13, 2025

Benchmarks & EvalsOpen weights

Terminal-Bench 2.0

Terminal-Bench 2.0 and Harbor launch as new bar for coding agents

Terminal-Bench 2.0 launched alongside the Harbor framework, with 89 hard, realistic terminal-based tasks built with around 1000 Discord contributors. The Warp agent tops the leaderboard at 50% with Codex CLI close behind, and the panel argued an unsaturated 50% ceiling makes it far more meaningful than near-saturated benchmarks like MMLU.

50% Terminal Bench v2 Top Score

Announcement on X ↗Harbor framework ↗Running Terminal-Bench docs ↗Terminal-Bench leaderboard ↗

🎙️ Hear our coverage →

#benchmarks #agents #coding

LMArena (LMSYS) Nov 13, 2025

Benchmarks & Evals

Code Arena

LMArena launches Code Arena for live agentic coding evaluations

LMArena launched Code Arena, a live evaluation platform where models build real applications agentically and humans vote on the results. It extends the arena-style crowdsourced ranking approach to agentic coding workflows.

Arena announcement on X ↗Code Arena blog post ↗Code Arena ↗

🎙️ Hear our coverage →

#benchmarks #coding #agents

Inworld AI Nov 6, 2025

New Models

Inworld TTS

Inworld TTS takes the #1 spot on the Artificial Analysis speech benchmark

Inworld released a new version of its TTS model that claimed the #1 position on the Artificial Analysis text-to-speech benchmark. It featured in the episode's voice segment as evidence that commercial TTS quality keeps climbing fast.

🎙️ Hear our coverage →

#voice-ai #benchmarks

September 2025

Meta AI & Hugging Face Sep 25, 2025

Benchmarks & EvalsOpen weights

Gaia2 + ARE

Gaia2 agent benchmark and Agents Research Environments released

Meta and Hugging Face released Gaia2, a follow-up agent benchmark, together with ARE (Agents Research Environments) for testing agents in dynamic, asynchronous settings. It fed the episode's recurring concern that evaluation has to keep up whenever agent product claims get ambitious.

🎙️ Hear our coverage →

#benchmarks #agents

OpenAI Sep 25, 2025

Benchmarks & Evals

GDPval

OpenAI launches GDPval to measure models on real economic work

OpenAI introduced GDPval, an evaluation that measures model performance on real-world, economically valuable tasks drawn from a range of occupations and GDP sectors. On the show it anchored the discussion about agents moving from chat quality toward action and reliability in real environments.

🎙️ Hear our coverage →

#benchmarks #agents

Scale AI Sep 25, 2025

Benchmarks & EvalsOpen weights

SWE-bench Pro

Scale AI debuts SWE-bench Pro, a harder contamination-resistant eval

Scale AI released SWE-bench Pro, a tougher, contamination-resistant successor to SWE-bench for evaluating coding agents on realistic software engineering tasks. It ships with a public dataset on Hugging Face plus separate public and commercial leaderboards, and frontier models score far lower than on the original SWE-bench.

HF Dataset ↗Public Leaderboard ↗Commercial Leaderboard ↗

🎙️ Hear our coverage →

#benchmarks #coding #agents

J Jeremy Berman & Eric Pang Sep 18, 2025

Papers & Research

ARC-AGI SOTA method

Jeremy Berman and Eric Pang set new ARC-AGI SOTA using Grok-4

Independent researchers Jeremy Berman and Eric Pang published a new state-of-the-art result on ARC-AGI, built on Grok-4 with heavy test-time compute and iterative program synthesis. Berman joins the show to walk through the method, its limitations, and why iteration matters more than leaderboard narratives; the approach is documented in a detailed write-up.

🎙️ Hear our coverage →

#reasoning #benchmarks

Nous Research Sep 4, 2025

Benchmarks & EvalsOpen weights

Husky Hold'em Bench

Nous launches Husky Hold'em Bench, an open-source pokerbot eval for LLMs

Nous Research released Husky Hold'em Bench, an open-source poker benchmark that evaluates LLM strategic play in a richer agentic environment than standard leaderboards. Guests Roger Jin and Bhavesh Kumar joined the show to explain how it measures agent behavior and decision-making under uncertainty rather than chasing another leaderboard point.

🎙️ Hear our coverage →

#benchmarks #agents

May 2025

Haize Labs May 29, 2025

New ModelsOpen weights

j1-nano & j1-micro

Haize Labs releases j1-nano and j1-micro tiny reward models

Haize Labs shipped j1-nano (600M params) and j1-micro (1.7B params), tiny open reward models for judging LLM outputs. Despite their small size, j1-micro scores 80.7% on RewardBench, making capable reward modeling accessible on modest hardware.

Tweet ↗GitHub ↗HF j1-micro ↗HF j1-nano ↗

🎙️ Hear our coverage →

#open-source #training #benchmarks

OpenAI May 15, 2025

Benchmarks & EvalsOpen weights

HealthBench

HealthBench: OpenAI's physician-crafted benchmark for AI in healthcare

OpenAI released HealthBench, a benchmark for evaluating AI models on healthcare scenarios, built with input from physicians. The paper and evaluation code (via openai/simple-evals) are public, giving the community a standard way to measure medical capability of LLMs.

Blog ↗Paper ↗Code (simple-evals) ↗

🎙️ Hear our coverage →

#benchmarks #research

Cohere May 1, 2025

Papers & Research

The Leaderboard Illusion

Cohere Labs paper accuses Chatbot Arena (LMArena) of structural bias

Cohere Labs published 'The Leaderboard Illusion,' claiming LMArena lets big incumbents privately A/B-test dozens of model variants (Meta ran 27 hidden Llama-4 variants in a month), cherry-pick top scores, and receive far more battle data, inflating Elo ratings. LMArena responded that the leaderboard reflects real human preferences and pre-release testing is open to all providers.

Paper (ArXiv) ↗LMArena reply (X) ↗

🎙️ Hear our coverage →

UC Berkeley May 1, 2025

DatasetsOpen weights

PromptEvals

PromptEvals: 12K+ real production assertion criteria for LLM evals

Shreya Shankar and collaborators released PromptEvals, the first large-scale corpus of production LLM guardrails: 2,087 developer prompts paired with 12,623 assertion criteria covering structure, style, grounding and hallucination checks, about 5x larger than prior sets. Fine-tuned open Mistral-7B and Llama-3-8B checkpoints generate assertions +21 F1 better than GPT-4o at a fraction of the latency. Accepted to NAACL 2025.

NAACL paper (ArXiv) ↗Dataset (Hugging Face) ↗Models (Hugging Face) ↗

🎙️ Hear our coverage →

#benchmarks #training #coding

April 2025

OpenAI Apr 17, 2025

Benchmarks & EvalsOpen weights

MRCR

OpenAI open sources the MRCR long-context benchmark dataset

OpenAI open sourced MRCR, a benchmark dataset for evaluating long-context, complex retrieval tasks, building on Gemini research from Google and publishing the dataset on Hugging Face.

Hugging Face ↗

🎙️ Hear our coverage →

#benchmarks #architecture

Weights & Biases Apr 17, 2025

Major Features & Updates

W&B Weave Playground

W&B Weave Playground adds GPT-4.1 family and o3/o4-mini support

The Weights & Biases Weave Playground shipped full support for the new GPT-4.1 family and the o3/o4-mini models, letting developers evaluate and compare the week's new models for their own applications.

X ↗W&B Weave ↗

🎙️ Hear our coverage →

#benchmarks #coding

CoreWeave Apr 3, 2025

Benchmarks & Evals

CoreWeave GB200 inference benchmark

CoreWeave hits 800 tok/s on Llama 405B with NVIDIA GB200 Blackwell

CoreWeave announced record-breaking AI inference benchmarks using NVIDIA's new GB200 Grace Blackwell superchips: 800 tokens/sec on Llama 3.1 405B, plus 33,000 tokens/sec on Llama 2 70B with H200s. It is a marker of how fast inference hardware is accelerating.

800 tok/s Llama 3.1 405B on GB20033,000 tok/s Llama 2 70B on H200

CoreWeave press release ↗

🎙️ Hear our coverage →

#infrastructure #benchmarks

Google DeepMind Apr 3, 2025

Benchmarks & Evals

Gemini 2.5 Pro USAMO results

Gemini 2.5 Pro scores 24.4% on USAMO olympiad math, crushing the field

New evaluation results published this week showed Gemini 2.5 Pro scoring 24.4% on the USA Math Olympiad (USAMO), problems so hard that most top models score under 5%. The result showcases a step change in frontier reasoning ability on competition mathematics.

24.4% Gemini 2.5 Pro USAMO score<5% typical score for other top models

🎙️ Hear our coverage →

#reasoning #benchmarks

OpenAI Apr 3, 2025

Benchmarks & EvalsOpen weights

PaperBench

OpenAI releases PaperBench eval and open-sources Nano-Eval framework

OpenAI published PaperBench, a tough new evaluation that tests whether AI agents can replicate cutting-edge AI research papers, with more than 8,300 graded tasks and meta-evaluation of the LLM judge. The best model managed only a 21.0% replication score versus 41.4% for human PhDs. The code and the Nano-Eval framework were open sourced on GitHub alongside the paper.

8,300+ graded tasks in the benchmark21.0% best model replication score41.4% human PhD baseline score

PaperBench announcement ↗PaperBench code on GitHub ↗PaperBench paper (PDF) ↗Nano-Eval framework (openai/preparedness) ↗

🎙️ Hear our coverage →

#benchmarks #research #agents

March 2025

ARC Prize Foundation Mar 27, 2025

Benchmarks & Evals

ARC-AGI 2

ARC-AGI 2 benchmark revealed, thinking models score just 4%

The ARC Prize Foundation revealed ARC-AGI 2, the next iteration of the abstract reasoning benchmark. Base LLMs score 0% and even thinking models only reach about 4%, showing how far current frontier models remain from human-level fluid intelligence.

0% base LLM score on ARC-AGI 24% thinking model score on ARC-AGI 2

X announcement ↗

🎙️ Hear our coverage →

#benchmarks #reasoning

OpenAI Mar 27, 2025

New Models

GPT-4o (2025-03-26)

GPT-4o gets an update, ties for #1 on LMArena beating GPT-4.5

OpenAI shipped a new GPT-4o checkpoint (2025-03-26) that jumped over GPT-4.5 to tie for #1 on LMArena. The update landed as the show was being written, read as a direct response to Gemini 2.5's launch in the escalating frontier-model race.

🎙️ Hear our coverage →

#frontier-models #benchmarks

Weights & Biases Mar 27, 2025

Dev ToolsOpen weights

Weave MCP Server

W&B ships official Weave MCP server - talk to your evals

Weights & Biases shipped an official MCP server for Weave, its LLM observability and evaluation tool, letting agents and MCP clients query and analyze your evals directly. Morgan McQuire of the W&B Applied AI team demoed it on the show, with wandb Models integration coming soon so agents can monitor loss curves for you.

X announcement ↗GitHub repo ↗Example W&B report ↗

🎙️ Hear our coverage →

#agents #benchmarks #coding

Roboflow Mar 20, 2025

Benchmarks & EvalsOpen weights

RF100-VL

Roboflow launches RF100-VL benchmark for vision-language models

Alongside RF-DETR, Roboflow introduced RF100-VL, a new evaluation benchmark for vision-language models built from real-world detection datasets. It gives the community a grounded way to measure how well VLMs handle practical object detection tasks.

RF100-VL Benchmark ↗RF-DETR Blog Post ↗

🎙️ Hear our coverage →

#benchmarks #vision

February 2025

Haize Labs Feb 20, 2025

Dev ToolsOpen weights

Verdict

Haize Labs open-sources Verdict, a framework for composing LLM judges

Haize Labs released Verdict, an open-source framework for composing LLM judges that tackles core LLM-as-a-judge problems: self-preference bias, prompt sensitivity, and meta-evaluation. Verdict combines simpler judging primitives into more robust and efficient evaluators ('judge-time compute scaling'), achieving near state-of-the-art results on benchmarks like ExpertQA at a fraction of the cost, fast enough to use as a real-time guardrail. Co-founders Leonard Tang and Nimit joined the show to discuss it.

Whitepaper ↗GitHub ↗Thread on X ↗

🎙️ Hear our coverage →

#benchmarks #open-source

University of Cambridge researchers Feb 20, 2025

Benchmarks & EvalsOpen weights

ZeroBench

ZeroBench: the 'impossible' benchmark where all top VLMs score zero

A new benchmark called ZeroBench launched, claiming to be the impossible benchmark for vision-language models: all current top-of-the-line VLMs score zero on it. Tasks include visually demanding puzzles like reading a question written in the shape of a star hidden among scattered letters, highlighting how far VLMs still are from true visual understanding.

Announcement on X ↗Project page ↗Paper ↗Hugging Face ↗

🎙️ Hear our coverage →

#benchmarks #vision

Weights & Biases Feb 20, 2025

Papers & Research

Agents Whitepaper & Course

Weights & Biases releases an AI agents whitepaper and announces agents course

Weights & Biases released a whitepaper on evaluating AI agent applications and announced an upcoming agents course built in collaboration with OpenAI's Ilan Biggio, with signups at wandb.me/agents. The push targets agent evaluation and observability tooling for the community.

Whitepaper ↗Agents course signup ↗

🎙️ Hear our coverage →

#agents #benchmarks #coding

January 2025

Center for AI Safety & Scale AI Jan 23, 2025

Benchmarks & Evals

Humanity's Last Exam (HLE)

Humanity's Last Exam: a deliberately unsaturated frontier benchmark

Humanity's Last Exam (HLE) launched as a new, very hard benchmark designed to stay unsaturated as models max out MMLU and math evals. It crowdsourced expert-level questions to measure frontier model capability where existing benchmarks are at 98-99% saturation.

Humanity's Last Exam website ↗

🎙️ Hear our coverage →

#benchmarks #reasoning

Weights & Biases Jan 23, 2025

Also Released

W&B SWE-bench Verified SOTA agent

W&B programming agent breaks SOTA on SWE-bench Verified

Weights & Biases announced a state-of-the-art AI programming agent built with OpenAI's o1 that broke the SOTA score on SWE-bench Verified. The work was developed and tracked with W&B Weave, the team's LLM observability toolkit.

W&B SOTA programming agent report ↗W&B Weave ↗

🎙️ Hear our coverage →

#coding #agents #benchmarks