Laude Institute / Stanford

1 releases covered on ThursdAI

November 2025

Benchmarks & EvalsOpen weights

Terminal-Bench 2.0

Terminal-Bench 2.0 and Harbor launch as new bar for coding agents

Terminal-Bench 2.0 launched alongside the Harbor framework, with 89 hard, realistic terminal-based tasks built with around 1000 Discord contributors. The Warp agent tops the leaderboard at 50% with Codex CLI close behind, and the panel argued an unsaturated 50% ceiling makes it far more meaningful than near-saturated benchmarks like MMLU.

50% Terminal Bench v2 Top Score