Benchmarks & EvalsOpen weights
Terminal-Bench 2.0
Terminal-Bench 2.0 and Harbor launch as new bar for coding agents
Terminal-Bench 2.0 launched alongside the Harbor framework, with 89 hard, realistic terminal-based tasks built with around 1000 Discord contributors. The Warp agent tops the leaderboard at 50% with Codex CLI close behind, and the panel argued an unsaturated 50% ceiling makes it far more meaningful than near-saturated benchmarks like MMLU.
50% Terminal Bench v2 Top Score