Benchmarks & Evals
Humanity's Last Exam (HLE)
Humanity's Last Exam: a deliberately unsaturated frontier benchmark
Humanity's Last Exam (HLE) launched as a new, very hard benchmark designed to stay unsaturated as models max out MMLU and math evals. It crowdsourced expert-level questions to measure frontier model capability where existing benchmarks are at 98-99% saturation.