Benchmarks & EvalsOpen weights
MRCR
OpenAI open sources the MRCR long-context benchmark dataset
OpenAI open sourced MRCR, a benchmark dataset for evaluating long-context, complex retrieval tasks, building on Gemini research from Google and publishing the dataset on Hugging Face.
Major Features & Updates
W&B Weave Playground
W&B Weave Playground adds GPT-4.1 family and o3/o4-mini support
The Weights & Biases Weave Playground shipped full support for the new GPT-4.1 family and the o3/o4-mini models, letting developers evaluate and compare the week's new models for their own applications.
Benchmarks & Evals
CoreWeave GB200 inference benchmark
CoreWeave hits 800 tok/s on Llama 405B with NVIDIA GB200 Blackwell
CoreWeave announced record-breaking AI inference benchmarks using NVIDIA's new GB200 Grace Blackwell superchips: 800 tokens/sec on Llama 3.1 405B, plus 33,000 tokens/sec on Llama 2 70B with H200s. It is a marker of how fast inference hardware is accelerating.
800 tok/s Llama 3.1 405B on GB20033,000 tok/s Llama 2 70B on H200
Benchmarks & Evals
Gemini 2.5 Pro USAMO results
Gemini 2.5 Pro scores 24.4% on USAMO olympiad math, crushing the field
New evaluation results published this week showed Gemini 2.5 Pro scoring 24.4% on the USA Math Olympiad (USAMO), problems so hard that most top models score under 5%. The result showcases a step change in frontier reasoning ability on competition mathematics.
24.4% Gemini 2.5 Pro USAMO score<5% typical score for other top models
Benchmarks & EvalsOpen weights
PaperBench
OpenAI releases PaperBench eval and open-sources Nano-Eval framework
OpenAI published PaperBench, a tough new evaluation that tests whether AI agents can replicate cutting-edge AI research papers, with more than 8,300 graded tasks and meta-evaluation of the LLM judge. The best model managed only a 21.0% replication score versus 41.4% for human PhDs. The code and the Nano-Eval framework were open sourced on GitHub alongside the paper.
8,300+ graded tasks in the benchmark21.0% best model replication score41.4% human PhD baseline score