New Models
Gemini 2.5 Flash
Google launches Gemini 2.5 Flash with controllable thinking budgets
Google answered OpenAI's launch week with Gemini 2.5 Flash, a fast reasoning model that introduces controllable thinking budgets so developers can dial how much the model reasons per request. It is available through the Gemini API and developer platform.
New Models
o3 & o4-mini
OpenAI launches o3 and o4-mini, SOTA reasoning models with tool use
OpenAI shipped o3 and o4-mini in ChatGPT and the API, with o3 setting new SOTA records on Codeforces, SWE-bench, MMMU and more. For the first time the models can use tools (web search, Python, image generation) during the reasoning process, and they can think visually by cropping, zooming and rotating images. o3 scored $65k on the Freelancer eval versus o1's $28k, and o4-mini hits 99.5% on AIME with a Python interpreter.
$65 o3 score on the Freelancer eval ($65k vs o1's $28k)99.5% o4-mini on AIME with Python interpreter200 context window (200k tokens)
New ModelsOpen weights
INTELLECT-2
Prime Intellect launches INTELLECT-2, a 32B globally-distributed RL run
Prime Intellect released INTELLECT-2, a 32B reasoning model trained with globally decentralized reinforcement learning, a follow-up to the INTELLECT-1 decentralized pretraining run covered on the show in December. The release includes open weights on Hugging Face, a tech report, and the PRIME-RL training code.
New ModelsOpen weights
GLM-4-0414
Z.ai (formerly chatGLM) releases the GLM-4-0414 open-source family
Z.ai, the rebranded Zhipu AI / chatGLM team, released the GLM-4-0414 family of open-source models. The drop includes base, reasoning and rumination variants published on Hugging Face and GitHub.
Papers & Research
Seed-Thinking-v1.5
ByteDance publishes Seed-Thinking-v1.5 reasoning model tech report
ByteDance's Seed team published Seed-Thinking-v1.5, a new reasoning model announced via a technical report on GitHub. It was mentioned among the week's open-source LLM news, though weights were not released at the time.
New ModelsOpen weights
Cogito v1 Preview (3B-70B)
Deep Cogito debuts Cogito v1 Preview models from 3B to 70B, beating DeepSeek 70B
New lab Deep Cogito released the Cogito v1 Preview family of open models ranging from 3B to 70B parameters, claiming SOTA results at each size and beating DeepSeek's 70B distill. The models are available on Hugging Face, giving local AI enthusiasts the small-to-mid sizes Llama 4 skipped.
3B-70B Model size range
New ModelsOpen weights
Kimi-VL & Kimi-VL-Thinking
Moonshot drops Kimi-VL and Kimi-VL-Thinking, tiny A3B open vision models
Moonshot AI released Kimi-VL and Kimi-VL-Thinking, compact vision-language models with only ~3B active parameters (A3B MoE). The thinking variant adds reasoning to a tiny VLM, and both are available openly on Hugging Face.
A3B ~3B active parameters (MoE)
New ModelsOpen weights
Llama-3.1-Nemotron-Ultra-253B
NVIDIA ships Nemotron Ultra, a 253B pruned and distilled Llama 3.1-405B
NVIDIA released Nemotron Ultra, a pruned and distilled finetune of Llama 3.1-405B at roughly half the parameters (253B). Its benchmarks even included Llama 4 comparisons, showing the older finetuned Llama beating the new models on AIME, GPQA and more. It supports 128K context and fits on a single 8xH100 node for inference.
253B Parameters (pruned from Llama 3.1-405B)128K Context window
New ModelsOpen weights
DeepCoder-14B-Preview
DeepCoder-14B: open RL-finetuned coder beats DeepSeek R1 and o3-mini on coding
Together AI and Agentica (UC Berkeley Sky Computing Lab) released DeepCoder-14B-Preview, a reasoning model finetuned with RL that beats DeepSeek R1 and even o3-mini on several coding benchmarks. The project aims to democratize RL: the team open-sourced the model, the training dataset, the Weights & Biases logs, and the eval logs. Guest Michael Luo from Agentica joined the show to discuss the release.
14B Model parameters
Benchmarks & Evals
Gemini 2.5 Pro USAMO results
Gemini 2.5 Pro scores 24.4% on USAMO olympiad math, crushing the field
New evaluation results published this week showed Gemini 2.5 Pro scoring 24.4% on the USA Math Olympiad (USAMO), problems so hard that most top models score under 5%. The result showcases a step change in frontier reasoning ability on competition mathematics.
24.4% Gemini 2.5 Pro USAMO score<5% typical score for other top models
New Models
Dream 7B
Dream 7B: a diffusion language model challenger unveiled
Researchers unveiled Dream 7B, a diffusion-based language model that posts strong benchmark results, notably on planning-style tasks like Sudoku, possibly because parallel generation handles global constraints better than autoregression. It hints at viable alternative LLM architectures, but the weights were not yet released at show time, so results could not be independently verified.