Everything AI Released in May 2025

43 releases covered live on the show — every model, product, paper and tool that mattered, with links and our analysis.

🧠 New Models 22

Black Forest Labs
New Models

FLUX.1 Kontext

Black Forest Labs drops FLUX.1 Kontext, SOTA image editing

Black Forest Labs, creators of Flux, released Kontext: three models (Pro, Max, and a 12B open-weights Dev in private preview) for consistent, context-aware text and image editing. Unlike GPT-image or VEO-style regeneration, Kontext keeps identity consistent across edits, adding what you ask for without changing your face every generation. Broke as news during the show.

DeepSeek
New ModelsOpen weights

DeepSeek-R1-0528

DeepSeek drops R1-0528, an updated open reasoning model with big gains

DeepSeek released R1-0528 out of nowhere, an update to their open-weights reasoning model with serious performance jumps: AIME 91, LiveCodeBench 73, and SWE-bench Verified 57.6. They also shipped an 8B distilled version based on Qwen3 that can run on a laptop, keeping it among the best open-weight models available.

91 AIME score, beating previous R1 by a mile8B Distilled Qwen3-based version runnable on a laptop
Tencent (Hunyuan)
New Models

HunyuanPortrait

Tencent's HunyuanPortrait animates portraits from a single photo

Tencent's Hunyuan team published HunyuanPortrait, a model for high-fidelity portrait video generation from a single photo. It animates a still portrait into realistic talking-head video, with an accompanying paper.

Tencent (Hunyuan)
New ModelsOpen weights

HunyuanVideo-Avatar

Tencent releases HunyuanVideo-Avatar for audio-driven avatars

Tencent Hunyuan released HunyuanVideo-Avatar, an audio-driven full-body avatar animation model. Feed it audio and a reference image and it animates a full-body avatar in sync, pushing AI-generated humans further toward indistinguishable.

A-M Team
New ModelsOpen weights

AM-Thinking v1

AM-Thinking v1: 32B dense reasoning model beats bigger MoEs at math and code

A 32B dense open-weights reasoning LLM from a new Chinese team that takes on much larger mixture-of-experts models and comes out on top for math and code, hitting 85.3% on AIME 2024, 70.3% on LiveCodeBench v5, and 92.5% on Arena-Hard. It supports a /think reasoning toggle, ships with a permissive license, is tooled for vLLM, LM Studio, and Ollama, and runs at 25 tokens/sec on a single 80GB GPU with INT4 quantization. A multilingual RLHF pass and 128k context window are in the works.

32B dense parameters85.3% AIME 202425 tokens/sec on a single 80GB GPU with INT4
Alibaba
New ModelsOpen weights

Wan 2.1

Alibaba's Wan 2.1: open-source diffusion-transformer text-to-video suite

Alibaba, the team behind the Qwen LLMs, released Wan 2.1, a full stack of open-source diffusion-transformer text-to-video foundation models. Amid the show's discussion of video-model fatigue, this was called out as a release that cuts through the noise, with weights on Hugging Face and code on GitHub.

Lightricks
New Models

LTX Video (distilled)

LTX distilled model enables near real-time video generation

Lightricks shared a distilled version of its LTX video model that generates video at near real-time speeds. It was highlighted in the vision and video segment as a notable speed milestone for video generation.

Stability AI
New ModelsOpen weights

Stable Audio Open Small

Stability AI and Arm release Stable Audio Open Small for on-device audio

Stability AI, together with Arm, released Stable Audio Open Small, a 341M-parameter open text-to-audio model built for real-world on-device deployment. The show framed it as part of a small comeback for Stability, with weights on Hugging Face and an accompanying paper.

StepFun
New ModelsOpen weights

Step1X-3D

StepFun's Step1X-3D: open two-stage framework for textured 3D assets

StepFun released Step1X-3D, an open two-stage framework for high-fidelity, controllable generation of textured 3D assets: it first synthesizes watertight geometry, then generates view-consistent textures. Trained on 2M curated meshes, the release also includes a curated dataset of 800K assets and a Hugging Face demo.

New ModelsOpen weights

Falcon-Edge

Falcon-Edge: ternary BitNet LLMs for edge deployment under 1GB VRAM

TII's Falcon-Edge project releases ternary BitNet LLMs (1B and 3B base models) that slash memory and compute requirements, enabling inference on less than 1GB of VRAM. Fine-tuners get pre-quantized checkpoints and a clear path to 1-bit LLMs.

Alibaba (Qwen)
New ModelsOpen weights

Qwen 3

Alibaba open-weights the full Qwen 3 family under Apache 2.0

Alibaba released the entire Qwen 3 stack: two MoE models (235B total/22B active and 30B/3B active) plus six dense siblings from 32B down to 0.6B, all Apache 2.0 with day-one support in LM Studio, Ollama, vLLM, MLX and llama.cpp. The headline feature is a runtime hybrid 'thinking' toggle (/think and /no_think) that trades latency for reasoning depth. Trained on ~36T tokens with 128K context and 119-language coverage, the 235B MoE rivals DeepSeek-R1, o1, o3-mini and Gemini 2.5 Pro on coding and math.

235 B Flagship MoE total parameters (22B active)30 B Qwen3-30B-A3B hit 57 tok/s on a Mac with speculative decoding36 Trillions of pre-training tokens (2x Qwen 2.5)
HiDream
New ModelsOpen weights

HiDream E1

HiDream E1: open-weights image model with standout Ghibli style

HiDream released E1, an open-weights image editing/generation model (Apache 2.0-style licensing) noted for beautiful Ghibli-style outputs. It ranks #4 on the Artificial Analysis image arena leaderboard, sitting among top contenders like Google Imagen and ReCraft.

Kyutai
New ModelsOpen weights

Helium-1

Kyutai releases Helium-1, a 2B European-language model plus dactory pipeline

Kyutai released Helium-1, a 2B-parameter model distilled from Gemma-2-9B and purpose-built for Europe's 24 official languages, under CC-BY 4.0. It sets a new state of the art for its size class on MMLU-EU, ARC-EU and FLORES translation while fitting in under 2GB VRAM for edge and phone deployment. They also open-sourced 'dactory' (MIT), their full Common Crawl data-processing pipeline that scores, dedups and tags webpages.

Microsoft
New ModelsOpen weights

Phi-4-reasoning

Microsoft ships Phi-4-reasoning and Phi-4-reasoning-plus (14B, MIT)

Microsoft fine-tuned the 14B Phi-4 on 1.4M curated chain-of-thought traces (SFT) and added a small RL stage (Plus variant) to create two MIT-licensed reasoning models. They punch far above their weight: Phi-4-reasoning-plus outperforms DeepSeek-R1-Distill-70B on AIME 25 (78% vs 51%) and sits within a few points of the full 671B DeepSeek-R1, while running on a single GPU with explicit <think> scaffolding.

OpenPipe
New ModelsOpen weights

ART·E

OpenPipe's ART·E: RL-trained open email agent that beats o3

OpenPipe released ART·E, an Apache 2.0 email research agent built on a 14B Qwen 2.5 backbone, trained on 500K Enron emails plus synthetic Q&A and refined with reinforcement learning. It tops o3 on accuracy (96% vs 90%) while running 5x faster (1.1s median) and 64x cheaper ($0.85 per 1,000 queries), using a simple three-tool loop.

Xiaomi
New ModelsOpen weights

MiMo-7B

Xiaomi enters open weights with MiMo-7B, MIT-licensed reasoning family

Xiaomi's first open-weights release is a 7B dense family (Base, SFT, RL, RL-Zero) trained from scratch on 25T tokens with a multi-token-prediction objective and rule-verifiable reinforcement learning. The RL variant matches OpenAI o1-mini on benchmark suites despite being far smaller, scoring 55.4% on AIME 2025 and 49.3% on LiveCodeBench v6, all under an MIT license with vLLM-ready weights.

🚀 Products & Apps 5

Kyutai
Products & Apps

Unmute.sh

Kyutai launches Unmute.sh, a low-latency voice wrapper for any LLM

Kyutai (the lab behind Moshi) launched Unmute.sh, a modular wrapper that adds voice to any text LLM with under 300ms latency and semantic VAD that knows a thinking pause from a breath. It preserves the underlying text model's capabilities while adding natural voice interaction, and is slated to be open-sourced.

Odyssey
Products & Apps

Odyssey Interactive Video

Odyssey debuts real-time interactive AI video at 30 FPS

Odyssey launched interactive video: real-time AI world exploration rendered at 30 FPS, letting you walk through generated worlds as they are created. A glimpse at world-model-driven media where the video responds to you instead of just playing back.

Opera
Products & Apps

Opera Neon

Opera unveils Neon, an agent-centric AI browser

Opera announced Neon, an agent-centric AI browser built for autonomous web tasks. Instead of just assisting with browsing, it is designed to act on the web for you, joining the emerging category of agentic browsers.

Google DeepMind
Products & Apps

AlphaEvolve

AlphaEvolve: Gemini-powered coding agent for discovering new algorithms

Google DeepMind announced AlphaEvolve, a Gemini-powered coding agent that designs and evolves advanced algorithms, credited on the show as one of the week's mind-bending algorithmic-discovery stories. DeepMind opened an interest form for early access rather than shipping it broadly.

Nous Research
Products & AppsOpen weights

Psyche

Nous Research launches Psyche, a decentralized cooperative-training network

Psyche is Nous Research's decentralized cooperative-training network that lets distributed participants jointly train large models over the internet. The launch includes open code on GitHub and a live dashboard tracking the first run, a 40B model called Consilience. COO Dillon Rolnick joined the show to explain the decentralized training push.

✨ Major Features & Updates 7

OpenAI
Major Features & Updates

Advanced Voice Mode

OpenAI's Advanced Voice Mode can now sing

OpenAI updated ChatGPT's Advanced Voice Mode with new capabilities, including the ability to sing. Part of a week where voice interfaces kept converging on more natural, expressive interaction.

OpenAI
Major Features & Updates

GPT-4.1 in ChatGPT

OpenAI brings the previously API-only GPT-4.1 models into ChatGPT

OpenAI's GPT-4.1 series, previously available only via the API, is now selectable in the ChatGPT interface. The crew used the news to dig into model-picker UX: seven model options in the dropdown, each with its own quirks, speed, and context length, while most casual users don't even know the dropdown exists.

Anthropic
Major Features & Updates

Claude Integrations (MCP)

Claude.ai gets Integrations: remote MCP tool support for apps

Breaking during the show: Anthropic announced Integrations, letting Claude connect directly to apps like Asana, Intercom, Linear, Zapier, Stripe, Atlassian, Cloudflare and PayPal via MCP. Developers can build their own integrations quickly, bringing tool use to Claude.ai itself rather than just the API.

OpenAI
Major Features & Updates

ChatGPT Shopping

ChatGPT adds shopping capabilities

OpenAI rolled out shopping features in ChatGPT, letting the assistant find and recommend products for users. Mentioned briefly in the big-companies roundup amid the week's OpenAI sycophancy drama.

Runway
Major Features & Updates

Gen-4 References

Runway References brings character and scene consistency to Gen-4

Runway launched References for Gen-4 on all paid plans, letting creators supply reference images (characters, outfits, locations, even selfies) and use tags in prompts to keep those elements consistent across generations. It tackles AI video's biggest pain point, frame-to-frame identity drift, at no extra credit cost per run.

🔌 APIs & Platforms 4

Mistral AI
APIs & Platforms

Mistral Agents API

Mistral launches Agents API for building tool-using agents

Mistral released an Agents API, a framework for building custom tool-using agents on top of Mistral models. It joins the wave of big-lab agent frameworks, letting developers wire up tools and orchestrate agentic workflows through Mistral's platform.

Mistral AI
APIs & Platforms

Mistral Embed

Mistral ships new state-of-the-art embedding API

Mistral announced a new state-of-the-art embedding API. The release gives developers a SOTA option for retrieval and semantic search workloads served through Mistral's platform.

Anthropic
APIs & Platforms

Web Search API

Anthropic launches Web Search API for real-time retrieval in Claude

Anthropic released a Web Search API that gives Claude models real-time web retrieval, letting developers ground responses in current information directly through the API. It was covered among the week's big-company API updates.

📄 Papers & Research 3

UC Berkeley
Papers & Research

Intuitor (Learning to Reason Without External Rewards)

Paper: models can learn to reason without external rewards

A mind-bending paper showing that reinforcement learning with internal or even random rewards can improve reasoning models. Intuitor matched or exceeded some GRPO results (the external-reward framework DeepSeek popularized with R1) when finetuning Qwen2.5 3B, questioning how much of RL's gains come from the reward signal itself.

3B Qwen2.5 model size where Intuitor matched or exceeded GRPO results
MiniMax (Hailuo)
Papers & Research

MiniMax Speech

MiniMax Speech tech report published, called the best TTS out there

MiniMax (Hailuo) published the technical report for MiniMax Speech, its text-to-speech system, which the show described as the best TTS out there. The report details the architecture behind the system on arXiv.

Cohere
Papers & Research

The Leaderboard Illusion

Cohere Labs paper accuses Chatbot Arena (LMArena) of structural bias

Cohere Labs published 'The Leaderboard Illusion,' claiming LMArena lets big incumbents privately A/B-test dozens of model variants (Meta ran 27 hidden Llama-4 variants in a month), cherry-pick top scores, and receive far more battle data, inflating Elo ratings. LMArena responded that the leaderboard reflects real human preferences and pre-release testing is open to all providers.

📦 Datasets 1

UC Berkeley
DatasetsOpen weights

PromptEvals

PromptEvals: 12K+ real production assertion criteria for LLM evals

Shreya Shankar and collaborators released PromptEvals, the first large-scale corpus of production LLM guardrails: 2,087 developer prompts paired with 12,623 assertion criteria covering structure, style, grounding and hallucination checks, about 5x larger than prior sets. Fine-tuned open Mistral-7B and Llama-3-8B checkpoints generate assertions +21 F1 better than GPT-4o at a fraction of the latency. Accepted to NAACL 2025.

📊 Benchmarks & Evals 1

OpenAI
Benchmarks & EvalsOpen weights

HealthBench

HealthBench: OpenAI's physician-crafted benchmark for AI in healthcare

OpenAI released HealthBench, a benchmark for evaluating AI models on healthcare scenarios, built with input from physicians. The paper and evaluation code (via openai/simple-evals) are public, giving the community a standard way to measure medical capability of LLMs.