Everything AI Released in February 2025

22 releases covered live on the show — every model, product, paper and tool that mattered, with links and our analysis.

🧠 New Models 12

Anthropic
New Models

Claude 3.7 Sonnet

Anthropic releases Claude 3.7 Sonnet, a coding beast with immaculate vibes

Anthropic shipped its long-awaited model update, Claude 3.7 Sonnet, which the crew called a coding BEAST with 'immaculate' vibes. It was one of the week's two huge model drops alongside GPT-4.5 and became an instant favorite for AI coding workflows like those discussed in the Windsurf interview.

Hume AI
New Models

Octave

Hume AI launches Octave, a TTS model that understands what it says

Hume AI released Octave, which it calls the first text-to-speech model that understands what it's saying, adjusting emotion, emphasis, and delivery based on the meaning of the text. It fits the episode's humanlike AI voices theme, letting users direct performances with natural-language acting instructions.

Inception Labs
New Models

Mercury

Inception Labs debuts Mercury, a commercial diffusion LLM

Inception Labs announced Mercury, billed as the first commercial-scale diffusion large language model, generating text via diffusion rather than autoregressive decoding. The approach promises dramatically faster token throughput, demoed first with the Mercury Coder playground.

Microsoft
New ModelsOpen weights

Phi-4-multimodal

Microsoft releases Phi-4-multimodal and Phi-4-mini open weights

Microsoft expanded the Phi family with Phi-4-multimodal-instruct, a small open-weights model that handles text, vision, and audio in a single model, alongside a compact Phi-4-mini. The weights shipped on Hugging Face, continuing Microsoft's push for capable small models that can run on-device.

OpenAI
New Models

GPT-4.5

OpenAI ships GPT-4.5, its largest model yet at roughly 10x scale

OpenAI released GPT-4.5 as breaking news during the show, its first .5-scale jump in two years and reportedly around 10x the scale of the previous model, with speculation of 10+ trillion parameters. Sam Altman said it 'won't crush on benchmarks' against reasoning models, but early vibes praised its creative writing, vision, and medical diagnosis abilities, and it is expected to fuel future o-series reasoners trained on top of it.

Arc Institute & NVIDIA
New ModelsOpen weights

Evo 2

Arc Institute and NVIDIA release Evo 2, a 40B state-of-the-art genomics model

Arc Institute and NVIDIA introduced Evo 2, a state-of-the-art genomics model with around 40 billion parameters trained on 9.3 trillion nucleotides. It uses the StripedHyena architecture to process genetic sequences up to 1 million nucleotides, enabling prediction of genetic mutation effects and even design of entire genomes. Fully open: two papers, weights, data, and training and inference codebases.

Figure
New Models

Helix

Figure announces Helix, an on-robot VLA model enabling robot-to-robot handoffs

Humanoid robot company Figure announced Helix, a Vision-Language-Action (VLA) model with full upper-body control that runs entirely on the robot, pairing a 7 billion parameter VLM for understanding with an 80 million parameter transformer for control. The demo showed two robots collaborating and handing objects to each other from natural language commands, a first that Alex called 'super futuristically cool'.

Microsoft
New ModelsOpen weights

MUSE (WHAM)

Microsoft MUSE generates playable game worlds from a single second of video

Microsoft's MUSE can generate minutes of playable gameplay from just a single second of video frames and controller actions, preserving screen elements like health bars and percentages. It is based on the World and Human Action Model (WHAM) architecture, trained on a billion gameplay images from Xbox, with the model released on Hugging Face.

Microsoft
New ModelsOpen weights

OmniParser v2

Microsoft ships OmniParser v2 for faster screen parsing in GUI agents

Microsoft released OmniParser v2, a better and faster screen-parsing model that converts UI screenshots into structured elements for GUI agents. It improves the computer-use agent stack and is available with a public Gradio demo.

Perplexity
New ModelsOpen weights

R1-1776

Perplexity releases R1-1776, a censorship-free DeepSeek R1 fine-tune

Perplexity open-sourced R1-1776, a fine-tuned version of DeepSeek R1 designed to remove Chinese government censorship on topics like Tiananmen Square and Taiwanese independence. They used human experts to identify around 300 sensitive topics and built a censorship classifier to train the bias out, claiming no significant impact on standard eval performance. The name 1776 is a nod to American independence.

StepFun
New ModelsOpen weights

Step-Video-T2V

StepFun open-sources Step-Video-T2V, a SOTA 30B text-to-video model

StepFun released Step-Video-T2V (plus a T2V Turbo variant), a 30 billion parameter state-of-the-art text-to-video model under an MIT license. Results impressed especially on text integration, such as rendering 'We will open source' on a scroll as a character unfurls it, marking one of the strongest open-source video drops of the week.

xAI
New Models

Grok 3

xAI launches Grok 3, claiming SOTA benchmarks and a 1M token context window

xAI dropped Grok 3 on Monday evening, claiming state-of-the-art performance on several benchmarks and a 1 million token context window, with heavy emphasis on agents and future reasoners. The launch was messy, with a bug serving Grok 2 to some users and an eval-methodology spat with OpenAI over best-of-N scores, but vibes shifted positive, with co-hosts calling the base model the best coding model out. It is free for now, 'until their GPUs melt', with no API yet for independent evaluation.

🚀 Products & Apps 1

Microsoft
Products & Apps

Majorana 1

Microsoft unveils Majorana 1 quantum chip and a new state of matter

Microsoft announced the Majorana 1 quantum chip alongside a claimed new state of matter called topological superconductivity, carving a new path for quantum computing. Alex called the announcement 'absolutely mind blowing' as a potential big deal for the future of computing.

✨ Major Features & Updates 2

xAI
Major Features & Updates

Grok Voice Mode

xAI ships Grok's unhinged voice mode

A week after launching Grok 3 without voice, xAI released Grok's voice mode, including an 'unhinged' personality option that the panel demoed live. It marks xAI's entry into real-time conversational voice AI alongside OpenAI's advanced voice mode.

xAI
Major Features & Updates

DeepSearch

xAI launches DeepSearch, an agentic research feature with live X access

Alongside Grok 3, xAI launched DeepSearch, an agentic deep-research feature comparable to Perplexity or OpenAI's Deep Research, with a leg up on real-time information thanks to native access to X search. Alex's initial tests were underwhelming, nicknaming it 'Shallow Search' after it spent 34 seconds on a query where OpenAI's Deep Research took 11 minutes and cited 17 sources.

🔌 APIs & Platforms 1

Google DeepMind
APIs & Platforms

Veo 2 (via FAL API)

Google's Veo 2 video model becomes available via FAL API

Google DeepMind's Veo 2 video generation model became accessible to developers through FAL's inference API. This was the first broadly available API access to Veo 2, letting builders generate high-quality video from text prompts without waiting on Google's own product surfaces.

🛠️ Dev Tools 3

DeepSeek
Dev ToolsOpen weights

Open Source Week infra releases

DeepSeek open-sources its infra stack during Open Source Week

DeepSeek ran its Open Source Week, releasing a series of production infrastructure repos (including FlashMLA, DeepEP, and DeepGEMM) that power its training and inference stack. The drops gave the open-source community a rare look at the low-level kernels and communication libraries behind DeepSeek's efficient frontier models.

Haize Labs
Dev ToolsOpen weights

Verdict

Haize Labs open-sources Verdict, a framework for composing LLM judges

Haize Labs released Verdict, an open-source framework for composing LLM judges that tackles core LLM-as-a-judge problems: self-preference bias, prompt sensitivity, and meta-evaluation. Verdict combines simpler judging primitives into more robust and efficient evaluators ('judge-time compute scaling'), achieving near state-of-the-art results on benchmarks like ExpertQA at a fraction of the cost, fast enough to use as a real-time guardrail. Co-founders Leonard Tang and Nimit joined the show to discuss it.

Hao AI Lab
Dev ToolsOpen weights

FastVideo

Hao AI Lab's FastVideo makes HunyuanVideo 3x faster with no extra training

Hao AI Lab released FastVideo, a method that makes HunyuanVideo (HY-Video) three times faster with no additional training, using a technique called Sliding Tile Attention that outperforms even flash attention for this workload. Faster inference makes open-source video models far more practical, and it supports HY-Video LoRAs for fine-tuned applications.

📄 Papers & Research 1

Weights & Biases
Papers & Research

Agents Whitepaper & Course

Weights & Biases releases an AI agents whitepaper and announces agents course

Weights & Biases released a whitepaper on evaluating AI agent applications and announced an upcoming agents course built in collaboration with OpenAI's Ilan Biggio, with signups at wandb.me/agents. The push targets agent evaluation and observability tooling for the community.

📊 Benchmarks & Evals 1

Benchmarks & EvalsOpen weights

ZeroBench

ZeroBench: the 'impossible' benchmark where all top VLMs score zero

A new benchmark called ZeroBench launched, claiming to be the impossible benchmark for vision-language models: all current top-of-the-line VLMs score zero on it. Tasks include visually demanding puzzles like reading a question written in the shape of a star hidden among scattered letters, highlighting how far VLMs still are from true visual understanding.

🌀 Also Released 1