Anthropic releases Claude 3.7 Sonnet, a coding beast with immaculate vibes
Anthropic shipped its long-awaited model update, Claude 3.7 Sonnet, which the crew called a coding BEAST with 'immaculate' vibes. It was one of the week's two huge model drops alongside GPT-4.5 and became an instant favorite for AI coding workflows like those discussed in the Windsurf interview.
Hume AI launches Octave, a TTS model that understands what it says
Hume AI released Octave, which it calls the first text-to-speech model that understands what it's saying, adjusting emotion, emphasis, and delivery based on the meaning of the text. It fits the episode's humanlike AI voices theme, letting users direct performances with natural-language acting instructions.
Inception Labs debuts Mercury, a commercial diffusion LLM
Inception Labs announced Mercury, billed as the first commercial-scale diffusion large language model, generating text via diffusion rather than autoregressive decoding. The approach promises dramatically faster token throughput, demoed first with the Mercury Coder playground.
Microsoft releases Phi-4-multimodal and Phi-4-mini open weights
Microsoft expanded the Phi family with Phi-4-multimodal-instruct, a small open-weights model that handles text, vision, and audio in a single model, alongside a compact Phi-4-mini. The weights shipped on Hugging Face, continuing Microsoft's push for capable small models that can run on-device.
OpenAI ships GPT-4.5, its largest model yet at roughly 10x scale
OpenAI released GPT-4.5 as breaking news during the show, its first .5-scale jump in two years and reportedly around 10x the scale of the previous model, with speculation of 10+ trillion parameters. Sam Altman said it 'won't crush on benchmarks' against reasoning models, but early vibes praised its creative writing, vision, and medical diagnosis abilities, and it is expected to fuel future o-series reasoners trained on top of it.
Arc Institute and NVIDIA release Evo 2, a 40B state-of-the-art genomics model
Arc Institute and NVIDIA introduced Evo 2, a state-of-the-art genomics model with around 40 billion parameters trained on 9.3 trillion nucleotides. It uses the StripedHyena architecture to process genetic sequences up to 1 million nucleotides, enabling prediction of genetic mutation effects and even design of entire genomes. Fully open: two papers, weights, data, and training and inference codebases.
Figure announces Helix, an on-robot VLA model enabling robot-to-robot handoffs
Humanoid robot company Figure announced Helix, a Vision-Language-Action (VLA) model with full upper-body control that runs entirely on the robot, pairing a 7 billion parameter VLM for understanding with an 80 million parameter transformer for control. The demo showed two robots collaborating and handing objects to each other from natural language commands, a first that Alex called 'super futuristically cool'.
Microsoft MUSE generates playable game worlds from a single second of video
Microsoft's MUSE can generate minutes of playable gameplay from just a single second of video frames and controller actions, preserving screen elements like health bars and percentages. It is based on the World and Human Action Model (WHAM) architecture, trained on a billion gameplay images from Xbox, with the model released on Hugging Face.
Microsoft ships OmniParser v2 for faster screen parsing in GUI agents
Microsoft released OmniParser v2, a better and faster screen-parsing model that converts UI screenshots into structured elements for GUI agents. It improves the computer-use agent stack and is available with a public Gradio demo.
Perplexity releases R1-1776, a censorship-free DeepSeek R1 fine-tune
Perplexity open-sourced R1-1776, a fine-tuned version of DeepSeek R1 designed to remove Chinese government censorship on topics like Tiananmen Square and Taiwanese independence. They used human experts to identify around 300 sensitive topics and built a censorship classifier to train the bias out, claiming no significant impact on standard eval performance. The name 1776 is a nod to American independence.
StepFun open-sources Step-Video-T2V, a SOTA 30B text-to-video model
StepFun released Step-Video-T2V (plus a T2V Turbo variant), a 30 billion parameter state-of-the-art text-to-video model under an MIT license. Results impressed especially on text integration, such as rendering 'We will open source' on a scroll as a character unfurls it, marking one of the strongest open-source video drops of the week.
xAI launches Grok 3, claiming SOTA benchmarks and a 1M token context window
xAI dropped Grok 3 on Monday evening, claiming state-of-the-art performance on several benchmarks and a 1 million token context window, with heavy emphasis on agents and future reasoners. The launch was messy, with a bug serving Grok 2 to some users and an eval-methodology spat with OpenAI over best-of-N scores, but vibes shifted positive, with co-hosts calling the base model the best coding model out. It is free for now, 'until their GPUs melt', with no API yet for independent evaluation.
Microsoft unveils Majorana 1 quantum chip and a new state of matter
Microsoft announced the Majorana 1 quantum chip alongside a claimed new state of matter called topological superconductivity, carving a new path for quantum computing. Alex called the announcement 'absolutely mind blowing' as a potential big deal for the future of computing.
A week after launching Grok 3 without voice, xAI released Grok's voice mode, including an 'unhinged' personality option that the panel demoed live. It marks xAI's entry into real-time conversational voice AI alongside OpenAI's advanced voice mode.
xAI launches DeepSearch, an agentic research feature with live X access
Alongside Grok 3, xAI launched DeepSearch, an agentic deep-research feature comparable to Perplexity or OpenAI's Deep Research, with a leg up on real-time information thanks to native access to X search. Alex's initial tests were underwhelming, nicknaming it 'Shallow Search' after it spent 34 seconds on a query where OpenAI's Deep Research took 11 minutes and cited 17 sources.
Google's Veo 2 video model becomes available via FAL API
Google DeepMind's Veo 2 video generation model became accessible to developers through FAL's inference API. This was the first broadly available API access to Veo 2, letting builders generate high-quality video from text prompts without waiting on Google's own product surfaces.
DeepSeek open-sources its infra stack during Open Source Week
DeepSeek ran its Open Source Week, releasing a series of production infrastructure repos (including FlashMLA, DeepEP, and DeepGEMM) that power its training and inference stack. The drops gave the open-source community a rare look at the low-level kernels and communication libraries behind DeepSeek's efficient frontier models.
Haize Labs open-sources Verdict, a framework for composing LLM judges
Haize Labs released Verdict, an open-source framework for composing LLM judges that tackles core LLM-as-a-judge problems: self-preference bias, prompt sensitivity, and meta-evaluation. Verdict combines simpler judging primitives into more robust and efficient evaluators ('judge-time compute scaling'), achieving near state-of-the-art results on benchmarks like ExpertQA at a fraction of the cost, fast enough to use as a real-time guardrail. Co-founders Leonard Tang and Nimit joined the show to discuss it.
Hao AI Lab's FastVideo makes HunyuanVideo 3x faster with no extra training
Hao AI Lab released FastVideo, a method that makes HunyuanVideo (HY-Video) three times faster with no additional training, using a technique called Sliding Tile Attention that outperforms even flash attention for this workload. Faster inference makes open-source video models far more practical, and it supports HY-Video LoRAs for fine-tuned applications.
Weights & Biases releases an AI agents whitepaper and announces agents course
Weights & Biases released a whitepaper on evaluating AI agent applications and announced an upcoming agents course built in collaboration with OpenAI's Ilan Biggio, with signups at wandb.me/agents. The push targets agent evaluation and observability tooling for the community.
ZeroBench: the 'impossible' benchmark where all top VLMs score zero
A new benchmark called ZeroBench launched, claiming to be the impossible benchmark for vision-language models: all current top-of-the-line VLMs score zero on it. Tasks include visually demanding puzzles like reading a question written in the shape of a star hidden among scattered letters, highlighting how far VLMs still are from true visual understanding.
Hugging Face publishes the Ultra Scale Playbook for training on GPU clusters
Hugging Face released the Ultra Scale Playbook, a guide to building and scaling AI models on large GPU clusters. The team ran 4,000 scaling experiments on up to 512 GPUs to distill practical guidance for labs training big models.