NVIDIA ships Nemotron 3.5 ASR, a 600M streaming speech model
NVIDIA released Nemotron 3.5 ASR, a 600M-parameter open multilingual streaming speech-to-text model aimed at voice agents. It supports 40 languages and reportedly delivers 17x more throughput than Parakeet-style baselines at half the size, pushing the latency/accuracy frontier for open voice-agent infrastructure.
Cartesia Ink-2 tops Artificial Analysis's new STT leaderboard
Cartesia released Ink-2, which debuted as the most accurate streaming speech-to-text model with the fastest turnaround on Artificial Analysis's new STT leaderboard. It landed just after recording as part of a double post-show voice-AI drop alongside ElevenLabs Dubbing v2.
ElevenLabs Dubbing v2 preserves your performance across 90+ languages
ElevenLabs launched Dubbing v2, an audio-to-audio dubbing model that translates voices across more than 90 languages while preserving cadence, expression, intonation, and even stutters. Alex's live demos, including dubbing Nisten into Hebrew and his own voice into multiple languages, were the brain-melting moment of the episode.
MOSS-TTS-v1.5: open-source 8B TTS with 31 languages
OpenMOSS shipped MOSS-TTS-v1.5, an 8B open-source text-to-speech model supporting 31 languages with pause control, released under Apache 2.0. It is one of the larger fully open TTS models available.
Meta launches Muse Spark voice conversations across its apps and glasses
Meta rolled out Muse Spark-powered voice conversations across the Meta AI app, WhatsApp, Instagram, Facebook, and Ray-Ban Meta glasses. The feature includes real-time image generation, live camera AI, and instant Reels/maps integration. Alex tested it live and called it surprisingly good, the first big consumer ship from Meta Superintelligence Labs.
Mira Murati's Thinking Machines Lab released Interaction Models, a 276B-parameter MoE (12B active) trained from scratch for native real-time multimodal collaboration. It supports full-duplex audio/video/text with 0.40s turn-taking latency and scores 77.8 on FD-bench v1.5. The demo can react live to events like another person entering the camera frame.
StepAudio 2.5 TTS adds natural-language control of emotion and delivery
StepFun released StepAudio 2.5, a text-to-speech model that lets you steer emotion and delivery with natural-language instructions. It was covered in the show's Voice & Audio segment as the week's notable speech release.
Gradient Bang: first massively multiplayer fully LLM-driven voice game
Kwindla Kramer's 'side project that broke containment' is a fully LLM-driven multiplayer voice-based space game inspired by BBS-era Trade Wars, built on a new Pipecat Sub-Agents library with a class-based event bus that works locally and over the network. A Deepgram plus GPT-4.1 voice agent always responds in under 1.5 seconds while GPT-5.2 medium-thinking task agents do the work, and the React frontend is rendered from LLM-generated JSON as dynamic UI. The team also open-sourced GB Benchmarks for evaluating agent task execution.
Gemini 3.1 Flash TTS tops TTS Arena at 1,211 Elo with 70+ languages
Google released Gemini 3.1 Flash TTS, which leads TTS Arena at 1,211 Elo, supports 70+ languages with inline audio tags, and costs about $0.03 per 60 seconds, roughly 5x cheaper than ElevenLabs. Kwindla noted it is fully promptable like an LLM rather than limited to fixed tags, but its ~3 second time-to-first-token makes it batch-only for now rather than usable in live conversational pipelines.
Fish Audio launches speech-to-text with automatic emotion tagging
Fish Audio released a speech-to-text product with automatic emotion tagging that feeds directly into its S2 TTS pipeline. The panel saw it as another sign that voice tooling is rapidly commoditizing and challenging incumbent speech providers.
Microsoft MAI ships MAI-Transcribe-1, ranked #1 in transcription
Microsoft's MAI lab released MAI-Transcribe-1, an in-house speech transcription model that debuted at #1 in transcription quality. It is part of a three-model drop showing Microsoft expanding its first-party model stack beyond its OpenAI dependence.
Microsoft MAI debuts MAI-Voice-1 expressive voice model
MAI-Voice-1 is Microsoft's expressive voice model, the third piece of the MAI in-house model drop alongside transcription and image generation. The panel discussed how Microsoft's first-party voice stack compares to specialist voice providers.
Irodori-TTS-500M: open Japanese TTS with emoji emotion control
Irodori-TTS-500M is a 500M-parameter open-weights Japanese text-to-speech model released on Hugging Face, notable for controlling emotional delivery through emojis in the input text. It landed as part of the week's wave of voice and audio releases.
Cohere Transcribe: open-source 2B ASR tops Open ASR Leaderboard at 5.42% WER
Cohere entered the ASR game with Transcribe, a 2-billion-parameter Apache 2.0 speech recognition model that immediately took the number-one spot on Hugging Face's Open ASR Leaderboard with a 5.42% word error rate versus Whisper Large v3's 7.44%. It wins 61% of human evaluations on average and 64% head-to-head against Whisper, making it a credible local-inference Whisper replacement for regulated industries.
2B Cohere Transcribe ASR size5.42% Word error rate on Open ASR Leaderboard
Google drops Gemini 3.1 Flash Live: Gemini can see, hear, and talk to you
Google released Gemini 3.1 Flash Live, a realtime multimodal model that handles voice and vision interaction in a single model path instead of stitched pipelines. The panel framed it as a major upgrade for end-to-end voice and vision agents, with AI Studio and API availability as the immediate way to experiment.
Mistral drops Voxtral TTS, a 3B open-weight text-to-speech model
Mistral released Voxtral TTS, its first text-to-speech model, as breaking news during the live show: 3 billion parameters, open weights, with emotion controls for neutral, happy, and frustrated voices. Mistral claims it beats ElevenLabs Flash v2.5 in human preference tests with a 58% win rate on flagship voices and 68% on zero-shot voice cloning, though Alex's live test found it decent rather than stunning.
xAI launches Grok TTS API with 5 voices and WebSocket streaming
xAI launched a Grok Text-to-Speech API with five voices, expressive controls, and WebSocket streaming, priced cheaper than ElevenLabs. It adds another option to a suddenly competitive voice AI market alongside open-source entrants like Fish Audio S2.
Fish Audio S2 is a fully open-source TTS model with inline emotion control via free-text bracket tags like gasp, laughter, and long pause. Alex demoed it live with an OpenClaw skill that let his 5-year-old talk to a voice clone of 'Rocky' from Project Hail Mary; Wolfram called it 'ElevenLabs V3 for free.'
OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5
OpenAI shipped gpt-audio-1.5 and gpt-realtime-1.5, updated audio and realtime voice models available through its platform. The release was covered in the week's voice and audio roundup.
Mistral's Voxtral Transcribe 2 dethrones Whisper as SOTA transcription
Mistral AI launched Voxtral Transcribe 2, state-of-the-art speech-to-text with sub-200ms latency, native diarization support, and open weights under Apache 2.0. The panel called it the first model to dethrone Whisper after roughly three years, and Alex used it to transcribe this very episode.
MiniCPM-o 4.5: first open-source full-duplex omni model
OpenBMB released MiniCPM-o 4.5, the first open-source full-duplex omni-modal LLM that can see, listen, and speak simultaneously. It can listen while speaking and even interrupt the user, bringing real-time conversational behavior to open weights.
Qwen3-TTS: open-source TTS family with 97ms latency and voice cloning
Alibaba's Qwen team released Qwen3-TTS, a full open-source text-to-speech family under Apache 2 that dropped 30 minutes before the show. It spans 5 models from 0.6B to 1.7B parameters, with 97ms latency, voice cloning from just 3 seconds of audio, voice description prompting, and 10-language support.
FlashLabs Chroma 1.0: open-source real-time speech-to-speech under 150ms
FlashLabs released Chroma 1.0, billed as the world's first open-source end-to-end real-time speech-to-speech model with voice cloning under 150ms latency. The 4B parameter model is built on Qwen 2.5 Omni and released under Apache 2; its live demo with RAG and document upload impressed the whole panel.
Inworld TTS-1.5 claims #1 TTS ranking at half a cent per minute
Inworld AI launched TTS-1.5, a closed-source text-to-speech model claiming the #1 ranking with sub-250ms latency. Its headline is price: roughly $5 per million characters (about half a cent per minute) versus ElevenLabs' $120 per million characters.
KAIST published Avatar Forcing, a framework for real-time interactive talking-head avatars with approximately 500ms latency. The paper targets responsive, live avatar interaction rather than offline video generation.
Liquid AI LFM 2.5: 1B on-device family with end-to-end audio
Liquid AI released LFM 2.5, a family of ~1.2B parameter on-device models spanning text, vision, and audio, announced at CES alongside AMD's Lisa Su. The models hit 239 tokens/sec on AMD CPU and 100 tokens/sec on iPhone 16 Pro Max, and include a revolutionary end-to-end audio model that skips the traditional ASR-LLM-TTS pipeline entirely, running in as little as 8GB of RAM.
Nemotron Speech ASR: 600M streaming model with 24ms latency
NVIDIA released Nemotron Speech ASR, a 600M parameter open source streaming speech recognition model with 24ms median latency and support for 900 concurrent streams on a single H100. Kwindla Hultman Kramer of Daily.co demoed sub-500ms voice-to-voice latency using a three-model pipeline of Nemotron ASR, Nemotron Nano LLM, and Magpie TTS.
Qwen launches speech-to-speech model with emotion handling
Qwen released a speech-to-speech model in March with internal emotion handling, joining the wave of voice-native models. It was part of the Qwen team's relentless 2025 release cadence across modalities.
Kwindla's Daily.co shipped smart turn detection during Q2, an open model that helps voice agents know when a speaker has actually finished talking. It landed in the quarter when voice agents first got attention outside the builder bubble.
Google ships a Gemini TTS model in its December run
As part of Google's December release wave, a Gemini TTS model shipped alongside realtime model updates. It rounded out Google's full-stack voice story heading into 2026.
Kokoro TTS: 82M-param Apache 2 model hits #1 on TTS Arena
Kokoro, a tiny 82M parameter text-to-speech model, went viral in January after hitting #1 on TTS Arena. Released under Apache 2.0 and small enough to run in the browser, it showed that high-quality speech synthesis no longer required huge models.
OpenAI ships two new voice models derived from GPT Realtime
In March, OpenAI released two voice models derived from its GPT Realtime speech-to-speech stack. They were part of a wave that pushed voice agents toward the mainstream over the course of 2025.
Resemble AI open-sources Chatterbox Turbo, a 350M MIT-licensed TTS
Resemble AI released Chatterbox Turbo, an MIT-licensed 350M-parameter open text-to-speech model. The company claims it beats ElevenLabs in blind listening tests, pushing high-quality TTS into fully open, accessible territory.
xAI Grok Voice Agent API ships at $0.05/min flat rate, powers Tesla
xAI launched the Grok Voice Agent API with flat-rate pricing of $0.05 per minute and integration into Tesla vehicles. xAI claims the #1 spot on Big Bench Audio at 92.3%, tightening competition in the rapidly commoditizing real-time voice stack.
Amazon announces Nova 2 family: Lite, Pro, Sonic, and Omni
Amazon rolled out the Nova 2 model suite spanning text, speech, and multimodal stacks with Lite, Pro, Sonic, and Omni variants. The launch came with major benchmark jumps over the first Nova generation and includes a fast, cost-effective reasoning model in Nova 2 Lite.
Microsoft shares VibeVoice-Realtime-0.5B with ~300ms latency TTS
Microsoft published VibeVoice-Realtime-0.5B on Hugging Face, a small realtime text-to-speech model claiming roughly 300ms latency. The show framed it as more evidence that sub-second audio response is becoming table stakes for production voice agents.
OpenAI integrates ChatGPT Voice Mode directly into chats
OpenAI integrated ChatGPT's Voice Mode directly into the chat interface instead of a separate full-screen experience. Users can now talk to ChatGPT while seeing transcripts and visual responses inline in the conversation.
ElevenLabs launches Scribe v2 Realtime speech-to-text with 150ms latency
ElevenLabs launched Scribe v2 Realtime, a streaming speech-to-text model with roughly 150ms latency and support for over 90 languages, demoed live by Paul Asjes. It auto-switches languages mid-stream and handles code, initialisms, and technical terms with context-aware transcription, outpacing Whisper on speed and accuracy.
Google rolled out an upgrade to Gemini Live's voice capabilities, making conversations more natural. Covered in the big-companies roundup alongside GPT-5.1 and Grok 4 Fast as the voice interface race heats up.
Meta releases Omnilingual ASR covering 1,600+ languages
Meta released Omnilingual ASR, an Apache 2.0 speech recognition family supporting over 1,600 languages, including 500+ never before served by any ASR system, with character error rate under 10% for 78 languages. The release includes an open corpus of 500k+ rows of transcribed audio, and the 1B model was praised as a near drop-in state-of-the-art replacement on Hugging Face.
Inworld TTS takes the #1 spot on the Artificial Analysis speech benchmark
Inworld released a new version of its TTS model that claimed the #1 position on the Artificial Analysis text-to-speech benchmark. It featured in the episode's voice segment as evidence that commercial TTS quality keeps climbing fast.
Maya-1 open-source voice generation model released
Maya-1 is a new open-source voice generation model that was demoed on the show as part of the week's voice AI wave. The panel highlighted how quickly open voice model quality is improving, with expressive output that holds up against commercial systems.
Sandbar launches Stream voice assistant and Stream Ring wearable
Sandbar launched Stream, a voice-first personal assistant, alongside Stream Ring, a wearable described as a 'mouse for voice' that is now available for preorder. The pairing pushes always-available voice interaction into dedicated hardware rather than the phone.
Cartesia Sonic 3: real-time TTS with emotion and laughter, plus $100M raise
Cartesia launched Sonic 3, a real-time text-to-speech model that adds expressive emotion and natural laughter, announced alongside a $100M funding round. Co-founder Arjun Desai joined the show to break down the voice stack and why state-space-model approaches enable this latency and expressiveness.
$100M funding round announced alongside the launch
MiniMax Speech 2.6: ultra-human voice AI with sub-250ms latency
MiniMax released Speech 2.6, a voice model targeting ultra-human quality with end-to-end latency under 250ms, available through the MiniMax platform API. It slots into the episode's voice arms race alongside Cartesia's Sonic 3.
Decart ships real-time lip-sync API for live AI avatars
Decart AI released a real-time lip-sync API that modifies an avatar's video frames to match generated speech on the fly. Kwindla Kramer broke down the pipeline on the show: WebRTC audio capture, Whisper transcription, an LLM response, ElevenLabs voice generation, then Decart's model syncing the avatar's lips, all at sub-two-second latency, a key step toward interactive, believable AI characters.
Krea open-sources a 14B real-time video generation model
Krea AI open-sourced a 14-billion-parameter real-time video model, with weights on Hugging Face. It joins the week's clear trend of generative video racing toward live, interactive experiences rather than offline rendering.
Microsoft adds agentic powers and voice to Copilot Mode in Edge
Microsoft answered Atlas with agentic enhancements to Copilot Mode in Edge, including a voice mode that can see and discuss the current page, plus broader Copilot updates (and Clippy back as an easter egg via the Mico avatar). In Alex's hands-on testing the agentic features did not actually work, so real-world parity with Atlas and Comet is unproven.
Microsoft makes every Windows 11 PC an AI PC with Copilot voice input
Microsoft announced that every Windows 11 machine becomes an 'AI PC,' adding 'Hey Copilot' voice input and deeper agentic Copilot integration at the OS level. The panel discussed it as a sign of AI assistants moving into the default computing experience.
Qwen3-Omni ships open-weights any-to-any audio, vision, and text
Alongside Qwen3-VL, Alibaba released Qwen3-Omni, an end-to-end omni-modal open-weights model that takes text, image, audio, and video input and can respond with streaming speech. The show treated it as direct evidence of how fast open multimodal systems are improving, with weights on Hugging Face, a GitHub repo, demos, and availability in Qwen Chat and the Model Studio API.
Qwen3-TTS-Flash multilingual text-to-speech lands via Alibaba's API
Part of the same Qwen release streak, Qwen3-TTS-Flash is a low-latency multilingual text-to-speech model with multiple voices and dialect support, offered through Alibaba Cloud Model Studio's API rather than as open weights. It fed into the episode's closing audio-demo pileup, where voice launches were treated as product proof points.
Huxe personal audio briefing app opens to everyone
Huxe, the personal audio app from former Google NotebookLM team members, just opened up publicly, generating proactive personalized audio briefings. It came up alongside ChatGPT Pulse as another take on proactive, ambient AI products.
Reka Speech: high-throughput multilingual ASR and speech translation
Reka AI announced Reka Speech, a high-throughput multilingual speech recognition and speech translation model with timestamps, aimed at batch-scale transcription pipelines. It positions Reka in the production ASR market against incumbent transcription APIs.
OpenAI ships gpt-realtime and takes the Realtime API to GA
OpenAI shipped the gpt-realtime speech-to-speech model and moved the Realtime API to general availability. The GA release adds remote MCP tool support, image input, and SIP phone calling, making it a full production stack for voice agents and tying into the episode's voice-agents discussion with Kwindla Kramer.
Alibaba launches Qwen-TTS with human-level bilingual naturalness
The Qwen team released Qwen-TTS, a bilingual Chinese/English text-to-speech model claiming human-level naturalness, available via API with a Hugging Face demo space. It was the second voice release of the week alongside Kyutai TTS.
Kyutai releases open low-latency TTS for English and French
Kyutai Labs released an open 1.6B-parameter text-to-speech model with low latency and high voice similarity in English and French. It was one of two TTS launches closing out the episode, underscoring how quickly multimodal product quality is rising.
Anthropic releases voice mode on Claude mobile apps
Anthropic shipped a voice mode on mobile, bringing conversational voice AI to the Claude apps. Another entry in the week's theme of every major lab giving its models a voice.
Kyutai launches Unmute.sh, a low-latency voice wrapper for any LLM
Kyutai (the lab behind Moshi) launched Unmute.sh, a modular wrapper that adds voice to any text LLM with under 300ms latency and semantic VAD that knows a thinking pause from a breath. It preserves the underlying text model's capabilities while adding natural voice interaction, and is slated to be open-sourced.
OpenAI updated ChatGPT's Advanced Voice Mode with new capabilities, including the ability to sing. Part of a week where voice interfaces kept converging on more natural, expressive interaction.
Resemble AI open-sources Chatterbox voice cloning with emotion control
Resemble AI released Chatterbox, an open-source voice cloning model with emotion control. Weights and code are public on GitHub and Hugging Face, bringing controllable, expressive voice cloning to the open ecosystem.
LTX distilled model enables near real-time video generation
Lightricks shared a distilled version of its LTX video model that generates video at near real-time speeds. It was highlighted in the vision and video segment as a notable speed milestone for video generation.
MiniMax Speech tech report published, called the best TTS out there
MiniMax (Hailuo) published the technical report for MiniMax Speech, its text-to-speech system, which the show described as the best TTS out there. The report details the architecture behind the system on arXiv.
NotebookLM AI Audio Overviews go multilingual with 50+ languages
Google expanded NotebookLM's AI audio overviews (the podcast-style summaries) to support more than 50 languages, taking the feature global beyond its English-only debut.
Pipecat releases Smart-Turn, an open source semantic VAD model
The Pipecat team (from Daily) released Smart-Turn, an open source semantic voice activity detection model that understands when a speaker has actually finished their turn rather than just detecting silence. Kwindla Kramer joined the show to break down how semantic VAD makes voice agent conversations feel far more natural, with a community training effort at turn-training.pipecat.ai.
Nari Labs' Dia: a wild 1.6B open source TTS model that blew up Twitter
Nari Labs released Dia, a 1.6B parameter open-weights text-to-speech model that absolutely blew up Twitter with its expressive, emotional dialogue generation, including laughs, coughs, and multi-speaker conversations. Built by a tiny team, it punches far above its weight against commercial TTS systems and supports voice cloning, with demos available on Fal.ai.
Amazon unveils Nova Sonic, a speech-to-speech foundation model
Amazon announced Nova Sonic, a foundational speech-to-speech model that unifies speech understanding and generation for real-time, natural-sounding voice conversations. It is available through Amazon Bedrock as part of the Nova family.
Gladia launched Solaria, a new speech-to-text model offered through its transcription platform. It arrived in a busy week for voice AI alongside Hailuo's Speech-02 TTS.
Hailuo Speech-02 TTS API: potentially SOTA emotional voice cloning
Hailuo (MiniMax) released the Speech-02 TTS API, which Alex called potentially state of the art for emotional control and voice cloning quality. It produces nuanced, realistic synthetic voices and was the standout voice release of the week.
OpenAI added a new "Monday" voice to ChatGPT's voice mode, an EMO-flavored persona released around April 1st. It rounds out a week of OpenAI shipping across models, evals, and product.
Qwen launches Omni 7B: sees, hears, reads, and talks back
Qwen released Qwen2.5-Omni-7B, an open-weights omni-modal model that perceives text, images, audio, and video, and generates both text and speech. It packs end-to-end multimodal perception and spoken output into a 7B parameter model available on Hugging Face.
Prince Canuma releases MLX-Audio v0.0.3 for speech on Apple Silicon
Prince Canuma, creator of MLX-VLM, FastMLX, and MLX Embeddings, released MLX-Audio v0.0.3, an open-source library bringing speech and audio models to Apple Silicon via MLX. It makes powerful open-source TTS and audio models accessible locally on Mac hardware.
OpenAI updates ChatGPT advanced voice mode with semantic VAD
Alongside the image generation launch, OpenAI quietly updated ChatGPT's advanced voice mode with semantic voice activity detection. The model now understands when you have actually finished speaking rather than cutting in on pauses, leading to much more natural conversation flow.
Canopy Labs drops Orpheus 3B natural-sounding speech model
Canopy Labs released Orpheus, an open speech language model that produces natural, human-sounding speech, headlined by a 3B model with smaller variants (1B, 500M, 150M) in the family. Weights are on Hugging Face with a Colab for trying it out, discussed on the show with Daily.co CEO Kwindla Kramer in the voice AI segment.
NVIDIA Canary Flash: Apache 2 speech recognition and translation
NVIDIA released Canary 1B Flash and 180M Flash, Apache 2.0 licensed speech recognition and translation models built as Llama finetunes. The permissive license makes them freely usable for commercial ASR and translation workloads.
OpenAI launches steerable voice model and two new transcription models
OpenAI launched a new emotionally steerable text-to-speech voice model plus two new transcription models, watched live on the show as a watch party. The TTS model can be instructed how to speak (tone, emotion, character), demoed at openai.fm, and the models are available through the API for voice agents.
Sesame's ultra-realistic conversational voice demo takes the world by storm
Sesame released a demo of its conversational speech model featuring the Maya voice, and its naturalness, with human-like pauses, laughs, and interruptions, went viral across the AI community. Alex recorded a reaction conversation with Maya showcasing how lifelike the voice model is.
xAI made Grok's voice mode available to free users, removing the paid-tier requirement. The expansion brings conversational voice AI to everyone on the Grok app.
Hume AI launches Octave, a TTS model that understands what it says
Hume AI released Octave, which it calls the first text-to-speech model that understands what it's saying, adjusting emotion, emphasis, and delivery based on the meaning of the text. It fits the episode's humanlike AI voices theme, letting users direct performances with natural-language acting instructions.
A week after launching Grok 3 without voice, xAI released Grok's voice mode, including an 'unhinged' personality option that the panel demoed live. It marks xAI's entry into real-time conversational voice AI alongside OpenAI's advanced voice mode.
YuE 7B: open-source Suno-style music generation model
The Multimodal Art Projection (M-A-P) team released YuE, a 7B open-source music generation model dubbed the 'open Suno' on the show, capable of generating full songs with vocals from lyrics. Weights are on Hugging Face with code on GitHub and a hosted demo on fal.ai.
Riffusion launches Fuzz music generation, free for now
Riffusion (written as 'Refusion' in the show notes) launched Fuzz, a hosted AI music generation product that is free to use during its initial period. It was highlighted in the voice and audio segment alongside YuE as part of a wave of new AI music tools.