Voice & Speech

Text-to-speech, speech recognition, real-time voice agents, transcription, and voice cloning. — 83 releases covered on the show.

June 2026

NVIDIA
New ModelsOpen weights

Nemotron 3.5 ASR

NVIDIA ships Nemotron 3.5 ASR, a 600M streaming speech model

NVIDIA released Nemotron 3.5 ASR, a 600M-parameter open multilingual streaming speech-to-text model aimed at voice agents. It supports 40 languages and reportedly delivers 17x more throughput than Parakeet-style baselines at half the size, pushing the latency/accuracy frontier for open voice-agent infrastructure.

17x Nemotron ASR throughput

May 2026

Cartesia
New Models

Ink-2

Cartesia Ink-2 tops Artificial Analysis's new STT leaderboard

Cartesia released Ink-2, which debuted as the most accurate streaming speech-to-text model with the fastest turnaround on Artificial Analysis's new STT leaderboard. It landed just after recording as part of a double post-show voice-AI drop alongside ElevenLabs Dubbing v2.

ElevenLabs
New Models

Dubbing v2

ElevenLabs Dubbing v2 preserves your performance across 90+ languages

ElevenLabs launched Dubbing v2, an audio-to-audio dubbing model that translates voices across more than 90 languages while preserving cadence, expression, intonation, and even stutters. Alex's live demos, including dubbing Nisten into Hebrew and his own voice into multiple languages, were the brain-melting moment of the episode.

Meta AI
Major Features & Updates

Muse Spark voice conversations

Meta launches Muse Spark voice conversations across its apps and glasses

Meta rolled out Muse Spark-powered voice conversations across the Meta AI app, WhatsApp, Instagram, Facebook, and Ray-Ban Meta glasses. The feature includes real-time image generation, live camera AI, and instant Reels/maps integration. Alex tested it live and called it surprisingly good, the first big consumer ship from Meta Superintelligence Labs.

Thinking Machines Lab
New Models

Interaction Models

Thinking Machines Lab drops Interaction Models: real-time multimodal 276B MoE

Mira Murati's Thinking Machines Lab released Interaction Models, a 276B-parameter MoE (12B active) trained from scratch for native real-time multimodal collaboration. It supports full-duplex audio/video/text with 0.40s turn-taking latency and scores 77.8 on FD-bench v1.5. The demo can react live to events like another person entering the camera frame.

276B MoE parameters12B active parameters

April 2026

StepFun
New Models

StepAudio 2.5

StepAudio 2.5 TTS adds natural-language control of emotion and delivery

StepFun released StepAudio 2.5, a text-to-speech model that lets you steer emotion and delivery with natural-language instructions. It was covered in the show's Voice & Audio segment as the week's notable speech release.

Daily (Pipecat)
Products & AppsOpen weights

Gradient Bang

Gradient Bang: first massively multiplayer fully LLM-driven voice game

Kwindla Kramer's 'side project that broke containment' is a fully LLM-driven multiplayer voice-based space game inspired by BBS-era Trade Wars, built on a new Pipecat Sub-Agents library with a class-based event bus that works locally and over the network. A Deepgram plus GPT-4.1 voice agent always responds in under 1.5 seconds while GPT-5.2 medium-thinking task agents do the work, and the React frontend is rendered from LLM-generated JSON as dynamic UI. The team also open-sourced GB Benchmarks for evaluating agent task execution.

Google DeepMind
New Models

Gemini 3.1 Flash TTS

Gemini 3.1 Flash TTS tops TTS Arena at 1,211 Elo with 70+ languages

Google released Gemini 3.1 Flash TTS, which leads TTS Arena at 1,211 Elo, supports 70+ languages with inline audio tags, and costs about $0.03 per 60 seconds, roughly 5x cheaper than ElevenLabs. Kwindla noted it is fully promptable like an LLM rather than limited to fixed tags, but its ~3 second time-to-first-token makes it batch-only for now rather than usable in live conversational pipelines.

1,211 TTS Arena Elo
Microsoft
New Models

MAI-Voice-1

Microsoft MAI debuts MAI-Voice-1 expressive voice model

MAI-Voice-1 is Microsoft's expressive voice model, the third piece of the MAI in-house model drop alongside transcription and image generation. The panel discussed how Microsoft's first-party voice stack compares to specialist voice providers.

March 2026

Cohere
New ModelsOpen weights

Cohere Transcribe

Cohere Transcribe: open-source 2B ASR tops Open ASR Leaderboard at 5.42% WER

Cohere entered the ASR game with Transcribe, a 2-billion-parameter Apache 2.0 speech recognition model that immediately took the number-one spot on Hugging Face's Open ASR Leaderboard with a 5.42% word error rate versus Whisper Large v3's 7.44%. It wins 61% of human evaluations on average and 64% head-to-head against Whisper, making it a credible local-inference Whisper replacement for regulated industries.

2B Cohere Transcribe ASR size5.42% Word error rate on Open ASR Leaderboard
Google DeepMind
New Models

Gemini 3.1 Flash Live

Google drops Gemini 3.1 Flash Live: Gemini can see, hear, and talk to you

Google released Gemini 3.1 Flash Live, a realtime multimodal model that handles voice and vision interaction in a single model path instead of stitched pipelines. The panel framed it as a major upgrade for end-to-end voice and vision agents, with AI Studio and API availability as the immediate way to experiment.

Mistral AI
New ModelsOpen weights

Voxtral TTS

Mistral drops Voxtral TTS, a 3B open-weight text-to-speech model

Mistral released Voxtral TTS, its first text-to-speech model, as breaking news during the live show: 3 billion parameters, open weights, with emotion controls for neutral, happy, and frustrated voices. Mistral claims it beats ElevenLabs Flash v2.5 in human preference tests with a 58% win rate on flagship voices and 68% on zero-shot voice cloning, though Alex's live test found it decent rather than stunning.

3B Mistral Voxtral TTS size
Fish Audio
New ModelsOpen weights

Fish Audio S2

Fish Audio S2 open TTS hits sub-150ms latency

Fish Audio S2 is a fully open-source TTS model with inline emotion control via free-text bracket tags like gasp, laughter, and long pause. Alex demoed it live with an OpenClaw skill that let his 5-year-old talk to a voice clone of 'Rocky' from Project Hail Mary; Wolfram called it 'ElevenLabs V3 for free.'

<150ms Fish Audio S2 TTS latency

February 2026

Mistral AI
New ModelsOpen weights

Voxtral Transcribe 2

Mistral's Voxtral Transcribe 2 dethrones Whisper as SOTA transcription

Mistral AI launched Voxtral Transcribe 2, state-of-the-art speech-to-text with sub-200ms latency, native diarization support, and open weights under Apache 2.0. The panel called it the first model to dethrone Whisper after roughly three years, and Alex used it to transcribe this very episode.

January 2026

Alibaba (Qwen)
New ModelsOpen weights

Qwen3-TTS

Qwen3-TTS: open-source TTS family with 97ms latency and voice cloning

Alibaba's Qwen team released Qwen3-TTS, a full open-source text-to-speech family under Apache 2 that dropped 30 minutes before the show. It spans 5 models from 0.6B to 1.7B parameters, with 97ms latency, voice cloning from just 3 seconds of audio, voice description prompting, and 10-language support.

97ms Latency
FlashLabs
New ModelsOpen weights

Chroma 1.0

FlashLabs Chroma 1.0: open-source real-time speech-to-speech under 150ms

FlashLabs released Chroma 1.0, billed as the world's first open-source end-to-end real-time speech-to-speech model with voice cloning under 150ms latency. The 4B parameter model is built on Qwen 2.5 Omni and released under Apache 2; its live demo with RAG and document upload impressed the whole panel.

Liquid AI
New ModelsOpen weights

LFM 2.5

Liquid AI LFM 2.5: 1B on-device family with end-to-end audio

Liquid AI released LFM 2.5, a family of ~1.2B parameter on-device models spanning text, vision, and audio, announced at CES alongside AMD's Lisa Su. The models hit 239 tokens/sec on AMD CPU and 100 tokens/sec on iPhone 16 Pro Max, and include a revolutionary end-to-end audio model that skips the traditional ASR-LLM-TTS pipeline entirely, running in as little as 8GB of RAM.

NVIDIA
New ModelsOpen weights

Nemotron Speech ASR

Nemotron Speech ASR: 600M streaming model with 24ms latency

NVIDIA released Nemotron Speech ASR, a 600M parameter open source streaming speech recognition model with 24ms median latency and support for 900 concurrent streams on a single H100. Kwindla Hultman Kramer of Daily.co demoed sub-500ms voice-to-voice latency using a three-model pipeline of Nemotron ASR, Nemotron Nano LLM, and Magpie TTS.

24ms Nemotron Speech latency

December 2025

Alibaba (Qwen)
New ModelsOpen weights

Qwen speech-to-speech model

Qwen launches speech-to-speech model with emotion handling

Qwen released a speech-to-speech model in March with internal emotion handling, joining the wave of voice-native models. It was part of the Qwen team's relentless 2025 release cadence across modalities.

Daily (Pipecat)
New ModelsOpen weights

Smart Turn Detection

Daily ships smart turn detection for voice agents

Kwindla's Daily.co shipped smart turn detection during Q2, an open model that helps voice agents know when a speaker has actually finished talking. It landed in the quarter when voice agents first got attention outside the builder bubble.

Google DeepMind
New Models

Gemini TTS

Google ships a Gemini TTS model in its December run

As part of Google's December release wave, a Gemini TTS model shipped alongside realtime model updates. It rounded out Google's full-stack voice story heading into 2026.

Hexgrad (Kokoro)
New ModelsOpen weights

Kokoro TTS

Kokoro TTS: 82M-param Apache 2 model hits #1 on TTS Arena

Kokoro, a tiny 82M parameter text-to-speech model, went viral in January after hitting #1 on TTS Arena. Released under Apache 2.0 and small enough to run in the browser, it showed that high-quality speech synthesis no longer required huge models.

OpenAI
New Models

New voice models (GPT Realtime derivatives)

OpenAI ships two new voice models derived from GPT Realtime

In March, OpenAI released two voice models derived from its GPT Realtime speech-to-speech stack. They were part of a wave that pushed voice agents toward the mainstream over the course of 2025.

Resemble AI
New ModelsOpen weights

Chatterbox Turbo

Resemble AI open-sources Chatterbox Turbo, a 350M MIT-licensed TTS

Resemble AI released Chatterbox Turbo, an MIT-licensed 350M-parameter open text-to-speech model. The company claims it beats ElevenLabs in blind listening tests, pushing high-quality TTS into fully open, accessible territory.

xAI
APIs & Platforms

Grok Voice Agent API

xAI Grok Voice Agent API ships at $0.05/min flat rate, powers Tesla

xAI launched the Grok Voice Agent API with flat-rate pricing of $0.05 per minute and integration into Tesla vehicles. xAI claims the #1 spot on Big Bench Audio at 92.3%, tightening competition in the rapidly commoditizing real-time voice stack.

$0.05/min Grok Voice Agent API
Amazon
New Models

Amazon Nova 2

Amazon announces Nova 2 family: Lite, Pro, Sonic, and Omni

Amazon rolled out the Nova 2 model suite spanning text, speech, and multimodal stacks with Lite, Pro, Sonic, and Omni variants. The launch came with major benchmark jumps over the first Nova generation and includes a fast, cost-effective reasoning model in Nova 2 Lite.

Microsoft
New ModelsOpen weights

VibeVoice-Realtime-0.5B

Microsoft shares VibeVoice-Realtime-0.5B with ~300ms latency TTS

Microsoft published VibeVoice-Realtime-0.5B on Hugging Face, a small realtime text-to-speech model claiming roughly 300ms latency. The show framed it as more evidence that sub-second audio response is becoming table stakes for production voice agents.

~300ms Claimed TTS latency0.5B Parameters

November 2025

ElevenLabs
New Models

Scribe v2 Realtime

ElevenLabs launches Scribe v2 Realtime speech-to-text with 150ms latency

ElevenLabs launched Scribe v2 Realtime, a streaming speech-to-text model with roughly 150ms latency and support for over 90 languages, demoed live by Paul Asjes. It auto-switches languages mid-stream and handles code, initialisms, and technical terms with context-aware transcription, outpacing Whisper on speed and accuracy.

150ms Latency90+ Languages (Scribe)
Meta AI
New ModelsOpen weights

Omnilingual ASR

Meta releases Omnilingual ASR covering 1,600+ languages

Meta released Omnilingual ASR, an Apache 2.0 speech recognition family supporting over 1,600 languages, including 500+ never before served by any ASR system, with character error rate under 10% for 78 languages. The release includes an open corpus of 500k+ rows of transcribed audio, and the 1B model was praised as a near drop-in state-of-the-art replacement on Hugging Face.

1600+ Languages Supported
Inworld AI
New Models

Inworld TTS

Inworld TTS takes the #1 spot on the Artificial Analysis speech benchmark

Inworld released a new version of its TTS model that claimed the #1 position on the Artificial Analysis text-to-speech benchmark. It featured in the episode's voice segment as evidence that commercial TTS quality keeps climbing fast.

Maya Research
New ModelsOpen weights

Maya-1

Maya-1 open-source voice generation model released

Maya-1 is a new open-source voice generation model that was demoed on the show as part of the week's voice AI wave. The panel highlighted how quickly open voice model quality is improving, with expressive output that holds up against commercial systems.

Sandbar
Products & Apps

Stream / Stream Ring

Sandbar launches Stream voice assistant and Stream Ring wearable

Sandbar launched Stream, a voice-first personal assistant, alongside Stream Ring, a wearable described as a 'mouse for voice' that is now available for preorder. The pairing pushes always-available voice interaction into dedicated hardware rather than the phone.

October 2025

Cartesia
New Models

Sonic 3

Cartesia Sonic 3: real-time TTS with emotion and laughter, plus $100M raise

Cartesia launched Sonic 3, a real-time text-to-speech model that adds expressive emotion and natural laughter, announced alongside a $100M funding round. Co-founder Arjun Desai joined the show to break down the voice stack and why state-space-model approaches enable this latency and expressiveness.

$100M funding round announced alongside the launch
Decart AI
APIs & Platforms

Real-Time Lip Sync API

Decart ships real-time lip-sync API for live AI avatars

Decart AI released a real-time lip-sync API that modifies an avatar's video frames to match generated speech on the fly. Kwindla Kramer broke down the pipeline on the show: WebRTC audio capture, Whisper transcription, an LLM response, ElevenLabs voice generation, then Decart's model syncing the avatar's lips, all at sub-two-second latency, a key step toward interactive, believable AI characters.

<2s end-to-end pipeline latency
Krea AI
New ModelsOpen weights

Krea Realtime Video

Krea open-sources a 14B real-time video generation model

Krea AI open-sourced a 14-billion-parameter real-time video model, with weights on Hugging Face. It joins the week's clear trend of generative video racing toward live, interactive experiences rather than offline rendering.

14B parameters
Microsoft
Major Features & Updates

Edge Copilot Mode (agentic)

Microsoft adds agentic powers and voice to Copilot Mode in Edge

Microsoft answered Atlas with agentic enhancements to Copilot Mode in Edge, including a voice mode that can see and discuss the current page, plus broader Copilot updates (and Clippy back as an easter egg via the Mico avatar). In Alex's hands-on testing the agentic features did not actually work, so real-world parity with Atlas and Comet is unproven.

September 2025

Alibaba (Qwen)
New ModelsOpen weights

Qwen3-Omni

Qwen3-Omni ships open-weights any-to-any audio, vision, and text

Alongside Qwen3-VL, Alibaba released Qwen3-Omni, an end-to-end omni-modal open-weights model that takes text, image, audio, and video input and can respond with streaming speech. The show treated it as direct evidence of how fast open multimodal systems are improving, with weights on Hugging Face, a GitHub repo, demos, and availability in Qwen Chat and the Model Studio API.

Alibaba (Qwen)
New Models

Qwen3-TTS-Flash

Qwen3-TTS-Flash multilingual text-to-speech lands via Alibaba's API

Part of the same Qwen release streak, Qwen3-TTS-Flash is a low-latency multilingual text-to-speech model with multiple voices and dialect support, offered through Alibaba Cloud Model Studio's API rather than as open weights. It fed into the episode's closing audio-demo pileup, where voice launches were treated as product proof points.

Huxe
Products & Apps

Huxe

Huxe personal audio briefing app opens to everyone

Huxe, the personal audio app from former Google NotebookLM team members, just opened up publicly, generating proactive personalized audio briefings. It came up alongside ChatGPT Pulse as another take on proactive, ambient AI products.

Reka AI
New Models

Reka Speech

Reka Speech: high-throughput multilingual ASR and speech translation

Reka AI announced Reka Speech, a high-throughput multilingual speech recognition and speech translation model with timestamps, aimed at batch-scale transcription pipelines. It positions Reka in the production ASR market against incumbent transcription APIs.

OpenAI
New Models

gpt-realtime

OpenAI ships gpt-realtime and takes the Realtime API to GA

OpenAI shipped the gpt-realtime speech-to-speech model and moved the Realtime API to general availability. The GA release adds remote MCP tool support, image input, and SIP phone calling, making it a full production stack for voice agents and tying into the episode's voice-agents discussion with Kwindla Kramer.

July 2025

May 2025

Kyutai
Products & Apps

Unmute.sh

Kyutai launches Unmute.sh, a low-latency voice wrapper for any LLM

Kyutai (the lab behind Moshi) launched Unmute.sh, a modular wrapper that adds voice to any text LLM with under 300ms latency and semantic VAD that knows a thinking pause from a breath. It preserves the underlying text model's capabilities while adding natural voice interaction, and is slated to be open-sourced.

OpenAI
Major Features & Updates

Advanced Voice Mode

OpenAI's Advanced Voice Mode can now sing

OpenAI updated ChatGPT's Advanced Voice Mode with new capabilities, including the ability to sing. Part of a week where voice interfaces kept converging on more natural, expressive interaction.

Lightricks
New Models

LTX Video (distilled)

LTX distilled model enables near real-time video generation

Lightricks shared a distilled version of its LTX video model that generates video at near real-time speeds. It was highlighted in the vision and video segment as a notable speed milestone for video generation.

MiniMax (Hailuo)
Papers & Research

MiniMax Speech

MiniMax Speech tech report published, called the best TTS out there

MiniMax (Hailuo) published the technical report for MiniMax Speech, its text-to-speech system, which the show described as the best TTS out there. The report details the architecture behind the system on arXiv.

April 2025

Daily (Pipecat)
New ModelsOpen weights

Smart-Turn VAD

Pipecat releases Smart-Turn, an open source semantic VAD model

The Pipecat team (from Daily) released Smart-Turn, an open source semantic voice activity detection model that understands when a speaker has actually finished their turn rather than just detecting silence. Kwindla Kramer joined the show to break down how semantic VAD makes voice agent conversations feel far more natural, with a community training effort at turn-training.pipecat.ai.

Nari Labs
New ModelsOpen weights

Dia-1.6B

Nari Labs' Dia: a wild 1.6B open source TTS model that blew up Twitter

Nari Labs released Dia, a 1.6B parameter open-weights text-to-speech model that absolutely blew up Twitter with its expressive, emotional dialogue generation, including laughs, coughs, and multi-speaker conversations. Built by a tiny team, it punches far above its weight against commercial TTS systems and supports voice cloning, with demos available on Fal.ai.

1.6B Parameters
Amazon
New Models

Nova Sonic

Amazon unveils Nova Sonic, a speech-to-speech foundation model

Amazon announced Nova Sonic, a foundational speech-to-speech model that unifies speech understanding and generation for real-time, natural-sounding voice conversations. It is available through Amazon Bedrock as part of the Nova family.

Gladia
New Models

Solaria STT

Gladia launches Solaria speech-to-text model

Gladia launched Solaria, a new speech-to-text model offered through its transcription platform. It arrived in a busy week for voice AI alongside Hailuo's Speech-02 TTS.

March 2025

Alibaba (Qwen)
New ModelsOpen weights

Qwen2.5-Omni-7B

Qwen launches Omni 7B: sees, hears, reads, and talks back

Qwen released Qwen2.5-Omni-7B, an open-weights omni-modal model that perceives text, images, audio, and video, and generates both text and speech. It packs end-to-end multimodal perception and spoken output into a 7B parameter model available on Hugging Face.

7B parameters
Dev ToolsOpen weights

MLX-Audio v0.0.3

Prince Canuma releases MLX-Audio v0.0.3 for speech on Apple Silicon

Prince Canuma, creator of MLX-VLM, FastMLX, and MLX Embeddings, released MLX-Audio v0.0.3, an open-source library bringing speech and audio models to Apple Silicon via MLX. It makes powerful open-source TTS and audio models accessible locally on Mac hardware.

OpenAI
Major Features & Updates

ChatGPT Advanced Voice Mode (semantic VAD)

OpenAI updates ChatGPT advanced voice mode with semantic VAD

Alongside the image generation launch, OpenAI quietly updated ChatGPT's advanced voice mode with semantic voice activity detection. The model now understands when you have actually finished speaking rather than cutting in on pauses, leading to much more natural conversation flow.

Canopy Labs
New ModelsOpen weights

Orpheus 3B

Canopy Labs drops Orpheus 3B natural-sounding speech model

Canopy Labs released Orpheus, an open speech language model that produces natural, human-sounding speech, headlined by a 3B model with smaller variants (1B, 500M, 150M) in the family. Weights are on Hugging Face with a Colab for trying it out, discussed on the show with Daily.co CEO Kwindla Kramer in the voice AI segment.

NVIDIA
New ModelsOpen weights

Canary 1B/180M Flash

NVIDIA Canary Flash: Apache 2 speech recognition and translation

NVIDIA released Canary 1B Flash and 180M Flash, Apache 2.0 licensed speech recognition and translation models built as Llama finetunes. The permissive license makes them freely usable for commercial ASR and translation workloads.

OpenAI
New Models

Next-gen audio models (gpt-4o-mini-tts & transcription)

OpenAI launches steerable voice model and two new transcription models

OpenAI launched a new emotionally steerable text-to-speech voice model plus two new transcription models, watched live on the show as a watch party. The TTS model can be instructed how to speak (tone, emotion, character), demoed at openai.fm, and the models are available through the API for voice agents.

Sesame
Products & Apps

Sesame conversational voice demo (Maya)

Sesame's ultra-realistic conversational voice demo takes the world by storm

Sesame released a demo of its conversational speech model featuring the Maya voice, and its naturalness, with human-like pauses, laughs, and interruptions, went viral across the AI community. Alex recorded a reaction conversation with Maya showcasing how lifelike the voice model is.

February 2025

Hume AI
New Models

Octave

Hume AI launches Octave, a TTS model that understands what it says

Hume AI released Octave, which it calls the first text-to-speech model that understands what it's saying, adjusting emotion, emphasis, and delivery based on the meaning of the text. It fits the episode's humanlike AI voices theme, letting users direct performances with natural-language acting instructions.

xAI
Major Features & Updates

Grok Voice Mode

xAI ships Grok's unhinged voice mode

A week after launching Grok 3 without voice, xAI released Grok's voice mode, including an 'unhinged' personality option that the panel demoed live. It marks xAI's entry into real-time conversational voice AI alongside OpenAI's advanced voice mode.

January 2025

New ModelsOpen weights

YuE 7B

YuE 7B: open-source Suno-style music generation model

The Multimodal Art Projection (M-A-P) team released YuE, a 7B open-source music generation model dubbed the 'open Suno' on the show, capable of generating full songs with vocals from lyrics. Weights are on Hugging Face with code on GitHub and a hosted demo on fal.ai.

7B Parameters
Riffusion
Products & Apps

Fuzz

Riffusion launches Fuzz music generation, free for now

Riffusion (written as 'Refusion' in the show notes) launched Fuzz, a hosted AI music generation product that is free to use during its initial period. It was highlighted in the voice and audio segment alongside YuE as part of a wave of new AI music tools.