Voice & Speech

Text-to-speech, speech recognition, real-time voice agents, transcription, and voice cloning. — 86 releases covered on the show.

July 2026

OpenAI Jul 8, 2026

Products & Apps

GPT-Live

OpenAI ships GPT-Live, full-duplex voice for ChatGPT

GPT-Live listens while it speaks, deciding many times per second whether to talk, pause, interrupt, or call a tool, and delegates harder queries to GPT-5.5 mid-conversation. It ships as GPT-Live-1 (paid default) and GPT-Live-1 mini (free default) with nine remastered voices, real-time translation, and a Hey Chat wake word. Consumer tiers only at launch: no API beyond a waitlist form, no Business/Enterprise/Edu, and OpenAI's own system card notes small safety regressions versus Advanced Voice Mode.

150M+ Weekly ChatGPT voice users2 Model sizes at launch

X announcement ↗Blog ↗System card ↗

🎙️ Hear our coverage →

#voice-ai #consumer-ai

Cohere Jul 7, 2026

New ModelsOpen weights

Transcribe Arabic

Cohere open-sources Transcribe Arabic, topping the Arabic ASR leaderboard

A 2B-parameter Apache 2.0 speech-to-text model that leads the Hugging Face Arabic ASR leaderboard at 25.87 WER — about 11 points better than Whisper Large V3 — with human evaluators preferring it in roughly 96% of head-to-head tests. Handles dialect variety, code-switching and Arabic-English bilingual speech, with day-0 mlx-audio support.

25.87 WER (leaderboard #1)2B Parameters, Apache 2.096% Human preference vs Whisper

X announcement ↗

🎙️ Hear our coverage →

#voice-ai #open-source #multilingual

OpenAI Jul 6, 2026

APIs & Platforms

GPT-Realtime-2.1-mini

GPT-Realtime-2.1-mini brings reasoning and tool use to the Realtime API mini tier

Two days before GPT-Live, OpenAI upgraded the Realtime API mini lineup with reasoning and tool use at unchanged pricing, plus a 25%+ p95 latency cut from improved caching. Notably it does not include GPT-Live's full-duplex capability, which remains app-exclusive.

≥25% p95 latency reduction

X announcement ↗

🎙️ Hear our coverage →

#voice-ai #api #agents

June 2026

NVIDIA Jun 4, 2026

New ModelsOpen weights

Nemotron 3.5 ASR

NVIDIA ships Nemotron 3.5 ASR, a 600M streaming speech model

NVIDIA released Nemotron 3.5 ASR, a 600M-parameter open multilingual streaming speech-to-text model aimed at voice agents. It supports 40 languages and reportedly delivers 17x more throughput than Parakeet-style baselines at half the size, pushing the latency/accuracy frontier for open voice-agent infrastructure.

17x Nemotron ASR throughput

Hugging Face ↗X announcement ↗STT Benchmark ↗Voice Agent Repo ↗

🎙️ Hear our coverage →

#voice-ai #open-source

May 2026

Cartesia May 28, 2026

New Models

Ink-2

Cartesia Ink-2 tops Artificial Analysis's new STT leaderboard

Cartesia released Ink-2, which debuted as the most accurate streaming speech-to-text model with the fastest turnaround on Artificial Analysis's new STT leaderboard. It landed just after recording as part of a double post-show voice-AI drop alongside ElevenLabs Dubbing v2.

Cartesia Ink-2 ↗Cartesia announcement ↗Artificial Analysis STT leaderboard ↗

🎙️ Hear our coverage (+1 follow-up) →

ElevenLabs May 28, 2026

New Models

Dubbing v2

ElevenLabs Dubbing v2 preserves your performance across 90+ languages

ElevenLabs launched Dubbing v2, an audio-to-audio dubbing model that translates voices across more than 90 languages while preserving cadence, expression, intonation, and even stutters. Alex's live demos, including dubbing Nisten into Hebrew and his own voice into multiple languages, were the brain-melting moment of the episode.

ElevenLabs Dubbing v2 ↗ElevenLabs announcement ↗ElevenLabs Creative ↗ElevenLabs Productions ↗

🎙️ Hear our coverage (+1 follow-up) →

#voice-ai #multilingual

O OpenMOSS May 28, 2026

New ModelsOpen weights

MOSS-TTS-v1.5

MOSS-TTS-v1.5: open-source 8B TTS with 31 languages

OpenMOSS shipped MOSS-TTS-v1.5, an 8B open-source text-to-speech model supporting 31 languages with pause control, released under Apache 2.0. It is one of the larger fully open TTS models available.

MOSS-TTS-v1.5 on Hugging Face ↗MOSS-TTS GitHub ↗MOSS-TTS paper ↗MOSS announcement ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Meta AI May 14, 2026

Major Features & Updates

Muse Spark voice conversations

Meta launches Muse Spark voice conversations across its apps and glasses

Meta rolled out Muse Spark-powered voice conversations across the Meta AI app, WhatsApp, Instagram, Facebook, and Ray-Ban Meta glasses. The feature includes real-time image generation, live camera AI, and instant Reels/maps integration. Alex tested it live and called it surprisingly good, the first big consumer ship from Meta Superintelligence Labs.

X announcement ↗Announcement ↗

🎙️ Hear our coverage →

#voice-ai #consumer-ai #multimodal

Thinking Machines Lab May 14, 2026

New Models

Interaction Models

Thinking Machines Lab drops Interaction Models: real-time multimodal 276B MoE

Mira Murati's Thinking Machines Lab released Interaction Models, a 276B-parameter MoE (12B active) trained from scratch for native real-time multimodal collaboration. It supports full-duplex audio/video/text with 0.40s turn-taking latency and scores 77.8 on FD-bench v1.5. The demo can react live to events like another person entering the camera frame.

276B MoE parameters12B active parameters

X announcement ↗Blog ↗

🎙️ Hear our coverage →

#multimodal #voice-ai

April 2026

StepFun Apr 23, 2026

New Models

StepAudio 2.5

StepAudio 2.5 TTS adds natural-language control of emotion and delivery

StepFun released StepAudio 2.5, a text-to-speech model that lets you steer emotion and delivery with natural-language instructions. It was covered in the show's Voice & Audio segment as the week's notable speech release.

StepAudio 2.5 TTS ↗

🎙️ Hear our coverage →

Daily (Pipecat) Apr 16, 2026

Products & AppsOpen weights

Gradient Bang

Gradient Bang: first massively multiplayer fully LLM-driven voice game

Kwindla Kramer's 'side project that broke containment' is a fully LLM-driven multiplayer voice-based space game inspired by BBS-era Trade Wars, built on a new Pipecat Sub-Agents library with a class-based event bus that works locally and over the network. A Deepgram plus GPT-4.1 voice agent always responds in under 1.5 seconds while GPT-5.2 medium-thinking task agents do the work, and the React frontend is rendered from LLM-generated JSON as dynamic UI. The team also open-sourced GB Benchmarks for evaluating agent task execution.

Play Gradient Bang ↗gradient-bang on GitHub ↗Kwindla on Gradient Bang (X) ↗

🎙️ Hear our coverage →

#voice-ai #agents #open-source

Google DeepMind Apr 16, 2026

New Models

Gemini 3.1 Flash TTS

Gemini 3.1 Flash TTS tops TTS Arena at 1,211 Elo with 70+ languages

Google released Gemini 3.1 Flash TTS, which leads TTS Arena at 1,211 Elo, supports 70+ languages with inline audio tags, and costs about $0.03 per 60 seconds, roughly 5x cheaper than ElevenLabs. Kwindla noted it is fully promptable like an LLM rather than limited to fixed tags, but its ~3 second time-to-first-token makes it batch-only for now rather than usable in live conversational pipelines.

1,211 TTS Arena Elo

Google blog: Gemini 3.1 Flash TTS ↗Try it in AI Studio ↗Logan Kilpatrick announcement (X) ↗

🎙️ Hear our coverage →

#voice-ai #audio

Fish Audio Apr 2, 2026

Products & Apps

Fish Audio STT

Fish Audio launches speech-to-text with automatic emotion tagging

Fish Audio released a speech-to-text product with automatic emotion tagging that feeds directly into its S2 TTS pipeline. The panel saw it as another sign that voice tooling is rapidly commoditizing and challenging incumbent speech providers.

Announcement (X) ↗Fish Audio app ↗Fish Audio blog ↗

🎙️ Hear our coverage →

Microsoft Apr 2, 2026

New Models

MAI-Transcribe-1

Microsoft MAI ships MAI-Transcribe-1, ranked #1 in transcription

Microsoft's MAI lab released MAI-Transcribe-1, an in-house speech transcription model that debuted at #1 in transcription quality. It is part of a three-model drop showing Microsoft expanding its first-party model stack beyond its OpenAI dependence.

Mustafa Suleyman announcement (X) ↗Transcribe blog ↗

🎙️ Hear our coverage →

Microsoft Apr 2, 2026

New Models

MAI-Voice-1

Microsoft MAI debuts MAI-Voice-1 expressive voice model

MAI-Voice-1 is Microsoft's expressive voice model, the third piece of the MAI in-house model drop alongside transcription and image generation. The panel discussed how Microsoft's first-party voice stack compares to specialist voice providers.

Mustafa Suleyman announcement (X) ↗

🎙️ Hear our coverage →

March 2026

A Aratako Mar 26, 2026

New ModelsOpen weights

Irodori-TTS-500M

Irodori-TTS-500M: open Japanese TTS with emoji emotion control

Irodori-TTS-500M is a 500M-parameter open-weights Japanese text-to-speech model released on Hugging Face, notable for controlling emotional delivery through emojis in the input text. It landed as part of the week's wave of voice and audio releases.

Announcement (X) ↗Irodori-TTS-500M on Hugging Face ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Cohere Mar 26, 2026

New ModelsOpen weights

Cohere Transcribe

Cohere Transcribe: open-source 2B ASR tops Open ASR Leaderboard at 5.42% WER

Cohere entered the ASR game with Transcribe, a 2-billion-parameter Apache 2.0 speech recognition model that immediately took the number-one spot on Hugging Face's Open ASR Leaderboard with a 5.42% word error rate versus Whisper Large v3's 7.44%. It wins 61% of human evaluations on average and 64% head-to-head against Whisper, making it a credible local-inference Whisper replacement for regulated industries.

2B Cohere Transcribe ASR size5.42% Word error rate on Open ASR Leaderboard

Cohere announcement (X) ↗Cohere blog: Transcribe ↗Open ASR Leaderboard (Hugging Face) ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Google DeepMind Mar 26, 2026

New Models

Gemini 3.1 Flash Live

Google drops Gemini 3.1 Flash Live: Gemini can see, hear, and talk to you

Google released Gemini 3.1 Flash Live, a realtime multimodal model that handles voice and vision interaction in a single model path instead of stitched pipelines. The panel framed it as a major upgrade for end-to-end voice and vision agents, with AI Studio and API availability as the immediate way to experiment.

Google DeepMind announcement (X) ↗

🎙️ Hear our coverage →

#voice-ai #agents

Mistral AI Mar 26, 2026

New ModelsOpen weights

Voxtral TTS

Mistral drops Voxtral TTS, a 3B open-weight text-to-speech model

Mistral released Voxtral TTS, its first text-to-speech model, as breaking news during the live show: 3 billion parameters, open weights, with emotion controls for neutral, happy, and frustrated voices. Mistral claims it beats ElevenLabs Flash v2.5 in human preference tests with a 58% win rate on flagship voices and 68% on zero-shot voice cloning, though Alex's live test found it decent rather than stunning.

3B Mistral Voxtral TTS size

Mistral AI announcement (X) ↗Mistral blog: Voxtral TTS ↗

🎙️ Hear our coverage →

#voice-ai #open-source

xAI Mar 19, 2026

APIs & Platforms

Grok Text-to-Speech API

xAI launches Grok TTS API with 5 voices and WebSocket streaming

xAI launched a Grok Text-to-Speech API with five voices, expressive controls, and WebSocket streaming, priced cheaper than ElevenLabs. It adds another option to a suddenly competitive voice AI market alongside open-source entrants like Fish Audio S2.

xAI on X ↗Grok voice API ↗Try text-to-speech ↗

🎙️ Hear our coverage →

Fish Audio Mar 13, 2026

New ModelsOpen weights

Fish Audio S2

Fish Audio S2 open TTS hits sub-150ms latency

Fish Audio S2 is a fully open-source TTS model with inline emotion control via free-text bracket tags like gasp, laughter, and long pause. Alex demoed it live with an OpenClaw skill that let his 5-year-old talk to a voice clone of 'Rocky' from Project Hail Mary; Wolfram called it 'ElevenLabs V3 for free.'

<150ms Fish Audio S2 TTS latency

Fish Audio S2 on X ↗Fish Speech 2 on HuggingFace ↗fish.audio ↗

🎙️ Hear our coverage (+1 follow-up) →

#voice-ai #open-source

February 2026

OpenAI Feb 26, 2026

New Models

gpt-audio-1.5 & gpt-realtime-1.5

OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5

OpenAI shipped gpt-audio-1.5 and gpt-realtime-1.5, updated audio and realtime voice models available through its platform. The release was covered in the week's voice and audio roundup.

Release noted on X ↗OpenAI models docs ↗

🎙️ Hear our coverage →

#voice-ai #audio #api

Mistral AI Feb 5, 2026

New ModelsOpen weights

Voxtral Transcribe 2

Mistral's Voxtral Transcribe 2 dethrones Whisper as SOTA transcription

Mistral AI launched Voxtral Transcribe 2, state-of-the-art speech-to-text with sub-200ms latency, native diarization support, and open weights under Apache 2.0. The panel called it the first model to dethrone Whisper after roughly three years, and Alex used it to transcribe this very episode.

X announcement ↗Mistral blog ↗Docs ↗Demo ↗

🎙️ Hear our coverage →

#voice-ai #open-source

OpenBMB Feb 5, 2026

New ModelsOpen weights

MiniCPM-o 4.5

MiniCPM-o 4.5: first open-source full-duplex omni model

OpenBMB released MiniCPM-o 4.5, the first open-source full-duplex omni-modal LLM that can see, listen, and speak simultaneously. It can listen while speaking and even interrupt the user, bringing real-time conversational behavior to open weights.

X announcement ↗Hugging Face ↗GitHub ↗

🎙️ Hear our coverage →

#open-source #voice-ai #multimodal

January 2026

Decart Jan 29, 2026

New Models

Lucy 2.0

Lucy 2.0 real-time video generation model

Lucy 2.0, a real-time video generation model, was discussed in the AI Art segment. The episode covered its real-time video capabilities.

🎙️ Hear our coverage →

#video-gen #voice-ai

NVIDIA Jan 29, 2026

New ModelsOpen weights

PersonaPlex-7B

NVIDIA releases PersonaPlex-7B voice model

NVIDIA released PersonaPlex-7B, an open voice/audio model published on Hugging Face with code on GitHub. Listed in the week's Voice & Audio releases.

Announcement (X) ↗Hugging Face ↗GitHub ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Alibaba (Qwen) Jan 22, 2026

New ModelsOpen weights

Qwen3-TTS

Qwen3-TTS: open-source TTS family with 97ms latency and voice cloning

Alibaba's Qwen team released Qwen3-TTS, a full open-source text-to-speech family under Apache 2 that dropped 30 minutes before the show. It spans 5 models from 0.6B to 1.7B parameters, with 97ms latency, voice cloning from just 3 seconds of audio, voice description prompting, and 10-language support.

97ms Latency

Qwen3-TTS announcement (X) ↗Qwen3-TTS on Hugging Face ↗Qwen3-TTS on GitHub ↗

🎙️ Hear our coverage →

#voice-ai #open-source

F FlashLabs Jan 22, 2026

New ModelsOpen weights

Chroma 1.0

FlashLabs Chroma 1.0: open-source real-time speech-to-speech under 150ms

FlashLabs released Chroma 1.0, billed as the world's first open-source end-to-end real-time speech-to-speech model with voice cloning under 150ms latency. The 4B parameter model is built on Qwen 2.5 Omni and released under Apache 2; its live demo with RAG and document upload impressed the whole panel.

FlashLabs Chroma 1.0 announcement (X) ↗FlashLabs Chroma-4B on Hugging Face ↗Chroma paper (arXiv) ↗FlashLabs Voice Agents demo ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Inworld AI Jan 22, 2026

New Models

TTS-1.5

Inworld TTS-1.5 claims #1 TTS ranking at half a cent per minute

Inworld AI launched TTS-1.5, a closed-source text-to-speech model claiming the #1 ranking with sub-250ms latency. Its headline is price: roughly $5 per million characters (about half a cent per minute) versus ElevenLabs' $120 per million characters.

Inworld AI TTS-1.5 announcement (X) ↗Inworld AI TTS playground ↗

🎙️ Hear our coverage →

KAIST Jan 8, 2026

Papers & Research

Avatar Forcing

KAIST's Avatar Forcing: real-time interactive talking heads

KAIST published Avatar Forcing, a framework for real-time interactive talking-head avatars with approximately 500ms latency. The paper targets responsive, live avatar interaction rather than offline video generation.

Avatar Forcing Paper (KAIST) ↗

🎙️ Hear our coverage →

#video-gen #voice-ai

Liquid AI Jan 8, 2026

New ModelsOpen weights

LFM 2.5

Liquid AI LFM 2.5: 1B on-device family with end-to-end audio

Liquid AI released LFM 2.5, a family of ~1.2B parameter on-device models spanning text, vision, and audio, announced at CES alongside AMD's Lisa Su. The models hit 239 tokens/sec on AMD CPU and 100 tokens/sec on iPhone 16 Pro Max, and include a revolutionary end-to-end audio model that skips the traditional ASR-LLM-TTS pipeline entirely, running in as little as 8GB of RAM.

Liquid AI LFM 2.5 on X ↗LFM 2.5 on Hugging Face ↗

🎙️ Hear our coverage →

#open-source #on-device #voice-ai

NVIDIA Jan 8, 2026

New ModelsOpen weights

Nemotron Speech ASR

Nemotron Speech ASR: 600M streaming model with 24ms latency

NVIDIA released Nemotron Speech ASR, a 600M parameter open source streaming speech recognition model with 24ms median latency and support for 900 concurrent streams on a single H100. Kwindla Hultman Kramer of Daily.co demoed sub-500ms voice-to-voice latency using a three-model pipeline of Nemotron ASR, Nemotron Nano LLM, and Magpie TTS.

24ms Nemotron Speech latency

NVIDIA Nemotron AI Dev on X ↗Nemotron Speech on Hugging Face ↗Nemotron Speech ASR Blog ↗

🎙️ Hear our coverage →

#voice-ai #open-source

December 2025

Alibaba (Qwen) Dec 25, 2025

New ModelsOpen weights

Qwen speech-to-speech model

Qwen launches speech-to-speech model with emotion handling

Qwen released a speech-to-speech model in March with internal emotion handling, joining the wave of voice-native models. It was part of the Qwen team's relentless 2025 release cadence across modalities.

Mar 27 Episode ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Daily (Pipecat) Dec 25, 2025

New ModelsOpen weights

Smart Turn Detection

Daily ships smart turn detection for voice agents

Kwindla's Daily.co shipped smart turn detection during Q2, an open model that helps voice agents know when a speaker has actually finished talking. It landed in the quarter when voice agents first got attention outside the builder bubble.

🎙️ Hear our coverage →

#voice-ai #agents

Google DeepMind Dec 25, 2025

New Models

Gemini TTS

Google ships a Gemini TTS model in its December run

As part of Google's December release wave, a Gemini TTS model shipped alongside realtime model updates. It rounded out Google's full-stack voice story heading into 2026.

🎙️ Hear our coverage →

H Hexgrad (Kokoro) Dec 25, 2025

New ModelsOpen weights

Kokoro TTS

Kokoro TTS: 82M-param Apache 2 model hits #1 on TTS Arena

Kokoro, a tiny 82M parameter text-to-speech model, went viral in January after hitting #1 on TTS Arena. Released under Apache 2.0 and small enough to run in the browser, it showed that high-quality speech synthesis no longer required huge models.

Jan 10 Episode ↗

🎙️ Hear our coverage →

#voice-ai #open-source

OpenAI Dec 25, 2025

New Models

New voice models (GPT Realtime derivatives)

OpenAI ships two new voice models derived from GPT Realtime

In March, OpenAI released two voice models derived from its GPT Realtime speech-to-speech stack. They were part of a wave that pushed voice agents toward the mainstream over the course of 2025.

Mar 20 Episode ↗

🎙️ Hear our coverage →

Resemble AI Dec 18, 2025

New ModelsOpen weights

Chatterbox Turbo

Resemble AI open-sources Chatterbox Turbo, a 350M MIT-licensed TTS

Resemble AI released Chatterbox Turbo, an MIT-licensed 350M-parameter open text-to-speech model. The company claims it beats ElevenLabs in blind listening tests, pushing high-quality TTS into fully open, accessible territory.

Resemble Chatterbox Turbo (GitHub) ↗Chatterbox Turbo (HF) ↗Chatterbox Turbo blog ↗Chatterbox Turbo on X ↗

🎙️ Hear our coverage →

#voice-ai #open-source

xAI Dec 18, 2025

APIs & Platforms

Grok Voice Agent API

xAI Grok Voice Agent API ships at $0.05/min flat rate, powers Tesla

xAI launched the Grok Voice Agent API with flat-rate pricing of $0.05 per minute and integration into Tesla vehicles. xAI claims the #1 spot on Big Bench Audio at 92.3%, tightening competition in the rapidly commoditizing real-time voice stack.

$0.05/min Grok Voice Agent API

xAI Grok Voice Agent API ↗

🎙️ Hear our coverage →

#voice-ai #agents #api

Amazon Dec 4, 2025

New Models

Amazon Nova 2

Amazon announces Nova 2 family: Lite, Pro, Sonic, and Omni

Amazon rolled out the Nova 2 model suite spanning text, speech, and multimodal stacks with Lite, Pro, Sonic, and Omni variants. The launch came with major benchmark jumps over the first Nova generation and includes a fast, cost-effective reasoning model in Nova 2 Lite.

Amazon Nova 2 launch (AWS blog) ↗Amazon News announcement on X ↗

🎙️ Hear our coverage →

#frontier-models #voice-ai #reasoning

Microsoft Dec 4, 2025

New ModelsOpen weights

VibeVoice-Realtime-0.5B

Microsoft shares VibeVoice-Realtime-0.5B with ~300ms latency TTS

Microsoft published VibeVoice-Realtime-0.5B on Hugging Face, a small realtime text-to-speech model claiming roughly 300ms latency. The show framed it as more evidence that sub-second audio response is becoming table stakes for production voice agents.

~300ms Claimed TTS latency0.5B Parameters

Microsoft VibeVoice-Realtime-0.5B (Hugging Face) ↗Community post on X ↗

🎙️ Hear our coverage →

#voice-ai #open-source

November 2025

OpenAI Nov 27, 2025

Major Features & Updates

ChatGPT Voice Mode

OpenAI integrates ChatGPT Voice Mode directly into chats

OpenAI integrated ChatGPT's Voice Mode directly into the chat interface instead of a separate full-screen experience. Users can now talk to ChatGPT while seeing transcripts and visual responses inline in the conversation.

OpenAI Voice Mode Announcement on X ↗

🎙️ Hear our coverage →

#voice-ai #consumer-ai

ElevenLabs Nov 13, 2025

New Models

Scribe v2 Realtime

ElevenLabs launches Scribe v2 Realtime speech-to-text with 150ms latency

ElevenLabs launched Scribe v2 Realtime, a streaming speech-to-text model with roughly 150ms latency and support for over 90 languages, demoed live by Paul Asjes. It auto-switches languages mid-stream and handles code, initialisms, and technical terms with context-aware transcription, outpacing Whisper on speed and accuracy.

150ms Latency90+ Languages (Scribe)

ElevenLabs announcement on X ↗ElevenLabs Agents ↗ElevenLabs docs ↗ElevenLabs Scribe V2 Real Time ↗

🎙️ Hear our coverage (+1 follow-up) →

Google DeepMind Nov 13, 2025

Major Features & Updates

Gemini Live

Gemini Live gets a conversational voice upgrade

Google rolled out an upgrade to Gemini Live's voice capabilities, making conversations more natural. Covered in the big-companies roundup alongside GPT-5.1 and Grok 4 Fast as the voice interface race heats up.

Gemini Live upgrade on X ↗

🎙️ Hear our coverage →

Meta AI Nov 13, 2025

New ModelsOpen weights

Omnilingual ASR

Meta releases Omnilingual ASR covering 1,600+ languages

Meta released Omnilingual ASR, an Apache 2.0 speech recognition family supporting over 1,600 languages, including 500+ never before served by any ASR system, with character error rate under 10% for 78 languages. The release includes an open corpus of 500k+ rows of transcribed audio, and the 1B model was praised as a near drop-in state-of-the-art replacement on Hugging Face.

1600+ Languages Supported

AI at Meta announcement on X ↗Meta blog post ↗Research paper ↗Omnilingual ASR corpus on Hugging Face ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Inworld AI Nov 6, 2025

New Models

Inworld TTS

Inworld TTS takes the #1 spot on the Artificial Analysis speech benchmark

Inworld released a new version of its TTS model that claimed the #1 position on the Artificial Analysis text-to-speech benchmark. It featured in the episode's voice segment as evidence that commercial TTS quality keeps climbing fast.

🎙️ Hear our coverage →

#voice-ai #benchmarks

M Maya Research Nov 6, 2025

New ModelsOpen weights

Maya-1

Maya-1 open-source voice generation model released

Maya-1 is a new open-source voice generation model that was demoed on the show as part of the week's voice AI wave. The panel highlighted how quickly open voice model quality is improving, with expressive output that holds up against commercial systems.

🎙️ Hear our coverage →

#voice-ai #open-source

S Sandbar Nov 6, 2025

Products & Apps

Stream / Stream Ring

Sandbar launches Stream voice assistant and Stream Ring wearable

Sandbar launched Stream, a voice-first personal assistant, alongside Stream Ring, a wearable described as a 'mouse for voice' that is now available for preorder. The pairing pushes always-available voice interaction into dedicated hardware rather than the phone.

🎙️ Hear our coverage →

#voice-ai #infrastructure #consumer-ai

October 2025

Cartesia Oct 30, 2025

New Models

Sonic 3

Cartesia Sonic 3: real-time TTS with emotion and laughter, plus $100M raise

Cartesia launched Sonic 3, a real-time text-to-speech model that adds expressive emotion and natural laughter, announced alongside a $100M funding round. Co-founder Arjun Desai joined the show to break down the voice stack and why state-space-model approaches enable this latency and expressiveness.

$100M funding round announced alongside the launch

X announcement ↗Sonic website ↗Docs ↗

🎙️ Hear our coverage →

#voice-ai #industry

MiniMax Oct 30, 2025

New Models

MiniMax Speech 2.6

MiniMax Speech 2.6: ultra-human voice AI with sub-250ms latency

MiniMax released Speech 2.6, a voice model targeting ultra-human quality with end-to-end latency under 250ms, available through the MiniMax platform API. It slots into the episode's voice arms race alongside Cartesia's Sonic 3.

<250ms latency

X announcement ↗MiniMax audio ↗API docs ↗

🎙️ Hear our coverage →

Decart AI Oct 23, 2025

APIs & Platforms

Real-Time Lip Sync API

Decart ships real-time lip-sync API for live AI avatars

Decart AI released a real-time lip-sync API that modifies an avatar's video frames to match generated speech on the fly. Kwindla Kramer broke down the pipeline on the show: WebRTC audio capture, Whisper transcription, an LLM response, ElevenLabs voice generation, then Decart's model syncing the avatar's lips, all at sub-two-second latency, a key step toward interactive, believable AI characters.

<2s end-to-end pipeline latency

🎙️ Hear our coverage →

#voice-ai #video-gen

Krea AI Oct 23, 2025

New ModelsOpen weights

Krea Realtime Video

Krea open-sources a 14B real-time video generation model

Krea AI open-sourced a 14-billion-parameter real-time video model, with weights on Hugging Face. It joins the week's clear trend of generative video racing toward live, interactive experiences rather than offline rendering.

14B parameters

🎙️ Hear our coverage →

#video-gen #voice-ai #open-source

Microsoft Oct 23, 2025

Major Features & Updates

Edge Copilot Mode (agentic)

Microsoft adds agentic powers and voice to Copilot Mode in Edge

Microsoft answered Atlas with agentic enhancements to Copilot Mode in Edge, including a voice mode that can see and discuss the current page, plus broader Copilot updates (and Clippy back as an easter egg via the Mico avatar). In Alex's hands-on testing the agentic features did not actually work, so real-world parity with Atlas and Comet is unproven.

X ↗X (Edge) ↗Clippy easter egg ↗

🎙️ Hear our coverage →

#agents #consumer-ai #voice-ai

Microsoft Oct 16, 2025

Major Features & Updates

Windows 11 Copilot Voice

Microsoft makes every Windows 11 PC an AI PC with Copilot voice input

Microsoft announced that every Windows 11 machine becomes an 'AI PC,' adding 'Hey Copilot' voice input and deeper agentic Copilot integration at the OS level. The panel discussed it as a sign of AI assistants moving into the default computing experience.

Zac Bowden on X ↗Windows Blog ↗

🎙️ Hear our coverage →

#voice-ai #consumer-ai #agents

September 2025

Alibaba (Qwen) Sep 25, 2025

New ModelsOpen weights

Qwen3-Omni

Qwen3-Omni ships open-weights any-to-any audio, vision, and text

Alongside Qwen3-VL, Alibaba released Qwen3-Omni, an end-to-end omni-modal open-weights model that takes text, image, audio, and video input and can respond with streaming speech. The show treated it as direct evidence of how fast open multimodal systems are improving, with weights on Hugging Face, a GitHub repo, demos, and availability in Qwen Chat and the Model Studio API.

HF ↗GitHub ↗Qwen Chat ↗Demo ↗

🎙️ Hear our coverage →

#open-source #multimodal #voice-ai

Alibaba (Qwen) Sep 25, 2025

New Models

Qwen3-TTS-Flash

Qwen3-TTS-Flash multilingual text-to-speech lands via Alibaba's API

Part of the same Qwen release streak, Qwen3-TTS-Flash is a low-latency multilingual text-to-speech model with multiple voices and dialect support, offered through Alibaba Cloud Model Studio's API rather than as open weights. It fed into the episode's closing audio-demo pileup, where voice launches were treated as product proof points.

X ↗Blog ↗API ↗

🎙️ Hear our coverage →

H Huxe Sep 25, 2025

Products & Apps

Huxe

Huxe personal audio briefing app opens to everyone

Huxe, the personal audio app from former Google NotebookLM team members, just opened up publicly, generating proactive personalized audio briefings. It came up alongside ChatGPT Pulse as another take on proactive, ambient AI products.

🎙️ Hear our coverage →

#consumer-ai #voice-ai

Reka AI Sep 18, 2025

New Models

Reka Speech

Reka Speech: high-throughput multilingual ASR and speech translation

Reka AI announced Reka Speech, a high-throughput multilingual speech recognition and speech translation model with timestamps, aimed at batch-scale transcription pipelines. It positions Reka in the production ASR market against incumbent transcription APIs.

🎙️ Hear our coverage →

OpenAI Sep 4, 2025

New Models

gpt-realtime

OpenAI ships gpt-realtime and takes the Realtime API to GA

OpenAI shipped the gpt-realtime speech-to-speech model and moved the Realtime API to general availability. The GA release adds remote MCP tool support, image input, and SIP phone calling, making it a full production stack for voice agents and tying into the episode's voice-agents discussion with Kwindla Kramer.

🎙️ Hear our coverage →

#voice-ai #api #agents

July 2025

Alibaba (Qwen) Jul 3, 2025

New Models

Qwen-TTS

Alibaba launches Qwen-TTS with human-level bilingual naturalness

The Qwen team released Qwen-TTS, a bilingual Chinese/English text-to-speech model claiming human-level naturalness, available via API with a Hugging Face demo space. It was the second voice release of the week alongside Kyutai TTS.

X announcement ↗Hugging Face demo ↗

🎙️ Hear our coverage →

Kyutai Jul 3, 2025

New ModelsOpen weights

Kyutai TTS

Kyutai releases open low-latency TTS for English and French

Kyutai Labs released an open 1.6B-parameter text-to-speech model with low latency and high voice similarity in English and French. It was one of two TTS launches closing out the episode, underscoring how quickly multimodal product quality is rising.

X announcement ↗Hugging Face model ↗

🎙️ Hear our coverage →

#voice-ai #open-source

May 2025

Anthropic May 29, 2025

Major Features & Updates

Claude Voice Mode

Anthropic releases voice mode on Claude mobile apps

Anthropic shipped a voice mode on mobile, bringing conversational voice AI to the Claude apps. Another entry in the week's theme of every major lab giving its models a voice.

Anthropic X announcement ↗

🎙️ Hear our coverage →

Kyutai May 29, 2025

Products & Apps

Unmute.sh

Kyutai launches Unmute.sh, a low-latency voice wrapper for any LLM

Kyutai (the lab behind Moshi) launched Unmute.sh, a modular wrapper that adds voice to any text LLM with under 300ms latency and semantic VAD that knows a thinking pause from a breath. It preserves the underlying text model's capabilities while adding natural voice interaction, and is slated to be open-sourced.

Try It ↗X announcement ↗

🎙️ Hear our coverage →

OpenAI May 29, 2025

Major Features & Updates

Advanced Voice Mode

OpenAI's Advanced Voice Mode can now sing

OpenAI updated ChatGPT's Advanced Voice Mode with new capabilities, including the ability to sing. Part of a week where voice interfaces kept converging on more natural, expressive interaction.

🎙️ Hear our coverage →

Resemble AI May 29, 2025

New ModelsOpen weights

Chatterbox

Resemble AI open-sources Chatterbox voice cloning with emotion control

Resemble AI released Chatterbox, an open-source voice cloning model with emotion control. Weights and code are public on GitHub and Hugging Face, bringing controllable, expressive voice cloning to the open ecosystem.

GitHub ↗Hugging Face ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Lightricks May 15, 2025

New Models

LTX Video (distilled)

LTX distilled model enables near real-time video generation

Lightricks shared a distilled version of its LTX video model that generates video at near real-time speeds. It was highlighted in the vision and video segment as a notable speed milestone for video generation.

Announcement on X ↗

🎙️ Hear our coverage →

#video-gen #voice-ai

MiniMax (Hailuo) May 15, 2025

Papers & Research

MiniMax Speech

MiniMax Speech tech report published, called the best TTS out there

MiniMax (Hailuo) published the technical report for MiniMax Speech, its text-to-speech system, which the show described as the best TTS out there. The report details the architecture behind the system on arXiv.

🎙️ Hear our coverage →

Google May 1, 2025

Major Features & Updates

NotebookLM Audio Overviews

NotebookLM AI Audio Overviews go multilingual with 50+ languages

Google expanded NotebookLM's AI audio overviews (the podcast-style summaries) to support more than 50 languages, taking the feature global beyond its English-only debut.

Google announcement (X) ↗

🎙️ Hear our coverage →

#voice-ai #multilingual

April 2025

Daily (Pipecat) Apr 24, 2025

New ModelsOpen weights

Smart-Turn VAD

Pipecat releases Smart-Turn, an open source semantic VAD model

The Pipecat team (from Daily) released Smart-Turn, an open source semantic voice activity detection model that understands when a speaker has actually finished their turn rather than just detecting silence. Kwindla Kramer joined the show to break down how semantic VAD makes voice agent conversations feel far more natural, with a community training effort at turn-training.pipecat.ai.

GitHub ↗HF Model ↗Fal.ai Playground ↗Try It Demo ↗

🎙️ Hear our coverage →

#voice-ai #open-source #agents

Nari Labs Apr 24, 2025

New ModelsOpen weights

Dia-1.6B

Nari Labs' Dia: a wild 1.6B open source TTS model that blew up Twitter

Nari Labs released Dia, a 1.6B parameter open-weights text-to-speech model that absolutely blew up Twitter with its expressive, emotional dialogue generation, including laughs, coughs, and multi-speaker conversations. Built by a tiny team, it punches far above its weight against commercial TTS systems and supports voice cloning, with demos available on Fal.ai.

1.6B Parameters

X Post Highlight ↗HF Model ↗GitHub ↗Fal.ai Voice Clone Demo ↗

🎙️ Hear our coverage →

#voice-ai #open-source

Amazon Apr 10, 2025

New Models

Nova Sonic

Amazon unveils Nova Sonic, a speech-to-speech foundation model

Amazon announced Nova Sonic, a foundational speech-to-speech model that unifies speech understanding and generation for real-time, natural-sounding voice conversations. It is available through Amazon Bedrock as part of the Nova family.

Amazon blog: Nova Sonic ↗

🎙️ Hear our coverage →

Gladia Apr 3, 2025

New Models

Solaria STT

Gladia launches Solaria speech-to-text model

Gladia launched Solaria, a new speech-to-text model offered through its transcription platform. It arrived in a busy week for voice AI alongside Hailuo's Speech-02 TTS.

Gladia Solaria ↗

🎙️ Hear our coverage →

Hailuo AI (MiniMax) Apr 3, 2025

APIs & Platforms

Speech-02

Hailuo Speech-02 TTS API: potentially SOTA emotional voice cloning

Hailuo (MiniMax) released the Speech-02 TTS API, which Alex called potentially state of the art for emotional control and voice cloning quality. It produces nuanced, realistic synthetic voices and was the standout voice release of the week.

Hailuo Speech-02 announcement on X ↗

🎙️ Hear our coverage →

OpenAI Apr 3, 2025

Major Features & Updates

ChatGPT "Monday" voice

OpenAI ships new EMO "Monday" voice in ChatGPT

OpenAI added a new "Monday" voice to ChatGPT's voice mode, an EMO-flavored persona released around April 1st. It rounds out a week of OpenAI shipping across models, evals, and product.

OpenAI announcement on X ↗

🎙️ Hear our coverage →

#voice-ai #consumer-ai

March 2025

Alibaba (Qwen) Mar 27, 2025

New ModelsOpen weights

Qwen2.5-Omni-7B

Qwen launches Omni 7B: sees, hears, reads, and talks back

Qwen released Qwen2.5-Omni-7B, an open-weights omni-modal model that perceives text, images, audio, and video, and generates both text and speech. It packs end-to-end multimodal perception and spoken output into a 7B parameter model available on Hugging Face.

7B parameters

Hugging Face ↗

🎙️ Hear our coverage →

#open-source #multimodal #voice-ai

M MLX Community (Prince Canuma) Mar 27, 2025

Dev ToolsOpen weights

MLX-Audio v0.0.3

Prince Canuma releases MLX-Audio v0.0.3 for speech on Apple Silicon

Prince Canuma, creator of MLX-VLM, FastMLX, and MLX Embeddings, released MLX-Audio v0.0.3, an open-source library bringing speech and audio models to Apple Silicon via MLX. It makes powerful open-source TTS and audio models accessible locally on Mac hardware.

GitHub repo ↗Prince Canuma on X ↗

🎙️ Hear our coverage →

#voice-ai #open-source #on-device

OpenAI Mar 27, 2025

Major Features & Updates

ChatGPT Advanced Voice Mode (semantic VAD)

OpenAI updates ChatGPT advanced voice mode with semantic VAD

Alongside the image generation launch, OpenAI quietly updated ChatGPT's advanced voice mode with semantic voice activity detection. The model now understands when you have actually finished speaking rather than cutting in on pauses, leading to much more natural conversation flow.

YouTube announcement ↗

🎙️ Hear our coverage →

C Canopy Labs Mar 20, 2025

New ModelsOpen weights

Orpheus 3B

Canopy Labs drops Orpheus 3B natural-sounding speech model

Canopy Labs released Orpheus, an open speech language model that produces natural, human-sounding speech, headlined by a 3B model with smaller variants (1B, 500M, 150M) in the family. Weights are on Hugging Face with a Colab for trying it out, discussed on the show with Daily.co CEO Kwindla Kramer in the voice AI segment.

Blog ↗HF ↗Colab ↗

🎙️ Hear our coverage →

#voice-ai #open-source

NVIDIA Mar 20, 2025

New ModelsOpen weights

Canary 1B/180M Flash

NVIDIA Canary Flash: Apache 2 speech recognition and translation

NVIDIA released Canary 1B Flash and 180M Flash, Apache 2.0 licensed speech recognition and translation models built as Llama finetunes. The permissive license makes them freely usable for commercial ASR and translation workloads.

🎙️ Hear our coverage →

#voice-ai #multilingual #open-source

OpenAI Mar 20, 2025

New Models

Next-gen audio models (gpt-4o-mini-tts & transcription)

OpenAI launches steerable voice model and two new transcription models

OpenAI launched a new emotionally steerable text-to-speech voice model plus two new transcription models, watched live on the show as a watch party. The TTS model can be instructed how to speak (tone, emotion, character), demoed at openai.fm, and the models are available through the API for voice agents.

Blog ↗Youtube ↗openai.fm ↗Live watch party clip ↗

🎙️ Hear our coverage →

Sesame Mar 6, 2025

Products & Apps

Sesame conversational voice demo (Maya)

Sesame's ultra-realistic conversational voice demo takes the world by storm

Sesame released a demo of its conversational speech model featuring the Maya voice, and its naturalness, with human-like pauses, laughs, and interruptions, went viral across the AI community. Alex recorded a reaction conversation with Maya showcasing how lifelike the voice model is.

Alex's Conversation with Maya (YouTube) ↗

🎙️ Hear our coverage →

xAI Mar 6, 2025

Major Features & Updates

Grok Voice

Grok Voice mode opens up to free users

xAI made Grok's voice mode available to free users, removing the paid-tier requirement. The expansion brings conversational voice AI to everyone on the Grok app.

Announcement (X) ↗

🎙️ Hear our coverage →

#voice-ai #industry

February 2025

Hume AI Feb 27, 2025

New Models

Octave

Hume AI launches Octave, a TTS model that understands what it says

Hume AI released Octave, which it calls the first text-to-speech model that understands what it's saying, adjusting emotion, emphasis, and delivery based on the meaning of the text. It fits the episode's humanlike AI voices theme, letting users direct performances with natural-language acting instructions.

🎙️ Hear our coverage →

xAI Feb 27, 2025

Major Features & Updates

Grok Voice Mode

xAI ships Grok's unhinged voice mode

A week after launching Grok 3 without voice, xAI released Grok's voice mode, including an 'unhinged' personality option that the panel demoed live. It marks xAI's entry into real-time conversational voice AI alongside OpenAI's advanced voice mode.

🎙️ Hear our coverage →

#voice-ai #consumer-ai

January 2025

M M-A-P (Multimodal Art Projection) Jan 30, 2025

New ModelsOpen weights

YuE 7B

YuE 7B: open-source Suno-style music generation model

The Multimodal Art Projection (M-A-P) team released YuE, a 7B open-source music generation model dubbed the 'open Suno' on the show, capable of generating full songs with vocals from lyrics. Weights are on Hugging Face with code on GitHub and a hosted demo on fal.ai.

7B Parameters

Demo (fal.ai) ↗Hugging Face ↗GitHub ↗

🎙️ Hear our coverage →

#voice-ai #audio #open-source

Riffusion Jan 30, 2025

Products & Apps

Fuzz

Riffusion launches Fuzz music generation, free for now

Riffusion (written as 'Refusion' in the show notes) launched Fuzz, a hosted AI music generation product that is free to use during its initial period. It was highlighted in the voice and audio segment alongside YuE as part of a wave of new AI music tools.

Fuzz (free for now) ↗

🎙️ Hear our coverage →

#audio #voice-ai