ElevenLabs launches ElevenMusic platform with 4,000+ indie artists
ElevenLabs launched ElevenMusic, a full music platform with discovery, remixing, and royalties, debuting with over 4,000 indie artists. Alex closed the show with an ElevenMusic-generated slow, dreamy indie rock track with reverse vocals.
Gemini 3.1 Flash TTS tops TTS Arena at 1,211 Elo with 70+ languages
Google released Gemini 3.1 Flash TTS, which leads TTS Arena at 1,211 Elo, supports 70+ languages with inline audio tags, and costs about $0.03 per 60 seconds, roughly 5x cheaper than ElevenLabs. Kwindla noted it is fully promptable like an LLM rather than limited to fixed tags, but its ~3 second time-to-first-token makes it batch-only for now rather than usable in live conversational pipelines.
Google Lyria 3 Pro generates full 3-minute music tracks with structural control
Google DeepMind released Lyria 3 Pro, its most advanced music model, generating full 3-minute tracks with structural control over intros, verses, choruses, and bridges, and even composing music from images. The crew generated a drum-and-bass ThursdAI opener live with spot-on instruction following; output is SynthID watermarked and royalty-free, available to Gemini subscribers and via Producer AI.
OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5
OpenAI shipped gpt-audio-1.5 and gpt-realtime-1.5, updated audio and realtime voice models available through its platform. The release was covered in the week's voice and audio roundup.
Google DeepMind launches Lyria 3 music generation in the Gemini app
Google DeepMind launched Lyria 3, its most advanced AI music generation model, now available in the Gemini app. It generates 32-second high-fidelity music tracks with creative controls and can compose music from uploaded images. Google also published a prompt guide covering vocals, lyrics, and different styles.
ACE-Step 1.5: open-source 'Suno at home' music generation under MIT
ACE-Step 1.5 is an MIT-licensed AI music generator that produces full songs in under 10 seconds on consumer GPUs and runs on a MacBook. The panel demoed it live via Pinocchio, generating a ThursdAI song on the spot, and it is available for one-click install.
Kling 3.0: 15-second multi-shot video with native audio
Kuaishou's Kling 3.0 launched as an all-in-one AI video creation engine with native multimodal generation, 15-second multi-shot sequences, built-in audio, and character consistency across scenes. Alongside Grok Imagine, it marks the week native audio and lip sync became table stakes for video models.
Grok Imagine 1.0 tops video arena with native audio and lip sync
xAI launched Grok Imagine 1.0 with 10-second 720p video generation, native audio, and lip sync, taking the #1 spot on the Artificial Analysis text-to-video arena. Generation costs roughly $0.42 per 10-second clip and an API is available.
Lightricks open-sources LTX-2 synchronized audio-video model
Lightricks open-sourced LTX-2, billed as the first truly open audio-video generation model with synchronized audio and video output, releasing full training code alongside the weights. A distilled version is available to try on Replicate.
VEO3: native audio video generation crosses the uncanny valley
Google's VEO3 stunned everyone in Q2 with video generation that included native audio, which the crew credits with crossing the uncanny valley for AI video. It was a centerpiece of Google IO 2025 and of Google's comeback year.
Meta SAM Audio brings promptable source separation to audio
Meta released SAM Audio, an audio source separation model that extends the Segment Anything concept to sound. It supports multimodal prompting via text, visual, and temporal cues to isolate sources from audio, with weights on Hugging Face and code on GitHub.
Kling VIDEO 2.6 adds first native audio generation
Kling released VIDEO 2.6, its first video model with native audio generation, producing sound directly alongside generated footage. It was one of two Kling releases this week spanning video and image generation.
LTX-2: native 4K audio+video generation engine from Lightricks
Lightricks announced LTX-2 as breaking news on the show: a video generation engine producing native 4K video (no upscaling) with synchronized audio, positioned as a fast, efficient open alternative to closed models like Sora. It is billed as open-source with weights coming this fall.
Suno rolled out v5, its newest flagship music generation model with cleaner audio quality and more natural vocals. The live audio demos in the show's closing segment were treated as product proof points for how fast AI music quality is climbing.
Stability AI and Arm release Stable Audio Open Small for on-device audio
Stability AI, together with Arm, released Stable Audio Open Small, a 341M-parameter open text-to-audio model built for real-world on-device deployment. The show framed it as part of a small comeback for Stability, with weights on Hugging Face and an accompanying paper.
DolphinGemma: Google's audio model for decoding dolphin communication
Google, with Georgia Tech and the Wild Dolphin Project, announced DolphinGemma, a ~400M parameter audio model based on the Gemma architecture using SoundStream audio tokenization. Trained on decades of recorded dolphin clicks, whistles and pulses, it aims to decipher structure in dolphin communication and runs on a Pixel phone for field deployment.
NotaGen open symbolic music model generates classical sheet music
NotaGen is an open symbolic music generation model that produces high-quality classical sheet music rather than raw audio. The release includes code on GitHub, weights on Hugging Face, and a browser demo.
YuE 7B: open-source Suno-style music generation model
The Multimodal Art Projection (M-A-P) team released YuE, a 7B open-source music generation model dubbed the 'open Suno' on the show, capable of generating full songs with vocals from lyrics. Weights are on Hugging Face with code on GitHub and a hosted demo on fal.ai.
Riffusion launches Fuzz music generation, free for now
Riffusion (written as 'Refusion' in the show notes) launched Fuzz, a hosted AI music generation product that is free to use during its initial period. It was highlighted in the voice and audio segment alongside YuE as part of a wave of new AI music tools.