Episode Summary
This week on ThursdAI, the crew dives into a whirlwind of AI breakthroughs: GPT-5.1 finally lands with a warmer, more personable voice, Grok 4 Fast stuns with a 2 million token context window, and Baidu’s Ernie 4.5 VL shakes up visual reasoning with just 3B active parameters. Meta drops Lingual ASR, supporting a jaw-dropping 1600+ languages, while 11 Labs launches Scribe V2 Real Time for blazing-fast, multilingual transcription. Plus, Dima from W&B demos LEET, a terminal UI that sparks joy for ML practitioners everywhere—it’s a jam-packed episode full of live demos, open source surprises, and breaking news you won’t want to miss.
In This Episode
Hosts & Guests
By The Numbers
🔥 Breaking During The Show
📰 Introduction and Show Overview
Alex sets the stage for a jam-packed episode, previewing major open source releases, big lab news, and two live interviews. The team teases highlights like Terminal Bench v2, Baidu Ernie 4.5 VL, Grok 4 Fast’s massive context window, and demos from 11 Labs and Weights & Biases.
- Preview of GPT-5.1, Grok 4 Fast, Baidu Ernie 4.5 VL
- Live demos from 11 Labs and W&B LEET
🔓 Open Source AI Highlights
The crew kicks off the open source segment, covering community-driven models and benchmarks in a week where Chinese labs and open weights models continue to push the frontier.
- Terminal Bench v2 sets new bar for agentic evals
- Baidu, Qwen, and Meta all drop open source releases
🛠️ Terminal Bench Deep Dive
A deep exploration of Terminal Bench v2, the new gold standard for evaluating coding agents in realistic, terminal-based tasks. The team discusses the benchmark’s difficulty, community contributions, and why a 50% top score is more meaningful than chasing fractions on saturated benchmarks.
- Terminal Bench v2: 89 hard tasks, 1000 Discord contributors
- Warp agent hits 50%, Codex CLI close behind
- Top score of 50% is ideal for meaningful comparison (cf. MMLU at 99%)
🎨 Baidu’s Ernie 4.5 VL and Visual Reasoning
Baidu’s Ernie 4.5 VL drops as a 3B parameter visual reasoning model, claiming to rival much larger models like GPT-5 High on vision tasks. The team tests it live, scrutinizes the benchmarks, and discusses the GSPO training method.
- Ernie 4.5 VL: 3B active params, Apache 2.0, open weights
- Innovative image zooming, spatial grounding, and reasoning
- GSPO training from Qwen team enables strong small-model performance
🔥 Breaking News: Age Company Releases Hello Two
Live on air, the team reacts to the surprise open source release of Hello Two by Age Company—a new multimodal agent family fine-tuned on Qwen 3 VL, boasting SOTA results on computer use and web navigation tasks. Apache 2.0, four model sizes.
- Hello Two: open source, Apache 2.0
- Strong OS World G scores, built on Qwen 3 VL
- 4B, 8B, and 30B model variants
🔊 11 Labs’ Scribe V2 Real-Time Launch
Paul Asjes from ElevenLabs joins to demo Scribe V2 Real Time, a lightning-fast, multilingual speech-to-text model with 150ms latency and 90+ language support. The team sees live transcription and seamless language switching, and Paul explains how Scribe outpaces Whisper on speed and accuracy.
- Scribe V2 Real Time: 150ms latency, 90+ languages
- Live demo: seamless language auto-switching mid-stream
- Context-aware transcription handles code, initialisms, and technical terms
⚡ This Week’s Buzz: W&B LEET Demo
Dima Duev from W&B’s SDK team demos LEET—a terminal-native dashboard for tracking ML runs even fully offline. The UI brings real-time metrics, beautiful ASCII art, and interactive exploration to the terminal, sparking joy for ML engineers everywhere.
- LEET: Lightweight Experiment Exploration Tool
- Works offline—perfect for air-gapped HPC clusters
- Interactive metrics, system stats, zoomable charts in terminal
🔊 Meta’s Lingual SR Release
Meta (Facebook) releases Lingual ASR, a speech recognition model supporting over 1600 languages—including 500 never before served by any ASR system. The team breaks down the technical leap, open source release under Apache 2.0, and the massive curated dataset behind it.
- Lingual ASR: 1600+ languages, 500+ new to ASR
- Character error rate <10% for 78 languages
- Apache 2.0, 500k+ rows of transcribed audio
🏢 Big Companies and APIs
The panel covers the week’s major releases from big labs: GPT-5.1’s new warmer voice, Grok 4 Fast’s 2 million token context window, and Gemini Live voice updates. The crew discusses what these signal for the frontier model race.
- GPT-5.1: warmer voice, personality upgrades
- Grok 4 Fast: 2M token context window
- Gemini Live: updated voice capabilities
Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co-Hosts - @WolframRvnwlf, @yampeleg, @ldjconfirmed
Guest: Dima Duev - SDK team Wandb
Guest: Paul Asjes - Eleven Labs (@paul_asjes)
Open Source LLMs
Terminal-Bench 2.0 and Harbor launch (X, Blog, Docs, Announcement)
Baidu releases ERNIE-4.5-VL-28B-A3B-Thinking (X, HF, GitHub, Blog, Platform)
Project AELLA (OSSAS): 100K LLM-generated paper summaries (X, HF)
WeiboAI’s VibeThinker-1.5B (X, HF, Arxiv, Announcement)
Code Arena — live, agentic coding evaluations (X, Blog, Announcement)
Big CO LLMs + APIs
This weeks Buzz
Voice & Audio
AI Art & Diffusion & 3D