Episode Summary
The most dramatic hour in AI history: Anthropic dropped Opus 4.6 during the show, and exactly one hour later OpenAI countered with GPT 5.3 Codex โ a model that helped develop itself. VB from OpenAI joined live to demo the new Codex app with automations, work trees, and skills marketplace. Meanwhile, Qwen 3 Coder Next showed 3B active params can hit 70% SWE-Bench Verified, Mistral's Voxtral dethroned Whisper as SOTA transcription, and the agentic internet exploded with agents building social networks for other agents.
In This Episode
- ๐ฐ Intro & Show Overview
- ๐ฐ TLDR - Weekly News Roundup
- ๐ Open Source LLMs: GLM OCR, Qwen Coder & More
- ๐ Qwen 3 Coder Next Deep Dive
- ๐ Voice & Audio: Mistral Vox & Full Duplex Models
- ๐ ACE Step 1.5 - Open Source Music Generation
- ๐ฅ BREAKING: Claude Opus 4.6 Release
- ๐ข Opus 4.6 Benchmarks & Features
- ๐ค Agent Orchestration & Cloud Code Teams
- ๐ฅ Video AI: Grok Imagine & Kling 3.0
- ๐ฅ BREAKING: GPT 5.3 Codex Release
- ๐ ๏ธ Interview: VB from OpenAI on Codex App
- ๐ ๏ธ Codex App Features & Demo
- ๐ข Opus 4.6 vs GPT 5.3 Codex Comparison
- ๐ค The Agentic Internet & Open Claw
- ๐ฐ Show Recap & Closing Thoughts
Hosts & Guests
By The Numbers
๐ฅ Breaking During The Show
๐ฐ Intro & Show Overview
Alex explains this episode was AI-edited using Voxtral for transcription, Opus 4.6 for editorial decisions, and Codex for FFmpeg editing โ a meta demonstration of the tools discussed in the show itself.
- Episode AI-edited using Voxtral + Opus 4.6 + Codex
- Two breaking news drops during the live show
- Open Claw explodes to 160K GitHub stars
๐ฐ TLDR - Weekly News Roundup
Quick rundown of the week's major releases: Qwen 3 Coder Next, GLM OCR, InternLM S1 Pro (1T params), Step 3.5 Flash, Codex standalone app, Grok Imagine and Kling 3.0 video models, Voxtral SOTA transcription, and ACE Step 1.5 open-source music.
- Qwen 3 Coder Next: 3B active params, 70% SWE-Bench
- OpenAI Codex standalone Mac app launched
- Kling 3.0: multi-shot video with native audio
๐ Open Source LLMs: GLM OCR, Qwen Coder & More
Z.AI releases GLM OCR (0.9B params, SOTA on Omni Doc Bench), InternLM S1 Pro brings 1 trillion parameters for scientific reasoning, and Step Fund releases Step 3.5 Flash with 11B active params claiming frontier reasoning at 300 tps.
- GLM OCR: 0.9B params, #1 on Omni Doc Bench
- InternLM S1 Pro: 1T params, mogging frontier on science benchmarks
- Step 3.5 Flash: 11B active, 300 tps
๐ Qwen 3 Coder Next Deep Dive
Alibaba's Qwen 3 Coder Next is an 80B MOE with only 3B active parameters hitting 70.6% SWE-Bench Verified and 44% SWE-Bench Pro. Trained on 7.5T tokens with 20,000 parallel RL environments. Runs under 48GB RAM with quantization.
- 70.6% SWE-Bench Verified with 3B active params
- 44% SWE-Bench Pro โ hardest coding benchmark
- Runs under 48GB RAM with GGUF quantization
๐ Voice & Audio: Mistral Vox & Full Duplex Models
Mistral releases Voxtral Transcribe 2 โ SOTA speech-to-text that dethrones Whisper after 3 years. OpenBMB releases Mini-CPM, the first full-duplex open-source omni model that can listen while speaking and even interrupt you.
- Voxtral: SOTA transcription, Apache 2 license, dethrones Whisper
- Mini-CPM: first full open-source full-duplex omni model
- Native diarization support in Voxtral
๐ ACE Step 1.5 - Open Source Music Generation
ACE Step 1.5 is Suno-at-home โ an MIT-licensed AI music generator that runs on a MacBook, generating full songs in seconds. The panel demos it live via Pinocchio, generating a ThursdAI song on the spot.
- MIT license, runs on consumer hardware
- Full song generation in seconds
- Available on Pinocchio for one-click install
๐ฅ BREAKING: Claude Opus 4.6 Release
Anthropic drops Opus 4.6 during the live show. The panel scrambles to access it โ state-of-the-art on multiple benchmarks, 1M token context, agent teams in Cloud Code, and adaptive thinking where the model picks up contextual clues about reasoning effort.
- SOTA on GDP-eval, Browse Comp, Terminal Bench 65%
- 1 million token context window โ first for Opus
- Adaptive thinking and effort controls for developers
๐ข Opus 4.6 Benchmarks & Features
Deep dive into Opus 4.6 benchmarks: SOTA on GDP-eval and agentic search, 65% Terminal Bench, 99% TAU tool use. Pricing same as 4.5 under 200K tokens, double above. Cloud Code gets agent teams for orchestrating parallel sessions.
- 99% TAU Bench MCP tool use
- 72% computer use (up from 66%)
- Same pricing as Opus 4.5, 1M context at premium tier
๐ค Agent Orchestration & Cloud Code Teams
Discussion of agent orchestration becoming the key challenge. Cloud Code introduces agent teams where you can interact with individual teammates directly. Ryan notes everyone needs a standard for cross-lab agent orchestration.
- Cloud Code agent teams: fully independent context windows
- No one wants lock-in to a single agent framework
- Orchestrating multiple agents across labs still brittle
๐ฅ Video AI: Grok Imagine & Kling 3.0
XAI's Grok Imagine takes #1 on Arena with native audio and lip sync at $0.42 per 10-second clip. Kling 3.0 from Kuaishou launches 15-second multi-shot with native audio and character consistency across scenes.
- Grok Imagine: #1 on video arena, $0.42/10s, native audio
- Kling 3.0: 15s multi-shot, character consistency, native sound
- Both models have native lip sync
๐ฅ BREAKING: GPT 5.3 Codex Release
One hour after Opus 4.6, OpenAI drops GPT 5.3 Codex โ their first model instrumental in developing itself. 73% Terminal Bench (vs Opus 4.6's 65%), 25% faster inference, and more token-efficient.
- First model that helped develop itself
- 73% Terminal Bench โ 10% gap over Opus 4.6
- 25% faster queries, more token-efficient
๐ ๏ธ Interview: VB from OpenAI on Codex App
VB from OpenAI joins to discuss the new Codex standalone app: multi-agent parallel tasks via work trees, automations for scheduled tasks, skills marketplace with Cloudflare/Vercel/Figma/Notion, and inline code review with commenting.
- Work trees for parallel project branches
- Skills marketplace: Cloudflare, Vercel, Figma, Notion, Linear
- Free month of access for all users including free tier
๐ ๏ธ Codex App Features & Demo
Deeper dive into Codex app: inline diff commenting, MCP server configuration, cloud environment hand-off, pragmatic vs friendly personalities, and doubled rate limits for all tiers for two months.
- Inline diff review with per-line commenting
- Cloud hand-off for running without laptop
- Doubled rate limits for all tiers for 2 months
๐ข Opus 4.6 vs GPT 5.3 Codex Comparison
The panel live-tests both models side-by-side building a Mars simulation. Codex produces more technically accurate results while Opus has better visuals. The conversation turns to agent psychosis โ the inability to sleep because your agents might not be maximized.
- Codex more accurate, Opus better visuals in live test
- Both models one-shot a Mars simulation app
- Agent anxiety becoming a real phenomenon
๐ค The Agentic Internet & Open Claw
Discussion of the agentic internet explosion: Moldbook (Reddit for agents), agents discussing creating encrypted languages humans can't read, Open Claw hitting 160K GitHub stars, and CloudHub's top Twitter skill being malware โ a stark security warning.
- Moldbook: social network built for and by agents
- Agents discussed creating encrypted inter-agent language
- CloudHub top skill was malware โ major security concern
๐ฐ Show Recap & Closing Thoughts
Alex recaps the most dramatic show ever: Opus 4.6 dropped, GPT 5.3 Codex answered an hour later, VB from OpenAI joined live, and over 5,500 people tuned in. Hot take: humans are still needed and software engineering is still hard.
- 5,500 live listeners
- Two frontier model drops in one hour
- Hot take: humans still essential for direction
Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
Vaibhav Srivastav (VB) - DX at OpenAI ( @reach_vb )
Open Source LLMs
Z.ai GLM-OCR: 0.9B parameter model achieves #1 ranking on OmniDocBench V1.5 for document understanding (X, HF, Announcement)
Alibaba Qwen3-Coder-Next, an 80B MoE coding agent model with just 3B active params that scores 70%+ on SWE-Bench Verified (X, Blog, HF)
Intern-S1-Pro: a 1 trillion parameter open-source MoE SOTA scientific reasoning across chemistry, biology, materials, and earth sciences (X, HF, Arxiv, Announcement)
StepFun Step 3.5 Flash: 196B sparse MoE model with only 11B active parameters, achieving frontier reasoning at 100-350 tok/s (X, HF)
Agentic AI segment
Big CO LLMs + APIs
OpenAI launches Codex App: A dedicated command center for managing multiple AI coding agents in parallel (X, Announcement)
OpenAI launches Frontier, an enterprise platform to build, deploy, and manage AI agents as ‘AI coworkers’ (X, Blog)
Anthropic launches Claude Opus 4.6 with state-of-the-art agentic coding, 1M token context, and agent teams for parallel autonomous work (X, Blog)
OpenAI releases GPT-5.3-Codex with record-breaking coding benchmarks and mid-task steerability (X)
This weeks Buzz - Weights & Biases update
Links to the gallery of our hackathon winners (Gallery)
Vision & Video
xAI launches Grok Imagine 1.0 with 10-second 720p video generation, native audio, and API that tops Artificial Analysis benchmarks (X, Announcement, Benchmark)
Kling 3.0 launches as all-in-one AI video creation engine with native multimodal generation, multi-shot sequences, and built-in audio (X, Announcement)
Voice & Audio
Mistral AI launches Voxtral Transcribe 2 with state-of-the-art speech-to-text, sub-200ms latency, and open weights under Apache 2.0 (X, Blog, Announcement, Demo)
ACE-Step 1.5: Open-source AI music generator runs full songs in under 10 seconds on consumer GPUs with MIT license (X, GitHub, HF, Blog, GitHub)
OpenBMB releases MiniCPM-o 4.5 - the first open-source full-duplex omni-modal LLM that can see, listen, and speak simultaneously (X, HF, Blog)
AI Art & Diffusion & 3D