Episode Summary
Gemini 3.1 Pro dropped live during the show โ Google's biggest model yet with 44% on Humanities Last Exam and 77% ARC-AGI. Anthropic launched Sonnet 4.6 with 79.6% SWE-Bench Verified, Alibaba shipped Qwen 3.5 with 397B parameters, and xAI unleashed Grok 4 20 with four 500B-parameter agents collaborating. Ryan Carson laid out the Code Factory blueprint for agentic engineering, and the panel unanimously declared one-shot coding officially dead. Plus OpenClaw's creator Peter Steinberger joined OpenAI in what might be the first single-founder billion-dollar acqui-hire.
In This Episode
- ๐ฐ Introductions & Top AI News Picks
- ๐งช Brain-Computer Interface (Thought to Text)
- ๐ Qwen 3.5 Release
- ๐ฐ Are We Still in the AI Bubble?
- ๐ฐ TL;DR - Weekly AI News Roundup
- ๐ฅ Gemini 3.1 Pro - Breaking News
- ๐ข Gemini 3.1 Pro - Benchmarks & Long Context
- ๐ ๏ธ Gemini 3.1 Pro - Live Vibe Coding Test
- ๐ข Codex 5.3 vs Gemini vs Opus Discussion
- ๐ข Claude Sonnet 4.6 Release
- ๐ข ByteDance Seed 2.0
- ๐ฐ Anthropic Terms of Use Controversy
- ๐ข ChatGPT Personality & OpenAI Model Deprecations
- ๐ข Grok 4 20 Review
- โก This Week's Buzz - Terminal Bench Benchmarking Deep Dive
- ๐ค Code Factory - Agentic Engineering with Ryan Carson
- ๐ ๏ธ One-Shot is a Myth - Front End vs Backend AI Coding
- ๐ฐ Will Software Engineers Lose Their Jobs?
- ๐ Google Lyria 3 - AI Music Generation
- ๐ Open Source Roundup - Qwen 3.5 & Cohere
- ๐งช Zuna - Open Source Brain-Computer Interface Model
- ๐ฐ Wrap Up & Outro
Hosts & Guests
By The Numbers
๐ฅ Breaking During The Show
๐ฐ Introductions & Top AI News Picks
Alex opens with breaking news โ Gemini 3.1 Pro just dropped from Google. The panel shares their top picks: LDJ picks Zuna (thought-to-text BCI), Nisten picks Qwen 3.5, Wolfram picks Gemini 3.1, and Yam drops the bombshell that OpenAI acqui-hired OpenClaw creator Peter Steinberger.
- Gemini 3.1 Pro drops live as the show starts
- OpenClaw founder Peter Steinberger joins OpenAI
- ThursdAI approaching 3 years of weekly broadcasts
๐งช Brain-Computer Interface (Thought to Text)
LDJ highlights Zif's release of Zuna, a sub-billion parameter model that translates EEG brain signals into text โ what people are calling 'thought to text'. A glimpse into non-invasive brain-computer interfaces becoming accessible.
- Zuna: 380M parameter BCI foundational model
- Translates EEG brain signals to text
- Open source and Apache licensed
๐ Qwen 3.5 Release
Nisten picks Alibaba's Qwen 3.5 as his top news โ almost 400B parameters with only 17B active. Qwen models have historically excelled at multilingual and medical performance, and this new release runs faster with fewer active parameters.
- 397B total parameters, 17B active (down from 22 in previous version)
- Qwen excels at multilingual and medical tasks
- Runs faster for data generation workloads
๐ฐ Are We Still in the AI Bubble?
Alex shares his experience at a Claude Code meetup where even attendees weren't running agents. Ryan reports meeting normies whose reaction to AI progress is mostly fear and dread. The panel discusses the widening gap between the AI-native bubble and everyone else.
- Even Claude Code meetup attendees barely running agents
- Ryan closing his seed round, planning to hire one 10x engineer instead of a team
- Eric S. Raymond (open source pioneer) embraces AI as 'wizard mode'
๐ฐ TL;DR - Weekly AI News Roundup
Alex runs through the week's releases: Qwen 3.5 from Alibaba, OpenClaw joining OpenAI, Anthropic's terms controversy, ByteDance's Seed 2.0, Gemini 3.1 Pro dropping live, Grok 4 20, Google's Lyria 3 music model, and Cohere's multilingual Aya model.
- OpenClaw founder joins OpenAI โ possibly first single-founder billion-dollar deal
- Gemini 3.1 Pro: 44% HLE, 77% ARC-AGI, same price point
- Grok 4 20: 500B params ร 4 agents, no evals released
๐ฅ Gemini 3.1 Pro - Breaking News
The panel dives into Gemini 3.1 Pro which dropped minutes before the show. Same price point with significantly better performance. Ryan insists he only cares about SWE-Bench scores, while Wolfram argues Terminal Bench is more relevant for agent use cases.
- Same price as previous Gemini, significantly better performance
- 77% ARC-AGI, 44% Humanities Last Exam, 68 Terminal Bench
- State of the art alongside Opus 4.6 on SWE-Bench
๐ข Gemini 3.1 Pro - Benchmarks & Long Context
LDJ reveals a massive discrepancy in long-context benchmarks โ Opus 4.6 scores 76% on MRCR at 1M context vs Gemini 3.1's 26%. The panel debates whether Google is under-reporting competitor scores and highlights the difficulty of comparing benchmarks across different methodologies.
- Opus 4.6: 76% MRCR at 1M context vs Gemini 3.1 Pro: 26%
- Google's eval table may be under-reporting Anthropic scores
- Different measurement methodologies make direct comparison difficult
๐ ๏ธ Gemini 3.1 Pro - Live Vibe Coding Test
Nisten runs a live vibe coding test in Google AI Studio โ the same Martian mass driver simulation they tested with previous models. Gemini 3.1 Pro is blazingly fast but the output doesn't match what Opus 4.6 and Codex achieved.
- Extremely fast generation โ completed in about 20 seconds
- Created a functional simulation but less polished than Opus/Codex
- Fast but not passing the initial vibe check for agentic coding
๐ข Codex 5.3 vs Gemini vs Opus Discussion
The panel debates why models perform best in their own harnesses. Ryan argues this is why agent labs are struggling โ the model maker always has the natural advantage. LDJ points out that Codex in its own harness scores 77% on Terminal Bench, the true highest score.
- Codex 5.3 gets 77% in Codex Harness โ true state of the art
- Model labs have natural harness advantage over third-party agents
- Claude Code's success proves the model+harness synergy
๐ข Claude Sonnet 4.6 Release
Anthropic releases Sonnet 4.6 โ 79.6% on SWE-Bench Verified, 1M token context window, now the default model on Claude AI. LDJ notes it feels like a smaller Opus 4.6 that may have been trained for longer. In Claude Code testing, users preferred Sonnet 4.6 over the previous Opus 4.5 59% of the time.
- 79.6% SWE-Bench Verified โ very close to state of the art
- 1M token context window in beta, $3/$15 per million tokens
- Users preferred Sonnet 4.6 over Opus 4.5 59% of the time in blind testing
๐ข ByteDance Seed 2.0
ByteDance steps up as a leading Chinese AI provider with Seed 2.0 โ a frontier multimodal LLM with video understanding that surpasses the human benchmark (77% vs 73%). Priced at 84% cheaper than Opus 4.5, it's a compelling option for price-conscious developers.
- 84% cheaper than Opus 4.5 with near-comparable quality
- Video understanding surpasses human benchmark: 77% vs 73%
- Pro, Light, Mini, and Code variants available
๐ฐ Anthropic Terms of Use Controversy
Anthropic updated their terms of use, causing panic that Max account OAuth couldn't be used with third-party agents like OpenClaw. They partially reverted, but the situation remains unclear. Meanwhile, Chinese labs and OpenAI explicitly welcomed agent usage with their subscriptions.
- Anthropic's terms briefly banned using Max accounts with agents
- OpenAI confirmed Pro subscription works everywhere including OpenClaw
- Chinese labs explicitly host OpenClaw instances on their platforms
๐ข ChatGPT Personality & OpenAI Model Deprecations
A brief transition segment โ Alex acknowledges the need to move on from big lab discussions to cover Grok, open source, and evals. The panel has been discussing for nearly an hour and still hasn't touched half the topics on the docket.
- Panel acknowledges the sheer volume of news to cover
- Transition to Grok and open source coverage
๐ข Grok 4 20 Review
xAI releases Grok 4 20 โ four 500B-parameter agents collaborating in a multi-agent UI. No benchmarks or evals released. The panel finds it underwhelming for coding and day-to-day work, but acknowledges its strength for deep research via X's data. A $300/month Heavy tier with 16 agents exists.
- 500B params ร 4 agents (or ร16 for Heavy at $300/month)
- No benchmarks or evals released โ silent drop
- Grok 4.1 Fast still #8 on Open Router for API usage
โก This Week's Buzz - Terminal Bench Benchmarking Deep Dive
Wolfram presents his Terminal Bench benchmarking work for W&B. He reveals that benchmarks are far more nuanced than single scores โ runtime limits, harness settings, thinking mode, and resource allocation all dramatically change results. He also shares how Weave tracing caught an inference bug that was causing GLM-5 to score only 5%.
- Terminal Bench tasks include building Linux kernels and cracking passwords โ not just coding
- Qwen 3.5 scores 52.5% โ third place among open source models
- Kimi K2.5 achieves 67.4% ceiling score across multiple runs
- Weave tracing caught a critical inference bug affecting GLM-5 scores
๐ค Code Factory - Agentic Engineering with Ryan Carson
Ryan walks through his viral Code Factory article โ a system for fully automated code generation, review, and deployment. Inspired by OpenAI's Harness Engineering article, the setup uses GitHub Actions, Reptile for code review, CI gates, and a self-healing loop where agents fix their own PR issues until all checks pass.
- Code Factory: agents write, review, and ship code in a loop
- Risk classification system flags high-risk file changes for extra review
- Self-healing loop: Codex fixes PR issues until all CI checks pass
- Takes a week+ of setup but unlocks massive throughput
๐ ๏ธ One-Shot is a Myth - Front End vs Backend AI Coding
Alex demos the new ThursdAI website built entirely with agents, but emphasizes it took days of iteration โ not one shot. The panel agrees: one-shot coding is a myth, especially for front end. Ryan recommends design systems and Instill for UI feedback loops, but notes frontend still requires human-in-the-loop driving.
- New ThursdAI website built with OpenClaw โ agents extracted 160+ guests from 152 episodes
- Running agents overnight produced near-complete website rewrites daily
- Backend loops work; frontend still requires human steering
- Design systems dramatically improve agent UI output consistency
๐ฐ Will Software Engineers Lose Their Jobs?
Yam reveals he's fired a crazy number of agents this week โ models are inherently random and can destructively delete your entire computer by accident. Ryan emphasizes document drift as a critical Code Factory concern. Nisten argues frontend developers are still essential to take projects to completion.
- Models are inherently random โ destructive mistakes are a matter of 'when' not 'if'
- Document drift is a major Code Factory challenge
- Frontend developers needed to take things to production quality
๐ Google Lyria 3 - AI Music Generation
Google DeepMind launches Lyria 3, their most advanced AI music generation model, available in the Gemini app. It generates 32-second high-fidelity tracks with creative controls, and can compose music from uploaded images. A prompt guide is available for vocals, lyrics, and different styles.
- 32-second high-fidelity music tracks
- Image-to-music: upload an image and generate matching music
- Prompt guide released for vocals, lyrics, and styles
๐ Open Source Roundup - Qwen 3.5 & Cohere
Deeper dive into Qwen 3.5 โ Nisten reports benchmarks look good but coding is behind GLM-5. The model uses a different architecture from DeepSeek, with 512 experts and 262K native context extendable to 1M. Cohere releases Aya 3.3B, a tiny multilingual model supporting 70+ languages.
- Qwen 3.5: 512 experts, 11 active, 262K native context (extendable to 1M)
- GLM-5 still ahead on coding; Qwen excels at multilingual
- Cohere Aya: 3.3B params, 70+ languages
๐งช Zuna - Open Source Brain-Computer Interface Model
The panel revisits Zuna, the 380M parameter open-source BCI model. Nisten notes it could work with $500 non-invasive EEG headsets, would likely need personalized training per user, and is small enough to run in real time on a gaming GPU. He's considering buying a headset to experiment.
- 380M params โ small enough for real-time on consumer GPUs
- Compatible with ~$500 non-invasive EEG headsets
- Needs personalized training per user but fully open source
๐ฐ Wrap Up & Outro
Alex recaps the highlights โ Sonnet 4.6 and Gemini 3.1 Pro tested live, Code Factory discussion, and the one-shot myth debunked. He promotes the new ThursdAI website and reminds listeners the show is available as a newsletter and podcast everywhere. Over 1,500 listeners tuned in.
- 1,500+ live listeners
- New ThursdAI website launched at thursdai.news
- Approaching 3 years of weekly broadcasts
Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
🔥 New website: thursdai.news with all our past guests and episodes
Open Source LLMs
Big CO LLMs + APIs
OpenClaw founder joins OpenAI
Google releases Gemini 3.1 Pro with 2.5x better abstract reasoning and improved coding/agentic capabilities (X, Blog, Announcement)
Anthropic launches Claude Sonnet 4.6, its most capable Sonnet model ever, with 1M token context and near-Opus intelligence at Sonnet pricing (X, Blog, Announcement)
ByteDance releases Seed 2.0 - a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing (X, blog, HF)
Anthropic changes the rules on Max use, OpenAI confirms it’s 100% fine.
Grok 4.20 - finally released, a mix of 4 agents
This weeks Buzz
Wolfram deep dives into Terminal Bench
We’ve launched Kimi K2.5 on our inference service (Link)
Vision & Video
Voice & Audio
Google DeepMind launches Lyria 3, its most advanced AI music generation model, now available in the Gemini App (X, Announcement)
Tools & Agentic Coding
Ryan is viral once again with CodeFactory! (X)
Ryan uses Agentation.dev for front end development closing the loop on componenets
Dreamer launches beta: A full-stack platform for building and discovering agentic apps with no-code AI (X, Announcement)