Episode Summary

Alex goes live EIGHT minutes early to cover breaking news: Claude Opus 4.7 drops mid-intro — 87.6% SWE-bench Verified, a 64.3% on SWE-bench Pro, and a brand-new 'extra high' reasoning effort that ends up feeling more like a Mythos teaser than a full release. Qwen 3.6-35B-A3B (Apache 2.0, 73.4% SWE-Verified with only 3B active params) and MiniMax M2.7 open weights keep the open-source train screaming, while OpenAI drops another breaking-news bomb during the show with a massive Codex update: native macOS background computer use, 90+ plugins, memory, gpt-image-1.5, and multi-terminal SSH. Three incredible interviews — Trevor Manz on Marimo Pair dropping coding agents into reactive Python notebooks, Kwindla on Gradient Bang (the multi-agent voice game that 'broke containment'), and Theodor Marcu on Windsurf 2.0 + Devin's agent command center. Plus Alex debuts the 'ZL Continuum' essay from AI Engineer Europe: do engineers still read code?

Hosts & Guests

Alex Volkov
Alex Volkov
Host · W&B / CoreWeave
@altryne
Trevor Manz
Trevor Manz
Founding engineer · Marimo
@trevmanz
Theodor Marcu
Theodor Marcu
Product · Cognition (Windsurf)
@theodormarcu
Kwindla Hultman Kramer
Kwindla Hultman Kramer
Co-CEO Daily · Pipecat maintainer
@kwindla
Yam Peleg
Yam Peleg
AI builder & founder
@yampeleg
Nisten Tahiraj
Nisten Tahiraj
Weekly co-host · AI engineer
@nisten
LDJ
LDJ
Weekly co-host · AI researcher
@ldjconfirmed
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host · AI Evangelist at W&B/CoreWeave
@WolframRvnwlf

By The Numbers

SWE-bench Verified
87.6%
Claude Opus 4.7 — 64.3% on SWE-bench Pro, an 11-point jump over 4.6 on the harder agentic coding eval
ScreenSpot Pro jump
+22%
Opus 4.7 computer-use — 57.7% → 79.5% vs Opus 4.6, pulling even with Mythos on some slices
SWE-bench Verified
73.4%
Qwen 3.6-35B-A3B — Apache 2.0, 35B MoE with just 3B active, rivals models 10x its size
active parameters
10B
MiniMax M2.7 — 230B MoE matches GPT-5.3-Codex on SWE-Pro at 56.22%, self-evolved via 100+ rounds of autonomous RL
TTS Arena Elo
1,211
Google Gemini 3.1 Flash TTS — 70+ languages, inline audio tags, ~$0.03 per 60s (≈5× cheaper than ElevenLabs)
YoY PR growth
15×
GitHub on track for 15 billion PRs in 2026 (vs 1 billion in 2025) — Vercel says 60% of their traffic is now agents

🔥 Breaking During The Show

Claude Opus 4.7 drops 8 minutes before show start
Anthropic ships Opus 4.7 right as ThursdAI is about to go live. 87.6% SWE-bench Verified, 64.3% SWE-Pro, new 'extra high' (xhigh) reasoning effort, 3× vision resolution, /ultrareview in Claude Code, new tokenizer uses 1.0–1.35× more tokens. Feels like a Mythos teaser (331 mentions in the system card).
OpenAI Codex gets a massive update mid-show
Native macOS background computer use (separate cursor!), 90+ plugins, gpt-image-1.5 for images, in-app browser, memory, self-scheduling automations, multi-terminal SSH. Alex calls it: Codex is becoming the super-app, not ChatGPT.
Qwen 3.6-35B-A3B open-sourced the same morning
Alibaba Qwen ships Apache 2.0 35B MoE with 3B active, 73.4% SWE-Verified, 262K→1M context, natively multimodal. Confirms Qwen's open-source commitment post-Junyang Ling departure.
MiniMax M2.7 open weights released
230B MoE / 10B active, matches GPT-5.3-Codex on SWE-Pro at 56.22%, self-evolved via 100+ rounds of autonomous RL.

🔥 Pre-Show Banter & Opus 4.7 Breaking News

Alex and Yam go live 8 minutes before the official show start because Anthropic just dropped Claude Opus 4.7. No early access, no advance briefing — the crew opens the system card live alongside the audience.

  • AI breaking news from the jump: Opus 4.7 drops with no prior access
  • Anthropic ships the system card publicly, crew reads it in real time
  • 4 cohosts on set: Yam, Wolfram, LDJ (and Nisten joining shortly)
Alex Volkov
Alex Volkov
"Anthropic made us go live, eight minutes before the official show start, because we got some breaking news."

🧪 Opus 4.7 Evals & Benchmarks

The crew walks through the Anthropic evals table live. Biggest jumps: SWE-bench Pro (+11), ScreenSpot Pro (+22% on computer use), new xhigh effort level. MRCR long-context drops from 78% → 32%, suggesting a new pre-trained base. The system card mentions 'Mythos' 331 times — this feels like an ad for the godlike version they haven't released yet.

  • 331 mentions of 'Mythos' in the system card — 4.7 feels like the appetizer
  • SWE-bench Pro jumps 11 points, passes GPT-5.4 on agentic coding
  • MRCR 8-needle V2: 78% → 32% — LDJ thinks it's a new pre-trained base
  • New 'xhigh' (extra high) reasoning effort gets best-in-class on HLE
  • New tokenizer uses 1.0–1.35× more tokens per prompt — not a cash grab, probably multimodal reasons
Alex Volkov
Alex Volkov
"If you look for Mythos in the system card, there's three hundred and thirty-one mentions for Mythos. It does feel like a Mythos ad."
LDJ
LDJ
"This kind of helps maybe confirm it is maybe a new pre-trained model from scratch — in MRCR it's performing much worse, but I expect in other areas it might be much better in dramatic ways."
Yam Peleg
Yam Peleg
"We're all cooked already at this point, everyone looking at the evals, who has the biggest number. If you do a good job and train a really good model, people like us are gonna look at the evals and say 'nah, this model is not good.' But probably the model is really good."

📰 TL;DR — Weekly AI News Roundup

The longest TL;DR in show history. CoreWeave signs Anthropic (multibillion), Meta ($21B expansion), and Jane Street ($6B cloud + $1B equity) — now serving 9 of the top 10 AI labs. Qwen 3.6-35B-A3B, MiniMax M2.7 open weights, Windsurf 2.0 + Devin, Warp's any-CLI-agent support, Claude Code Routines (cron-triggered agents on Anthropic's cloud), Marimo Pair, Gemma 4 live on W&B Inference, Gemini 3.1 Flash TTS, Baidu ERNIE-Image, Tencent HYWorld 2.0, Nvidia Lyra 2.0, a unitree humanoid breaking a 100m dash record, and Allbirds → NewBird AI.

  • CoreWeave now backs 9 of the top 10 AI labs (Anthropic, Meta $35B+, Jane Street $7B)
  • Qwen 3.6-35B-A3B: Apache 2.0, 35B MoE / 3B active, 262K→1M context, natively multimodal
  • MiniMax M2.7 open weights: 230B / 10B active, 56.22% SWE-Pro matching GPT-5.3-Codex
  • Gemma 4 now live on W&B Inference (CoreWeave) with LoRA inference support — code 'Gem Drop' for $20 credits
  • Claude Code Routines: cron/GitHub-event/API triggered autonomous agents on Anthropic's cloud
  • Super Gemma 4 26B Uncensored v2 by @songjunkr trending on HF — 0/100 refusals, fixed tool calls
Nisten Tahiraj
Nisten Tahiraj
"People are going ham over this. It's like we're back to two years ago."
Yam Peleg
Yam Peleg
"Open source is so back."

🧪 Opus 4.7 Live Testing — Martian Simulation

Nisten puts Opus 4.7 through the infamous Martian simulation benchmark the crew has been running for months. Early impression: incremental over 4.6 but solid — 'not as much of a jump from 4.6 to 4.7 as it was from 4.5 to 4.6.' Vision / computer use, on the other hand, feels genuinely better.

  • Nisten: 'It feels incremental this time' — not as big a jump as 4.6 over 4.5
  • Vision + ScreenSpot Pro improvements are the real story
  • New 'ultrareview' /command in Claude Code
Nisten Tahiraj
Nisten Tahiraj
"It feels incremental this time. Like, it doesn't — 4.6 was smarter. 4.7 is not as much of a jump from 4.6 as it was from 4.5 to 4.6."

🔓 Qwen 3.6 Open Source Release

Alibaba Qwen drops Qwen 3.6-35B-A3B under Apache 2.0 the same morning as Opus 4.7 — 35B MoE with only 3B active parameters, 73.4% SWE-bench Verified, natively multimodal, 262K context extensible to 1M. After Junyang Ling left the team there were doubts about Qwen's open-source commitment; this release puts those to rest.

  • Apache 2.0 — 35B MoE / 3B active, 73.4% SWE-bench Verified
  • Natively multimodal, 262K context extensible to 1M
  • Strongest mid-size LLM on nearly all benchmarks
  • Confirms Alibaba Qwen's continued open-source commitment post-Junyang

🔓 MiniMax M2.7 Open Weights

MiniMax releases M2.7 open weights: 230B parameter MoE with only 10B active, 56.22% on SWE-Pro (matching GPT-5.3-Codex). Self-evolved via 100+ rounds of autonomous RL. Typical recent Chinese release cadence: ship the API first, then open-source the weights.

  • 230B MoE / 10B active — matches GPT-5.3-Codex on SWE-Pro (56.22%)
  • Self-evolved via 100+ rounds of autonomous RL
  • Open weights released on HF after API launch

🧠 The ZL Continuum — Do Engineers Still Read Code?

Alex's essay from AI Engineer Europe: where are you on the Z–L spectrum? Ryan Lopopolo (OpenAI, token billionaire) says code is a liability, don't read it. Mario Zechner (creator of pyo, the harness powering OpenClaw) says slow the fuck down, read every line of critical code. Everyone else is somewhere in between. The crew + later guests weigh in: Nisten=Z, Yam=Z (with a brutal 'hidden-features accumulation' warning), Wolfram=L (mostly), LDJ=moving-to-L, Trevor=L-leaning, Kwindla=full-L ('my rule was not to read or write any code for the side project').

  • Ryan Lopopolo: 'code is a liability' · Mario Zechner: 'read every line'
  • Yam's warning: agents silently add hidden features (context truncation, etc.), accumulate, eventually no agent OR human can review the code
  • Consensus: it's not per-person, it's per-task — critical code = read, throwaway = YOLO
  • Poll opens live during the show
Yam Peleg
Yam Peleg
"You must read the code, man. Those things accumulate — the agent thinks 'oh, that's what the human wanted' so it maintains hidden features. You get to a point where the agents just can't do anything."
Wolfram Ravenwolf
Wolfram Ravenwolf
"I trust the AI, I tell it what to do, I test it, and my most used command is review. I probably spend more time reviewing than actually coding."

⚡ AI Engineer Summit — Top 10 Themes

Alex's synthesized top-10 from AI Engineer Europe: (1) FMAT — Fear of Missing Agent Time, (2) the ZL Continuum, (3) everything is changing super fast (GitHub on track for 15B PRs, Vercel 60% agent traffic), (4) we're still early, (5) AGI is here and unevenly distributed, (6) 'just talk to your Clanker', (7) MCP is dead long live MCP (enterprises adopting faster than ever), (8) AI was supposed to make us work less — we work more, (9) MHC = Model / Harness / Context is the new ASL.

  • FMAT: Fear Of Missing Agent Time — universal at the conference
  • GitHub: 1B → 15B PRs YoY projected, 15× growth
  • Vercel: 60% of traffic attributed to agents
  • MHC framework — Model/Harness/Context is the new ASL for AI engineers
  • MCP isn't dying — enterprises are adopting faster, especially with code-mode
Alex Volkov
Alex Volkov
"Model Harness Context is the new ASL. When somebody tells me 'OpenClaw is stupid', I have no idea how to react until they tell me if they use OpenClaw with Opus 4.7 or MiniMax 2.7."

🎥 Pi Hard — Craziest AI Video Yet

Alex plays the Pi Hard / Neil deGrasse Tyson / SBF AI trailer live. Even Alex's fiancée (who works with AI video daily) didn't clock it as AI until midway through. Seedance 2.0 is now everywhere; the crew agrees this is the craziest AI video they've ever seen.

  • Multi-shot AI video production, Neil deGrasse Tyson deepfake
  • Seedance 2.0 fully rolled out with video support everywhere
  • Yam: 'That's the craziest AI video I've seen'
Yam Peleg
Yam Peleg
"That's the craziest AI video I've seen. That's... there's no competition."

🛠️ Interview: Trevor Manz — Marimo Pair

Trevor Manz, founding engineer at Marimo, on why reactive Python notebooks are suddenly very important for AI workflows — and on Marimo Pair, which drops Claude Code / Codex / OpenCode agents directly inside a reactive notebook. Trended on Hacker News this week. On the ZL continuum, Trevor is moving L-ward but focuses more on building verification systems around agents than on reading less code.

  • Marimo = reactive Python notebooks (dependency-graph aware vs Jupyter)
  • Marimo Pair: drop Claude Code / Codex / OpenCode agents in the notebook
  • Trended on HN this week
  • Trevor's take: shift burden of review onto better verification systems
Trevor Manz
Trevor Manz
"My job has shifted a lot to trying to build systems that I can have the AI tools verify their results and correctness of the programs. So trying to shift some of the burden of review onto just having better systems."

⚡ This Week's Buzz — Weights & Biases / CoreWeave

Marimo Pair (interviewed above) is the W&B/CoreWeave Buzz this week, plus Gemma 4 now live on W&B Inference with LoRA inference support. Reply to the W&B announcement post with code 'Gem Drop' for $20 in inference credits.

  • Marimo Pair — the CoreWeave-family agent notebook integration
  • Gemma 4 live on wandb.ai/inference with LoRA inference support
  • Code 'Gem Drop' on X for $20 in free W&B inference credits

🔊 Interview: Kwindla Kramer — Gradient Bang & Google TTS

Kwindla Hultman Kramer (co-CEO of Daily, Pipecat maintainer) on Google's Gemini 3.1 Flash TTS (1,211 Elo, 70+ langs, fully promptable — but ~3s TTFT, so batch-only for now). Then the main event: Gradient Bang, his 'side project that broke containment' — a fully LLM-driven multiplayer voice-based space game inspired by Trade Wars. Built on a new Pipecat Sub-Agents library, uses Deepgram + GPT-4.1 voice agent + GPT-5.2 medium-thinking task agents + LLM-generated dynamic UIs. Kwindla's rule for his own side project: 'don't read or write any code.'

  • Gemini 3.1 Flash TTS is fully promptable like an LLM (not fixed tags) — but has ~3s TTFT
  • Gradient Bang: multi-agent voice space game inspired by BBS-era Trade Wars
  • Pipecat Sub-Agents: new class-based event bus, works locally + over network
  • Voice agent always runs (<1.5s response), task agents on GPT-5.2 medium thinking
  • LLM-generated dynamic UI paradigm — React frontend rendered via JSON from LLM
  • Open-sourced GB Benchmarks for evaluating agent task execution
Kwindla Hultman Kramer
Kwindla Hultman Kramer
"It's a side project that broke containment. I hacked together this game, we started playing it, and it became clear really quickly that a lot of things in voice AI — all these problems we were trying to solve actually are very general: how do you build AI-native software."
Kwindla Hultman Kramer
Kwindla Hultman Kramer
"Part of my goal for this, since it was a side project, was not to write or read any code. I've been doing that since November — and it's been painful in different ways, but also a great learning process."

🛠️ Interview: Theodor Marcu — Windsurf 2.0 & Devin

Theodor Marcu (product, Cognition) on Windsurf 2.0 — the first big post-acquisition launch. Headline: Agent Command Center (a Kanban-board mission control for dozens of agents), Spaces for task context switching, and full Devin integration inside Windsurf. Cognition's thesis: the future is managing a team of agents, both local (pair programmer) and cloud (end-to-end). Teodor also reveals Cognition's internal use has doubled since launching Managed Devins + Scheduled Devins.

  • Agent Command Center = Kanban-board mission control for dozens of agents
  • Spaces — switch contexts between parallel tasks, each with local + cloud agents
  • Devin is now integrated directly inside Windsurf (Devin's desktop visible locally)
  • Plan locally with a Socratic-method agent, hand off to Devin in the cloud for execution
  • Internal Cognition usage doubled after launching Managed + Scheduled Devins
  • 'Sub-Devins' — Devins managing Devins
Theodor Marcu
Theodor Marcu
"The future of software engineering is managing a team of agents — both remote and local, that can work alongside you. Some of our best engineers are working with dozens of agents at a time."
Theodor Marcu
Theodor Marcu
"A lot of folks on the team cannot go to sleep without starting at least a bunch of Devin sessions, sometimes multiple per task, so they can compare them in the morning."

🔥 Breaking News: OpenAI Codex Major Update

Second breaking news of the show: OpenAI drops a massive Codex update mid-conversation. Native macOS background computer use (with a separate cursor, so you can keep working), 90+ plugins, gpt-image-1.5 image generation + editing, in-app browser, memory ('learns from experience'), proactive work suggestions, multi-terminal SSH, and thread automations. Alex's hot take: Codex is becoming the super-app, not ChatGPT. Post-show, Alex streamed another hour of live testing — the background computer use in particular is much bigger than it looks on the landing page.

  • Native macOS computer use — runs in a separate cursor, in the background
  • 90+ plugins connecting Codex to external services
  • gpt-image-1.5 image generation + editing inside Codex
  • Memory preview — 'learns from experience', remembers corrections + preferences
  • In-app browser closes the frontend feedback loop automatically
  • Multi-terminal SSH into dev boxes + thread automations
Yam Peleg
Yam Peleg
"We just talked to Devin and Windsurf. And now we're talking about Codex. It's like it's a war, man. It's a war."
Kwindla Hultman Kramer
Kwindla Hultman Kramer
"'Learns from experience' is just a massive unlock if they can put the pieces together. All of us who do non-trivial stuff in coding agents — we're always trying to add this to the notes file, add this to the README, write down exactly how you got here so we don't get in this loop."
Alex Volkov
Alex Volkov
"The computer use happens in the background. They have another cursor — they don't use your cursor to click things, so you can actually ask it to do something and keep working on something else. This is the only experience like this I know of."

🎨 NVIDIA Lyra 2.0 & 3D World Generation

Quick hit on the 3D-world-from-single-image race: Baidu ERNIE-Image (8B DiT, #1 GenEval among open models), Tencent HYWorld 2.0 (editable 3D Gaussian Splats, Unity/Unreal/Isaac Sim ready), NVIDIA Lyra 2.0 (Apache 2.0, single image → explorable persistent 3D worlds). Essentially the open-source equivalents of what Fei-Fei Li's World Labs is building.

  • Baidu ERNIE-Image — 8B DiT, #1 GenEval among open models
  • Tencent HYWorld 2.0 — editable 3D scenes from single image, Unity/Unreal ready
  • NVIDIA Lyra 2.0 — Apache 2.0, persistent explorable 3D worlds from one image

📰 AI for Normies — Robots & Allbirds Pivot

Unitree humanoid breaks the 100m dash world record at ~10m/s — faster than Olympic sprinters. And in the stupidest pivot of 2026, Allbirds (the shoe company) loses 99% of its value, rebrands to 'NewBird AI', raises $50M 'to buy GPUs', and the stock shoots up 600-800%. Alex: 'where are they buying those GPUs?'

  • Unitree humanoid: ~10 m/s, world-record 100m dash
  • Allbirds → NewBird AI: 600–800% stock pump after GPU-pivot announcement
  • 'The more you buy, the more you save' — the entire new business model
Yam Peleg
Yam Peleg
"They literally implemented 'the more you buy, the more you save'. You just buy a bunch of GPUs and you print money. That's their new business model."
TL;DR - ThursdAI, April 16, 2026
  • Hosts and Guests

  • Show Notes

    • Recap essay on the Z/L Continuum from AI Engineer Europe (Blog): should AI engineers still read code? Ryan Lopopolo says no, Mario Zechner says yes for critical paths, everyone in between has FOMAT.

    • Mario Zechner talk is finally live on AI Engineer youtube (Watch)

    • Super Gemma 4 26B Uncensored v2 by @songjunkr — trending on HF, 0/100 refusals, fixed tool calls (HF GGUF, HF MLX 4bit)

    • Gemma 4 21B REAP — 20% expert-pruned Gemma 4 26B MoE by 0xSero using Cerebras REAP (HF)

    • Parcae (Together AI + UCSD) — stable looped transformer architecture with scaling laws, matches 2x-sized transformer quality (Paper/blog)

    • Claude Desktop app — rewritten from scratch, completely new app

    • Gemma 4 on W&B Inference — reply on the announcement post with code Gem Drop for $20 in inference credits, also supports LoRA inference via link

  • Big CO LLMs + APIs

    • Anthropic launches Claude Opus 4.7 - 87.6% SWE-bench Verified, 64.3% SWE-bench Pro, 3x vision resolution, new xhigh effort level, /ultrareview in Claude Code, same pricing as 4.6 but new tokenizer uses ~1.0-1.35x more tokens (X, Blog)

    • OpenAI Codex major update: macOS background computer use, 90+ plugins, gpt-image-1.5 image generation, in-app browser, memory, self-scheduling automations, multi-terminal SSH (X, Blog)

    • CoreWeave signs deals with Anthropic (multibillion), Meta ($21B expansion, $35B+ total), and Jane Street ($6B cloud + $1B equity), now serves 9 of the top 10 AI providers

  • Open Source LLMs

    • Qwen 3.6-35B-A3B - Apache 2.0, 35B MoE with 3B active, 73.4% SWE-bench Verified, natively multimodal, 262K context extensible to 1M (X, HF, Blog)

    • MiniMax M2.7 open weights - 230B MoE with 10B active, 56.22% SWE-Pro matching GPT-5.3-Codex, self-evolved via 100+ rounds of autonomous RL (X, HF)

  • Tools & Agentic Engineering

    • Windsurf 2.0 with Agent Command Center and Devin integration - interview with Theodor Marcu (X, Blog)

    • Warp now supports any CLI agent with vertical tabs, notifications, code review, mobile remote control (X, Blog)

    • Claude Code Routines - cron, GitHub event, and API-triggered autonomous agents running on Anthropic’s cloud (Docs)

  • This Week’s Buzz - Weights & Biases / CoreWeave

    • Marimo Pair - drop Claude Code / Codex / OpenCode agents directly inside reactive Python notebooks - interview with Trevor Manz (Blog, GitHub)

    • Gemma 4 now live on W&B Inference on CoreWeave infrastructure, with LoRA inference support

  • Vision & Video

    • Craziest AI video of the year: Pi Hard / Neil deGrasse Tyson (X)

  • Voice & Audio

    • Gradient Bang - first massively multiplayer fully LLM-driven game, Pipecat sub-agents - interview with Kwindla (Play, GitHub)

    • Google Gemini 3.1 Flash TTS - 1,211 Elo on TTS Arena, inline audio tags, 70+ languages, ~$0.03/60s (Blog)

  • AI Art, Diffusion & 3D

    • Baidu ERNIE-Image - 8B DiT, #1 GenEval among open models, precise multilingual text rendering (HF)

    • Tencent HYWorld 2.0 - single image to editable 3D Gaussian Splats/meshes, Unity/Unreal/Isaac Sim ready (GitHub)

    • NVIDIA Lyra 2.0 - single image to explorable persistent 3D worlds, Apache 2.0 (Project, HF)

  • Other news

    • Unitree humanoid breaks 100m dash world record at ~10m/s (X)

    • Allbirds shoe company loses 99.5%, rebrands as “NewBird AI”, raises $50M to buy GPUs, stock up 600-800% (X)