ThursdAI · April 16, 2026

April 16 - Codex uses your mac in the background, Opus 4.7 release not quite Mythos + 3 interviews

From Weights & Biases - 3 interviews, 3 Breaking news (Opus 4.7 and Codex Computer use) + a discussion about ZL Continuums among the cohosts. Let us catch you up!

By Alex Volkov

119 min

YouTube Spotify Apple Podcasts Substack

Episode Summary

Alex goes live EIGHT minutes early to cover breaking news: Claude Opus 4.7 drops mid-intro — 87.6% SWE-bench Verified, a 64.3% on SWE-bench Pro, and a brand-new 'extra high' reasoning effort that ends up feeling more like a Mythos teaser than a full release. Qwen 3.6-35B-A3B (Apache 2.0, 73.4% SWE-Verified with only 3B active params) and MiniMax M2.7 open weights keep the open-source train screaming, while OpenAI drops another breaking-news bomb during the show with a massive Codex update: native macOS background computer use, 90+ plugins, memory, gpt-image-1.5, and multi-terminal SSH. Three incredible interviews — Trevor Manz on Marimo Pair dropping coding agents into reactive Python notebooks, Kwindla on Gradient Bang (the multi-agent voice game that 'broke containment'), and Theodor Marcu on Windsurf 2.0 + Devin's agent command center. Plus Alex debuts the 'ZL Continuum' essay from AI Engineer Europe: do engineers still read code?

In This Episode

🔥 Pre-Show Banter & Opus 4.7 Breaking News
🧪 Opus 4.7 Evals & Benchmarks
📰 TL;DR — Weekly AI News Roundup
🧪 Opus 4.7 Live Testing — Martian Simulation
🔓 Qwen 3.6 Open Source Release
🔓 MiniMax M2.7 Open Weights
🧠 The ZL Continuum — Do Engineers Still Read Code?
⚡ AI Engineer Summit — Top 10 Themes
🎥 Pi Hard — Craziest AI Video Yet
🛠️ Interview: Trevor Manz — Marimo Pair
⚡ This Week's Buzz — Weights & Biases / CoreWeave
🔊 Interview: Kwindla Kramer — Gradient Bang & Google TTS
🛠️ Interview: Theodor Marcu — Windsurf 2.0 & Devin
🔥 Breaking News: OpenAI Codex Major Update
🎨 NVIDIA Lyra 2.0 & 3D World Generation
📰 AI for Normies — Robots & Allbirds Pivot

Hosts & Guests

Alex Volkov

Host · W&B / CoreWeave

@altryne

Trevor Manz

Founding engineer · Marimo

@trevmanz

Theodor Marcu

Product · Cognition (Windsurf)

@theodormarcu

Kwindla Hultman Kramer

Co-CEO Daily · Pipecat maintainer

AI builder & founder

Weekly co-host · AI engineer

@nisten

LDJ

Weekly co-host · AI researcher

@ldjconfirmed

Wolfram Ravenwolf

Weekly co-host · AI Evangelist at W&B/CoreWeave

@WolframRvnwlf

By The Numbers

SWE-bench Verified

87.6%

Claude Opus 4.7 — 64.3% on SWE-bench Pro, an 11-point jump over 4.6 on the harder agentic coding eval

ScreenSpot Pro jump

+22%

Opus 4.7 computer-use — 57.7% → 79.5% vs Opus 4.6, pulling even with Mythos on some slices

SWE-bench Verified

73.4%

Qwen 3.6-35B-A3B — Apache 2.0, 35B MoE with just 3B active, rivals models 10x its size

active parameters

10B

MiniMax M2.7 — 230B MoE matches GPT-5.3-Codex on SWE-Pro at 56.22%, self-evolved via 100+ rounds of autonomous RL

TTS Arena Elo

1,211

Google Gemini 3.1 Flash TTS — 70+ languages, inline audio tags, ~$0.03 per 60s (≈5× cheaper than ElevenLabs)

YoY PR growth

15×

GitHub on track for 15 billion PRs in 2026 (vs 1 billion in 2025) — Vercel says 60% of their traffic is now agents

🔥 Breaking During The Show

Claude Opus 4.7 drops 8 minutes before show start

Anthropic ships Opus 4.7 right as ThursdAI is about to go live. 87.6% SWE-bench Verified, 64.3% SWE-Pro, new 'extra high' (xhigh) reasoning effort, 3× vision resolution, /ultrareview in Claude Code, new tokenizer uses 1.0–1.35× more tokens. Feels like a Mythos teaser (331 mentions in the system card).

OpenAI Codex gets a massive update mid-show

Native macOS background computer use (separate cursor!), 90+ plugins, gpt-image-1.5 for images, in-app browser, memory, self-scheduling automations, multi-terminal SSH. Alex calls it: Codex is becoming the super-app, not ChatGPT.

Qwen 3.6-35B-A3B open-sourced the same morning

Alibaba Qwen ships Apache 2.0 35B MoE with 3B active, 73.4% SWE-Verified, 262K→1M context, natively multimodal. Confirms Qwen's open-source commitment post-Junyang Ling departure.

MiniMax M2.7 open weights released

230B MoE / 10B active, matches GPT-5.3-Codex on SWE-Pro at 56.22%, self-evolved via 100+ rounds of autonomous RL.

🔥 Pre-Show Banter & Opus 4.7 Breaking News

Alex and Yam go live 8 minutes before the official show start because Anthropic just dropped Claude Opus 4.7. No early access, no advance briefing — the crew opens the system card live alongside the audience.

AI breaking news from the jump: Opus 4.7 drops with no prior access
Anthropic ships the system card publicly, crew reads it in real time
4 cohosts on set: Yam, Wolfram, LDJ (and Nisten joining shortly)

Alex Volkov

"Anthropic made us go live, eight minutes before the official show start, because we got some breaking news."

🧪 Opus 4.7 Evals & Benchmarks

The crew walks through the Anthropic evals table live. Biggest jumps: SWE-bench Pro (+11), ScreenSpot Pro (+22% on computer use), new xhigh effort level. MRCR long-context drops from 78% → 32%, suggesting a new pre-trained base. The system card mentions 'Mythos' 331 times — this feels like an ad for the godlike version they haven't released yet.

331 mentions of 'Mythos' in the system card — 4.7 feels like the appetizer
SWE-bench Pro jumps 11 points, passes GPT-5.4 on agentic coding
MRCR 8-needle V2: 78% → 32% — LDJ thinks it's a new pre-trained base
New 'xhigh' (extra high) reasoning effort gets best-in-class on HLE
New tokenizer uses 1.0–1.35× more tokens per prompt — not a cash grab, probably multimodal reasons

Alex Volkov

"If you look for Mythos in the system card, there's three hundred and thirty-one mentions for Mythos. It does feel like a Mythos ad."

LDJ

"This kind of helps maybe confirm it is maybe a new pre-trained model from scratch — in MRCR it's performing much worse, but I expect in other areas it might be much better in dramatic ways."

Yam Peleg

"We're all cooked already at this point, everyone looking at the evals, who has the biggest number. If you do a good job and train a really good model, people like us are gonna look at the evals and say 'nah, this model is not good.' But probably the model is really good."

📰 TL;DR — Weekly AI News Roundup

The longest TL;DR in show history. CoreWeave signs Anthropic (multibillion), Meta ($21B expansion), and Jane Street ($6B cloud + $1B equity) — now serving 9 of the top 10 AI labs. Qwen 3.6-35B-A3B, MiniMax M2.7 open weights, Windsurf 2.0 + Devin, Warp's any-CLI-agent support, Claude Code Routines (cron-triggered agents on Anthropic's cloud), Marimo Pair, Gemma 4 live on W&B Inference, Gemini 3.1 Flash TTS, Baidu ERNIE-Image, Tencent HYWorld 2.0, Nvidia Lyra 2.0, a unitree humanoid breaking a 100m dash record, and Allbirds → NewBird AI.

CoreWeave now backs 9 of the top 10 AI labs (Anthropic, Meta $35B+, Jane Street $7B)
Qwen 3.6-35B-A3B: Apache 2.0, 35B MoE / 3B active, 262K→1M context, natively multimodal
MiniMax M2.7 open weights: 230B / 10B active, 56.22% SWE-Pro matching GPT-5.3-Codex
Gemma 4 now live on W&B Inference (CoreWeave) with LoRA inference support — code 'Gem Drop' for $20 credits
Claude Code Routines: cron/GitHub-event/API triggered autonomous agents on Anthropic's cloud
Super Gemma 4 26B Uncensored v2 by @songjunkr trending on HF — 0/100 refusals, fixed tool calls

Nisten Tahiraj

"People are going ham over this. It's like we're back to two years ago."

Yam Peleg

"Open source is so back."

🧪 Opus 4.7 Live Testing — Martian Simulation

Nisten puts Opus 4.7 through the infamous Martian simulation benchmark the crew has been running for months. Early impression: incremental over 4.6 but solid — 'not as much of a jump from 4.6 to 4.7 as it was from 4.5 to 4.6.' Vision / computer use, on the other hand, feels genuinely better.

Nisten: 'It feels incremental this time' — not as big a jump as 4.6 over 4.5
Vision + ScreenSpot Pro improvements are the real story
New 'ultrareview' /command in Claude Code

Nisten Tahiraj

"It feels incremental this time. Like, it doesn't — 4.6 was smarter. 4.7 is not as much of a jump from 4.6 as it was from 4.5 to 4.6."

🔓 Qwen 3.6 Open Source Release

Alibaba Qwen drops Qwen 3.6-35B-A3B under Apache 2.0 the same morning as Opus 4.7 — 35B MoE with only 3B active parameters, 73.4% SWE-bench Verified, natively multimodal, 262K context extensible to 1M. After Junyang Ling left the team there were doubts about Qwen's open-source commitment; this release puts those to rest.

Apache 2.0 — 35B MoE / 3B active, 73.4% SWE-bench Verified
Natively multimodal, 262K context extensible to 1M
Strongest mid-size LLM on nearly all benchmarks
Confirms Alibaba Qwen's continued open-source commitment post-Junyang

🔓 MiniMax M2.7 Open Weights

MiniMax releases M2.7 open weights: 230B parameter MoE with only 10B active, 56.22% on SWE-Pro (matching GPT-5.3-Codex). Self-evolved via 100+ rounds of autonomous RL. Typical recent Chinese release cadence: ship the API first, then open-source the weights.

230B MoE / 10B active — matches GPT-5.3-Codex on SWE-Pro (56.22%)
Self-evolved via 100+ rounds of autonomous RL
Open weights released on HF after API launch

🧠 The ZL Continuum — Do Engineers Still Read Code?

Alex's essay from AI Engineer Europe: where are you on the Z–L spectrum? Ryan Lopopolo (OpenAI, token billionaire) says code is a liability, don't read it. Mario Zechner (creator of pyo, the harness powering OpenClaw) says slow the fuck down, read every line of critical code. Everyone else is somewhere in between. The crew + later guests weigh in: Nisten=Z, Yam=Z (with a brutal 'hidden-features accumulation' warning), Wolfram=L (mostly), LDJ=moving-to-L, Trevor=L-leaning, Kwindla=full-L ('my rule was not to read or write any code for the side project').

Ryan Lopopolo: 'code is a liability' · Mario Zechner: 'read every line'
Yam's warning: agents silently add hidden features (context truncation, etc.), accumulate, eventually no agent OR human can review the code
Consensus: it's not per-person, it's per-task — critical code = read, throwaway = YOLO
Poll opens live during the show

Yam Peleg

"You must read the code, man. Those things accumulate — the agent thinks 'oh, that's what the human wanted' so it maintains hidden features. You get to a point where the agents just can't do anything."

Wolfram Ravenwolf

"I trust the AI, I tell it what to do, I test it, and my most used command is review. I probably spend more time reviewing than actually coding."

⚡ AI Engineer Summit — Top 10 Themes

Alex's synthesized top-10 from AI Engineer Europe: (1) FMAT — Fear of Missing Agent Time, (2) the ZL Continuum, (3) everything is changing super fast (GitHub on track for 15B PRs, Vercel 60% agent traffic), (4) we're still early, (5) AGI is here and unevenly distributed, (6) 'just talk to your Clanker', (7) MCP is dead long live MCP (enterprises adopting faster than ever), (8) AI was supposed to make us work less — we work more, (9) MHC = Model / Harness / Context is the new ASL.

FMAT: Fear Of Missing Agent Time — universal at the conference
GitHub: 1B → 15B PRs YoY projected, 15× growth
Vercel: 60% of traffic attributed to agents
MHC framework — Model/Harness/Context is the new ASL for AI engineers
MCP isn't dying — enterprises are adopting faster, especially with code-mode

Alex Volkov

"Model Harness Context is the new ASL. When somebody tells me 'OpenClaw is stupid', I have no idea how to react until they tell me if they use OpenClaw with Opus 4.7 or MiniMax 2.7."

🎥 Pi Hard — Craziest AI Video Yet

Alex plays the Pi Hard / Neil deGrasse Tyson / SBF AI trailer live. Even Alex's fiancée (who works with AI video daily) didn't clock it as AI until midway through. Seedance 2.0 is now everywhere; the crew agrees this is the craziest AI video they've ever seen.

Multi-shot AI video production, Neil deGrasse Tyson deepfake
Seedance 2.0 fully rolled out with video support everywhere
Yam: 'That's the craziest AI video I've seen'

Yam Peleg

"That's the craziest AI video I've seen. That's... there's no competition."

🛠️ Interview: Trevor Manz — Marimo Pair

Trevor Manz, founding engineer at Marimo, on why reactive Python notebooks are suddenly very important for AI workflows — and on Marimo Pair, which drops Claude Code / Codex / OpenCode agents directly inside a reactive notebook. Trended on Hacker News this week. On the ZL continuum, Trevor is moving L-ward but focuses more on building verification systems around agents than on reading less code.

Marimo = reactive Python notebooks (dependency-graph aware vs Jupyter)
Marimo Pair: drop Claude Code / Codex / OpenCode agents in the notebook
Trended on HN this week
Trevor's take: shift burden of review onto better verification systems

Trevor Manz

"My job has shifted a lot to trying to build systems that I can have the AI tools verify their results and correctness of the programs. So trying to shift some of the burden of review onto just having better systems."

⚡ This Week's Buzz — Weights & Biases / CoreWeave

Marimo Pair (interviewed above) is the W&B/CoreWeave Buzz this week, plus Gemma 4 now live on W&B Inference with LoRA inference support. Reply to the W&B announcement post with code 'Gem Drop' for $20 in inference credits.

Marimo Pair — the CoreWeave-family agent notebook integration
Gemma 4 live on wandb.ai/inference with LoRA inference support
Code 'Gem Drop' on X for $20 in free W&B inference credits

🔊 Interview: Kwindla Kramer — Gradient Bang & Google TTS

Kwindla Hultman Kramer (co-CEO of Daily, Pipecat maintainer) on Google's Gemini 3.1 Flash TTS (1,211 Elo, 70+ langs, fully promptable — but ~3s TTFT, so batch-only for now). Then the main event: Gradient Bang, his 'side project that broke containment' — a fully LLM-driven multiplayer voice-based space game inspired by Trade Wars. Built on a new Pipecat Sub-Agents library, uses Deepgram + GPT-4.1 voice agent + GPT-5.2 medium-thinking task agents + LLM-generated dynamic UIs. Kwindla's rule for his own side project: 'don't read or write any code.'

Gemini 3.1 Flash TTS is fully promptable like an LLM (not fixed tags) — but has ~3s TTFT
Gradient Bang: multi-agent voice space game inspired by BBS-era Trade Wars
Pipecat Sub-Agents: new class-based event bus, works locally + over network
Voice agent always runs (<1.5s response), task agents on GPT-5.2 medium thinking
LLM-generated dynamic UI paradigm — React frontend rendered via JSON from LLM
Open-sourced GB Benchmarks for evaluating agent task execution

Kwindla Hultman Kramer

"It's a side project that broke containment. I hacked together this game, we started playing it, and it became clear really quickly that a lot of things in voice AI — all these problems we were trying to solve actually are very general: how do you build AI-native software."

Kwindla Hultman Kramer

"Part of my goal for this, since it was a side project, was not to write or read any code. I've been doing that since November — and it's been painful in different ways, but also a great learning process."

🛠️ Interview: Theodor Marcu — Windsurf 2.0 & Devin

Theodor Marcu (product, Cognition) on Windsurf 2.0 — the first big post-acquisition launch. Headline: Agent Command Center (a Kanban-board mission control for dozens of agents), Spaces for task context switching, and full Devin integration inside Windsurf. Cognition's thesis: the future is managing a team of agents, both local (pair programmer) and cloud (end-to-end). Teodor also reveals Cognition's internal use has doubled since launching Managed Devins + Scheduled Devins.

Agent Command Center = Kanban-board mission control for dozens of agents
Spaces — switch contexts between parallel tasks, each with local + cloud agents
Devin is now integrated directly inside Windsurf (Devin's desktop visible locally)
Plan locally with a Socratic-method agent, hand off to Devin in the cloud for execution
Internal Cognition usage doubled after launching Managed + Scheduled Devins
'Sub-Devins' — Devins managing Devins

Theodor Marcu

"The future of software engineering is managing a team of agents — both remote and local, that can work alongside you. Some of our best engineers are working with dozens of agents at a time."

Theodor Marcu

"A lot of folks on the team cannot go to sleep without starting at least a bunch of Devin sessions, sometimes multiple per task, so they can compare them in the morning."

🔥 Breaking News: OpenAI Codex Major Update

Second breaking news of the show: OpenAI drops a massive Codex update mid-conversation. Native macOS background computer use (with a separate cursor, so you can keep working), 90+ plugins, gpt-image-1.5 image generation + editing, in-app browser, memory ('learns from experience'), proactive work suggestions, multi-terminal SSH, and thread automations. Alex's hot take: Codex is becoming the super-app, not ChatGPT. Post-show, Alex streamed another hour of live testing — the background computer use in particular is much bigger than it looks on the landing page.

Native macOS computer use — runs in a separate cursor, in the background
90+ plugins connecting Codex to external services
gpt-image-1.5 image generation + editing inside Codex
Memory preview — 'learns from experience', remembers corrections + preferences
In-app browser closes the frontend feedback loop automatically
Multi-terminal SSH into dev boxes + thread automations

Yam Peleg

"We just talked to Devin and Windsurf. And now we're talking about Codex. It's like it's a war, man. It's a war."

Kwindla Hultman Kramer

"'Learns from experience' is just a massive unlock if they can put the pieces together. All of us who do non-trivial stuff in coding agents — we're always trying to add this to the notes file, add this to the README, write down exactly how you got here so we don't get in this loop."

Alex Volkov

"The computer use happens in the background. They have another cursor — they don't use your cursor to click things, so you can actually ask it to do something and keep working on something else. This is the only experience like this I know of."

🎨 NVIDIA Lyra 2.0 & 3D World Generation

Quick hit on the 3D-world-from-single-image race: Baidu ERNIE-Image (8B DiT, #1 GenEval among open models), Tencent HYWorld 2.0 (editable 3D Gaussian Splats, Unity/Unreal/Isaac Sim ready), NVIDIA Lyra 2.0 (Apache 2.0, single image → explorable persistent 3D worlds). Essentially the open-source equivalents of what Fei-Fei Li's World Labs is building.

Baidu ERNIE-Image — 8B DiT, #1 GenEval among open models
Tencent HYWorld 2.0 — editable 3D scenes from single image, Unity/Unreal ready
NVIDIA Lyra 2.0 — Apache 2.0, persistent explorable 3D worlds from one image

📰 AI for Normies — Robots & Allbirds Pivot

Unitree humanoid breaks the 100m dash world record at ~10m/s — faster than Olympic sprinters. And in the stupidest pivot of 2026, Allbirds (the shoe company) loses 99% of its value, rebrands to 'NewBird AI', raises $50M 'to buy GPUs', and the stock shoots up 600-800%. Alex: 'where are they buying those GPUs?'

Unitree humanoid: ~10 m/s, world-record 100m dash
Allbirds → NewBird AI: 600–800% stock pump after GPU-pivot announcement
'The more you buy, the more you save' — the entire new business model

Yam Peleg

"They literally implemented 'the more you buy, the more you save'. You just buy a bunch of GPUs and you print money. That's their new business model."

TL;DR - ThursdAI, April 16, 2026

Hosts and Guests
- Alex Volkov - AI Evangelist & Community with Weights & Biases / CoreWeave (@altryne)
- Co-hosts: @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed
- Guests:
  - Kwindla Kramer (@kwindla) - Co-CEO of Daily, Pipecat maintainer
  - Theodor Marcu (@theodormarcu) - Product at Cognition
  - Trevor Manz (@trevmanz) - Founding engineer at Marimo
Show Notes
- Recap essay on the Z/L Continuum from AI Engineer Europe (Blog): should AI engineers still read code? Ryan Lopopolo says no, Mario Zechner says yes for critical paths, everyone in between has FOMAT.
- Mario Zechner talk is finally live on AI Engineer youtube (Watch)
- Super Gemma 4 26B Uncensored v2 by @songjunkr — trending on HF, 0/100 refusals, fixed tool calls (HF GGUF, HF MLX 4bit)
- Gemma 4 21B REAP — 20% expert-pruned Gemma 4 26B MoE by 0xSero using Cerebras REAP (HF)
- Parcae (Together AI + UCSD) — stable looped transformer architecture with scaling laws, matches 2x-sized transformer quality (Paper/blog)
- Claude Desktop app — rewritten from scratch, completely new app
- Gemma 4 on W&B Inference — reply on the announcement post with code Gem Drop for $20 in inference credits, also supports LoRA inference via link
Big CO LLMs + APIs
- Anthropic launches Claude Opus 4.7 - 87.6% SWE-bench Verified, 64.3% SWE-bench Pro, 3x vision resolution, new xhigh effort level, /ultrareview in Claude Code, same pricing as 4.6 but new tokenizer uses ~1.0-1.35x more tokens (X, Blog)
- OpenAI Codex major update: macOS background computer use, 90+ plugins, gpt-image-1.5 image generation, in-app browser, memory, self-scheduling automations, multi-terminal SSH (X, Blog)
- CoreWeave signs deals with Anthropic (multibillion), Meta ($21B expansion, $35B+ total), and Jane Street ($6B cloud + $1B equity), now serves 9 of the top 10 AI providers
Open Source LLMs
- Qwen 3.6-35B-A3B - Apache 2.0, 35B MoE with 3B active, 73.4% SWE-bench Verified, natively multimodal, 262K context extensible to 1M (X, HF, Blog)
- MiniMax M2.7 open weights - 230B MoE with 10B active, 56.22% SWE-Pro matching GPT-5.3-Codex, self-evolved via 100+ rounds of autonomous RL (X, HF)
Tools & Agentic Engineering
- Windsurf 2.0 with Agent Command Center and Devin integration - interview with Theodor Marcu (X, Blog)
- Warp now supports any CLI agent with vertical tabs, notifications, code review, mobile remote control (X, Blog)
- Claude Code Routines - cron, GitHub event, and API-triggered autonomous agents running on Anthropic’s cloud (Docs)
This Week’s Buzz - Weights & Biases / CoreWeave
- Marimo Pair - drop Claude Code / Codex / OpenCode agents directly inside reactive Python notebooks - interview with Trevor Manz (Blog, GitHub)
- Gemma 4 now live on W&B Inference on CoreWeave infrastructure, with LoRA inference support
Vision & Video
- Craziest AI video of the year: Pi Hard / Neil deGrasse Tyson (X)
Voice & Audio
- Gradient Bang - first massively multiplayer fully LLM-driven game, Pipecat sub-agents - interview with Kwindla (Play, GitHub)
- Google Gemini 3.1 Flash TTS - 1,211 Elo on TTS Arena, inline audio tags, 70+ languages, ~$0.03/60s (Blog)
AI Art, Diffusion & 3D
- Baidu ERNIE-Image - 8B DiT, #1 GenEval among open models, precise multilingual text rendering (HF)
- Tencent HYWorld 2.0 - single image to editable 3D Gaussian Splats/meshes, Unity/Unreal/Isaac Sim ready (GitHub)
- NVIDIA Lyra 2.0 - single image to explorable persistent 3D worlds, Apache 2.0 (Project, HF)
Other news
- Unitree humanoid breaks 100m dash world record at ~10m/s (X)
- Allbirds shoe company loses 99.5%, rebrands as “NewBird AI”, raises $50M to buy GPUs, stock up 600-800% (X)

Alex Volkov 0:00

are we live, Yam?

0:01

It feels like we're live earlier than usual. What's going on? then,

Yam Peleg 0:06

let's go.

Alex Volkov 0:07

Anthropic made us.

0:08

Let's go. Anthropic made us go live, eight minutes before the official show start, because we got some breaking news.

Yam Peleg 0:15

Let's go.

0:16

Let's go.

Alex Volkov 0:17

Let's go.

0:18

And- Let's start this thing.

0:50

Welcome, everyone, to ThursdAI for April 16. My name is Alex Volkov. I'm in the AIVengers with Weights & Biases from CoreWeave, which we're gonna tell you all about in the show today as well. And what a show we have for you. We have our four co-hosts here, Yam Peleg, Wolfram Ravenwolf, and LDJ. We're gonna have more joining us as we have breaking news, and I never do this, but I think for the sake of this show, this is absolutely worth it. We have breaking news from the jump. Let's go. AI breaking news coming at you only on ThursdAI. We have breaking news. Claude Opus 4.7 from Anthropic just dropped. Here's the system card. Folks, this is... Is this the top LLM for, for at least this week? What, what do we think? This just dropped, folks. Nobody has early access. We're just gonna look through it like you guys love. Yam, go ahead.

Yam Peleg 1:49

Look, it's definitely the biggest news, hundred percent,

1:53

but we need to talk about the evils. I'm just saying. Yes. I need to talk about this course. All right? Just a spoiler, just a spoiler alert. We need to talk about this course. Yeah.

Alex Volkov 2:03

I wanna talk about this thing, just before we jump in there.

2:06

We talked about Claude Mythos getting announced and not released and su-supposedly due the end of this release. We talked about this last week. w- I don't know if many people were expected a new version from Anthropic before this. definitely they didn't send early access to many people. But, if you look for Mythos in the system card, there's three hundred and thirty-one mentions for Mythos. so it does feel like a little bit-- And all of them is like, "Hey, this, this model is great, better than Opus previously, but worse than Mythos." So this does seem like an-another ad for this, like, incredible godlike dear Mythos thing that they're about to release, which is really funny. Yam, what evil specifically would you like to, to, to, to, to add meanwhile?

Yam Peleg 2:42

Yeah.

Alex Volkov 2:42

Yeah.

Yam Peleg 2:43

It's, L- look, we got the release, the official release of, in a

2:47

screenshot from, Anthropic's account- ...if you wanna put it up- Yeah ...of the table of, evaluations of the model. And- Oh, you

Alex Volkov 2:53

want, like, the whole, the whole

Yam Peleg 2:55

nice table.

2:55

Yeah ...it doesn't crush. It doesn't really crush as you might ex- want to expect that it will it doesn't crash the previous model. Seems like, okay, we need to talk about this because, the scores are not definitely like in the, in the we destroy the, the previous version kind of model. It's not even comparison to different models, comparison to their own. So maybe we want to talk about this,

Alex Volkov 3:23

Yeah.

3:23

Let's take a look. Let's take a look. we're gonna add the official Anthropic release evals table here. there it is. And I wanna mention that with that we do have two breaking news, and thank you, Innovus, for this comment. "Qwen announcing their continuing open sourcing is bigger news." I think that I agree. Incremental release from Opus is great for everyone, blah, blah, blah. but-- And also, Dan Shipper, we're coming for you. But incremental release is great, but, Alibaba, there was like some fears that like open source is gone and there is Qwen three point six, and we're gonna definitely take a look at three point six. let's see. Can we see this chart better now? I think so. it's a little hard to see, but let's walk through them. The biggest jump that I see, absolutely the biggest jump, is in SWE-bench Pro, the for agentic coding. This is a ten, eleven points jump from Opus four point six, which was already great. Opus four point six is already great at agentic coding, we know. And this jump is hell of a jump. It jumps over GPT five point four with a significant, like, step up. And we know that, like, people said, "Hey, Opus for personality, five point four for coding." this seems no longer to be the case. agentic coding- No ... with SWE-bench verified, nobody cares about. Yeah, which ones stood out to you? Agentic

Yam Peleg 4:38

search.

4:39

you look at this-

Alex Volkov 4:41

This one, browse, browse

Yam Peleg 4:42

comma.

4:42

And then you look at it-- Yeah, and then you look at agentic search and you think to yourself, "Wait a minute, both of them are tool use for, for doing, I don't know, long-term planning and tool use," and like, wait, what are we looking f- what are we looking at right here? You get what I'm saying? it's-- I'm not saying anything. I'm just saying that, I don't know. We still

Alex Volkov 5:03

need to improve- I, I will, I will, I think, summarize I will summarize

5:07

and LDJ, I want your comment as well. I'll summarize what you're trying to say, is that, evals are not everything. As we saw, GPT 5.4 c-came out and crushed Opus on, like, a bunch of benchmarks and then, like, folks tried to use this personality-wise and it didn't cut it for them, and they, like, went to Opus for personality. so evals are not everything. Also m-m, worth noting that BrowseComp agentic search, GPT 5.4 beats even Mythos. Even Mythos is not, like, jumping to that level. So I don't know if they, they even optimized for that, thing. finance agent, this is the first one. Nobody even, like Mythos didn't even test this. Finance agent, 64% on finance agent. I think the highlight for me, they don't have it here, I'm gonna switch to the system card, is the MRCR, folks. We know that Opus gives everyone one million, t-tokens in the context window. But MRCR, the long context kind of needle in the haystack check from OpenAI, by the way, is the thing that I look forward to in every, every release, and they have it here. And I will find it in a second, and I will show you MRCR. There we go. OpenAI, MRCR. this is absolutely insane to me. On 8-needle MRCR V2, Opus 4.7 gets-- Oh, actually less than Opus 4.6. That's really funny. Ah, I thought that this-

LDJ 6:25

I, I think this helps maybe confirm it is maybe a new pre-trained model- Yeah

6:30

from scratch because- Probably ... like-

Alex Volkov 6:31

Wait.

6:32

So what-- I-

LDJ 6:32

Yeah

Alex Volkov 6:33

LDJ, what are we saying?

6:34

Again, I, I wanna, I wanna, like, make sure that we're saying the right thing. On MRCR long context, 'cause I, I was sure when I looked at this super quickly before the show that this was seven and, like, the smaller one was six. No. Folks, Opus 4.7 Max is, getting 32% on the MRCR V2, which is the 8-needle, one million context window tests, and, Opus 4.6, gets 78%. Yes. This means-- Yeah, go ahead, LDJ.

LDJ 6:59

Yeah, I feel like this kind of goes to what I was saying earlier of, like,

7:02

if this is a new pre-trained model, we might expect to see some wonky things, like, very different from the last model of different strength and weaknesses. And I think that's what we're seeing here, like, in MRCR it's performing much worse. But I expect in some other areas, maybe even in some other areas that are not exactly, clear in benchmarks, it might be much better, in dramatic ways to 4.6. But yeah, I think it does probably need more polishing with this base model, and this is just their first version out with it.

Alex Volkov 7:31

Yeah.

7:32

what other evals? Wolfram, I wanna look at, Terminal Bench with you 'cause I know you're our Terminal Bench guy. Do you see the results here?

Wolfram Ravenwolf 7:39

Actually, I already checked it out, yeah.

7:41

And, I get better results with the previous Opus 4.6, in Terminal Bench when I ran my own eval. So- I use one hour per test, so there's a difference. They sometimes have, different settings for this. Yeah, it's hard to compare. So at least they at least, used five attempts for each one of the tasks as well.

Alex Volkov 7:58

Yeah,

Wolfram Ravenwolf 7:59

that's nice.