ThursdAI · April 23, 2026

📅 Apr 23: OpenAI's Week: GPT-5.5, GPT-Image-2, Codex CUA + Chronicle, + Claude Design, Kimi K2.6, Qwen 3.6-27B

From Weights & Biases, what an intense week, that's fully dominated by OpenAI, a new top LLM (5.5), a new top Image Gen (imagev2) and tons of codex releases + Claude Design and a bunch of open source

By Alex Volkov

144 min

YouTube Spotify Apple Podcasts Substack

What happened in AI the week of April 23, 2026?

The week OpenAI went full throttle. GPT-5.5 dropped mid-show — SOTA across terminal-bench, SWE-bench, GDPval and frontier-math, using ~40% fewer tokens than 5.4. GPT-Image-2 posted the biggest Arena ELO jump ever (200+ points), generating functioning QR codes, perfect infographics, and 360° street-view images that Peter Gostev stitched into a 24-hour walkable world. Codex now has real multi-cursor computer use on macOS plus Chronicle screen-memory. On the open-source side, Kimi K2.6 became Wolfram's best-ever open model and Qwen3.6-27B dense beat Alibaba's own 400B flagship. Oh — and Claude Design shipped, dropping Figma stock 7%.

Intro & TL;DR — Week in Review
Open Source: Kimi K2.6
Open Source: Qwen 3.6-27B
OpenAI Privacy Filter (Apache 2.0)
GPT-Image-2 — Thinking Mode for Images
Codex: Computer Use & Chronicle

Episode Summary

The Week That Broke The Chart — Interactive Recap

Interactive infographic generated with Claude Design. Scroll inside the frame.

In This Episode

📰 Intro & TL;DR — Week in Review
🔓 Open Source: Kimi K2.6
🔓 Open Source: Qwen 3.6-27B
🔓 OpenAI Privacy Filter (Apache 2.0)
🎨 GPT-Image-2 — Thinking Mode for Images
🤖 Codex: Computer Use & Chronicle
🛠️ Brex CrabTrap — Agent Security
🔥 BREAKING: GPT-5.5 Drops Live
💬 Peter Gostev Joins — First Impressions
🧪 Peter's 24-Hour Babylon Street-View Experiment
🎨 Claude Design — Figma Dropped 7%
⚡ This Week's Buzz — W&B LEET TUI Workspace Mode
📰 Recap & Outro

Hosts & Guests

Alex Volkov

Host · W&B / CoreWeave

@altryne

Peter Gostev

Head of AI · Arena (formerly LMArena)

@petergostev

Wolfram Ravenwolf

AI model evaluator · r/LocalLLaMA

@WolframRvnwlf

LDJ

Nous Research

@ldjconfirmed

Nisten Tahiraj

AI operator & builder

@nisten

Ryan Carson

AI educator & founder

@ryancarson

Yam Peleg

AI builder & founder

@Yampeleg

By The Numbers

Terminal-Bench 2

82.7%

GPT-5.5 state-of-the-art, up from 75% on 5.4

GPT-Image-2 Arena jump

+200 ELO

Biggest single jump ever recorded on Arena; beat prior top by 300 points

Longest task

8.5 hrs

Peter Gostev: 'It hasn't literally finished the first one' — GPT-5.5 ran one task overnight without stopping

Qwen3.6

27B dense

Apache-2.0, beats Alibaba's own 400B flagship on every major coding benchmark

Kimi K2.6

1T MoE

32B active, SOTA open-source on SWE-Bench Pro at 58.6

Anthropic

$30B ARR

Crossed the $30B annualized revenue mark this week

🔥 Breaking During The Show

GPT-5.5 drops mid-show

OpenAI ships GPT-5.5 and GPT-5.5 Pro during the livestream. State-of-the-art on Terminal-Bench 2 (82.7%), SWE-Bench Verified (73%), GDPval (84%), Frontier Math (35%). Uses 40% fewer tokens than 5.4, netting ~20% cheaper despite doubled API pricing. Codex-first rollout.

📰 Intro & TL;DR — Week in Review

Alex welcomes the full cohost lineup back — Ryan from Japan, Wolfram, Yam, LDJ, Nisten — and runs through the TL;DR. OpenAI's week of dominance: GPT-Image-2 shattering Arena, a GPT-5.5 leak via base64 in Codex ('Nous 41'), Claude Design crashing Figma stock, Cursor being acquired by xAI for $60B, and two massive open-source drops from Kimi and Qwen.

Full cohost panel reunion — Ryan back from Japan, everyone live
Nous 41 = base64 for 'GPT-5.5' — OpenAI leaked their own model in Codex
Cursor → xAI: $10B collab structure with $60B acquisition clause
Anthropic crosses $30B ARR, resets all Claude quotas, admits degradation

Wolfram Ravenwolf

"The benchmarks take time. The analysis takes time. And when you are done with one, the next one is already there. But I'm not complaining — this is the acceleration we've been waiting for."

Alex Volkov

"Welcome to livestream number five since the last show."

🔓 Open Source: Kimi K2.6

Moonshot AI drops Kimi K2.6 — 1T MoE with 32B active parameters, 256K context, modified MIT license. Claims open-source state-of-the-art on SWE-Bench Pro at 58.6. Wolfram calls it the best open-source model he's ever tested on his private wolf-bench.

1T parameters MoE, 32B active, 384 experts, MLA attention
256K context window, modified MIT license
58.6 on SWE-Bench Pro — SOTA open source
Wolfram's best open-source model ever on wolf-bench

Wolfram Ravenwolf

"Kimi 2.6 is the best model in the open source department. Both are the best."

LDJ

"Kimi seems to be the one that's less academically minded than Qwen, but kind of more creative and more poetic, more diverse in its outputs."

🔓 Open Source: Qwen 3.6-27B

Alibaba ships a dense 27B Apache-2.0 model that beats their own 400B flagship on every major coding benchmark. Plus Qwen3.6-Max-Preview on API. The dense-beats-MoE story keeps evolving.

Dense 27B, Apache 2.0 license
Beats Alibaba's own 400B flagship on coding benchmarks
Qwen3.6-Max-Preview also live on API

Yam Peleg

"Have you guys seen Qwen? The one that gives you Opus four or five at home."

🔓 OpenAI Privacy Filter (Apache 2.0)

OpenAI open-sources a tiny 1.5B MoE with only 50M active params — a privacy/PII filter that runs in the browser on WebGPU. Perfect companion for agent security stacks like Brex's CrabTrap.

1.5B MoE, 50M active params, Apache 2.0
Runs fully in browser via Xenova's Transformers.js
Designed to identify and remove PII in datasets

LDJ

"It's a model for helping identify and remove personally identifiable information within datasets — whether that's a company wanting to fine-tune on their own personal data or for whatever other reason."

🎨 GPT-Image-2 — Thinking Mode for Images

The biggest jump in Arena ELO history: GPT-Image-2 is 200+ points above the last top model. A thinking/reasoning image model that generates functioning QR codes, renders equirectangular 360° images, produces photo-perfect character consistency (even Dario Amodei), and 'writes code' by generating screenshots of IDEs containing SVGs that actually render. Ryan is integrating it into his weekly marketing pipeline today.

+200 ELO over prior top model on Arena (biggest jump ever)
Functioning QR codes embedded in generated images
Multi-image character consistency — can generate full manga pages
4K output, equirectangular 360° images (Peter's street-view hack)
Generates pixel-perfect screenshots of IDEs with working SVG code
New meta: GPT-Image-2 designs UI → Codex implements

LDJ

"There's not more than a 50-point gap between any of those 50 top-ranking neighbors. The exception is GPT-Image-2 — even on medium reasoning mode, it's over 200 points above the last top place. It's insane."

Ryan Carson

"It's good for real stuff, not fancy fun play stuff. I'm already integrating this into my marketing engine."

Wolfram Ravenwolf

"It's not just an image model. We have intelligence in the images that we didn't have before. It is so mind-blowing to see what you can do now outside of just good-looking images."

🤖 Codex: Computer Use & Chronicle

Codex now has true background computer use on macOS — a second cursor that works while you work, running on its own thread. It's so good, 'any other computer use is computer useless.' Plus subagents each controlling different windows in parallel. And Chronicle: Codex takes a screenshot every 10 seconds and has total screen memory — ask 'what was I doing an hour ago?' and it knows.

Background cursor that doesn't take over your mouse — works while you work
Multi-agent: subagents click in parallel windows
Software Apps Inc. (ex-Apple Shortcuts team) acquisition paying off
Chronicle: 10-second screenshots feed into Codex context
Alex used it to auto-quote-tweet from a prompt, with verification
OpenAI Codex passes 4M users

Alex Volkov

"Once you try Codex computer use, any other computer use is absolutely useless. It's computer useless."

Wolfram Ravenwolf

"I've been waiting for this from the computer operating system manufacturer. Apple or Microsoft could have built this already — a multi-user system where the AI is another user working with you on its own desktop."

LDJ

"OpenAI acquired a company called Multi back in June 2024. Their goal is to make computer use an inherently multiplayer experience. Ever since then I've been waiting for this."

🛠️ Brex CrabTrap — Agent Security

Brex's CEO pair-programs with Codex and open-sources CrabTrap — an LLM-as-judge HTTP proxy that intercepts outbound agent requests, uses natural-language rules, and blocks risky activity. Wolfram changes his pick of the week on the spot.

LLM-as-judge proxy for outbound agent traffic
Natural-language rule definitions for risky behavior
OpenClaw banned at CoreWeave — this is the enterprise fix
Ryan: 'intelligence monitoring all traffic — absolutely going to happen'

Wolfram Ravenwolf

"I want to change my pick of the week to CrabTrap. Every week my agent is doing deep research on how to secure agents, because the more access I give them, the more concerned I am."

Ryan Carson

"Intelligence is on demand now. What company would not want intelligence monitoring all their traffic to make sure their employees are not doing bad things? Absolutely this is going to happen."

🔥 BREAKING: GPT-5.5 Drops Live

Mid-show, OpenAI ships GPT-5.5 and GPT-5.5 Pro. Terminal-Bench 2 jumps to 82.7% (from 75%), SWE-Bench Verified to 73%, GDPval state-of-the-art beating Opus 4.7 and Gemini 3.1. Uses 40% fewer tokens than 5.4, so net intelligence-per-dollar drops ~20% despite pricing doubling to $5/$30 per million. Alex gets it live in Codex and runs a computer-use quote-tweet in real time.

82.7% Terminal-Bench 2 (SOTA), up from 75% on 5.4
73% SWE-Bench Verified, 84% GDPval — state of the art
40% fewer tokens at double the price → net ~20% cheaper to run
$5 / $30 per million tokens; Pro: $30 / $180
Live demo: computer use quote-tweeting in Chrome
Not yet in ChatGPT — Codex-first rollout

Yam Peleg

"Just to be clear — across the board state of the art, right? From thinking and above, everything is state of the art."

Alex Volkov

"State of the art while using almost 50% less tokens. All right folks, let's welcome Peter Gostev from Arena."

Wolfram Ravenwolf

"If a model is thinking longer, it can actually be detrimental on the agentic benchmarks. That's probably why the score is higher now — it decides it doesn't have to think so much, but act and then correct instead of overthinking."

💬 Peter Gostev Joins — First Impressions

Peter from Arena AI (ex-LMArena) joins with early access impressions. The headline: 'This is the first time a model can actually properly do long-running tasks.' He queued up prompts overnight expecting them to finish by 3am — woke up, first one still running. 8.5 hours on a single task, then seven-and-a-half hours on another. 'Reflex loops are dead.'

First model that genuinely sustains multi-hour coherent work
Three long-running tasks going simultaneously
Better conversational feel, less abrupt than 5.2-5.4
Still needs iteration — vision reflection is lacking
Front-end design: great with a spec, poor one-shot

Peter Gostev

"The biggest thing that jumps out is that this is the first time when a model can actually properly do long-running tasks. All previous models, they kept saying you can do it for many hours, but every time I shouted, it never did it."

Peter Gostev

"I queued up thermal prompts to keep it going, and then when I woke up I thought okay, it'll be done at 3am. I woke up and it hasn't literally finished the first one. All of this queuing up was completely unnecessary."

Peter Gostev

"We are not at AGI yet. We still need to trick them a little bit, massage them, understand how they behave."

🧪 Peter's 24-Hour Babylon Street-View Experiment

Peter's overnight project with GPT-5.5 + GPT-Image-2: planning out the Hanging Gardens of Babylon and generating ~400 equirectangular 360° images that stitch into a walkable Google-Street-View-style reconstruction of a place we don't know how it looked. Started at 1am London time, still running at broadcast. 'Reflex loops are dead.'

~400 equirectangular 360° images of ancient Babylon
GPT-5.5 orchestrated planning, coordination, and code
Topaz upscaling on Replicate for 4K fill-in
Alex: 'Street view of a place that doesn't exist'
Peter: 'It did exist — we just don't know what it looks like'

Peter Gostev

"I came up with this idea at about 1am London time, and it worked the whole night. It's been running about seven and a half hours on another task. Every time I check — seven hours. Literally seven hours. I can't even update the bloody app because it keeps running."

Alex Volkov

"You basically created street view of a place that doesn't exist."

Peter Gostev

"Well, it did exist — but we don't know what it looks like."

🎨 Claude Design — Figma Dropped 7%

Anthropic ships Claude Design on Friday as a research preview on Opus 4.7. It's not a Figma replacement, but it's magical enough that Figma stock dropped 7% at the news. Alex generated a full ThursdAI brand kit (logo, tokens, the opener videos for this episode) end-to-end in Claude Design — a flow Codex then used live to produce a GPT-5.5 launch video.

Research preview on Opus 4.7, claude.ai/design
Figma stock -7% at release
New usage meter added to Claude Max settings
Alex generated ThursdAI brand kit + opener videos with it
Companion: Codex picks up the kit, generates launch video in 9 min

Nisten Tahiraj

"I am kind of blown away by this design thing."

Ryan Carson

"We have crossed a new threshold. With the entrance of Claude Design plus GPT-Image-2, we are now in a spot where you can really begin to get professional design out of AI."

⚡ This Week's Buzz — W&B LEET TUI Workspace Mode

W&B LEET (the terminal UI everyone's talking about TUIs for) ships workspace mode — multi-run comparisons, GPU metrics, and images rendered right in your terminal.

Multi-run comparison in the terminal
Live GPU metrics
Images rendered directly in TUI

Alex Volkov

"Everybody's like going home about TUIs. W&B also has a TUI — it's called LEET, and it now shows GPU stats inside the TUI, which is really really good."

📰 Recap & Outro

Four hours live, 5,000 viewers, GPT-5.5 dropped mid-show, GPT-Image-2 reshaped image gen, Codex learned to use your Mac, Claude Design crashed Figma, and two new open-source SOTA models landed. 'How could we not have covered everything?'

Almost 4 hours on air
~5,000 concurrent viewers at peak
Full coverage of GPT-5.5, GPT-Image-2, Codex CUA, Claude Design, Kimi K2.6, Qwen 3.6-27B, Privacy Filter, CrabTrap

Alex Volkov

"Crazy, crazy week AI. With almost 4 hours live and almost 5,000 of you tuning in throughout — it's been a great show. Thank you so much for joining us."

TL;DR

Hosts and Guests
- Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
- Co-Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
- Peter Gostev (@petergostev) - Arena AI
Big CO LLMs + APIs
- OpenAI launches GPT-5.5 and GPT-5.5 Pro — SOTA across the board (Blog, Livestream)
- OpenAI GPT-Image-2 — biggest Arena Elo jump ever, thinking mode for images (X, Eval site, Livestream)
- OpenAI Codex — Background Computer Use + Chronicle (screen memory), hits 4M users (Chronicle)
- GPT-5.5 pre-launch leak in Codex dropdown (X)
- Anthropic Claude Design — research preview on Opus 4.7, Figma -7% (X)
- Anthropic resets all Claude quotas, admits degradation, allows OpenClaw CLI back (X)
- Anthropic ARR crosses $30B
- Google Gemini Deep Research + Deep Research Max on Gemini 3.1 Pro (X)
- Google Gemini Enterprise Agent Platform (X)
- ChatGPT Agents “Hermes” leak — builder/studio + Slack integration (X)
- OpenAI clinician/medical model + workspace agents released
Open Source LLMs
- Moonshot Kimi K2.6 — 1T MoE, 32B active, SOTA open source on SWE-Bench Pro (X)
- Alibaba Qwen3.6-27B — dense 27B, Apache 2.0, beats own 400B flagship (X, HF)
- Alibaba Qwen3.6-Max-Preview on API (X)
- OpenAI Privacy Filter — 1.5B MoE, 50M active, Apache 2.0, runs in browser (X)
Tools & Agentic Engineering
- Brex CrabTrap — LLM-as-judge HTTP proxy for agent security (X)
- OpenAIDevs Euphony — open-source Codex session log visualizer (X)
This week’s Buzz - Weights & Biases
- W&B LEET TUI goes workspace mode — multi-run, GPU metrics, images in terminal (X)
Voice & Audio
- StepAudio 2.5 TTS — natural-language control of emotion and delivery (X)
Deals & Industry
- SpaceX/xAI <> Cursor — $60B acquisition or $10B collaboration structure

Alex Volkov 0:45

Hello, Hello, uh, welcome to Thursday.

0:49

I, this is Alex Volkov coming to you live from Denver. It's a little bit later than we usually start, but I hope, uh, some of you who joined us on livestream saw a few of the openers that were prepared by Claude and Hyper Frames. I'm gonna tell you all about this. Today is a big day. Nous 41. If that means anything to anyone here, then you are too connected to X. You need to leave your house and go touch some grass. Uh, but if it means nothing to you, uh, and if you're asking in our chats, what is Nous 31? Everybody's saying, N 31 is today. Uh, then, uh, we'll tell you all about this, but plus we have a huge show. And to help me through kind of explaining everything that happened in the world of AI today, let's bring up some cohost here. We'll get Ryan Carson, who's back? Wolfram, Raven Wolf, Yam Peleg, and LDJ. What's up folks? How are you doing? Let's start with our long lost brother, Ryan Carson. Welcome back, dude. What's up? Let's go

Ryan Carson 1:46

everybody.

1:47

It's so good to be here. I was in Japan with my family and I'm back. And

Alex Volkov 1:51

you are back and, and you chose a hell of a week to be back, man.

Ryan Carson 1:55

I'm excited.