Episode Summary
The week OpenAI went full throttle. GPT-5.5 dropped mid-show โ SOTA across terminal-bench, SWE-bench, GDPval and frontier-math, using ~40% fewer tokens than 5.4. GPT-Image-2 posted the biggest Arena ELO jump ever (200+ points), generating functioning QR codes, perfect infographics, and 360ยฐ street-view images that Peter Gostev stitched into a 24-hour walkable world. Codex now has real multi-cursor computer use on macOS plus Chronicle screen-memory. On the open-source side, Kimi K2.6 became Wolfram's best-ever open model and Qwen3.6-27B dense beat Alibaba's own 400B flagship. Oh โ and Claude Design shipped, dropping Figma stock 7%.
The Week That Broke The Chart โ Interactive Recap
In This Episode
- ๐ฐ Intro & TL;DR โ Week in Review
- ๐ Open Source: Kimi K2.6
- ๐ Open Source: Qwen 3.6-27B
- ๐ OpenAI Privacy Filter (Apache 2.0)
- ๐จ GPT-Image-2 โ Thinking Mode for Images
- ๐ค Codex: Computer Use & Chronicle
- ๐ ๏ธ Brex CrabTrap โ Agent Security
- ๐ฅ BREAKING: GPT-5.5 Drops Live
- ๐ฌ Peter Gostev Joins โ First Impressions
- ๐งช Peter's 24-Hour Babylon Street-View Experiment
- ๐จ Claude Design โ Figma Dropped 7%
- โก This Week's Buzz โ W&B LEET TUI Workspace Mode
- ๐ฐ Recap & Outro
Hosts & Guests
By The Numbers
๐ฅ Breaking During The Show
๐ฐ Intro & TL;DR โ Week in Review
Alex welcomes the full cohost lineup back โ Ryan from Japan, Wolfram, Yam, LDJ, Nisten โ and runs through the TL;DR. OpenAI's week of dominance: GPT-Image-2 shattering Arena, a GPT-5.5 leak via base64 in Codex ('Nous 41'), Claude Design crashing Figma stock, Cursor being acquired by xAI for $60B, and two massive open-source drops from Kimi and Qwen.
- Full cohost panel reunion โ Ryan back from Japan, everyone live
- Nous 41 = base64 for 'GPT-5.5' โ OpenAI leaked their own model in Codex
- Cursor โ xAI: $10B collab structure with $60B acquisition clause
- Anthropic crosses $30B ARR, resets all Claude quotas, admits degradation
๐ Open Source: Kimi K2.6
Moonshot AI drops Kimi K2.6 โ 1T MoE with 32B active parameters, 256K context, modified MIT license. Claims open-source state-of-the-art on SWE-Bench Pro at 58.6. Wolfram calls it the best open-source model he's ever tested on his private wolf-bench.
- 1T parameters MoE, 32B active, 384 experts, MLA attention
- 256K context window, modified MIT license
- 58.6 on SWE-Bench Pro โ SOTA open source
- Wolfram's best open-source model ever on wolf-bench
๐ Open Source: Qwen 3.6-27B
Alibaba ships a dense 27B Apache-2.0 model that beats their own 400B flagship on every major coding benchmark. Plus Qwen3.6-Max-Preview on API. The dense-beats-MoE story keeps evolving.
- Dense 27B, Apache 2.0 license
- Beats Alibaba's own 400B flagship on coding benchmarks
- Qwen3.6-Max-Preview also live on API
๐ OpenAI Privacy Filter (Apache 2.0)
OpenAI open-sources a tiny 1.5B MoE with only 50M active params โ a privacy/PII filter that runs in the browser on WebGPU. Perfect companion for agent security stacks like Brex's CrabTrap.
- 1.5B MoE, 50M active params, Apache 2.0
- Runs fully in browser via Xenova's Transformers.js
- Designed to identify and remove PII in datasets
๐จ GPT-Image-2 โ Thinking Mode for Images
The biggest jump in Arena ELO history: GPT-Image-2 is 200+ points above the last top model. A thinking/reasoning image model that generates functioning QR codes, renders equirectangular 360ยฐ images, produces photo-perfect character consistency (even Dario Amodei), and 'writes code' by generating screenshots of IDEs containing SVGs that actually render. Ryan is integrating it into his weekly marketing pipeline today.
- +200 ELO over prior top model on Arena (biggest jump ever)
- Functioning QR codes embedded in generated images
- Multi-image character consistency โ can generate full manga pages
- 4K output, equirectangular 360ยฐ images (Peter's street-view hack)
- Generates pixel-perfect screenshots of IDEs with working SVG code
- New meta: GPT-Image-2 designs UI โ Codex implements
๐ค Codex: Computer Use & Chronicle
Codex now has true background computer use on macOS โ a second cursor that works while you work, running on its own thread. It's so good, 'any other computer use is computer useless.' Plus subagents each controlling different windows in parallel. And Chronicle: Codex takes a screenshot every 10 seconds and has total screen memory โ ask 'what was I doing an hour ago?' and it knows.
- Background cursor that doesn't take over your mouse โ works while you work
- Multi-agent: subagents click in parallel windows
- Software Apps Inc. (ex-Apple Shortcuts team) acquisition paying off
- Chronicle: 10-second screenshots feed into Codex context
- Alex used it to auto-quote-tweet from a prompt, with verification
- OpenAI Codex passes 4M users
๐ ๏ธ Brex CrabTrap โ Agent Security
Brex's CEO pair-programs with Codex and open-sources CrabTrap โ an LLM-as-judge HTTP proxy that intercepts outbound agent requests, uses natural-language rules, and blocks risky activity. Wolfram changes his pick of the week on the spot.
- LLM-as-judge proxy for outbound agent traffic
- Natural-language rule definitions for risky behavior
- OpenClaw banned at CoreWeave โ this is the enterprise fix
- Ryan: 'intelligence monitoring all traffic โ absolutely going to happen'
๐ฅ BREAKING: GPT-5.5 Drops Live
Mid-show, OpenAI ships GPT-5.5 and GPT-5.5 Pro. Terminal-Bench 2 jumps to 82.7% (from 75%), SWE-Bench Verified to 73%, GDPval state-of-the-art beating Opus 4.7 and Gemini 3.1. Uses 40% fewer tokens than 5.4, so net intelligence-per-dollar drops ~20% despite pricing doubling to $5/$30 per million. Alex gets it live in Codex and runs a computer-use quote-tweet in real time.
- 82.7% Terminal-Bench 2 (SOTA), up from 75% on 5.4
- 73% SWE-Bench Verified, 84% GDPval โ state of the art
- 40% fewer tokens at double the price โ net ~20% cheaper to run
- $5 / $30 per million tokens; Pro: $30 / $180
- Live demo: computer use quote-tweeting in Chrome
- Not yet in ChatGPT โ Codex-first rollout
๐ฌ Peter Gostev Joins โ First Impressions
Peter from Arena AI (ex-LMArena) joins with early access impressions. The headline: 'This is the first time a model can actually properly do long-running tasks.' He queued up prompts overnight expecting them to finish by 3am โ woke up, first one still running. 8.5 hours on a single task, then seven-and-a-half hours on another. 'Reflex loops are dead.'
- First model that genuinely sustains multi-hour coherent work
- Three long-running tasks going simultaneously
- Better conversational feel, less abrupt than 5.2-5.4
- Still needs iteration โ vision reflection is lacking
- Front-end design: great with a spec, poor one-shot
๐งช Peter's 24-Hour Babylon Street-View Experiment
Peter's overnight project with GPT-5.5 + GPT-Image-2: planning out the Hanging Gardens of Babylon and generating ~400 equirectangular 360ยฐ images that stitch into a walkable Google-Street-View-style reconstruction of a place we don't know how it looked. Started at 1am London time, still running at broadcast. 'Reflex loops are dead.'
- ~400 equirectangular 360ยฐ images of ancient Babylon
- GPT-5.5 orchestrated planning, coordination, and code
- Topaz upscaling on Replicate for 4K fill-in
- Alex: 'Street view of a place that doesn't exist'
- Peter: 'It did exist โ we just don't know what it looks like'
๐จ Claude Design โ Figma Dropped 7%
Anthropic ships Claude Design on Friday as a research preview on Opus 4.7. It's not a Figma replacement, but it's magical enough that Figma stock dropped 7% at the news. Alex generated a full ThursdAI brand kit (logo, tokens, the opener videos for this episode) end-to-end in Claude Design โ a flow Codex then used live to produce a GPT-5.5 launch video.
- Research preview on Opus 4.7, claude.ai/design
- Figma stock -7% at release
- New usage meter added to Claude Max settings
- Alex generated ThursdAI brand kit + opener videos with it
- Companion: Codex picks up the kit, generates launch video in 9 min
โก This Week's Buzz โ W&B LEET TUI Workspace Mode
W&B LEET (the terminal UI everyone's talking about TUIs for) ships workspace mode โ multi-run comparisons, GPU metrics, and images rendered right in your terminal.
- Multi-run comparison in the terminal
- Live GPU metrics
- Images rendered directly in TUI
๐ฐ Recap & Outro
Four hours live, 5,000 viewers, GPT-5.5 dropped mid-show, GPT-Image-2 reshaped image gen, Codex learned to use your Mac, Claude Design crashed Figma, and two new open-source SOTA models landed. 'How could we not have covered everything?'
- Almost 4 hours on air
- ~5,000 concurrent viewers at peak
- Full coverage of GPT-5.5, GPT-Image-2, Codex CUA, Claude Design, Kimi K2.6, Qwen 3.6-27B, Privacy Filter, CrabTrap
TL;DR
Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co-Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
Peter Gostev (@petergostev) - Arena AI
Big CO LLMs + APIs
OpenAI launches GPT-5.5 and GPT-5.5 Pro โ SOTA across the board (Blog, Livestream)
OpenAI GPT-Image-2 โ biggest Arena Elo jump ever, thinking mode for images (X, Eval site, Livestream)
OpenAI Codex โ Background Computer Use + Chronicle (screen memory), hits 4M users (Chronicle)
GPT-5.5 pre-launch leak in Codex dropdown (X)
Anthropic Claude Design โ research preview on Opus 4.7, Figma -7% (X)
Anthropic resets all Claude quotas, admits degradation, allows OpenClaw CLI back (X)
Anthropic ARR crosses $30B
Google Gemini Deep Research + Deep Research Max on Gemini 3.1 Pro (X)
Google Gemini Enterprise Agent Platform (X)
ChatGPT Agents โHermesโ leak โ builder/studio + Slack integration (X)
OpenAI clinician/medical model + workspace agents released
Open Source LLMs
Tools & Agentic Engineering
This weekโs Buzz - Weights & Biases
W&B LEET TUI goes workspace mode โ multi-run, GPU metrics, images in terminal (X)
Voice & Audio
StepAudio 2.5 TTS โ natural-language control of emotion and delivery (X)
Deals & Industry
SpaceX/xAI <> Cursor โ $60B acquisition or $10B collaboration structure