ThursdAI · March 5, 2026

ThursdAI - Mar 5 - GPT 5.4 is here, Anthropic supply chain risk, Qwen 3.5 small & leadership drama, wolfbench & more AI news

A packed episode covering AI policy drama, open-source model updates, and a major OpenAI release announced live on air.

By Alex Volkov

96 min

YouTube Spotify Apple Podcasts Substack

Episode Summary

This episode opens with a rare live-breaking OpenAI moment: GPT-5.4 Thinking and 5.4 Pro dropped during the show. The panel then unpacks a volatile week of AI policy and defense controversy, plus major open-source developments from Qwen and StepFun. They also cover GPT-5.3 Instant, Gemini 3.1 Flash-Lite pricing/performance shifts, and practical agent benchmarking insights from Wolfram’s new Wolf Bench framework. The back half turns into live testing and benchmark triage as the team compares GPT-5.4 directly against Opus and Gemini across coding, browsing, and reasoning tasks.

In This Episode

🔥 GPT 5.4 Preamble
⚡ Welcome & Introductions
📰 TL;DR
🏢 Anthropic vs Department of War
🔓 Qwen 3.5 Small Models & Junyang Departure
🛠️ GPT 5.3 Instant
⚡ Gemini 3.1 Flash-Lite
🧪 This Week's Buzz: Wolf Bench
🔓 Open Source: Step 3.5 Flash
🔥 BREAKING NEWS: GPT 5.4 Drops Live
🤖 5.4 Benchmarks: OS World, Web Arena, Browse Comp
💰 5.4 Pricing & Availability
📰 5.4 System Card & Safety
🛠️ 5.4 Live Vibe Check: Mars Benchmark
🛠️ 5.4 Live Vibe Check: Website Improvement (GPT vs Opus)
🧪 5.4 vs Opus & Gemini: Benchmark Comparison
⚡ Wrap-Up

Hosts & Guests

Alex Volkov

Host · W&B / CoreWeave

AI builder & founder

Weekly co-host · Nous Research

@ldjconfirmed

Wolfram Ravenwolf

Weekly co-host · AI evaluator

@WolframRvnwlf

Ryan Carson

Weekly co-host · AI educator & founder

@ryancarson

Nisten Tahiraj

Weekly co-host · AI operator & builder

@nisten

By The Numbers

ARC-AGI 2 (GPT-5.4 Pro)

83.3%

Alex highlighted this as roughly matching recent frontier reasoning performance.

OS World / computer-use score

75%

Presented in the GPT-5.4 preamble as a major computer-use milestone.

Token usage reduction

47%

Zapier-reported tool-search optimization improvement mentioned in the preamble.

Context window

GPT-5.4 launched with 1 million token context support in Codex workflows.

Gemini 3.1 Flash-Lite speed

360 tokens/sec

Discussed as a fast, efficient model in the same category as instant-tier offerings.

SWE Bench Pro (SWE 1.6)

51%

Cognition’s new SWE model performance cited in the TL;DR tools segment.

🔥 Breaking During The Show

GPT-5.4 Thinking and GPT-5.4 Pro dropped live during ThursdAI

OpenAI released GPT-5.4 mid-show, triggering immediate benchmark review and live coding/vibe tests from the panel.

🔥 GPT 5.4 Preamble

Alex opens with a direct recap of OpenAI’s surprise GPT-5.4 Thinking and 5.4 Pro release, framing it as a meaningful frontier-model update. He emphasizes unified reasoning + coding capability, strong benchmark claims, and live testing on the show.

GPT-5.4 Thinking + 5.4 Pro introduced as a breaking frontier release
Unified reasoning model positioned as codex-capability fold-in
Live test framing set before the main show intro

Alex Volkov

"They dropped a new frontier model called GBT five. Point four thinking and 5.4 Pro."

⚡ Welcome & Introductions

The panel opens the March 5 show with full co-host attendance and sets expectations for a dense, high-signal episode. Alex also acknowledges ongoing world events before transitioning into the agenda.

First show in March
Full co-host panel introduced
Tone set for a heavy AI-news week

Alex Volkov

"Welcome to ThursdAI my name is Alex Volkov."

📰 TL;DR

Alex speed-runs the week: Anthropic vs DoW fallout, Qwen 3.5 small releases, GPT-5.3 Instant, Gemini 3.1 Flash-Lite, SWE 1.6, Wolf Bench, and other tools/news blurbs. The section functions as a roadmap for the deeper discussion.

Anthropic/DoW conflict queued as top story
Qwen 3.5 small + Junyang context previewed
GPT-5.3 Instant and Gemini Flash-Lite positioned as fast-tier battle

Alex Volkov

"This is the TLDR. This is the section on Thursday."

🏢 Anthropic vs Department of War

The panel unpacks the fast-moving Anthropic-DoW saga: rejected requests, supply-chain-risk pressure, OpenAI stepping into defense deployment, and public backlash/optics shifts. They discuss how much is policy posture versus operational reality.

Anthropic says no to requests tied to surveillance and kill-chain concerns
OpenAI deal announcement triggers backlash and later amendments
Discussion includes legal/designation pathways and market implications

Alex Volkov

"Anthropic has said no."

🔓 Qwen 3.5 Small Models & Junyang Departure

The show covers strong Qwen 3.5 small-model performance and practical local-run viability, then pivots to leadership turbulence after Junyang’s departure post. The team frames this as both a technical and ecosystem-level story for open-source momentum.

Qwen 3.5 small models discussed as highly usable on consumer hardware
Junyang departure sparks major community and internal Alibaba response
Open-source continuity remains expected despite org changes

Alex Volkov

"Goodbye. My beloved Qwen."

🛠️ GPT 5.3 Instant

Alex and co-hosts review GPT-5.3 Instant as a free-tier baseline upgrade, with mixed reactions on quality and style. The discussion centers on when low-latency models matter in real systems versus where they still fall short.

OpenAI positions Instant as less cringey/more accurate
Panel sees improvements but still prefers other models in many workflows
Low-latency use cases remain valid (e.g., voice/real-time control)

Alex Volkov

"OpenAI rolls out GPT 5.3 instant."

⚡ Gemini 3.1 Flash-Lite

The team compares Gemini 3.1 Flash-Lite speed/cost dynamics against fast-tier competitors and practical agent needs. They note significant pricing changes versus prior flash-lite pricing and discuss where cheap fast models power orchestration.

Gemini 3.1 Flash-Lite presented as fast + 1M context
Pricing jump versus prior flash-lite discussed as material
Useful for judge/guardrail/orchestration style workloads

Alex Volkov

"Google launched Gemini 3.1 flashlight."

🧪 This Week's Buzz: Wolf Bench

Wolfram introduces Wolf Bench, a multi-metric evaluation framework based on Terminal Bench that emphasizes reliability and variance, not just single average scores. The segment highlights harness effects (Terminal Bench vs Claude Code vs OpenClaw) and reproducible benchmarking setup.

Four-metric view: average, best run, ceiling, and consistent floor
Harness differences shown as a first-class factor
Benchmark cost/transparency details shared publicly

Wolfram Ravenwolf

"One score is not enough."

🔓 Open Source: Step 3.5 Flash

The panel flags StepFun’s Step 3.5 Flash release as unusually open in both model and training-stack terms. They emphasize that continuation pretraining flexibility is a major practical unlock for builders.

Step 3.5 Flash highlighted for open training artifacts
Apache-2 orientation praised
Potential ecosystem impact discussed

Alex Volkov

"StepFun releases step 3.5, flash base."

🔥 BREAKING NEWS: GPT 5.4 Drops Live

Mid-show, OpenAI drops GPT-5.4 live, and the panel pivots immediately into hands-on analysis. They review announcement claims and begin direct testing inside Codex.

Live on-air GPT-5.4 announcement
Immediate benchmark and UX triage
Community reaction spikes in real time

Alex Volkov

"We have breaking news."

🤖 5.4 Benchmarks: OS World, Web Arena, Browse Comp

The panel reviews the newly posted benchmark deltas for GPT-5.4, especially computer-use and browsing tasks. They focus on tool-use efficiency, reasoning-effort curves, and practical improvements over 5.2/5.3 lines.

Strong OS World jump versus prior general model
Web/browse benchmark leadership claims examined
Reasoning-effort ladder interpreted live

LDJ

"Introducing GPT 5.4. That is the title of the blog post that open a I just dropped."

💰 5.4 Pricing & Availability

The team breaks down GPT-5.4 and 5.4 Pro pricing, noting modest output deltas but meaningful input increases and very high Pro output pricing. They also discuss 1M-context usage implications and cost management for eval runs.

Input pricing moved materially versus prior generation
Pro-tier output pricing flagged as expensive for heavy evals
5.4 available across Codex surfaces first

LDJ

"The pricing... it's about the same for output price... For input price though, it's about 50% more expensive than 5.2."

📰 5.4 System Card & Safety

The conversation moves into system-card details, model variants, and availability behavior across interfaces. They also note real-time steering support and discuss implications for interactive workflows.

System card reviewed live
Thinking vs Pro distinctions discussed
In-flight model steering highlighted

LDJ

"They mentioned the ability to interrupt in ChatGPT while it's thinking."

🛠️ 5.4 Live Vibe Check: Mars Benchmark

Nisten’s Mars mega-structure prompt is used as a live stress test combining math, coding, and visualization. The panel reacts positively to output quality and trajectory realism versus prior runs.

One-shot Mars benchmark run in Codex
Visual + math quality judged in real time
Panel calls it best run of this prompt so far

Nisten

"I think this is the best one so far."

🛠️ 5.4 Live Vibe Check: Website Improvement (GPT vs Opus)

Alex compares GPT-5.4 and Opus behavior on a vague web-improvement prompt to probe practical instruction-following style. The discussion distinguishes benchmark strength from preference for intuitive product judgment under ambiguity.

Same prompt run on GPT-5.4 and Opus
Differences in interpretive behavior discussed
Prompt quality vs model intuition debate surfaced

Alex Volkov

"When we refer to GPT Codex... as autistic, this is what we mean."

🧪 5.4 vs Opus & Gemini: Benchmark Comparison

The hosts inspect side-by-side benchmark snapshots for GPT-5.4, Opus 4.6, and Gemini variants. They note where 5.4 Thinking leads and where Pro-tier data is needed for fair apples-to-apples comparisons.

Cross-lab benchmark matrix reviewed live
FrontierMath and browsing deltas called out
Need for like-for-like deep-think/pro comparisons noted

LDJ

"This is a comparison of Opus 4.6, to Gemini to GPT 5.4."

⚡ Wrap-Up

The episode closes with a concise GPT-5.4 recap and quick takes from the panel on adoption intent. Alex tees up next week’s three-year ThursdAI anniversary and points listeners to the newsletter for remaining items.

GPT-5.4 summarized as major general-model jump
Panel intent to benchmark and test further
Three-year ThursdAI anniversary preview

Alex Volkov

"GPT 5.4 thinking just dropped with 1 million context window support."

TL;DR of all topics covered:

Hosts and Guests
- Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
- Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
Big CO LLMs + APIs
- OpenAI launches GPT-5.4 Thinking and Pro (X, X, X, X)
- Anthropic, Dept of War and OpenAI walk into a bar
- Alibaba Qwen departures: Friend of the pod JunyangLin and Binyuan Huy both depart Qwen (X)
- OpenAI Rolls Out GPT-5.3 Instant (X)
- Google launches Gemini 3.1 Flash-Lite (X, Announcement)
Evals and Benchmarks
- MarinLab shows degradation in Opus 4.6 (X)
- BullShit Bench from Peter Gostev (X)
Open Source LLMs
- StepFun releases Step 3.5 Flash Base models (X, HF, HF, Announcement, Arxiv)
- Alibaba releases Qwen 3.5 Small Model Series (X, HF, HF, HF)
- Yuan 3.0 Ultra (X, Blog, HF)
Tools & Agentic Engineering
- Cognition: SWE-1.6 preview (X, Blog)
- OpenAI launches Codex app on windows (X)
- Google released Google Workspace CLI (X)
- OpenAI released Symphony (Github)
This weeks Buzz
- Early preview of Wolf Bench (wolfbench.ai) from W&B
AI Art & Diffusion & 3D
- Black Forest Labs introduces Self-Flow (X, Announcement)

Alex Volkov (2) 0:35

The reason I'm coming to you right now is today on the show,