Episode Summary
This episode opens with a rare live-breaking OpenAI moment: GPT-5.4 Thinking and 5.4 Pro dropped during the show. The panel then unpacks a volatile week of AI policy and defense controversy, plus major open-source developments from Qwen and StepFun. They also cover GPT-5.3 Instant, Gemini 3.1 Flash-Lite pricing/performance shifts, and practical agent benchmarking insights from Wolfram’s new Wolf Bench framework. The back half turns into live testing and benchmark triage as the team compares GPT-5.4 directly against Opus and Gemini across coding, browsing, and reasoning tasks.
In This Episode
- 🔥 GPT 5.4 Preamble
- ⚡ Welcome & Introductions
- 📰 TL;DR
- 🏢 Anthropic vs Department of War
- 🔓 Qwen 3.5 Small Models & Junyang Departure
- 🛠️ GPT 5.3 Instant
- ⚡ Gemini 3.1 Flash-Lite
- 🧪 This Week's Buzz: Wolf Bench
- 🔓 Open Source: Step 3.5 Flash
- 🔥 BREAKING NEWS: GPT 5.4 Drops Live
- 🤖 5.4 Benchmarks: OS World, Web Arena, Browse Comp
- 💰 5.4 Pricing & Availability
- 📰 5.4 System Card & Safety
- 🛠️ 5.4 Live Vibe Check: Mars Benchmark
- 🛠️ 5.4 Live Vibe Check: Website Improvement (GPT vs Opus)
- 🧪 5.4 vs Opus & Gemini: Benchmark Comparison
- ⚡ Wrap-Up
Hosts & Guests
By The Numbers
🔥 Breaking During The Show
🔥 GPT 5.4 Preamble
Alex opens with a direct recap of OpenAI’s surprise GPT-5.4 Thinking and 5.4 Pro release, framing it as a meaningful frontier-model update. He emphasizes unified reasoning + coding capability, strong benchmark claims, and live testing on the show.
- GPT-5.4 Thinking + 5.4 Pro introduced as a breaking frontier release
- Unified reasoning model positioned as codex-capability fold-in
- Live test framing set before the main show intro
⚡ Welcome & Introductions
The panel opens the March 5 show with full co-host attendance and sets expectations for a dense, high-signal episode. Alex also acknowledges ongoing world events before transitioning into the agenda.
- First show in March
- Full co-host panel introduced
- Tone set for a heavy AI-news week
📰 TL;DR
Alex speed-runs the week: Anthropic vs DoW fallout, Qwen 3.5 small releases, GPT-5.3 Instant, Gemini 3.1 Flash-Lite, SWE 1.6, Wolf Bench, and other tools/news blurbs. The section functions as a roadmap for the deeper discussion.
- Anthropic/DoW conflict queued as top story
- Qwen 3.5 small + Junyang context previewed
- GPT-5.3 Instant and Gemini Flash-Lite positioned as fast-tier battle
🏢 Anthropic vs Department of War
The panel unpacks the fast-moving Anthropic-DoW saga: rejected requests, supply-chain-risk pressure, OpenAI stepping into defense deployment, and public backlash/optics shifts. They discuss how much is policy posture versus operational reality.
- Anthropic says no to requests tied to surveillance and kill-chain concerns
- OpenAI deal announcement triggers backlash and later amendments
- Discussion includes legal/designation pathways and market implications
🔓 Qwen 3.5 Small Models & Junyang Departure
The show covers strong Qwen 3.5 small-model performance and practical local-run viability, then pivots to leadership turbulence after Junyang’s departure post. The team frames this as both a technical and ecosystem-level story for open-source momentum.
- Qwen 3.5 small models discussed as highly usable on consumer hardware
- Junyang departure sparks major community and internal Alibaba response
- Open-source continuity remains expected despite org changes
🛠️ GPT 5.3 Instant
Alex and co-hosts review GPT-5.3 Instant as a free-tier baseline upgrade, with mixed reactions on quality and style. The discussion centers on when low-latency models matter in real systems versus where they still fall short.
- OpenAI positions Instant as less cringey/more accurate
- Panel sees improvements but still prefers other models in many workflows
- Low-latency use cases remain valid (e.g., voice/real-time control)
⚡ Gemini 3.1 Flash-Lite
The team compares Gemini 3.1 Flash-Lite speed/cost dynamics against fast-tier competitors and practical agent needs. They note significant pricing changes versus prior flash-lite pricing and discuss where cheap fast models power orchestration.
- Gemini 3.1 Flash-Lite presented as fast + 1M context
- Pricing jump versus prior flash-lite discussed as material
- Useful for judge/guardrail/orchestration style workloads
🧪 This Week's Buzz: Wolf Bench
Wolfram introduces Wolf Bench, a multi-metric evaluation framework based on Terminal Bench that emphasizes reliability and variance, not just single average scores. The segment highlights harness effects (Terminal Bench vs Claude Code vs OpenClaw) and reproducible benchmarking setup.
- Four-metric view: average, best run, ceiling, and consistent floor
- Harness differences shown as a first-class factor
- Benchmark cost/transparency details shared publicly
🔓 Open Source: Step 3.5 Flash
The panel flags StepFun’s Step 3.5 Flash release as unusually open in both model and training-stack terms. They emphasize that continuation pretraining flexibility is a major practical unlock for builders.
- Step 3.5 Flash highlighted for open training artifacts
- Apache-2 orientation praised
- Potential ecosystem impact discussed
🔥 BREAKING NEWS: GPT 5.4 Drops Live
Mid-show, OpenAI drops GPT-5.4 live, and the panel pivots immediately into hands-on analysis. They review announcement claims and begin direct testing inside Codex.
- Live on-air GPT-5.4 announcement
- Immediate benchmark and UX triage
- Community reaction spikes in real time
🤖 5.4 Benchmarks: OS World, Web Arena, Browse Comp
The panel reviews the newly posted benchmark deltas for GPT-5.4, especially computer-use and browsing tasks. They focus on tool-use efficiency, reasoning-effort curves, and practical improvements over 5.2/5.3 lines.
- Strong OS World jump versus prior general model
- Web/browse benchmark leadership claims examined
- Reasoning-effort ladder interpreted live
💰 5.4 Pricing & Availability
The team breaks down GPT-5.4 and 5.4 Pro pricing, noting modest output deltas but meaningful input increases and very high Pro output pricing. They also discuss 1M-context usage implications and cost management for eval runs.
- Input pricing moved materially versus prior generation
- Pro-tier output pricing flagged as expensive for heavy evals
- 5.4 available across Codex surfaces first
📰 5.4 System Card & Safety
The conversation moves into system-card details, model variants, and availability behavior across interfaces. They also note real-time steering support and discuss implications for interactive workflows.
- System card reviewed live
- Thinking vs Pro distinctions discussed
- In-flight model steering highlighted
🛠️ 5.4 Live Vibe Check: Mars Benchmark
Nisten’s Mars mega-structure prompt is used as a live stress test combining math, coding, and visualization. The panel reacts positively to output quality and trajectory realism versus prior runs.
- One-shot Mars benchmark run in Codex
- Visual + math quality judged in real time
- Panel calls it best run of this prompt so far
🛠️ 5.4 Live Vibe Check: Website Improvement (GPT vs Opus)
Alex compares GPT-5.4 and Opus behavior on a vague web-improvement prompt to probe practical instruction-following style. The discussion distinguishes benchmark strength from preference for intuitive product judgment under ambiguity.
- Same prompt run on GPT-5.4 and Opus
- Differences in interpretive behavior discussed
- Prompt quality vs model intuition debate surfaced
🧪 5.4 vs Opus & Gemini: Benchmark Comparison
The hosts inspect side-by-side benchmark snapshots for GPT-5.4, Opus 4.6, and Gemini variants. They note where 5.4 Thinking leads and where Pro-tier data is needed for fair apples-to-apples comparisons.
- Cross-lab benchmark matrix reviewed live
- FrontierMath and browsing deltas called out
- Need for like-for-like deep-think/pro comparisons noted
⚡ Wrap-Up
The episode closes with a concise GPT-5.4 recap and quick takes from the panel on adoption intent. Alex tees up next week’s three-year ThursdAI anniversary and points listeners to the newsletter for remaining items.
- GPT-5.4 summarized as major general-model jump
- Panel intent to benchmark and test further
- Three-year ThursdAI anniversary preview
Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
Big CO LLMs + APIs
Evals and Benchmarks
Open Source LLMs
Tools & Agentic Engineering
This weeks Buzz
Early preview of Wolf Bench (wolfbench.ai) from W&B
AI Art & Diffusion & 3D
Black Forest Labs introduces Self-Flow (X, Announcement)