Episode Summary
This episode captures the feeling that AI acceleration has crossed from hype into lived reality: benchmarks are saturating, toolchains are maturing, and solo founders are shipping at startup speed. The panel opens with Anthropic's reported Pentagon ultimatum and distillation accusations, then moves into hard evidence of capability jumps like METR's 14.5-hour autonomy and ARC-AGI nearing saturation. Three interviews anchor the show: Ben Broca on Polsia's hypergrowth, Nader Dabit on Devin 2.2's practical leap, and Philip Kiely on why inference demand is only getting started. The thread throughout is clear: we are not just getting better models, we're getting compounding systems around them.
In This Episode
- โก Show Intro & Welcome
- ๐ฐ TL;DR - Weekly News Roundup
- ๐ฅ Anthropic vs Pentagon / War Claude
- ๐งช Anthropic Distillation Attacks (DeepSeek, Minimax, ZAI)
- ๐ค Opus 3 Retirement & AI Sentience Debate
- ๐ ๏ธ GPT 5.3 Codex Release & Open Claw
- ๐ฐ This Week's Buzz - Kimi 2.5 & Minimax 2.5 on WB Inference
- ๐งช Evals & Benchmarks - METR, ARC-AGI, SWE-bench
- ๐ค Tools & Agentic Engineering - Claude Code, Cursor, Devin
- ๐ฐ Interview: Ben Broca - Polsia (AI-Run Companies)
- ๐ ๏ธ Interview: Nader Dabit - Cognition / Devin 2.2
- โก Interview: Philip Kiely - Inference Engineering (Base10)
- ๐ Open Source - Qwen 3.5 & Liquid LFM 2
- ๐ฅ Seedance 2 & Taalas 15K Tokens/Sec Demo
- ๐ฐ Show Wrap-up
Hosts & Guests
By The Numbers
๐ฅ Breaking During The Show
โก Show Intro & Welcome
Alex frames the episode around 'approaching singularity' and the sense that AI progress has entered a visibly faster phase since December. The full co-host panel assembles with a promise of three major interviews.
- Episode thesis: acceleration is now obvious to everyone, not just early adopters
- Full panel + three guest interviews announced up front
๐ฐ TL;DR - Weekly News Roundup
A rapid-fire pass through the week's biggest drops: Pentagon pressure on Anthropic, distillation claims, GPT 5.3 Codex API, Qwen 3.5, Liquid LFM2, METR autonomy growth, ARC-AGI saturation, and new agent tooling. Alex also announces a breaking image-model update mid-segment.
- METR, ARC-AGI, and SWE-bench all presented as major capability-shift signals
- Devin 2.2, Cursor cloud agents, and automation features framed as practical workflow unlocks
๐ฅ Anthropic vs Pentagon / War Claude
The panel debates reports that Anthropic was pressured to remove two military-use restrictions: no autonomous lethal decisions and no domestic mass surveillance. Discussion centers on ethics, state leverage, and whether model control is still realistic in a multi-polar AI world.
- Alleged ultimatum tied to supply-chain-risk designation and Defense Production Act threats
- Strong split between principled refusal and realpolitik cooperation
๐งช Anthropic Distillation Attacks (DeepSeek, Minimax, ZAI)
Anthropic's named allegations trigger a heated discussion on ToS abuse, model distillation norms, and the blurry legal line between scraping, training, and derivative outputs. The panel reads the numbers as both technical evidence and geopolitical signaling.
- Reported counts discussed: DeepSeek 150k, Minimax 13M, Moonshot 3.4M exchanges
- Core tension: enforcing platform rules while having trained on broad internet-scale corpora
๐ค Opus 3 Retirement & AI Sentience Debate
A short but philosophical detour: Anthropic's treatment of models as entities sparks discussion on AI personhood, anthropomorphism, and whether giving models pseudo-agency is responsible or risky.
- Opus 3 'retirement' narrative becomes a proxy for broader model-rights discourse
- Panel splits between playful framing and concern about AI psychosis dynamics
๐ ๏ธ GPT 5.3 Codex Release & Open Claw
The panel compares raw coding power versus conversational quality when Codex powers OpenClaw workflows. Consensus: Codex is elite at execution but often too literal and less human in interactive assistant contexts.
- Codex pricing and performance praised for code generation
- Personality and intent-following still seen as Anthropic's edge in assistant UX
๐ฐ This Week's Buzz - Kimi 2.5 & Minimax 2.5 on WB Inference
Alex and the co-hosts break down newly hosted inference options, emphasizing price/performance and multimodal capabilities. Kimi is highlighted as unusually strong for both tool use and conversational tone.
- Minimax 2.5 presented as ~10x cheaper than premium alternatives in some tiers
- Kimi 2.5 praised for practical function calling and image-in-loop use cases
๐งช Evals & Benchmarks - METR, ARC-AGI, SWE-bench
Benchmark discourse dominates this segment: METR's steep autonomy curve, ARC-AGI near-saturation claims, and SWE-bench's shifting reliability. The panel emphasizes both signal and noise in headline benchmark leaps.
- METR discussed as equivalent expert-task horizon, not raw wall-clock runtime
- SWE-bench Verified de-emphasized as labs move to harder successor benchmarks
๐ค Tools & Agentic Engineering - Claude Code, Cursor, Devin
The conversation shifts from model quality to product surface area: CLIs, desktop agents, remote control, automations, and browser loops. The key takeaway is that agent harness quality is becoming a primary competitive layer.
- Labs converging on cron-like automations and remote, async workflows
- Cursor cloud agents and UI demos highlighted as important frontend-dev progress
๐ฐ Interview: Ben Broca - Polsia (AI-Run Companies)
Ben Broca explains Polsia's thesis: AI-native company ops where agents handle code, growth, support, and iteration while founders provide taste and direction. The segment captures a concrete example of autonomous operations already producing revenue.
- Polsia positioned as an opinionated autonomous-company stack
- Run-rate milestone crosses $700k ARR live during the interview
๐ ๏ธ Interview: Nader Dabit - Cognition / Devin 2.2
Nader outlines why Devin feels different now: two years of platform maturity converging with stronger models. He emphasizes a practical organizational effectโlowering friction so non-engineers can fix many issues directly and teams can focus on higher-leverage work.
- Devin Review launch, free public workflow for PR review
- Scheduled sessions/automation and deep workflow polish highlighted
โก Interview: Philip Kiely - Inference Engineering (Base10)
Philip argues that inference is becoming the durable center of AI economics, regardless of falling training costs. The discussion covers demand growth, market misconceptions, and why inference engineering is now a core discipline.
- Inference framed as a future 10x-100x larger layer than training
- Cost trends discussed as efficiency gains plus continued premium demand
๐ Open Source - Qwen 3.5 & Liquid LFM 2
Open-weight momentum remains strong with Qwen 3.5 variants and Liquid's LFM2 update. The panel focuses on architecture shifts, local viability, and the practical importance of efficient active-parameter footprints.
- Qwen 3.5 Medium discussed at 35B total / 3B active
- Liquid LFM2 highlighted for speed and strong non-coding reasoning
๐ฅ Seedance 2 & Taalas 15K Tokens/Sec Demo
Alex showcases Seedance 2 availability in CapCut and then pivots to a hardware demo of ultra-fast on-card inference. The segment underscores how product UX and chip-level innovation are both compressing iteration cycles.
- Seedance 2 shipping in limited product form despite API/legal delays
- Taalas demo shows 15,691 tokens/sec with baked weights
๐ฐ Show Wrap-up
The episode closes by tying the week into a larger 2026 pattern: rapid model iteration, stronger agent tools, and rising audience demand for curated signal. Alex recaps the interviews and points listeners to ThursdAI's website and feeds.
- Over 2,000 live listeners noted
- Core theme reinforced: acceleration is compounding across models, tools, and businesses
- Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne) - Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
- Ben Cera (@bencera_) - Founder Polsia
- Nader Dabit (@dabit3) - Growth at Cognition
- Philip Kiely (@philipkiely) - Devrel Base10, Author Inference Engineering
- ThursdAI new website: https://thursdai.news
- Big CO LLMs + APIs
Anthropic vs Chinese OSS - Accuses DeepSeek, Minimax, ZAI at distillation attacks (Blog) - Pentagon Issues an ultimatum to Anthropic: Give military unfettered Claude access by Friday or face Defense Production Act - Anthropic says NO (Blog)
- OpenAI releases GPT-5.3-Codex, their most capable agentic coding model, to all developers via the Responses API (X, Announcement)
- Open Source LLMs
Alibaba: Qwen 3.5 Medium - 35B model with only 3B active parameters outperforms their previous 235B flagship (X, HF, HF, HF, Blog) - Liquid AI releases LFM2-24B-A2B: A 24B MoE model with only 2.3B active parameters that runs on consumer laptops (X, HF, Blog)
- Perplexity launches ppxl-embed - SOTA embedding models (Blog, HF, API)
- Evals & Benchmarks
METR Time Horizon Benchmark Goes Vertical: Claude Opus 4.6 Achieves ~14.5 Hour Task Completion (X, Blog) - Confluence Labs emerges from stealth with 97.9% SOTA on ARC-AGI-2 benchmark (X, GitHub)
- OpenAI Retires SWE-bench Verified (X, Blog)
- Agentica claiming to solve all public ArcAGI 3 (X)
- Tools & Agentic Engineering
Happy 1 year Birthday Claude Code! - Devin AI 2.2 - autonomous agent with computer use, browser, self verify and self fix its own work (X)
- LMStudio launches LMLink - use your local models from everywhere with TailScale! (try it)
- Claude Code introduces Remote Control (X, Docs) and memory (X)
- Claude Cowork and Codex both now have automations (Cron Jobs) (Cowork)
- Cursor launches cloud agents (X)
- Nous research agent (X)
- Perplexity Computer (blog)
- This weeks Buzz - W&B
W&B adds MiniMax 2.5 and Kimi K2.5 on Inference Service (LINK)
- Interviews
Ben Broca - polsia.com/live Polsia Dashboard - Nader Dabit - on seeing the future (blog)
- Philip Kiely - Inference Engineering book (Book)
- Vision & Video
Seedance 2.0 finally available in Capcut in US (X)
- Voice & Audio
OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5 models (X, Announcement)
- AI Art & Diffusion & 3D
Google DeepMind launches Nano Banana 2 (X, Announcement) - Quiver solves SVG with Arrow 1.0 (X)
- Others
Taalas AI - 15,000 tokens per second demo (chatjimmy.ai)