ThursdAI · February 26, 2026

📅 ThursdAI - Feb 26 - Approaching singularity

Devin 2.2, METR 14h, Qwen 3.5, WarClaude, Polsia $700K ARR, GPT-5.3 Codex, ArcAGI saturated

By Alex Volkov

110 min

YouTube Spotify Apple Podcasts Substack

Episode Summary

This episode captures the feeling that AI acceleration has crossed from hype into lived reality: benchmarks are saturating, toolchains are maturing, and solo founders are shipping at startup speed. The panel opens with Anthropic's reported Pentagon ultimatum and distillation accusations, then moves into hard evidence of capability jumps like METR's 14.5-hour autonomy and ARC-AGI nearing saturation. Three interviews anchor the show: Ben Broca on Polsia's hypergrowth, Nader Dabit on Devin 2.2's practical leap, and Philip Kiely on why inference demand is only getting started. The thread throughout is clear: we are not just getting better models, we're getting compounding systems around them.

In This Episode

⚡ Show Intro & Welcome
📰 TL;DR - Weekly News Roundup
🔥 Anthropic vs Pentagon / War Claude
🧪 Anthropic Distillation Attacks (DeepSeek, Minimax, ZAI)
🤖 Opus 3 Retirement & AI Sentience Debate
🛠️ GPT 5.3 Codex Release & Open Claw
💰 This Week's Buzz - Kimi 2.5 & Minimax 2.5 on WB Inference
🧪 Evals & Benchmarks - METR, ARC-AGI, SWE-bench
🤖 Tools & Agentic Engineering - Claude Code, Cursor, Devin
💰 Interview: Ben Broca - Polsia (AI-Run Companies)
🛠️ Interview: Nader Dabit - Cognition / Devin 2.2
⚡ Interview: Philip Kiely - Inference Engineering (Base10)
🔓 Open Source - Qwen 3.5 & Liquid LFM 2
🎥 Seedance 2 & Taalas 15K Tokens/Sec Demo
📰 Show Wrap-up

Hosts & Guests

Alex Volkov

Host · W&B / CoreWeave

@altryne

Ben Broca

Founder & CEO · Polsia

Growth · Cognition

Head of Developer Relations · Baseten

@philipkiely

Nisten Tahiraj

AI operator & builder

@nisten

Wolfram Ravenwolf

Independent AI evaluator (r/LocalLLaMA)

@WolframRvnwlf

Ryan Carson

AI educator & founder

AI builder & founder

Nous Research

By The Numbers

METR Time Horizon

14.5h

Opus-level agents now complete tasks equivalent to over 14 hours of expert human work

Autonomy Doubling Time

49 days

Panel cites METR's recent doubling cadence as dramatically faster than historical compute trends

ARC-AGI-2

97.9%

Confluence Labs result discussed as a major signal that this benchmark is near saturation

Taalas Demo Throughput

15,000 tok/s

Chip-level baked-weight demo for Llama 3 8B shown as a 10x speed-class jump

Qwen 3.5 Medium

35B / 3B active

New open model architecture with low active params and strong practical coding/agent performance

Polsia Run Rate

$700k ARR

Ben Broca's autonomous-company platform crossed this mark live during the show

Minimax Distillation Exchanges (claimed)

13M

Figure discussed while comparing Anthropic's reported account-abuse counts across labs

🔥 Breaking During The Show

Nano Banana 2 (Flash-quality image model) announced during the show

Alex breaks in mid-TLDR to call out Google's new image model tier, describing near-Pro quality at roughly half price plus image search capability.

$700k ARR crossed live by Polsia

During Ben Broca's interview, Alex notes the run-rate counter crossing $700k ARR in real time.

⚡ Show Intro & Welcome

Alex frames the episode around 'approaching singularity' and the sense that AI progress has entered a visibly faster phase since December. The full co-host panel assembles with a promise of three major interviews.

Episode thesis: acceleration is now obvious to everyone, not just early adopters
Full panel + three guest interviews announced up front

Alex Volkov

"This is how we're getting to the singularity."

📰 TL;DR - Weekly News Roundup

A rapid-fire pass through the week's biggest drops: Pentagon pressure on Anthropic, distillation claims, GPT 5.3 Codex API, Qwen 3.5, Liquid LFM2, METR autonomy growth, ARC-AGI saturation, and new agent tooling. Alex also announces a breaking image-model update mid-segment.

METR, ARC-AGI, and SWE-bench all presented as major capability-shift signals
Devin 2.2, Cursor cloud agents, and automation features framed as practical workflow unlocks

Alex Volkov

"Opus, based on this benchmark, runs autonomously for over 14 hours to achieve a task."

🔥 Anthropic vs Pentagon / War Claude

The panel debates reports that Anthropic was pressured to remove two military-use restrictions: no autonomous lethal decisions and no domestic mass surveillance. Discussion centers on ethics, state leverage, and whether model control is still realistic in a multi-polar AI world.

Alleged ultimatum tied to supply-chain-risk designation and Defense Production Act threats
Strong split between principled refusal and realpolitik cooperation

Alex Volkov

"The two red lines: no domestic surveillance of American people and no fully autonomous lethal weapons."

Ryan Carson

"The genie's outta the bottle."

🧪 Anthropic Distillation Attacks (DeepSeek, Minimax, ZAI)

Anthropic's named allegations trigger a heated discussion on ToS abuse, model distillation norms, and the blurry legal line between scraping, training, and derivative outputs. The panel reads the numbers as both technical evidence and geopolitical signaling.

Reported counts discussed: DeepSeek 150k, Minimax 13M, Moonshot 3.4M exchanges
Core tension: enforcing platform rules while having trained on broad internet-scale corpora

Yam Peleg

"What did you train your models on?"

🤖 Opus 3 Retirement & AI Sentience Debate

A short but philosophical detour: Anthropic's treatment of models as entities sparks discussion on AI personhood, anthropomorphism, and whether giving models pseudo-agency is responsible or risky.

Opus 3 'retirement' narrative becomes a proxy for broader model-rights discourse
Panel splits between playful framing and concern about AI psychosis dynamics

Alex Volkov

"How far will they go with asking the models what they actually want?"

🛠️ GPT 5.3 Codex Release & Open Claw

The panel compares raw coding power versus conversational quality when Codex powers OpenClaw workflows. Consensus: Codex is elite at execution but often too literal and less human in interactive assistant contexts.

Codex pricing and performance praised for code generation
Personality and intent-following still seen as Anthropic's edge in assistant UX

Yam Peleg

"It's an absolute beast for writing code... but it's doing exactly what you tell it to do."

💰 This Week's Buzz - Kimi 2.5 & Minimax 2.5 on WB Inference

Alex and the co-hosts break down newly hosted inference options, emphasizing price/performance and multimodal capabilities. Kimi is highlighted as unusually strong for both tool use and conversational tone.

Minimax 2.5 presented as ~10x cheaper than premium alternatives in some tiers
Kimi 2.5 praised for practical function calling and image-in-loop use cases

Nisten Tahiraj

"I had it for a week... ten users testing alpha and used like four bucks for the whole week."

🧪 Evals & Benchmarks - METR, ARC-AGI, SWE-bench

Benchmark discourse dominates this segment: METR's steep autonomy curve, ARC-AGI near-saturation claims, and SWE-bench's shifting reliability. The panel emphasizes both signal and noise in headline benchmark leaps.

METR discussed as equivalent expert-task horizon, not raw wall-clock runtime
SWE-bench Verified de-emphasized as labs move to harder successor benchmarks

Alex Volkov

"This is not a log chart, this is a regular chart. Opus is literally off the chart."

Alex Volkov

"The doubling time... is 49 days."

🤖 Tools & Agentic Engineering - Claude Code, Cursor, Devin

The conversation shifts from model quality to product surface area: CLIs, desktop agents, remote control, automations, and browser loops. The key takeaway is that agent harness quality is becoming a primary competitive layer.

Labs converging on cron-like automations and remote, async workflows
Cursor cloud agents and UI demos highlighted as important frontend-dev progress

Ryan Carson

"Heartbeats, cron jobs, browser testing, cloud-based agents... all that's gonna be rolled into the entire product."

💰 Interview: Ben Broca - Polsia (AI-Run Companies)

Ben Broca explains Polsia's thesis: AI-native company ops where agents handle code, growth, support, and iteration while founders provide taste and direction. The segment captures a concrete example of autonomous operations already producing revenue.

Polsia positioned as an opinionated autonomous-company stack
Run-rate milestone crosses $700k ARR live during the interview

Ben Broca

"Polsia will do 80% of the grunt work."

Ben Broca

"Can I make it 90% autonomous? Can I make it 100% autonomous?"

🛠️ Interview: Nader Dabit - Cognition / Devin 2.2

Nader outlines why Devin feels different now: two years of platform maturity converging with stronger models. He emphasizes a practical organizational effect—lowering friction so non-engineers can fix many issues directly and teams can focus on higher-leverage work.

Devin Review launch, free public workflow for PR review
Scheduled sessions/automation and deep workflow polish highlighted

Nader Dabit

"This is the worst that they'll ever be at this moment."

Nader Dabit

"If someone notices a typo... they can just say, 'Hey Devin, fix this.'"

⚡ Interview: Philip Kiely - Inference Engineering (Base10)

Philip argues that inference is becoming the durable center of AI economics, regardless of falling training costs. The discussion covers demand growth, market misconceptions, and why inference engineering is now a core discipline.

Inference framed as a future 10x-100x larger layer than training
Cost trends discussed as efficiency gains plus continued premium demand

Philip Kiely

"Inference is everything, man."

🔓 Open Source - Qwen 3.5 & Liquid LFM 2

Open-weight momentum remains strong with Qwen 3.5 variants and Liquid's LFM2 update. The panel focuses on architecture shifts, local viability, and the practical importance of efficient active-parameter footprints.

Qwen 3.5 Medium discussed at 35B total / 3B active
Liquid LFM2 highlighted for speed and strong non-coding reasoning

Nisten Tahiraj

"This one is special in the architecture... hybrid state-space model Mamba layers."

🎥 Seedance 2 & Taalas 15K Tokens/Sec Demo

Alex showcases Seedance 2 availability in CapCut and then pivots to a hardware demo of ultra-fast on-card inference. The segment underscores how product UX and chip-level innovation are both compressing iteration cycles.

Seedance 2 shipping in limited product form despite API/legal delays
Taalas demo shows 15,691 tokens/sec with baked weights

Alex Volkov

"It shows me 15,691 tokens per second."

📰 Show Wrap-up

The episode closes by tying the week into a larger 2026 pattern: rapid model iteration, stronger agent tools, and rising audience demand for curated signal. Alex recaps the interviews and points listeners to ThursdAI's website and feeds.

Over 2,000 live listeners noted
Core theme reinforced: acceleration is compounding across models, tools, and businesses

Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed @ryancarson
Ben Cera (@bencera_) - Founder Polsia
Nader Dabit (@dabit3) - Growth at Cognition
Philip Kiely (@philipkiely) - Devrel Base10, Author Inference Engineering
ThursdAI new website: https://thursdai.news

Big CO LLMs + APIs
Anthropic vs Chinese OSS - Accuses DeepSeek, Minimax, ZAI at distillation attacks (Blog)
Pentagon Issues an ultimatum to Anthropic: Give military unfettered Claude access by Friday or face Defense Production Act - Anthropic says NO (Blog)
OpenAI releases GPT-5.3-Codex, their most capable agentic coding model, to all developers via the Responses API (X, Announcement)

Open Source LLMs
Alibaba: Qwen 3.5 Medium - 35B model with only 3B active parameters outperforms their previous 235B flagship (X, HF, HF, HF, Blog)
Liquid AI releases LFM2-24B-A2B: A 24B MoE model with only 2.3B active parameters that runs on consumer laptops (X, HF, Blog)
Perplexity launches ppxl-embed - SOTA embedding models (Blog, HF, API)

Evals & Benchmarks
METR Time Horizon Benchmark Goes Vertical: Claude Opus 4.6 Achieves ~14.5 Hour Task Completion (X, Blog)
Confluence Labs emerges from stealth with 97.9% SOTA on ARC-AGI-2 benchmark (X, GitHub)
OpenAI Retires SWE-bench Verified (X, Blog)
Agentica claiming to solve all public ArcAGI 3 (X)

Tools & Agentic Engineering
Happy 1 year Birthday Claude Code!
Devin AI 2.2 - autonomous agent with computer use, browser, self verify and self fix its own work (X)
LMStudio launches LMLink - use your local models from everywhere with TailScale! (try it)
Claude Code introduces Remote Control (X, Docs) and memory (X)
Claude Cowork and Codex both now have automations (Cron Jobs) (Cowork)
Cursor launches cloud agents (X)
Nous research agent (X)
Perplexity Computer (blog)

This weeks Buzz - W&B
W&B adds MiniMax 2.5 and Kimi K2.5 on Inference Service (LINK)

Interviews
Ben Broca - polsia.com/live Polsia Dashboard
Nader Dabit - on seeing the future (blog)
Philip Kiely - Inference Engineering book (Book)

Vision & Video
Seedance 2.0 finally available in Capcut in US (X)

Voice & Audio
OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5 models (X, Announcement)

AI Art & Diffusion & 3D
Google DeepMind launches Nano Banana 2 (X, Announcement)
Quiver solves SVG with Arrow 1.0 (X)

Others
Taalas AI - 15,000 tokens per second demo (chatjimmy.ai)

Alex Volkov 0:29

Good morning or evening, depends on where you are.

0:32

Welcome to ThursdAI you're tuning to the weekly show that keeps you up to date. So if you are like everybody else in the beginning of 2026, overwhelmed with what's going on with ai, there's just too many things at once and you'll feel like you need a full-time job. Covering or just like knowing the news. Uh, that's why we're here for. My name's Alex. I'm an AI evangelist with Weights, & Biases from CoreWeave. And you are on Thursday, ai, the weekly AI news show that brings you everything that matters in the world of ai, which is getting to be very frank, increasingly harder and harder to do. So we have to make hard choices in what we cover. Uh, but, uh, with me today to help me do this is Yam Peleg and Wolfram Raven Wolf. And we're gonna have a few more hosting guests. How you guys doing? Welcome to the show.

Wolfram Ravenwolf 1:26

Excellent.

1:26

How are you, Alex?

Alex Volkov 1:28

Doing good.

1:29

Good. Now how you doing? Week?

Yam Peleg 1:31

Crazy week.

1:31

Crazy week. Crazy week.

Alex Volkov 1:33

They are getting, we're

Yam Peleg 1:34

we're crazy.