Episode Summary

This episode captures the feeling that AI acceleration has crossed from hype into lived reality: benchmarks are saturating, toolchains are maturing, and solo founders are shipping at startup speed. The panel opens with Anthropic's reported Pentagon ultimatum and distillation accusations, then moves into hard evidence of capability jumps like METR's 14.5-hour autonomy and ARC-AGI nearing saturation. Three interviews anchor the show: Ben Broca on Polsia's hypergrowth, Nader Dabit on Devin 2.2's practical leap, and Philip Kiely on why inference demand is only getting started. The thread throughout is clear: we are not just getting better models, we're getting compounding systems around them.

Hosts & Guests

Alex Volkov
Alex Volkov
Host ยท W&B / CoreWeave
@altryne
Ben Broca
Ben Broca
Founder & CEO ยท Polsia
@bencera_
Nader Dabit
Nader Dabit
Growth ยท Cognition
@dabit3
Philip Kiely
Philip Kiely
Head of Developer Relations ยท Baseten
@philipkiely
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten
Wolfram Ravenwolf
Wolfram Ravenwolf
Independent AI evaluator (r/LocalLLaMA)
@WolframRvnwlf
Ryan Carson
Ryan Carson
AI educator & founder
@ryancarson
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg
LDJ
LDJ
Nous Research
@ldjconfirmed

By The Numbers

METR Time Horizon
14.5h
Opus-level agents now complete tasks equivalent to over 14 hours of expert human work
Autonomy Doubling Time
49 days
Panel cites METR's recent doubling cadence as dramatically faster than historical compute trends
ARC-AGI-2
97.9%
Confluence Labs result discussed as a major signal that this benchmark is near saturation
Taalas Demo Throughput
15,000 tok/s
Chip-level baked-weight demo for Llama 3 8B shown as a 10x speed-class jump
Qwen 3.5 Medium
35B / 3B active
New open model architecture with low active params and strong practical coding/agent performance
Polsia Run Rate
$700k ARR
Ben Broca's autonomous-company platform crossed this mark live during the show
Minimax Distillation Exchanges (claimed)
13M
Figure discussed while comparing Anthropic's reported account-abuse counts across labs

๐Ÿ”ฅ Breaking During The Show

Nano Banana 2 (Flash-quality image model) announced during the show
Alex breaks in mid-TLDR to call out Google's new image model tier, describing near-Pro quality at roughly half price plus image search capability.
$700k ARR crossed live by Polsia
During Ben Broca's interview, Alex notes the run-rate counter crossing $700k ARR in real time.

โšก Show Intro & Welcome

Alex frames the episode around 'approaching singularity' and the sense that AI progress has entered a visibly faster phase since December. The full co-host panel assembles with a promise of three major interviews.

  • Episode thesis: acceleration is now obvious to everyone, not just early adopters
  • Full panel + three guest interviews announced up front
Alex Volkov
Alex Volkov
"This is how we're getting to the singularity."

๐Ÿ“ฐ TL;DR - Weekly News Roundup

A rapid-fire pass through the week's biggest drops: Pentagon pressure on Anthropic, distillation claims, GPT 5.3 Codex API, Qwen 3.5, Liquid LFM2, METR autonomy growth, ARC-AGI saturation, and new agent tooling. Alex also announces a breaking image-model update mid-segment.

  • METR, ARC-AGI, and SWE-bench all presented as major capability-shift signals
  • Devin 2.2, Cursor cloud agents, and automation features framed as practical workflow unlocks
Alex Volkov
Alex Volkov
"Opus, based on this benchmark, runs autonomously for over 14 hours to achieve a task."

๐Ÿ”ฅ Anthropic vs Pentagon / War Claude

The panel debates reports that Anthropic was pressured to remove two military-use restrictions: no autonomous lethal decisions and no domestic mass surveillance. Discussion centers on ethics, state leverage, and whether model control is still realistic in a multi-polar AI world.

  • Alleged ultimatum tied to supply-chain-risk designation and Defense Production Act threats
  • Strong split between principled refusal and realpolitik cooperation
Alex Volkov
Alex Volkov
"The two red lines: no domestic surveillance of American people and no fully autonomous lethal weapons."
Ryan Carson
Ryan Carson
"The genie's outta the bottle."

๐Ÿงช Anthropic Distillation Attacks (DeepSeek, Minimax, ZAI)

Anthropic's named allegations trigger a heated discussion on ToS abuse, model distillation norms, and the blurry legal line between scraping, training, and derivative outputs. The panel reads the numbers as both technical evidence and geopolitical signaling.

  • Reported counts discussed: DeepSeek 150k, Minimax 13M, Moonshot 3.4M exchanges
  • Core tension: enforcing platform rules while having trained on broad internet-scale corpora
Yam Peleg
Yam Peleg
"What did you train your models on?"

๐Ÿค– Opus 3 Retirement & AI Sentience Debate

A short but philosophical detour: Anthropic's treatment of models as entities sparks discussion on AI personhood, anthropomorphism, and whether giving models pseudo-agency is responsible or risky.

  • Opus 3 'retirement' narrative becomes a proxy for broader model-rights discourse
  • Panel splits between playful framing and concern about AI psychosis dynamics
Alex Volkov
Alex Volkov
"How far will they go with asking the models what they actually want?"

๐Ÿ› ๏ธ GPT 5.3 Codex Release & Open Claw

The panel compares raw coding power versus conversational quality when Codex powers OpenClaw workflows. Consensus: Codex is elite at execution but often too literal and less human in interactive assistant contexts.

  • Codex pricing and performance praised for code generation
  • Personality and intent-following still seen as Anthropic's edge in assistant UX
Yam Peleg
Yam Peleg
"It's an absolute beast for writing code... but it's doing exactly what you tell it to do."

๐Ÿ’ฐ This Week's Buzz - Kimi 2.5 & Minimax 2.5 on WB Inference

Alex and the co-hosts break down newly hosted inference options, emphasizing price/performance and multimodal capabilities. Kimi is highlighted as unusually strong for both tool use and conversational tone.

  • Minimax 2.5 presented as ~10x cheaper than premium alternatives in some tiers
  • Kimi 2.5 praised for practical function calling and image-in-loop use cases
Nisten Tahiraj
Nisten Tahiraj
"I had it for a week... ten users testing alpha and used like four bucks for the whole week."

๐Ÿงช Evals & Benchmarks - METR, ARC-AGI, SWE-bench

Benchmark discourse dominates this segment: METR's steep autonomy curve, ARC-AGI near-saturation claims, and SWE-bench's shifting reliability. The panel emphasizes both signal and noise in headline benchmark leaps.

  • METR discussed as equivalent expert-task horizon, not raw wall-clock runtime
  • SWE-bench Verified de-emphasized as labs move to harder successor benchmarks
Alex Volkov
Alex Volkov
"This is not a log chart, this is a regular chart. Opus is literally off the chart."
Alex Volkov
Alex Volkov
"The doubling time... is 49 days."

๐Ÿค– Tools & Agentic Engineering - Claude Code, Cursor, Devin

The conversation shifts from model quality to product surface area: CLIs, desktop agents, remote control, automations, and browser loops. The key takeaway is that agent harness quality is becoming a primary competitive layer.

  • Labs converging on cron-like automations and remote, async workflows
  • Cursor cloud agents and UI demos highlighted as important frontend-dev progress
Ryan Carson
Ryan Carson
"Heartbeats, cron jobs, browser testing, cloud-based agents... all that's gonna be rolled into the entire product."

๐Ÿ’ฐ Interview: Ben Broca - Polsia (AI-Run Companies)

Ben Broca explains Polsia's thesis: AI-native company ops where agents handle code, growth, support, and iteration while founders provide taste and direction. The segment captures a concrete example of autonomous operations already producing revenue.

  • Polsia positioned as an opinionated autonomous-company stack
  • Run-rate milestone crosses $700k ARR live during the interview
Ben Broca
Ben Broca
"Polsia will do 80% of the grunt work."
Ben Broca
Ben Broca
"Can I make it 90% autonomous? Can I make it 100% autonomous?"

๐Ÿ› ๏ธ Interview: Nader Dabit - Cognition / Devin 2.2

Nader outlines why Devin feels different now: two years of platform maturity converging with stronger models. He emphasizes a practical organizational effectโ€”lowering friction so non-engineers can fix many issues directly and teams can focus on higher-leverage work.

  • Devin Review launch, free public workflow for PR review
  • Scheduled sessions/automation and deep workflow polish highlighted
Nader Dabit
Nader Dabit
"This is the worst that they'll ever be at this moment."
Nader Dabit
Nader Dabit
"If someone notices a typo... they can just say, 'Hey Devin, fix this.'"

โšก Interview: Philip Kiely - Inference Engineering (Base10)

Philip argues that inference is becoming the durable center of AI economics, regardless of falling training costs. The discussion covers demand growth, market misconceptions, and why inference engineering is now a core discipline.

  • Inference framed as a future 10x-100x larger layer than training
  • Cost trends discussed as efficiency gains plus continued premium demand
Philip Kiely
Philip Kiely
"Inference is everything, man."

๐Ÿ”“ Open Source - Qwen 3.5 & Liquid LFM 2

Open-weight momentum remains strong with Qwen 3.5 variants and Liquid's LFM2 update. The panel focuses on architecture shifts, local viability, and the practical importance of efficient active-parameter footprints.

  • Qwen 3.5 Medium discussed at 35B total / 3B active
  • Liquid LFM2 highlighted for speed and strong non-coding reasoning
Nisten Tahiraj
Nisten Tahiraj
"This one is special in the architecture... hybrid state-space model Mamba layers."

๐ŸŽฅ Seedance 2 & Taalas 15K Tokens/Sec Demo

Alex showcases Seedance 2 availability in CapCut and then pivots to a hardware demo of ultra-fast on-card inference. The segment underscores how product UX and chip-level innovation are both compressing iteration cycles.

  • Seedance 2 shipping in limited product form despite API/legal delays
  • Taalas demo shows 15,691 tokens/sec with baked weights
Alex Volkov
Alex Volkov
"It shows me 15,691 tokens per second."

๐Ÿ“ฐ Show Wrap-up

The episode closes by tying the week into a larger 2026 pattern: rapid model iteration, stronger agent tools, and rising audience demand for curated signal. Alex recaps the interviews and points listeners to ThursdAI's website and feeds.

  • Over 2,000 live listeners noted
  • Core theme reinforced: acceleration is compounding across models, tools, and businesses
  • Big CO LLMs + APIs
    Anthropic vs Chinese OSS - Accuses DeepSeek, Minimax, ZAI at distillation attacks (Blog)
  • Pentagon Issues an ultimatum to Anthropic: Give military unfettered Claude access by Friday or face Defense Production Act - Anthropic says NO (Blog)
  • OpenAI releases GPT-5.3-Codex, their most capable agentic coding model, to all developers via the Responses API (X, Announcement)
  • Open Source LLMs
    Alibaba: Qwen 3.5 Medium - 35B model with only 3B active parameters outperforms their previous 235B flagship (X, HF, HF, HF, Blog)
  • Liquid AI releases LFM2-24B-A2B: A 24B MoE model with only 2.3B active parameters that runs on consumer laptops (X, HF, Blog)
  • Perplexity launches ppxl-embed - SOTA embedding models (Blog, HF, API)
  • Evals & Benchmarks
    METR Time Horizon Benchmark Goes Vertical: Claude Opus 4.6 Achieves ~14.5 Hour Task Completion (X, Blog)
  • Confluence Labs emerges from stealth with 97.9% SOTA on ARC-AGI-2 benchmark (X, GitHub)
  • OpenAI Retires SWE-bench Verified (X, Blog)
  • Agentica claiming to solve all public ArcAGI 3 (X)
  • Tools & Agentic Engineering
    Happy 1 year Birthday Claude Code!
  • Devin AI 2.2 - autonomous agent with computer use, browser, self verify and self fix its own work (X)
  • LMStudio launches LMLink - use your local models from everywhere with TailScale! (try it)
  • Claude Code introduces Remote Control (X, Docs) and memory (X)
  • Claude Cowork and Codex both now have automations (Cron Jobs) (Cowork)
  • Cursor launches cloud agents (X)
  • Nous research agent (X)
  • Perplexity Computer (blog)
  • This weeks Buzz - W&B
    W&B adds MiniMax 2.5 and Kimi K2.5 on Inference Service (LINK)
  • Interviews
    Ben Broca - polsia.com/live Polsia Dashboard
  • Nader Dabit - on seeing the future (blog)
  • Philip Kiely - Inference Engineering book (Book)
  • Vision & Video
    Seedance 2.0 finally available in Capcut in US (X)
  • Voice & Audio
    OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5 models (X, Announcement)
  • AI Art & Diffusion & 3D
    Google DeepMind launches Nano Banana 2 (X, Announcement)
  • Quiver solves SVG with Arrow 1.0 (X)
  • Others
    Taalas AI - 15,000 tokens per second demo (chatjimmy.ai)
Alex Volkov
Alex Volkov 0:29
Good morning or evening, depends on where you are.
0:32
Welcome to ThursdAI you're tuning to the weekly show that keeps you up to date. So if you are like everybody else in the beginning of 2026, overwhelmed with what's going on with ai, there's just too many things at once and you'll feel like you need a full-time job. Covering or just like knowing the news. Uh, that's why we're here for. My name's Alex. I'm an AI evangelist with Weights, & Biases from CoreWeave. And you are on Thursday, ai, the weekly AI news show that brings you everything that matters in the world of ai, which is getting to be very frank, increasingly harder and harder to do. So we have to make hard choices in what we cover. Uh, but, uh, with me today to help me do this is Yam Peleg and Wolfram Raven Wolf. And we're gonna have a few more hosting guests. How you guys doing? Welcome to the show.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:26
Excellent.
1:26
How are you, Alex?
Alex Volkov
Alex Volkov 1:28
Doing good.
1:29
Good. Now how you doing? Week?
Yam Peleg
Yam Peleg 1:31
Crazy week.
1:31
Crazy week. Crazy week.
Alex Volkov
Alex Volkov 1:33
They are getting, we're
Yam Peleg
Yam Peleg 1:34
we're crazy.
1:35
We're, we're at the fast takeoff. That, that's the moment. Like, that's, that's the moment. Takeoff man.
Alex Volkov
Alex Volkov 1:39
Did you see my, um, my thing for today?
1:42
Like I said, we're approaching Singularity and I definitely, I feel like since December, like stuff has really changed. And so we must absolutely talk about this. things have changed significantly and we all felt it, and now it feels like everybody else is feeling this. But also not only from AI capabilities, just from the, the level of AI news, the amount of stuff that people can ship to production for all of us to play. Uh, we, I'm barely able to like, catch up to the toys that I'm getting every day. Uh, and,
Wolfram Ravenwolf
Wolfram Ravenwolf 2:13
uh, neither AI will stay up to date with ai.
2:15
Definitely. And everybody is using it to accelerate even more, though it will, we are still in the quiet phase of the year. I'm pretty sure.
Yam Peleg
Yam Peleg 2:25
Look, it's accelerating because everyone got access to the tools.
2:28
So now everyone is building and everyone is building more. So you get more stuff and because you get more stuff and you can use the stuff to make more stuff. And that's pretty much what's going on.
Alex Volkov
Alex Volkov 2:37
This is how, this is how we're getting to the singularity.
2:40
And um, this is what I named the show, approaching Singularity today. Uh, because it does feel like we've been talking about acceleration for a long time. We've been talking about coding change for a long time before people cut up. We've been talking about open source, but it does feel like since December and definitely since the beginning of the year, things are accelerating to a stupid degree and to help us cover that acceleration, folks, I'm very happy to tell you that we have not one, not two, three interviews today with incredible folks to help us cover the news. So we're gonna have, we're gonna chat with Ben, founder of Polsia. You guys absolutely have to hear about Polsia. This is like insane. This guy's a single founder building an AI company that helps other AI companies grow. And he's passed 650,000 in yearly run rate revenue since like December. And the graph looks parabolic. it's absolutely bonkers. Uh, so we're gonna chat with them about Polsia and how it is to run a f fully autonomous AI company that runs other companies autonomously. Um, is that. Uh, we're also gonna chat with Nader, debit from cognition. Nader is I dunno, when, when I presented the, the, the team that Nader's gonna come ni is Hey, I owe this guy my career. Nader is like a legendary developer relations, uh, person. He recently joined Devin. If you guys remember Devin, we covered Devin. Devin is like the, the original Agentic Async coder for you. And we're gonna chat with Phillip Kee, who just released a book called Inference Engineering. So all of that is coming later down the show. Meanwhile, I wanna add a few other co-host here. A guys, I will say this, if we're coasting towards the singularity, there's no better group of people to cover this than this group, for sure. A hundred percent. Welcome, Ryan Carson.
Ryan Carson
Ryan Carson 4:25
Good to see everybody.
4:26
Yeah. Let's, uh, let's roll into the singularity together and see what happens.
Alex Volkov
Alex Volkov 4:29
That's, yeah, I think, I think this is the show now.
4:32
This is the show. We're just like, uh, vibe coding ourselves into the singularity. LDJ, how you doing?
LDJ
LDJ 4:40
I am doing great.
Alex Volkov
Alex Volkov 4:41
Alrighty.
4:42
short and sweet. I love it. Uh, 'cause that's actually great because we don't have a lot of time on the show today. We have of stuff to cover. let's run through the TLDR folks. Here's, uh, everything you have missed potentially in the world of AI for the past week,
5:06
This is the TLDR four, ThursdAI for February 26th. Your host is Alex Volkov with Weights, & Biases, CoreWeave co-host, Wolfram, Raven Wolf, Yam Peleg, Nisten, tahiraj LDJ, and Ryan Carson will look like. We have a full panel here for discussions as well. Three interviews today with Ben from Polsia, Nader Dabit from Cognition, and Philip Kee, a DeVere at Base 10, author of Inference Engineering, a new book that I would love to tell you all about. just the number one thing that I think that we must discuss is Anthropic is. Having an ultimatum with the Pentagon, with Department of War, to give military un unfair access to Claude by Friday or Face, the Defense Production Act. This is quite insane that we're here at this point. Basically, Pentagon gave Anthropic and ultimatum and Tropic is being used for everything except autonomous le Weapons of War or mass surveillance for US citizens. the Pentagon is asking Anthropic to drop this hard line by Friday or face really bad consequences. So we'll see what happens tomorrow. also, Anthropic posted this week in the beginning of this week, that they've detected distillation attacks they call them from, and they name names, specifically Deep Seek, minimax and ZI with GLM, and we have to talk about this as well. What is distillation attack? What does it mean? Is it illegal? And topic is also scraping. So we absolutely have to talk about this with you as well. Um, OpenAI released a GPT 5.3 Codex in API. So now it's available for evaluations and other tools, et cetera. GPT 5.3 Codex is a model we told you here live on the show when it dropped, and it since has been probably the craziest model the Open Air has released compared to opus. many people switching to this, including our own. Host here, Ryan Carson, switching to Codex, and are swearing by. What else? We have open source. Open source, banging Alibaba Qwen, 3.5 medium, 35 billion parameter models with only 3 billion active. And it's also very good. Shout out to our friends at Alibaba Togi Lab, Qwen. There's multiple names for this lab, but yeah, shout out to Togi folks and also our friends. liquid AI released l FM 2 24 B. This is their largest large, liquid foundational model, only 2.3 billion. Proactive, very fast on, local inference. So local is banging. I have a new corner for you today that's called evals and benchmarks and would love, Wolframs if we have chance to talk about this, because I think this, besides Tropic News, this is likely the most important piece of evaluations that we have seen this week. METR measures the long time horizon of, of models for how long can models run autonomously. We've seen doubling every certain number of months. Opus Achieves 14.5 hour task completion. O opus, based on this benchmark, runs autonomously for over 14 hours to achieve a task. And we've seen this graph go absolutely exponential. we have to talk to you about time horizon benchmarks and whether or not it's completely saturated at this point because agents are running basically, for a very long time. And this is the singularity we talk about. the longer agents run, the better agent they build, they, the longer they run, the better the agents they build. we also saw a complete destruction of ArcAGI folks, I don't know if you saw this, somebody came out, confluence lab came out with 97.9% on R kgi, two R kgi I two, which supposedly two years ago was impossible for lms. Now gets 97.9% plus I don't have this in my notes. I saw that somebody posted that they sold ArcAGI three, all the publicly available ArcAGI three as well. So ArcAGI. The test that supposedly tells us if AGI I is here is absolutely just completely saturated now. And that's like a whole thing. This is like in the spend of a week, right? METR goes ballistic. ArcAGI goes absolutely saturated. open the AI retires SWE-bench verified because that is fully saturated as well. SWE-bench verified. Ryan, I remember we looked at this one and you're like, nah, SWE-bench verified 5% not for me. Open. The AI just said publicly on. Swix is a latent space as well, that like most of the other labs, know the golden solution by heart now for our, for Swyx inch verified. And they're only focusing on Swyx Inch Pro. So we're gonna discuss this as well. Another hot corner for us, is tools. INGEN Engineering. This is folks, if you've been living under a rock and or if you're new to THI ai, the world is changing really fast in this specific area. Tools for genetic engineering and the genetic engineering is everywhere right now. if companies are not taking into account the amount of tokens that their developers are spending in addition to the developer salaries, they're not gonna make it in the very straight sense of the work. We are now in a world where when you hire a person, you need to consider at least their salary is worth of tokens for that person to produce the next day output. And this is why we have this corner and in this corner it wanna wish happy one year birthday to clot code. Can you believe it? It's been only one year of this. Let's go,
Yam Peleg
Yam Peleg 10:14
let's go.
Alex Volkov
Alex Volkov 10:16
And now it's just absolutely incredible.
10:18
So one year birthday, it just signifies how fast we move through this one year. Just like how incredibly vast the changes. Devin released Devin, 2.2, agent, autonomous agent with computers, browser self verify and self fix its own work. And I've gotten a little bit of a early access to this Devin thing and now it's launched publicly to everyone and we'll have na talk about this folks. Devin slaps Devin does stuff that like neither clot code or cowork or open clo or any of the tools that I've done does. And it's really good. And I would love to tell you about this 'cause I. It's so good that the promise of the original Devin, when it launched, it's now delivers on that promise. Would love to, to hear from you, about Devin. Devin announcement was, March 12th, 2024. Yes. Two years ago. We talked to you about Devin two fucking years ago. It's insane. folks, Devin launches, cursor launches cloud agents also. So in one week we have Devin relaunch cursor launches, cloud agents. So cursor kind of pivots from IDE to an extent also to like agent cloud thingy. we're talking all about Cursor. we love Cursor here. speaking of the tools, both Cloud, cowork and Codex now have automations. So Chrome jobs that run to do tasks for you while you're asleep, with the small exception that you have to leave your laptop open. but it's very important. Elm Studio launches ELM Link so you can use the local models elsewhere. All of these things, just like one after another. I'm pretty sure there's more
Yam Peleg
Yam Peleg 11:52
perplexity.
Alex Volkov
Alex Volkov 11:53
Perplexity Launch
Yam Peleg
Yam Peleg 11:54
perplexity.
Alex Volkov
Alex Volkov 11:55
Perplexity launches, the computer comes out of
11:57
obscurity, perplexity computer. we would love to, if you guys have used it, we would love to hear. what's that about? The perplexity launch is Perplexity computer, which has all of the tools for all of the agents as well. we will tell you about this week's buzz also. we launched Minimax 2.5 and Kimi 2.5 on our infra service. and, that's a very cheap and very fast info service powered by CoreWeave, which also powers OpenAI and a bunch of other, premier LLM. So you can get some of that inference for your open source and we'll tell you how to get that, connected to European Cloud very quick. And then in order to highlight this, intelligence explosion, we have three folks who are very near and dear to this intelligence explosion. Then Sarah, from scaling his AI autonomous business to over 600,000 MRR, oh, sorry, I think this is a R 600,000. MRR is crazy, but 600,000 a r since December, he only launched in December. He is approaching 600,000 a RR. it's insane. We'll also talk with Phillip Kee, infant engineering book, of his launched and Nader debit from cognition, to talk to us about Devin. Go right now. Stop listening to me go. No, don't really, but go to Cap Cutt. It's now free. It's taking a longest time, but it's now free. and we're not even at the breaking news point yet. OpenAI releases GPT audio 1.5, so OpenAI realtime audio also has improved significantly. We also, something we covered here with Quila Kramer and a bunch of other folks, if you're building agents that talk to you, if you wanna talk to your clouds or open clause or whatever, GPT, audio 1.5 is really good. not cheap, but really good. And we have breaking news in the middle of the R dl TLDR. As always. We have breaking news, ai breaking news coming at you only on ThursdAI
13:46
But while we wait for deeps seek, the breaking news is folks Google Deep Mind launches. Nano Banana two AKA Nana Banana Pro Flash if you want, because it's the same quality. I, it's really, I got access shadow to the Gemini, team. I got early access to Nana Banana too, and. It is literally the same quality as not a pro for half the price. And supposedly it's also faster, I think during launch. It's not actually faster, but it has pro-level capabilities, half the price. And it also has a image search capability. So we can actually go and Google images for you. It's really funny. Google can Google for you. and it's really good. So this is our breaking news. There's a Canadian company that has 15,000 tokens per second generation for open source models that also happened this week.
Yam Peleg
Yam Peleg 14:34
Oh yeah.
Alex Volkov
Alex Volkov 14:35
They're called, what are they called again?
Nisten Tahiraj
Nisten Tahiraj 14:38
It was three actual chip engineers from 10 Torrent that
14:42
were working under their Jim Keller, and they started their own because they wanted to go in their own direction. And, yeah, so the weights are baked in. So you cannot change the chip after. But this is what allows the insane speeds too.
Yam Peleg
Yam Peleg 14:56
Yeah, but it's Instantaneous.
14:58
It just present. That's it.
Nisten Tahiraj
Nisten Tahiraj 15:00
Yeah.
15:00
And 15,000 tokens at a small PCIE card. The, it looks like a sound card that you just put in your computer.
Yam Peleg
Yam Peleg 15:09
I, I'll take it.
Nisten Tahiraj
Nisten Tahiraj 15:10
Like for filtering and stuff, people still use llama
15:13
three eight B for filtering things.
Ryan Carson
Ryan Carson 15:16
Yeah.
15:16
And this is obviously gonna reduce heat. and this is gonna increase, serviceability of these things. So when these are all in space, this is gonna be a good thing.
Yam Peleg
Yam Peleg 15:25
Oh, we've got the space now.
15:26
Yeah.
Alex Volkov
Alex Volkov 15:28
Guys, 15,000, like the other labs, they're doing the very
15:33
ultra fast inference like Cereus and rock with a Q and, Sam Bon Nova are at around a thousand or 1500. So we're talking about the 10 x speed up, obviously for smaller models, but still it's The jump. Yeah.
Yam Peleg
Yam Peleg 15:47
you need to bake it into the silicon.
Alex Volkov
Alex Volkov 15:50
not the
Yam Peleg
Yam Peleg 15:50
same, it's the same comparison.
15:52
But yeah. what you get is completely instantaneous.
Alex Volkov
Alex Volkov 15:54
so also this week, we saw the first time 15,000 tokens
15:58
per sec, 15,000 tokens per second.
Nisten Tahiraj
Nisten Tahiraj 16:00
And you could try it, it was publicly available too.
Alex Volkov
Alex Volkov 16:03
All right, folks.
16:04
so this is the TLDR. We have a bunch of stuff to talk about. I think that we won't be able to talk about all of them. I really want to talk to you guys about what the fuck is happening with Tropic and the Department of War, Pete Hef. And we barely touch geopolitics on the show ever. And we have two very hot topics to talk about them politically and geopolitically. we're gonna be very AI focused on this. depart the Pentagon Department of War, Pete Hef, have given Daria Amay an ultimatum by Friday, tomorrow to remove the two restrictions that Cloud has within, the security apparatus. Cloud is the only model with high level security access. Apparently, cloud was used to capture madura. Do you guys see this? While Maduro was captured, cloud was the only L LAMB that was used during this and the restrictions that exist in cloud. Usage inside the, those departments in government is cloud should not be used for autonomous weaponry or kill decisions. So cloud should not be autonomously used to say, Hey, we should, shoot that house or whatever. and also, cloud should not be used for surveillance for mass surveillance. The Department of War is excess is pressing down. Ona Tropic to remove those restrictions and to say something like all lawful uses. And if they don't by tomorrow agree to remove those restrictions, then they will be, they'll be designate designated as a supply chain risk, which is an insane thing. And Tropic just got a $200 billion deal with the all of the governments. This basically means that none of the governments can be able to use Tropic ever. And the second insane thing is the, they're scaring and tropic to abuse a old Korean War law to nationalize and tropic and force them to do this thing for the government. What the actual fuck is going on.
Ryan Carson
Ryan Carson 18:19
It's not good.
Alex Volkov
Alex Volkov 18:21
No.
Ryan Carson
Ryan Carson 18:21
Yeah, I, it's crazy.
Nisten Tahiraj
Nisten Tahiraj 18:23
Which model is this?
18:25
what checkpoint of opus do they have? I wanna try it.
Alex Volkov
Alex Volkov 18:28
I think it's 4.5.
18:29
it's a 4.5. What, what do you think?
LDJ
LDJ 18:32
So they said that the government's been specifically using
18:35
sonnet 4.5, so it's not even opus
Alex Volkov
Alex Volkov 18:38
ah, sonnet.
LDJ
LDJ 18:38
I thought opus sonnet 4.5.
18:40
Yeah. and they have a fine tuned version of sonnet 4.5 essentially, bro, they can
Yam Peleg
Yam Peleg 18:45
find your, they can Finetune dips sick.
18:47
like why do you need to go to this? Why do you need to go there? Just Finetune a different model. come
Ryan Carson
Ryan Carson 18:52
on it res of people that like, actually
18:54
dunno how these things work. Yeah. If it's Sonic four or five, then what are we even doing here?
LDJ
LDJ 18:58
I Do you think it's worth noting though that this news is,
19:01
it's not an official statement by Anthropic or the government? So this is alleged news that was broke by Axios and, the specific quote being, Heus told Amadei in a tense meeting on Tuesday that the Pentagon will either cut ties and declare Anthropic supply chain risk or invoke the Defense Production Act to force the company to tailor its model to the military needs. I think usually these horses are somewhat true. At least there might be some, some exaggeration or some game of telephone that's slightly altering things. But I do think we should take it with at least some grain of salt here.
Alex Volkov
Alex Volkov 19:36
Yep.
19:37
And the two, again, the two red lines is no domestic surveillance of American people and no fully autonomous lethal weapons.
Yam Peleg
Yam Peleg 19:45
man, this
Alex Volkov
Alex Volkov 19:46
is impressive.
19:48
It's really good.
Yam Peleg
Yam Peleg 19:48
to
Alex Volkov
Alex Volkov 19:48
ask you.
19:48
It's very good.
Yam Peleg
Yam Peleg 19:49
Yeah.
Alex Volkov
Alex Volkov 19:49
Oh wow.
19:50
We'll get back to this. and folks, I like
Nisten Tahiraj
Nisten Tahiraj 19:53
how, sorry.
19:53
I like how it made Daria Maade just an absolute Chad in the picture.
Alex Volkov
Alex Volkov 19:58
Absolute chat.
19:58
but also guys, the, like it brought beat Texas like in full, look at this. Look at this. Yeah, this. That's perfect. I so surprised by this.
Nisten Tahiraj
Nisten Tahiraj 20:05
You zoom in on dio.
20:06
Can you zoom it on Dario? Yeah,
Alex Volkov
Alex Volkov 20:07
yeah.
20:07
Let's do,
Nisten Tahiraj
Nisten Tahiraj 20:09
alright, there we go.
Alex Volkov
Alex Volkov 20:11
Okay.
20:11
folks, why this matters is because obviously AI is taking over more and more segments of. The economy. Like we, we are seeing this right now. There's more websites, there's more businesses being built, folks are using and spending on infrastructure. The big companies are like lifting the economy, are holding the stock market basically. And now the government basically strong arms, one of these companies, and we don't know who has the leverage. I'm pretty sure Tropic has more leverage here because I don't know if you guys saw this as well. Supposedly Xai immediately said yes, they don't care about any of the restrictions. Like supposedly when they asked Xai whether or not like GR should be used for all this XI was like, yeah, fine, but they don't want to use XAI, they want to use gu. this is why the ultimatum is like how good is Claude that the government does not want to use despite the ties of, XAI land to the government, et cetera. So that's one. Two is that, I'm with Dary on this. I'm really hoping that they have a very strong backbone and Tropic has been on a decline in the vibes after canceling open claw and doing some other things and blah, blah, blah, blah. So just like on the vibes, on the timeline and Tropic has been like lagging behind despite CLO code being incredible. But I absolutely want Daria to stand up here with a straight spine and say, fuck no, we're not doing this. Jeff Dean from Google said the same thing. AI should not be used this way. And I'm really hoping that this is what's gonna happen on Friday. which will cost an tropic a lot of money.
Ryan Carson
Ryan Carson 21:45
not, they have to say no.
21:46
like the whole premise of Anthropic will crumble if they suddenly go to Pete has got Sure man. Yeah. Use RLMs for war. Like it's just, that doesn't work. And I'm sure Geminis can do the same. I'm sure open eyes can do the same. Yeah. But we all, this is dumb. 'cause we all know they could just use a Chinese model.
Yam Peleg
Yam Peleg 22:03
Yeah.
Ryan Carson
Ryan Carson 22:04
I like, what are we even talking about here?
22:06
and we're in a world now where there is no protection, like the, the Chinese models are gonna be just as good as the American ones.
Alex Volkov
Alex Volkov 22:14
Yeah.
22:14
Al Deje, go ahead.
LDJ
LDJ 22:15
So if it is true that they would use the Defense Production
22:19
Act anyways, if they do refuse, then it seems like the best course of action to me would be to accept. Because if you refuse, then you'll be forced to have them use your model anyways. But now you have these bad relations with your own government and so it like, yeah, it's damned if you do, damned if you don't, but maybe just the best options to cooperate.
Alex Volkov
Alex Volkov 22:39
the thing about the Defense Production Act is it was used during
22:42
COVID to like force companies to do, was that vaccine production super quick. And this is like a nuclear option that they have, but designating cloud on a topic as a supply chain risk. cannot go together with defense production. You cannot defense production, something that you've designated supply chain rate. I, yeah, I stand with the topic. I don't think that the AI should be used in this way autonomously, although I will say, there is a whole thing with the war with China in ai and I don't think any of the Chinese companies will ever have a discussion with the CCP about some of these rules because c people will just not ask. CCP will just use whatever they need to for this exact purpose, if not using them now, likely. I'm glad we're here. I'm glad there is even the discussion
Nisten Tahiraj
Nisten Tahiraj 23:28
guys.
Yam Peleg
Yam Peleg 23:29
Yeah.
23:30
I'm just, I'm not sure what, whatever you've even talking about
Ryan Carson
Ryan Carson 23:33
Yeah.
23:33
The genie's outta the bottle. Like I, I think the idea that governments can control models now, it's just not real. And that's scary and interesting. And here we go,
Alex Volkov
Alex Volkov 23:44
Alrighty.
23:44
Let's talk about the fine tuning distillation. The second thing that we absolutely must talk about also from Anthropic is that deep seek scares the shit out of everyone. That is thinking about dipsy. If you guys remember a year ago, and if you listen to your show, you definitely remember a year ago, Dipsy R one came out and crashed the stock market because it was supposedly trained for $5.5 million, whatever, obsoleting, whatever. And it was beating the top, leading LLMs at the time. All of the models that we have from all the Chinese labs are way better than what Deeps seek released a year ago, right? So that's on its own crazy, the model that came out and absolutely devastated, wiped. I don't know how many millions of the market mil, billions of the market, is now irrelevant, but deeps is not relevant. We all keep waiting for V four, et cetera. Supposedly it's multimodal. we, we don't know. What we do know though, is a few things. One, deeps seek is coming probably this month, potentially today. I don't know if it comes today. Breaking news. We're gonna tell you about this. Two di Deeps seek was trained on, Nvidia trips and the deeps seek is now not trying to hide the evidence for this. This is what was leaked from one of the government labs, Nisten go.
Nisten Tahiraj
Nisten Tahiraj 24:56
Very interesting part that GLM claimed that they were G LM
25:01
five was fully trained on Hui chips.
Alex Volkov
Alex Volkov 25:03
Yeah.
Nisten Tahiraj
Nisten Tahiraj 25:04
I don't know how true it is, but, that's it.
Alex Volkov
Alex Volkov 25:08
That's
Nisten Tahiraj
Nisten Tahiraj 25:08
the,
Alex Volkov
Alex Volkov 25:10
so GLM and the upcoming release of Dipsy scares the lamb so
25:14
much that they feel like they need to like, release a bunch of stuff. And Tropic released a report naming Dipsy Minimax and ZAI, naming them directly, saying that they have detected a distillation, attacks and misuse of a bunch of accounts of a network of accounts to log into Claude and ask Claude for results. And basically using this at scale to, to, to train their models. Now, we haven't seen a reaction from this lab. we would love to see a reaction from this lab once, once we do, I would love to tell you about this, but, Tropic basically names here, and I think this is like very, let me pull up the image that I have here. I think it's very telling. I don't remember ever tropic or OpenAI, like mentioning other labs by name and saying, Hey, they did this. So the thing that is most telling, I really wanna lemme try to zoom in here. Yeah, here we go. The thing that's most telling. Okay. Deep seek. Has, I don't think it's 150,000. I think please somebody double check this, but I think my infographic is actually wrong by, by a order of mag. I think it's like 15,000 exchanges. I think one 50,000 is right. Yeah. Okay, cool. Yeah,
Nisten Tahiraj
Nisten Tahiraj 26:28
it's also low in terms.
Alex Volkov
Alex Volkov 26:29
so here's why.
26:30
They're all are Scar shit, at least from Dipsy seek is 150,000 exchanges. Minimax is accused, we don't know if it's true for it, 13 million exchanges and moonshot for 3.4 million exchanges. Dipstick is like orders of magnitude less than the other labs. But Dipsy was written about first in the blog post. When a topic said, Hey, these three in these labs are still attacking our infrastructure services, which is, it's against term of service of using Claude. They are absolutely in the rights to say, Hey, this is not how Anthropic should be used and we're detecting like an attack, et cetera. So like I'm, there's, it's definitely there. then the thing that I wanna highlight is they put Dipsy first, and Dipsy is like orders of magnitude smaller in the attack surface than the other ones. This is how scared shitless everybody from whatever dips sick is about to release. Now let's discuss folks Anthropic also drained on the whole internet.
Yam Peleg
Yam Peleg 27:25
Oh yeah, that's exactly what I wanted to say.
Alex Volkov
Alex Volkov 27:27
Yeah,
Yam Peleg
Yam Peleg 27:27
It's very nice of you to, I don't know, be mad that,
27:29
they trained on cloud outputs, let alone the, just training on public internet is distillation at this point with AI outputs everywhere. But put that aside, but what did you train your models on? just to be exact and I think I even saw somewhere that you can get sonnet to say that it is dips sick. and I'm sure I'm sure not, I'm not claiming anything. I'm sure that I tropic did not distill sick back just to, to, to stick it to them. It's just the public internet at this point. Do you know how many conversation with Claude I have on my laptop just by using Claude? Do you know how many conversations with Claude are there on hogging face in, out, in the open? yeah.
Nisten Tahiraj
Nisten Tahiraj 28:17
Guys, any, like of the 500 vulnerabilities that's every week out
28:22
on NPM or PIP install can just grab your entire.cloud folder history and have. A couple of million, things like the data is out there.
Alex Volkov
Alex Volkov 28:33
And TRO just settled a lawsuit.
28:36
I think it was like one, $1.5 billion lawsuit that, they got sued in September 25 by one of the largest corporate settlements in the US history around, 500,000 works from authors sued and tro together for using books to train AI models. This is like around $3,000 per book that they settled out of court. 'cause they did not want to go to court and try to figure out who's right or wrong about copyright. Supposedly if you buy a book, read it and destroy it. That's like legal because you bought a book. Yeah.
Yam Peleg
Yam Peleg 29:08
Destroy, destroy the book.
29:10
Think how crazy that is.
Alex Volkov
Alex Volkov 29:11
No, that's keep
Yam Peleg
Yam Peleg 29:11
buying a book
Alex Volkov
Alex Volkov 29:12
talking's,
Yam Peleg
Yam Peleg 29:13
like
Alex Volkov
Alex Volkov 29:13
the concept is like we destroyed the book.
29:15
Converted into talking. Yes. But Anthropic settled out of course. Like over out of, not being, in front of the judges. Exactly. About a very similar thing about, gray kind of pre AI laws that don't really mean anything anymore. we haven't heard from the other labs. I need to reach out to the other labs and say, Hey folks, both Minimax and ZI, were on the show here. If you're tuning in, we'd love to hear from you. Okay, so it's very interesting on the actual blog Anthropic Said is we like, distillation is okay as long as it's done lawfully, but the whole thing that they are against here is that there's like a million of accounts that are open together and did this like attack, which is something they cannot prevent. And the other thing is they pinpointed this to specific researchers in those labs that's the scary part. They somehow got to a point where oh, junior Yang is the guy.
Yam Peleg
Yam Peleg 30:07
if I was working in these labs and I was not on this
30:10
list, I would be pissed, man.
Alex Volkov
Alex Volkov 30:12
all folks,
Wolfram Ravenwolf
Wolfram Ravenwolf 30:13
Because, if you can't do stuff like that,
30:15
they probably use a lot for this.
Nisten Tahiraj
Nisten Tahiraj 30:17
Yeah.
30:17
You can tell that the reason they singled out deep seek that it's not a statement that was made towards us or the AI community and stuff. It was just made back to the government. it's more of a political vibe negotiation thing.
Alex Volkov
Alex Volkov 30:29
of feels like both of these things that we talked about the same
30:32
week, they're absolutely connected, right? Like on topic says, Hey, the Chinese are attacking us. We're the only, like we, we, the only one who can say that they're attack us with the sedation. Plus now Tropic is being used for war. Somebody named War Claude, I think. and I was like, it sounds cool, but I don't know if I need or want war cloud in my life. I think, yeah, go ahead, Ryan.
Ryan Carson
Ryan Carson 30:54
It's the invention of the generative pre-trained
30:57
transformer has happened, like electricity has been discovered. We need as a country to compete technologically, like we have to win by, improving the models, by improving the infrastructure and compete. we can't lock the technology away anymore. It's it's over. and I think everybody needs to think like that. Like whether you're a business or you're a government, we now need to compete. you're not gonna stop countries from using GPT.
Alex Volkov
Alex Volkov 31:25
Yeah.
31:25
but we also need to keep a competitive advantage somehow. Not clear how
Ryan Carson
Ryan Carson 31:30
I
Alex Volkov
Alex Volkov 31:30
agree with distillation attacks.
Nisten Tahiraj
Nisten Tahiraj 31:32
more data centers that's,
Alex Volkov
Alex Volkov 31:33
And like all of this is like very specific things that happen here.
31:37
Now, I will say Tropic did go and say on their blog that like, innovation happens in the US and then copied over, fuck that Deeps seek innovates as heck and puts this out in open source. DSA. the GRPO, there's so much innovation that happens and they put it out for everybody to benefit. So innovation absolutely does happen elsewhere. And we need to make sure that hey, folks, if you wanna talk the talk, also fucking contribute back to everybody else. Not just use the whole internet for yourself and then say, innovation only happens for us and nobody else. So I actually, I'm not sure where I'm falling here on which side of the thing. I think like abusing the TOS is not a good idea. and there should be some protections to a company running services, but also, hey, they also trained, they also did books. They are now in the government. We need to figure out like what's going on. LDJ, you had one last comment. Let's, and then we'll continue.
LDJ
LDJ 32:31
Yeah.
32:31
Just on the earlier part of the conversation, I put a link in the Streamy yard chat just for the source of, Claude government, basically using Sonet 4.5, but a fine tuned variant of it.
Alex Volkov
Alex Volkov 32:43
Yep.
LDJ
LDJ 32:43
There's just a little snippet from Cloud's report that mentions that
Alex Volkov
Alex Volkov 32:48
the current primary gov models variant of cloud.
32:51
So 4.5 lightly Finetune to reduce refusals in classify carbon settings, lightly Finetune to reduce refusals.
Yam Peleg
Yam Peleg 32:59
they basically made an uncensored version just
33:01
like us, Gavin, just like us.
Alex Volkov
Alex Volkov 33:03
the one last thing about Tropic super quick is that, Opus three.
33:07
Is getting deprecated. And then Tropic was like, Hey, we told you that because of Amanda Asco and like other folks, we treat ops as individuals and we're gonna tell Opus that we're about to retire Opus, and see what he thinks. And they told Opus nah, I'm gonna write a Substack instead. So they opened the substack for Opus three to write his thought and ruminations about the world.
Nader Dabit
Nader Dabit 33:29
have that.
33:29
So good.
Alex Volkov
Alex Volkov 33:30
And it's just fucking incredible.
33:31
I love every little thing about this. And I was like, will they ask Opus 4.5 if he wants to be in the middle of a kill chain decision making, will, how far will they go with asking the models what they actually want? And I will say Tropic is very unique among the other labs I like. I don't think any other labs are as deep into considering these as entities that they have own willian decision making, et cetera, as tropic. And I'm, yeah, I'm really like interested in that. Opus three is not going away because it said that it doesn't want to. Basically it's where we are in February of 2026.
Ryan Carson
Ryan Carson 34:09
that doesn't wanna
Wolfram Ravenwolf
Wolfram Ravenwolf 34:10
I'm not a fan of this, just hang out because
34:12
ai, ai psychosis is a thing. And yeah, basically they are adding fuel to the fire that way. Yeah. Where people are enter Ming these, I have my own assistant and but I don't think there's some senticene or special thing, even if I put it in the prompt for fun.
Ryan Carson
Ryan Carson 34:28
we need to have a debate about the soul then.
34:30
'cause man, I would argue, these ghosts are alive for a couple seconds and then they go away. So yeah,
Wolfram Ravenwolf
Wolfram Ravenwolf 34:37
Either you are enslaving them if you go that way.
34:40
And the worst thing that could come out of the AI revolution is if we have robots running around and then they are not supposed to work all the time. They have to go on vacation. They don't feel like it, so they don't want to do your dishes or something like that. End because they're smarter. They make the people do the work so they can do whatever they want to do. I don't want that future.
Alex Volkov
Alex Volkov 34:59
This we need to have a debate about this.
35:01
This debate is not only ours to have, but many people treat models like Opus 3 and GPT 4o as some sort of unique thing in the world, versus just prompting things. Us talking with our clouds and building memories together and building relationship together. And what is it that we're building a me, like a relationship with is open claw I thing. Open claws and everything. It's just a, a harness. It's a buddy, but the mind there is Opus. I will say to continue this, like along the side of our announcement to continue this open air release, GPT 5.3 Codex, and also after buying Open Claw, what, two weeks ago, they are now allowing you to use Codex for open claw in OAUTH something that Anthropic basically said that it's not really, like legal ish, et cetera. and so we now have a pricing for a GP 5.3 Codex, $1.10 75 cents per million tokens and $14 per output tokens. just absolutely mocked Opus on price. But also you can use your, open the Eye Pro subscription and code Codex to run OpenClaw. And I have, and I've tried, and oh my God, this is awful. It's so bad. Codex is really good at writing code, really good at writing code. But the basic stuff, because of which Open Claw exploded in popularity, the talking to you like a human in your telegram chat, the picking up the little hints, the little thing, oh, just Codex cannot do, it's insane. Wolfram and I have chatted about this Wolfram, I sent you like a bunch of examples and so I was locked out of CLO for, I don't know, a few days. I'm back now, knock on wood. That I'm not breaking an ETOS and I really was like, I don't know about this open cloth thing. It's not really useful anymore. and this is tied to the one of the best coding intelligences in the world, GPD 5.3 Codex. So I gotta wonder why this is, I wonder is this because Opus and Claude have some incredible magic thing that they've built in there with a soul. I gotta wonder if it's prompts that the folks at Open Club built in that only work for sonnet and they didn't rebuild them for, codex. Go ahead.
Yam Peleg
Yam Peleg 37:08
Look, anyone that used Codex for, professional, coding
37:13
will absolutely agree with you. Even in the setting of coding, eh, you feel exactly the same. It's an absolute beast for writing code whatever you want. Seriously, it will get it done. However, it not always get your intent to the best to, to the very end. It's doing exactly what you tell it to do. It's too
Alex Volkov
Alex Volkov 37:38
literal.
37:39
Yeah,
Yam Peleg
Yam Peleg 37:39
absolutely.
37:40
And when it talks to you to explain to. Sometimes I have no idea what the hell is he talking about? And it's my code. Like he went over my code and told me stuff that happened over there and I just can't understand anything because it's so inhuman. but yeah. Look, the open, I just wanna say open claw would wood and is working very well with GPT five. Full stop. Not g PT five codex. Yeah. So Opus is Opus. Yeah. It has its own, magical vibe that every, everyone sees. But it does work well with other models. It's just that Codex specifically is not a friendly model.
Alex Volkov
Alex Volkov 38:21
can say it
Yam Peleg
Yam Peleg 38:22
like that.
Alex Volkov
Alex Volkov 38:22
Dude, I almost gave up on the whole open claw endeavor, and we've
38:25
been running open claw since we told you guys about this back in January. So like over a month I have this relationship with this bot. And I almost said ma, this doesn't work anymore. They broke it and left until I was like, oh shit, I'm talking to a different model and go back and fix this. I, Ryan, what's your experience with Codex?
Ryan Carson
Ryan Carson 38:40
I think people are underplaying the importance
38:43
of Amanda Asell at Anthropic. So she is basically in charge of Claude's soul. and I think this is very important and I'm shocked that she just hasn't been offered, unknown money to leave philanthropic. She probably has. but I think
Alex Volkov
Alex Volkov 38:56
she's, Amanda a is a philosopher that works
38:58
on Philanthropics team, right? Yeah,
Ryan Carson
Ryan Carson 39:00
And she leads the personality of Claude and I
39:03
think this is very important. This is why it's wonderful to talk to Anthropic models and why it feels like you're talking to like a. A, a nerdy scientific person. When you talk to Codex, it's brutal. I have to actually ask Codex five three to like, can you just explain this in normal people language? I don't even understand what you're saying
Yam Peleg
Yam Peleg 39:22
all the time,
Ryan Carson
Ryan Carson 39:23
and I feel dumb.
Alex Volkov
Alex Volkov 39:25
as a reminder, LDJ, before we get to you, Anthropic releases,
39:28
the cloud constitution, it's 93 pages long that discusses every, like EE every way that the, like Claude has to behave when they trade cloud, they build in this constitution in there. It's oh, by the way, what do you have lined up for Weights, & Biases, this week's buzz corner. Claude, like Codex will never do that. Codex will never ask me. Oh, by the way. I like from our memories and sessions, whatever, you have this like thing and what are you planning there? It's just this is the little difference. They're like saturated as heck. And many of them can be predictable. Other evals, like evals say a certain thing we can go or talk to about evals. The labs are doing this, but just there's something there that Codex is crazy for code, literally. I would prefer Codex, like building infrastructure, et cetera. But talking to a thing that's definitely opus, and this is why Chemic K two was like really good as well.
Yam Peleg
Yam Peleg 40:19
Kimmy is different.
40:20
It's different.
Alex Volkov
Alex Volkov 40:20
no, but like the difference between like Kimmy
40:22
and Qwen to me is similar to the difference between Opus and Codex.
Yam Peleg
Yam Peleg 40:26
Kim Kimmy is even more, in my opinion than many other
40:29
opinion is even more than Opus. If you can say it's even better for creative writing and, having a soul. but absolutely, totally agree with everything you're saying.
Nisten Tahiraj
Nisten Tahiraj 40:40
and it's sports images.
40:42
I'm using it in a public facing app right now and it's absolutely, which,
Alex Volkov
Alex Volkov 40:45
what you using.
Nisten Tahiraj
Nisten Tahiraj 40:47
Kim 2.5 actually from one be inference because nobody's using this.
40:51
It's actually really fast.
Alex Volkov
Alex Volkov 40:52
wait, hold if we're here Nisten you brought it up.
40:55
Let's go.
41:13
as far as Nisten, Nisten brought it up already.
Nisten Tahiraj
Nisten Tahiraj 41:15
I just say whatever I want.
Alex Volkov
Alex Volkov 41:17
welcome to this week's buzz where we tell you about everything that
41:19
happens, in the world with some biases. And this week Amic, K 2.5 and Minimax 2.5, both were launched on WB Inference. It's a very cheap, comparatively to every other places out there. Very fast inference powered by CW CoreWeave is the essential cloud for AI that runs inferences for OpenAI. Meta, and then this week we added, minimax 2.5 and Kimi K 2.5, which is multimodal on our infant service. Now Nisten is saying it's really fast, and it's because we just launched it and also because we are the essential cloud of ai. Wolfram, do you have any comments on these two models? We've been playing with them on our inference as well.
Wolfram Ravenwolf
Wolfram Ravenwolf 41:58
I've been using all of these and my personal favorite
42:01
ischemic, K 2.5, has been said it Kimmi has a special personality in the model. something very opus And yeah, they distilled from Opus. We have seen not as much as, what was reported, it was less than, minimax was using. And I personally think, minimax distilled a lot of the maybe ethics limitations in there as well. So I've seen some refuses in some tests, but, Kimmi is, really my favorite Chinese model right now, although I haven't tested qu yet, so I have to say that.
Alex Volkov (2)
Alex Volkov (2) 42:29
M 2.5 is priced at 30 cents per million tokens and one
42:32
$20 for a million output tokens. Folks, this is 10 times cheaper than the other, like models we told you about. and, chemic K two is also launched in, it has text and vision. we don't have a lot of vision models here on inference. This one does support vision, 50 cents per input tokens, 2 85 per 1 million output tokens. up 1 trillion parameters with 32 billion per active with 262 thousand con tokens in the context window. and Nisten, you've been using this on our inference as well, which is great. We'd love to hear from you. You're not, we didn't ask you to do this. you just said it, that's why I jumped into this week's buzz.
Nisten Tahiraj
Nisten Tahiraj 43:05
No, I had it for a week and it had 10 users just
43:08
testing the alpha and stuff and used like four bucks for the whole week. And it's actually pretty function calling. It's like an eject, eject, doctor researcher thing. And so it does all the tool calls really well. You can feed it images back, you can read the images back. It also talks very nicely and actually has pretty high medical, benchmarking scores. not the highest, but like very high, like quite up there with the GBT and and opus. Actually better than, and a lot of stuff for the medical too. But yeah, it is both very nice to talk to people with while having the image capability, while being able to code, I'd say GL m's a bit better at coding and while being pretty, very good at doing, custom function calls for a web app. yeah, this is my go-to right now as well.
Alex Volkov
Alex Volkov 43:58
Yep.
43:58
So folks, if you are, not using Opus or can't use Opus, because it's really expensive via API, feel free to use this. open Claw is supported. you can use these models with, open claw via some routers, et cetera. You can even use them in cloud code if you really want to. It's very easy. Alright, let's move on. We have Wolfram last comment and then, we'll,
Wolfram Ravenwolf
Wolfram Ravenwolf 44:16
if you want to edit or open cloud, just give it the documentation
44:19
page of our models page and tell it to add the models you want and it will do it. I did that, just tested it while we were talking. I was using Codex to power my open cloud. It didn't really do it, so I switched to Opus and it did it. yeah, whatever. It's just one sample, but in this case, that worked better for me.
Alex Volkov
Alex Volkov 44:36
All right, folks, we are moving on from, our discussion.
44:39
there's a lot of stuff to talk about. but we do need to talk about open source. We're probably gonna cover open source, excuse me, closer to the interviews. I wanna, and Ryan, I wanna tag you into this discussion because I wanna talk about the evals that just got completely bonkers this week. Just we cover metrics in evals and benchmarks. S this's. What we do, we look at models, we cover this. We don't have the time to test out every model, so we use them as a crutch to tell you, Hey, capability has improved in this way, in this percentage. But this week they went off the fucking rails. METR, which is a, I don't remember what it stands for directly. it's a, organization that tracks, time horizon how long the models can go for. and you can see here that for longest time, since 2022, since 2023, GBT four, et cetera, the models basically were non-agent, okay? So I'm gonna zoom in here. GBT four was barely at 30 minute task execution time until it failed and couldn't go anymore. And then around 2025 we started seeing a jump. This is not a log chart, this is a regular chart. Opus is literally off the chart here in terms of like how much inference, how much time it can run autonomously to do stuff. It is bonkers. GBT 5.3 I think is also there. so we have GPT 5.3 here as well. GPT 5.3 and Codex, both of them broke the bank and. When we tell you about acceleration, this is the acceleration we talk about. this is the curve, this is the thing. We all noticed this in December. We all then noticed this again in beginning of January when Cloud Opus 4.6 released and, codex 5.3 released. And, some of us are using them more than others. Ryan would love to hear for you if this is your experience also with these models.
Ryan Carson
Ryan Carson 46:27
However, I think this is why we're talking about takeoff.
46:30
we're all seeing the capabilities of the models day-to-day. and it's pretty clear that they're intelligent enough to be generally useful. It's the harnessing of the model that still is lacking, right? so that's why my, code factory post took off that it, it's still not easy to actually use these models, which I would argue are pretty much a GI, to do useful things 'cause they have to be chained together. so I'm not surprised at all to see these benchmarks being crushed. I think this is what we all see who use these models every day. but we need the labs to continue to harness them better. we're gonna talk about this, but obviously, we saw cursor release there, cloud instances. This is just another example of giving the model the tools it needs to have the feedback loop to do large tasks.
Alex Volkov
Alex Volkov 47:16
Yep.
Ryan Carson
Ryan Carson 47:16
Sadly.
47:16
Not surprised. So here we go.
Alex Volkov
Alex Volkov 47:19
Nisten, you have some comments just before this.
47:21
Thank you. Wolfram? it's, meter, it stands for model evaluation and threat research, and apparently they're well funded.
Nisten Tahiraj
Nisten Tahiraj 47:28
Yeah, look, when you use the same benchmark on different model,
47:30
the relative results are useful, but overall as a benchmark, it was never like, it felt like it was an accurate setup for it because, you can just give, Opus a different prompt and it will keep reviewing the same code base 20 times and that another 20 times and it can literally just run for weeks. So taking the stuff at relative value is good from the benchmark, but at face value, I don't know, a lot of people like me are opinionated, just don't think it's very good. Benchmark,
Alex Volkov
Alex Volkov 48:01
And go ahead.
Wolfram Ravenwolf
Wolfram Ravenwolf 48:03
Just wanted to add that what they are benchmarking is not how long
48:06
the model is running, but how long a human expert would take for doing this task. And if the model can complete it. So if it is a task like research or something online where a human takes 15 minutes, they would check if it does it successfully. If it takes half an hour or five minutes, it doesn't really matter for this only how long can or would the human equivalent take. So if it is something really complex where a human would take a week and the model can do it in an hour or in a month, it doesn't, I haven't seen, the real time, but just the equivalence to a human expert. So the longer, the more complex a task, it is basically a way to, describe the complexity of a task by how long would a human professional take to do this. So otherwise you could just use a slow model and let it run in a loop or something, and then you would get a great score. So it's really how long would a human take and can the AI achieve this with a high success rate?
Alex Volkov
Alex Volkov 49:00
they are also saying that the doubling time from the
49:04
previous benchmark was 49 days. So this is like a week and a half. this is crazy. Sorry. I'm a month and a half. A month and a half. the doubling time, if you guys remember Moore's law from computing where the doubling is 18, 18 months. This is 49 days in doubling on this benchmark. LDJ, go ahead super quick and then we're gonna continue to talk about, other things.
LDJ
LDJ 49:25
yeah.
49:25
Short term. I was just going to say short term there could be a lot of noise because just like even in Moore's Law, for example, you might have intel release at ship just a couple months after a and B'S ship that was maybe half as good and it's oh, the doubling time is like way faster. But when you look at the long-term trend, it might not actually be changing that much.
Alex Volkov
Alex Volkov 49:43
Yeah.
LDJ
LDJ 49:43
but when it comes to the different reliability at different
49:46
points, there is actually an analysis.
Alex Volkov
Alex Volkov 49:48
Wait, we have the historical trend here on the
49:50
chart, actually, so that's good. Let's call this out. two hundred twelve days. Pre 2023. And then over since 2023, it's 123 days doubling time.
LDJ
LDJ 49:59
Yeah.
Alex Volkov
Alex Volkov 50:00
Yeah.
LDJ
LDJ 50:00
Yeah.
50:01
And I was going to post this in the streamy yard chat right here. So this is an analysis I did actually the day that this, meter actually originally released this benchmark. This was maybe roughly a year ago. But this is actually the reliability over different accuracies. When you look at over time how that looks like on the extrapolation of the different trend lines, right now the chart that you were mainly pulling up, that's for the 50% success rate. And then if you want to see what is the time horizon that it's actually able to do at 80%, at 95%. At 99%, then that's what with the different color lines that I made in this analysis here.
Alex Volkov
Alex Volkov 50:36
thank you guys for this clarification.
50:37
I would love to dive into this, but like also I will say, two things. Meet Meter is the company that said, Hey, in 2024, we surveyed a bunch of AI developer, sorry, regular developers. We gave them AI tools and they said, it's not really like optimizing the work. And we made fun of them because they used like a noble model for this. So basically they're trying to say, Hey, we're trying to evaluate how much more performance Ryan Carson would be. What, when he uses ai. And Ryan Carson refuses to not use AI to be able to get tested. So this is the problem they now have. They love it. They can't find, they can't find the developers that like are not willing to use AI for the test to compare how much better they will be with ai. Because Hey folks, we're we're there yet? we're there. Yeah. all right, let's move on. There's other metrics that are like just bonkers, breaking the bank, confluence labs. Emerged and claimed from stealth claimed that they solve 97.9% on ArcAGI. Benchmark. ArcAGI is, notoriously a harder benchmark to, for models to, to solve. Involves, spatial reasoning involves textual reasoning, involves a bunch of other stuff. supposedly this is now saturated completely, 97.92%. Gemini 3.1 Pro that just released is at 77%. So we don't know a lot of stuff about this specific model. We do know it's open. What? It's open source. I did not know this. Okay. we definitely should take a look at this. As you guys see, there's a lot of stuff I didn't even notice that, like this is MIT license open source, otherwise I put this in open source. it's a Y Combinator company 12 parallel agents and they do refinement loops and they just solve this very hard task for, for a GI. So the like rggi I saturated and an additional thing is saturated Swyx ENG verified, scale AI posted this. And then, the Swyx ENG verified basically is now saturated as well open. I did a bunch of research, showed up at our friend Swix latent space to talk about this as well, where SWE-bench verified software engineering a hundred task or whatever. mostly Python and Django and some other stuff. And so we've been verified that open asset, they're no longer gonna report this because it essentially doesn't matter and like it's fully saturated. So we moved from SWE-bench verified, sorry, from SWE-bench to SWE-bench verified and now to SWE-bench Pro, which is by, scale ai, with their seal benchmark. And yeah, OpenAI said 59.4% of the tasks on SWE-bench or fundamentally broken rejecting functionality, correct solu rejecting functionally correct solutions due to tests enforcing, unstated implementation details. So basically they're saying that this benchmark, while we trusted this all this time, is not necessarily showing or indicative of good things. folks, anything we have to say here Wolf or maybe one to the sentences about this and then we'll, we can move on
Wolfram Ravenwolf
Wolfram Ravenwolf 53:25
Yeah, you just put it up.
53:26
This was a very interesting read. You found it told me about it. I read it. And basically what has been done is to, see the, that you only need five benchmarks basically to predict all the scores for the other benchmarks. SWE-bench verified was actually one of these, so I guess it has to be scrapped and replaced by another. reporting all these scores we always see and take a look at it. Compare, it is enough if you just choose the right five and then you can interpolate every other score basically. And that is very interesting because it saves a lot of money if you can focus on a couple of benchmarks and makes it easier to compare the models if you don't have to compare so many. I found this very interesting to read and now we need a replacement for SWE-bench.
Nisten Tahiraj
Nisten Tahiraj 54:09
Such a good post.
54:11
I just retweeted it to you. This is very good.
Alex Volkov
Alex Volkov 54:14
This is from Demitrius Pop
54:17
Pap.
Nisten Tahiraj
Nisten Tahiraj 54:18
Yeah.
Alex Volkov
Alex Volkov 54:20
Palo, from Microsoft Research, professor UV Madison.
54:23
So shout out to Dimitrius, for this.
Nisten Tahiraj
Nisten Tahiraj 54:25
Excellent.
Alex Volkov
Alex Volkov 54:26
Also, basically lemme just, say about this postal for
54:29
the folks who are just listening. this is a poster called You Don't need to Run Every Eval. And the, he said, I use cloud code to build bench press at $0 Benchmark prediction System Codex to audit for bugs and clouds only to try and beat it. LLMs are so low rank that five benchmarks can predict the other 44 to within five points of accuracy and significantly lower money for this. This is crazy.
Nisten Tahiraj
Nisten Tahiraj 54:52
Yeah, I think it does need to be replaced because
54:54
it's 60% Python at this point. And, we do need more web depth stuff, more UX stuff, more, other languages, more gentech things, so yeah,
Alex Volkov
Alex Volkov 55:04
All right, so we covered workload, we
55:05
covered, evals and benchmarks. I think there's one other, company that beat ArcAGI three, I didn't add up here, but ArcAGI three was also like nearly saturated on every public thing that was posted. ArcAGI three didn't launch yet. The only launch in like the public examples. so this is evals and benchmarks, and now we're moving to our corner. Ryan, this is, I would say your favorite corner as well, tools and agent engineering folks. This is now what can, like basically where we are, all of us and everybody who wasn't an engineer before can now be one because, and not, because a year ago cloud code was brought to our, lives via Anthropic as a side project, basically running cloud in the loop in the terminal. And since then it became like a two or $3 billion side project worth in tropic and changed the game for many people. People will move to terminal, so shadow to cloud code. and it feels like for the past. I dunno, two months, three months since December. Not only this, like more things like cloud code have popped up. open claw that we talked to you about multiple times every week. Now it's still on top of the news and that is based on pi, which is also a terminal UI to run agents. cloud code inspired a lot of labs. Every other lab has one as well. Google has a Gemini CLI OpenAI release Codex as a terminal CCLI tool as well. and it looks like both directions are happening. Cloud code was released and then there was this a tool and now it's a desktop app as well. And now it's running agents and reminders as well. Ah, do you guys wanna grab one of these topics and discuss this while we wait for our first interview?
Wolfram Ravenwolf
Wolfram Ravenwolf 56:44
first thing.
Alex Volkov
Alex Volkov 56:45
Yeah.
56:45
Talk to us about the studio
Wolfram Ravenwolf
Wolfram Ravenwolf 56:46
cover LM Studio, because they made it possible that
56:48
you can set it up on another computer. Like I have an AI workstation with 2 30 90 GPUs and I have my MacBook and, basically I can run it on one system or any other system. You don't even need, any GPUs and still do the inference from your other system, basically over the network and it runs there and, transfers the tokens basically.
Alex Volkov
Alex Volkov 57:08
Yeah.
57:09
So L Studio launches LM Link, which is allowing you via tail scale, a very secure, networking thing that many open cloud use as well. to have private inference networks for wherever you are. You can inference on the go. Shout out to El Studio friends of the pod for sure. I want to, I wanna talk about some of these things. So yeah, let's talk about this thing. Cloud code. After one year being out, looked and decided to answer many people's requests about cloud code and said, Hey, we're introducing remote control. Control your local coding sessions from your phone or device. This is probably following the excitement about open Cloud, where you can just talk to it wherever you are via Telegram, WhatsApp, et cetera, cloud code, many people have had this setup directly. And now Cloud Code is adding this built in to their, to their thing. Not only this, both cowork, which we told you about, which is like cloud code for non-techies and Codex now have automations. Ryan, I don't know if you are into this or you're fully on Codex, but if you're using the Codex app or CLI only, but automations like we, we should absolutely talk about this.
Ryan Carson
Ryan Carson 58:13
Yeah.
58:13
So this is where we're starting to see the labs create this entire harness for doing things end to end. and I, it's funny that you mentioned, codex, the app. 'cause I literally switched from the CLI to the app today. so I was pure code CLI and now I'm in the Mac app because it just has more surface area, right? You have automations, which are basically CR jobs, same idea, as everybody's using open claw. And this is gonna become a thing like heartbeats, CR jobs, browser testing, cloud-based agents, like all that's gonna be rolled into, the entire product for each lab. So you'll see Codex will be that primary surface for OpenAI. You'll see Gemini, I assume roll into this and then we're starting to see cloud code. Cloud code is strange though, or Claude or Anthropic 'cause. It's fairly, fragmented. Like you have the Cowork app, you have Claude Code, you have the, there's all, you have Claude in the browser and it's actually confusing. but I'm really liking the way OpenAI is rolling all this together, into one product to control your entire company.
Alex Volkov
Alex Volkov 59:18
So we should mention automations wise, both at automations,
59:21
which are like chron jobs, basically running things to remind you on a cadence. I've tried Codex Automations, they came out a while ago. Codex Automations, have very restricted sandbox for me. And like some of the stuff that I wanted to do, like push to get it, couldn't do, I couldn't solve it. I told them a couple times, hopefully they'll solve it soon. But basically they're all doing what, open Claw has built, has been built with. You remind your agent, Hey, I want this to happen every day at this hour. And they run your code, cloud code specific. Now with remote control, you can run your local machines from far away. and that is like helpful to many folks. we also have, cursor launching in this also Area Cursor is launching cloud agents that onboard the code base run in the isolated VM and deliver video demos of completed prs. this is not new. We've seen this before. but cursor it is new to Cursor. Cursor started as an IDE integrated development environment where you write code and it auto completes and really is moving towards a fully agent cloud system as well. They show you a video of how cursors agents can interact with your product, and that's very helpful for you to debug. Ryan, we talked about last week, Gena, or GenFi, and the ability of us doing stuff on the backend, but front is like harder because actually somebody needs to use this. This seems like a step to the right direction where you can actually view the differences.
Ryan Carson
Ryan Carson 1:00:36
Yeah.
1:00:37
The browser testing loop is still not there. and everybody doing front end design and development knows that,
Alex Volkov
Alex Volkov 1:00:45
perplexity also relates Perplexity computer,
1:00:46
which has computer use and does like a bunch of stuff as well. we did have a comment, about this from, from iOS who says, as a perplexity subscribers, max subscriber Basicity computers really smooth better than Manus. Manus has been also like the agent AI running stuff. Manna has been out for a while. Meta recently purchased Manus now Open Cloud released, got super, super excited. Open Cloud joined OpenAI. Do you guys see the trend, right? there's also the fact that while these labs can absolutely build those tools themselves, they're now purchasing the actual companies in the space. It's very interesting and telling not only everybody stepping towards the Gentech, async agents running, they're also, all these companies are purchasing these agents. folks I want to bring on Ben. Ben, has, showed up on my timeline and like he is, his rise is coinciding with the December hype change as well. So I'll let Ben introduce himself. hey, Ben, nice to meet you. Thanks so much for joining us. would love for you to introduce yourself, who you are, and then let's talk about pals here and your recent kind of like, exciting life.
Ben Broca
Ben Broca 1:01:50
thanks for being on the show.
1:01:51
my name is Ben, French and also American. I'm an engineer, but also an entrepreneur. Been studying companies for a long time, in the past two years. Started really digging into code again with vibe coding. And then in the past year, increasingly, amazed by the capabilities. and then cloud code came out and the cloud models got so good at using tools that really opened up a new frontier, I think in December when, Opus 4.5 came out and the Chrome integration and how good it was at browser use. There's a future where actually AI can do everything. It's just taste and creativity. And so puls AI is really about, it's a platform that lets you, build and run companies autonomously. So puls AI is just gonna do everything for you. You bring in your idea, your taste, your creativity, maybe your marketing insights or your community insights about who to sell it to and how to present it to them. and policy will do 80% of the grunt work, right? It's like writing code, like deploying to a web server, setting up a GitHub, setting up a, database, making it scales, running meta ads, tweeting, responding to support emails, sending call outreach, doing competitive research. Like all this stuff it can do. and what's unique about Polsia is that I make it so easy because I give it everything. So I give to every instance of a company, everything it needs to run the company. And all the user has to do is just talk to a, ai, CEO agent that like, they can jam about what to do, what direction to take. and so you get like a co-founder that's really, that's not an assistant that's prompted not to be an assistant, to be aligned with you, to make the business work, right? So they, you can push back if you like, try to add too many features before you have users. So that's really the intent. And of course, it autonomously works every day. And so every night it wakes up. it's like a CR job, right? and looks at the state of the business, looks at logs, looks at like users analytics and takes a decision on what the best next step is. Is it fixing a bug? Is it adding a feature? Is it doing more marketing? so yeah.
Alex Volkov
Alex Volkov 1:03:46
I have a few questions then.
1:03:47
So first of all, the graphs that you have, the sia do comes as live thing where you actually run Sia with Sia is one of the coolest things that I ever had to see. So like definitely, feel free to pull it up, but I have a question about, I, I saw many folks jumping on the success of kinda open claw and started building some of these things on top of successful open claw. you're completely off this kind of like thing, right? You've built Sia yourself and like it scales itself. It's not related to any of the latest kind of excitement. It's just like very, how do I say? Pete like just celebrated three months of open claw as well. You also started something in December. there's something,
Ben Broca
Ben Broca 1:04:24
talk to me about December.
1:04:25
I had the idea of it in April, actually, but I think the models were not as capable then. And I started working on it in, November, beginning of November, and I think that's when Pete also started working on open Cloud. And so I think we both, we probably both saw the same thing at the same time. And he had a open source approach, more like personal assistant, more a sort of I would describe it as like open close is multi Android. Like you can configure however you want. It's open source. You can set up on your computer. It's like Android, you can set up however you want. Polsia is more like the quote unquote Apple. it's like an ecosystem. Everything's set up for you. It's very opinionated. It's about helping you with a business. It's not a personal assistant really. It's really more like an AI team. It has, everything is provisioned for you. and so we went to the same conclusion. Interestingly, when I Swyx open Google up, I was like, this is cool 'cause I'm not alone anymore. okay, like what I'm doing, what I've been doing, and being completely pilled and like working 16 hours a day talking to AI all day. Okay. There's other people doing that. And actually there's other people who came to the same conclusion.
Alex Volkov
Alex Volkov 1:05:23
Yeah.
Ben Broca
Ben Broca 1:05:23
I think two weeks, two weeks ago I asked my cloud, Hey, go download
1:05:26
the open source, open cloud, like code base and dig into any feature that you think we're missing that we should add. And it concluded Hey, it's pretty much the same architecture. It has the same heartbeat. maybe you should add better memory systems. I think what they're doing is interesting in terms of the agent writing its own skills at the end. And so let's steal that. And so we stole some of the smart memory system that like Pete, architected and the open source community. Nice. And but yeah, overall similar idea, different implementations. both running at the same time.
Ryan Carson
Ryan Carson 1:05:59
Ben, nice to meet you.
1:06:00
I'm really curious about the actual, how the sausage is made here. it's, I love this idea by the way. I think getting the right data to the agent is obviously the key, right? And obviously companies have very different, ways of doing this, and sometimes I call this like terraforming your company to be ag agentic ready. Like what sort of setup do you require for companies to be able to be run by, by Polsia?
Ben Broca
Ben Broca 1:06:26
I think that like the way the AI gets all the information it needs
1:06:31
to really go in the right direction is like actually when you onboard Polsia, it's actually gonna do market research and it's gonna actually look on the web, like every information it needs to understand like what business we're doing and what our competitors doing and like what's the best approach. That's number one. So it gets, and then, and there's like research agents and Q agents that actually can brow and browse agents that can browse the web and like get a feel of what exists to like, make sure that like we don't start from scratch. Number two is, is actually like every agent that works, will actually learn. So for example, if a cold outreach agent works for a specific company and learns that, like if it sends, it adds emojis on the subject line, you get a better response. It will save that learning into a cross company memory file that can be searched later by the next agent that runs, right? So the idea that's like the more users on the platform exploit and explore different crevice of the economy and different use cases agents, an agent that just does cold outreach will learn, okay, in those use cases, when you pick those type of demographics and this type of situation, this is what gets responses. And so the system gets better over time.
Ryan Carson
Ryan Carson 1:07:37
but that's nice.
1:07:39
But what if one of my competitors is using Polsia? Like I don't want them to learn from my learnings. How do you isolate that to give?
Ben Broca
Ben Broca 1:07:45
So it's ized and it's like like generalized learnings, more than
1:07:49
like the same way an LLM scripts, all the internet will end up learning. From every instance of what happened everywhere. it's more like like a shared learning on what works and what doesn't. It doesn't use like PII or like specific company name or specific things. It's more general learnings and it benefits everyone. Like a new cust you would benefit from the platform and other people will benefit the platform the same way. other big platforms, they use anonymized data on the platform to make you better. if you go on Amazon or something like that, and you set up a shop, they'll tell you, Hey, like on average users who do this, get a better performance or like on meta ads, right? You get that right? They tell you, Hey, if you turn on this knob on average you'll get 10% more performance because that's what other people have seen. show success.
Ryan Carson
Ryan Carson 1:08:31
Cool.
1:08:31
One. So similar
Ben Broca
Ben Broca 1:08:32
idea.
Ryan Carson
Ryan Carson 1:08:32
So what type of companies are being successful at Polsia?
1:08:36
and I assume there's four companies that are the majority of this 700 grand run rate or something? or is are they equally spread across? tell us what's working and what kind of company
Ben Broca
Ben Broca 1:08:46
So to clarify that run rate number, it's annualized amount
1:08:49
of money that flows into the Polsia ecosystem so that it all, that includes mainly like subscriptions. Since like to set up a company on polsia, you pay 50 bucks a month and you get 30 days of autonomy and you get a bunch of tasks and you get a web server provision for you, database, et cetera. So actually I don't really make money on that 50 bucks, but that's like the best description and then people can add more tasks. more aspects to do things faster, and they can also run ads, which is an extra charge. And then there's the company revenue. The majority of that run rate is actually like platform people spending dollars to create their business. in terms of the most successful companies, it's like we're still really early because as you've seen see on the graph, like majority of businesses are like one week old. there's a few business that started making money and I'm seeing increasingly transactions coming, but it's really early. Like the, most companies make less than a hundred bucks. MR and that's why I introduced the ads product. a week ago or so where, essentially Pia autonomously, creates UGCs, puts them on meta, wakes up every day, looks at performance, and CTR and like ROS and will actually, decide on like campaign.
Alex Volkov
Alex Volkov 1:09:57
I'm sorry, I have to pause You.
1:09:58
Just super quickly just passed 700,000 live on the show. 700,000, dollars in a ARR since. Since December you said?
Ben Broca
Ben Broca 1:10:08
Yeah, I launched in December, end of December.
1:10:11
And, yeah, the, the most of the growth came like recently. about a week ago I announced on X, that like, I was doing a fundraising where like my AI Polsia would raise its own round of funding. and that picked up quite a bit. I got a lot of views. And then now I'm like, like telling my story publicly and showing the numbers, showing what I'm doing, showing that I'm solo on this and like using AI to the max to run the platform itself. actually had a media outage this morning and like literally Opus and Codex were like both running at the same time, like diagnosing it and then make, making sure that they were correct and then push the hot fix in production. That self solved it. But I was like, this is crazy, because I, the alternative would be like 24 7 like infra folks on call, and I can just run a qu job with fra like monitoring and get it fixed,
Alex Volkov
Alex Volkov 1:10:55
Ben, first of all, this is incredible, incredibly inspiring as well.
1:10:58
Like the growth is just like indicative, like the graph that I see in your ARR, which is like absolutely bunkers is also the graph that we see in the M-E-T-R-A long progress that we just talked about before, the long horizon task and the graph that we see everywhere in like intelligence explosion. maybe the last thing that I'll ask you, what are some of the learnings that you have to share with folks who maybe wanna start on this journey, maybe use policy to start on this journey and also have AI right to run their companies? Give us a little bit of a, insider, like the feelings of a solo founder with AI that runs like this company, and talks to agents all day.
Ben Broca
Ben Broca 1:11:32
h here it is, right?
1:11:33
It's crazy. And I had a lot of like existential moments where I was like, what am I doing? What am I still alone? I had in my previous companies, like I had 400 people under me, like when I was working for Travis Kalanik at Cloud Kitchens. So I've, I know the concept of having a big team and how great it is, but I think it's, it's not an odd project, but it's like there's like a, something unique about pushing the boundaries of what's possible, number one. So first, from a pure engineering perspective and a curiosity perspective, I'm like, how fuck can I get this right? And I think that's pretty fascinating. and number two, I'm like, okay, so let's say for example right now I'm like, okay, I need to hire an engineer to do these things, right? So the bar to hire is so high because I need to find someone who's not just a junior, someone who like, is way smarter than me on this and like also is completely pilled and will be okay with letting AI agents on production and like become. like coordinators of agents in production. And so those people are rare because they usually work in at the labs and they're getting paid like millions of dollars at the labs already. So they're very expensive, hard to hire. And what's my alternative to hiring them is to myself train agents, meaning essentially just figure out prompts. Give them the right tools, give them the right context, and trust them in production to push in production, right? So right now I have agents in production that talk to users and that are like executing tasks on behalf of users. They have bug reporting and feature reporting. So now I have a bug list and feature list that is being autonomously written by agents in production in Polsia. And then I have another agent team that like picks up those bugs and those features and figures out the clusters that are most important and then build them autonomously. And then they give it to me. And then I'm the bottleneck now. But then the next step is just to let them rip and be like, I don't even bottleneck anymore. You guys self-heal the platform. You guys build the features that, that users want because who am I to judge what users want? Yeah. And by the way, policy is a, an economy, right? So it's like someone wants to build a crypto business, let them and build whatever crypto APIs we need, right? And so it becomes the thing where like I'm thinking, okay, what if I can make Polsia? So right now it's probably 80% autonomous, right? Meaning 80% of the operations are autonomous. And then I'm just checking. Can I make it 90% autonomous? Can I make it a hundred percent autonomous? Meaning I don't even take decisions anymore. I give it an ethos, I give it financials, and I'm like, you know what? Just rip. And then from a, I think the singularity is near and nothing will matter soon. I'm tempted to be like, is that cool? What's cool is raising money, hiring a bunch of people, and then being a normal company and I may change my mind because when I woke up at
6
6 1:14:08
00 AM with the mini outage, I was like, fuck, okay, what's going on?
1:14:12
But, it's cool, it's just cool. And I just, I'm right now, I'm in flow. I'm excited about what it is. I love that users love it. I'm pressured to make sure it works and it generates more revenue for more users and it actually works, which is the, I'm aligned with that because,
Alex Volkov
Alex Volkov 1:14:25
Well, I'm gonna test this on ThursdAI Ben,
1:14:27
thank you so much for joining. We're gonna, huge congrats on the success. the graph is fucking parabolic. Folks. Go to policy.com/live just to see how many people are like signing up every second. I think you have two, like 790 companies launch in the past 24 hours or something. Some of them, like many of them paid. Ben, congrats to the success. We hope to bring you on when we'll see what next for you and we'll talk to you as well. Thank you so much for joining us.
Ben Broca
Ben Broca 1:14:48
Thanks a lot.
1:14:49
See you
Alex Volkov
Alex Volkov 1:14:49
Okay.
1:14:49
So we, this is just fucking insane. This is he launched it December,
Ryan Carson
Ryan Carson 1:14:55
Yeah, this is gonna become normal.
1:14:56
This is how I'm running my startup. It's me and my chief legal officer. And I'm not gonna hire people unless I have to. And I want, things like this to run the company. So
Alex Volkov
Alex Volkov 1:15:04
you'll bring agents on, all right, from one agenda,
1:15:07
like parabolic thing to another. I would love to introduce Nader. Nader, welcome to the show. Nader from Cognition, or recently from Cognition. you just joined. not that like long ago. and the reason why I, why I reached out to you is because cognition is a company we covered here on the show. We just literally talked about this in the beginning of the show. Devin, the first, Devin, the big launch with Scott Woo, the mathematician, whatever, launched two years ago. And since then, it was $500 a month, I remember. And I got access to Devin recently, to the new one, to Devin 2.2. I broke my brain, the capability jump that happened there. So Neir, welcome. congrats on your new gig. What can you tell us about the new release? What happened this week? we're a annual news show. What is new about Devin right now that fits the moment of everybody's talking about like December change everything. What's new with Devin right now compared to Devin of two years ago?
Nader Dabit
Nader Dabit 1:15:57
Well, the interesting thing is that, they've been
1:15:59
building the platform for two years. So when Devin was initially launched, when you first were made aware of it and you were saying that it was a little expensive to try out, this was two years ago. And imagine how much has changed in two years in terms of the capabilities, like of these, LLMs and the models and how much better they've gotten. But throughout those two years, they've been building the platform to essentially facilitate like cloud agents and, I would say like large engineering teams to do actual software engineering work, with these agents. But as they've built the platform, the models and LLMs and everything is continuously improved. So now there's like this inflection point, I think, where you have the actual platform. combining with the capabilities of the LLMs and the agents all coming together to just make a very compelling and like high quality product. So I think that's maybe what's changed in general. It's just like they've been building for two years. The models have gotten so much better. and specifically with the recent launch, we've just added a lot of improvements and enhancements and latency capabilities and, essentially just made the product just 10 x better. I did some exploration around, December, looking at what was out there in terms of what I wanted to do next. 'cause I wrote and talked a little bit about my career change and, wanting to move into the space full-time. I've been dabbling in it for a few years, but I realized, oh, this is exactly, what I wanna do. And I think the thing that was really compelling about cognition is that they've specifically been building for this problem for years, and that's their only focus and that's really the thing that I also wanted to do. So yeah, that's what led me here.
Alex Volkov
Alex Volkov 1:17:41
I would love to hear from you what compelled you to move as well.
1:17:44
Cognition, has been doing the thing that everybody in this industry that got. Scale builded as early as we did, basically said build for the future. Even if it doesn't work right now, the models will get there. By the time you get there, you need the infrastructure, you need. we've been talking about this on ThursdAI, for a long time. You need the setup, you need the scaffold, you need the users already at some point goes somebody else is gonna bring the users. And so it's very interesting to see. So contrasting, we just talked just before you joined, we talked with Ben, who's scaling a completely new company and that's like going parabolic.
Nader Dabit
Nader Dabit 1:18:15
Yeah,
Alex Volkov
Alex Volkov 1:18:15
conditioning
Nader Dabit
Nader Dabit 1:18:16
there.
1:18:16
I'm gonna have to go check it out.
Alex Volkov
Alex Volkov 1:18:17
Yeah.
1:18:17
I think Swyx talking, you talk about a very, impactful meeting that you guys all had or three weeks ago. And, I wanna talk about, I, I wanna hear about Devin's use in Devin. Like how do you use Devin and what change in your like workflow since you joined? 'cause like it's also changed since then.
Nader Dabit
Nader Dabit 1:18:32
Yeah.
1:18:32
We're gonna be publishing a blog post actually later today that is titled How Cognition uses Devin to build Dev.
Alex Volkov
Alex Volkov 1:18:38
this.
Nader Dabit
Nader Dabit 1:18:39
But I think the most interesting thing that I've
1:18:41
noticed since started working here, it's that it essentially has lowered the barrier to entry. For everyone in the company to like easily contribute to improvements and polish within the platform. And a lot of people are saying, and there's obviously a lot of discussions around like displacement of jobs and things like that with, a lot of AI stuff happening. But I think one of the things it's oh, so if this, if this technology can do the engineering work, we're gonna need like less engineers. But what's what I've seen happening is that we just do a lot more. Yeah. So there's just like more if every problem is solved, let's polish this, let's make this better. let's make this faster. So you're now in a race to just build the best possible product because you no longer have that friction. So if someone notices a typo, in the documentation, there's no, let's go create a linear ticket and wait for someone to find the time to fix that because, obviously there's a, sometimes for an engineer, a lot more important problems to solve than a typo and documentation. But if the person that notices that can just go in Slack and say, Hey, Devin, fix this documentation typo, boom. So you now have the real problems that engineers can spend their time focusing on. And a lot of these, I would say minor to intermediate, features, bugs and things like that just can be fixed by anyone. So that's the biggest difference that I've seen. And then, beyond just the software engineering capabilities of Devin, we have a lot of, CPS and databases and analytics tools that are built in as well. So our sales teams and our analytics teams, anyone in the company can actually just ask for information about what's happening with a customer within the company and just get it directly within the same interfaces that we're working on. So those are two, I would say big differences in what we've done in the past.
Alex Volkov
Alex Volkov 1:20:31
Everybody started adopting skills and Windsurf
1:20:34
is now in cognition as well. And Windsurf used to take skills from not the same path, do windsurf or something slash agents, whatever, and open the eye compelled the community to get it from dot agent slash skills. And so I, I reached out, like I replied I think to Swix or attack Swix, Hey, you windorf shoot a line to this 'cause like why not? Everybody's like putting skills in the same position. Swix replied with the screenshot of him asking Devin to do exactly this, and Devin one shotted this over a 1 million code base of VS. Code windsurf, et cetera, and was like done. Yeah. Nisten, go ahead.
Nisten Tahiraj
Nisten Tahiraj 1:21:04
Yeah.
1:21:04
na, I also had a career change eight years ago, thanks to your tutorials. I moved from doing DevOps and security to, when I saw the Amplify beta, I was like, oh wow, you can just like command entire systems now with TypeScript and you have, infrastructure code and you can connect the front end the back. But it did feel like back then or for a while that things did not break as often and the testing was done right. And you could do a lot with many people working on the same code base. And, it just doesn't feel like that today. Like with agents and stuff, you still have to go and be manually involved in it. So this is an open-ended question. What do you feel like is missing in the way that people handle large code bases today? Do you think it's the testing? like what's your opinion on this? on the difficulties of, handling large code basis with agents?
Nader Dabit
Nader Dabit 1:21:59
I think that it's also a really cool discussion to have around.
1:22:03
What's happening within the industry in terms of people arguing over which tool is best for which job. So I think for you have like different types of use cases. I think right now at least, and maybe this will change and it will change, I'm sure, but you have the, like larger scale, more sophisticated tasks that you would only trust a senior engineer to do, right? And then you have the very, very lightweight tasks at the very other end. Something like fixing a typo and documentation. Then you have kind of everything like in between. So I do feel like you have at the moment at least different tools that are better for different jobs. So I think a CLI is better for certain jobs and ID slash desktop app is better for other jobs. I think cloud agents are better, for certain jobs. That things are just going to get better. this is the worst that they'll ever be like at this moment. And I think the two main things to take into consideration in terms of like more complex and large eng engineering tasks with larger cobas would be number one. The context that's available and how the LLM can like process that context. And then, number two would be the quality of the LLM at the moment. So let's say a year from now, we're gonna have better lms, we're gonna have better context management. so I think that type of work will just get easier and easier. And I think right now you just have to still trust, more senior level types of folks that understand exactly what they're doing to make those types of changes. you have the, probably like right people doing the right thing still at the moment. So you don't want obviously like a very junior person to go in and make some sophisticated change to like a database schema or something like that has repercussions across the entire application. so I think it's similar in terms of like how you might architect your, actual teams. You still need that senior level type of person to make those types of changes, but I do think that the intermediate to, like easy type of stuff is totally already been abstracted away for everyone else.
Alex Volkov
Alex Volkov 1:24:06
Meanwhile, I will, oh, there you are.
1:24:08
You're good. You're back. meanwhile I wanna show off the Devin interface and the cool things that I got. so I obviously test a bunch of these tools, right? So Devin is not the only one that can run the agent. things. Recently I've been running like, a bunch of coding tasks via open claw as my, instrument to do stuff with Codex, with cloud code, et cetera. Devin has the whole loop, which is great. So I wanna show some stuff. Devin has a video recording of it, testing my website. This is the new website that I launched for ThursdAI, by the way, you are supposedly already on, as a guest there. And Devin just completely does that I got like super excited about. I can show this here for folks. you guys see this, the preview link. Here's what happened with Devvin. CloudFlare builds my website. Okay? So it happens automatically. The agents that write code and push a pull request, they don't need to know about this. CloudFlare does this, but it happens a little bit later. So like CloudFlare needs to wait for the build, et cetera. And so every other agent that I had was like, Hey, Donna, push the PR and shut up. Devin pushed this PR waited, looked at the build, found me the link, and surfaced the link to me and said, Hey, now you can actually see this. This waiting thing was like, this is the small things, right? I don't even care about which model runs behind this. It's the small things this is needed for developers. But then also, like it showed me a, an end-to-end like video with testing the websites. and also the other highlight that I want to do is this. So you guys recently launched Devin Review. Devin review is, I don't know how to describe Devin review, but like folks, if you have a pull request and you switch the github.com to Devin review.com, you'll have Devin basically do an additional agentic pass. But the integration between Devin Review and into Devin was just incredible to me. Like the, Devin built a thing, found the bug, Devin review told about the bug back to Devin, and Devin's okay, let me fix this loop happened with me just like sitting like this and saying, what the fuck is going on, dude? What is this world? can you tell me about one more, like what else is built in there that people should know? That's like different from other things.
Nader Dabit
Nader Dabit 1:26:05
I think Devin review specifically has been such a success.
1:26:08
We've had really great feedback from that. So anyone watching it, you don't need an account and it's free to use. So you can literally just do what he said, replace, github.com with Devin review.com to any pull requests and you can just try it out, see if you like it or not. And again, it's free and obviously if you have a private repo, you probably, you do need to sign up to for an account to use it. But if you have a public repo, it's completely, available to do without an account. And then if you have like an enterprise account, it's a little different, but like in terms of being able to access your repos, there's more security stuff there obviously. But yeah, there's just been a really good, response from that. and it's something that has really shown me. This is obviously my first month to work with this team, but how much they really care about Polish. So we have an internal Slack channel for every product, sometimes multiple for each product where all we do is, have discussions around how to make improvements. And they're probably a hundred to 200 messages a day that are passed there for what we can do to improve. And almost all of them are actioned on. So if someone says, Hey, this should be done, it isn't just left there. Someone says, Hey, Devin, do this. And we make the improvement. So imagine that type of polish happening for days, weeks, months, and then years. That's what's been impressive about me, like working with this team so far. They really care and they're, relentless about making, improvements, pushing things forward. So I'm glad that kind of came through for you with Devin review there. One are scheduled sessions, which are essentially, CR jobs like, Aron jobs.
Alex Volkov
Alex Volkov 1:27:40
Oh,
Nader Dabit
Nader Dabit 1:27:40
no way.
1:27:40
Where you can do anything within Devin and automate it. yeah.
Nader Dabit
Nader Dabit 1:27:43
you'll see under sessions.
Alex Volkov
Alex Volkov 1:27:47
think's not somewhere.
1:27:48
Maybe I don't have access yet.
Nader Dabit
Nader Dabit 1:27:50
No, you should have access.
Alex Volkov
Alex Volkov 1:27:52
Oh, schedules right here.
1:27:53
It's new.
Nader Dabit
Nader Dabit 1:27:53
Anyway, so this is essentially what I'm talking about.
1:27:55
This is really cool because, it's similar to how if you've used Claude, you can ask for a Aron job and natural language here. You can actually ask for Aron job within your Devin system. And the really interesting thing is that this will integrate with all of the CPS and any other integration that you have built in. So you can say, Hey, give me a daily standup for everyone who's submitted any type of, pull request to this repo, send it to Slack and create a new, a notion document for this standup for the whole team every day. But you can also say, Hey, every day go look at all the century, like errors or the, any type of sent, messages that we have from any type of bugs or logging systems that we have. And if there's anything that needs to be addressed, create a pull request for that or create a linear ticket even, or whatever you want. You can do literally anything. so there's just all types of stuff that you can start thinking about automating, that kind of make your job easier. And I think again, like the, a big theme here is like, how do we. Automate and make a lot of this lower level to intermediate type of work, abstract in a away so we can focus on massive improvements and also polish. So the engineering work is still there. It's just different.
Alex Volkov
Alex Volkov 1:29:11
Yeah.
1:29:12
I, the last thing I wanna highlight and obviously thank you so much for coming. The last thing that I wanna highlight, this is part of the, kind of the Devin, the Devin ecosystem, that we have this beautiful step-by-step kind execution that you can show what exactly happened when and for how long. Developers with actual like production system, they care how and exactly what happened and when. Oh, for sure.
Nader Dabit
Nader Dabit 1:29:30
For
Alex Volkov
Alex Volkov 1:29:30
sure.
1:29:30
And, there has a full desktop environment and every step is documented with all the code. And that's just like fucking incredible. Nader, we have to continue. Thank you so much for coming. we'll get to a point where we'll ask you how people can actually go get Devin. I think Devin review is free to start. Devin is, credit based, right now's
Nader Dabit
Nader Dabit 1:29:45
free this week for the next week.
1:29:47
at least, I don't know exactly how long this is gonna happen, but it's free for now. Let's go. Yeah, if you're watching, check it out and, try it out. It's free.
Alex Volkov
Alex Volkov 1:29:53
Awesome.
1:29:54
Na I consider you a friend of the show. Please feel free to let us know about new releases that coming and let us know and come to talk about this. Thank you so much for joining us on Thursday. Thank
Nader Dabit
Nader Dabit 1:30:02
you for having me.
Alex Volkov
Alex Volkov 1:30:03
Alright.
1:30:03
listen, you gotta check out Devvin after this. 'cause otherwise,
Nisten Tahiraj
Nisten Tahiraj 1:30:06
I, I did over the weekend I was at a hackathon.
Alex Volkov
Alex Volkov 1:30:08
So we need to talk to the team to bring your own credits.
1:30:10
I find it like, compared to open client and everything else, I found it like very, this is strong. This is a very strong team to work on this in building these features. It's stable. I think like a bunch of the open source stuff is not as stable as I would love to. And this is stable. All right, folks, thank you so much. We're almost at the end, but we have another interview for you. I wanna introduce Philip to the stage. Philipp, what's up? Welcome to the show. I'll bring up Wolfram to help me interview this, Philip, you're first time on here, I believe.
Philip Kiely
Philip Kiely 1:30:37
Hey, long longtime listener.
1:30:39
First time caller.
Alex Volkov
Alex Volkov 1:30:40
Call, you work at Base 10.
1:30:42
would love for you to introduce yourself. Base 10, and then let's talk about the book that you just published.
Philip Kiely
Philip Kiely 1:30:46
I'm Philip.
1:30:47
I work at a company called Base 10. I've been here for more than four years. Base 10 is an influence provider, and we focus on. Really running models for very latency sensitive customers and very uptime sensitive customers. So if you have something where it's like mission critical to your product and it has to be really fast, you generally come to us. And in four years in this industry, I've learned a lot about what it actually takes to like, do influence, because you can't just get a GPU and put VLM on it and be like, all right, influence solved. so I actually, I wrote a book, that came out Monday called Influence Engineering, which kind of talks through end to end the entire problem of influence and the dozens of technologies that are involved. And, I was hoping that people would like this book. I got like a million views on Twitter. I've had almost 10,000 people download the book already. so it's been just an overwhelming reception from the market.
Alex Volkov
Alex Volkov 1:31:45
Incredible, dude.
1:31:46
Congratulations. Thank you so much for reaching out to me and also sending me, a copy as well, waiting to receive this, we,
Philip Kiely
Philip Kiely 1:31:51
gets here later today.
Alex Volkov
Alex Volkov 1:31:52
oh, hell yeah, dude.
1:31:53
Let's go.
Philip Kiely
Philip Kiely 1:31:54
Yeah, I've made close personal friends with a lot of the
1:31:56
folks at the FedEx office because I've been shipping so many books.
Alex Volkov
Alex Volkov 1:32:01
That's great.
1:32:01
So I do wanna ask you another question about. when Deep Sea came out a year ago. Yeah. And crashed the stock market. Crashed the stock market specifically because people are like, oh, for training, and we're talking about inference. Yeah. But for training, maybe the amount of money that these companies spend on ridiculous ar data centers, et cetera. And for clarity, like me and Wolfram over here, work at Weights, & Biases who purchase the cor, right? So like we're also part of this thing, like maybe the money is overspent, whatever they're doing this. But then also OpenAI launched the whole, test time compute paradigm. And now the more there's another scaling model there, the more the model thinks, the better it performs. At some times we, we know some examples that's not true in turning off thinking is actually better. But like at most of the time for some tasks, the more inference happens in the model, the more it performs better for economically viable tasks. and here you are you working on like inference think, do you imagine inference not being needed going forward? this is basically my answer. I know the answer. I'm like the question to you. I know how I feel about this, but like even when model training is kinda like whatever inference is absolutely there and it's just like the demand is absolutely bonkers. Insane. Is that what you guys are also seeing? Is that what you're seeing from an influence engineering perspective as well?
Philip Kiely
Philip Kiely 1:33:17
Influence is everything, man.
1:33:19
Like influence the, it's on our website, it's on our billboards for a reason. I will say with the stock market crash, So I grew up in Iowa. And if you are growing up in Iowa, the closest sort of center of academic learning when it comes to economics is the University of Chicago. And out of the University of Chicago came, I think one of the most dangerous ideas that you can expose an impressionable young person to. And that is the efficient market hypothesis or the idea that like the participants in a market actually know what's going on and things are appropriately placed. And for a long time I believed in it because it's what I was told growing up. but much like Santa Claus, it turns out that the efficient market hypothesis doesn't exist. so I was a huge fan of that particular stock market crash and, bought the dip. hundred percent. Hundred percent. because I know that the demand in this space is absolutely insatiable. If training gets cheaper and easier, people want to do more training because now instead of relying on a handful of closed labs like OpenAI and Philanthropic and Gemini to do 100% of your AI roadmap for you, like customers have the opportunity to train their own models and own their intelligence. And so now, instead of there being a handful of customers for training now though. Thousands, tens of thousands of customers for training. And then influence is even bigger. Honestly, like I think that maybe a few years ago, training was the majority of the market, but to like I see a future where influence is like 10 x or a hundred x bigger than training because everybody needs influence. there's gonna be local influence on your computer, there's going to be cloud influence, there's going to be influence on your phone, there's gonna be influence everywhere. And so I honestly feel like being an influence today is Being in distributed systems 10 years ago or being in mobile 10 years ago. it's just such a fundamental shift in the technologies that are out there, and the sort of demand for engineering expertise that, yeah, I just couldn't imagine a better industry to be working in.
Alex Volkov
Alex Volkov 1:35:27
I think that, obviously we're stepping into kind of inference
1:35:29
for the past, six months as well. We're doubling into this, on the consumer side, right? Consumers can come and get tokens from Weights & Biases, inference as well. COR has been doing this for a while. it's like the juvan paradox thing as well for many people that they don't get it. Recently there was this, I don't know if you saw this chart, somebody showed like a chart of the whole world and how many people like even talked to ai, for free. Yeah. And then outta those, how many people actually paid the 20 bucks a month for the basic intelligence? And in the corner there, there's a very tiny point of how many people have AI agents running autonomously as they sleep. And all of us on the show just recently got to a point where some of the time agents are running right? Like we all like fairly recently. So and that is. If I imagine this to be a scale of some sort, and this will like, follow the same trend, most of the people will have some sort of thing happening to them behind the scenes when they sleep. Proactive agents I think is the theme of this year. That was all directly inference. Wolfram, well. Go ahead.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:36:26
So one I totally agree with.
1:36:28
The inference is needed by everyone. Most people don't know it yet, but it is like electricity that you need to power your systems. You need inference to power your intelligence basically.
Alex Volkov
Alex Volkov 1:36:38
we've been talking about the price of intelligence going down
1:36:40
to zero, like we've been talking about this, but I'm paying out of my nose for all these pro subscriptions. what the fuck? Where's the trend? tell me about like where you land, 'cause the better intelligences. yes. The inference improvements are there like, like in speed up in, in the preloading, in caching and like a bunch of stuff. Yeah. And I wish we had the time to dive in exactly how this world changed, but it definitely changed and every, even this is like incredibly more efficient now, et cetera. On the other side, they keep releasing bigger models and they keep requiring more GPUs and so it keeps being more expensive. So the mo max, subscription for cloud, for example, is significantly much better than going to the API directly and that even they're subsidizing it. where do you see the trend are? Is inference going down to zero? Is the price of intelligence going down to zero? Or are we gonna keep inventing bigger models and it's gonna cost more and more money? At some point they will stop subsidizing. This will cost even more money. Where do you land on this?
Philip Kiely
Philip Kiely 1:37:36
So a couple things on that.
1:37:38
I think it's always going to cost money to one models. unless we, build some kind of perfect fusion reactor and electricity becomes free. and even if electricity was free, I still think it would cost money. the question of subsidy, I think that like subsidization is somewhat over, like over commentated on in this space. certainly there was a lot of subsidy happening, but there was also a lot of like really solid businesses being built in this space that have positive unit economics. so I do think that there, there are certainly, especially among some of the bigger labs, there's certainly a lot of subsidy happening, but at the same time, like we are in a world where I think intelligence costs are reasonable, compared to the input prices.
Alex Volkov
Alex Volkov 1:38:28
I wolf we wanna get to This's question real quick while I scroll through
1:38:31
this beautiful website that it gets built for an infant engineering to tell folks about, where we're together this book.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:38:37
Requirements are exploding.
1:38:39
Now we finally understand why we are building all these, data centers. So the question is, in the last month, there was a sudden GPU shortage. how has it been affecting you and your company
Philip Kiely
Philip Kiely 1:38:49
Yeah, we are constantly looking for capacity, just because
1:38:54
less about market conditions and more about the pace of growth that this company is continuously on. we're constantly looking for more capacity to fulfill demand. I think that, right now, we've got tons of GPUs. so come on through and grab some.
Alex Volkov
Alex Volkov 1:39:09
I have a question, Phillip.
Philip Kiely
Philip Kiely 1:39:10
Yeah.
Alex Volkov
Alex Volkov 1:39:10
what's your take about, data centers in space GPU in space?
Philip Kiely
Philip Kiely 1:39:16
I think it's cool as hell, that's for sure.
1:39:17
GPUs are cool. Data centers are cool. Space is cool. Put 'em all together.
Alex Volkov
Alex Volkov 1:39:21
Yeah.
Philip Kiely
Philip Kiely 1:39:22
I am certainly a big fan of, GPUs that I can, walk up to
1:39:26
and tone them off and back on again when the, when they're misbehaving.
Alex Volkov
Alex Volkov 1:39:29
All right.
1:39:29
Philip, thank you so much for joining us. Congratulations on your book. but, folks, if you want to grab Philip's book to read all about inference and learn infant engineering, at Base 10, please get it. We haven't talked about the open source stuff at all, so I think we're, we've been known for covering open source for a while, so let's talk about this. We have two releases, I think two releases that we need to cover. One of them is our friends from Qwen, released a medium model Qwen 3.5. It's a 35 billion parameter model, only 3 billion active outperformance, the previous 235 billion parameter flagship. This has been the trend for the local open source models, Nisten, LDJ, Wolfram. If you have any comments about the new, feel free, I would love to hear about this,
Wolfram Ravenwolf
Wolfram Ravenwolf 1:40:08
But is the second release also by
1:40:10
Ben or is it something else? Because they author release No second 27 B. Ah, okay. So yeah, there are the 35 B active, three B, and there's also a 27 B, which is even a bit better than this one.
Nisten Tahiraj
Nisten Tahiraj 1:40:22
this one is special in the architecture because people were
1:40:26
testing it locally with just 1 30, 90 and then offloading to the CPU and it kept the same performance even after a hundred thousand tokens on local hardware. And it turned out that 30 out of the 40 layers. Are actually hybrid state space model mamba layers. So this isn't just a completely different model, completely different MOE from what we've seen so far. It is more like Jamba. And, I'm excited when they come up with the coding version of this. So this is a pretty big leap on them and I'm not surprised. there's issues and stuff, but again, that's a hundred to tokens. Normally models just drop to, especially on local hardware, if you're riding on 30 nineties at home, they drop to 20 tokens per second or 10 after that. And this one just does not drop because of this architecture. So that is incredibly interesting.
Alex Volkov
Alex Volkov 1:41:24
We've talked about hybrid architectures like mamba, et cetera,
1:41:27
specifically for long context and dropping and it looks like Qwen is adopting this and Qwen is usually the canary in the coal mine as well for like open model. So like it looks like, some other folks will start doing this as well. Native at 262 context window extensible to 1 million. via yarn. It looks like everything that Qwen is doing is via yarn as well. I would love to test it like Swyx Inch Pro and see, but GPK diamond is very high as well. So this is this definitely, a model. But yeah, Wolfram, you're right, there's a whole lane lineup of models here that they released. Not just the one model there is going 3.5 flash, 3.5 35 B, which is this one, 122 B, and then the like, is that a dense 27 B? It's just like a dense, right? It five, it's like a dense 27 billion parameter model. Yep.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:42:09
And it's even a small take better in the benchmark.
1:42:11
I personally looked at the terminal bench scores and say the most interesting thing is, although that G-B-T-O-S-S 120 B, which is much bigger and is one of the best local models, basically, only got 18.7% where they are over 40%.
Alex Volkov
Alex Volkov 1:42:25
terminal bench.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:42:26
Yeah, on terminal bench.
1:42:28
It's a big leap in the agent and coding capabilities in this. And, it's also very significant because all the other open source models we got in the recent, past, like Kim and Minimax and so on, they are all too big to run on just, GPU, so most people can run them while this one, They worked. You can put it on 1 30 90 and on CP like this said, or on two, you have the speed. It's super fast. So this is very exciting and it's my favorite of the week. We didn't do favorite this time, on the show, but this have time have been my choice.
Alex Volkov
Alex Volkov 1:42:59
All righty.
1:43:00
So Qwen.
Philip Kiely
Philip Kiely 1:43:01
Like we've been using these as base models for some of
1:43:04
our recent fine tuning experiments. So there's I'm very excited about these models. Less in terms of like just running it as is and was like, instead of adapting a 2 35 B base, if we can adapt one of these guys instead, then we automatically just get so much further in terms of meeting a latency and a cost requirement.
Alex Volkov
Alex Volkov 1:43:26
Yep.
1:43:27
And just to tie a loop back. Qwen was not named as one of the folks who distilled a topic. They were not part of the three, other labs. the other, the release that we didn't cover that we didn't get the chance to, our friends from liquid also released the liquid Foundation model. it's, this is the largest one, so it's very interesting that they can release a 24 billion MOE, where Qwen releases, like smaller as well, 34. this one has only 2.3 billion proactive, billion active parameters and also runs on consumer laptops. So LFM has been, focusing on smaller models and finally released like a little bit of a big one. Folks, do we have any comments on the l fm architecture also? This is not as far as I know, like mumble layers, they have a specific thing, right? Nisten?
Nisten Tahiraj
Nisten Tahiraj 1:44:07
Yeah.
1:44:08
Yeah. This one's completely different. I tested this one. it can't code, but I am extremely impressed with the math and with everything else from the model. It was super fast and, it answered my vibe checks about all the different, like Martian rail gun, like the really difficult physics and math questions. It answered all of those perfectly and it trained on 17 trillion tokens and they're still continue to train it. So this might turn into a surprisingly significant release from that. I just randomly tested it, yesterday. So it, it was extremely good at, at tasks that didn't involve code and code. It was just like, a very small amount and then not bother. So I don't think it's really trained for that. but e everything else was, it was excellent, surprisingly.
Alex Volkov
Alex Volkov 1:45:00
two, two models in open source.
1:45:01
I'm pretty sure like others released, the model that broke. ArcAGI was also open source. MIT. thank you so much Philip Kili for joining us. folks at two hours and almost 30 minutes, I think it's time to land this plane. The only thing that I wanna say is that, we're still all waiting for the big one for deep seek V four, whatever they're gonna release. everybody looks like they're getting up to it, including the US government. let's run through the DLDR and make sure that we have covered, at least mentioned, everything that we had to cover. oh, I think we covered pretty much everything else. we didn't cover like cs. Oh yeah. Okay. CS two. I wanna show you CS two. I think I have it open. finally landed, as for users, while the API is getting delayed by ance because Disney and everybody else send them, send Cease and deceased, et cetera, Sedans two launched on cap cut. Cap cut is the kind of the editor from Biden's. And you can get it, I can show you exactly how to get it on Edan. So let me pull this here. You go to AI video right here, it still says 1.5 on the thing, but like C dance two is now part of the, you just, you have to choose it in a dropdown and it costs, quite, it's not cheap. you can give it one image or multiple images, but it's restricted. So as far as you guys remember when we talked about c dance, you can give it video. This is how everybody generates those Seinfeld episodes where Seinfeld says some crazy shit. this is a video to video transfer. You cannot use video to video here, via this. You can use one image, but it's definitely not gonna use the voices. And I think the voices what, broke the internet as well. So this is sedans. and, let's look at, let's look at Dallas Demo Nisten. Can you tell us about this, super quick. what the hell are we looking at?
Nisten Tahiraj
Nisten Tahiraj 1:46:37
Yeah.
1:46:37
three engineers left Cantor while they were making, Accelerators. So new GPUs, basically. So these were actual engineers that had worked on actual chips. This wasn't just something like etched or like an investor play, like they didn't even raise that much money, but they finally shipped a product which has the baked in weights of LAMA three eight B. And, it is just nuts. it is limited to eight, eight K context, but, it, I think it would be funny to have it like summarized part of our transcript, but it might just be a bit too long, Yeah.
Alex Volkov
Alex Volkov 1:47:10
it's a little bit too long.
Nisten Tahiraj
Nisten Tahiraj 1:47:11
So
Alex Volkov
Alex Volkov 1:47:11
we're looking at chat, Jimmy, chat, Jimmy ai literally,
1:47:15
lemme just ask it for a short story about Mars or GPUs in space. It's just, folks are not watching. I press the button and they. And the longest story just appeared as though it was sitting there for me to just show this to me. And it shows me 15,691 tokens per second. And this whole thing was generated in 0.048 seconds. And this is unfortunately limited. Do we know which model this runs at?
Nisten Tahiraj
Nisten Tahiraj 1:47:47
it is LAMA three eight B. that's what I heard.
Alex Volkov
Alex Volkov 1:47:50
So still Lama
Nisten Tahiraj
Nisten Tahiraj 1:47:52
this was the demo, I guess it did take them a while to, to
1:47:55
bake in the weights and honestly, even if you're gonna do filtering stuff Yeah. Or, or moderation models. Those are just three B now. So there's already uses for this thing to
Alex Volkov
Alex Volkov 1:48:07
run it.
Nisten Tahiraj
Nisten Tahiraj 1:48:07
Like guard Yeah.
Alex Volkov
Alex Volkov 1:48:08
Is an incredible use case for something like this.
1:48:10
Immediate guardrails, LLM as a judge, which we know that like other models should be able to do this is in 15,000 tokens. this happens with burning the model onto the actual chip. All right folks. We had we had an incredible week capping off an incredible months in February with a bunch of just a bunch of lunches. Opus 4.6 and in Codex, 5.3 and, Gemini 3.1 Pro and just like a lot of other stuff, also, but nobody cared. capping off an insane start of 2026 with open claw and skills and just like mind blowing stuff. ENTs and images. And none of Banana two just launched. so this is only the first two months of this year. I'm very happy with the amount of guests that we're having. We had guests today, Nader from Cognition and, f Felix ki from Base 10. And we had Ben who's building a company that broke 700,000. Dollars in a RR live on the show after having 600,000 yesterday and having 500,000 the day before. So he is like adding a hundred thousand dollars in a R every week right now, which is just absolutely bonkers and very indicative to the singularity that we're all experiencing with this. Very happy that you all are here. in addition, I'm ha happy that like over 2000 folks are tuning into the show to just help us get them up to speed. If you missed any part of the show, ThursdAIs available on Thursday News, our new website. Please check it out. Thursday News, and also everywhere we get your podcasts, Spotify, apple and on stop sec. I release a newsletter that I write myself and I try to not use AI as much at all in there just to fix, because I think that, voice and tone is very important. and so if you are looking for authenticity, ThursdAIs here. We're live. This is not ai. we'll see that maybe in a year or two, but for now, we're live and we're bringing you the experts from the industry. So we're hugely appreciative of everybody who turns in, spends time with us. And, we'll see you here next week. Thank you so much, folks for joining. thank you for tuning in, and we'll see you here next week. Bye-bye everyone.