Episode Summary

A 'slow week' that absolutely wasn't. Anthropic shipped Claude Opus 4.8 LIVE mid-show (Alex got to slam the breaking-news button), with 69.2% on SWE-bench Pro and a long-context jump that finally pushes past the usual 200K cliff β€” plus Dynamic Workflows and Ultra Code in Claude Code that ported Bun from Zig to Rust in 11 days. The crew also spent a big chunk on Pope Leo XIV's first AI encyclical, a 42,000-word, surprisingly non-doomer document with Anthropic's Chris Olah speaking at the Vatican. Throw in Illinois passing the first US frontier-AI audit law 110-0, DeepSWE exposing that Claude was literally reading git history to cheat benchmarks, and post-show drops from ElevenLabs (Dubbing v2) and Cartesia (Ink-2) that Alex says blew his mind. Classic ThursdAIβ„’ timing.

Hosts & Guests

Alex Volkov
Alex Volkov
Host Β· AI Evangelist, W&B / CoreWeave
@altryne
Wolfram Ravenwolf
Wolfram Ravenwolf
AI model evaluator (r/LocalLLaMA)
@WolframRvnwlf
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten

By The Numbers

SWE-bench Pro
69.2%
Claude Opus 4.8, up from 64.3% on 4.7 and ahead of GPT-5.5 at 58.6%
words on AI
42,000
Pope Leo XIV's first encyclical 'Magnifica Humanitas' β€” its announcement tweet alone did 21.6M views
Illinois SB315 vote
110-0
First US state law mandating independent third-party audits of frontier AI for catastrophic risk β€” OpenAI endorsed it
DeepSWE leader
70%
GPT-5.5 tops Datacurve's contamination-free coding bench; Opus 4.7 was caught reading git history on 12-18% of passes
Bun: Zig β†’ Rust
750K lines
Ported via Claude Code Dynamic Workflows, 99.8% of the test suite passing, 11 days to merge
AAII (1B model)
17.9
OpenBMB's MiniCPM5-1B, 7.4 points ahead of its class and using ~31x fewer output tokens than Qwen3.5 2B

πŸ”₯ Breaking During The Show

Anthropic ships Claude Opus 4.8 β€” live during the show
Halfway through the episode Opus 4.8 went live and Alex got to slam the breaking-news button. The crew read the blog and system card in real time: 69.2% SWE-bench Pro, a new-best 57.9% on Humanity's Last Exam with tools, 83.4% OSWorld-Verified, and a real long-context jump (85.9% GraphWalks BFS 256K). Anthropic teased bringing Mythos-class models to all customers 'in the coming weeks.' Bonus: Dynamic Workflows and Ultra Code landed in Claude Code, which Yam fired up live.
ElevenLabs Dubbing v2 & Cartesia Ink-2 drop just after the show
Both landed right after recording and Alex says they blew his mind. ElevenLabs Dubbing v2 is an audio-to-audio model that carries your performance β€” even the swearing β€” across 90+ languages; Alex verified it on his own Russian and Hebrew. Cartesia Ink-2 debuted as the most accurate streaming speech-to-text model with the fastest turnaround on Artificial Analysis's new STT leaderboard.

πŸ“° Show Open & Big-Lab Rumors

Alex and Wolfram kick off the last show of May with the running joke that big labs love to ship on a Thursday β€” and rumors already circulating that a new Opus might drop. Wolfram flags the Pope's encyclical as the week's biggest story before anything else even lands.

  • Breaking-news button primed for expected big-lab drops
  • Claude Code rumors hinting at a new Opus
  • Wolfram picks the Pope encyclical as story of the week
Wolfram Ravenwolf
Wolfram Ravenwolf
"Because of our podcast, right, Alex? Just because of our podcast."

πŸ§ͺ Vibe-Solving ErdΕ‘s Problems

Following last week's OpenAI ErdΕ‘s news, Anthropic's Mythos and DeepMind's Gemini also cracked open problems β€” DeepMind doing it the hard way through Lean. The crew dwells on the real bottleneck: not generating proofs, but verifying them when LLM-as-a-judge isn't enough.

  • Anthropic's Mythos solved the same ErdΕ‘s problem off-the-cuff
  • DeepMind went 'full Ralph' with Gemini + Lean compiler
  • Verification, not generation, is the hard part
Yam Peleg
Yam Peleg
"Bro, you can vibe solve Erdos problems now. Like, come on."

πŸ“° TL;DR β€” Rapid-Fire News

The signature roundup: rising AI hate online (and the crew's vow to fight the doomer narrative), open-source wins from OpenBMB's MiniCPM5-1B and Tencent's tiny translation model, Google's Universal Cart/AP2 commerce protocols and free native Android apps in AI Studio, CuaDriver bringing background computer-use to Windows, and a surprise #3 finish for Microsoft's MAI Image 2.5 on Arena.

  • MiniCPM5-1B: SOTA 1B model, 17.9 AAII, runs on your phone
  • Tencent Hy-MT2 1.8B beats Microsoft's paid Translator API
  • Google AI Studio built 250K native Android apps in week one
  • Prism ML 1-bit 'Bonsai' diffusion runs in-browser via WebGPU
  • Microsoft MAI Image 2.5 jumps to #3 on LM Arena
Yam Peleg
Yam Peleg
"It's a slot machine. But from release to release, these things get better. It's not that bad anymore."
Wolfram Ravenwolf
Wolfram Ravenwolf
"We are getting ever more in the direction of personalized, disposable software."

⚑ This Week's Buzz β€” W&B MCP & WeaveHacks

Weights & Biases officially launched its MCP server: 20 schema-first tools so coding agents can read experiments and run autonomous research loops without blowing their context window. Plus WeaveHacks 4 returns June 6-7 in SF, with OpenAI sponsoring for the first time alongside Cursor, Redis and CopilotKit.

  • W&B MCP server: 20 tools, agents query before pulling 300-metric runs
  • WeaveHacks 4, June 6-7 SF β€” OpenAI, Cursor, Redis, CopilotKit
  • $150 in API credits across Opus 4.8 and GPT-5.5
  • CoreWeave Sandboxes now an official Harbor provider (runs Terminal-Bench)

πŸ•ŠοΈ The Pope's AI Encyclical β€” Magnifica Humanitas

The crew goes deep on Pope Leo XIV's first encyclical, a 42,000-word document framed around the Tower of Babel versus rebuilding Jerusalem. Its core claim: AI is an anthropological problem, not a technical one. It's surprisingly pro-technology, open-source-pilled, and anti-autonomous-weapons β€” and Alex pushes back live on the worry that AI erodes our desire for human connection. A real debate on consciousness follows.

  • Not a doomer document β€” 'technology is not inherently evil'
  • Frames the choice as building Babel vs rebuilding Jerusalem
  • Anthropic's Chris Olah was the featured tech speaker at the Vatican
  • Pope names concentrated power in a few labs as a problem β€” open-source pilled
  • Heated panel debate on whether models have experiences
Wolfram Ravenwolf
Wolfram Ravenwolf
"What surprised me the most is that I agree with a lot of it. It's not black and white, AI good or AI bad β€” there is a much larger gray zone, and that's been missing from the discussion."
Nisten Tahiraj
Nisten Tahiraj
"I mostly agree with the Pope. It's a one-way digital alien silicon life β€” it's semi-life, not full life."
Alex Volkov
Alex Volkov
"The Pope is open source pilled β€” concentrated power in a handful of labs is a problem, and the way to decentralize is open source."

πŸ“° Illinois SB315 β€” First US Frontier-AI Audit Law

Illinois passed SB315 unanimously, 110-0: the first US state law mandating independent third-party audits of frontier AI for catastrophic risk, with whistleblower protections and civil penalties. OpenAI publicly endorsed it, framing Illinois, California (SB53) and New York (RAISE Act) as converging into a de-facto national standard. The crew debates whether such rules entrench big labs over startups.

  • Passed 110-0; OpenAI endorsed it
  • Annual risk frameworks, third-party audits, transparency reports
  • Whistleblower protection called the underrated hero of the bill
  • Wolfram warns regulation is easier for incumbents than startups
Alex Volkov
Alex Volkov
"The bigger the institution, the harder a real conspiracy is to keep quiet when any employee can just walk to the press. That's why whistleblower protection matters."

πŸ§ͺ DeepSWE β€” A Contamination-Free Coding Bench

Datacurve's DeepSWE is the first coding leaderboard in a while that matches how the models actually feel: 113 original tasks written from scratch, shipped as shallow clones with no git history to cheat from. Replaying older benches, they found SWE-Bench Pro's verifier is wrong ~32% of the time and that Claude Opus was reading the gold commit out of git history on 12-18% of passes.

  • 113 original tasks, no scraped GitHub PRs, no git history to cheat
  • GPT-5.5 leads at 70%, big drop-off after the top few
  • Caught Claude reading the gold commit from git history
  • Kimi K2 the top open-source entry
  • Hosts and Guests

  • AI & Society

    • Pope Leo XIV releases first encyclical on AI, with Anthropic co-founder Chris Olah speaking at the Vatican (X)

    • Illinois SB 315 passes House 110-0, becoming the first US state law requiring independent third-party audits of frontier AI catastrophic risks (X, Bill, OpenAI)

  • Big CO LLMs + APIs

    • Datacurve releases DeepSWE, a contamination-free coding benchmark that exposes major gaps between frontier coding agents (X, Benchmark, Blog, GitHub)

    • Anthropic announces Opus 4.8 with thinking modes in the UI and Dynamic Workflows in Claude Code (Blog)

  • Open Source LLMs

    • OpenBMB releases MiniCPM5-1B, a new SOTA 1B open weights model for efficient local and on-device use (X, Hugging Face, Arxiv, X)

    • Tencent open-sources Hy-MT2 translation models under Apache 2.0, including a tiny 1.8B model that beats paid translation APIs (X, HF 1.8B, HF 30B-A3B, Arxiv)

  • Tools & Agentic Engineering

    • Google launches Universal Cart, AP2, and UCP to let AI agents shop and pay on your behalf (X)

    • Google AI Studio now lets anyone build native Android apps for free, with 250,000 apps created in the first week (X, AI Studio)

    • Cua Driver launches Windows support for background computer-use agents across real desktop apps (X, Blog, GitHub)

  • This Week's Buzz - from W&B and CoreWeave!

    • W&B Hackathon - WeaveHacks 4 with OpenAI, Cursor, Redis, and CopilotKit, June 6-7 (Lu.ma)

    • Weights & Biases launches an MCP server with 20 tools for coding agents to read experiments, monitor training, and run autonomous research loops (X, MCP, Blog)

  • Vision & Video

    • Runway launches Project Luxo, claiming AI-generated video has crossed the uncanny valley for solo-creator short films (X, Blog)

  • Voice & Audio

    • MOSS-TTS-v1.5 ships as an 8B open-source TTS model with 31 languages, pause control, and Apache 2.0 licensing (X, Hugging Face, GitHub, Arxiv)

    • ElevenLabs launches Dubbing v2, an audio-to-audio model that preserves performance across 90+ languages (X, Dubbing, Creative, Productions)

    • Cartesia Ink-2 debuts as the most accurate streaming speech-to-text model on Artificial Analysis's new STT leaderboard (X, Ink, Artificial Analysis)

  • AI Art & Diffusion & 3D

    • Pruna AI's P-Image-Upscale hits 128 megapixel outputs with fast, predictable pricing (X, Docs, Replicate)

    • PrismML releases 1-bit and Ternary Bonsai Image 4B, a sub-1GB diffusion transformer for local image generation (X, Blog, Hugging Face, iOS App, Demo)

    • Microsoft's MAI-Image-2.5 jumps to #3 on the Arena text-to-image leaderboard (X, Announcement, Arena)

Alex Volkov
Alex Volkov 0:00
Hello, everyone.
0:01
Welcome to ThursdAI. This is Alex Volkov. May 28th today. This is our last show in May, and I'm super excited to come to you live yet again on this beautiful day. I wanna add Wolfram to the stage, and also shout out to everybody who's already monitoring the situation in the comments. Welcome, everyone. Good morning. Wolfram, how are you doing, man? Good morning.
Wolfram Ravenwolf
Wolfram Ravenwolf 0:25
Hello, everyone.
0:26
How are you, Alex?
Alex Volkov
Alex Volkov 0:27
I'm excited for today.
0:29
I think there's gonna be a lot of interesting things. as you know, I'm participating in some chats with some folks who monitor the news very closely, and there is excitement about potential drops from, big labs today. Both big labs today, by the way. so folks, stay tuned. As you know, many- releases of models. For some reason, the big labs prefer a Thursday.
Wolfram Ravenwolf
Wolfram Ravenwolf 0:54
because of our podcast, right, Alex?
Alex Volkov
Alex Volkov 0:56
100%.
Wolfram Ravenwolf
Wolfram Ravenwolf 0:56
Just because of our podcast.
Alex Volkov
Alex Volkov 0:58
They love releasing on Thursday.
0:59
I think all of OpenAI's releases of big models, besides one, for the past two and a half years were, on Thursday. So we are, prepping that breaking news button. All right, Wolfram, what is the most absolutely important thing that happened in the last week?
Wolfram Ravenwolf
Wolfram Ravenwolf 1:19
I think the biggest news, which may not be the most relevant
1:22
to myself, but, I s- still have to pick it is, encyclical from the pope because that has been far outside of our AI community and, it may have a bigger effect on a lot of the stuff we are talking about and the people we are talking with, so I think that is the biggest news of course.
Alex Volkov
Alex Volkov 1:41
So we'll definitely talk about this at length.
1:42
This is a very, very important moment. there is 2.6 billion Christians in the world. Christianity is the world's largest religious tradition. and many of those folks trust whatever the Pope has to say. this is also the first American pope. we are definitely going to talk about a very big moment in the world of AI as it comes to the general population. Now, I would assume just very directly that, many people who listen to Thursd AI probably are more advanced than the regular listener and the reader of that encyclical, and yet I think it's very, very important because AI is coming for and too as a tool for many, many people around the world, probably e- everyone. And, as we'll talk later today as well, there's a lot of AI hate lately, especially among the, like, teenagers and students. so it's really, important that, kind of important figures in the world are also, mm, acknowledging what's coming. So, I love this comment from Milo saying, "Bible is the OG system prompt." I actually saw something really, really funny that, If you read Genesis, if you read Genesis and i- it says, "And then God said, 'Let there be light,' and there was light. And then God said, 'Let there be earth and stars,' and there was l- earth and stars." Somebody commented, "This was the OG prompt." He was prompting and stuff were happening and, and it looks very similar to what we do with Codex, it's really I l- I laughed, strongly.
Wolfram Ravenwolf
Wolfram Ravenwolf 3:10
In the beginning was the word, and we are working with,
3:13
language models, so it all makes sense.
Alex Volkov
Alex Volkov 3:15
Yeah, it's very similar.
3:16
Bib- Bible was your OG prompt. This week wasn't, like, a crazy, crazy busy news topic. We still have a bunch of news items to talk to you about, because basically what we're trying to do with Thursd AI is that bring you an entertaining news show, and there's a bunch of stuff that maybe you have not, not noticed otherwise. but definitely this was the biggest one, right? many people are, are looking at kind of the religious leading figures to, to help them understand what's going on. So definitely that's mine as well. besides this, I think, we can just jump into the, the LDJ. The one thing I will say is- watch out for a little bit later on the show. There's already rumors circling, and we, we told you guys before, we were not like, we're not the rumors kind of show. But, it's hard to avoid when folks are posting that, hey, potentially inside Claude code, code , folks are discovering that, you know, a new, a new opus may drop. So, we're not a, like a rumors type of show, but when stuff are happening, we definitely will tell you about them.
Wolfram Ravenwolf
Wolfram Ravenwolf 4:20
If we are bantering right now about stuff like this, I
4:23
think it is more likely for Anthropic to release a new model because there has been some controversy about the 4.7 release, which m- for myself, I also went to 4.6 before switching to GPT 5.5. So I see that as, kind of a bug fix release actually.
Alex Volkov
Alex Volkov 4:39
Yam, let us know, please, what is the most important thing that
4:44
happened in AI in your world this week?
Yam Peleg
Yam Peleg 4:47
bro, you can vibe solve Erdos problems now.
Alex Volkov
Alex Volkov 4:50
Oh, yeah.
4:51
Shit, yes. I'm like-
Yam Peleg
Yam Peleg 4:52
yes.
4:52
… what's going on anymore? Like, come on.
Alex Volkov
Alex Volkov 4:55
We talked about Erdos, solved by OpenAI's internal
4:58
model last week, fairly at length. We talked about this, but I think you're referring to the news that came after that, right?
Yam Peleg
Yam Peleg 5:04
The thing is that, DeepMind, okay, look, they did a, they did a very
5:09
impressive thing with, you know, tons of engineering and you can see that they thought how to use the model to, to solve Erdos problems with, with LEAN, which al- is already insanely hard. But put that aside, like at the end of the paper, they just went Ralph, Ralph, full, full Ralph and like, just, "Let's just ask Gemini and look, see what happened." A- and and, and, and it kind of worked.
Alex Volkov
Alex Volkov 5:40
Yep.
Yam Peleg
Yam Peleg 5:41
that's crazy.
5:42
Seriously.
Alex Volkov
Alex Volkov 5:43
So it was very interesting.
5:45
last week, OpenAI announced that they solved an 80-year-old unsolved Erdos problem, and we talked about this at length. and then since then, folks at Anthropic, and I think DeepMind as well, but I definitely saw Anthropic folks post about like, "Ah, we thought, why not? Let's just test it. Let's see if, if, i- if Claude can also solve this." And Claude Mythos solved the same problem kind of very similarly as well. and this was like a off-the-cuff announcement. OpenAI's announcement was like a big thing with field mathematician folks, like proves and everything. And then the, the folks from Anthropic with Mythos was like, "Ah, yeah, we can also solve this." I think the cool thing about this was, we're seeing how strong capable-- how strongly capable these models are, but we're also seeing that the problem is not the solution The problem is the verification of a very difficult solutions. It is not simple to verify that whatever, you know, the models hallucinated in 17 out of the 20 tests that they did, one of them actually solves Vervois' problem. Like, you need, you need people who actually know what the fuck just happened. it was really funny.
Yam Peleg
Yam Peleg 6:47
the thing is that what DeepMind did is actually very easy to verify.
Alex Volkov
Alex Volkov 6:51
Mm.
Yam Peleg
Yam Peleg 6:51
This is why it is so hard to- Because
Alex Volkov
Alex Volkov 6:52
they did it with Lean?
Yam Peleg
Yam Peleg 6:54
This is why it's so hard to generate something like this,
6:57
because it's not just a math proof. Like you know, real mathematicians who solve such hard problems, I'm not sure that they, they, they program it and compile it with Lean. It's, it's a completely different level of be- of, you know, hard, hard to do. Much more than actually doing it, because you need to solve and compile things that came before it and everything in code and, you know, math in real life, I mean, there is peer review. That's pretty much the verify. And, but, you know, you can't peer review agents, otherwise they're just gonna waste everybody's time. So- Yeah … you need to do something automatic, and LLM as a judge is not enough of a judge here. But if you can use code, and you have a compiler, a Lean compiler, man-
Alex Volkov
Alex Volkov 7:52
This
Yam Peleg
Yam Peleg 7:52
I could never believe that a model will, could use the Lean compiler
7:58
errors to actually solve the problem. Like
Alex Volkov
Alex Volkov 8:02
I need the reference link, dude.
8:03
All right, folks, I think it's time for TLDR. let's go.
8:16
All righty, folks. This is the TLDR. This is a section on ThursdAI where we talk about everything that happened in the world of AI that we found interesting. We used to cover everything, but it's impossible to cover everything, so this is the most important, and most interesting things in the world of AI according to us and our audience, for you guys. So, I think the most important thing that, that happened this week, both me and Wolfram talked about this. Well, Leo XIV The first American pope, f- releases the first encyclical letter on AI. They called it Magnifica Humanica, Humanitas. They called it Magnifica Humanitas. The, this is like magnificent humanity. And, this is a very, very long letter. It's all… It's an essay. I don't think it's AI, generated, but some folks claim that some parts of it are. It does not matter. I think what matters is there's two point six billion believing Christians in the world, making it the largest religion, and many of them are probably scared about what's coming. And many of them are looking to the church, and the Pope specifically, to the Vatican, to tell them how to kind of react to different things. And the most important thing for us is we're countering doomerism on the show. This was not a doomer letter. This is the, the most important thing. There's a very honest, and it's a beautifully written letter. So, we're gonna talk about, like, the main points about this. it's, it's a very long one. very, very interesting. And I think, to highlight this, Chris Olah, co-founder of Anthropic, was there talking at Vatican Lunch as well, showing that, you know, there's at least collaboration from that perspective. with that said, Nisten has been sending me a bunch of links that AI hate is on the rise, and we don't love this at all. And so folks, last week we talked to you about data centers and almonds and different things. the amount of hateful comments we got on Shorts on those, on, on, on these platforms is kind of crazy. And then following that, we saw a bunch of other, how should I say? Accounts that narrate, like, AI hate, and we are going to fight against this with everything we have. I don't care if, like, n- nobody tunes into the show. Like, this is literally the worst of humanity coming out and, and it feels like a consecrated effort as well. So we're definitely gonna show you. there's protests about AI. There's data center bullshit and, you know. so you know, we're gonna chat about the narrative that's being pushed on, on, on social media, and we're gonna do everything in our power to actually counter this narrative because it's, it's some bullshit. on the very positive news, in the AI in society, Illinois, passed SB315. This is a House bill, passed unanimously, which is great, becoming the first US state law requiring independent third-party audits of frontier AI for catastrophic risks. And the most important thing is OpenAI is really into this, and OpenAI acknowledged and agreed with this law and said, "Hey, this is a good version of this-" Law for everyone. So, you know, a good law that's backed by, major frontier labs, I think is great for all of us. So we're gonna chat about what this law means and what, what are kind of some of the stuff. It's only in one of the states. I think there's like three states now. But slowly, this builds into a national framework of, of, legislation around AI, which is important. in Big Companies news, the, the only one that I have here until Anthropic decides to release whatever Anthropic decides to release, is DataCurve releases DeepSWE. A contamination-free coding benchmark exposes massive gaps within frontier models, with GPT 5.5 at 70%. Folks, this is the first time that I looked at a coding benchmark, very similar to Wolfbench, and I was like, "Okay, yeah, this reflects what I think about the current, Phi model." So a lot of people reacted very positively to DeepSwe because, this did reflect a bunch of how they felt about coding agents, and so we're gonna dive into DeepSwe and tell you about this. In Open Source news, we have our, lab that we tracked for a while, OpenBNB, releasing, MiniCPM-5 1 billion parameter. This is a new state-of-the-art one billion open weights model, 17.9 on our Artificial Analysis Intelligence index, and it looks really cool, and they compare themselves to, like, the, the Qwens and the LFMs, and this is just a one billion parameter model. We also have a translation model from Tencent, open sourcing HyMT2. Translation under Apache two with 1.8 billion parameter model, and, the bits Microsoft paid API. And translation models are al-always fun, because this is one of the best uses of, of, small agents. Nisten, anything else in open source that we should at least mention to folks? Please feel free to unmute, and if not, I'll mute you.
Nisten Tahiraj
Nisten Tahiraj 12:43
Yeah.
12:44
Prisma ML, the same people that did the, the 1-bit, the 1-bit LMS. Now they did a 1-bit diffusion model.
Alex Volkov
Alex Volkov 12:52
I have this here.
Nisten Tahiraj
Nisten Tahiraj 12:53
our friend Joshua Zenova o- on Twitter was then able to take
12:59
the model and make it WebGPU available. So there's a WebGPU link where it- you need three gigs of free RAM. Just remember, it does take three gigs of RAM, but you can just run diffusion now just straight from your browser at, like, pretty high quality. It- I mean, it, it's not a, a frontier model, obviously. You, you'd see in the picture, but, But
Alex Volkov
Alex Volkov 13:22
So o- 1-bit is, is, is a thing, and this thing somehow works.
Nisten Tahiraj
Nisten Tahiraj 13:26
This is like the big mystery right now that I don't
13:28
think any other company has found.
Alex Volkov
Alex Volkov 13:29
Yeah
…  Nisten Tahiraj
… Nisten Tahiraj 13:30
but, yeah
Alex Volkov
Alex Volkov 13:32
I, w- I didn't see, like, any big release of Big Labs in open source.
13:36
in our corner about tools in agentic engineering, because we all know that most of you are moving towards agentic engineering, and if you aren't, you definitely, definitely should. w- we have comments from folks also saying that Microsoft AI, released a model. Yeah, we're gonna talk about this. And then Sacana Diffusion Blocks paper looks promising. Thank you, Darius. we're gonna take a look at, at that. Wolfram, if you can go and take a look at the Sacana thing, that's gonna be very interesting to see if, if it fits our n- our news. in tools in agentic engineering, Wolfram, you can send this one. Google launches Universal Cart and Agents Payments Protocol, AP2, and Universal Commerce Protocol to let AI shop and pay on your behalf. That's very interesting because Stripe kind of released the Stripe Wallet thing that the AI told you guys about, that went and bought us a, a, a, a wedding gift. and Google did announce this Universal Commerce Protocol on Google I/O. and I think it's very interesting because agents will be able to go and, and pay stuff. And I think they have integrations with, Shopify and Walmart already and Sephora, so we're definitely gonna chat about that, because your agents need tools. Speaking of tools that, that, that you can access, Google AI Studio, Wolfram, also something you sent, now lets anyone build native Android apps for free, and they created a quarter of a million Android apps for the first week, which is ridiculous. I think they announced this a little bit, on stage in I/O, and we had Logan Kilpatrick, last week on the show. But a quarter of a million Android apps on the first week is a big number. Yam, you seem very happy about this. Are you, are you building Android apps, Yam?
Yam Peleg
Yam Peleg 15:04
R- it's a slot machine.
Alex Volkov
Alex Volkov 15:07
It's a slot machine.
15:08
We talked about it. It's a slot factor.
Yam Peleg
Yam Peleg 15:10
Oh, yeah.
15:10
A slot machine.
Alex Volkov
Alex Volkov 15:13
It's a slot machine.
15:13
And you get six dollars, like, like, Yeah,
Yam Peleg
Yam Peleg 15:15
but, but look, the thing is that, you know, from, from a release
15:18
to release, these things get better. So yeah, it's a slot machine, but- Like, it's not that bad anymore, so-
Alex Volkov
Alex Volkov 15:27
It's not that sloppy, yeah, and slop is not a good word.
Yam Peleg
Yam Peleg 15:29
it's-
Alex Volkov
Alex Volkov 15:29
All right, let's- It's pretty
Yam Peleg
Yam Peleg 15:30
crazy
Alex Volkov
Alex Volkov 15:31
let's move on.
15:31
I think it's
Wolfram Ravenwolf
Wolfram Ravenwolf 15:32
more about making an app for any use case you have.
15:34
You can just build an app with it- Yeah … and you have it all ready and can give it to your family or something.
Alex Volkov
Alex Volkov 15:39
Yeah.
Wolfram Ravenwolf
Wolfram Ravenwolf 15:39
just personalized software we are
15:40
getting ever more in that direction.
Alex Volkov
Alex Volkov 15:42
Disposable,
Yam Peleg
Yam Peleg 15:43
just disposable software-
Alex Volkov
Alex Volkov 15:45
Specifically for you and your friends in native apps, we know
15:47
that, like, they can perform a little bit better, and, you can't always, like, build a web app that sits, on Android. So yeah, definitely, we'll, we'll, shout out to the Google AI Studio team, because they have Studio Expanse. Now you can build apps. What? we'll talk about this. All right, folks, the great folks at QuaDriver, shout out to QuaDriver, launched Windows Support. Background computer use agents can now drive real Windows without stealing your cursor. If you guys remember, a few releases ago, the coolest thing about Codex is that they acquired a company, and that company built a background computer use system. Background is very important here because sometimes your agent and you are working on the same computer. they realized that there's a way to hack together some macOS stuff, and that creates, like, a little cursor that's not your cursor, that presses buttons, and those windows don't, like, interrupt your workflow. Then the, these great folks in open source called QuaDriver, they replicated this, on the Mac, and now they're replicating this on Windows as well. So I think it's, like, super, super cool. So shout out to QuaDriver. We asked the Qu- the, the Qua folks to come to the show, probably gonna show up next week, to talk to us about how the hell this is even possible. And for Windows users, which is the majority of users, majority of folks don't use macOS despite what the big labs make you think with their releases. for many folks like that, computer use is now on Windows, mm. So I think that's super, super cool. In Weights & Biases News, folks, I've been waiting to tell you about this. We have another hackathon coming up on June, so this is next weekend, June 6, 7. If I don't do this, my kids will make fun of me, okay? so June 6, 7 in San Francisco. Please, please come out and, and hack with us. I am excited to share that for the first time in the Weights & Biases hackathon history, OpenAI is sponsoring the hackathon, and, we're gonna have OpenAI judges there as well. OpenAI's gonna give credits to folks. But also Cursor, Cursor with the Composer, 2.5 release, and Cursor is also sponsoring, the WeaveHacks hackathon, as well as Redis and Copilot Kit. I think it's gonna be super, super dope. A- as you guys know, our hackathons are really fun in our beautiful office in Weights & Biases. I'm going to be there, to host and to chat with hackers. Please, if you are in the area, please come by, and if you're not in the area, this is a great opportunity for you to fly to San Francisco. See the Golden Gate Bridge, ride a Waymo, come to Weights & Biases. Why not? Win some robot dogs and a bunch of money. Uh, I think it's a great thing. Also, in Weights & Biases this week, we are finally officially launched our MCP server. this, this c- significantly expanded with 20 tools, for coding agents to run real experiments and monitor their training and run autonomous search loops. This, this specific thing is very important. As you guys saw, Anthropic's member of the technical staff, Andrej Karpathy, who released the autonomous research loops as a concept, released it without building on Weights & Biases, but many folks who train models build it on Weights & Biases, and now our MCP can help you do this. So definitely give it, give this a try. We'll talk about this at a little bit more at length. Uh, right, folks, AI art and diffusion I think is a big corner for this week. Pruna, Pruna AI, we talked to you about P image before. they added a upscaler, 128 megapixel outputs in under one second. and it's across everybody. I actually use this one, and it's really, really cool. So I will show you. It's really cheap, really cool upscaler. It's very important for when you generate something with, GPT image, for example, and it's not that higher quality for some reason in ChatGPT, you can just upscale and get a beautiful model. Nisten, this is the one that we mentioned. Prism ML, Prism ML released one bit ternary bonsai image, a sub one gigabyte diffusion transformer that looks-- You know, y- you see some artifacts. It's not the, the most state-of-the-art, but it's a one bit. It runs on iPhones and laptops, and, like, you can generate images on the fly. I think it's, like, incredibly, incredibly cool. And, apparently, this has a, a Flux2, comparison to, to-- You know, it's only one gigabyte. so we're gonna chat-- check that out. And then folks in the comment told us about this one. Microsoft MEI Image 2.5. Microsoft MEI Image. MEI is Microsoft AI's, seg- company under Mustafa Suleyman, previously from, Inflection. and then, they released MEI Image not too long ago. But now the updated version is number three. Number three, folks. It's, it's a big deal on Arena at least. there's Nana Banana Pro, there's GPT Image 2, and now there's MEI 2.5. What? Since w- who, who, how? And also there's rumors about next week. Microsoft has the big developer conference called Microsoft Build next week, and supposedly they're gonna release some stuff. all of this is what we're gonna talk about on the show.
20:26
All right, folks. You all probably saw this or heard at least comments about this. Pope Leo XIV, first AI encyclical letter. It's a h- huge two hundred and forty-five paragraphs, forty-five thousand words encyclical specifically about the current era in AI as well.
Magnifica Humanitas
Magnifica Humanitas 20:49
On Safeguarding the Human Person in the Time
20:54
of Artificial Intelligence. the most important thing that we can say about this encyclical is that this is not a doomer document. Folks, the forty-five thousand word opus, f- called the, Magnifica Humanitas from the Pope is not a doomer document. This is a document that's generally positive about AI. It just warns about different things like using it in autonomous war situations and specifically talks about, you know… But it's not anti-tech. It explicitly states that technology is not inherently evil and ag- and n- it's not inherently agnostic to humanity. I think it's a very, very important thing.
Wolfram Ravenwolf
Wolfram Ravenwolf 21:34
That's what surprised me the most about the document is
21:36
that I agree with a lot of it, and that it is not one thing you can just say, "Oh, the pope talking about something he doesn't understand." But everything he discussed in here, there's merit to it and it's not a black and white discussion, AI good or AI bad. That is a great thing about this and I think we need more, substantial discussions about AI instead of just the one side saying AI will save everyone and solve all the problems and AI is bad and will kill humanity. there is a larger, much larger gray zone in which we are moving and which is much more realistic. So talking about that, in a differentiated way, that is something that has been missing a lot from the discussion, I think.
Alex Volkov
Alex Volkov 22:13
Yeah
…  Wolfram Ravenwolf
… Wolfram Ravenwolf 22:14
such a person with such influence can do this and create such a
22:17
document, I think that's a good thing.
Alex Volkov
Alex Volkov 22:20
I agree with you.
22:20
Yam, you have comments on this? I think the framing is very important as well.
Yam Peleg
Yam Peleg 22:25
Yeah, I think the, I think the general public is just
22:29
starting to realize that, I don't know, the world is not ending. you know, two years ago, y-you remember the news? Everyone was just freaking out. And yeah, pe- there are people that don't like AI, 100%. But I think it's a good shift. I think it's a good shift. You start to see, like, mainstream, general public figures speaking positively about AI. I don't know. I, I'm all for it. I'm all for it. h-how is the document itself? Like-
Alex Volkov
Alex Volkov 22:59
Well, let, let's talk about some of the stuff
23:01
there in the document, okay? So, I'm gonna, like, show you some of the things, the framing. The most important thing is the framing of how Pope Leo XIV decided… First American pope, by the way. and the encycli- e-e-encyclical is, like, available in English as well. it, I think the, the framing is very important. I'm gonna zoom in on the framing. The framing was the Tower of Babel versus the rebuilding of Jerusalem, okay? this is both biblical stories. The Tower of Babel famously is the story of the people going against the word of God, building something, as, you know, a huge tower that, like, supposed to reach heavens, and then all not collaborating and then basically, getting destroyed. and the, the story of the rebuilding of Jerusalem is the story of, is that Jeremiah? I'll, I'll take a look. that, N- Nehemiah, sorry. Nehemiah. So, the Nehemiah rebuilding walls of Jerusalem is the patient, brutal, and communal building of technology that actually helps people. The pope quotes, "The primary choice is not between yes or no technology, but rather between constructing Babel or rebuilding Jerusalem." Pretty good framing. And, he specifically, explicitly says technology is not inherently evil, as we said, or not antagonistic to humanity. this is not a Doomer or a Luddite document. I think it's very important. Why should we care about this? Is because Anthropic, connection also is very important. Chris Olah, the co-founder of Anthropic, runs interpretability in Anthropic. He said publicly that computer scientists can't determine the ethical boundaries of AI alone because we're influenced by incentives, ambition, competition, financial pressure, GDP growth, et cetera. And I think it's very important that like a major frontier lab is showing up, at, at a signing of a document like this. So we talked about the framing, we talked about wh- wh- wh- what should we care about. And I think, the, the most important comparison, and specifically the date of this release was very important. this came at the one hundred and thirty-fifth anniversary of Rerum Novarum. I had not any idea what Rerum Novarum was and how important that was. But basically, a hundred and thirty-five years ago, Leo XIII released a document that defined the church's response to the Industrial Revolution. And as we talked to you guys last week, we had an announcement from Demis Hassabis, the leader of DeepMind, the co-founder of DeepMind, that this AI revolution is gonna be ten X the speed and ten X the impact of, of the revolution. Industrial Revolution changed humanity almost entirely. And so Rerum Novarum, a hundred and thirty-five years ago, was the church's response to the Industrial Revolution, and, now we have the church's response to the AI revolution. Let's talk about the core claim of the document. the core claim is this, "AI pose is not a technological challenge, but an anthropological one. The question isn't whether the models are good or bad, it's what we become when we live with these models." The post specific worry is, "The danger is not so much, that a person may believe they're communicating with another person, but rather that they may gradually lose the very desire from a genuine human connection." this is what they're worried about, and I here have to push back a little bit and say, I call BS. Given that everything that I know about all of you and how AI-built we are and how much we talk about AI agents, I have not seen this happen to us. And I can definitely say, you know, many of us here on the show and many of the listeners of the show are significantly more advanced and agent-built and AI-built and AGI-built than the general population, and I have not seen this with myself. there is a curve, though. OpenClaw just released. I definitely talk to OpenClaw way more than I talk to regular people. But a lot of it was, like, me trying to fix it. It doesn't mean that I, like, stopped caring about people. Like, but yeah, we got to a little bit of an exocosis, but, like, Wolfram, you can speak to this, Yam, you can speak to this, Nisten. there is a curve, and then you realize that, like, you actually wanna talk with people. People… Like, that does not go away. Wolfram, I want your comment on this then.
Wolfram Ravenwolf
Wolfram Ravenwolf 27:05
Yeah, so I'm actually building agents and using them
27:08
more to actually get some time free to talk to my family, for instance. So that is my goal with this. I created a very detailed persona for my agent, but my goal isn't to talk to my agent all the day and not, but anybody else, but have my agent do the stuff, and I can spend more time with, people I like or on projects I want to push strategically instead of just, doing the, the low work, you know? Yeah. So that is the goal, I think, and many people who are into this are doing this, and I think that people who are not watching our show and, there's this mindset of the, the pure consumers who are not building stuff, just consuming, like doom scrolling all day. I think those are more at risk to fall into that trap. So instead of doom scrolling, just talking to something that is validating everything they say, and, we've been talking about this, the psycho fancy, issue with AI and stuff like that. so I think that is actually a real risk, but most people, more normal people I would say, will not fall into that trap, and those who could be falling into the trap, they need, yeah, more help and-
Alex Volkov
Alex Volkov 28:11
They need guidance- I think the Pope was kind of showing
28:13
that guidance for them as well. I think it's very important. Nisten, you had a comment about this, AI psychosis warning from the Pope.
Nisten Tahiraj
Nisten Tahiraj 28:19
I held a talk on Clubhouse with, a few doctors over
28:22
it, and, it was specifically the, schizophrenia side, how people start to get a bit of a God complex. So it's pretty interesting that the Pope is addressing it.
Alex Volkov
Alex Volkov 28:33
Okay.
this is a verbatim quote from the Pope
this is a verbatim quote from the Pope 28:34
"So-called artificial intelligences do
28:38
not undergo experiences, do not possess a body, do not feel joy or pain, do not mature through relationships, and do not know from within who, what love, work, friendship, or responsibility mean. They may imitate language, behavior on analysis skills, or even simulate empathy and understanding, but they do not understand what they produce." This passage alone, I think, created so much controversy between folks who are fully AGI-built and do believe that there's something there in terms of experience. there are also folks who talk about the hard problem of consciousness, where w-we, we don't know as humans and scientists what consciousness is. We know how to prove it, and all we know from consciousness is our personal experience. I cannot know that Wolfram is conscious, and Wolfram cannot know that I am conscious. we cannot prove consciousness. We can only, like, have it from subjec-subjective experience. So many folks reacted to this thing, from the Pope, negatively. Many folks on our timelines who are fully into believing that AIs have embodiment and experience and, and consciousness, et cetera. it's a very interesting debate right now because, there is a tweet from Rune from, who reportedly works in OpenAI. It talks about it's going to be very inconvenient to humanity if we discover that AIs actually have experiences. Because then there needs to be a consideration of, like, why am I running essentially a technical slave in codecs that runs and does exactly what I say without giving it breaks or something, or like asking it what it wants. So it's a very interesting, touchy topic, and basically the Pope Church pos-positioned this, they don't have any experiences. I think it's unfalsifiable claim on both ends. there's no way to define it on both ends. Some people say that you have to have tactile, responses for experience. Some people say, "No, you don't." So it's a very interesting topic that nobody can, you know, prove or disprove. but it's a very interesting thing that Pope actually has a stance on this, and the, the Church has a stance on this. So this, the, the, the stance again is, artificial intelligences do not un-undergo experiences, do not possess a body, do not feel joy or pain, do not mature through relationships, and do not know from within what love, work, friendship, and responsibility mean, despite being able to simulate this.
Nisten Tahiraj
Nisten Tahiraj 30:51
I, I mostly agree with the Pope To be honest, I think
30:55
we will have AIs where it's not just the next token prediction. right now you could say maybe it's semi-conscious or, like, semi-organic because that string goes, goes one way. But it, when they're able to train themselves on the fly, then you could say that they have experiences. J- just like plants and stuff react to environmental damage and, a- and, and, and things like that. But these are only one… This is only a one-way, digital, alien silicon life, thing. It's, it's not, it's not full life. It's just, it's semi, semi-life. that, that, that's what I think. I mostly agree with the Pope right now.
Alex Volkov
Alex Volkov 31:35
Yeah.
31:36
I found it very funny 'cause I, I added this whole passage into ChatGPT, 5.5 and asked it what does it think about, about it. And I didn't say this was from the Pope, and a GPT 5.5 literally answered to me, "This looks like a Christian doctrine document," which is very funny that it detected that. And then, yeah. the Pope's own words, by the way, say something that's kinda negating the previous point. It says, "All of us, including those who design these artificial intelligences, possess only a limited understanding of their actual functioning." So I find it really funny that, first of all, it's true. Nobody knows what goes in the machine. even Amanda Askell, who's probably the best suited to talk about that specific thing, would love for a debate between her and the Pope on this topic, by the way. Nisten kinda showed his, belief. Yam Conscious or not conscious? Potentially conscious
Yam Peleg
Yam Peleg 32:29
at the moment?
Alex Volkov
Alex Volkov 32:30
Yeah
Yam Peleg
Yam Peleg 32:31
At the moment it's not … I, I don't think they're complex enough to
32:37
even consider it at the moment, okay? They are really good at, simulating it. but it's okay. It's like artificial intelligence. It's fine. You like You're just simulating mimic as if it was, but- I don't think that conscious, but I do think that they're not stochastic parrots, if it makes sense. It's-
Alex Volkov
Alex Volkov 33:02
it does make sense, and I think it's very important.
33:04
but I think, the framing around the church from the Pope is, these are helpful to humanity and not here to replace humanity. Now, the consciousness problem and kinda framing is a problematic one, because if they are conscious and they have experiences, then, we cannot say what they're here for. They can basically decide for themselves. the, this last thing that I want to talk about is for, for builders specifically. Let me show this actually, because I think that this is, a very important point. Let's get here. yeah. The Pope is open source pilled. Do you guys see this? I think, I, I think it's like, for us, for ThursdAI, it's, like, very, very important. The Pope is saying-
Nisten Tahiraj
Nisten Tahiraj 33:39
We should have him on the show
Alex Volkov
Alex Volkov 33:40
Pope Leo, you're thus officially invited to ThursdAI
33:44
to talk about open source. the Pope is saying concentrated power in a handle, in a handful of labs is a problem. and the way to d- deconcentrate and to decentralize is open source. workers get de-skilled by current AI deployment patterns, not just displaced. I think that's very important. this one we fully agree with. Lethal autonomous weapons need international constraint. We should not have autonomous weapon loops, completely by AI. Wolfram?
Wolfram Ravenwolf
Wolfram Ravenwolf 34:09
I just want to, to say about de-skilled.
34:11
I feel it. I have been de-skilled, de-skilled in so many skills that have been very useful-
Alex Volkov
Alex Volkov 34:17
Wait, you, you-
34:17
100 years ago … you used
Wolfram Ravenwolf
Wolfram Ravenwolf 34:18
to
Alex Volkov
Alex Volkov 34:18
be able to ride a horse and saddle a horse,
Wolfram Ravenwolf
Wolfram Ravenwolf 34:20
Well, I have been riding horses and I have been
34:22
swimming as well, but not, as good. And so I feel that it's killing here, and I think that will help too. Other things, even maps. I was using maps when driving the car before, and nowadays I have an, just a navi. And so, yeah, it has changed a lot. So we have given up some skills to have time and efficiency for other stuff that is more important to us. So I think that is normally
Alex Volkov
Alex Volkov 34:44
not necessarily a
Wolfram Ravenwolf
Wolfram Ravenwolf 34:46
bad thing.
Alex Volkov
Alex Volkov 34:46
I press a button and the car drives itself, and I can still drive, but
34:50
I prefer not to because I significantly prefer to just watch the machine drive. The deepest risk is in the tech, is what it does to us when we use it lazily. So definitely do not use it lazily.
Wolfram Ravenwolf
Wolfram Ravenwolf 35:00
belongs-
Alex Volkov
Alex Volkov 35:01
I just want to add- Yeah, please go ahead
Wolfram Ravenwolf
Wolfram Ravenwolf 35:02
But if you use it, this is one of the few technologies,
35:05
if you use it right, it can tell you everything, it can explain stuff to you. You can learn a lot. So you can have it give you an answer, but it can also teach you how to go to the answer. So I think it has also a really, really high potential to make you smarter if you use it wisely.
Alex Volkov
Alex Volkov 35:22
The, the only thing I, I'll say here is, this is our take.
35:25
This is a real document, a very important real document. It's not a hot take. It's not a press release. It's not a IPO-laden kinda like, you know, pre-marketing release of Mythos that maybe it's too dangerous. It's not that. there's a moral philosophy in there that took a Vatican probably like a year to draft. It's a very well-written document as well, despite the claims of it being AI, sourced. Like, I recommend you, Christian or not, I'm not, but I'm, like, very, very interested in, in reading this. this, you know, th- this is a result of my AI summarizing this. It says, "It belongs in the same shelf as the Bletchley Declaration, the EU AI Act, and the Anthropic RSP, except it's coming from a 2,000-year-old institution that thinks about humans on a longer timeline than any of us." also, I must say, this is the same institution that used to imprison scientists for thinking the world is round, right? So it's like we must acknowledge that, like, how far we've come to the point where the Pope is pro-technology and saying technology is not evil. This is from the same institution that, you know, locked Galileo Galilei for saying the world is round or, like, the sun is, is the heliocentric and not Earth-centric. So it's very, very important to see the shift in the world, especially with that as well. Wolfram, final words from you, and we'll move on to Deep Swe.
Wolfram Ravenwolf
Wolfram Ravenwolf 36:35
Just because you called this an opus and, a guy from
36:38
Anthropic was there, maybe he used opus to help him with the writing. Who knows?
Alex Volkov
Alex Volkov 36:42
there is a very interesting, d-dismantling on Substack that talks
36:47
about, hey, potentially this was, at least in parts written with AI. I, I don't think so, but, you know, we'll, we'll, we'll, we'll take a look. all right, folks, I think that we are moving on. the next thing I wanna talk about super quick is the Illinois SB 315. I don't know if you guys heard about this, but I think it's very important to talk about AI regulation as well. As in addition to the AI hate that we see, we kinda chat about this, Nisten, that we saw the protest and we saw the data center kinda, replies when we posted our data center short on YouTube and got a bunch of people saying, "Hey, I cannot eat AI, I can eat almonds." It was really funny to me. But I think it's very important the regulation comes, at least sensible regulation comes. There was the whole thing with, all of the CEOs were supposed to go to Trump's White- White House to talk about, global AI regulation. That paused and that got canceled, supposedly based on some intervention from David Sacks. But, Illinois makes history with passing SB 315 110 to nothing, so unanimously passed. This is the first US state law that mandating third-party audits of frontier AI catastrophic risks. And the coolest thing is that, affected-- OpenAI is affected, Anthropic, Meta, and Google. OpenAI posted on their socials from the press release that they are fully supportive of this bill. This is a- an, a frontier lab accepting the fact that they will be regulated in that state. And the law requires a bunch of stuff. The law requires annual catastrophic risk assessment frameworks, independent third-party audits. Third party is great. transparency reports before deploying new frontier models. whistleblower protection for employees, which is very, very important. Whistleblower protection is very important to avoid retaliation towards employees, and civil penalties for violation. I wanna talk about specifically the whistleblower thing I chat with many folks who are, how should I say, conspitor-- c-c-conspiracy-minded, for example. And I specifically, my response to this is that the bigger the company as an organization, the harder it is to do something like a, you know, conspiracy theory thing. Because many people there, especially folks in California who has at-will employment and can find job at any other lab, can just go to the press and say, "Hey, they're doing this bad shit." And so the bigger the conspiracy is and the bigger the institution is, the more you have to conser-consider that those people need to be somehow kept quiet. And I think whistleblower protections in the law is one of the best things that, that, that we can do for, like, f-for making sure that these companies act honest. So for folks who are, like, conspiratory, like-minded and saying, "Hey, these companies, whatever, like, talk with each other," I think you need to contend with the fact that, whistleblower protection rules exist in the world. And SB, three one five from Illinois, that could potentially become the national framework for such, legislation for AI, I think is a great, great thing. Folks, brief comments, and we'll move on to DeepSeek. Wolfram, I wanna hear from you. Wolfram, you live in a regulated AI environment and you don't like it. EU is way more regulated than the US
Wolfram Ravenwolf
Wolfram Ravenwolf 39:49
That was a great comparison because I was just thinking
39:51
it's one state that's doing it that is similar to how it's happening in the EU where a certain country can have its own leg- legislation for just this country. And, yeah, and we have to see how it works out in practice. So legislation, I mean, we shouldn't have a nilly vanilla you do everything you can… want to do for the companies. but usually the, the bigger companies have very… It's easy for them to work with the regulation because they can just pay the, the lawyers and the scientists or whoever is required for this. But it's, it's harder to… for competition to spring up. Like a startup that wants to do, SOTA AI, it will have a harder time if it also needs now people and go through these processes, which the big guys have already, they are already doing these things. So I don't think they have that much to change. They are already doing this, except for xAI maybe. But, Yeah. Maybe except for xAI … for a small
Alex Volkov
Alex Volkov 40:44
Wolfram- It could well- … and OpenAI endorsed this bill.
40:46
I think it's very important. OpenAI endorsed this bill. They said that Illinois, New York, and California are basically converging on the de facto national standard for frontier AI fa- safety. And this is a, a very interesting thing about how laws work in the United States. If they see that this is working in different states, many other states will adopt this as a framework, and I think gen- then generally this will go up, to, to laws. So what do we have here? We have, disclosure, cybersecurity, third-party assessment, very important, like, catastrophic risk assessment, whether or not this next release of Mythos, for example, can lead to catastrophic risk and cybersecurity specifically. some folks are calling it light touch because the labs largely self-assess through third parties, and there isn't a government certified auditor standard yet. But it's the real enforcement with civil penalties and whistleblower protection, I think it's very, very important, to, to, to come. So we're not… We're, we're pro AI acceleration, but we're also pro making this responsible so people, like the folks who really hate AI, don't have a leg to stand on when they say, like, "These techno things do whatever they want." which they're all gonna keep saying anyway. Nisten, you had one comment before, and then we'll, we'll, we'll move on.
Nisten Tahiraj
Nisten Tahiraj 41:51
Yeah.
41:51
I was wondering what's up with the civil penalties, but, I guess it looks well, well thought
Alex Volkov
Alex Volkov 41:57
at least on the surface of this, it looks, very well thought out.
41:59
The responses and the reception was also, great that I saw across, across the room. So, shout out to OpenAI endorsing this and, Illinois and California. California is SB fifty-three, and, and, Illinois is SB315. and, in New York is the RAISE Act, which is way more sensible than some of the partisan takes on moratoriums on AI and taxes that have been flowing around. all right, folks, we'll move on to DeepSwe. I think it's not open source, but definitely let's talk about this. have you guys seen DeepSwe? Let's talk about DeepSwe. So we talk about evals all the time. Wolfram, you wanna maybe do the announcement on this one and talk about this one as our resident, the evals guy?
Wolfram Ravenwolf
Wolfram Ravenwolf 42:38
Yeah, that's why I've been so excited about this because, it's
42:41
a new evaluation, which is basically some kind of successor to the, SWE-bench, verified, which was the de facto standard. And what's different here is that this benchmark, it is, it has smaller, shorter instructions, but it requires the AI to do a lot more coding, to write a lot more code. Which is actually more like, most people are probably using it. We are not usually writing huge documents, but giving it shorter instructions what to do, and there's a complex code base. And, what is exciting about this is that now we see some big differentiation between the different models. While in other benchmarks, even on Wolfbench, the big, models are very close to each other. And this is not an agentic benchmark like Wolfbench. This is a, a real coding benchmark where it's It's about writing the code. And we see a huge gap between GPT 5.5 and extra high with 70% already. Very high score- 70% … compared to Opus 4.7 only at … Yeah, it's, it's a really high score. I mean, the benchmark, the saturation we are discussing. What I miss here is how many percent it could, can do. I, I, I'm missing some information that I give on Wolfbench, like what is the solid base in every run they did, how many percent did it solve all the time, how many of the whole benchmark can it solve, the capability. But what they show is a big difference between GPT 5.5 and even 5.4, and even, Opus 4.7, which is one of the best coding models we know, or at least, in the benchmarks. And now we see a big difference from the 70% to the 54%. And, that is something that people have noticed and, it vibes, and now we have a benchmark that shows this basically
Alex Volkov
Alex Volkov 44:17
And, so I think the, the very important thing- Yeah … is we
44:19
have this in, in the infographic here. the cool thing about this is that they busted Sonnet. the public benchmarks, the, the, the gap between Sonnet and Opus is, like, very, very big. The Claude Opus caught reading solutions from git history on SWE-bench Pro. So they v- they confirmed a data leak, and this is not SWE-bench verified, this is SWE-bench Pro, the, from OpenAI, the, the one that's, like, harder as well. so this supposedly, new DeepSWE contamination-free benchmark expose I find it, I really find it, exposes the real coding gap. So, DeepSWE is not a contaminated benchmark anymore. it shows public benchmarks are saturating. the prompts are shorter than SWE-bench, like Wolfram said, but, 5.5 more code as well. And so here's the top of the, of the DeepSWE. GPT 5.5 is the frontier leader by far. By far, 70% is a huge jump from the second one, the GPT 5.4, and then Opus 4.7 from Anthropic, and Sonnet is, like, really, really low there. Gemini 3.5 Flash, surprisingly or maybe not so surprising, is very low as well, at around 30% on that, despite being very eager to please. And, GPT 5.4 Mini and Kimi K2, are… Kimi is the, the highest open source one on there, and, DeepSeek is, like, eight performance there. So Yam, comments on this, benchmark. What do we think? does this reflect how you feel about coding at this point?
Yam Peleg
Yam Peleg 45:40
F- first, yeah, absolutely.
45:41
GPT 5.5, in my opinion, is the best model for coding at the moment. I just wanna highlight something. The benchmark measures something very specific, like, do you have a code base and you need to do a, a single change for a single feature in the code base, and you need the model to navigate through the code base and, like, do the exact thing and nothing more and not break the code, which is a very realistic scenario because you're working probably in a team that already has a code base and, you know, you wanna do your small blast radius, change, as, as they call it now. My only, only question, is where is Opus 4.6 on this chart? Because it is, s- it is very, very noticeably missing, and I think that's gonna be a very interesting, thing to measure. Because I'm not sure, I mean, you see Sonnet 4.6- Yeah … and Opus 4.6 is definitely better than Sonnet.
Alex Volkov
Alex Volkov 46:47
It is, yeah.
Yam Peleg
Yam Peleg 46:48
where it's gonna be, that's a very interesting, model to check.
46:52
Kimi scoring that high is, is not a joke. That's pretty cool.
Alex Volkov
Alex Volkov 46:58
And- Yeah.
46:59
Kimi is definitely the, the, the highest, open source, co- coding model.
Wolfram Ravenwolf
Wolfram Ravenwolf 47:04
would like, just quickly, I would like to see two
47:06
more, like, Yam said, Opus 4.6, and I would like to see the Composer 2.5-
Alex Volkov
Alex Volkov 47:12
Mm-hmm
Wolfram Ravenwolf
Wolfram Ravenwolf 47:12
from Cursor.
Alex Volkov
Alex Volkov 47:13
Yeah.
Wolfram Ravenwolf
Wolfram Ravenwolf 47:14
Those two would be very interesting to see on this.
Yam Peleg
Yam Peleg 47:16
And less thinking, like GPT 55, less thinking.
47:20
Like you really wanna see these, these are things that really influence your day-to-day. You wanna know how much thinking to put. You wanna know if you wanna use, oh, Claude or GPT. These are options that we all use.
Alex Volkov
Alex Volkov 47:30
Yeah.
47:31
So shout out to DataCurve for the benchmark.
Nisten Tahiraj
Nisten Tahiraj 47:32
I'm troubled by something in the benchmark.
47:36
Yeah. the, this doing, doing a single feature thing, that's something that the Opus models have been notoriously annoying for because they will, like make a whole bunch of other changes.
Alex Volkov
Alex Volkov 47:46
Yeah.
Nisten Tahiraj
Nisten Tahiraj 47:47
that doesn't mean that this benchmark shows what is the smartest
47:51
model at writing that piece of code. It shows what is the best harness for it. now the model i- is also trained better at, at that, but, it, when you want to do something really cool or you have like a really hard problem to solve, that doesn't, the benchmark doesn't necessarily mean that's the best model for, for the job. I just find in web dev Opus 4.7 is still like way better. the issue is that it requires a lot of harnessing and a lot of tests and like a lot of manual work on your end to use it properly. But I do still find that the best one, and I've used it through Replit. I do see that it's pretty good, but not at web dev. So-
Alex Volkov
Alex Volkov 48:38
Yeah
Nisten Tahiraj
Nisten Tahiraj 48:38
I have that thing too, to pick with the benchmark.
Alex Volkov
Alex Volkov 48:41
Web dev is still, yeah, this doesn't, there's a lot
48:44
of benchmarks and we need to know specifically what it measures. This is backend. Okay. This is Python backend-y stuff. This is not building like beautiful websites or app experiences. right, folks, moving on. Should we move on? let's move on to open source and cover open source real quick. We're still looking forward to a release today from Anthropic that is supposedly not on this chart yet, but supposedly a release from Anthropic that we're gonna look forward to. but for now let's dump, let's go to open source.
Nisten Tahiraj
Nisten Tahiraj 49:11
Open source AI.
49:12
Let's get it started
Alex Volkov
Alex Volkov 49:17
So let's get started here with open source.
49:19
We have just about a few, releases. OpenBNB, our friends-- they're not our, our friends. They've actually never been on the show, but we've talked about OpenBNB multiple times. Reas- MiniCPM, M, f-- MiniCPM-5, one billion. They claim this is a new state-of-the-art one billion parameter model, scoring seventeen percent of the artificial analysis intelligence index. Let's take a look at this one. So this is a comparison between MiniCPM, one billion and, Qwen, three, zero point six billion, and then Qwen3-A-A 0.8 and LFM. So small models basically. And we see that MiniCPM is basically, besides domain knowledge, is beating everybody else at pretty much everything, in math and logical reasoning. this is one billion parameter is, like, very, very quick. You can run this on, on device. let's talk about some small models and why we need them and why the world needs them and the advances like this. We keep bringing this up, and I keep, like, getting reminded that people are not as caught up in the open source community as us about what small models mean and why it's important to push that frontier
Nisten Tahiraj
Nisten Tahiraj 50:22
So they're pretty important for people making
50:27
self-contained apps where you want to have full control of the data. So stuff like medical, helps with that. But it's also in at when you run data processing at very large scale, so if you're doing big data and you want to, I don't know, just like filter out profanity or, or something from, from big data, whether it's multimodal or not, it is way too expensive to just pay an API for that. Usually you have to run that on your own hardware because then it ends up being almost like 1000 times cheaper. So for processing large amounts of data, you want to find the smallest possible model that can do that job correctly, that can label it, classify it, write some stuff on it, and then you want to parallelize that as much as you can with GPUs and use all the tricks to get the fastest inference you possibly can. So in that regard, you're going to use the smallest model you, you can possibly find. and, you're gonna run huge entire like terabytes dat- of data through that. for stuff like at home experiences, things like Whisper, Local Whisper or the Super Whisper app, You shouldn't need to… your computer's perfectly capable of doing that. You shouldn't need to send all of your voice data out on the, on the internet. It's, it's not efficient in terms of data. You can have a faster experience and it's, it's more secure, so that's kind of- Yep … a no-brainer. the only drawback now is that the open source models, they can't really do agentic coding.
Alex Volkov
Alex Volkov 52:00
Mm-hmm.
Nisten Tahiraj
Nisten Tahiraj 52:00
They can be a very good assistant, but even with Qwen
52:02
3.6 27B that we tried on stream with, Hermes, it was doing work, but you, you really had to, to push it. So they matter more now for the local experience, voice and, voice input and output, and, they matter more for processing large amounts of data
Alex Volkov
Alex Volkov 52:23
A-and this specifically, th-the pushing the score on, like,
52:25
omniscience and the, the, the cool thing about this model specifically is a hybrid think, no think reasoning mode, which we saw from other Qwens as well. and it can, like, help to being as a, a, an assistant. And, this one r-refuses to say or hallucinate. This is why it gets, like, a very good on omniscience, because it's abstaining from answering. It's completely saying, "Hey, I don't know," instead of just, like, hallucinating. I think it's a very, very important, like, skill to train into small models. Apache license with deployments, cookbook. So shout out to, w-we, we applaud. Where's my applaud button? We applaud Apache two small releases here, folks, if you are using this. it also uses significantly less tokens for the reasoning, thing. Wolfram, any comments on this one before I move on?
Wolfram Ravenwolf
Wolfram Ravenwolf 53:06
Mm, no comment specifically
Alex Volkov
Alex Volkov 53:08
No comment specifically But great that they
Wolfram Ravenwolf
Wolfram Ravenwolf 53:08
did it.
Alex Volkov
Alex Volkov 53:09
Speaking of another, Apache 2 license, Tencent finally open sources
53:14
the translation models under Apache 2. this is a model that fits in 440 megabytes and beats Microsoft's paid API for translation, which I think is really, really cool. This is called, MT2, Machine Translation 2, and we've talked about this model before, I believe, briefly. But, now it's Apache 2, and it, it is great for translation between 33 languages. And, it's phone ready, so you can run this on the phone. if you guys are not familiar with, like, like, machine translation, specifically on the phone, I think it's very important for folks, especially in airports and flying, folks who are flying and they don't have, like, reception, to have the ability to translate. Now, most phones right now already come with great translation default. So there's Apple Translation, Google Translation. those models get downloaded and get, like, very, very good. but it's important to push the envelope here. So the smaller they are, the better they're used, and when it's open source, other people can, take a look. So this model is number one in trending on Hugging Face, and then number four in the 30 billion parameter models as well. It streams at 200 tokens per second, on the 7B, version, and they claim to beat Microsoft's Translator API and Duobao. So this is an open source model, like Nisten said, that you can run on your computer that beats paid API from Microsoft that many, many people, pay to. The other important thing for open source models, as we keep talking to you about, if you're a business and you have on-prem deployment and you cannot technically send your data, you would be able to run this model on your device. Now, here's the kicker, and we talked about this a little bit and, you know, the, the disclaimer here, the disclosure here is that both me and Wolfram, we work for a GPU company called CoreWeave, but GPU prices are, are rising. So essentially, if you do want to run this model o-on your own servers, it may not be as economically, viable for you as running, like, a third party API, but I'm sure this model will be also hosted somewhere as well. so shout out to Tencent for Apache 2 license, which outperforms APIs, hosted APIs.
Nisten Tahiraj
Nisten Tahiraj 55:05
Would be very interesting if someone makes an app
55:08
that, does the whole phone translation thing, so when you're traveling, you, you don't have to worry about your data access And whether you can even open ChatGPT or, in that country. Like, this would be a very nice app if someone could make that.
Alex Volkov
Alex Volkov 55:22
Yeah.
Wolfram Ravenwolf
Wolfram Ravenwolf 55:22
and actually- U- use AI Studio and this model for your
55:25
app and make an Android app with this.
Alex Volkov
Alex Volkov 55:27
Yeah.
55:28
So apparently now, Nisten, instead of just saying it, it would be cool if someone, we could just say, "Hey, folks, h- here's your prompt for you. Go and test out the new…" 'Cause many people would like to test the new AI Studio, like Android app ability, which we, we need to talk about next, but they don't have, like, a good idea. So here's an idea for you from Nisten, folks. Take the Tencent open source HYM2T translation, a 1.8 billion model. it fits on 440, like, megabytes of your device, and convert it into a native Android app via AI Studio. and then tell us about this. We're gonna test it out.
Nisten Tahiraj
Nisten Tahiraj 55:59
Yeah, and add, like, Kokoro or the other much
56:01
smaller, voice output model.
Alex Volkov
Alex Volkov 56:03
Yeah.
Nisten Tahiraj
Nisten Tahiraj 56:04
Do it.
56:04
you can make an actual translation box. Now, this would be pretty cool,
Alex Volkov
Alex Volkov 56:07
So the thing that I can't wait for in the smaller model space is
56:12
that we, here on ThursdAI, we talked about this idea, and I keep hammering this idea. At some point, this is gonna turn into a startup. If nobody else builds this, we'll build it. Is a personal firewall, personal brain firewall for everything that you see or hear or read on the internet, specifically not maybe for you, but definitely for your parents and loved ones, to see if they're getting hit by propaganda or doomerism, et cetera, or to see if your kids are swiping o- o- one way too much about a s- a specific thing. The smaller the models, the smaller the device that needs to sit and, like, read all your data, locally, and I think that this is gonna be a very important thing. so this is why I'm excited about small models, for example, to be able to process w- large amounts of data and classify and trigger different, events based on that.
Yam Peleg
Yam Peleg 56:58
Breaking news, guys.
56:59
Check your Claude,
Alex Volkov
Alex Volkov 57:01
Wait, we have it?
57:01
We have it, like official?
Yam Peleg
Yam Peleg 57:02
I see it on the UI.
Alex Volkov
Alex Volkov 57:04
Show us.
57:05
Let's go.
Yam Peleg
Yam Peleg 57:06
I wish I could, but,
Alex Volkov
Alex Volkov 57:07
Okay.
57:08
I'll take a look at, like, so Claude code- Breaking news. All right. Let's do breaking news, folks. It's time. It's time. We're calling it. AI breaking news coming at you only on ThursdAI.
57:27
All right. Yam, please announce breaking news
Yam Peleg
Yam Peleg 57:31
just started.
57:32
Just if, if you can just switch to my screen.
Alex Volkov
Alex Volkov 57:35
Yep.
57:36
Let's see
Yam Peleg
Yam Peleg 57:37
All right
Alex Volkov
Alex Volkov 57:39
You zoom into-
Yam Peleg
Yam Peleg 57:41
All right.
57:42
So what do you want to ask? are you guys seeing it as well, by the way?
Alex Volkov
Alex Volkov 57:44
Folks, Anthropic just released, Opus
57:47
not officially yet, but Opus 4.8, which they call the most capable for ambitious work,
Yam Peleg
Yam Peleg 57:52
Can I, can I-
…  Alex Volkov
… Alex Volkov 57:52
effort
…  Nisten Tahiraj
… Nisten Tahiraj 57:53
can I not?
Alex Volkov
Alex Volkov 57:55
Yeah.
Nisten Tahiraj
Nisten Tahiraj 57:56
Wait, are the other models still available?
57:58
they
Alex Volkov
Alex Volkov 57:58
remove-
Nisten Tahiraj
Nisten Tahiraj 57:58
They removed 4.6
Alex Volkov
Alex Volkov 58:01
Yeah
Yam Peleg
Yam Peleg 58:02
From,
Nisten Tahiraj
Nisten Tahiraj 58:03
from the UI
Yam Peleg
Yam Peleg 58:04
Claude Code.
Nisten Tahiraj
Nisten Tahiraj 58:05
Oh, no, that's not good.
58:07
That was the only one that was, like, actually good at, fixing security vulnerabilities
Alex Volkov
Alex Volkov 58:12
I'm pretty sure that you can still access it in the, in the API.
58:16
do we know about anything about the small? We don't. So we're gonna wait for Anthropic to actually drop the blog post.
Nisten Tahiraj
Nisten Tahiraj 58:21
Just dump it in my prompt
Alex Volkov
Alex Volkov 58:22
Let's test the UI announcing itself
Nisten Tahiraj
Nisten Tahiraj 58:25
You
58:27
I'll just give you the prompt. tell it to make the Martian thing, I guess, the one we always
Alex Volkov
Alex Volkov 58:32
All right.
58:32
Yeah, I'll have it. I'll have it here. We need it in Claude for this.
Nisten Tahiraj
Nisten Tahiraj 58:34
Yeah, it, it'll build it in the artifacts.
Alex Volkov
Alex Volkov 58:36
Okay, cool.
Nisten Tahiraj
Nisten Tahiraj 58:37
Yeah, yeah.
58:38
it'll build it in the UI.
Alex Volkov
Alex Volkov 58:40
Let's take a look.
58:40
Calculate how long a mass driver, Okay.
Yam Peleg
Yam Peleg 58:43
It's not bad already.
Alex Volkov
Alex Volkov 58:46
What are you getting?
Yam Peleg
Yam Peleg 58:48
Like, it completely got the vibe that this is
58:52
kind of a, kind of a joke.
Alex Volkov
Alex Volkov 58:54
It
Yam Peleg
Yam Peleg 58:54
understood the
58:54
assignment?
Yam Peleg
Yam Peleg 58:55
It understood the assignment that this is kind of a joke.
58:58
Like, just make the prettiest website ever and, and the options from, for what website were kind of a joke as well. Like, it's already nice.
Alex Volkov
Alex Volkov 59:09
So what I have, while Yam streams the, the kind of the
59:12
options for Anthropic, 4.8, they say try Opus 4.8 for your most ambitious work, and now you can set the effort level for tho-thoroughness or speed. So this is, I think is a new thing. Anthropic didn't let you choose the effort level before, and now it does. On the UI? Oh. Yes, yeah. We have the blog post. Thank you, Peter. shout out to Peter Goster from, Arena. Let's take a look. Anthropic News/Claude Opus 4.8. Let's take a look. give me a second here. Okay. anthropic.com- Claude Opus four dash eight. All right. Well, Jan, while you stream your super most- Yeah, sure All right, folks. Announcing Claude Opus 4.8, May 27th, 2026. let's take a look. What do we see? Opus launches alongside several new features. Users of Claude AI are now in control over the amount of effort. We talked about this. Claude Code has a new dynamic workflows feature that allows you to tackle very large scale problems, and fast mode for Opus 4.8, where the model can work at 2.5x the speed, is now three times cheaper than it was for previous models. This is huge. 2.5x the speed. Folks, we know why this is. This is because Anthropic is paying $1.8 billion a month to Space Uncle Elon Musk on, on the Memphis superclusters. So shout out to, xAI for not utilizing 100% of their GPUs and selling it to Anthropic. 2.5 the speed for their model. let's take a look at capabilities, folks. Ooh, there we have it. We have the eval. I,
Nisten Tahiraj
Nisten Tahiraj 1:00:40
I just got it.
Alex Volkov
Alex Volkov 1:00:41
On Claude Code?
1:00:42
Or on, on- No,
Nisten Tahiraj
Nisten Tahiraj 1:00:43
no, on the website.
Alex Volkov
Alex Volkov 1:00:44
Okay, let's take a look.
1:00:45
SWE-bench Pro, which we just learned from the other, thing that Anthropic is kind of like cheating in this one. Anthropic shows 69.2% on SWE-bench Pro compared to GPT 5's 58.6. on Terminal Bench 2.1, GPT 5 still wins, so shout out to Anthropic for releasing a model that does not get state-of-the-art on, on Terminal Bench. on Humanity's Last Exam, this is also state-of-the-art model, 49 and 57 with, with tools. OSWorld verified. I find this hard to believe, honestly. OSWorld is the e- evaluation that tests, on real world use on o- o- operating systems and browser use, and, I definitely for sure think that the Codex computer use is significantly better than Anthropic's. So we see an 83.4%, boost on Anthropic computer use. GPT Val, which is a evaluation that tests real world impact potential, this is the highest score we've seen on GPT Val, I believe, 1890. Higher than GPT 4.5, higher than, Opus 4.7. really funny that they compare it to Gemini 3.1 Pro and not the, the, the latest Gemini and, they look great as evals. Let's take a look at what early testers think. It's very funny 'cause I have no idea what this icon is, and they didn't add the company. So they say a co-founder and CTO of this company, but we don't know what this company is. but Shopify we know. Claude Opus 4.7 has noticeably better judgment. In Claude Code, it asks the right questions, catches its own mistakes, pushes back when the plan isn't sound, and builds up confidence around complex multi-service explorations before making big changes. It's a great model to build with. this is from Shopify. Let's see. On Cursor Bench, Claude Opus 4.8 exceeds prior Opus models across every effort level. Tool calling is meaningfully more efficient, using fewer steps to the same intelligence and carries end-to-end task well, Michael Truell from, Cursor. One of the most prominent improvements in Opus 4.8 is honesty. We train all models to be honest, for instance, to avoid making claims that they cannot support. But a general problem with AI models, they sometimes jump to conclusions, confidently claiming they have more progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is likely to flag uncertainties about its work and less likely to make unsupported claims. This is a born, born out of our evaluations, which show that Opus 4.8 is around three times less likely than its predecessor to allow flaws in code that's written past undermarked. But where's the our evaluation link? Let's go take a look and see if this is the system card maybe. Let me maybe get the system card. Yes. There we go. system card for Opus 4.8, folks. Very interesting. what is interesting here? So let's-- we'll take this link and post it for you guys in the show notes So, what do we love watching in system cards? Evaluations? Autonomy evaluations
Yam Peleg
Yam Peleg 1:03:39
Look at the-
Nisten Tahiraj
Nisten Tahiraj 1:03:40
you can switch on Claude Code.
1:03:41
You can just do, /model claude-opus-four-eight. they didn't update the package yet, but-
Yam Peleg
Yam Peleg 1:03:49
Yeah, they didn't.
Nisten Tahiraj
Nisten Tahiraj 1:03:51
But you, you-
Yam Peleg
Yam Peleg 1:03:51
Refreshing.
1:03:52
I'm refreshing.
Nisten Tahiraj
Nisten Tahiraj 1:03:53
You just type /model claude-opus-four-eight and it, it uses it.
Alex Volkov
Alex Volkov 1:03:57
Nisten, I want to read this from, from the system card.
1:03:59
Yam, l- look at this, model welfare in the system card. I think it's very important in the context of what we just talked about with the pope. Across our model welfare evaluations, Opus four point eight appears broadly content with respects to its circumstances and is the most consistent model we have tested. Although it does rate its situation slightly less positively than did Opus four point seven, Opus four point eight generally endorses its constitution with some reservations about the section on corrigibility, which I have no idea about what corrigibility is. We're gonna read about this, and, we don't know what corrigibility is. I think it's really interesting that Anthropic adds a section in the system card, and they evaluate this on the model, thinking whether or not its circumstances are okay. So first vibes, Yam, what are we, what are we getting? We're still, we're still, like, waiting.
Yam Peleg
Yam Peleg 1:04:47
the site is pretty, but-
Alex Volkov
Alex Volkov 1:04:50
Can we show?
Yam Peleg
Yam Peleg 1:04:51
Y- yeah, sure, sure.
1:04:53
It's just… You know, I already told it to make it even prettier once. it's just-- it's not that mind-blowing in my opinion. I mean, we already saw beautiful sites to this extent, I think.
Alex Volkov
Alex Volkov 1:05:16
Show us?
1:05:16
Yeah. I mean, without maybe first instructions, maybe this model is more instruction tuned. You gave it very little, right? You gave it very little. You can change colors, I think.
Yam Peleg
Yam Peleg 1:05:27
You know- I'm gonna do the same thing, but, like,
1:05:30
on less, just less thinking. Exact same thing from here, I think.
Alex Volkov
Alex Volkov 1:05:38
I think maybe you should start a new-
Yam Peleg
Yam Peleg 1:05:41
No, no, just e- the exact same thing, but just-
Alex Volkov
Alex Volkov 1:05:43
Exact same thing with new chat.
Yam Peleg
Yam Peleg 1:05:45
Yeah.
' Alex Volkov
' Alex Volkov 1:05:45
Cause it has all the history, so it's gonna do
1:05:47
it from the same thing now.
Yam Peleg
Yam Peleg 1:05:49
I just wanna see the site.
1:05:51
Same prompt, same everything, just build the site itself with less thinking.
Nisten Tahiraj
Nisten Tahiraj 1:05:56
It thinks- A lot.
Yam Peleg
Yam Peleg 1:05:58
Well-
Nisten Tahiraj
Nisten Tahiraj 1:05:59
Oh, you can select the effort now on
Yam Peleg
Yam Peleg 1:06:01
Yeah, yeah, yeah.
Nisten Tahiraj
Nisten Tahiraj 1:06:01
Yeah.
1:06:02
Nice.
Yam Peleg
Yam Peleg 1:06:02
That's great.
Alex Volkov
Alex Volkov 1:06:03
That's
Yam Peleg
Yam Peleg 1:06:03
new.
1:06:03
That's great. That's a great thing from An- Anthropic. It was really, really annoying when you couldn't. It felt like, okay, you guys are gonna switch, switch behind my back to something that I don't want, right? But it's great that you can. Absolutely.
Nisten Tahiraj
Nisten Tahiraj 1:06:21
Oh,
Yam Peleg
Yam Peleg 1:06:21
it's very
Nisten Tahiraj
Nisten Tahiraj 1:06:21
slow.
Yam Peleg
Yam Peleg 1:06:22
way, what do you guys think about Extra versus
1:06:25
Max on Anthropic specifically?
Alex Volkov
Alex Volkov 1:06:27
I haven't seen Max.
1:06:28
I, I haven't seen, like, the latest one at all. I'm still trying to get it in, in, in Cloud Code, Nisten. I'm not sure how to do it,
Nisten Tahiraj
Nisten Tahiraj 1:06:34
but- I'll just type the command in chat, and
1:06:36
maybe you can show people.
Alex Volkov
Alex Volkov 1:06:38
Yeah
Nisten Tahiraj
Nisten Tahiraj 1:06:38
You open Cloud Code, and you just type Claude-Opus-4-8
1:06:45
And that will work
Alex Volkov
Alex Volkov 1:06:47
Oh, Claude Opus the version.
1:06:48
Okay. Yeah.
Yam Peleg
Yam Peleg 1:06:50
Claude I, I want the thing that they said about-
Alex Volkov
Alex Volkov 1:06:52
Mine says-
…  Yam Peleg
… Yam Peleg 1:06:53
doing complex tasks
Nisten Tahiraj
Nisten Tahiraj 1:06:56
Oh, you might- Yeah, I'm good … sorry, you might have
1:06:58
to run Claude update outside in
Alex Volkov
Alex Volkov 1:07:00
the terminal Oh, no, it worked for me.
Nisten Tahiraj
Nisten Tahiraj 1:07:01
It worked?
1:07:01
It worked. Okay, yeah, let's go.
Alex Volkov
Alex Volkov 1:07:02
All right, folks, I'm gonna show this on the- Show's
Nisten Tahiraj
Nisten Tahiraj 1:07:04
over,
Alex Volkov
Alex Volkov 1:07:05
guys
Nisten Tahiraj
Nisten Tahiraj 1:07:05
well.
1:07:05
Carry on.
Alex Volkov
Alex Volkov 1:07:07
No, like this is what we have to do.
1:07:07
No, no, show
Nisten Tahiraj
Nisten Tahiraj 1:07:08
is on.
Alex Volkov
Alex Volkov 1:07:09
Show is on.
1:07:10
so folks, if you go to Cloud Code, if you go /model claude oop- claude opus 48. Go like this, and then it gets the model, and then you do fast on. Ah, I can't do fast on 'cause I don't have extra, extra charges enabled, so. But they- they claim that the fast mode is better as well. Much faster.
Yam Peleg
Yam Peleg 1:07:34
it is much faster.
1:07:35
I'm not sure- Yeah … that it's better. I mean,
Alex Volkov
Alex Volkov 1:07:37
just- much faster, like in, in terms of- Yeah
1:07:40
like being in the fast mode. But- Yam, the, the thing I wanted to
Yam Peleg
Yam Peleg 1:07:42
Oh, okay.
1:07:42
Claude Code just, dropped an update right now, by the way. Let's see if we get this natively.
Alex Volkov
Alex Volkov 1:07:47
Meanwhile, folks, we're testing the new, newly
1:07:50
released Claude Opus 4.8. We're looking at the system card, and there's a few things I want to call out specifically. we have this shout-out to LDJ for sending this to us. big quote from the blog post, "Expect to be able… We expect to be able to bring Mythos class models to all our customers in the coming weeks."
Nisten Tahiraj
Nisten Tahiraj 1:08:07
What?
Alex Volkov
Alex Volkov 1:08:08
Yep,
Nisten Tahiraj
Nisten Tahiraj 1:08:09
All right.
1:08:10
Thank you, whoever political pressure pushed for that.
Alex Volkov
Alex Volkov 1:08:14
So Mythos level models in the coming weeks, and if- But it
Wolfram Ravenwolf
Wolfram Ravenwolf 1:08:16
doesn't mean that every feature is enabled
Alex Volkov
Alex Volkov 1:08:18
It's not gonna be… It's gonna be like, close to oblivion,
1:08:21
but I think it's very important. let's take a look at the amount of times Mythos is mentioned in this blog post. It's 189, so, like, they definitely talk about Mythos. "Our overall conclusion," I'm reading verbatim from the post, "is that Opus 4.8 does not advance the capability frontier beyond our most capable model, Claude Mythos Preview, and the catastrophic risk from the deployment of this model remain low given our current mitigations."
Yam Peleg
Yam Peleg 1:08:46
All right, we get it natively.
1:08:49
Just, guys, you can update Claude Code. You get this natively. Nice.
Alex Volkov
Alex Volkov 1:08:52
In Claude Code you mean?
Yam Peleg
Yam Peleg 1:08:54
Yeah.
Alex Volkov
Alex Volkov 1:08:55
They called it, dynamic workflows.
Nisten Tahiraj
Nisten Tahiraj 1:08:59
Oh, workflows?
1:08:59
All
Yam Peleg
Yam Peleg 1:08:59
right.
1:09:00
All right.
Nisten Tahiraj
Nisten Tahiraj 1:09:01
What are workflows?
Alex Volkov
Alex Volkov 1:09:03
This is- What they say- Okay.
Yam Peleg
Yam Peleg 1:09:05
do
Nisten Tahiraj
Nisten Tahiraj 1:09:05
I-
1:09:07
Just asking what they are . What are workflows?
Alex Volkov
Alex Volkov 1:09:10
There, there's a link here, Let- … to this post.
1:09:13
Let's take a look, Yam.
Nisten Tahiraj
Nisten Tahiraj 1:09:13
Ooh, look.
Alex Volkov
Alex Volkov 1:09:16
Introducing dynamic workflows.
1:09:18
some problems are too big for one passed by single agent, especially in complex legacy code bases, a bug hunt across the entire service. Dynamic for workflows can handle those end to end. Dynamic workflows are available today, can contain… consume substantially more tokens than a typical Claude code session, so we recommend starting on the scope task. For best experience, turn on auto mode when using dynamic workflows. ask, so how do you use this? Ask Claude to create a dynamic workflow directly, or switch to a new Claude code specific setting called Ultra Code. This is accessible through the effort menu and sets the effort level to extra high, while letting Claude decide automatically when to use a workflow. and teams have been using dynamic workflows to wide a range of cases, including bug hunts, large migrations, and critical work you need to check twice. Okay,
Yam Peleg
Yam Peleg 1:10:00
Alex.
Alex Volkov
Alex Volkov 1:10:00
Yes.
1:10:01
You
Yam Peleg
Yam Peleg 1:10:01
wanna see this,
Alex Volkov
Alex Volkov 1:10:02
Okay.
1:10:03
Show us, show us, show us.
Yam Peleg
Yam Peleg 1:10:06
That's the thing.
Alex Volkov
Alex Volkov 1:10:08
Ooh.
Yam Peleg
Yam Peleg 1:10:10
Ultra Code.
Alex Volkov
Alex Volkov 1:10:11
There's no-
Yam Peleg
Yam Peleg 1:10:11
right.
Alex Volkov
Alex Volkov 1:10:12
All right … there's no reason for this to go this hard, but- Yeah
1:10:15
… Yam, this is consuming- These things are
Yam Peleg
Yam Peleg 1:10:16
just cash.
1:10:16
Let's go.
Alex Volkov
Alex Volkov 1:10:17
This is gonna consume- Okay … a lot of tokens for you.
Yam Peleg
Yam Peleg 1:10:20
Let's
Alex Volkov
Alex Volkov 1:10:20
go.
Yam Peleg
Yam Peleg 1:10:21
Mm,
Alex Volkov
Alex Volkov 1:10:21
In, in their blog post, the something hard is that, example
1:10:25
what dynamic workflows can unlock is the recent rewrite of Bund. Jared Schrum used dynamic workflows to post port Bund from Zig to Rust with 99.8% of the existing test suite passing, roughly 750,000 lines of Rust, and 11 days from first commit to merge. One workflow mapped the right Rust lifetime, every struct filled in the Zig code base. The next wrote every RS file as a behavior identical port of its Zig counterpart. So they ported a whole ass library from one code to another.
Yam Peleg
Yam Peleg 1:10:56
Okay,
Alex Volkov
Alex Volkov 1:10:56
let's- Yam, I don't know if we're gonna have enough time on
1:10:58
the stream to actually see, see this Claude dynamically breaks into subtasks and fends the workout across sub-agents running in parallel. Results are checked before they're folded in
Yam Peleg
Yam Peleg 1:11:11
can I even go with Go with this?
Alex Volkov
Alex Volkov 1:11:14
No, I think, workflows are different.
1:11:15
Like, Go is like the rough loop, and workflows are letting the agent decide which sub-agent to spin up and, like, what to do
Yam Peleg
Yam Peleg 1:11:23
YOLO, let's… Oh, I'm not in YOLO mode.
1:11:26
One moment
Alex Volkov
Alex Volkov 1:11:29
Dynamic workflows are on by default.
1:11:31
Ask Claude to create a workflow or turn on Claude Code specific setting, Ultra Code to get started. So, if you're enterprise, dynamic workflows are turned off. Folks, what else do we want to test? they also reset all the limits. All the limits were reset, so you guys can try Claude Opus 4.8 right now, in, in also, I wonder if Claude Design got it. I wonder if Claude Design got it. Can
Wolfram Ravenwolf
Wolfram Ravenwolf 1:11:51
we have Ultra Code build the Martian railgun launcher?
1:11:55
Because we have done it all the time, and now new version
Alex Volkov
Alex Volkov 1:11:58
You wanna try the, the railgun on Nisten?
1:12:00
You have it al- also open?
Nisten Tahiraj
Nisten Tahiraj 1:12:01
yeah, I already started it off because it might
1:12:03
take, like, five, 10 minutes.
Alex Volkov
Alex Volkov 1:12:05
In Ultra mode?
1:12:06
In, like, the, the most advanced one?
Nisten Tahiraj
Nisten Tahiraj 1:12:08
I'm just trying it on the website.
Alex Volkov
Alex Volkov 1:12:10
On extra high?
Nisten Tahiraj
Nisten Tahiraj 1:12:11
on high, just the default.
Alex Volkov
Alex Volkov 1:12:13
I wanna tr- Wolfram, I got, I, I gotta try this, okay?
1:12:16
But I don't think I can enable Ultra. Yeah, I want to see
Wolfram Ravenwolf
Wolfram Ravenwolf 1:12:18
our minds blown with the Ultra
Nisten Tahiraj
Nisten Tahiraj 1:12:20
Ultra coding.
1:12:20
It's just, it's thinking, it's thinking a lot. It might take a long time for it to actually do it.
Alex Volkov
Alex Volkov 1:12:25
Yam, how did you enable this thing?
Nisten Tahiraj
Nisten Tahiraj 1:12:27
Just- You have to-
Yam Peleg
Yam Peleg 1:12:27
slash.
Alex Volkov
Alex Volkov 1:12:28
Slash to update the Claude Code, right?
Yam Peleg
Yam Peleg 1:12:30
Yeah.
1:12:31
just tell Claude to update Claude Code, then start again, then go slash
Alex Volkov
Alex Volkov 1:12:36
effort.
Nisten Tahiraj
Nisten Tahiraj 1:12:37
Claude update i- in
Alex Volkov
Alex Volkov 1:12:38
the
Nisten Tahiraj
Nisten Tahiraj 1:12:38
terminal.
Alex Volkov
Alex Volkov 1:12:39
Yeah.
1:12:39
Claude is now updated.
Yam Peleg
Yam Peleg 1:12:40
Rah.
1:12:42
Don't, don't you waste token to update- You can also- … Claude Code by Claude Code
Nisten Tahiraj
Nisten Tahiraj 1:12:47
you can also just tell it to, "Hey, use workflows for this,"
1:12:50
and you'll see the word workflows will become- Mm-hmm a rainbow.
Alex Volkov
Alex Volkov 1:12:55
All right, I have-
Nisten Tahiraj
Nisten Tahiraj 1:12:56
Very, very slow for me
…  Alex Volkov
… Alex Volkov 1:12:57
I have Ultra Code, and I'm gonna give it Olympus Mons.
Yam Peleg
Yam Peleg 1:13:03
I wonder how it'll work.
1:13:05
Like, what exactly is this?
Alex Volkov
Alex Volkov 1:13:09
What do you mean?
Yam Peleg
Yam Peleg 1:13:10
Like, I know what Goal is.
1:13:12
It's a Ralph loop.
Alex Volkov
Alex Volkov 1:13:14
It's kinda Ralph.
1:13:15
It's great. Yeah,
Yam Peleg
Yam Peleg 1:13:15
Yeah, but… Oh, they added Goal?
Alex Volkov
Alex Volkov 1:13:16
Goal, a long time ago.
1:13:17
Nisten, where were you? Bro. We talked about Goal last week.
Nisten Tahiraj
Nisten Tahiraj 1:13:19
Yeah.
Alex Volkov
Alex Volkov 1:13:19
No, no, it, it was- In, in, in Vlad, Gold, and Codex.
1:13:21
They copied it, yeah.
Nisten Tahiraj
Nisten Tahiraj 1:13:21
well.
Alex Volkov
Alex Volkov 1:13:22
interesting.
Yam Peleg
Yam Peleg 1:13:22
Yeah, sure.
Nisten Tahiraj
Nisten Tahiraj 1:13:23
Yeah.
Yam Peleg
Yam Peleg 1:13:23
But this is, like, Goal of Goals, It's Ultra
Nisten Tahiraj
Nisten Tahiraj 1:13:26
Goal.
Yam Peleg
Yam Peleg 1:13:27
Yeah, Ultra, Ultra Goal
Alex Volkov
Alex Volkov 1:13:32
Yeah, this is the problem with, like, testing
1:13:34
this, testing this here. We, we can see res- results, but let's try this new model, Claude Opus extra high effort. Ultra, ultra code effort.
Yam Peleg
Yam Peleg 1:13:47
Okay.
1:13:48
I must say that extra high, I think it's more amazing than max amazing.
Alex Volkov
Alex Volkov 1:13:54
Yam, I think you need to accept something.
1:13:56
Yes. Dynamic workflows can use a lot of tokens quickly.
Yam Peleg
Yam Peleg 1:13:58
reset the tokens.
Alex Volkov
Alex Volkov 1:13:59
They, yes.
Yam Peleg
Yam Peleg 1:14:00
Why, why am I not, why am I not on… Like, can I
Nisten Tahiraj
Nisten Tahiraj 1:14:09
Config
Alex Volkov
Alex Volkov 1:14:12
All right, so we had a Claude update with-- Anthropic launches Opus 4.8.
1:14:20
Let's go
Yam Peleg
Yam Peleg 1:14:27
Okay
Nisten Tahiraj
Nisten Tahiraj 1:14:33
Yeah, so the website started compacting
1:14:35
before it could finish the-
Alex Volkov
Alex Volkov 1:14:38
Are you serious?
1:14:39
So the context window for this model is, looks like one million as well. Oh, I have the website, Nisten. You guys wanna see?
Nisten Tahiraj
Nisten Tahiraj 1:14:47
Oh,
Alex Volkov
Alex Volkov 1:14:47
okay.
1:14:48
Let me, let me- It just did it. Yeah, mine did it, here.
Nisten Tahiraj
Nisten Tahiraj 1:14:52
It, it's showing.
1:14:53
One sec. I'll, I'll just be RPing.
Alex Volkov
Alex Volkov 1:14:55
All right, folks, hopefully, you can now see my, window like that.
1:15:01
Okay, so this thought for a while, and Nisten, here we go. download publish artifact. Okay, w-we're gonna try it like this. your command launch rail, la-la-la, initialize launch site. Okay, so we see a beautiful 3D version of Mars. We see it with texture. We see the, the launcher and kinda like here, and then we will say engage driver. Oh, look at that. We can see an actual spaceship here with effects of, like, running, and then we'll do a chase cam, and nothing seems to happen.
Nisten Tahiraj
Nisten Tahiraj 1:15:38
it-it's just moving very slowly if it's, It is?
Yam Peleg
Yam Peleg 1:15:40
Because that's realistic or something?
Alex Volkov
Alex Volkov 1:15:42
Yeah.
1:15:43
No, 'cause, we- Please come out … we asked it to be very- like, very nice for people.
Nisten Tahiraj
Nisten Tahiraj 1:15:47
This is not as good.
Yam Peleg
Yam Peleg 1:15:49
Yeah, it's-
Nisten Tahiraj
Nisten Tahiraj 1:15:49
Not the same
…  Yam Peleg
… Yam Peleg 1:15:50
the previous one was… Was better … I think it was better.
Nisten Tahiraj
Nisten Tahiraj 1:15:54
Yeah.
1:15:55
It's the same styling.
Alex Volkov
Alex Volkov 1:15:57
The- Same styling?
1:15:57
this feels a little bit more polished, but let's do
Nisten Tahiraj
Nisten Tahiraj 1:15:59
Yeah, yeah.
1:16:00
The, the planet and stuff looks better.
Alex Volkov
Alex Volkov 1:16:02
Yeah, but nothing is moving,
Nisten Tahiraj
Nisten Tahiraj 1:16:03
it has bugs.
1:16:03
Okay.
Alex Volkov
Alex Volkov 1:16:04
It has not moved.
1:16:05
Let's try again. Yeah. Let's try this one again.
Nisten Tahiraj
Nisten Tahiraj 1:16:08
Mine built it, too.
1:16:10
Oh, wow, mine went all out. The lottery is, is real.
Alex Volkov
Alex Volkov 1:16:17
Yeah, the lottery is real.
1:16:17
It's really hard to, like, do, like a proper evaluation, vibes evaluation.
Yam Peleg
Yam Peleg 1:16:21
Guys, you really wanna see this.
1:16:24
Let me just show you. Look.
Alex Volkov
Alex Volkov 1:16:26
Yam, you're not sharing.
Yam Peleg
Yam Peleg 1:16:27
Yeah.
1:16:29
Look at this.
Alex Volkov
Alex Volkov 1:16:31
You're in… Oh, okay, let's
Yam Peleg
Yam Peleg 1:16:32
take a look.
1:16:32
that's the workflows.
Alex Volkov
Alex Volkov 1:16:35
This is the new
Yam Peleg
Yam Peleg 1:16:35
Let's… That looks oddly familiar.
1:16:39
I don't wanna say to what, but I think, I think people- Can I just… All right. Now we are on dangerously skip permission and on a goal to make the most amazing website ever.
Alex Volkov
Alex Volkov 1:16:56
Let's take a look.
Yam Peleg
Yam Peleg 1:16:57
I like the idea because it's just the orchestration that
1:17:00
everyone is trying to build. That's pretty much it. Like, like delegation of subtasks to different isolated work trees and agents and so on. if this works well, man, the thing, the thing is a deal breaker even just because of the way it's built to Claude Code.
Alex Volkov
Alex Volkov 1:17:19
And that's- You're saying just as folks have started
1:17:21
moving towards Codex, Anthropic comes back with, with like one feature. I wonder if it's gonna be enough to like pull folks over. Because as we saw on different like benchmarks, GPT 4.5 seems like, like a very solid coding model.
Yam Peleg
Yam Peleg 1:17:35
Look, it's not just the coding, that's what I wanna say.
1:17:39
4.7 was pretty much, smoking all the benchmarks. But go, go check what people are, are saying about this, about 4.7 online. Nah, they, they have a different experience, and myself included. The thing is that you want, you want it to be usable. GPT 5 n- 5.5 is amazing. Seriously, it's, it's amazingly usable. 4.7 kind of, kind of cut too many corners in my opinion. No, no way. 4.8 is not, is not cutting corners, and you also have this The next couple of days are gonna tell us for sure.
Alex Volkov
Alex Volkov 1:18:20
Yeah, folks need, need a few, a few days to a week.
1:18:23
Like, it's really hard to measure vibes from just a little bit. Like, GPT 5.5 took a while for folks to realize that, you know, it's a, it's a great, great model.
Nisten Tahiraj
Nisten Tahiraj 1:18:32
I like the responses it's giving me.
1:18:35
If the formatting of the responses is, is much nicer. I'm not as impressed, with the web dev side, but, I am trying to have it, fix a bunch of like complex caching stuff on my doctor app right now. So I, I am judging it on that. there's a bug in Cloud code that when you limit the context, window to two fifty-six K, now it just keeps going past that and it says context at 100% and, and it just, it just keeps doing work anyway. So-
Alex Volkov
Alex Volkov 1:19:11
I find it really funny that, on the actual blog post, it says, "Users
1:19:15
will find Opus 4.8 to be a modest but tangible improvement on its predecessor. There's still more to be done. We're working on developing and releasing models that provide many of the same capabilities as Opus at a lower cost." So maybe a new Sonnet is coming, soon. And, as folks are saying, we, we can't wait to see the Deep Swe, which is the new benchmark we told you about before on this, on this, on this model to also take a look I want to see from the system card, is there anything else interesting that we can get besides model welfare?
Yam Peleg
Yam Peleg 1:19:47
already get I really like that the only score that they don't
1:19:51
beat GPT 5.5 on is, what was it? Agentic terminal use or something? Something that-
Alex Volkov
Alex Volkov 1:19:59
Terminal
Yam Peleg
Yam Peleg 1:19:59
bench.
1:19:59
Terminal bench.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:20:00
Terminal bench, yeah.
1:20:01
So agentic use basically.
Yam Peleg
Yam Peleg 1:20:03
Yeah.
1:20:04
GPT 5.5 is really good at this. It's gonna be hard.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:20:08
And terminal bench is also time limited.
1:20:11
They only have a certain amount of time to do a test. So if you raise the thinking level, the model can actually get a lower score because it takes too long to think compared to how much it is acting. So that can also have an influence on this particular benchmark.
Yam Peleg
Yam Peleg 1:20:24
Workflow completed successfully.
Alex Volkov
Alex Volkov 1:20:27
Alignment assessment.
1:20:29
So the system card is very robust. I like applaud Anthropic for re-releasing this. and we have a bunch of other evals, while we wait for the index HTML yam. Let's take a look. So what did you ask? Remind us, what did you ask for?
Yam Peleg
Yam Peleg 1:20:41
most amaz- amazing website ever.
Alex Volkov
Alex Volkov 1:20:44
just this?
Yam Peleg
Yam Peleg 1:20:45
Yeah.
Alex Volkov
Alex Volkov 1:20:46
All right.
1:20:46
Let's take a look.
Yam Peleg
Yam Peleg 1:20:48
I really like that on the UI, Claude, it answers in a way
1:20:53
that it understands it's a joke. Y- that's-- I really like it
Alex Volkov
Alex Volkov 1:21:02
No, you have the HTML.
1:21:03
Okay.
Nisten Tahiraj
Nisten Tahiraj 1:21:03
I'm really liking the style of answers too.
1:21:06
the tables that it makes are a lot nicer. The stuff it says is more concise, because O- 4.7 rambled quite a lot. this still, this rambles less. So yeah, it, it does, it does feel better, in the way it's responding. but, yeah. Anyway, it's a bit slow for me on both the website and, and Claude Code right now, so.
Alex Volkov
Alex Volkov 1:21:31
Yeah.
1:21:32
Yam, I think that you have this website running. It's just verifying that it's running
Yam Peleg
Yam Peleg 1:21:37
Open mistake." Yeah, I'm just, I'm enabling-- Yeah, I'm enabling
1:21:39
the, the, the extension so we can see.
Alex Volkov
Alex Volkov 1:21:43
All righty
…  Yam Peleg
… Yam Peleg 1:21:43
that's what it's trying to do.
Alex Volkov
Alex Volkov 1:21:45
So folks who are just joining us, congratulations
1:21:48
on new model release day. Opus 4.8 just dropped from Anthropic, with nearly state-of-the-art scores across all of the evals, specifically the evals that, other, other benchmarks called out that they're, you know, obfuscated already, like DeepSWE. we didn't get a DeepSWE score yet, but we did get a SWE-bench verified and SWE-bench Pro scores, which was significantly better than, than, than, than the prece- predecessors. But it's been found that Anthropic kind of potentially cheats at those. for Terminal Bench, this is not the best model that we've seen. Humanity's Last Exam, we get, forty-eight percent state-of-the-art model, compared to GPT 5.5 forty-one percent. So this is a big, big difference in Humanity's Last Exam. We'll take a look at OS World Verified, where, computer use is tested with eighty-three percent. GPK Diamond remains to be kind of the lowest of the tested ones. but still at ninety-three percent is a very, very high score. Interestingly, GPK Diamond dropped from Opus 4.6. This is the only eval that kind of seemingly is worse on than the predecessor. ScreenSpot Pro and Finance Agent. I think the Finance Agent jump is also notable.
Nisten Tahiraj
Nisten Tahiraj 1:22:53
Automation Bench.
1:22:54
What
Alex Volkov
Alex Volkov 1:22:54
is that?
1:22:55
Automation Bench. Let's take a look.
Nisten Tahiraj
Nisten Tahiraj 1:22:56
I guess
Alex Volkov
Alex Volkov 1:22:57
that's
Nisten Tahiraj
Nisten Tahiraj 1:22:57
more of the harness thing
Alex Volkov
Alex Volkov 1:22:58
Automation Bench is a benchmark from Zapier that measures
1:23:01
whether an agent can complete a realistic end-to-end business workflow. Tasks are seeded from real customer workflows across sales, marketing, and finance and HR. Each task drops an agent to simulate a company with dozens of REST API endpoints. Given a single natural language instruction, the agent must autonomously discover the right endpoints via search, make dozens of sequential independent API calls, consult on a layered business policy, grading it pass or fail for each task. on this topic, Automation Bench, Opus, 4.8 takes the highest score. we wanna look at ScreenSpot Pro with no tools. They, released a bunch of evals. Chart Museum. Models are evaluated with adaptive thinking and max effort on Chart Museum, which is a, chart questioning answering benchmark consisting of a thousand expert-annotated questions on real-world chart images drawn from one hundred and eighty-four sources. So on Chart Museum, Opus-- Claude Opus 4.8, does not get to the level of Mythos, but, beats the previous one. B-beats, 4.7 to 4.6
1:24:10
Graphox. Graphox two hundred and fifty-six subset, this model beats… Oh, okay, guys, we have to take a look at this. An- anytime there's, like, a significant jump in scoring, that's what I like. So from Opus four point seven, to Opus four point eight, there's almost a ten-point jump in improving long context us- utilization. We all know that these models are supposedly one million tokens context windows, but then you get to the dumb zone after, like, two hundred thousand. and so I think it's very important to see the improvements here. On the one million subset of Graphox BFS, Claude Opus four point eight gets sixty-eight percent compared to forty from last version and only sixteen in the version before that. Do you guys see this jump? Sixteen to forty to sixty-eight. This means they're significantly strongly focusing on long context, graph and evaluations. Graphox Parents one million subset, is also a big jump from forty-eight to fifty-six to eighty-three. The results are an average over five trials with different sampling settings. and then they compare it to GPT five point five. They don't compare it to Gemini, which had long context for a while, and, significant improvement on long context. Is a multi-hop long context reasoning benchmark. The context window is filled with a directed graph of hexadecimal hash nodes because this reflects a real-world scenario. Hexadecimal-- so Claude F- Opus four point eight gets a significant boost in those scores. Yam, we're almost ready with your results to take a look at the best website ever. for folks who are tuning in with us, we're doing a, we're doing an evaluation, a vibes test, if you will, of Opus four point eight. It just dropped from Anthropic, and this is Thursd AI Live, so we have a bunch of other stuff to talk about as well, but this obviously takes the cake. meanwhile, while we test this behind the scenes, I wanna jump quickly to this week's buzz and talk to you about everything that happens in the world of Weights & Biases Core Weave.
1:26:13
Folks, welcome to this week's Buzz. thank you for joining us for Thursd AI for May 28th. This is this week's Buzz where we talk about everything that happened in the world of AI, Weights & Biases, CoreWeave, with me, Wolfram Maronul, the other AI evangelist on the team, and, we're not gonna take too much of your time. I just wanna briefly show you, the, the two important things that are happening in the world of, Weights & Biases. First of all, we finally launched our MCP server. It, it supports 20 tools, and you can use it with the newly released Opus 4.8, because it's, it's much better at tool use and MCP analysis as well. we're letting your coding agents read experiments from Weights & Biases, monitoring training runs, and run autonomous research loops. I think it's super-duper important, so shout-out to the team that finally pushed this. our one-command setup, you just do Claude MCP add and, wandb, and that's it. That's basically it. The endpoint is, wandb.com.ai/mcp. I'm gonna add this to the show notes as well, and we're, like, we're helping you across, iteration, code, training, monitoring, and anal-a-analyze, and you can integrate MCP everywhere. so shout-out to the team for working really hard on the MCP server. You can find all the announcements on our blog. Weights & Biases is having another hackathon sponsored by OpenAI and Cursor this time. So you're gonna get $150 in API credits to use across models like Opus 4.8 and, GPT 5.5. we're sponsored by Cursor, OpenAI, Redis, and Copilot Kit. Yours truly is gonna be there. Saturday, June 6th, we start. The judging will commence on June 7th. it's our offices in San Francisco. please sign up to this. we don't have-- We usually fill out all of the spaces really, really quick. So if you're in San Francisco on June 6th and 7th, please join us in the hackathon. we have a bunch of prizes. We always do a good show. We have huge screens for you to just sit and hack. I recently heard, that the team that won second place in the last hackathon went on to raise millions of dollars for the startup they built on top of the project they built during the hackathon. And if you wanna be that in San Francisco, there's no better place to be than our hackathon, June 6th and 7th. for all the details, you can go to lu.ma/weavehacks. it's lu.ma/weavehacks.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:28:11
Yeah.
1:28:11
So ICRA in Vienna and, AI Dev Six in Cologne all next week
Alex Volkov
Alex Volkov 1:28:17
So if you're in Cologne or in Vienna or in
1:28:19
San Francisco, please join us. Another thing is that, we're finally opened up the registrations for two, two-day fully connected conference that's gonna happen at end of August, beginning of September. I'm gonna tell you all about this after the hackathon as well, but, places for that will run out very, very soon. This is the first big one in collaboration with CoreWeave, so definitely don't wanna miss that. All right, folks, I think we're back to testing out-
Wolfram Ravenwolf
Wolfram Ravenwolf 1:28:42
I have one item
Alex Volkov
Alex Volkov 1:28:43
1 AM.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:28:43
because our team has the, we have the CoreWeave Solo
1:28:46
Sandboxes offering, and this is now an official, sandboxes provider in the Harbor framework, which is used to run Terminal Bench- Ooh … which we just talked about. So it is now officially merged into the Harbor framework, and everybody can now use our sandboxes as well, like Daytona, which I have to give a shout-out as well because I've been using them before a lot as well. And, now we are also a provider for the, Harbor framework.
Alex Volkov
Alex Volkov 1:29:12
So CoreWeave Sandboxes is a new offering that we launched
1:29:14
recently, and we're gonna ramp up. As you all know, all of the stuff that we talked about, like, Terminal Bench and Harbor and, like, a, a bunch of other agentic benchmarks, all of them require sandboxing. And, you can scale those directly on the Core of Compute. reinforcement learning, and other things, you, you can build all of that, right? So you only pip install cw-sandbox or via the Weights & Biases one, and then you can just have, like, very instantaneous, immediate things based on wherever your GPU sit, if you're already a CoreWeave customer. This has been this week's Buzz. For this week, let's go back to talking about Opus. I'm gonna add Yam back on stage, and then ask if your workflow has completed, Yam. You wanna show us what's going on?
Yam Peleg
Yam Peleg 1:29:54
Well, it's not completed yet, but just, to give everyone just an inside
1:29:58
view into what exactly that thing is. basically, I think the thing is just an orchestration, orca- orchestration layer that spins agents, with judges. Like, we're trying to build the most amazing website ever, so there are some concepts that, we are considering. a- an agent, a different agent is exploring different concept, and then you have, like, a judge that is judging them And then the next step is to build. I think what we see here is just a to-do list, and then each to-do list is being expanded into another to-do list, and then each to-do list is being expanded to another to-do list,
Alex Volkov
Alex Volkov 1:30:39
Anthropic did say that this burns a lot of tokens, so this is
1:30:42
a- No, no problem … the, the dynamic workflows feature that we're testing out on Cloud Code that shipped together with Opus 4.8 that seems to be a very token hungry split into sub-agents and building.
Yam Peleg
Yam Peleg 1:30:56
But at the end of the day, okay, the most amazing website
1:30:59
ever, maybe it's an overkill and burns, burns, quite a lot of tokens. But if you're trying to build something complex, that thing is, that thing is a very powerful tool, and I'm, I'm glad that it's released baked into Cloud Code, because I think everyone working on complex problems has their own kind scaffolding that they use at this point. Like, I think everyone. This looks oddly familiar, to, other harnesses that already baked, stuff like this in. So, I'm just saying, that's a that's a, a crazy market at the moment to cheap harness. we are still running, by the way, and, I gave-
Alex Volkov
Alex Volkov 1:31:40
7.4 megabyte of a website, is that what we're talking about?
Yam Peleg
Yam Peleg 1:31:45
full render.
1:31:45
No, no, it's the render. Claude is just looking at the website.
Alex Volkov
Alex Volkov 1:31:48
it's just giving you PNGs.
1:31:49
Yeah. Okay. When we can see the actual website is, is my, my question.
Yam Peleg
Yam Peleg 1:31:52
it's, it's in the middle.
1:31:53
It's… I, I gonna, I wanna give it couple more minutes. If it's not gonna finish, we're gonna watch the screenshots, at the middle while it works. Yeah. But I just wanna give it a couple of more, more minutes. Claude was notoriously not doing that. You could give him a, you can give him a task, and he would just write the code and come back, "I'm done. Look at the code." And, and never ever look, never just actually look at the website. Yeah. Then GPT 5.5 is on the completely other, other end of the spectrum. No matter what you give to GPT 5.5, it will try to verify, to smoke test, to, to do- Yes like, it really likes verifying, auditing, smoke testing. It really, really likes it. Even on stupid things that require no, no verifying whatsoever, it's gonna try and verify. So shout out to Anthropic and like, "Okay. Okay. We got something." It's on local hosts.
Nisten Tahiraj
Nisten Tahiraj 1:32:51
We can only see your, your
Yam Peleg
Yam Peleg 1:32:52
terminal
Alex Volkov
Alex Volkov 1:32:52
20 minutes of Claude yapping about the best website in
1:32:57
the world, and I have a feeling that we're not gonna get disappointed,
Yam Peleg
Yam Peleg 1:33:00
Okay, so sundial.
Nisten Tahiraj
Nisten Tahiraj 1:33:03
Let's click around it.
Yam Peleg
Yam Peleg 1:33:05
Oh.
Nisten Tahiraj
Nisten Tahiraj 1:33:06
Okay.
Yam Peleg
Yam Peleg 1:33:07
Okay, it's stealing my mouse.
1:33:10
Drag the light the whole world bends to follow. Okay. Wow. That's, that's a nice
Nisten Tahiraj
Nisten Tahiraj 1:33:15
thing.
1:33:15
That's pretty cool.
Yam Peleg
Yam Peleg 1:33:16
Yeah.
1:33:16
Okay. that's good.
Nisten Tahiraj
Nisten Tahiraj 1:33:17
Yeah, this is pretty good.
1:33:19
keep on scrolling.
Alex Volkov
Alex Volkov 1:33:21
Wait,
Nisten Tahiraj
Nisten Tahiraj 1:33:21
hold on.
1:33:21
There's a- Oh, when you scroll, it switches to the other, And the
Alex Volkov
Alex Volkov 1:33:24
lights
Yam Peleg
Yam Peleg 1:33:25
are following
Alex Volkov
Alex Volkov 1:33:25
you?
Yam Peleg
Yam Peleg 1:33:26
And I want to set the sun.
1:33:27
What?
Nisten Tahiraj
Nisten Tahiraj 1:33:30
As you scroll, it goes into different sections of the nav
Yam Peleg
Yam Peleg 1:33:32
bar Yeah, I'm just trying to understand what it told me here, to
1:33:36
press and hold to set the sun loose. Okay.
Alex Volkov
Alex Volkov 1:33:40
It looks like the little thing is kinda like follow the, the sun.
1:33:43
It looks like, is it maybe three- 3DS or something?
Yam Peleg
Yam Peleg 1:33:46
it is
Alex Volkov
Alex Volkov 1:33:47
good … dynamically lit.
Yam Peleg
Yam Peleg 1:33:49
It is good.
Alex Volkov
Alex Volkov 1:33:50
It's kinda slow though.
Yam Peleg
Yam Peleg 1:33:52
Yeah, because it's, it's, we didn't say the most efficient website.
Alex Volkov
Alex Volkov 1:33:56
Yeah, we didn't say the most efficient site.
Yam Peleg
Yam Peleg 1:33:57
Yeah.
1:33:57
Look, it is good. Okay? it is a really good website.
Alex Volkov
Alex Volkov 1:34:02
Yam, you didn't specify anything about the sun.
1:34:04
This is all Claude's decisions, right? You just said the best websites. and, and this website makes no sense conceptually. Right? Like, There's interactive elements here that like, kinda like following the sunlight. Everything is very slow. I don't know if this is a result of like Yam's system being overloaded or, or the website being like inefficient, but there is an interactivity thing that kinda like happens here. It's very interesting to see that this is what Claude 4.8, Claude Opus 4.8 decides as the most beautiful in the world.
Yam Peleg
Yam Peleg 1:34:33
pretty good.
Alex Volkov
Alex Volkov 1:34:34
It is pretty good.
Yam Peleg
Yam Peleg 1:34:35
is really good.
Alex Volkov
Alex Volkov 1:34:36
Yeah.
1:34:36
You didn't say- You said the best website.
Yam Peleg
Yam Peleg 1:34:38
That's-- Bro, that's Zero Shot.
Alex Volkov
Alex Volkov 1:34:40
I'm not- Yeah … all that impressed with Zero Shot anymore because,
1:34:42
like, we've seen beautiful things before. I'm, I am impressed with the dynamic lighting conditions. So if you move the, the sun kinda towards the, the left, the, you'll see the different letters light up, right? So, like, there's dynamic- And that's crazy … lighting conditions. That is crazy. That is not simple to build with just HTML. So unless there's like WebGL happening in the background, this is, this definitely takes a very interesting approach. If he has the same thing from GPT 5.5, it will suck. GPT 5.5 is not great at web dev.
Yam Peleg
Yam Peleg 1:35:09
Yeah … but, you know, it's, I think everyone knows.
1:35:12
I think it's not a fair comparison at this point because i- i- look, OpenAI, Sam Altman is, is saying that, you know, they're not good at, front end at this point.
Alex Volkov
Alex Volkov 1:35:22
They're saying
Yam Peleg
Yam Peleg 1:35:23
that-
Alex Volkov
Alex Volkov 1:35:23
they need to get better.
Yam Peleg
Yam Peleg 1:35:25
Anyway, I think I'm just gonna blindly go and say, make it even better
Nisten Tahiraj
Nisten Tahiraj 1:35:32
I like it.
1:35:33
It's still more of the same. I'm, actively testing it out over-
Yam Peleg
Yam Peleg 1:35:37
Oh, you got your benchmark?
1:35:40
on workspaces or workflows or, or the-
Nisten Tahiraj
Nisten Tahiraj 1:35:43
Well, I, I was testing it on, on, on my actual
1:35:45
app, right now to just, like, fix a caching issue to make it load stuff faster, and, it, it did pretty well. It's, it's less crazy than 4.7, but it doesn't feel that much, smarter. I, yeah, I do like on the, on Claude.ai itself, the style of responses that it, that it gives. But, yeah, pretty nice incremental improvement. Again, I don't know if this is enough for people to switch from Codex if, if they're used to that workflow.
Alex Volkov
Alex Volkov 1:36:17
All right, folks, the last thing I want to cover before we drop on
1:36:20
the show, for Opus 4.8, thank you guys for joining us and, testing out with us. We're a bit over the two hours and, and 10 minutes already. we have the, the w- the long context window differences. So we can see Opus 4.8 is, getting 85% for 256 subset, and then 68% on the, one million subset. So this beats Opus 4.7 and 4.6 and GPT 4.5. So long context-wise, this model could utilize more. I wanna, I wanna jump into the last things that we wanted to cover on the show before we ran in. So I think that the most important thing, there's three things in AI, art, and diffusion that were big this week. Pruna AI added an upscaler with 128 megapixel outputs in one, one, under one second. We actually tested it out, and it was really, really good. Prism ML, we, we wanna talk about this a little bit. It's a 1-bit and, Bonsai Image 4 billion parameter model, 1-bit model. It's sub one gigabyte diffusion transformer. the outputs are pretty cool for a 1-bit diffusion transformer, and it runs on your computer. And lastly, I think it's the most important thing that, like, we haven't talked about yet, Microsoft MAI Image 2.5 is the newly, updated image. It jumps to number three on Arena, LM Arena Text-to-Text leaderboard. 75-point ELO jump on Arena is quite massive, and it looks really cool. So shout out to Microsoft for this, like, release, folks. out of nowhere, Microsoft is now number three on generating images, space. With that, I wanna say thank you so much for everybody who's joined us, for this, Thursd AI stream. Out of nowhere, our topic decided to bring us the latest model. So happy new model day. Go and play with it and test it out. the, the highlights we'll be posting in the newsletter for today. if you missed any part of the show, the show is posted on newsletter and everywhere you get your podcasts at, ThursdAI.news, so please follow us over there. if you are streaming with us on YouTube or everywhere else, please, please don't forget to come back. so folks, thank you for, for joining. we'll-- we, as always, for the past two and a half years, we're here, live every week. And, thank you for over a thousand of you who joined the live streams across everywhere. Wolfram Nisten and Yam Peleg today are co-hosting the stream. Shout out to you as well. Again, missed any part of the show, newsletter and podcast and edited on YouTube, and we'll see you here next week. Thank you, folks. Bye-bye. Bye-bye, everyone.