Episode Summary

This episode opens with a rare live-breaking OpenAI moment: GPT-5.4 Thinking and 5.4 Pro dropped during the show. The panel then unpacks a volatile week of AI policy and defense controversy, plus major open-source developments from Qwen and StepFun. They also cover GPT-5.3 Instant, Gemini 3.1 Flash-Lite pricing/performance shifts, and practical agent benchmarking insights from Wolfram’s new Wolf Bench framework. The back half turns into live testing and benchmark triage as the team compares GPT-5.4 directly against Opus and Gemini across coding, browsing, and reasoning tasks.

Hosts & Guests

Alex Volkov
Alex Volkov
Host · W&B / CoreWeave
@altryne
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg
LDJ
LDJ
Weekly co-host · Nous Research
@ldjconfirmed
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host · AI evaluator
@WolframRvnwlf
Ryan Carson
Ryan Carson
Weekly co-host · AI educator & founder
@ryancarson
Nisten Tahiraj
Nisten Tahiraj
Weekly co-host · AI operator & builder
@nisten

By The Numbers

ARC-AGI 2 (GPT-5.4 Pro)
83.3%
Alex highlighted this as roughly matching recent frontier reasoning performance.
OS World / computer-use score
75%
Presented in the GPT-5.4 preamble as a major computer-use milestone.
Token usage reduction
47%
Zapier-reported tool-search optimization improvement mentioned in the preamble.
Context window
1M
GPT-5.4 launched with 1 million token context support in Codex workflows.
Gemini 3.1 Flash-Lite speed
360 tokens/sec
Discussed as a fast, efficient model in the same category as instant-tier offerings.
SWE Bench Pro (SWE 1.6)
51%
Cognition’s new SWE model performance cited in the TL;DR tools segment.

🔥 Breaking During The Show

GPT-5.4 Thinking and GPT-5.4 Pro dropped live during ThursdAI
OpenAI released GPT-5.4 mid-show, triggering immediate benchmark review and live coding/vibe tests from the panel.

🔥 GPT 5.4 Preamble

Alex opens with a direct recap of OpenAI’s surprise GPT-5.4 Thinking and 5.4 Pro release, framing it as a meaningful frontier-model update. He emphasizes unified reasoning + coding capability, strong benchmark claims, and live testing on the show.

  • GPT-5.4 Thinking + 5.4 Pro introduced as a breaking frontier release
  • Unified reasoning model positioned as codex-capability fold-in
  • Live test framing set before the main show intro
Alex Volkov
Alex Volkov
"They dropped a new frontier model called GBT five. Point four thinking and 5.4 Pro."

⚡ Welcome & Introductions

The panel opens the March 5 show with full co-host attendance and sets expectations for a dense, high-signal episode. Alex also acknowledges ongoing world events before transitioning into the agenda.

  • First show in March
  • Full co-host panel introduced
  • Tone set for a heavy AI-news week
Alex Volkov
Alex Volkov
"Welcome to ThursdAI my name is Alex Volkov."

📰 TL;DR

Alex speed-runs the week: Anthropic vs DoW fallout, Qwen 3.5 small releases, GPT-5.3 Instant, Gemini 3.1 Flash-Lite, SWE 1.6, Wolf Bench, and other tools/news blurbs. The section functions as a roadmap for the deeper discussion.

  • Anthropic/DoW conflict queued as top story
  • Qwen 3.5 small + Junyang context previewed
  • GPT-5.3 Instant and Gemini Flash-Lite positioned as fast-tier battle
Alex Volkov
Alex Volkov
"This is the TLDR. This is the section on Thursday."

🏢 Anthropic vs Department of War

The panel unpacks the fast-moving Anthropic-DoW saga: rejected requests, supply-chain-risk pressure, OpenAI stepping into defense deployment, and public backlash/optics shifts. They discuss how much is policy posture versus operational reality.

  • Anthropic says no to requests tied to surveillance and kill-chain concerns
  • OpenAI deal announcement triggers backlash and later amendments
  • Discussion includes legal/designation pathways and market implications
Alex Volkov
Alex Volkov
"Anthropic has said no."

🔓 Qwen 3.5 Small Models & Junyang Departure

The show covers strong Qwen 3.5 small-model performance and practical local-run viability, then pivots to leadership turbulence after Junyang’s departure post. The team frames this as both a technical and ecosystem-level story for open-source momentum.

  • Qwen 3.5 small models discussed as highly usable on consumer hardware
  • Junyang departure sparks major community and internal Alibaba response
  • Open-source continuity remains expected despite org changes
Alex Volkov
Alex Volkov
"Goodbye. My beloved Qwen."

🛠️ GPT 5.3 Instant

Alex and co-hosts review GPT-5.3 Instant as a free-tier baseline upgrade, with mixed reactions on quality and style. The discussion centers on when low-latency models matter in real systems versus where they still fall short.

  • OpenAI positions Instant as less cringey/more accurate
  • Panel sees improvements but still prefers other models in many workflows
  • Low-latency use cases remain valid (e.g., voice/real-time control)
Alex Volkov
Alex Volkov
"OpenAI rolls out GPT 5.3 instant."

⚡ Gemini 3.1 Flash-Lite

The team compares Gemini 3.1 Flash-Lite speed/cost dynamics against fast-tier competitors and practical agent needs. They note significant pricing changes versus prior flash-lite pricing and discuss where cheap fast models power orchestration.

  • Gemini 3.1 Flash-Lite presented as fast + 1M context
  • Pricing jump versus prior flash-lite discussed as material
  • Useful for judge/guardrail/orchestration style workloads
Alex Volkov
Alex Volkov
"Google launched Gemini 3.1 flashlight."

🧪 This Week's Buzz: Wolf Bench

Wolfram introduces Wolf Bench, a multi-metric evaluation framework based on Terminal Bench that emphasizes reliability and variance, not just single average scores. The segment highlights harness effects (Terminal Bench vs Claude Code vs OpenClaw) and reproducible benchmarking setup.

  • Four-metric view: average, best run, ceiling, and consistent floor
  • Harness differences shown as a first-class factor
  • Benchmark cost/transparency details shared publicly
Wolfram Ravenwolf
Wolfram Ravenwolf
"One score is not enough."

🔓 Open Source: Step 3.5 Flash

The panel flags StepFun’s Step 3.5 Flash release as unusually open in both model and training-stack terms. They emphasize that continuation pretraining flexibility is a major practical unlock for builders.

  • Step 3.5 Flash highlighted for open training artifacts
  • Apache-2 orientation praised
  • Potential ecosystem impact discussed
Alex Volkov
Alex Volkov
"StepFun releases step 3.5, flash base."

🔥 BREAKING NEWS: GPT 5.4 Drops Live

Mid-show, OpenAI drops GPT-5.4 live, and the panel pivots immediately into hands-on analysis. They review announcement claims and begin direct testing inside Codex.

  • Live on-air GPT-5.4 announcement
  • Immediate benchmark and UX triage
  • Community reaction spikes in real time
Alex Volkov
Alex Volkov
"We have breaking news."

🤖 5.4 Benchmarks: OS World, Web Arena, Browse Comp

The panel reviews the newly posted benchmark deltas for GPT-5.4, especially computer-use and browsing tasks. They focus on tool-use efficiency, reasoning-effort curves, and practical improvements over 5.2/5.3 lines.

  • Strong OS World jump versus prior general model
  • Web/browse benchmark leadership claims examined
  • Reasoning-effort ladder interpreted live
LDJ
LDJ
"Introducing GPT 5.4. That is the title of the blog post that open a I just dropped."

💰 5.4 Pricing & Availability

The team breaks down GPT-5.4 and 5.4 Pro pricing, noting modest output deltas but meaningful input increases and very high Pro output pricing. They also discuss 1M-context usage implications and cost management for eval runs.

  • Input pricing moved materially versus prior generation
  • Pro-tier output pricing flagged as expensive for heavy evals
  • 5.4 available across Codex surfaces first
LDJ
LDJ
"The pricing... it's about the same for output price... For input price though, it's about 50% more expensive than 5.2."

📰 5.4 System Card & Safety

The conversation moves into system-card details, model variants, and availability behavior across interfaces. They also note real-time steering support and discuss implications for interactive workflows.

  • System card reviewed live
  • Thinking vs Pro distinctions discussed
  • In-flight model steering highlighted
LDJ
LDJ
"They mentioned the ability to interrupt in ChatGPT while it's thinking."

🛠️ 5.4 Live Vibe Check: Mars Benchmark

Nisten’s Mars mega-structure prompt is used as a live stress test combining math, coding, and visualization. The panel reacts positively to output quality and trajectory realism versus prior runs.

  • One-shot Mars benchmark run in Codex
  • Visual + math quality judged in real time
  • Panel calls it best run of this prompt so far
Nisten
Nisten
"I think this is the best one so far."

🛠️ 5.4 Live Vibe Check: Website Improvement (GPT vs Opus)

Alex compares GPT-5.4 and Opus behavior on a vague web-improvement prompt to probe practical instruction-following style. The discussion distinguishes benchmark strength from preference for intuitive product judgment under ambiguity.

  • Same prompt run on GPT-5.4 and Opus
  • Differences in interpretive behavior discussed
  • Prompt quality vs model intuition debate surfaced
Alex Volkov
Alex Volkov
"When we refer to GPT Codex... as autistic, this is what we mean."

🧪 5.4 vs Opus & Gemini: Benchmark Comparison

The hosts inspect side-by-side benchmark snapshots for GPT-5.4, Opus 4.6, and Gemini variants. They note where 5.4 Thinking leads and where Pro-tier data is needed for fair apples-to-apples comparisons.

  • Cross-lab benchmark matrix reviewed live
  • FrontierMath and browsing deltas called out
  • Need for like-for-like deep-think/pro comparisons noted
LDJ
LDJ
"This is a comparison of Opus 4.6, to Gemini to GPT 5.4."

⚡ Wrap-Up

The episode closes with a concise GPT-5.4 recap and quick takes from the panel on adoption intent. Alex tees up next week’s three-year ThursdAI anniversary and points listeners to the newsletter for remaining items.

  • GPT-5.4 summarized as major general-model jump
  • Panel intent to benchmark and test further
  • Three-year ThursdAI anniversary preview
Alex Volkov
Alex Volkov
"GPT 5.4 thinking just dropped with 1 million context window support."
TL;DR of all topics covered:

  • Hosts and Guests

  • Big CO LLMs + APIs

    • OpenAI launches GPT-5.4 Thinking and Pro (X, X, X, X)

    • Anthropic, Dept of War and OpenAI walk into a bar

    • Alibaba Qwen departures: Friend of the pod JunyangLin and Binyuan Huy both depart Qwen (X)

    • OpenAI Rolls Out GPT-5.3 Instant (X)

    • Google launches Gemini 3.1 Flash-Lite (X, Announcement)

  • Evals and Benchmarks

    • MarinLab shows degradation in Opus 4.6 (X)

    • BullShit Bench from Peter Gostev (X)

  • Open Source LLMs

  • Tools & Agentic Engineering

    • Cognition: SWE-1.6 preview (X, Blog)

    • OpenAI launches Codex app on windows (X)

    • Google released Google Workspace CLI (X)

    • OpenAI released Symphony (Github)

  • This weeks Buzz

  • AI Art & Diffusion & 3D

Alex Volkov (2)
Alex Volkov (2) 0:35
The reason I'm coming to you right now is today on the show,
0:38
OpenAI had some breaking news for us that they hadn't had in a while. They dropped a new frontier model called GBT five. Point four thinking and 5.4 Pro. This is the new generalized model. We haven't seen a generalized model since GPT 5.2 and this model fold in the coding capabilities of GVT 5.3 codex that we saw launched last month in February. We actually got to test the model out live on the show. And if you don't wanna wait till the end of the show, here's a summary of the most important things. But due to Nez, we did some testing like we always do, and it was really fun to see the model break in real time. This model has taken everything that worked for GPT 5.3 Codex, the coding stuff, and folded into a unified reasoning model. They're calling it their best model yet and they always call their models the best one yet, but this one seems to back it up. the headline story for me was Bartos Naski, a Polish mathematician. He goes by Nas Red on Twitter, and he shared that 5.4 solved a research level frontier math problem that he had been working on for about 20 years. He called it his personal move. 37. You guys remember Lisa, do and AlphaGo, this is a big, big moment and said the solution was very clean and nice and feel almost human on RKGI two. The pro version, the bigger of the two releases today hit 83.3%. Very Claudesely matching the Gemini Deep think that we saw a few weeks ago. And on OS world and computer use, this model score is 75%, which supposedly beats the human baseline on computer use GPK Diamond 94%. The inject stuff is also interesting. folks from Zapier confirmed that this is the new state of the art for multi-step tool use. And their tool search optimization cuts token by 47% token usage. Which is massive for anybody, building agents or using open Claudeud. And there's also a really cool feature that I wanted to highlight Amid thought steering. We saw this on Codex before, but now even on Chat GPT, you can interrupt the model while it's thinking And redirects is reasoning in real time by confirming something or seeing that it goes into, some path that you don't want it to go. And it's now on J GPT interface. This is the first for any production AI model that I've seen. It is 1 million context window, $2 and 50 cents on the input, and then double that after 272 tokens. So it is 1 million context. The first one we've seen from OpenAI since GPT 4.1 on production. It's cheaper than Claudeud Sonnet 4.6 and reportedly on computer use, it beats Claudeud Opus 4.6. the community sentiment on this is basically, this is the model and you don't need to choose anything else. As always, after the first initial few days of excitement, we see that it's not the case, but we tested it live. We'll get into the details. This is a significant drop from OpenAI. Let's get into Thursday.
Alex Volkov
Alex Volkov 3:32
What's going on everyone?
3:32
Welcome to ThursdAI my name is Alex Volkov. I'm an AI evangelist with Weights, & Biases. Today's the Thursday, March 5th. this is our first show in March and, our last show before our three year anniversary. Can you believe it? and, yet another, very full week here in the world of ai. To help me cover this, I will add my trusted co-host, LDJ and Wolfram and Ryan Carson. Welcome folks to the show. How is your week? There's a lot to talk about.
Ryan Carson
Ryan Carson 4:05
So much to talk about.
4:05
I'm excited. Good to be here.
Alex Volkov
Alex Volkov 4:07
Yeah, good to be here.
4:08
LDJ. How are you?
LDJ
LDJ 4:10
I'm doing Swell.
4:12
It's a lot of exciting, possibly upcoming thing very soon.
Alex Volkov
Alex Volkov 4:15
Yep.
4:16
And, Wolfram, how about you? How are you doing?
Wolfram Ravenwolf
Wolfram Ravenwolf 4:18
Amazing weekend.
4:19
Very busy week for me, so I couldn't even keep up with the AI news very much, but that's why we are here, to bring people up to the latest.
Alex Volkov
Alex Volkov 4:27
Yeah, a hundred percent.
4:29
So this is why we're here and just, just a shit ton of news, not only on the AI front, also in the world as well, but we're not gonna cover this besides just acknowledging folks that we specifically, I, I can say for myself, but probably for everybody else here, we hope that folks are okay wherever they are, especially civilian folks. I folks, I think it's time for TLDR So folks, let us catch you up. This is the TDR, the section that we're gonna run through, everything that we have to talk about in the show, so that you'll be caught up to date. And if you're interested in any part, just stick around with us. So let us catch you up on everything that happens since last week since you listened to Thursday. I, because I think a lot has happened. Let's go. TLDR.
5:22
All right. this is the TLDR. This is the section on Thursday. I will rerun through everything that happened, that it was very of importance in the world of ai. You with you Alex Wilko, AI ventures, and from Weights, & Biases. Your host. Our co-host. Today we have a full panel, Wolfram, Raven Wolf, Yam Peleg, I, Ryan Carson Nisten and LDJ A are all here folks. The biggest story from the last week, obviously, we started with this Anthropic and the Department of War. If you don't remember a brief reminder for you, Anthropic got an ultimatum from the Department of War, previously the Department of Defense and the Pentagon, to remove some restrictions and there was a whole thing back. A week ago on the show, we were waiting for Anthropic Answer. Well, Anthropic did answer, and then the Department of War also did answer, and then OpenAI came into the mix. So we're gonna cover all of this very soon after the TLDR because I think it's very important to talk about. Also from this last week, Alibaba Qwen is in the news, back is in the news, and they released Qwen, 3.5 small, a series of small models for the past three weeks. Qwen has been consistent in releasing series of models. but also Qwen was not in the news because of their series of models. They were in the news because our friend of the Paul Ling, posted publicly on Twitter that, he has quit the team, causing an uproar of huge proportions. Enough so that Alibaba, CEO convinced the whole company to talk about this. It's, and stepping into maintain the open source ability of Alibaba. There's like, this story is still developing. but, it's, it's big in kind of the world of open source and people are, talking about this. So we'll catch you up. As far as we know, we have some source news, we're gonna talk about. Also this week, OpenAI rolls out GPT 5.3 instant. but GPT 5.3 instant, OpenAI claims it's less range claims. It's, more accurate and, there's some safety trade offs, et cetera. but instant is the model that is served for free for people who don't pay for Chat GPT. So, you know, it's not the best one. Google also launched Gemini 3.1 flashlight, which is kind of in the same category, right? We're seeing these models kind of in the same category. So Gemini 3.1 flashlight is very fast and efficient with 360 tokens per second. so we're gonna talk about that as well in open source. There's a bunch of new, so obviously we already mentioned Qwen 3.5 small model series. They have native multimodal capabilities and they are rivaling models, 30 x their size. always fun to see small models because in the world of open source, the models that we actually run on our hardware are the small models. We cannot run the 1 trillion parameter, UAN 3.0 that also release this week. 1 trillion parameter open source, MOE. But we can run Qwen 3.5 small on all the variations and we're gonna talk to you about them Also, folks from Step Fun, step 3.5 Flash, with full training code bases on their Apache two, which we always celebrate. they claim this the most Open Foundation model released from a Chinese AI lab because they have all the training there as well. And Apache two. so this is quickly on open source. I think that's pretty much it in tools in agent engineering, uh, cognition released SWE 1.6. Cognition, is the company behind Devon and Windsurf now released SWE Volume 0.6 is there Finetune? and it is very, very fast. I think it's powered by cereus. I think 950 tokens per second. achieving 51% on SWE Inch Pro. We talked to you last week. SWE Inch verified is no longer relevant. Open isna gonna report on this. SWE Inch Pro is kind of the new, software engineering benchmark. so cognition released their preview of their like super fast and cheap model as well. And open air launches Codex app on Windows. In this week's buzz, we have a early preview for you. The Wolfram was gonna get into very, very soon something we call Wolf bench, how different the ice perform and the different harnesses. This is a very interesting, very UpToDate segment that you, you should listen to because we went deep to try to figure out, how to test these models and should we trust just one segment. I don't think we have a bunch in vision and video.
Nisten
Nisten 9:10
we did miss something and, it's called Bullshit Bench.
Alex Volkov
Alex Volkov 9:15
Yes.
9:16
Peter Gus's Bullshit Bench. please tell us about this.
Nisten
Nisten 9:19
Yeah, it, it was just too busy.
9:20
So it asks something about, it's a restaurant and they want to change the recipe or something, but then it just tries to make a very, it just adds a lot of, corporate jargon, to the question that says that the fire code regulations will hit with our proprietary, thing. This is my favorite benchmark because it shows everything I dislike about the Gemini models that they will always, reply with, corporate executive summary regardless, of what they're asked.
Alex Volkov
Alex Volkov 9:52
Yeah.
9:52
So this is Bullshit bench. Peter Gusev, part of the arena team now, and a friend of the part as well. thank you Nisten, uh, Wolf. We had another one that we missed.
Wolfram Ravenwolf
Wolfram Ravenwolf 10:00
it's a small thing.
10:01
Maybe it's not even directly ai, but it is relevant for ai and I think that it has been released. It's an interesting thing that Google released the Google Workspace, CLI, which is a command line tool. Yes,
Alex Volkov
Alex Volkov 10:11
yes.
Wolfram Ravenwolf
Wolfram Ravenwolf 10:12
Google Mail, drive, calendar and so on.
10:14
So like us Olaw people, we are, have been using the tool. Peter Steinberger made the go CLI and now Google did an official tool that way. And I hope this is the first in many and inspiration for others to follow. So that is a change in the mindset that now Google accepts that, agents will be using their tools. They are all of their tools and basically support this. And that is a great thing. And I think there should be more of this. So I see it as a great sign.
Alex Volkov
Alex Volkov 10:42
I, I think it's absolutely a great sign, CLI command line interface.
10:46
For folks who are listening who don't know what this means, folks, I'm trying to make this, make this approachable. We're on the top of this AI wave. Many people who are listening is like, I have no idea what they're talking about. so we'll try. Google released a set of tools for your agents to do tasks for you in the Google workspace. Read Gmail, read documents. Everybody's like super, super happy. The, for some reason before this, there was not one unified way to do this, even despite the fact that Google is like everywhere. so now there is, and it's very, very exciting. And also there's, if we're in the tools and other places as well. Ryan, I sent you a thing
Ryan Carson
Ryan Carson 11:18
Yeah.
11:18
So it's called Symphony and thank you for sending it to me. Basically, it's an orchestration layer, very similar to the Code Factory that I'm trying to build, and it's fascinating and it's created by Open. It's funny you mention, I'm literally talking to Codex about installing it right now. so. We maybe will see all the labs do this. Like, and I, I was literally in the slack this morning talking to o the OpenAI folks saying, we really need this like code factory orchestration layer built into Codex itself.
Alex Volkov
Alex Volkov 11:47
I wanna highlight, And, uh, option one in how to install the sle.
11:51
Tell your favorite coding agent to build Symphony in the programming language of your choice. Implement Symphony according to the following spec and the link to the spec. So basically this is like a new type of software release. They release a spec that you can say to your agents, Hey, build this. This is a crazy world that we live in, folks. This was the TL DR Let's dive into the actual news, because I think there's a lot to cover, and we could go on and on, but I think some of this is very, very important. So I think we'll start with the national security update, not the war stuff in Iran, because that's happening. And, you know, uh, we, we, we'll keep this away from the show despite, uh, our friend here, yam may need to run to a shelter at some point. last week we told you about Anthropic, and the Department of War and that whole thing that's going on, and we were waiting for the answer. and Anthropic has said no. And Anthropic just replied to, department of Defense back on Thursday after our chat and said, we cannot in good conscious, exceed to these requests. The requests were again, do not spy on US citizens. And, do not put Claude in the middle of a kill chain without human intervention, for autonomous weapon building. Apparently those were the two terms Now. It gets significantly crazier since then, day by day. the, the stick behind. This was a designation of potentially a supply and chain risk for Anthropic, which means everybody who does business with the government will not be able to use Anthropic for the work with that supply chain risk. it's a law that never has been used against the US company, against, this is the word, using against. And also Anthropic says they will challenge this in court. on Friday, the US President Trump tweeted that they're like super woke and left, and this is why they don't wanna do this, and that he's asking all of the US governments to stop using Anthropic in the next, six months. Friday evening, this escalated with Pete Hexes, secretary of Defense, saying that he will designate on traffic as a supply chain race so nobody can use them. On Saturday, US went to war with Iran, and there were reports that said that Claude was used in those attacks via Palantir. So despite the posturing and everything on Friday, this has been used in production. 'cause like the government doesn't work as fast as turn off something. They have processes and the Claudeud is built in everywhere. what else happens on Friday evening, Sam Altman posts. Hey. Hi folks. We have reached an agreement with Department of War and announced the OpenAI and the Department of War have made a deal to deploy OpenAI models instead of Anthropic models. And they have an agreement in place with the highest restriction in place they ever done. And, on Friday, at least as far as I saw, everybody were pro Anthropic anti OpenAI because of the supposed moral stance that all like and Anthropic took in the face of the $200 million contracts. this was Friday. And, the, there was so much backlash, at least on Twitter against OpenAI that hashtags like quit OpenAI and delete OpenAI and et cetera, started trending. Multiple people showed that like, Hey, they're going to on Anthropic, it looked like this whole endeavor made Anthropic so much money that the $200 million, Deal with the DoW makes no sense to them at all. Like they made a lot of money because of this. Sam Altman later, I think Saturday, did an a MA on Twitter and acknowledged that the, the release was definitely rushed and the optics did not look good. and they amended the deal on Monday with the DoW to add surveillance and weapons prohibitions after the backlash. So there was a backlash and people started leaving OpenAI and deleting and canceling their accounts, et cetera. because supposedly OpenAI exceeded. let's talk about this folks. what do you think? Yeah, I think you have some comments. Go ahead.
Yam Peleg
Yam Peleg 15:44
Okay.
15:45
So, basically I wanna talk about, you know, the actual substance. Like, okay, I get all the moral endpoint and so and so on, but like, what exactly are they using Claude for or wants to use Claude for? And it's also, to the best of my knowledge, we're talking about sonet. I'm not even sure that Sonet 4.6, I mean, it's a very outdated model and
Alex Volkov
Alex Volkov 16:11
we actually think it's 4.5 LDJ and I looked it up.
Yam Peleg
Yam Peleg 16:14
What are we even talking about?
16:16
I mean. Claude is, I don't know the best of my knowledge. Claude is not a a, a piloting helicopters and and so on. So what exactly, Claude being used, the war in Iran at the moment.
Alex Volkov
Alex Volkov 16:32
We have comments that saying that Palantir
16:34
Stack uses on Anthropic Claude.
LDJ
LDJ 16:37
Yeah.
16:37
So according to the Washington Post, and this is me quoting directly from them, they say it was used for, it, it quote suggested hundreds of targets issued precise location coordinates and prioritized those targets according to importance. So the two of the people, end quote, and it seems like also just for general planning, for helping expedite the process of planning and kind of thinking through different battle scenarios and, and basically helping decide what would be the optimal plan for a given scenario.
Alex Volkov
Alex Volkov 17:08
Yeah.
17:09
Nisten, go ahead. What's your take on this?
Nisten
Nisten 17:12
look, at the end of the day, the US Army and Air Force, they are
17:18
primarily logistics companies and they were some of the world's best logistics companies and that's what enables them to. Do their missions. And so that, I mean, it is not so much about, I I actually think that prioritizing targets and stuff, that's like a very minor thing of what SA is being used. There's just an entire outdated software stack there that, that, that needs fixing. Like all the parts coming in their suffer for all different providers. And these are all public complaints that have happened. So how, how are you gonna fix that? You're gonna use some kind of agent AI tool. I say I, I got a lot of work done with, with sonnet 4.5. So if they had a less Nerf version or a less quantized version, that'll probably still be a lot better than, than most of the other tools,
Alex Volkov
Alex Volkov 18:12
just for removal of doubt.
18:14
Niton is Canadian. He does not work for the Department of War. That work, he refers to his, his own work unrelated to any government stuff.
Nisten
Nisten 18:22
Yeah.
18:23
these are all public complaints about like what the F 35 software stack was and how to manage all the, all the parts. And even the US Army itself has adopted, an open source, but like internal open source philosophies, it just make a, a lot of the software more compatible with each other. Yeah. So again, this is a logistics, this is a logistics organization that has very good logistics. And, they run on a, on a very messy mix of software as it publicly seems, but public complaints. So you do need to fix that software and you need a very good, AI model. And honestly, I think sounds pretty good for that.
Alex Volkov
Alex Volkov 19:03
Yeah.
Ryan Carson
Ryan Carson 19:03
I just think this is all a bit of posturing isn't even real.
19:06
Like, are they gonna literally go out and scrape everyone's API keys off their machines so they stop using an Anthropic model. And it's so not practical. It, it's clearly said by people that don't use this technology, that don't understand what engineers actually do or how they do it, and it's just stupid. Like, and also why would the US cut off its nose to spite its face? Like, use whatever models you can like, and I think it's just kind of annoys me.
Alex Volkov
Alex Volkov 19:37
a lot of people looked at Anthropic as kind of the moral
19:40
stance, people that don't cow to the government's demands, et cetera. And then this is also, and I'm gonna try to stay as neutral as possible here because I think. I think everybody's wrong or everybody's right. I don't know, but I'm gonna try to stay as neutral as possible here. A lot of people looked it up in the eyes going and saying, Hey, we will be okay with a legal framework that this legal framework does not make sense anymore. And, and kind of attributed a moral sense to on Anthropic. And we know from our show, we know from a Anthropic, conversational Anthropic, like very big into safety. Apparently this is not about safety at all. Two things about this, one, apparently on Friday, there was a memo that, Dario Amodei they posted internally that leaked, yesterday. And that leak shows incredible language that Dario Amodei uses specific targets. Sam Altman, and he's kind of like a antics and, and shenanigans calling everybody on Twitter. Twitter morons. It was, the, the, i i I called it out on Twitter. The, the, the rollercoaster of, Hey Anthropic is, you know, based because we're using Red, the best model. And Anthropic is horrible because they tell Open Claw to change their name. And Anthropic is the best because they release Claudeud code. And Anthropic is the best because they, the back and forth between how quickly everybody like decided on Anthropic is the best and the worst is just like giving me whiplash. The latest thing though, is according to Financial Times, and Anthropic Chief is back in talks with the Pentagon about the AI deal. Not only that. Multiple tech companies reached out to the government and said, Hey, like we worried when you are taking a company of the private market and because of the stuff that you want them to do, you designate them as a supply chain risk and potentially are talking about, invoking the Defense Production Act to take over them, nationalize them, something that China would do and not something that we in the US should be able to do because we have private markets that are separate from the government. the kind of the publishing from Anthropical was very interesting. It led them, it was the best campaign they ever did, I think way more money and people and influx of, new registrations. Way more than the Super Bowl commercial that they did, like knocking the OpenAI. so there's definitely that. we also see the Anthropic just past 20 billion or like coming very, very Claudese, 19 billion, more than doubling in three months since 9 billion run rate. It was confirmed that on Anthropic is they hit 19 billion plus in annual revenue, recurring annual revenue, more than doubling from three months ago, from 9 billion. So enterprises are signing up. Many of them probably work with the government and some of them, because of this designation that they are aiming to challenge in court may not be, able to. So this is kind of the saga, I think. Last comments folks on the saga, like we just wanted to give an update. It looks like Anthropic is still in, in the chats with, with the Pentagon. So all of this posturing may mean nothing and it's just like public for show. And eventually they will still work with the government because how could they not us needs the best tools. Like why, why wouldn't they? I see.
Nisten
Nisten 22:41
Sam Altman had to fall back and, say a few things about what safety
22:47
percussions they're taking when it comes to national surveillance, because they have what, like a hundred, I don't know how many, what percentage of the US population they have as users. And that was pretty funny to me because like no one believed him when it, when it came to that. And, yeah, I, I don't think that worked out as well as he was, hoping it would.
Alex Volkov
Alex Volkov 23:10
Yeah.
Nisten
Nisten 23:10
So leave it at that.
Alex Volkov
Alex Volkov 23:12
Okay.
23:12
We're strongly not into politics, but we are have to because like AI is getting involved. The last thing I'll say, there was a phone call with President Trump and he said, I fired in Anthropic. which is just a funny way of saying this. I fired in Anthropic LDJ. Go ahead and we'll move on to Qwen because there's some drama there as well.
LDJ
LDJ 23:27
Yeah.
23:27
Just some more recent updates on the, supply chain risk designation. It seems like a, actually as of more recently in the past 48 hours, they're less so focusing on that and seems like they're maybe focusing more now on designating them as a, under the Defense Production Act. It should basically, it, it's almost like the opposite, extreme opposite, but baud basically force Claude to work with the government essentially.
Alex Volkov
Alex Volkov 23:53
it is crazy.
23:53
But did anybody think differently? These companies are building a SI, it's a matter of national security. Did anybody think that the government will not step in at some point and take over? Like I know many people don't want to. Many people are, you know, libertarian for example, but like, this is not the we live in. and the comparisons to China was very interesting, because essentially, there's no need for any of these laws to happen in China, for the Chinese government to take over any of the AI labs. I think we have covered this folks in the comments. If you want to give us a comments about this, we would love to hear as well. What's your take? this is a developing situation and obviously we don't know all the details and it looks like, I would really recommend folks to go and try to read, Dario Amides leaked memo because he definitely did not mean to share as much publicly as, as he shared in this memo. and some, some choice comments from this is, Twitter, morons may believe some Sam Altmans antics, but he hopes that, the member of the gullible staff in OpenAI doesn't. Honestly, I've updated a few of my stances after this like leak memo. that's
Yam Peleg
Yam Peleg 25:00
insane.
25:00
That's an insane quote.
Alex Volkov
Alex Volkov 25:05
Yeah.
25:06
It's really funny that like most people from open, like many people from OpenAI just changed their bio to Twitter more and, which is really funny. All folks. I think it's time for us to move on to the next thing that like, is kind of like taking over Alibaba. This, we'll talk about this in open source, actually worth covering now. Alibaba op, Qwen Lab. Alibaba has multiple, it is gonna be relevant, multiple AI efforts, basically. So the Qwen team in Alibaba released Qwen, Qwen 3.5 small model series with native multimodal capabilities, rivaling models. 13 x their size. the small models of Quin 3.59 billion parameters, is beating, let's take a look here. It's beating GPT OSS 120 B on multiple benchmarks. basically they're saying that the 9 billion parameter Qwen 3.5 is competing with GPT OSS 120, which, is an open source series of models that OpenAI released back in the summer of, of last year. They're also natively multimodal, right? So we have, like these models on the same video and documents. So video MME is at 84% for these models or for at least the 9 billion parameter model. GPK Diamond is at 81%. This is a great series of models that can completely run on your device and potentially do some stuff on your device now in the middle of releasing this, a day after releasing. Do you guys wanna comment on the actual models? for, for a quick sec? Yeah, go ahead and
Nisten
Nisten 26:33
I'll quickly say, right now the, the nine B model is the most popular
26:37
model right now trending on hugging face. And, I don't have the speeds for that, but the 27 B, which is, three times larger people are running, the model was
Alex Volkov
Alex Volkov 26:46
released last week,
Nisten
Nisten 26:47
People are running it on a 30, 90 card, which you can still
26:50
probably get for like 900 bucks. And, they're getting 35 tokens per second in the beginning. And then as it fills up after a hundred thousand tokens, they're still getting 15 tokens per second. and so the architecture of it, a a allows that and that is very usable now. And that model score like had some of the best, the, the best scores on, artificial analysis.
Ryan Carson
Ryan Carson 27:14
Yep.
Nisten
Nisten 27:14
So yeah, these are the, I think it's an important threshold
27:20
that's being crossed here. Is that what you can do with a $1,000 GPUA thousand dollars budget, just crossed into being usable for, for agen stuff.
Alex Volkov
Alex Volkov 27:32
Yep.
Nisten
Nisten 27:32
You can feed videos to them.
Alex Volkov
Alex Volkov 27:34
All of the Chinese labs besides the whale big one that we
27:38
keep waiting for, to hopefully launch at some point, dip all the Chinese models, mostly released textual models. Aaba Qwen is a specifically multimodal one, I think. MK 2.5, the last one is also multimodal. Right. But most.
Nisten
Nisten 27:54
GPUs to run Qmi on your
Alex Volkov
Alex Volkov 27:56
Yeah, like te it's one, 1 trillion parameter model.
27:59
and Qwen is multilingual. Multimodal in 260 5K context, 260 2K context with the 9 billion parameter you can run on your device. The Qwen series of models is absolutely incredible. Qwen is almost single handedly holding up the open source. we love open source, but we also love to be able to run these on our laptops. And so, you know, 40, 90 is nice, but running a nine B parameter on the macro via LMStudio is like much, much nicer, for certain things. Now with that said, they also released other sizes. They released a small one, 0.8 billion parameters. They released a 2 billion, 4 billion, and this nine B is the flagship. nine Bs is the sweet spot for just intelligent enough to do some tasks, but also small enough to run on most laptops with decent speed. so very good agent tool use and API calls. anybody else play with the small one? Alrighty. So with that said, a day after this release, our friend of the Pod Ling, who was the tech lead for Qwen, basically posted on Twitter. Goodbye. My beloved Qwen. This tweet reached, I looked yesterday, 5.7 million views, which which just shows that like how many, like as Nisten said, this model was the, the, the, the most famous model on hug and face for a long time. Qwen has been carrying the torch of open source and junior Yang with his, like, seven appearances on Thursday, I think. invited him today, by the way, but I think he's otherwise preoccupied. he is the kind of the, the flag bearer for Qwen. He basically made Qwen lab for what it is he took over. he's one of the youngest, like at his level bear at Alibaba, and he supposed to go by my beloved Qwen. and then everybody started freaking out, like, what's going on, very Claudesely after this bin who, or something like this. Binion is also, the member of the technical staff there. he decided, he posted me too as well. And so like with the gr departures of three weeks ago, everybody started speculating like, what's going on? Somebody posted that they know it's not his choice. This announcement from J Yang was, did so much shock and awe across this ecosystem that apparently the CEO of Alibaba convened like a internal meeting a day after this announcement to talk to people, and say that Qwen is still remaining. So apparently there's no firing, but based on reports from every other Chinese outlet, this story completely broke through the bubble as well. It's not like our own ecosystem. Somebody gets, you know, fired or something. this story absolutely broke through the ecosystem. A lot of the Chinese news outlets are reporting on this. Alibaba is very big in China, and open source is very important to them. And so this post was seen as like, Hey, maybe there's not gonna be any open source anymore. but based on KR 64, there was a reporting that said, Alibaba and open source will continue Alibaba commits to open source. This was a dispute over who is gonna consolidate which parts. So the Qwen team was not only the only AI team we know, there's tonge, There's a few image models, Qwen images, different than the Qwen team. so apparently this was just a conversation about like who and where, and, Where the research are gonna get allocated. Some comments from this meeting that they had which is the QU team is only like a hundred people with all the success in the Qwen models. And the over 120 models that they released in open source is only a hundred people, 120 models. they have complained to the bigwigs in, in Alibaba that it's harder for them to get resources than some of their clients, so GPUs to train their models, et cetera. and they did great work since then, so it's unclear where they're going. but as of yesterday, I think it's confirmed that, junior Yang's resignation was accepted. And then, the CEO of Alibaba is now co-leading the Qwen lab directly. I dunno what that means, if it's good or bad, we'll see. but they're basically claiming that, you know, that the Qwen team is larger than one man, despite the fact that during Yang, like did a lot of work, including evangelizing on our show, So we, we follow his career for a while here. So shout out to Jin Yang. We hope you land somewhere that you'll do great impact. I personally hope it's another Logan Kilpatrick situation, and Junior Yang is gonna take all of the good faith that he got from the community and like, take this to another lab and we'll see. but this is the update there with Quinn's, departures as well. So no, Qwen is not going away. Doesn't look like it. but our, our boy Junyang is not gonna not gonna represent Qwen anymore. On the show. Folks, any comments on this? On how sometimes the people behind the AI kind of like are more important to the AI itself?
Ryan Carson
Ryan Carson 32:36
I mean, as someone who's done Derel for 25 years, I
32:40
think people are very important. and we trust people. think about how much Logan has done for Gemini, it, all of us use Gemini because of Logan. Like, yes, the technology is amazing, but it's the person, you know, I think Roman, has done an amazing job at OpenAI, for instance, Mm-hmm. Really building those and a lot of people underneath him, like Dominic and other folks. So it's the people. Yeah. So this is a big blow for them. I don't know what happened though.
Alex Volkov
Alex Volkov 33:05
Yep.
Wolfram Ravenwolf
Wolfram Ravenwolf 33:07
I'm most curious what is happening to the people
33:09
now, where they are going, what is happening if they're forming a new company, joining another company.
Alex Volkov
Alex Volkov 33:15
we reached out for a comment and we did not get one.
33:17
Once we get one, we'll update folks. LDJ, go ahead and listen.
LDJ
LDJ 33:21
The mainstream population, things like football or basketball.
33:25
And you know, in the news a lot of the times, sometimes even more often than the scores of the game, you'll hear, oh, LeBron just got signed this contract for hundreds of millions of dollars, or this and that. And it just seems like the, this interesting thing where a lot of the entertainment truly is just the identities of the people involved themselves.
Alex Volkov
Alex Volkov 33:44
This is kind of like sports.
33:45
The, the talents moving around, et cetera. Nisten, go ahead. You had a comment as well.
Nisten
Nisten 33:50
So there, there are a few things that, that, that could
33:53
have happened here, but, I'll just like narrow it down to, to three. either the three guys, they, new company opened up and they, they got a much better, option, which could be like an I type of situation. So there's one and then the other, the other one is, what we saw from some of the people at Kimmy. They were describing it as, you know how Google has like level one to level seven engineers and staff engineers and stuff. Well, at Alibaba there are, when it comes to the executives, 'cause they're all engineers as China, there are 14, there are 14 levels. and Junyang was a, a level 10 there.
Alex Volkov
Alex Volkov 34:32
the youngest level 10, that's what I read somewhere.
Nisten
Nisten 34:34
So, it is likely, so we saw, for example, with the GLM model, GM five,
34:39
they announced, I don't know how true it it was, but they announced that it was all trained on internal Huawei chips. So this could be a situation where they're forced to not use Nvidia cards anymore for some arbitrary reason. so, and then they, they just can't get their work done. And then, all these departments got consolidated. So there's the other one and the third one, which is the, the spicy one. they might have just been forced to use Qwen V models for killer drones, and they didn't want anything to do with that. a hundred
Alex Volkov
Alex Volkov 35:11
percent speculation niton we'll move forward.
35:14
OpenAI rolls out GPT 5.3 instant. But basically, if you have used OpenAI Chat GPT and you decided to stop because it's dumb, you probably have played with the instant 5.2 version model. if you don't select the think versions of the Chat GPT you are playing with the instant models, they're not great. The slight one is a little bit better at creative writing. Still not amazing. but OpenAI basically posted and said on their socials, we hope this model is less cring . Let us know what you think. and the it interprets typos be better. The previous one would like over obsess about typos. then they claim 26% fewer hallucinations on web search, and 20% reduction. I play with it just a little bit, but the honest truth is, besides Codex for coding, I do not use open IHIG PT anymore. And it's a like a big change for me, not only for me. I think like for many, many, many people, people prefer Claude. I know this from friends of mine who like Discover Claudeud recently. Not Claudeud code, just Claudeud ai, just the chat bot of Claudeud. And, I don't know what happened with I GPT, but basically it looks like they're focusing their stuff on, on the coding. again, we're waiting for the 5.4 model and that's supposedly the, some changes, but the 5.3 instant, the model that they roll out to everyone for free, was updated. It's a little bit better in creative writing, but it's, that's basically all of the vibes that I got from this model. Do you guys have any, responses to this model compared to also, spark as well? Like the, the fast model that they put up on Cereus Codex 5.3 Spark kind of like answers the, the fast question. So why would anyone use this besides the folks who are the 900 million active folks on J GT's platform?
Nisten
Nisten 37:09
you have to keep in mind that when it comes to voice mode,
37:11
OpenAI is still the best one. So if you want to take a walk down the street, and you, you wanna hear something, it's, it is so instant is pretty crucial to that.
Alex Volkov
Alex Volkov 37:22
Yeah.
Nisten
Nisten 37:22
personally I don't like OpenAI or Gemini responses at, at all.
37:26
I prefer Qmi, QMI, and Opus. but the, again, I, I haven't tried the model, but like I really, I can tell almost instantly when reading all the claw inter internet that I, I, I can tell right away when it's freaking Codex or an OpenAI model, writing something. It is just that very annoying language, which I, I don't want anywhere in, in my apps.
Alex Volkov
Alex Volkov 37:50
LDJ, you have a comments on this as well?
LDJ
LDJ 37:52
for the instant models, for a while now, I feel like Claude has
37:54
definitely been the, the best there. But here this does seem like a noticeable improvement, especially in things like hallucination rate, where kind of, when people say things like big model smell, I feel like a lot of that actually ends up coming down to things like, total hallucination rate, but especially for things like, let's say you have just very low, Low latency requirements for a specific use case. For example, I could think of something maybe when you're doing logistics with actual mechanisms and, and things in the real world or maybe doing some robotic experiments and wanting to do some experiments with that. And it needs to really quickly react to stimuli happening in the real world. That's something where that really fast text response might be applicable. But then again, you mentioned Codex Spark and I imagine that even, you know, that would probably be better for that unless there's a big price disparity there in the API. So yeah, I think it is a bit confusing.
Alex Volkov
Alex Volkov 38:49
So this is the model that Chat GPT is giving for free to users.
38:53
So this is Chat GPT for most users. Like most of the world is not even heard of AI yet. The, those who do, they probably use the free versions of those. Maybe like a very small percentage are actually paying for the regular ones and are able to switch, right? So this is, whatever we wanna call this and whether, whether or not like we care, this is an upgrade for some people. Speaking of faster models that was launched, Google launched Gemini 3.1 flashlight, and this one is, is a completely different, right? So, this is the, the fastest and most cost efficient model in Gemini three series. And this one comes with a 1 million token, context window. And it's fast with 300 tokens per second, very, very fast. Not cereus fast, right? So we know there's, some of these models are hosted on like specific chips like cereus, and they're, over a thousand tokens per second. last week we told you about models etched on the actual hardware. That 15,000 tokens per second, but Gemini is not that fast, but it's really, really fast as well. Gemini 3.5 flashlight. they call it the fastest and cheapest, with 86 at 0.9 in GPK Diamond, compared to 82 for G pt, five Mini and 73. 4.5 haiku for Claude. So it's a, there's a series of model where this competes with, so this competes with Haiku and Rock Fast and G PT five Mini, and Google is like showing off their skills because like, honestly, they can make all models fast is all a question of how much, how much GPU they're gonna throw at it, right? Now, I know for a fact that many people use the, the Gemini flash ones for multiple things like prompt rewriting, catching regressions, doing different like, guardrails. For example, one use case. We keep telling you, by the way, it's great for open source models, but also these models are great for is judging other models outputs. You don't need a very strong model to judge other models outputs as as LOM as a judge, for example. You can just use a fast model. The stronger, the faster your guardrail model is, the better it is for your clients. Right? So, deciding whether or not the output is, was jailbroken, deciding whether or not the client is asking for unsavory things, all those things they go into, into using these models.
Nisten
Nisten 40:58
I just tested it really quickly with the same Martian,
41:02
the question and the flashlight, Gemini flashlight Got it correct. Actually got the math very precise, and, I'm assuming 5.3 is rolled out because it just answered instantly, directly, and it just made major math mistakes that none of the other models, make right now.
Alex Volkov
Alex Volkov 41:21
so the, the 5.3 instant Chat GPT made major
41:24
mistakes, but generally 3.1 flash. Did the math correctly, is what you're saying?
Nisten
Nisten 41:29
I got it, got it.
41:30
Very good. Actually, a lot more specific in the numbers too.
Alex Volkov
Alex Volkov 41:34
One call out from our audience is that the new
41:36
flashlight version is more expensive than the last version. That's Yeah, that's correct. it's more expensive. 2.5 light. correct?
Wolfram Ravenwolf
Wolfram Ravenwolf 41:44
There's a flashlight model, which is a
41:47
smaller model, even faster.
Alex Volkov
Alex Volkov 41:48
Yeah.
Wolfram Ravenwolf
Wolfram Ravenwolf 41:49
But it's great for home automation.
41:50
I use it on home assistance. So if I tell it to do something, I need tool calling and I need it fast. I don't want it to think a long time until I've already fallen down the seller stairs because, it's dark.
Alex Volkov
Alex Volkov 42:02
Yeah.
42:03
and wolf, I think you mentioned also in, in comments, the, the, the comment that I had, about LM as a judge, the judge needs to be a smarter model than what is judging. Yeah, I agree. but there's a whole host of things that, like you can detect things like guardrails with a faster, cheaper model.
Yam Peleg
Yam Peleg 42:18
I just gonna say it is way more expensive than the previous, I
42:23
think, I think the, the previous one look. All these CLI, harnesses are using small models for all sorts of small things. I don't, I'm not sure that people realize to what extent, even Claudeud code, the amounts of KU calls for every prompt that you're sending. I mean, I don't think people realize like, just, just to put the title at the top of the terminal or to put the, the, the thing that is, you know, the, the spinner on the side that's going like shimmering and, and all of this, every single one of these things is, is a haiku call. So this is probably a model that Google trained for the CLI of Gemini and you know, without, you know, an intention for actually commercializing it to this extent. Gemini three had this, weird releases where you got Gemini three and all of a sudden you got Gemini three flash. That is better than the original three to the point that the original team of Gemini has shifted to, to use Flash three. And then you have Gemini 3.1, which is probably the larger version of the flash small that is stronger and. So people already discovered this model and started sending, you know, directly, whatever they want, all sorts of prompts to it because it's basically nearly free. And I think this is why Google is now, releasing flashlight three, with, with the price tag because it is extremely useful. Yeah.
Nisten
Nisten 43:55
So that previous flashlight was 10 cents per million token input and
44:00
40 cents per output, extremely cheap. Now this one, this new, double flash like is, is 50 cents per input and a dollar 50 per output. That is extremely, that is for, because most, like if you do it genetically, over 90, 90%, 95% of your tokens are input tokens. It's five times more expensive at input and almost four times more expensive than output. So that's a, that's a big price increase.
Alex Volkov
Alex Volkov 44:31
Yep.
Nisten
Nisten 44:32
it did get the math right though, so there's that.
Alex Volkov
Alex Volkov 44:34
Alright, folks, I think it's time for us to move on,
44:37
this show, Patrick, by what basis? CoreWeave. And, we have something to announce to you. So we're gonna go to this week's buzz the corner, where we cover this week's, excitements from Weights, & Biases. stick with us. We have a bunch of other tools and the things to talk about. meanwhile, let's go to this week's buzz.
45:12
All righty. Welcome to this week's Buzz with you right now. Alex Wilco, ai evangels with Wi Biases and Wolf and Raven Wolf, also AI evangelists, at quarter with biases, both the same team, Wolf. And you have something to show us here on, on today's show. and I'm very excited to, to show this to folk as well. So let's talk about what you have to show.
Wolfram Ravenwolf
Wolfram Ravenwolf 45:32
great.
45:33
So, it's finally the time to really talk about this and show it to the world. What I've been up to now that I joined, CoreWeave in January and working at the Rates and Biases team. So basically it's called, yeah, I chose a name Wolf Bench because, basically it's, it's not even a real benchmark per c it's more like a framework for evaluation. It, it's based on terminal bench. But let's start at the beginning. Why should you care and why do I care? So I've been the evil guy, not evil eval for a long time, but, I'm doing this because I want to use my AI better. I want to use my agents better. I've been working on agents and, my assistant three years, agent two years. So I'm always testing models because I want to know which is the best for the users. general purpose, ai, basically, and I chose terminal bench because it is not a coding benchmark, although many people think it is, and it's often put in there, but it is actually a benchmark about, yeah, how to use terminals. like system administration, terminal interaction, git server configuration, setting up servers and doing stuff you would ask of your agent. So it's a nice sample of all these things. And the thing is, it already exists, of course, and it is one of the most popular. That's also why I'm using it. So I can com compare my scores to what the labs report. And the thing is, you always see this, what we are looking at right now. Basically what we see is an average score. Most use five runs, four runs and, different timeouts for all of this. And yeah, you see one score and you don't really know very much like what we are seeing now. Like Qmi K 2.5
Alex Volkov
Alex Volkov 47:10
could mint please a little bit Wolf.
Wolfram Ravenwolf
Wolfram Ravenwolf 47:12
Yeah.
47:12
Let's zoom in one more. So basically Qmi, GLM five, and Minimax, they almost have the exact same score. So which one should you use for your agent? The one score is not enough. It doesn't tell you enough about what the model is actually doing. And even this is the terminal bench benchmark. It uses its own agent terminals two, but how does it do in Claudeud code? How does it do in open claw? I wanted to know these things and that is why I've been doing these. I tested it with different agents, like I have Claude code cc, and I also wanted to make sure how many runs I've done. I, I am aiming for five here. I I will do more models, more agents, and you can do even combinations like how would, Claudeud code do, if I use a Codex model with it, how would Codex do if I use Claudeud code with this all stuff I'm planning to do in the soon and report about this. And what is special about this framework of evaluating it is I'm not just using the average score I have, basically it's a four metric framework where I have the average, but I can also see like what is the best of the five. So you see it goes a bit above the average, of course. So what was the best one that was done, but even more interesting to me was how many of the 89 tasks that are in this benchmark, how many can the model actually solve at all across all the runs, even if it has never solved all of these at once. So that is the ceiling that the theoretical top. So if we just look at Terminus, we see that, sonnet and opus are very Claudese together, but Kimi and the Chinese, they are also very Claudese together on that part. Now looking at what is the solid base, which task got solved in all the runs We did five each. So how many tasks got solved all the time? And that is a very different picture. So even if Claude, Opus can do 88% of the whole benchmark in on average it only does 73% and only 55 or even 55% it can do all the time. So that is a reliability because an agent that, is doing it sometimes, but not all the time, is less reliable if it's yeah, if it's not doing it consistently. So that is a consistency and if you look at all of them together, that also paints a very interesting picture because now we see that even if Kimmi has a higher average than the others, it has a lower, baseline. So here, even GLM has an advantage. And this is also very interesting because now Qmi only gets 10% of all the tasks done. So it is not very reliable. It can do a lot as much as, for instance, GLM, but GLM has, twice as high, baseline of the task that always does.
Alex Volkov
Alex Volkov 49:48
Wolfram.
49:48
I have a few questions. and I wanna highlight some of the findings, as well. And folks, you can find all of this in Wolf bench, Wolfram. I think that this view is unique. The view that you currently have right now is unique. one score is not enough to tell us about the model. there's variations in every run of the benchmark. Every time we're on the benchmark. There's a few variations. for terminal bench specifically, there's a bunch of, tasks and I think the base, the floor thing is very important as you can, guys can see on the graph. the, the bottom, darker shades of every column shows how many tasks these models solve all the time. A hundred percent of the runs these models sold detest all the time. And the top score is at maximum, at the best run possible, at best conditions. How many tasks every model solves. Opus is absolutely magging. Everybody at the maximum one, like at at the terminal bench Opus 4.6 is the best model. But we also looked and tested with different harnesses, harnesses what actually runs the model. And Wolf we have three here. You have the terminal bench harness itself. you have the Claudet code harness, that everybody uses. Everybody who's like excited about Claudet code essentially is excited about the prompts and the system and kind of the how it tool calls specifically. And also the, one of the more famous ones that many non-techie people started using is open claw, right? And in all those harnesses, the models perform a little bit differently because of the system problems, because of the additional tool call settings and the explanations to the model, right? So, it's not only which model do you use and what number do they show you. the coolest thing is to compare GPT 5.3 Codex inside Claudeud Code versus Opus 4.6 inside, the, the, the Codex app. And I think comparing between them and seeing who has the better harness compared to the model, I think is gonna be very exciting. IM gonna bring up, yeah, from, do you have any other comments before I bring up the, the co-host questions about this?
Wolfram Ravenwolf
Wolfram Ravenwolf 51:40
I want to just, say, thanks also to, of course, the
51:44
company I work for, which is sponsoring the podcast and my work as well. I'm working for them. I'm doing this, for them and with them. So, the inference we have been using for the Chinese model that was on our own. So I test our own inference to make sure that it works as good as, you know, officially published results. So, it's not cheap to do this. And the Opus, for instance, a single opus run with terminal bench costs. if it's, Opus 4.6, it costs me 120 bucks per, per run. So it's five runs of these. So that is very expensive. And Sonet costs 80 bucks because it has been a lot of tokens and 50% were cashed, which is also something I will write, in the report This is just the beginning. I will add more. agent frameworks like Gini, Gemini, CLI, and Codex are definitely coming next, maybe Hermes agent, which, is also a new agent framework, so I want to add more. And, yeah, and I need sandboxes for this. Daytona sponsored the sandboxes with, couple hundred bucks, because it's 89 tests and I do five runs per model, and each one is limited to two hours. So, it is 890 hour CPU time. That is also a lot of, time it takes to do the benchmark. And, yeah, the two hours is also a special thing.
Alex Volkov
Alex Volkov 52:56
big, big shout out to Daytona for, helping us and
52:58
sponsoring the sandboxes here, folks like agents need sandboxes on the internet to run their code. And we, you know, we couldn't have done this without Daytona. Awesome job, Wolfram. You can check out, Wolf Bench. At Wolf Bench. Do ai, and we'll keep updating this with new models, including models that we host on our own inference. So you'll know kind of the, the full story there and which models to use. obviously the highlight there, at least the highlight for me is that you shouldn't trust this. Just the one score. These models perform great on the baseline, but the variance between how many they can do in general is, is also important. Nisten, I think you had a question before we move on.
Nisten
Nisten 53:31
Yeah.
53:31
And I do want to say that, that I'm not that surprised by Sonet performing so well at the open Claudeud benchmark because, based on what LDJ said last week or two weeks ago, that it might have been trained longer. Mm-hmm. can we. Run this benchmark on our own too. I just, I, I would be interested to run it with smaller agent models. And the other thing that I want to try, which, may or may not be entirely, kosher with the terms of service would be to, just, hook up, the Claudeud code CLI with a proxy to, V-L-M-A-P-I running minimax or, or, or, or whatever. Because I want to see, when you hook up those models that apparently have been trained on Claudeuds data, how well do they actually do here? that would be interesting.
Wolfram Ravenwolf
Wolfram Ravenwolf 54:21
Yeah, that's sounded possible.
54:22
I did it with light LLM, but now it's even easier nowadays where they have, APIs that are Claudeud compatible even and even advertise this. So you can do this, you can do it, it is a terminal bench 2.0 benchmark. Basically the original one with specific settings that, give it. And it is all down here, like how long the timeout is, how many CPUs and RAM I gave the, the Xboxes. So it's very all get the same resources. So it's not that, if open Claudeud were using more ram, it would run out of memory much easier with the small ones. and yeah, you could do the same. Nothing, against that, I'm doing this. I used Remo, which is a, yeah, a notebook, a Python notebook thing. I used this to build a dashboard where I create this, I start to run. I get the stats, I will put them all on, Weights, &, Biases, Weave as well. So people can look into this. The, I will make more posts about this and share it. I want it to be completely transparent. So the settings I'm using, I will put them on GitHub so you can get the config. And if you do some runs with the config, I trust you. So if you report this cost to me, we can put it in there. And yeah, let's build on this.
Alex Volkov
Alex Volkov 55:28
Folks.
55:28
If you wanna participate and run your own benchmarks or your own harnesses, Wolf Bench AI and reach out to Wolfram, and we'll definitely include this, right time to move on. Wolfram, thank you so much for, for bringing this, bringing us over. Great work on, on Wolf Bench. let's talk about, let's talk about open source. We, there's a few things in open source that we haven't covered. we, we talked about Alibaba Qwen, 3.5 small models. There's two models that I wanted to bring to your attention. from StepFun. StepFun releases step 3.5, flash base. They call it the most open foundational model, out of the Chinese labs.
Nisten
Nisten 55:57
I haven't tried this, but people are incredibly excited about this.
56:02
I think Stephan just made a name for themselves with this release, strangely enough. yeah. People love that they have as if the supervised fine tuning data, from this, it's, I
Alex Volkov
Alex Volkov 56:13
think
Nisten
Nisten 56:14
Actually.
Alex Volkov
Alex Volkov 56:15
I think this is the highlight for this model, right?
56:16
Like they, they released all of all of the training as well. They released the base mid train for code and agents in long context. they released their step, step tron, OSS training framework. the SFT data is coming soon. It, it didn't launch yet, apparently. And it's all Apache two license. So basically, not only they're saying, Hey, here's the model. You go go use this, which we can, we can look at. the, the, the benchmarks here is 74. It's, we bench verified again, we verified after last week. I think everything throughout the, this, the training process, is, is open, which is great. And there's some evidence of focus switching from Qwen to step 3.5 flash. it runs on a Mac Studio M four, and it runs obviously on DGX Spark, which is, d NVIDIA's a little small like supercomputer.
Nisten
Nisten 57:05
This allows you to continue pre-training the model how you want,
57:11
and, that's a big deal for people. I, I think there's gonna be a lot of companies that just like use this and say, oh, we're releasing, this model and they don't say that it's based off of a Chinese model, but, yeah, it's huge that they provided that because you can continue pre-training, not just fine tuning, but continue the pre-training of the model how you want. And that is a big deal because like a lot of the shackles around the model haven't yet settled. So you can kind of, you can formulate the yogurt how you want later on. So yeah, this, it is a big deal for people that train models. I
Alex Volkov
Alex Volkov 57:45
Yeah.
57:46
so we have a few comments here. let me go take a look at our Nous summary. A significant shift in openness is what folks notice, most of all. Oh, let's go breaking news. Let's go LDJ. Let's go. Here's why we're here for, okay. Step one is gonna step aside for a second. Sorry for the pun, folks. We have breaking news. Oh, let's
Yam Peleg
Yam Peleg 58:06
fucking go.
Alex Volkov
Alex Volkov 58:09
AI breaking news coming at you only on Thursday.
58:16
I,
Nisten
Nisten 58:21
it's so funny.
58:22
We were like fully expecting it. Something's gonna happen.
Alex Volkov
Alex Volkov 58:27
All right, LDJ.
58:28
your, your, your honors. You found this first. Go ahead.
LDJ
LDJ 58:31
Okay, so introducing GPT 5.4.
58:36
That is the title of the blog post that open a I just dropped And we might as well just kind of scroll down to when we start seeing benchmarks, that's what we really just want, right?
AI
AI 58:46
Yeah.
LDJ
LDJ 58:46
So A GDP valve, which measures a lot of different tasks that
58:52
the OpenAI teams have deemed as valuable to things like GDP and just overall economically valuable jobs. here it ends up getting 83%, I believe. When it says winter ties, I believe it's saying 83% of the time it gets a winner, a tie with the, the human measure in the dataset. Mm-hmm. That's compared to around 70% for both GPT 5.2 and 5.3 Codex SWE Bench Pro, about the same gap between 5.4 and 5.3 codex, as we saw between 5.3 codex and 5.2. So about one ish, 2%. OS world verified. A huge difference, but about 1% extra.
Alex Volkov
Alex Volkov 59:36
Look at this.
59:37
GPT 5.2, the last kind of like main OpenAI model, not the Codex Finetune one for code specifically the main one, the jump in OS world is from 47% to 75%. So this model, the, the generic open OpenAI model that's now gonna serve everyone, not only the coders mm-hmm. Is also gonna be incredible, like using computers. This does seem like a distillation of codex of some sort. Right. it's very interesting, the naming here, the, the previous GPT series model GP 5.2, then 5.3 was Codex. there's no standalone 5.3 as far as I saw. And now they released a newer version that's called GPT 5.4 Think. but there's two versions of them, right? LDJ,
LDJ
LDJ 1:00:23
of 5.4.
Alex Volkov
Alex Volkov 1:00:24
Yeah.
1:00:24
There's the, the think variation. very strong.
LDJ
LDJ 1:00:27
Yeah.
1:00:28
So they actually showed benchmarks for, for low thinking, for just no reasoning effort. So basically instant as well as medium high and even extra high scores. So if you scroll down and there's the OS world verified chart that actually shows the different thinking budgets.
Alex Volkov
Alex Volkov 1:00:44
Yeah.
LDJ
LDJ 1:00:44
Here, right there.
1:00:45
There you go. Yep.
Alex Volkov
Alex Volkov 1:00:46
So what are we seeing here?
1:00:47
Let's read through this, at number of tools. The more tools. So, so we're seeing a graph, with the x axis is the number of tool yields. So how many tools are getting called and the accuracy is
Yam Peleg
Yam Peleg 1:01:02
crazy.
LDJ
LDJ 1:01:02
Yeah.
1:01:03
So the y axis is the accuracy here, and then the different dots, the different blue dots, either blue or purple ibit color blind. So those different, bluish dots are, that's the, the no reasoning effort. The low, the medium, the high, and the extra high. And then you could see also no reasoning effort, low, medium, high, and extra high for 5.2. It kind of goes into this funny zigzag pattern. 'cause at some point it actually doesn't get consistently better, but that's impressive that 5.4 does actually seem to get consistently better on this benchmark, whereas the last generation didn't. Yeah,
Alex Volkov
Alex Volkov 1:01:34
I would say based on this graph, no effort is, is the lowest at 40%,
1:01:40
still better than pretty much all of the previous model at very, very high effort. I think this is the craziness of this graph. Yum. and then if we go higher, the difference between the thinking reasons is not that crazy. Right? The 75%, maybe 77%, 75, 73 and 71. Between the medium effort, the high effort, and the extra high effort. But the jump between how good this model is running your computer from the previous G, BT 5.2 is crazy. It's absolutely crazy. 71% on medium effort versus. 45% is medium effort. Yeah.
Nisten
Nisten 1:02:19
What's really cool about that, this is general.
1:02:21
my question, this is the general model, right? This is not the coded model. This is the
Alex Volkov
Alex Volkov 1:02:24
general Yeah, this is the new general model is
1:02:27
beating, the previous coding model.
Nisten
Nisten 1:02:28
let's go, let's go guys.
LDJ
LDJ 1:02:29
huge.
Alex Volkov
Alex Volkov 1:02:30
Yep.
1:02:31
So a very interesting way for them to, to highlight kind of the, the improvements here. yeah. What else do, what else can we tell? Talk about this model here. on Web Arena Verify, which has the browser use GPT 5.4 achieves a leading 67.3% success rate when using both DOM and screenshot driven interaction compared to GPT 5.2 at 65% on online mind to web, which also test browser use GPT 5.4 Achieves 92 success rate using screenshot based observation alone, improving charge GPT Atlas agent mode, which achieves a success rate of 70. This model, the new one, achieves 92 success rate at using only screenshots versus 17 in Atlas. Which Atlas we know is great. What this is, this is crazy. what else is here? I haven't deleted.
LDJ
LDJ 1:03:20
Eventually you should see a TTU bench telecom, and that's, yeah, the benchmark.
1:03:26
And it shows actually, interestingly, they're highlighting the without reasoning scores here. Yeah. so they have telecom and airline, if I recall right. The airline, benchmark is relating to basically having to book a ticket for someone having to book like a plane ticket and, and do that. which obviously requires kind of multiple steps back and forth with, a given interface usually, or with a given database of flight times and flight prices and, and things like that.
Alex Volkov
Alex Volkov 1:03:52
And we're seeing that in TT bench telecom, without
1:03:54
reasoning, GPT 5.4 gets 64%, jumping over 57% from G pt 5.2 interest. They didn't include Codex three point, like 5.3 here. improved web search. Significantly improved web search, new state of the art at 89% there. There's a lot of sort of like, state-of-the-art achievements here. So GPT 5.4 is better at the gen web search on browse comp, a measurement of how well AI agents can persistently browse the web to find hard to locate at information. GPT 5.4 leaps 70%, average over 5.2. and the GPT 5.4 Pro, which is a, the second model is, sets the new state of the art at 89%. And we have comments that it's already in the Codex App instance. So if you wanna like run this at a test, like that'll definitely do a vibe check on on this. I'll launch Codex as well and see if I got it.
LDJ
LDJ 1:04:48
finally they're showing some 5.4
Alex Volkov
Alex Volkov 1:04:50
pro benchmarks on the screen here.
1:04:52
For brows comp, which is the state of the art. So we are getting two models. One is the regular one, one is the pro model looks like, and let's see if I've got it.
LDJ
LDJ 1:05:01
Oh, this is also a big thing we haven't mentioned yet.
1:05:05
1 million tokens of context.
Alex Volkov
Alex Volkov 1:05:08
Oh wow.
1:05:08
Okay. For the bigger model.
Yam Peleg
Yam Peleg 1:05:09
go, let's go.
LDJ
LDJ 1:05:10
I'm not sure about pro, but it does say if you just, control
1:05:14
effort context, you'll see. Yeah.
LDJ
LDJ 1:05:17
or just, control F for 1 million actually.
Alex Volkov
Alex Volkov 1:05:20
Hang on, in charge GPT 5.4 thinking is available starting
1:05:23
today to charge GB plus team and pro users replacing 5.2 thinking. da DA will remain available for three months. The previous model, those in enterprise in need to plans enable early access to, it's available context. Window remain unchanged from the previous ones in, codex in, GPT 5.4 in Codex includes experimental support for the 1 million contact window. Developers can try this by configuring model contact window and model auto compact token limit request that exceed the standard 272 contact window count against usage limits at two x the normal rate. Okay, so essentially you can run your codex. it's more fair than, than, the NOS 4.4. so basically you can run your, where Codex is for much, much, much longer.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:06:12
I just noticed they also have the terminal bench 2.0
1:06:14
score in their, in their block there. Yeah, if you go back to it, it's there. They only gave it for the GPT 5.4, not for the bro, and it's 75.1%, which interestingly is lower than GBT 5.3 codex, which got 77.3%. Unfortunately, they don't give any information how many runs they did and so on, but basically their score is about the same score I got with Opus 4.6. So in that benchmark it's on par with this?
Alex Volkov
Alex Volkov 1:06:43
Yeah,
Yam Peleg
Yam Peleg 1:06:44
because Opus 4.6, as per Anthropic on blog post is 65.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:06:51
Yeah, that is also interesting because I got a
1:06:53
higher score with my benchmark. I did five runs and I got this score
Nisten
Nisten 1:06:58
starting to look like it's on par.
Alex Volkov
Alex Volkov 1:07:01
I do you guys notice that there's no opus here at all mentioned?
1:07:08
Nothing. no comparisons. Let's go to the, to the, to the technical card. 'cause I think we have that as well, right? because I think it's very important for us as well. folks, I will just mention the Alman just posted about this and we're, we are here live, looks like a lot of people are joining us right now to talk about the, to try. We want to try. Yeah. The new release from OpenAI, GPT 5.4 that jumps over the, the latest, 5.2 significantly, but also like is Compet competing with 5.3 as well, as a, as well as now available in the Cortex app. Alj, go ahead while I pull up the technical card that you also sent.
LDJ
LDJ 1:07:43
So the pricing, which I, I have here.
1:07:45
Yeah, it is. So it's about the same for output price. It's $15 per million tokens compared to 5.2, which is $14 per million token. So really tiny difference there. For input price though, it's about 50% more expensive than 5.2. So 5.2 was it, it's $1 and 75 cents per million tokens for input. And GPT 5.4 is $2 and 50 cents per million tokens input. So that's roughly 50% more,
Alex Volkov
Alex Volkov 1:08:13
and that's up to 272.
1:08:15
If you are enabling the 1 million output, then you'll get the 1 million token, window.
LDJ
LDJ 1:08:20
priced.
1:08:21
Yeah. And then in terms of the, input and output price for 5.4 Pro, it is also a very, very small difference for output pricing, but it's about 50% more cost for the input.
Alex Volkov
Alex Volkov 1:08:31
we need to check this against the other models, but G PT 5.4 Pro has $30
1:08:36
per million tokens on the input and $180. For a million tokens on the output. $180 per million tokens. Wolfram, before we run benchmarks on this, let me run this by procurement. Okay. this is exceeding the realm of, what is realistic to do benchmarks on. 'cause $180 per million tokens times, a few of the runs could, you know, can set aside, oh, and
Nisten
Nisten 1:08:59
Yeah, so there'll be 60.
1:09:00
And so every tool call that feeds it, all the context that it's gonna start being like 60 bucks, per tool call if you end up like towards 1 million tokens.
Alex Volkov
Alex Volkov 1:09:11
Yeah.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:09:11
We have to set on limits, not just
1:09:12
on time, but also on code.
Alex Volkov
Alex Volkov 1:09:15
Yeah.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:09:15
anyone wanna sponsor this?
Alex Volkov
Alex Volkov 1:09:17
we should mention also the codex is on Windows.
1:09:19
So Yeah. All of you on Windows can use this without like, running, the Codex. CLI. Let's look at the system card and then let's find what's like important about this. So again, two models I still haven't seen comparison to Opus, I still haven't seen, the, the differences between the, the 5.4 thinking and 5.4 Pro. So I would like to, I would love to see this as well if,
Nisten
Nisten 1:09:40
if, if we can do a quick test.
1:09:42
I, I, I pasted the, the prompt in, in the chat.
Alex Volkov
Alex Volkov 1:09:45
Chat
Nisten
Nisten 1:09:45
here.
Alex Volkov
Alex Volkov 1:09:46
Yeah, I just sent
LDJ
LDJ 1:09:46
the link to the system card,
Alex Volkov
Alex Volkov 1:09:47
Okay.
1:09:47
The full system card. Okay. Nisten, let me pull this up and then I will also, test out the Mars thing. So we're gonna run, uh, Stan's, uh, Mars thing, and we can see already that like, codex is, not Codex within the Codex Harness. Folks, we need to relearn this. Okay. so far we've been, when we said Codex, we sometimes meant Codex 5.3, the previous, GPT version. Now, this is, GPT 5.4. The stand one. So, codex now, and this thing refers to the app Codex and GPT 5.4 within the Codex Harness is running and doing one of these tasks. You can see a bunch of searches that it did for different NASA websites and, and the JPL et cetera. And it's running distance prompt to calculate, mass driver rail would need to accelerate, for how long they need to accelerate to get people out of the Mars. top mount Olympus months. Yeah, max to geographic.
Nisten
Nisten 1:10:43
it builds a mega structure on Mars and visualizes it.
1:10:46
And, I really like to find one shot prompts that you can vibe check models with because they actually tend to be pretty good comparisons between different models capabilities. And we use the same one for like, almost a year now. And, so yeah, so it's a pretty complex megastructure thing that has built and visualized and have a, mag love, launcher that's built along the mountain on Mars. It's gotta launch stuff in the space. It's gotta make it all pretty and look like a video game. so it has to get the math right, has to get the coding right and has to get the visuals right. So it's a good one shot test. for agents
Alex Volkov
Alex Volkov 1:11:25
LDJ, I think you're sending a few screenshots.
1:11:26
You wanna talk to them.
LDJ
LDJ 1:11:29
yeah, these are just some of the, the most drastic changes in benchmarks that I
1:11:34
saw while looking through the system card. MLE Bench Machine Learning Engineering bench. so this is just literally the ability for AI models to be able to do AI research, machine learning engineering, more specifically. And here we could see 5.2 thinking, to GBT 5.2 codex. that actually was little to no change or maybe possibly even just a dip there. Then almost doubles in the score from 5.2 codex to 5.4 thinking. Unfortunately no 5.3 codex there to compare against.
Alex Volkov
Alex Volkov 1:12:05
yep.
LDJ
LDJ 1:12:06
And then next,
Alex Volkov
Alex Volkov 1:12:07
I think most people are gonna be interested in because obviously
1:12:10
Codex is great, but I think most people are gonna be interested in, figuring out how to compare this to, to, to opus and how to like, think about whether or not they, them living OpenAI a day ago. They should come back now because of this model.
LDJ
LDJ 1:12:25
Mm-hmm.
1:12:25
And, and many benchmarks from what I recall, especially for the coding benchmarks, 5.3 Codex was near identical or higher in most of them compared to Opus 4.6. Yeah. So unfortunately for that specific benchmark that you just showed, we don't have a 5.3 Codex score for that. However, for the next, for the next image that I think I sent. I believe that one does have a 5.3 Codex score.
Alex Volkov
Alex Volkov 1:12:49
Let me open this up here.
1:12:51
This is the Monorepo bench. You wanna talk to this one?
LDJ
LDJ 1:12:57
Yeah, sure.
1:12:57
So, these are pretty similar at first glance, but if we look at the specific scores, 5.2, thinking to 5.3 codex that's near identical, or actually maybe even like a 0.7% drop. So probably within margin of error. And then, a little bit over 3% increase though for 5.4 thinking. so that does seem kind of significant when you look at the relative score changes over the generations.
Alex Volkov
Alex Volkov 1:13:23
Yep.
1:13:24
I gotta wonder about the, the design scores as well. Like most models when they release. And I think it's very important here to, to, to say folks, this is a new model from OpenAI. It's, there is not only a coding model, it's not like a dedicated coding model. and it's now walking through the design of, of this system, so index html and style. So now it's like it's coding up, it's coding up ton's, like very hard question. We can like review the changes as well. but most of these models are used for other things. And so this model is, live now on, I think it's live on J GPT as well. It's different live on Codex. let's see if it's live on Chat. GPT for us. I still have 5.3 instant and no, I only have 5.2 thinking here. I didn't get it on Chat GPT via the pro account yet. but I'm assuming that they're rolling in this out quick. But a very interesting thing that they released it on Codex, already and I gotta wonder if it's in Codex CLI I'm assuming so as well. Let's take a look at Codex Cli. Let's, let's get out of Claudeud code
Yam Peleg
Yam Peleg 1:14:26
but you can use it if you specifically just name
1:14:30
the model like Dash M GPT.
Alex Volkov
Alex Volkov 1:14:33
I have it here.
Yam Peleg
Yam Peleg 1:14:34
4.5
Alex Volkov
Alex Volkov 1:14:36
I opened Codex and it's right here.
Yam Peleg
Yam Peleg 1:14:38
Oh really?
Alex Volkov
Alex Volkov 1:14:39
Nah, G 5.4.
1:14:40
Let's go to model and then hit enter and then 5.4 Codex. I didn't even have to update the app. it just here. Latest frontier gen coding model. Oh, beautiful. The funny thing is if you just look at, the Codex CLI, the description for both the previous, state-of-the-Art 5.3 codex and the new one 5.4 is latest frontier gen coding model. So they are considering this as a coding model as well. Now, my questions for this one is, is there a way for us to measure how autistic 5.3 was versus this new one? This is what I like to measure. 5.3 was very much a do exactly as you tell it type of model. and so I I, I gotta wonder if like, this one's gonna be a little bit better, it's kind of slow on distance task. Like I, I would expect it to finish it by now.
Yam Peleg
Yam Peleg 1:15:31
I think the easiest way is just to give it, let it do some
1:15:34
testing or unit test for something that you can just write, you know, specific unit test to check a contract or something and just see if it gets what you're actually trying to measure. I mean, it is very easy to to know if it's different.
Nisten
Nisten 1:15:51
you can try turning Opus 5.3 into, codex 5.3 into a, a therapy
1:15:56
bot or something compassionate.
Yam Peleg
Yam Peleg 1:16:00
I'm imagining the 5.3, like, what, what's gonna happen
1:16:04
if you ask 5.3 Codex about this? Yes.
Nisten
Nisten 1:16:07
5.3 Compliment the girl.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:16:11
Have any of you ever tried to have 5.3 or Codex basically
1:16:16
generate documentation for the code? It reads so much different than if you have Opus read the documentation or when you had a Yeah, it talks like a, a specialist in a field and it doesn't even care if you don't know what it's talking about.
Alex Volkov
Alex Volkov 1:16:32
Let's read some of the on the thought processes.
1:16:34
I asked, GGPD 5.4 to do one thing that can improve about the Thursday website while we have Nisten kind like thing running behind, using the web design guidelines skill to judge the most impactful single improvement without changing the code. you can see some of the reasoning here. And let's read, redo the reasoning. this is 5.4 thinking I need to fetch the latest guidelines. A seems I should be using web tool to access this information. I'll aim for the raw GitHub data. making sure the follow up instructions accurately. I don't remember seeing this in GBT 5.3 kinda reasoning thoughts. Like the self-affirmation thing is very, very interesting. starting
Nisten
Nisten 1:17:10
to sound a bit like Opus.
1:17:11
Okay, let's go.
Alex Volkov
Alex Volkov 1:17:13
Yeah.
1:17:13
And, right, it feels like a little bit more humane and like less, less straight, straightforward the autistic than 5 1 3. it says also, I should probably take a Claudeser look at the CSS at the index file for some tasks. I might need to use tools like RG and set to help me out. I think I should focus on producing one key improvement for the website. Probably mentioning something related to file references. okay. I like the thinking processes here and I have the answer. I'm pulling exact line references. So let's see the actual answer. This is kind of awful. The one thing I'd improve is the site's core page accessibility. Add the proper main landmark with a skip to content link. Right now the shared shell, what is this? I don't even know what this is.
Nisten
Nisten 1:17:59
'cause if you want to get I Google Lighthouse score.
1:18:03
So you have to hit all the accessibility stuff
Alex Volkov
Alex Volkov 1:18:06
and I guess, but this is, okay, so, so I guess, this
1:18:08
is not necessarily what I meant when I said improve the website. I meant like for a front end. we can run the same thing with Opus just to compare between the two, and say like, what is the one thing? And then see how Opus approaches this.
Nisten
Nisten 1:18:20
It's still a bit autistic.
Alex Volkov
Alex Volkov 1:18:22
Yeah, that's why I wanna say Claude.
1:18:25
Yeah, let's go like this.
Nisten
Nisten 1:18:26
It's not
Alex Volkov
Alex Volkov 1:18:26
a bad thing,
Nisten
Nisten 1:18:27
but I mean it's, it just changes your harnessing quite a bit.
Alex Volkov
Alex Volkov 1:18:32
Yeah.
1:18:32
Let's give the same one to. To, to opus. all right. It's gonna go and we're gonna compare between, codex is one thing. It, it gave me an accessibility thing. I, I gotta wonder if we can ask Codex to actually, view the website and then give a response. Meanwhile, let's take a look at Opus and then compare the, the two outputs for, fixing the, the one thing, from the website. Look at this, the guests from ticker. It's a plain text, marque of a company names with no logos, photos, or links. It's a missed opportunity for social proof. Showing actual guest headshots or a company logo would make it instantly more credible and visually compelling. Right now it reads like filler text rather than trust signal. You, you, you, you guys, this, this is what I want from an intelligence. OpenAI, I love you, but, When we refer to, to, to, to GPT Codex previously and now to the new one 5.4 as autistic, this is what we mean. Here's the comparison between the two things that I asked the question for Claudeud code, Opus 4.6, when I asked it to improve, one thing about the website, it launched the website in, in, in Chrome, I think. Yeah. It launched the website. It looked at the website and it said, Hey, the guest is the plain text of a company named with no logos. This guy, okay, this guy, the, the Scrolly, this guy, and it said it's a, showing guests or company logos, even small ones, would make it instantly more credible and visually compelling. Right now it reads like fill or text rather than just trust signal. So basically it says, Hey Alex, these should be logos Cortex said, the one thing I'd improve is the site core page accessibility. Add the proper main landmark with skip to content link. This makes sense to no one, and I know that makes
Nisten
Nisten 1:20:15
sense to me, but
Alex Volkov
Alex Volkov 1:20:17
No, but like, there's no main, it's not a blog.
1:20:19
It's, it's a, it's a podcast website. It shouldn't, yeah. Okay. I'll say though, GPT 5.4 did not launch the website. It didn't see the website, so it is judging based off HTML, but it's also something that I would say like, Hey, why wouldn't you just like, use your tools to launch the website. This is, this is definitely a, a, a thing that would, that would like to see.
Nisten
Nisten 1:20:42
it.
Alex Volkov
Alex Volkov 1:20:43
I'm
Nisten
Nisten 1:20:43
looking at it.
Alex Volkov
Alex Volkov 1:20:44
Yeah.
1:20:45
LDJ, you send something folks where in the vibe testing of, GPT 5.4, 5.4, not Codex, not 5.2 Codex versus Opus, 4.6. So if you have any examples you wanna comment, feel free to give us LDJ. Go ahead.
LDJ
LDJ 1:21:00
Yeah, so the, the image Alex is about to bring up.
1:21:03
This is a comparison of Opus 4.6, to Gemini to GPT 5.4, And this is just in some popular benchmarks that they have in common. these are mostly agentic benchmarks, but there's a couple of coding ones in there as well.
Alex Volkov
Alex Volkov 1:21:18
Oh, nice.
1:21:18
Is this a from op?
LDJ
LDJ 1:21:21
somebody in a, in a group chat that I'm just posted this.
1:21:23
Yeah. But I just fact checked with the system card to make sure these are not like hallucinated figures and all, all four of the scores I checked are all correct, so
Alex Volkov
Alex Volkov 1:21:32
yeah.
1:21:32
Okay. So let's take a look. GPT 5.4 thinking not the pro one. So the two models is thinking a pro, the G 5.4 thinking. Which, let's just call it 5.4 and that's it.
LDJ
LDJ 1:21:43
5.4.
1:21:44
Yeah.
Alex Volkov
Alex Volkov 1:21:44
Yeah.
1:21:45
GB 5.4, 75% on OS World verified versus, Claudeud 72. It looks like it's beating on all these benchmarks. It looks like can, by the way. Oh, my bad. Yeah. There we go. Okay. Is this better? Yes. Okay. So com for comparison, GPT 5.4, we found some benchmark comparisons to Anthropic 4.6 and Google's, 3 1, 3 0.1. LDJ, you wanna read through some of these for free?
LDJ
LDJ 1:22:13
OS world verified, it's a bit higher than 5.3 Codex.
1:22:17
it's, it's even more above. Opus 4.6, no score for three oh point, 3.1 Pro, unfortunately. Let's see, what, what is one they have all in common? browse comp. Okay. SOS comp browsing. here we see 82.7% for GPT 5.4 thinking that looks like actually is behind both Opus 4.6 and 3.1 Pro when com comparing to competitors. Yeah. However, if we look at 5.4 pro, it ends up having the actual overall state of the art at 89.3%. That we have
Alex Volkov
Alex Volkov 1:22:51
and 5.4 Pro is if we go based on the previous pro models, it's
1:22:56
kind of like a multiple like span out reasoning things that they choose the best ones across, like multiple runs. And This is why they're so expensive. They're basically like spinning up four, and, and running like the best out of them or
LDJ
LDJ 1:23:08
Yeah,
Alex Volkov
Alex Volkov 1:23:08
Multi-agent.
1:23:09
Yeah.
LDJ
LDJ 1:23:09
Yeah.
1:23:10
and actually it's good that you just brought up here. So there's things like Deep Think from DeepMind, that is basically the Gemini equivalent of that, which unfortunately we don't have to compare here. So it's possible that might even be 5.4 Pro in something like browse com for example. but we just don't have that data to compare unfortunately. but we could see here what are some other ones they all have in common? frontier Math. Okay, so, here, which is the, the second one from the bottom 5.4 thinking from tiers one to three, it scores 47.6% and that is higher than both 3.1 Pro and Opus 4.6 significantly
Alex Volkov
Alex Volkov 1:23:47
higher.
LDJ
LDJ 1:23:48
Then tier tier four, frontier Math GPT 5.4 thinking score 27.1%.
1:23:55
Which is significantly higher than Opus 4.6 is 22.9%. And Gemini 3.1 pros, 16.7%.
Alex Volkov
Alex Volkov 1:24:02
it's really good enough.
1:24:04
That's, this is why, I guess
LDJ
LDJ 1:24:04
so,
Alex Volkov
Alex Volkov 1:24:05
yes.
LDJ
LDJ 1:24:06
And then,
Alex Volkov
Alex Volkov 1:24:07
I just wanna add, Jordan, shout out to Jordan from everyday ai.
1:24:10
He said, this is a super vague prompt with that you get what you pay for. So not really a valid comparison. And I agree, this is a vague prompt. This is a prompt of me not knowing that the model statistic and how exactly should talk about this. And Opus got exactly what I wanted exactly without knowing anything, without like Opus did the thing that I wanted to do as a human. And, and, and the prompt engineering can get you so far with the GPT models if you are good with Claudeud prompt engineering. And there's a difference, between the Claudeud MD at Edge md this is like all goes into this thing. This is the difference between harnesses as well. Prompt engineering, knowing exactly what you wanna build can get you significantly farther with the GPT models. And it looks like, the thing that I wanted to test is whether or not it's Claudese to opposite, understanding me as a human. and from these like few prompts, it's not, but, but like, we, we should keep testing because like the, the numbers show different, different story. Absolutely. is it done? Nisten, LDJ, is there anything else we wanna cover here or, you wanna finish?
LDJ
LDJ 1:25:09
the 5.4 PRO score for Frontier Math tier four, which is the 38%
1:25:13
one, it is the tier four level. It's almost double both. Opus 4.6 and Gemini 3.1 Pro, which again, isn't a fully fair comparison 'cause we don't have deep think to compare against tier.
Alex Volkov
Alex Volkov 1:25:24
That's true.
LDJ
LDJ 1:25:24
But it's really a big difference here.
Alex Volkov
Alex Volkov 1:25:29
I think the Deep think did launch with some frontier math, things.
1:25:32
I'll look
LDJ
LDJ 1:25:32
for that while you guys move on to the next thing.
Alex Volkov
Alex Volkov 1:25:34
so we have comments from folks that saying, I want a model that
1:25:37
can handle a vague prompt like that. Yeah. That's, that's kind of where I am as well. Like, I think that this is what I expect from the GPT stuff, the general intelligence, not the, Hey, I know exactly how to ask you for stuff. Intelligence, Nisten. Let's look at our, Mars thing for now.
Nisten
Nisten 1:25:52
Okay.
1:25:52
This is starting to look kind of good. we
Alex Volkov
Alex Volkov 1:25:54
have orbit run and we have the escape run.
Nisten
Nisten 1:25:57
Oh, camera director.
1:25:58
There's a camera director button on the right.
Alex Volkov
Alex Volkov 1:26:01
Like this.
Nisten
Nisten 1:26:02
Okay.
1:26:02
I think you can click it multiple times actually.
Alex Volkov
Alex Volkov 1:26:04
Oh, it started at t minus two seconds.
1:26:07
So supposedly we're gonna travel with this,
Nisten
Nisten 1:26:09
so can change camera views.
Alex Volkov
Alex Volkov 1:26:10
Yeah.
Nisten
Nisten 1:26:11
Oh yeah.
Alex Volkov
Alex Volkov 1:26:11
There we go.
1:26:12
System. And there's the camera director and the camera Chase. This looks significantly more advanced than the previous one, right?
Nisten
Nisten 1:26:19
Yeah.
1:26:19
This is a lot better than what Codex did.
Alex Volkov
Alex Volkov 1:26:22
Yeah.
Nisten
Nisten 1:26:22
I think this is even better than,
Alex Volkov
Alex Volkov 1:26:24
Look at this.
1:26:24
Look at, are you seeing all this? Are you seeing the
Nisten
Nisten 1:26:25
pulses?
Alex Volkov
Alex Volkov 1:26:26
Yeah.
Nisten
Nisten 1:26:26
You can zoom out a bit so we can see it more clearly.
LDJ
LDJ 1:26:30
crazy.
Nisten
Nisten 1:26:30
amazing.
LDJ
LDJ 1:26:31
amazing.
1:26:31
It feels like just six months ago that like, we were trying to do this and Yeah, it would kind of do it with the frontier models, but like, it wouldn't have all these extra bells and whistles and these nice graphics and this camera change option and this whole ui.
Alex Volkov
Alex Volkov 1:26:46
Okay.
1:26:47
This is
Nisten
Nisten 1:26:47
very good
Alex Volkov
Alex Volkov 1:26:47
This is so good.
1:26:48
There's stars in the background. There's a pulsing light on this thing. it gives you the flight log, and all of the calculations are supposedly like very, very high and correct. And this is the escape run.
Nisten
Nisten 1:26:58
Three point 54.
1:26:59
246, 3 80, 3 85 0.9 kilometer. Track link. Yep. Yep. It got them right.
Alex Volkov
Alex Volkov 1:27:07
Yeah.
Nisten
Nisten 1:27:07
So I might have to actually use this now.
Alex Volkov
Alex Volkov 1:27:10
the math is great.
Nisten
Nisten 1:27:10
This is the best one.
Alex Volkov
Alex Volkov 1:27:11
Yeah,
Nisten
Nisten 1:27:11
this is the best one that was
Alex Volkov
Alex Volkov 1:27:12
sent.
Nisten
Nisten 1:27:12
I'm not a fan of opening ai, but this is the best one so far.
1:27:17
actually very impressed. Holy shit.
Alex Volkov
Alex Volkov 1:27:21
Yeah.
1:27:21
This is a very impressive model. this looks like a top tier model. GDP Val, we have GD PVE here. we don't have this for Google. We have this Anthropic GDP, Val is knowledge word tasks win or ties. So basically folks, if you're, looking at a, change in how the world works right now because of code GP Val, is that for like everything else, like knowledge work for everything else, and, Jordan is right, like a very important benchmark. It's very interesting that the thinking version of GBT takes a higher score than the pro version that supposedly fell out and, and has more, 83% for GDP valve, jump of 13 points over GP 5.2 and 5.3 codex. So the jump here is quite crazy. the jump from Anthropic Opus 4.6 is also quite a big one, although not big. GDP is definitely there. This model 5.4 and, and thinking is both in the Codex interface, which is now on the, on the windows and in and in the, in Codex CLI, it has 1 million contact window. Any other things that we should read? Let's look at the, I I really wanna like dig into the, to the technical stuff that we haven't in the technical card. Lemme pull this up.
LDJ
LDJ 1:28:33
There is at least one more thing I noticed in their,
1:28:36
their Twitter posts for 5.4. Yeah, they mentioned the fact that, so I haven't used the, the native Chat GPT interface a little bit. I've been mostly using Codex, so I don't know if this has been there for a while, but they mentioned the ability to interrupt in Chachi Pino. So like, while it's thinking, you could send a message and the message will actually go through and help influence the rest of its thinking process.
Alex Volkov
Alex Volkov 1:29:00
Oh, the steering.
1:29:00
Yeah, the steering, that I still don't have this on. I have 5.3 incident, but I don't have 5.4 on my Chat GPT. So we're gonna wait.
LDJ
LDJ 1:29:09
I
AI
AI 1:29:09
just
LDJ
LDJ 1:29:09
sent this video here, by the way.
Alex Volkov
Alex Volkov 1:29:11
Yeah.
1:29:12
So let's take a look. The steering is one of the major things that I think all the labs are gonna catch up. The steering, we want the major things that, that, that, that work. let's take a look at the video.
1:29:29
So we're seeing a video that says, a baby Japanese macka has stolen my heart. Where can I volunteer? And then we see to be Claudeser to animals. And Chad GPT is kind of thinking, and meanwhile, the person kinda writes, I live in Cobble Hill, by the way. So while the model was thinking, the person kind of added, comments and then the thinking process said, perfect. I'm narrowing this to like this front, the option. So not all of Manhattan. And I'll say, as far as, as far as kind of UI affordances go, it's not really an affordance, but as far as the abilities of the model to actually understand this happens to me all the time. In open Claudeck, for example, I would send something like, oh shit, I meant, I forgot to mention this one thing. I would stop the whole process and I would like, need to send this back again. this is, this is definitely something that, we, we like from the model. The steering of the model is something that like, we always, always do. so now it's in the Chat GPT interface. This is new, right? I haven't seen this yet, but this is definitely new. And it said, growing out gradually, gradually rolling out, wolf, and maybe you wanna comment on this, but the, for terminal bench, the previous, the previous artist, GBT three is still the best one. 5.3 is the best one.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:30:40
Yeah, it's the gentech stuff.
1:30:42
I think that is a thing here. I commented on this before, this is a discrepancy here, that it has requested some benchmarks, maybe some additional training.
Nisten
Nisten 1:30:52
I think for a, a generalist model to be on par with the coding
1:30:57
model is a big deal because even after all of that fine tuning is just code and making it bad at everything else. Like people saw open claw. The fact that it gets, it still gets the coding done. I, I think that, yeah, that makes it a lot more, more useful. like there, there's some discrepancy there where the coding model, that only shows how good it is at the benchmarks. But for actual real world views, you're gonna have opinions in the app, you're gonna have descriptions, you're gonna have more artistic stuff that, that it needs to, to be good at, even when acting genetically. So to me, that's like overall a, a way better model. I don't think the benchmark shows everything there.
Alex Volkov
Alex Volkov 1:31:38
Yeah.
1:31:39
LDJ, go ahead.
LDJ
LDJ 1:31:42
Yeah.
1:31:42
so on, on that note too, of what Ton just said something important here is the fact that it seems much more efficient at using tools and needing less tool calls, essentially to do the same amount of things, or at least less sequential or serialized set of tool calls. So that should also. At least in, in some cases where it requires a lot of tool calls, make the overall task complete faster. So even though it might be not quite as good as Codex or a little bit worse in certain more gen coding aspects, it might just get the job done very similarly, but much faster.
Alex Volkov
Alex Volkov 1:32:15
Yeah, I think this has been the, the feedback on GBD 5.3 as well,
1:32:20
that like, if you know how to prompt it correctly, you can spin a task agents and it will run and eventually we'll do whatever you want after a while. We've been on the air for, with a little over 3000 people here, been on the air for quite a while. any other things? So we probably should summarize this at this point. Yeah, go ahead Wolf.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:32:42
One thing, it's cheaper than on it.
Alex Volkov
Alex Volkov 1:32:46
It's cheaper than on it while outperforms opus
1:32:49
on multiple, multiple tasks. also 1 million context window, which Sunnet also has an experimental. So, we only saw 1 million contact window in OpenAI models at like 4.1 models before, or like for enterprise. So now we have a very long, contact window model, even though the, the, the, the calls for exceeding the experimental 1 million contact window are increasingly cheap, expensive. but folks, let's do a summary super quick. GPT 5.4 just dropped with 1 million contact window. state of the art, G point GPT 5.4 thinking just dropped with 1 million context window, support. It's now live at the Codex app and is going live on the Chat GPT, it has state-of-the-art, reasoning across multiple benchmarks. It looks like, we've tested it out on coding. It's good on coding. It's still very much, you know, very much needs instructions, direct instructions. we're all excited to go play with this, obviously. parting thoughts folks? Stan, what, what is your summary for after seeing this perform on the Mars Benchmark? thing that we run here often?
Nisten
Nisten 1:33:56
I might start using it.
1:33:58
I might just use it through AMP too, but, yeah, I, I think I might start finally. I haven't had a chat GT subscription in like two years.
Alex Volkov
Alex Volkov 1:34:07
Oh, okay.
Nisten
Nisten 1:34:08
this is kind of convincing me to do it.
Alex Volkov
Alex Volkov 1:34:11
yeah,
Nisten
Nisten 1:34:11
take that as you will
Alex Volkov
Alex Volkov 1:34:13
folks saying this is a high praise for, for, for this one.
1:34:16
Riley Fox. I think with this it's time to conclude Thursday. I for today, we didn't get to all the news, but we covered like the most important ones. definitely, an exciting day when OpenAI releases a new model Wolf. You wanna finish up And then
Wolfram Ravenwolf
Wolfram Ravenwolf 1:34:28
just wanted to say, I've started a wolfed run on this model.
1:34:32
It's not the expensive promo or there's a on it. I will do this and, this will be the next one I put up.
Alex Volkov
Alex Volkov 1:34:37
As, as well as we saw the reported terminal bench is lower than
1:34:41
the Opus one, so we'll, we'll see how maybe it's better on the baseline, and lower on the top score as well. folks, if you missed any part of the show, the show is called Thursday. I, you can find anything on our new website called Thursday. That news, Thursday that news is the website. Please feel free to visit us in the episode. links there. Every piece of thing that we talked here in current, the screenshots and evaluations will be posted as a newsletter after this. we are here every week to talk about, everything major that happens in ai, including banking news. Like right now. we've been doing this for three years, so next week we're gonna celebrate exactly three years of Thursday News. Just for comparison, three years ago, GPT four was launched. This is 5.4, and it, it, it's not really the same world anymore. So everybody is, is, changing how they treat these ai and many people are going through in AI psychosis. we're gonna be here to monitor the news for you, and so we appreciate you tuning in today. Thank you so much for tuning in. and then everything that we haven't covered still, we will be also in newsletter. feel free to follow us on every, everywhere where you listen to a podcast or a newsletter. Thank you so much, folks. With this, we're going tune note and then we'll see you here next week hopefully with more news. Bye-bye.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:35:54
Bye bye.