Episode Summary

Two open-source labs sent representatives to the show in the same episode โ€” Lou from Z.AI debuted GLM-5 (744B params, open-weights coding crown) and Olive Song from MiniMax revealed M-2.5 (80.2% SWE-Bench Verified with only 10B active params at 1/20th the cost of Opus). Then Google dropped Gemini 3 Deep Think with an 84% ARC-AGI 2 score โ€” the biggest single-week jump ever on that benchmark โ€” and OpenAI answered with GPT 5.3 Codex Spark on Cerebras for real-time coding speeds. Oh, and ByteDance's Seedance 2 shattered video generation reality with 15-second multi-shot clips that feel like stepping into the future.

Hosts & Guests

Alex Volkov
Alex Volkov
Host ยท W&B / CoreWeave
@altryne
Lou
Lou
Z.AI โ€” Head of DevRel
@louszbd
Olive Song
Olive Song
MiniMax AI โ€” Senior Researcher
@olive_jy_song
Ryan Carson
Ryan Carson
AI educator & founder
@ryancarson
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg
LDJ
LDJ
Nous Research
@ldjconfirmed
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host, AI model evaluator
@WolframRvnwlf

By The Numbers

ARC-AGI 2
84%
Gemini 3 Deep Think โ€” biggest single jump on this benchmark ever, up from Opus 4.6's 68%
SWE-Bench Verified
80.2%
MiniMax M-2.5 with only 10B active parameters, approaching Opus 4.6 levels
GLM-5 Parameters
744B
Z.AI's open-weights model with 40B active params, trained on Huawei chips
Cost per task
15ยข
MiniMax M-2.5 vs Opus 4.6 at ~$2.50 โ€” 57% win rate at fraction of the price
Training tokens
28.5T
GLM-5 trained on 28.5 trillion tokens, scaled up massively from previous version
Codex Spark speed
100 tps
GPT 5.3 Codex Spark on Cerebras โ€” real-time coding inference

๐Ÿ”ฅ Breaking During The Show

MiniMax M-2.5 โ€” 80.2% SWE-Bench Verified
Dropped 30 minutes before the show. 10B active parameters competing with Opus 4.6 at a fraction of the cost. Olive Song joined live to discuss.
Gemini 3 Deep Think โ€” 84% ARC-AGI 2
Dropped during the show. Biggest single-week jump in ARC-AGI history, from Opus 4.6's 68% to 84%. Also 48.4% on Humanities Last Exam without tools.
GPT 5.3 Codex Spark on Cerebras
Ryan spotted it on X during the show. OpenAI's first model on Cerebras hardware, designed for real-time coding at extreme speeds.

๐Ÿ“ฐ Intro & Highlights of the Week

Alex opens with the biggest open-source week in memory โ€” GLM-5 and MiniMax 2.5 both dropped with representatives joining live. The panel shares their highlights: Wolfram picks GLM-5, Alex picks Seedance 2, and Yam is funding Anthropic's snack budget.

  • Both Z.AI and MiniMax sent reps to the show for live interviews
  • Open source competing directly with Opus 4.6 on benchmarks
  • Seedance 2 from ByteDance breaking everyone's brains
Wolfram Ravenwolf
Wolfram Ravenwolf
"What a week. So much cool stuff from China."

๐Ÿ“ฐ TLDR - This Week's AI News Rundown

Alex runs through all the week's releases: GLM-5 and MiniMax 2.5 competing with Opus, XAI restructuring after SpaceX acquisition, Anthropic's sabotage risk report, OpenAI's deep research upgrade, and ByteDance's Seedance 2 shattering video generation.

  • GLM-5: 744B params, open-weights coding crown
  • MiniMax 2.5: 80.2% SWE-Bench with 10B active
  • Seedance 2: 15-second multi-shot video with sound

๐Ÿ”“ Interview: Lou from ZAI on GLM-5

Lou from Z.AI joins at 1 AM Shanghai time to discuss GLM-5's architecture, the new SLIM reinforcement learning framework, and adoption of DeepSeek's sparse attention mechanism. She summarizes the model in four words: bigger, faster, better, and cheaper.

  • SLIM: new asynchronous RL framework for post-training
  • DeepSeek sparse attention for reduced deployment cost
  • GLM-5 trained on Huawei chips, not NVIDIA
Lou
Lou
"If I had to sum it up in four words, I would say bigger, faster, better, and cheaper."

๐Ÿ”“ Panel Discussion: GLM-5 Reactions

The panel reacts to GLM-5 โ€” Nisten notes it uses DeepSeek architecture, Ryan highlights the dream of running open-source models locally for Open Claw, and Yam emphasizes it's a model that can run general computer use at close to free.

  • Trained on Huawei chips, restricted GPU serving capacity
  • 50% Humanities Last Exam, beating Opus 4.5 and Gemini 3 Pro
  • 34% lowest hallucination rate on AAA benchmark
Yam Peleg
Yam Peleg
"It's a model that can run general computer use at close to being free. Like that. That's crazy."
Ryan Carson
Ryan Carson
"I love seeing the competition because what we want is a really good open source model that can rival an Opus or a Codex for people to run Open Claw locally."

๐Ÿ”ฅ BREAKING: Minimax M-2.5 Drops Live

Breaking news during the show โ€” MiniMax releases M-2.5 just 30 minutes before airtime. Alex brings Olive Song from MiniMax to announce the model live.

  • 80.2% SWE-Bench Verified
  • 10B active parameters, 200B total
  • Dropped live during the show

๐Ÿ”“ Interview: Olive Song from Minimax on M-2.5

Olive Song discusses their Forge RL framework, how they trained efficiency into the model (less tool calling, less thinking tokens), and reveals the model is actually still training โ€” they cut a checkpoint to release because developers were asking.

  • Forge: decoupled RL framework training diverse tasks without interference
  • Model optimized for end-to-end task time, not just benchmark scores
  • Still training โ€” cut a checkpoint for early release
Olive Song
Olive Song
"A funny story about this release is that as we are talking right now, the model is actually still training and then the accuracy is still scaling."

๐Ÿ”“ Panel Discussion: Minimax & Open Source Momentum

The panel discusses the jaw-dropping pace of open-source progress. Nisten notes benchmarking concerns but acknowledges the model's real utility for multi-agent orchestration. LDJ highlights the cost-per-intelligence advantage.

  • MiniMax 2.5 beats Gemini 3 Pro on SWE-Bench
  • Can run on a Mac Studio M3 Ultra at 80+ tps
  • Open source now one week behind frontier on benchmarks
Nisten Tahiraj
Nisten Tahiraj
"You can buy something for $8,000 like an M3 Ultra, and I think it does like very good speeds, like over 80 tokens per second."

๐Ÿ’ฐ This Week's Buzz - W&B Inference

Alex announces day-zero GLM-5 support on W&B Inference service powered by CoreWeave, with MiniMax 2.5 and Kimi K2.5 coming soon. Free credits available for testing.

  • GLM-5 live on W&B Inference day zero
  • Free credits for testing via @wandb on X

๐Ÿข XAI Restructuring & SpaceX Acquisition

Multiple XAI co-founders departed after SpaceX acquired XAI. The company restructured into four buckets: LLM/Voice, Coding, and Macro Hard (data centers). Grok 4.2 is nowhere to be found, and they're talking about putting GPUs in space.

  • 300,000 GPU Memphis training cluster โ€” largest in the world
  • Jimmy Ba (co-author of Adam) left, said recursive self-improvement coming this year
  • Restructured into 4 divisions including Macro Hard
Alex Volkov
Alex Volkov
"I use Grok for research, specifically X research. Grok itself has API access to X better, faster than you."

๐Ÿ“ฐ Matt Schumer's Viral AI Article & The Acceleration

The panel discusses Matt Schumer's viral article (74M views) about the speed of AI progress, the gap between AI-native people and everyone else, and Ryan shares a real-world case study of end-to-end AI engineering.

  • 74 million views on Matt Schumer's article
  • Feb 5 models made everything before feel like a different era
  • Harness Engineering case study on Codex in production
Ryan Carson
Ryan Carson
"People are beginning to actually, from end to end having zero humans involved in the writing or reading or reviewing or shipping of code. It's starting to happen."

๐Ÿ”ฅ BREAKING: Gemini 3 Deep Think - 84% on ARC-AGI-2

Breaking news mid-show: Google drops Gemini 3 Deep Think with 84% on ARC-AGI 2 (up from Opus 4.6's 68% just one week prior) and 48.4% on Humanities Last Exam without tools. The biggest single jump in ARC-AGI history.

  • 84% ARC-AGI 2 โ€” up from 68% (Opus 4.6) one week ago
  • 48.4% Humanities Last Exam without tools
  • Biggest single-week jump in benchmark history
Yam Peleg
Yam Peleg
"Google drops Gemini three deep thinking, significant upgrade to deep thinking. Basically state-of-the-art on ARC AGI 2, to the best of my knowledge."
Alex Volkov
Alex Volkov
"The jump in ARC-AGI. What the fuck just happened?"

๐Ÿ”ฅ BREAKING: GPT 5.3 Codex Spark on Cerebras

Another breaking news: OpenAI releases GPT 5.3 Codex Spark, a smaller version of Codex designed for real-time coding, in partnership with Cerebras for insane inference speeds. Available to ChatGPT Pro users.

  • First OpenAI model on Cerebras hardware
  • Designed for real-time coding at 100+ tokens/sec
  • Available in Codex app, CLI, and IDE extension
Ryan Carson
Ryan Carson
"GPT 5.3 Codex Spark, what? Last two minutes ago. So I'll read a little bit from that."

๐ŸŽฅ Seedance 2 - ByteDance's Mind-Bending Video Model

Alex demos ByteDance's Seedance 2, a video generation model that accepts 9 images + 3 videos + 3 audio clips as reference. The multi-shot consistency, native audio, and physics are at a level that makes the original Sora feel like a different era.

  • 15-second high-quality multi-shot with native stereo audio
  • 9 images + 3 videos + 3 audio clips as input references
  • 45-second internal test mode available
Alex Volkov
Alex Volkov
"These videos are generated with Seedance 2. It feels like the jump from when we were before Sora and then we saw Sora for the first time."

๐Ÿค– Agent Psychosis & The Sleep Problem

The panel gets real about the mental health impact of running AI agents 24/7. Multiple panelists report sleep disruption, FOMO about underutilizing their agents, and the paradox that tools meant to reduce work are creating more anxiety.

  • Ryan wakes up at 2 AM regularly worried about agents
  • Wolfram worries about shutting down agents for security
  • The primitives for managing agent teams don't exist yet
Ryan Carson
Ryan Carson
"No one's running agents 24/7 and actually doing productive work. They may be running small teams of agents to build real apps, but we're just not there yet."
Wolfram Ravenwolf
Wolfram Ravenwolf
"Every moment an agent is not running, you think you are losing time. You know you are wasting time because it could be doing something for you."

๐ŸŽฅ Bytedance SeeDance 2.0 - shattering reality

Continued deeper dive into Seedance 2 demos โ€” showing multi-shot character consistency, anime style generation, and native audio with environmental sounds. Available on BytePlus platform.

  • Character consistency across multi-shot sequences
  • Anime and realistic style modes
  • Available on BytePlus platform

๐Ÿ“ฐ Wrap-Up & Goodbye

Alex recaps an insane show: two open-source lab interviews, two breaking news drops (Gemini 3 Deep Think and GPT 5.3 Codex Spark), and Seedance 2 demos. Over 2000 listeners tuned in.

  • 2000+ live listeners
  • 4 breaking events in one episode
  • Coming up on 3 years of ThursdAI
TL;DR of all topics covered:

  • Hosts and Guests

  • Open Source LLMs

    • Z.ai launches GLM-5: 744B parameter MoE model achieving #1 open-source ranking for agentic coding with 77.8% SWE-bench Verified (X, HF, Wandb)

    • MiniMax M2.5 drops official benchmarks showing SOTA coding performance at 20x cheaper than competitors (X)

  • Big CO LLMs + APIs

    • XAI cofounders quit/let go after X restructuring (X, TechCrunch)

    • Anthropic releases Claude Opus 4.6 sabotage risk report, preemptively meeting ASL-4 safety standards for autonomous AI R&D (X, Blog)

    • OpenAI upgrades Deep Research to GPT-5.2 with app integrations, site-specific searches, and real-time collaboration (X, Blog)

    • Gemini 3 Deep Think SOTA on Arc AGI 2, HLE (X)

    • OpenAI releases GPT 5.3 Codex spark, backed by Cerebras with over 1000tok/sec (X)

  • This weeks Buzz

    • W&B Inference launch of Kimi K2.5 and GLM 5 🔥 (X, Inference)

    • Get $50 of credits to our inference service HERE (X)

  • Vision & Video

    • ByteDance Seedance 2.0 launches with unified multimodal audio-video generation supporting 9 images, 3 videos, 3 audio clips simultaneously (X, Blog, Announcement)

  • AI Art & Diffusion & 3D

    • Alibaba launches Qwen-Image-2.0: A 7B parameter image generation model with native 2K resolution and superior text rendering (X, Announcement)

  • Tools & Links

    • Entire raises $60M seed to build open-source developer platform for AI agent workflows with first OSS release ‘Checkpoints’ (X, GitHub, Blog)

    • Chrome 146 introduces WebMCP: A native browser API enabling AI agents to directly interact with web services (X)

    • RyanCarson AntFarm - Agent Coordination (X)

    • Steve Yegge’s “The AI Vampire” (X)

    • Matt Shumer’s “something big is happening” (X)

Alex Volkov
Alex Volkov 0:30
Welcome, everyone.
0:31
Welcome to ThursdAI four February 12th. My name is Alex Volkov. I'm AI Evangelist with Weights, & Biases from CoreWeave. ThursdAI is brought to you by Weights, & Biases, and I am really excited about today's show. really excited about today's show. Uh, we have folks tuning in on YouTube. and we have our own Wolfram, Raven Wolf over here, and Ryan Carson. What's up guys? How you guys doing?
Ryan Carson
Ryan Carson 0:54
Good to see everybody.
Alex Volkov
Alex Volkov 0:56
Good to see you guys.
0:57
Hi.
Wolfram Ravenwolf
Wolfram Ravenwolf 0:57
What a week.
0:58
So much cool stuff from China.
Alex Volkov
Alex Volkov 1:01
Open source is back, open source is back with a vengeance,
1:05
and we have so much new stuff to talk about, including the breaking news, for Minimax, minimax 2.5 just dropped and, literally 30 minutes before. and the thing that I will start the show with is that both. Companies will have representatives of theirs on the show today for us to interview live. So we're gonna welcome Lou from Z.ai formerly known as pu ai. we have Lou to talk to us about their release, and then we're gonna have the lead researcher on, the Minimax team Olive song to join us also to talk about the drop that they had. both models are competing with the previous opus on benchmarks. This is insane, folks. Open source is catching up super quick, open weights. These models are now live and I'm, I'm very, very excited. So with that, I wanna say hi to everybody who tune in. I see a bunch of folks on Twitter spaces and Queen and LinkedIn and near, we have a bunch of folks on YouTube as well, and obviously you guys. So how has your week been? let's talk about AI thing that happened. and then we can dive into the show while people kind of pick up.
Ryan Carson
Ryan Carson 2:19
I'll go first.
2:20
Wolf, I'm good to see you, Alex. Good to see you as always. so I shipped, an open source project called Amp Farm just to kind of orchestrate agents on top of Open Claw. people seem to like it. I'm going on twist tomorrow to talk about it. that was kind of what I was thinking a lot about Agent orchestration. Do you do it on top of open Claw? Do you do it somewhere else? everything's changing so fast.
Alex Volkov
Alex Volkov 2:41
That's awesome.
2:41
Wolfram, how are you doing?
Wolfram Ravenwolf
Wolfram Ravenwolf 2:43
Oh, this is the week I've been, yeah,
2:45
you know what I've been up to? doing evaluations left and right, and new models come out. And I'm so excited because I find so many interesting things and, soon we will be able to talk about these. So I'm looking forward to that. this week, my highlight, actually when it came out as Pony alpha, I tested this model in German and I was a bit disappointed because in German it was very weak. But now that it is released and I tested it in English, I have to say, wow, amazing. I'm really impressed. I'm looking forward to talk more about how we can use it with some of the genic tools that we have. But that will be a topic for another time. For now my highlight of the week, GL LM five, and there are more highlights. It's hard to say, but right now, if I have to pick one, I will pick this
Alex Volkov
Alex Volkov 3:26
I heard that we're gonna have two breaking news releases, today.
3:33
I didn't know about the open source models. I will just say, folks, we have our own Weights, & Biases, CoreWeave inference, and we have day zero support for GLM five. And we're gonna work really, really hard to get you Minimax 2.5 as well. if you want a little bit of credits, I will point you to our main account, and I'm probably gonna link this, in the show below so that you'd be able to claim some credits because, top tier intelligence doesn't come cheap, but we're gonna give it to you cheap. so, for me, the absolute excitement of the week was, you know what, I'm not gonna go with lms. You guys went with lms. I'm gonna go with fucking Seedance that broke my brain this week. Just absolutely broke my bearing. If your feed is not full of Seedance videos yet, it will be so very, very, very soon. Biden's released, An announcement today that Seedance is official, but also they pre release, see dance on a bunch of their platforms. Biden's obviously the folks who build TikTok and they've trained it on everything. They literally, I'm pretty sure that our faces are in there, Ryan Wolfram, like our faces are in there, but also every Hollywood movie, every Hollywood actor, every, you know, there, there is a story about a Chinese influencer that like, all he needed to do is put his face and say me saying something. It was like almost perfectly his voice, because they just literally trained on everything. It's, it's really crazy. So, I will, like, I will say that Seedance is mine for sure. And, it's
Wolfram Ravenwolf
Wolfram Ravenwolf 5:01
my second choice.
Alex Volkov
Alex Volkov 5:02
Yeah.
5:02
It was your second choice.
Wolfram Ravenwolf
Wolfram Ravenwolf 5:04
Yeah, definitely.
Alex Volkov
Alex Volkov 5:05
alright.
5:05
Right folks. So we're gonna start with a quick TLDR, and then we're gonna have a chance to chat with Lou from, ZAI about GLM. And then we're gonna continue with the show. With this. I wanna just say, see if with us, what's up man? How are you doing? coming through a little bit. A little pixel. Yeah. But we'll figure it out.
Yam Peleg
Yam Peleg 5:22
usual.
5:23
crazy week. Crazy
Alex Volkov
Alex Volkov 5:24
week.
5:24
tell us about one thing in the world of AI that you are absolutely not gonna miss.
Yam Peleg
Yam Peleg 5:30
My API, bill has grown just a little bit, this week.
5:35
I don't know, twice, three times, but, man, I'm funding the snacks at Tropics office, like real good. I'm really funding their snacks, man. Seriously. I hope you get good food over there, guys. On Tropics. Seriously? Oh, yeah.
Alex Volkov
Alex Volkov 5:49
one news item, must Oh,
Yam Peleg
Yam Peleg 5:50
GGLM.
5:51
It must be GA must be, there. There's no doubt. GLM, for sure.
Alex Volkov
Alex Volkov 5:54
Alright.
5:55
and just because she's smiling. Lou here. Hey, Lou. Welcome. We're gonna do the TLDR, but hi guys to, Lou from Z ai. and we're gonna chat with her very, very soon. but first we're gonna run through everything that we have to talk about on the show, for this week, in the corner called TLDR, or TLDW. Too long D watch. I don't know, or didn't listen. So we're gonna do it TDR super quick because there's a bunch of stuff to talk about, and then we're gonna be chatting with Luke.
Lou
Lou 6:23
Sure.
Alex Volkov
Alex Volkov 6:33
All righty.
6:33
This is it. This is the TL DR. This is everything that happened in the world of AI that we're going to cover on this week's show. Starting with the open source, Models, open source, LLM. Open Source Intelligence has been absolutely blowing up This week, following the crazy last week where we saw live on the show, we saw drops from, OpenAI New Codex model. That's absolutely bonkers. we're gonna continue, today on the show. Your host is Alex Volkov and AI evangelist with Weights, & Biases. Of course, our co-hosts are Wolfram Ravenwolf, Yam Peleg is here. Ryan Carson Nisten is gonna come back, LDJ at some point. Our guests today is Lou from Z ai and we have, minimax folks as well. Olive Song some from minimax, and of course, open source is absolutely, absolutely dominating this week. Just domination across the board folks, if last week was the, the week of the big labs, this week we're going to cover two major releases. One of them release just yesterday. One of them today. So, GLM five is the, the, the breakaway open source model of the week. Just released yesterday already up on weight devices inference, by the way. and we're gonna tell you all about how to get it. GLM claims the open source coding crown with, incredible performance. We're gonna talk about all of this. the highlight there is Vending Bench a very interesting benchmark about real world tasks, but there's a bunch of other ones, 744 billion parameters, with only 40 billion active. we're gonna mention everything, including the fact that they significantly updated, the amount of training tokens that they use to train this model. obviously we have, representatives of ZAI here on the show to talk to us about this. following this just 30 minutes ago, minimax released their updated Minimax 2.5, which is, they claim state-of-the-art at coding. We're gonna compare between them. SWE Bench verified that 80%, 80.2%. SWE Bench verified is a very difficult benchmark for coding performance. And, with 200,000 tokens in the context window, this model is absolutely a beast. And we're gonna chat about that one as well very soon. Following this, we're gonna talk about big companies and LLMs and APIs. there's a big drop of folks from XAI, Elon's company. following the acquisition of SpaceX. SpaceX acquired X-A-I-X-A-I previously acquired X. So now if you're tweeting, you're basically tweeting on, you know, space company, dime. But, a bunch of co-founders of XAI announced their departure. It wasn't clear. And then, Ilan went on a whole spiel talking about, the future of XAI in the space plus restructuring. So we're gonna talk about the restructuring there. Then, Tropic released a very interesting document for Cloud Opus 4.6, a sabotage risk report. and, assumably, this model is like much more riskier. So we're gonna talk about this a little bit as well. Openly upgraded their deep research to finally to the five series decrease. Previous was based on four series. What the heck, still advanced force mode is, is four, but deep research now is, is five series with app integrations. definitely worth checking out. In this week's buzz, we have an announcement, that, we have put GLM five day zero support on our inference service. And plus, we also updated it with Kim K 2.5, which is also multimodal. So you'd be able to do all kinds of stuff with inference, including random to cloud code, open claw, et cetera. And I think the big one. and I wish I was able to use this more than the two times Seedance from Biden Seedance two is, let me, let me put my, Seedance two Shatters reality limits to the way that the first so shattered reality limits when we saw it first time. Seedance two is a new video model and it's just absolutely mind bending in the character consistency and the physics and the sound. It is just, it's really like we're, if you folks are tuning in on Twitter spaces, you should tune in on the video, channel when we get to the Seedance part because it is just gonna be like breaking your mind. The downside is, unless you have a Chinese phone number, it's really hard to use this model, because they put it up and they took it away. But it was announced today, it supports nine images as referenced, plus three videos, plus three audio clips, just absolutely bonkers, bonkers model, that shatters reality for us. So we're gonna chat about, CEDS a bunch as well. folks from Qwen launched in, AI, art and D Fusion folks from Qwen launched Qwen Image two. It's a 7 billion parameter image generation model. It's near SOTA and it's really, really cool. hard to launch image models now with non abandon pro have been around for so, such a long time. But, they did this anyway and it's, it's absolutely, a great model with 7 billion parameters, in the tools. We're gonna announce, two things here, and we're gonna probably talk about Enform as well from our own Ryan Carson, but in the tools area entire, a new company raised a $60 million seed to build open source developer platform for AI agents, backed by the previous GitHub, CEO Thomas. and then Chrome launch is something very interesting. Web MCP. I don't know if you guys heard about this, but definitely worth talking about that, for the show. this is the TL DR folks. I know for a fact there's gonna be more breaking news today. GLM five is finally here, folks, and I would love to introduce first time on the show, Lou from ZAI Lou, welcome to the show. It's great to see you.
AI
AI 11:45
to see you,
Lou
Lou 11:46
Alex.
Alex Volkov
Alex Volkov 11:47
It's very, very late for you, right?
11:48
So, please introduce yourself. We're not gonna take too much of your time, but please introduce yourself to the folks and then we're gonna dive into the release that you guys have.
Lou
Lou 11:55
Okay?
11:56
So here, thanks to Alex for giving GLM and this opportunity. I appreciate it. And I'm Lou, head of Dre at Z dot AI. So Jill GLM five just, you know, was just released yesterday. And, you know, today I'm here to step away from the coding angle and try to talk about what agents can actually do for us right now and along where we'll get a, feel for what's new and improved in M five.
Alex Volkov
Alex Volkov 12:24
All righty.
12:25
first of all, thank you so much for joining. I know it's late and you guys have been through a release and it's bunkers. Everybody's like using this model. It's great at coding. I had a chance to test it out on my open clients and it is absolutely great. So I have a few questions for you. We're gonna obviously talk about, some evals, et cetera, but I have a few questions for you. We've talked about GLM since before you were ZII think since since GPO days. And, I keep updating folks about like the new name, but I think it's time to just say ZI. GLM 4.5 was definitely a great coding model. 4.7 beat that. What makes this one a whole major release update versus just a 4.8 or 4.9? What makes this five?
Lou
Lou 13:06
So,
13:06
Like first the scale has increased a lot 'cause we pushed the models general like intelligence, to a higher level for much larger pre-training compute. Yeah. And then we introduced a brand new a synchronous reinforcement learning framework called slim. It largely improves the efficiency of post-training with rl and plus we, adopted a deep seek spars attention mechanism. It preserves, long contest performance will dramatically reduce deployment cost. So if I had to sum it up in four words, I would say bigger, faster, better, and cheaper,
Alex Volkov
Alex Volkov 13:48
bigger, faster, better and cheaper.
13:50
Choose all, all of them together. This is great. I wanna talk about some of the, evals and benchmarks that you guys are most proud of. So Humanis last exam. You guys compare yourselves to the previous G LM here, folks who are not watching. we're looking at the evals chart here, from Z AI themselves. And they have, a bunch of comparisons to Cloud Opus 4.5 and Gemini three Pro. There is, a few benchmarks in which you guys are beating Gemini three Pro. Gemini three pro is a huge model from Google, and yet there's an open weight model now with, what, 744 billion parameters. We can talk about this, the scale as well, that you guys are bid in, in open source. tell me about how the team kind of like works around which, benchmarks we're gonna try to hit. Is this a continual training of the model and which is like you watch performance in some point. You say, okay, this is enough. tell me about what makes this model special.
Lou
Lou 14:39
Okay, Alex, so, you know, you invited me here, but I'm not gonna spend too much
14:43
time talking about coding benchmarks. Oh yeah. But without, without like carefully design, like high difficulty benchmarks is getting harder and harder to tell the upper limits of a model just from simple task cases. So, you know, in everyday scenarios the differences aren't always obvious. for us, we are like the coding SOTA. I would say. Yeah, we're really put a lot of effort on coding capabilities.
Alex Volkov
Alex Volkov 15:12
So
15:12
You guys mentioned the switch from vibe coating to agentic engineering. I think, Peter from Open Cloud also talks about agentic engineering versus vibe coating. what in this model is focused on the agents, you know, multiple back and forth conversations. Could you talk about this? So we would love to hear from you from an actual lab that releases these models, how, the world into shifting to agent engineering has impacted you and how you approached, building this model with that in mind.
Lou
Lou 15:36
So I literally think like we step into the era of agent engineering, like
15:42
from my own hands-on testing gel five, on a agent, it has really holds up. It handles like multi-step, decomposition well calls tool when needed, pushes task forward on its own. the overall task completion quality feels like it's getting real and I've seen a lot of developers already shipping, pretty complex coding workflows with it. So you define the process once almost like an SOP and agent can execute it. memory in interactions with agents like AI starts to build a real understanding of your habits, And over time it feels like a tool you prompt and more like a system that you actually know, how you work.
Alex Volkov
Alex Volkov 16:26
I
16:26
So, a few follow ups and we'll let you go 'cause we know we're like mindful of your time at night. But I wanna ask you about, two things I wanna ask about. Vending bench. Vending bench is for folks just a refresher. Vending bench is from, on the labs, I believe. they basically have this benchmark where it's a real world task. they give a model the chance to, run a vending machine with, access to restocking, et cetera. Let's talk about availability. Okay. So, first of all, I can say we worked with you guys, G five is now available on Weights, & Biases, but also I think, obviously you guys have your own platform for a while and you guys have been like serving multiple things, and plus there's a coding coding tier as well.
Lou
Lou 17:02
Okay.
17:02
So, you know, Jill M five, was just launched yesterday. We're still on this rolling out thing, you know, for coding plan users from Max, to pro, a few hours, ago we just opened for Coding Plan Pro users But, GLM actually performs really good as well, which, shocked me a little bit because I often think that's, our short qu but it performs well. And, I use slides from, this feature, a lot.
Alex Volkov
Alex Volkov 17:36
Awesome.
17:36
So well, yeah. Thank you, Lou. thank you for coming and talking about the model as well. Folks, absolutely can test out the model. It's available open router from day zero, on your own platform. There's a coding tier as well, $1 input and $3 and 20 cents output on and biases, folks, it is like almost nothing. I'll tell you guys how to get this for free. I'm gonna bring back the co-host here. and Lou, if you are, able to stay for one more question, I think Wolfram had a follow up, but otherwise we are gonna definitely do some discussions about this model.
Wolfram Ravenwolf
Wolfram Ravenwolf 18:05
Yeah.
18:06
I've found it very interesting that you not just released a model. You also released, a terminal bench two verified. So you, went through the benchmark and improved it. Do you think that is, the new one to use or do you have any information about this? Because I found it notable
Lou
Lou 18:22
Yeah, yeah, Because, for, these benchmarks, some of my, colleagues
18:27
from my team, come up with some idea. they go through some research and think, oh, we should do that. we should definitely, do that and see how we could evaluate the model, further. So, that's why we tested and it amazingly go well. it's actually like few months ago we started this, kind of thing and now we just see how it models all performs, in this bench. So we keep going and, you know, when we train model, we just, put more effort, more attention on this.
Alex Volkov
Alex Volkov 19:07
Hi Lou.
19:07
Thank you so much for joining us. Thank you for taking the questions. vending me is very, very interesting. We're continue talking about the model. We know it's, fairly late for you there, so congratulations on the launch. regards from our teams, we always applaud open source here on the show, so thank you so much, for releasing the model for everybody to be able to use. Again, 744, billion parameters. Not something that people can just use at their house, but still the fact that open source is catching up to the frontier Labs in this space is astonishing to see. And, it's great to have you here on the show. Feel free to come back when you guys release more models as well. meanwhile, but we're gonna say right to Lou. Thanks. Thank you, for joining us. folks panel, what do we, what do we think This is, first of all, huge shout out to Lou for joining.
It's 1
It's 1 19:49
00 AM if you can believe it in Shanghai.
19:52
and, she's joining despite everything. I don't know if I'm gonna be able to live stream like this and get
questions from Alex at 1
questions from Alex at 1 19:57
00 AM with the energy, but holy shit G LM five,
20:02
Nisten would love to hear from you. You're back. Microphone works. Tell us what, what do you think about G LM.
Nisten Tahiraj
Nisten Tahiraj 20:08
I made a fun animation of all the weights and stuff, and, I
20:11
find it pretty interesting now that they're using the deep seek architecture.
Alex Volkov
Alex Volkov 20:16
DSA stuff, right?
Nisten Tahiraj
Nisten Tahiraj 20:17
Yeah, but also overall, like the way that they share the experts
20:21
and the way the gates switch between shared and not shared experts in the pool. It's mostly the deep seek, still from the deep seek architecture. So this was the change that I found pretty, pretty interesting. I don't find it quite at Opus 4.5 still, but. It is close enough and it's reliable enough that, now it's becoming like you might actually want to consider it for daily agent use stuff. if you have the repeatability part as, a more important factor, if that does harm your work, you might actually want to consider this at this point. So it is getting pretty crazy how fast the catch up is happening.
Alex Volkov
Alex Volkov 21:05
And we have a bunch of offerings, folks are posting in the show.
21:08
Like we have kimmi 2.5 and then GLM and Minimax 2.5 that just came out. open source is catching up like super quick, especially for Gentech stuff. Now the kicker is we talk about comparison to Opus, on price performance. These models absolutely MOG opus in every possible way. Maybe not on speed, but definitely on price. all of these companies now come up with their own Max. I don't know if you saw it, we're gonna talk about Minimax as well, but they have the max plan or coding plan, and it's significantly, like $300 a year or something versus $200 a month. so that's very interesting. the thing that I didn't, get a chance to talk to Lou is that, they trained GLM on Hui. So GLM is not trained on Nvidia chips anymore. Like they're trained on Hui and, the GM five series of models is not up on their coding plan yet because they're restricted in the ability to service with GPU. So, despite the fact that they're training the model or twice the size that's actually walk to the numbers, they're still stressed for some GPU power. I would love to hear from any other of you for folks who use this model or read about this or comments that we wanna talk about specifically benches.
Ryan Carson
Ryan Carson 22:10
Well, real quick, I just wanna say I love seeing the, competition
22:15
because what we want to happen here is for a really good open source model that can rival, an opus or, a codex for people to run open claw locally, right? Because I'll say folks are starting to run some pretty heavy workflows with an farm, and it can get expensive, right? Yeah. It's on Aron job. So let's go GLM five. Like, I'm excited to see this.
Alex Volkov
Alex Volkov 22:37
Yeah, we should mention though, locally is a,
22:39
you know, a bit of misnomer here. Usually when we cover open source, we say like, you know, there this, you know, Qwen model, Qwen, three, 7 billion parameters, misra, whatever, these models are approaching whatever the big labs used to release, what, a year ago in size. We don't, we don't know really about like 744 billion parameters. you need use some chunky hardware to get, to run this locally, right? but I think that Ryan, you're absolutely right. People want to run, multiple age agentic processes, going on while they sleep. But, a lot of stuff is happening kinda behind the scenes and looping, and, folks are at some point are, okay, maybe I don't need. 4.6 is intelligence for every task that I have, maybe smaller tasks like documentation. Why would you need like super deep intelligence for documentation, for example. and maybe I don't wanna send all my credentials to on tropic. So there's definitely now choices in the open source for this. So we're gonna run super, super quick through some of the evals, and then we're gonna chat with all from Minimax, about the upgraded breaking news model from Minimax 2.5. but I do wanna like run through some of the evals because I think that they're important, here. humanities last exam, I think is the biggest one. biggest jump here with 50% humanities. Last exam, beating with tools, tropics 4.5 and Gemini three Pro and GBT 5.2. they didn't measure on 5.3, it just came out. just like differences across the board, the hallucination one, I'm very interested to see other open source models. doing like getting 34% the lowest score on the AA hallucination rate is definitely impressive. yam would love to hear from you, if any, any things, from you on this model about the size, about the trainings scaling up from 23 parameters to 28.5 trillion parameters. What?
Yam Peleg
Yam Peleg 24:25
Well, it's a model that can run general computer
24:29
use at close to being free. Like that. that's crazy. You know, we used to talk about, terminal bench and, performance on coding and so on. look what's going on on GitHub? The vast majority of usage for these models at the point is probably not necessarily just code with everyone that is, using them. So when you have a model that can run general computer use, that's basically the thing with OpenFlow and so on. most of, what people do with open cloud, except for extreme people on X is general computer use day to day, check my calendar. Remind me, what's going on over there. That's very useful. Everyone, everyone understand that's very useful on its own and you don't need such a powerful model to do that. And I'm not saying that you got something that you can run locally. It depends who you are. Some people might, but, for most people not, but the price, even for not running it locally, man. That's a completely game changer at this point. and yeah, I definitely did run it for code. It's an absolute great coding model. it's not just benchmark. I completely agree that we can talk about the benchmarks, but it's not just the benchmarks. It's just a good model for coding in real life. So, yep, absolute great release. Seriously,
Alex Volkov
Alex Volkov 25:57
So I would love to introduce Olive from Minimax.
26:00
Let's see if Olive is with us, to the show. welcome Olive, how are you doing?
Olive Song
Olive Song 26:05
Hi Alex.
26:05
How are you?
Alex Volkov
Alex Volkov 26:06
Good.
26:07
Nice to see you again. We met together, I think was that in New York for AI engineer, right? You gave a talk there?
Olive Song
Olive Song 26:13
Yeah, yeah.
26:14
I think two months or three months ago. Yeah.
Alex Volkov
Alex Volkov 26:15
Yeah,
Olive Song
Olive Song 26:16
three months ago.
26:17
I've been training this model since then, so I'm so excited about sharing the,
Alex Volkov
Alex Volkov 26:20
ah, let's go.
26:21
All right. I'm, I'm excited to talk to you about, Olive, we have a special thing that we do, when the breaking news happens during the show. And since the model just launched, we're gonna do breaking news, AI breaking news, coming at you only on ThursdAI All right. The announcement that we have, and I'm gonna let Olive do the announcement herself, is that, a new model has just dropped and I think you are taking state of the art on open source on multiple benchmarks. Olive, how about you tell us, about what just dropped,
Olive Song
Olive Song 26:50
Yeah, definitely.
26:51
So, we just launched Minimax M 2.5, right? It's the new version of our model. we have a lot of SOTA scores on the benchmarks, and as you can see on the blog, we're improving very fast.
Alex Volkov
Alex Volkov 27:04
I'm looking at SWE Bench verified, and this is
27:06
absolutely the most breaking thing. SWE Bench verified is a difficult benchmark. Would love for you to talk about, what it is to train a model that beats it, but, 80.2% on SWE Bench verified, which is the harder version of SWE Bench versus even Opus 4.6. Is that Opus 4.6? That's like you guys are very, very close to,
Olive Song
Olive Song 27:23
very close.
27:25
we're still scaling,
Alex Volkov
Alex Volkov 27:26
Yeah.
27:26
I think the multiple things about Minimax is that, you guys are, all around, multi-modal lab, right? It's not only like language models. You guys have been, great at voice, I think for a longest time on the show I've talked about, you know, minimax voices being one of the top voices, that sounds the more natural as well. and then Minimax the LLM has not, we featured this on the show, but I remember at some point you guys started scaling up super quick. So maybe can you start with the scale? Could you talk to us about the scale of this model and the training size, et cetera, and then we're gonna talk about some benchmarks more.
Olive Song
Olive Song 27:57
Sure, sure, definitely.
27:59
So for this version of the model, we, achieved higher scores, in coding benchmarks, in a lot of benchmarks, Searching and, general tool use. And our model also excels in several, professional settings in workspace settings. it can use Excels very well and stuff like that. And then, if we go into how we scaled it, I would say reinforcement learning. and that's the part that I spent most time in as well. So what we noticed was that, with reinforcement learning, we can actually scale the model's, capability to a very high performance. Even with a smaller model, right? Because we can see that our model is not that large. it is only 10 B active parameters and it's very fast. and so people would doubt is intel intelligence for a very long time. But then we realized is that there's a lot of potential where this small model like this still a lot of potential if we train reinforcement learning on it, if we train it on very large amounts of, environments and agents, right? But it, it's not a very, very easy thing to do. 'cause, scaling up the compute or scaling up, scaling up the algorithms, scaling up the environments, scaling up, the agents are all not very, easy. So that's what we spend a lot of time in. that's, I think on the block we do have the, we do talk a little bit about our, RL framework, which is called Forge. so, so we designed, yeah, what we designed was that it, it is very, I would say it's very decoupled and, disentangled so that you can actually train on every sample, a different, agent, a different strategy of training so that, and they don't interrupt with each other during the training, right? So we have diverse tasks. it's not just, for example, bug fixing it. It's, there's like from zero to one writing up a repo or like the model working in a very complex environments with a lot of tools and different, agents. So. What we achieved this at during our training with our design and with our algorithm, we could, like, the samples don't interrupt with each other and we could train it very smoothly. And we did a lot of work on that and that's how we scaled the performance in many, many agents. 'cause we could see that it's not only good in, for example, the very mainstream agent products is also good in multiple other environments. And that's what we care about. 'cause people want to use the model in multiple, products, in multiple environments doing multiple different tasks. And, that's how we achieved it.
Alex Volkov
Alex Volkov 30:22
Yeah.
30:23
This is great. I have a few questions for you about the specific training and size as well. I think, the highlight that comes to mind when you guys release a model like this, you get like around 57%, I think you guys called it, on M 2.5 win rate at 15 cents per task versus Opus 4.6, which was like almost. $3, two and a half dollars, 15 cheaper. And I would love to understand like, is this just the size or what's the architecture that gets to this performance? I'm reading about a hundred tokens per second for multiple things. Could you tell me about what's the architecture that lets you get this performance from this model?
Olive Song
Olive Song 30:59
Model?
30:59
Yeah, definitely. So our architecture didn't change from our M two and M 2.1, right? We share the same base model, including M two Heart. So it's still like 10 via active parameters with 200 parameters in total. So it's very small model itself. And then I think a thing we cared a lot is just as you said, the speed, right? the efficiency of the model. 'cause that's what people care in coding scenarios or in developer scenarios, or like Coworking scenarios, right? So during our task optimization, or like during our training, we took that as one of the top priorities. So we didn't, only care about like the scores and stuff, right? what we took in our signals was the like end-to-end time of the task performing. 'cause you can see in 2.5 that the total time it decreases from 2.1, right? And it's significantly in multiple tasks, including coding and search and stuff. it uses less tools, for example. it thinks less, for example. And that matters a lot to users because that means less money. Yeah. And less time waiting. Less time of waiting, I would say. So you can spend less time and less money on the task and then it performs good.
Alex Volkov
Alex Volkov 32:11
Yeah.
Olive Song
Olive Song 32:12
And then you put that as one of the top priorities in
32:15
reinforcement learning training.
Alex Volkov
Alex Volkov 32:17
Oliver, one thing that, I've noticed we noticed on the
32:19
show would love to, to talk to you. 'cause like you're training these models. You, you're thinking about the, the verifier algorithms and how long it things for. Right. So with around the year and a half ago, the whole reasoning paradigm, test, time, compute, et cetera, came to part and everybody caught up super quick. And at some point we noticed just like bloating of tokens in the thinking process to achieve the same goal. And now it looks like we're going through the process of like, hey, we want to get to the same goals while thinking less, but still thinking about thinking less. This is also, that's something that you guys are working on, like training, like getting, is, is this part of the verifier score, for example, that you guys post it as well?
Olive Song
Olive Song 32:52
definitely because we look at the traces, right?
32:54
The model sometimes is not very efficient in its thinking or tool calling. We will share more details in a blog that we will post later on our reinforcement learning training. but we noticed that the model itself wasn't that efficient, but we can still make it more efficient. It's not only the thinking tokens, it's also the two calling behavior. For example, it can, for example, right, the model now knows how to use a spec, Then it knows how to plan and achieve the task more efficiently.
Alex Volkov
Alex Volkov 33:23
I got you.
33:23
let me see, the follow up question that I had. About the improvement rate. You guys have boasted about this and definitely worth like shouting out this graph. Do you have the improvement rate, three and a half months after October reached M1, M two, M 2.1 M to five, and your improving very, like very, very fast. Can you attribute this to a single factor or is it a mixture of all, Is this just a rail? Is this scaling up compute? Is this some combination of three? Where would you put the weight on top of how is this rate of performance happening? Because you guys are scaling up super quick and it's incredible to see that you're also doing open source while you're doing this.
Olive Song
Olive Song 33:59
Yes.
33:59
it's definitely a mixture of all. 'cause we all know that data is very, very important. And then training is very, very important. Training speed is important, right? And then also one thing that people might forget sometimes, but it's super important, is evaluation or task definition. So you can notice in our, development of the model, we release several benchmarks each time. It's that defines how we think of the problem and how we actually evaluate the, model's performance. And that's how we plan to, perceive in the future. So it's a mixture of all right. I would say all equal ways to all people in our team, so it is, it's a mixture of all. Definitely.
Alex Volkov
Alex Volkov 34:38
I wanna follow up on the commitment to open source.
34:40
Can you talk about this? this is very incredible because we're now, and we just talked with Lou from, GLM folks as well, I love the comradery. I love that. the fact that their Twitter account, went like this when you guys just released yours. I love the fact that the folks in open source is collaborating on, and talking about their methods. So collectively we get better intelligence. Love that. So, so shout out to you and the team for open sourcing as well. question about, specifically commitment to open source could you talk about, how you think about open source, why open source and, whether or not you're gonna continue releasing these models with open weights?
Olive Song
Olive Song 35:10
So as a researcher myself, I think open source for me is a very good
35:14
way of, communicating research stuff with the community and bridging the gap of research, how we imagine things or how we develop things and how people actually use the model with open source, There's a very large community that all people share their different ideas, share how they think of the future of artificial intelligence, how they think of how it will power productivity, for example. And then with open source, we can actually, keep all the advice and put that in our model and we can iterate very fast and let everyone be able to use the model, to reflect back to the model and improve the model to everyone.
Alex Volkov
Alex Volkov 35:51
I also noticed, there's a M 2.5 lighting, right?
35:56
is that something I'm looking at the blog and I wanted to ask you about the kind of the two models and the differences. 'cause the main model is fast enough. I think you have, a very high, speed, talking a second. Would you talk about the lighting version?
Olive Song
Olive Song 36:07
I think that is part of how we start, that's one type of a PII
36:11
think so on my end, it's just one model weights, but I think there are different. Yeah. I gotcha. Comment under Minimax io and then ask the questions.
Alex Volkov
Alex Volkov 36:20
Yeah.
36:20
many folks from different areas work on this model. anybody's work or findings about when you trained the model that came to mind that you wanna share with us? I know that for, a while ago there was this aha moment that people saw the model thing for the first time and then like researchers like, okay, cool. you are now scaling. This is not the new model. This continued training, I'm assuming. And then at some point, you guys, okay, this is good enough for a release. when I chatted with Lou from GLM, they released a new version, like a five, right? You're releasing 2.5, which is not like three. what in your mind makes a major release versus like a dot five or dot three update? Because you jump from two to 2.1 to 2.5, not quite yet at three, what would it take for a model to be three in your eyes?
Olive Song
Olive Song 37:05
Okay, so internally we usually update the, like the first number of the
37:09
version number, just when the pre-trained. Base model is changed, for example, for our M three model that will be released in the future. It will be a different base model. And then for 2.1, and 2.5, they are all sharing the two M two model. And what makes a major release is that when we see a very large, increase in performance of the model and we think that, developers are going to benefit from it. And a funny story about this release is that as we are talking right now, the model is actually still training and then the accuracy is still scaling. But we did decide to cut it now and then release a version so that people can use it. 'cause a lot of people have been asking us when to release the model. so as long as we see a major improvement in a lot of, areas, we see that the developers will be able to benefit from it and then will release it.
Alex Volkov
Alex Volkov 38:02
That's great.
38:03
I wanna add Nisten here. You had the question about, smaller models and then we're gonna let you go to continue training the model. But appreciate your time with us. But Nisten go ahead.
Nisten Tahiraj
Nisten Tahiraj 38:11
Yeah, so we see, as you guys started with, M1, which was
38:14
about like 450 billion parameters. And now switched to. A size that a lot of people find much more practical. I just wanted to know, because the open source community gets very excited about even tinier models. Are there any plans to like, make distillations of this or, release some other smaller ones now that you guys have one? Very good one, which is widely used by people.
Olive Song
Olive Song 38:42
Okay.
38:42
Very good question. I'm not sure that's the answer. what I know now is that we will have experimental models, internally for our experiments and that's gonna be smaller, but I'm not confident if we're gonna release it or not. That's all the future stuff. Right.
AI
AI 38:59
Okay.
Olive Song
Olive Song 39:00
was, but I, yeah.
Alex Volkov
Alex Volkov 39:02
Olive, you mentioned, AEL has a lot to do with how, how, how great the
39:07
performance model compared to its size. w would love to give you the stage to talk more about your efforts, the stuff that you're releasing, what you're seeing, because I think this is, I know you mentioned you're gonna release it in the blog, but any tidbits, any insight because AEL is improving incredibly fast, like small models with less and less training data. would love to hear your current thoughts about the paradigm of reinforcement learning generally and the advancements you guys had, while training this model and learnings.
Olive Song
Olive Song 39:31
Right.
39:32
so for our rl, the open source community might have seen the CIPO algorithm that we released last year, while we were releasing M1. at that time we solved the stable training algorithm, right? So as the reinforcement learning model trains in my collapse, and then what we resolved was that we, from the algorithm side, we want the model training to be stable in LLM training, right? That's what we solved last year. And then as long as we did some improvements on that side, what we realized was that what's important in RL is to train fast. So that's what we worked on this year, to make the RO framework to train very fast. And it has a lot of parts in it. You can see that in the later blog, the details. But we did a lot of things to let it speed up a lot so that we can train a lot of steps and then it iterates fast and stable with the algorithm.
Alex Volkov
Alex Volkov 40:25
So basically, scaling up, the rollouts with a l is something
40:28
that you're saying is very important. how important is the speed of inference to kind of like training with a l? Is that a factor as well? Like when you wanna do just like a lot is speed of inference very important to this process?
Olive Song
Olive Song 40:40
it is definitely very important.
40:41
And then I think in the agent era, right? Inference is not the only part of the rollout and people might in, there's, you know, environment interacting and two calling speed and like that. so one great thing about our framework is that it balances all very, very well. So that, okay, there's inference speed, there's two calling and stuff like that. One can balance it so that it trains fast. It trains very fast.
Alex Volkov
Alex Volkov 41:03
Olive, there is one question, final question and
41:05
then we'll let you go, I promise. But like, final question is, I was always wondering, we have a bunch of commenters. Some of them are like deeply into RL in training and understand the concept we're talking about. Many folks come to the show, which just like understand what the hell it is we're talking about just in the world of ai because it changes so fast. They just wanna learn, and they want kind of the lower version. Many are struggling with becoming AI native. What does it mean to become AI native? How much AI you use in your daily work. I gotta wonder and ask you with as much confidence that you can talk about how AI native is a lab like yours. Like you guys are releasing models. You have, Multiple music and a bunch of models that you're working on, in your day-to-day, not you specifically your day-to-day, the team's day-to-day. How AI native are you using Minimax 2.1 to build the Minimax 2.5? Like, tell me about what's going on behind the scenes, from an AI lab being AI native lab.
Olive Song
Olive Song 41:55
Okay.
41:55
Definitely. So one of our teams, right? so one of our, our colleague, his daily job is to collaborate with multiple agents and then collect the feedbacks of multiple agents in the work and then judge their work, but treating them as interns, and check their delivered work. put new things into the skill so that they're better. And then, you know, using all the agent's powers to do more agent stuff. Very, very AI native. That's my answer. Yeah. I still think that the current model is not perfect in everything. So I'm not talking about only M 2.5, right? I'm, I'm talk, talking about like all models in the wild.
Alex Volkov
Alex Volkov 42:31
Hmm.
Olive Song
Olive Song 42:31
There are certain areas that they are still not good at.
42:34
And for those teams, they're still, not that AI native, but then I'm pretty sure it will change in the future.
Alex Volkov
Alex Volkov 42:41
So tell me about maybe one such thing.
42:43
What do you think the current crop of models, the current, class of models is not good at yet. what do we need to improve?
Olive Song
Olive Song 42:50
One very easy example, our RL framework that we, I just
42:54
talked about, it is not like the AI native, percentage is not that high. And we can see in multiple, system cards and stuff like that, people test, the models. The current model's performance in doing reinforcement learning work or development, and they're still not that, not that good. Yeah. that would be one very simple example.
Alex Volkov
Alex Volkov 43:11
Nice.
43:11
So, the model is still not good at replacing the creators of the model yet, but we're getting there. this is what I'm hearing from you.
Olive Song
Olive Song 43:17
Yes.
Alex Volkov
Alex Volkov 43:19
Olive.
43:19
so first of all folks, if you wanna hear all of you had a, talk at AI engineer. the other thing that I wanna mention, SWIX, who invited you to AI engineer. They just launched a leaderboard with Windsurf. I don't know if they quite used Minimax 2.5 F, you guys just released it. But basically they're focusing on fast, but good enough crop of models and fast. But good enough is great because this is kind of the stuff that you're talking about, right? Like, it's, it's not necessarily frontier level 4.6, et cetera. It's still very, very impressive how close you come to like the, the last iteration, like to a week ago. Opposites, you guys come like very, very close. It's very, very impressive in the open source. but also this model smaller than the, the QI 1 trillion parameter like crop of models. and this makes it fast. And for many people the speed is like, what's very important. So that benchmark is probably gonna be owned by you guys at some point. So looking forward to seeing, minimax 2.5 over there. Oliver, anything else that I forgot to ask you that you wanna mention? Shout out people on the team. Feel free, this is your time and then we're gonna continue with the show. feel free to shout out anyone or tell us anything that I forgot to ask you.
Olive Song
Olive Song 44:18
Okay.
44:18
shout out to the team of Minimax. And shout out to you all.
Alex Volkov
Alex Volkov 44:21
Awesome.
44:21
Olive, thank you so much for joining us. we appreciate your time as well as if you train new models, please come back. We would love to hear from you what breakthroughs you do in Narelle as well. Cheers.
Olive Song
Olive Song 44:29
Okay.
Alex Volkov
Alex Volkov 44:31
All right folks.
44:32
How about that? Back to back. Back to back interviews with open source foundational training folks from, from, from inside the team, on ThursdAI I think, you know, it's as if we had both the eye and Anthropic folks on the show. Yeah. Like last week, but no, like we we're focusing on open source, incredible chat with Ali. What'd you guys think?
Nisten Tahiraj
Nisten Tahiraj 44:57
we could have probably gotten, crystal
44:59
from the MK two team as well.
Alex Volkov
Alex Volkov 45:02
Yeah.
Nisten Tahiraj
Nisten Tahiraj 45:02
And then that would've been the whole trinity,
Alex Volkov
Alex Volkov 45:04
the whole trinity of the Chinese open source right now with the
Nisten Tahiraj
Nisten Tahiraj 45:06
Chinese service source model.
Alex Volkov
Alex Volkov 45:07
Except, except Deep Dips does not interview no matter how much we tried
Nisten Tahiraj
Nisten Tahiraj 45:12
and Qwen.
45:13
Yeah. But we've had Qwen quite a lot.
Alex Volkov
Alex Volkov 45:15
Yeah.
45:16
Yeah. shout out to Qwen. We're gonna mention Qwen in a bit on the show, folks. Minimax 2.5. the compared to the size, I think the evals are absolutely bonkers. Wolf. you wanna talk a little bit about like what they're posting there? Because I think it's like we absolutely need to talk about this.
Wolfram Ravenwolf
Wolfram Ravenwolf 45:30
Yeah, we got SOTA with Ggl, LM five, open Source, sota, and
45:33
now it's basically already a new model. Like last week when, Opus 4.6 came out and an hour later we had, GBT 5.3 Codex. Similar stuff is happening here. It's happening so fast. You get a new model, you think, wow, you are still running the EVAs. and then the next model comes out. So it's amazing. surprise is also very interesting, while, minimax is even cheaper than GM LGLM five, but even exceeded in, SWE Bench for instance. So this is very, very interesting. Although it's the best timing for these kinds of models now that OpenCL has happened. And we, not even, not only coders are interested in these models anymore, but the agent stuff, basically everyone can now make use of it. And so this is the best timing and, the cost of intelligence has fallen once more. You get an intelligence level at a price level that has the acceleration. You can feel it here.
Alex Volkov
Alex Volkov 46:28
I don't see the size in my recap here, but it's 200 something, right?
46:32
235,
Nisten Tahiraj
Nisten Tahiraj 46:35
yeah.
46:35
2 35 10 be active,
Alex Volkov
Alex Volkov 46:37
10 billion parameter active, and it gets 80.2% on SWE
46:42
Bench verified, just absolute bonkers. Ryan, I know you don't use open source models and you use like the frontier ones. what are your thoughts about this? Because this model approaches like a, Mac studio level, but you can probably run this on like one or two, with some quantization, you can actually run on actual, home hardware completely on its own, with the cost of just the investment in the beginning and electricity.
Ryan Carson
Ryan Carson 47:06
Right.
47:06
And I think that's interesting and that's where probably it'll matter for Open Claw, where people really do wanna run unlimited inference locally, and, and own the agent. I do think we're, we're in a new world where people are thinking about owning, owning versus renting. Mm-hmm. and I, all of us are more attracted to owning clearly. but if you're still using, you know, a SOTA model from a OpenAI or Anthropic, you're not really owning anything, are you? So, I'm kind of waiting. It's like, you know, but meanwhile I'm like a lot of people, I'm just too damn busy to, like, the model that I use needs to work now. Yeah. So I'm happy to pay for the extra inference.
Alex Volkov
Alex Volkov 47:42
LDJ, welcome to the show.
47:43
You've listened to the interview, and you've seen this model release. Would love to hear your takes on how fast the open source is catching up behind Frontier because like, we're literally a step behind. If, if, if 4.6 wasn't released last week, these models would, would beat some of the 4.5 results, like significantly.
LDJ
LDJ 48:02
Yeah, it's definitely going really fast right now.
48:04
I think they're doing amazing things and not to put it down at all, but I do recall at least for ZI there, there's some interviews before where they did talk about how very quickly after training runs finished, they put models out, which I think is, a pretty good thing usually for open source. but I think that might explain at least part of the closeness. The abilities compared to the frontier models. As of course, as we know, a lot of times the Western labs are having a lot longer time before they actually release the models after training 'cause safety training and they do a lot of that more so. So, when it comes to the customer perspective or the people actually using the models like us, what really matters is at a given time, what is actually publicly available, what can we use? And right now it's at least for like a cost per intelligence basis. It's definitely competitive for a lot of use cases right now. And we already saw Communic K 2.5 was topping the leaderboards for most used open cloud model. If you go to open router and look at the stats for that. And it's going to be really interesting how that dynamic continues and what things end up using what models.
Alex Volkov
Alex Volkov 49:13
Yep.
49:14
I wanna highlight this graph that they added, specifically for like the times of the releases. This is a Swyx inch verified score graph between, tropics, oppos, open the eye, minimax and Gemini at Google. Minimax 2.5 beats Gemini three Pro on this fairly hard agent coding tasks. Gemini three Pro, we just celebrated Gemini three Pro not that much a while ago. there may be breaking news about that type of model. We'll see. but you can see an open weights model beating the current state of the art biggest model from Google. Obviously benchmarks is not like everything. They don't tell you the full picture. And there's like benchmarking, but the speed of improvement with these folks is just absolutely incredible. LDJ, I think this speaks to the, the fact that you're saying like, the foundational model, the Western foundational model, they don't release model as fast. They sit on releases for a while, but it does look like the, you know, there is an advantage to the open source community folks to just like, just, just shipping the models. Yeah, for sure. Nisten, you wanna, you wanna talk about some criticisms of models, and drawbacks?
Nisten Tahiraj
Nisten Tahiraj 50:21
Yeah.
50:22
so what I've heard from the community and stuff, I'm on week three of not using cloud code or any agent. Nice.
Alex Volkov
Alex Volkov 50:28
you, are you feeling withdrawal or you feeling better?
50:30
You look better.
Ryan Carson
Ryan Carson 50:30
are you
Nisten Tahiraj
Nisten Tahiraj 50:31
crying in
Ryan Carson
Ryan Carson 50:31
the corner or what
Alex Volkov
Alex Volkov 50:32
No, it looks healthier.
Nisten Tahiraj
Nisten Tahiraj 50:34
I
50:36
the website, but what I find from people is that. They do accuse minimax of a bit of bench maxing. And, so, so what that means is, because s SWE Bench is just mostly Python, I think it's like 60% Python, you can train it to be very agentic in general. And then, you can create a lot more, similar problems or problems around the same types that the benchmark likes to check for. And it can do pretty well. It's not cheating because, it is a very hard agentic task still, and it is very good at those types of questions. normally when you train, you try to generate a lot of variety around those subjects, and you can get pretty good results from that. But at the end of the day when I test these, the way that I'm testing them now, I just give them a very hard problem. I have a long distributed script for data stuff and there's usually pretty hard, small issues in it that you need a lot of nuance to understand. And I do still find Opus 4.6. To be the smartest when it comes to that. Like when there are really difficult problems where you need to understand the OS and the scheduling and why is this slow and what's going on? It is still, you still do want to use the best model for that now. on the other hand, what people need to also realize is that we're starting to use multiple agents simultaneously. So as long as you do have something, so now these are the praises for this model specifically, well for the 2.1 version that people found it very good at being able to call other agents as well and handling the work. So it is still smart enough to handle that type of work and offload that to other agents when you need to. And then you can also own your data this way and you can choose what to share with the big labs and what not to. And I think that's the very important step, which we're moving towards slowly. And, because it is gonna be multi, multimodel, multigenic stuff, that's what people are gonna do next. And this one is quite excellent for that, for the size and the speed because you can actually, realistically even run it on your own. you can buy something for $8,000 like a, M three Ultra, and I think it does like vary Good speeds, like over 80 tokens per second. Yeah. So this is one that can actually do agentic work with, small, low power. but expensive device.
Alex Volkov
Alex Volkov 53:15
Yeah,
Nisten Tahiraj
Nisten Tahiraj 53:15
so there's
Alex Volkov
Alex Volkov 53:15
still expensive, but definitely owning your own stack.
53:18
Or for companies like us that can host this, we can now offer, you know, not quite Opus 4.6, but definitely opus four and beating like level of intelligence at very, very cheap, cheap inference speeds as well. cheap inference prices and fast inference speed, which I think is like very important to, to many folks as well. folks, we've, we've talked about open source a lot today. This week is definitely the open source, kind of like highlight of the WL lms. I do want to switch topics a little bit, but before this, because we're more than an hour into the show, we're gonna do this week's buzz super quick, and then we're gonna move on and talk about big companies and exciting things and, and star bases on the moon and whatever. be right back after this week's
Lou
Lou 54:04
Buzz's Buzz tell you.
Alex Volkov
Alex Volkov 54:20
All right folks, for this week's buzz, a corner where I
54:22
talk about everything that happens in the world of Weights, & Biases. I want to tell you that the new model, GLM five that just launched from ZAI, the 744 billion per ounce model that you cannot run on your computer unless you have seven H two hundreds or whatever is now live on the W&B inference service. WNB inference is a service power. You know, the best cloud around it serves, companies as big as Microsoft and OpenAI, et cetera. so you can have some of that in as well. If you go, one B that AI inference, you can see the GLM is over here. you can see that G LM five is, already there. We're working on adding Chemic K 2.5, which is model, model new released minimax 2.5 as well. with that said, if you want a little bit of inference, we just on our, on our X account, our main X account x.com/wandb, which is very easy to remember by the way. we post about this model. You go there and you reply with a reply charging pony. We'll give you a little bit of credits, maybe not a little bit, maybe a lot of bit. These models are fairly cheap, so you can get a lot for your, inference credits. try out these models, try out our inference service. You can plug this into cloud code. You can plug this into open claw. You can plug this pretty much everywhere. And so if you're running multiple agent scenarios and they're running overnight and you don't wanna spend, that much money on different foundational labs, definitely check out our inference service at one b.ai/inference. This has been this week's buzz and now, I want to talk to you guys about some big, big company drama. specifically XAI, because it's already Thursday, February 12th, and GR 4.2 or four 20 or whatever is nowhere to be found. we've been waiting for that model for a bit, apparently super, super coding ag agent. and then suddenly like flies, multiple folks, including co-founders of XAI decided to post on X and saying, Hey, it's been a great time. Thank you, Elon, blah, blah, blah, but I'm no longer there. And then, XI released a all hands with Elon and a bunch of like heads of, of the new company, that they've restructured into multiple buckets there. and, this restructuring happened after the crazy scale up that they had obviously, and the acquisition by SpaceX. So it's kind of like Elon Musk goes like this with Elon Musk and approves the, the acquisition kinda like, yeah. So now XAI is part with SpaceX and, a bunch of folks are no longer there. but the details that they've shared in that all hands are just fucking mind blowing. They have the biggest training cluster in the world with Memphis, 300,000 GPUs. It's just like insane, insane numbers. They're talking about like putting GPUs in space. What, what are you guys hearing? What are your thoughts on this? what the fuck is happening with XAI is basically my question to the panel.
LDJ
LDJ 57:09
So it was confirmed that a lot of management, restructuring and
57:13
management changes ended up happening with the SpaceX acquisition of XAI. It seems like that might have played a big role there in a lot of the decisions of some of the co-founders and team to leave. but also when it comes to the more, technical aspect and a roadmap aspect of what they announced in the all hands and everything, what really stuck out to me, and the most interesting thing, I don't know if you guys saw, but they were talking about having coding models and having this research direction of just having the models directly develop the binary and generate binary for the computers themselves and the chips as opposed to having all these layers of abstraction, whether that's in Python or c or, the kernel level. And yeah, I think, very long term at least, I don't know how realistic that will be very short term, but that does seem kind of like a natural endpoint when we look at really long time.
Alex Volkov
Alex Volkov 58:06
Yep.
58:08
Ryan, any comments on x AI and the team, structures and whether or not you're using any of the models?
Ryan Carson
Ryan Carson 58:14
I still don't use grok, for any production stuff.
58:18
but interestingly enough, X did just release paper use, on their API.
Alex Volkov
Alex Volkov 58:22
Yes.
Ryan Carson
Ryan Carson 58:22
And I know, and it's interesting because I'm seeing a lot of
58:25
people, including myself, who are happy to, to do paper use now on the XAPI. 'cause there's a lot of interesting things to do, but now you have like x, SpaceX, X ai, you know, it's all the same thing. and then Elon, you know, goes on the cheeky pint and talks about how he's, you know, launching data centers in space and it, all of it makes sense. You're like, I get it. Like, yeah, go to where it's always sunny, you know, go to where you don't need any electricity infrastructure to power the GPUs go where it's really cold so you don't have to, you know, worry about heat as much. These things all make sense, but then, you know, none of us, I don't know anyone who's using grok for prod. It's just, okay, you are, I know one guy. So
Alex Volkov
Alex Volkov 59:08
here's my, and some folks in comments wanted like real, real
59:10
life advice on what models to use. I use GR for research, specifically X research. So you mentioned the, X-A-A-P-I. XAPI is now paper per whatever. It's really, really expensive still to do like a lot of research. If you go through grok, and I mentioned this on the show multiple times, grok itself has API access to X. And if you use X AI's API, it's confusing. I know, but X and XAI are two different companies still XAPI and x ai API are different things. X-A-I-A-P-I is like open the A-P-I-L-L-M completions, right? With tool use. That tool use has access to X better than faster than you. more performance than you that can perform researchers on x, like nobody's business. Most of the show that I bring to you guys every week, the, detailed notes that I send you guys behind the scenes, all of that is researched on X with API via API, like, I've been using this for a while. It's incredible on unlock to know everything that's going on. It can find the smallest comments from the researchers, from different labs, very cheap. Not something you'd be able to do with the XAPI. So I do use it. I don't use it for intelligence, necessarily, but it is intelligent research, right? So like it does happen in those cases, do not use it for code. And I think the reason why they didn't launch, grok four 20 is because it's nowhere near the state of the art, like foundational models. They promised to bring it, but they didn't. they restructure the company now to four buckets. One of them is, LLM and voice. Another one is coding. they have a specific, part of the company now in coding. And then, they have another one called Macro Hard, which is the data centers, which is really funny. I think the space stuff is, you know, just like propping up before the SpaceX, IPOI don't know, but like, that's how it seems to me.
Ryan Carson
Ryan Carson 1:00:49
and I'll say, did anybody notice that that
1:00:51
SpaceX changed their mission? So now it's not Mars, it's the moon. Yes. which is kind of cool. It's like, okay, we'll see. Faster duration, hopefully. So it's a fun time to be live. I think the thing that I wanna talk about, which is the meta around all of this is related to Matt Schumer's post. Oh yeah. Like around, you know, what it feels like, the acceleration, but yet it, it's just a very strange time.
Alex Volkov
Alex Volkov 1:01:17
can we shout out Matt Schumer super quick?
1:01:19
can you tell people what post? 'cause I, there is, believe it or not, people who haven't seen, yeah. This.
Ryan Carson
Ryan Carson 1:01:25
So I started seeing this in the feed.
1:01:27
I'm like, oh yeah, Matt. Cool. You know, we all know him. The, yeah, the, the, I think the AI world is still pretty small in X and you kind of know everybody. And I saw it, I'm like, holy shit. Like it got a bunch of views. And then next time I look at it, it was like at 20 million and now it's at 74 million, which is probably the largest article of all time
Alex Volkov
Alex Volkov 1:01:43
probably.
Ryan Carson
Ryan Carson 1:01:43
but then as soon as this happened, I think this all happened.
1:01:46
We got our Normandy friends texting us about this article. so I had this conversation with an attorney friend of mine, listen to, and he might be listening. I'm not gonna say your name 'cause you're a buddy. and he's super smart. but he does represent a person that's outside of our sphere. He is like, Hey, I went into Chat GPT and I told it to make me an app and it didn't work. And it was interesting to hear like the disconnect between, okay, well the reality is, yeah, you should have used Codex. Like you can't tell Chat GPT to make you an app. I mean, you can, but, so there's this weird disconnect we're all feeling. And on top of that, you know, you kind of have to pick which horse you're gonna ride. Like, I can't stay always with one foot in the camp of on Anthropic, you know, one in Gemini put a another hand over here on OpenAI. Like at some point you gotta pick your stack.
Alex Volkov
Alex Volkov 1:02:39
Yep.
Ryan Carson
Ryan Carson 1:02:40
And, and then orchestrate that stack.
1:02:42
and literally as of today, like I might be moving to cloud code, like, because it's got agent team, you know, native infrastructure now. but then it's like, all right, I gotta build all that orchestration and it's kind of exhausting.
Alex Volkov
Alex Volkov 1:02:58
It is exhausting.
1:02:59
So I wanna talk about this and I think that the Matt Schumer shout out to Matt, by the way. incredible, incredible, ability to communicate with what everybody who listens to Thursday, I probably already knows. The speed with which we're moving towards something incredible is scary. And for many people the speed is itself, is scary. I wanna like read an out outlaw out outline. A, a small, piece of the, of the article. It's very long, in 2022. AI couldn't do basic arithmetic reliably. It would confidently tell you that seven times eight is 54. By 2023, it could pass the bar exam by 2024, it could write working software explained graduate level science by 2025, some of the best engineers in the world said they had handed over most of their coding work to ai. and February 5th, 2026, new models arrived. That made everything before them feel like a different era. February 5th was a week ago. This is the show that we had here and we talked to you about, Opus 0.6 and the N GPT 5.3 codex. And these, both these. Releases absolutely shatters how we think about software. And we've been AI native for a while and we've been telling you to be AI native for a while, right? And so paired together with AI assistance like open claw and different agentic things, people that have no, no business in like writing or creating software because it wasn't possible for them right now, can absolutely do so. Ryan Ryan's point is absolutely correct. Many people go to Chat GPT because this is what they know. They type a prompt and nothing happens to them. Like, oh, okay. But then we're seeing plumbers operating their business with open Claw now, and we're seeing, agent software making that has zero bugs in one shot after like two minutes. And you get a full incredible thing. Plus the guy who left, XAI, Jimmy Ba, one of the co-founders, I think he is, you guys correct me if I'm wrong, but I think he's like co co-author of Adam. Jimmy is like highest age sighted person there at XAI. He said, recursive self improving is coming this year. Recursive self-improving is when the models use the models to build the models. We told you about this in, codex 5.3. We told you about this in cloud code where everybody in opus, everybody in philanthropic probably have a way faster opus and way cheaper free opus 'cause they work there. but also probably, significantly longer context. they all use this software to build the software. we're looking at this loop that's coming for many Normie is for many people who don't follow us. For folks who are not listening to the show, you can send them our show as well. You can send them Matt's article. they have no idea what's coming and because they have no idea what's coming, it's easy for them to brush off. Like, eh, you know, okay, fine, AI can AI still hallucinates, blah, blah, blah, blah, blah. Coding is the way to generalize everything. This is what we've been feeling since, since forever, but definitely since December and since like the last February models, this clicked for many of us, coding is the way towards generality for many of these agents. If it can code, it can, it can do stuffing around your browser with code, et cetera. some thoughts there in Matt's article. Ryan, go ahead.
Ryan Carson
Ryan Carson 1:05:58
so I put a link in the show notes if you
1:06:00
could pull it up really quick. Yeah. I think this article is important and everybody watching this or listening to this should read this. this was a post called Harness Engineering Leveraging Codex in an Agent First World. and it was written by Ryan, LA Polo. this is a case study about how, people are actually using, an agentic, engineering solution from end in the real world, right? the TLDR is that it's really hard, right? some of this is still hype, like yeah, I set up open claw and I have a company that runs itself, no, you don't. Like, that's not happening in the real world for real people.
Alex Volkov
Alex Volkov 1:06:35
Yeah.
Ryan Carson
Ryan Carson 1:06:35
but people are beginning to actually, from end to end having
1:06:40
zero humans involved in the, writing or reading or reviewing or shipping of code. It's starting to happen. And this article is a good summary of how that actually works and how hard it is to actually get it working. so, but you know, this will become the playbook for every company who's writing code, in the next 12 months.
Alex Volkov
Alex Volkov 1:07:01
yep.
1:07:02
All right, time to switch gears. Definitely read Matt's article. Add this in the show notes. send it to your friends who don't feel the acceleration is you. And then send them a link to Thursday Eye, because one of the things that I love the most that Matt talked about at the end was like, Hey, the what can come? what can we say about the time singularity comes is like knowing what the tools are, using them. it gives you a competitive advantage. This is why we're here on the show every week, folks coming up on three years, March 14th, 2023 is when we started ThursdAI , and we're coming up on three years in two shows or something that's gonna you crazy. And the amount of acceleration we saw since then. It's just fucking mind blowing to me. let's talk about breaking news that we have. Let's go. AI breaking news coming at you only on ThursdAI
Yam Peleg
Yam Peleg 1:07:55
All right, so, Google drops, Gemini three a deep thinking,
1:07:59
significant upgrade to deep thinking. basically a state-of-the-art on ARC AGI 2 two, to the best of my knowledge. And, 48 do 4% on human's Last exam. Real, real, real, real, real. This time. Real. For last, last exam for real, I'm joking, I'm joking. 4 48 0.4%, without any tool.
Alex Volkov
Alex Volkov 1:08:24
Oh, wow.
1:08:26
With no tools. This is crazy.
Yam Peleg
Yam Peleg 1:08:28
I wanna say they're, they are couple of different classes of models.
1:08:34
The way I see it, Gemini three Deep think, as well as, GPT four, 5.2. I think. pro is the, is the one that we get access to. These are on a class of their own. They're pretty niche. they take a lot of time, when they do their stuff. Also, also grok, grok, super heavy, I think it's called. Grok Heavy.
Alex Volkov
Alex Volkov 1:08:57
Yeah, grok Heavy.
Yam Peleg
Yam Peleg 1:08:58
Yeah.
1:08:59
Those that take, quite a little while. when you ask them stuff. They are for extremely hard tasks that are not intuitive. You can say like send and forget and you get like, look, you can steer pro while it works. But generally this is a very special type of models. I use the model time, both, deep think and pro. And they are extremely, extremely useful. And, it's amazing to see, the frontier being pushed again You know, every week we get several state-of-the-art models on something. At this point, like every week we get multiple models and some of them open source, some of them, closed source and like everything is just accelerating to,
Alex Volkov
Alex Volkov 1:09:42
yeah.
1:09:42
I wanna pause you for one sec, because RKGI 84 is insane. The highest, score before this is 68 from Opus 4.6, which he knows is an incredible model. GPT is still not tested on this, but 5.2 is 52. Lemme just put this up on the screen folks. The jump in RKGI. What the fuck just happened? Woo.
Yam Peleg
Yam Peleg 1:10:01
can I show you something?
Alex Volkov
Alex Volkov 1:10:02
Yeah,
Yam Peleg
Yam Peleg 1:10:02
let me just show you this.
Alex Volkov
Alex Volkov 1:10:03
so now this, this is the new, I'm getting tired.
1:10:06
Is this the, you start now? Basically. Yeah. You room
Ryan Carson
Ryan Carson 1:10:10
here?
Alex Volkov
Alex Volkov 1:10:10
Yes.
Ryan Carson
Ryan Carson 1:10:12
We are tired.
LDJ
LDJ 1:10:13
So just recently it was just in these, I don't know, when was it?
1:10:17
I don't know if it was three days ago or three weeks ago that Opus 4.6 dropped it.
Alex Volkov
Alex Volkov 1:10:22
week,
LDJ
LDJ 1:10:22
One week ago.
1:10:23
It
Alex Volkov
Alex Volkov 1:10:23
was one week.
1:10:23
It
LDJ
LDJ 1:10:23
was one
Alex Volkov
Alex Volkov 1:10:24
week ago.
LDJ
LDJ 1:10:24
So one week ago, Opus 4.6 dropped and that was already a significant jump
1:10:30
in the state of the art for R kgi I two. And, hopefully they listed on this graph. Yes, it was the red ones in the middle.
Alex Volkov
Alex Volkov 1:10:37
Yeah.
LDJ
LDJ 1:10:38
but now Gemini three deep think is like another significant jump
1:10:41
even beyond that, just a week later. yeah. This is, this is huge and exciting progress. And maybe RKGI three is gonna be saturated by the time it comes out
Yam Peleg
Yam Peleg 1:10:51
and we need to have humanities for real.
1:10:54
Last exam Humanity. Humanity. Last exam.
Nisten Tahiraj
Nisten Tahiraj 1:10:57
gotta
1:10:57
copy.
Nisten Tahiraj
Nisten Tahiraj 1:10:58
Stop letting Alex Wang.
1:10:59
think
Alex Volkov
Alex Volkov 1:10:59
At some point, the billions of dollars that Zuck is spending
1:11:03
on that lab needs to do something. I think they're preparing. I think it's called, avocado or something, eh? Yes. But yeah, we're waiting for this while we're waiting in the middle of breaking news. Alright folks. This, this one is for real. Ryan, you go. This is, this is you buddy.
Ryan Carson
Ryan Carson 1:11:19
Y'all, I just happen to be browsing X as we're doing the show.
1:11:22
And what do you know? We've got GPT five three Codex Spark, what? Last two, two minutes ago. so I'll read a little bit from that.
Yam Peleg
Yam Peleg 1:11:31
oh,
Ryan Carson
Ryan Carson 1:11:31
let's go Release.
1:11:32
Yeah. My knee glasses like You, it says, today we're releasing a research preview of GPT 5.3 Codex Spark, a smaller version of GPT five three Codex. And our first model designed for real-time coding Codex Spark marks the first milestone in our partnership with cereus. so get ready for fashion. Oh, shit. Here we go.
Alex Volkov
Alex Volkov 1:11:53
We talked to you about this man.
Ryan Carson
Ryan Carson 1:11:54
Let's go.
1:11:55
It, it says research preview to Chat GPT Pro users. have fun. Y'all let us know what you think.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:12:01
I have to renew my subscription.
Alex Volkov
Alex Volkov 1:12:03
But no, we're getting the cerebral thing and apparently, this one
1:12:08
is supposed to be live for pro users in the Codex app, CLI and ID extension.
Ryan Carson
Ryan Carson 1:12:13
I think so.
1:12:13
Trying.
Alex Volkov
Alex Volkov 1:12:14
Well, I have an update on the Codex app, so
1:12:16
I'm hitting the update button. I wanna see CEREUS is a LPU company. We talked about cereus multiple times. They have, connected with OpenAI to deliver these intelligence like incredible speed. yeah, I don't have it yet. I only have 5.3 Codex.
Ryan Carson
Ryan Carson 1:12:31
Yeah, I've just got five three Codex.
Alex Volkov
Alex Volkov 1:12:33
Come on guys.
1:12:34
I want it, but we don't have it. So we're gonna continue with the show. Meanwhile, we're gonna try behind the scenes to see if we're gonna get the update. And then we're gonna try because the incense speed of cerebral is definitely something. meanwhile, speaking of insane speed and, folks asked us about our opinions. Do you guys remember Vox drill where we talked about, mistrial transcription service? I switched open cloud to Vox and the responses, I can show this off. I think this is the demo that I can show, hopefully my little bot will not, expose some personal detail. I think
Nisten Tahiraj
Nisten Tahiraj 1:13:03
they also run on Cerebra, by the way,
Alex Volkov
Alex Volkov 1:13:05
Hey, bot, you're live on ThursdAI and folks are looking
1:13:08
at how fast you can transcribe things, so I'm gonna talk really fast. My name is Alex Volkov. I'm an the AI evangelist with Weights, & Biases with me Wolf and Raven, with Ryan Carson, Yam, Peleg, LDJ, and Nisten. And we're live on ThursdAI , with all that said, I want you to describe everything that I say and put it in the chat. so my bot just saw it. Okay. We're gonna, and it's already started in typing, so the transcription I think went through and now it's just like Opus 4.6 doing its work. Do you guys see this? It's already responded, it transcribed and responded. This was like insane. You said you're live on Thursday? I rank and John ing. Sorry. I, LDJ. Thank you. You describe everything you say. Yes. when you talk to your bots, VForce is like absolutely amazing. shout out, we told you about V store, we tested it out, but now it's like actually built in. I did not do anything to build in. All right, it's time to continue with the show, what else do we have? So we have, we just mentioned a deep, I'm guys, I'm still hung up on this. on both of the releases, the first part of the show, we had two incredible open source models. Now we're having, spike in intelligence with 84 in a GI yum. your glasses are absolutely correct right now. and then OpenAI releases a version of Python three codex. That's stupid fast. And all I can think about is stop talking. Alex, go try to find whether or not you have this model. I don't have it yet.
Yam Peleg
Yam Peleg 1:14:26
Meter.
1:14:26
Meter still?
Alex Volkov
Alex Volkov 1:14:28
No.
Nisten Tahiraj
Nisten Tahiraj 1:14:28
Isn't it?
1:14:30
Crazy. I don't think people outside our bubble realize just how nuts it is that we get like much higher intelligence and then we get much higher speed and like we want more. We can use more, like we can just use a lot more,
Yam Peleg
Yam Peleg 1:14:41
bro.
1:14:42
I can definitely use more
Wolfram Ravenwolf
Wolfram Ravenwolf 1:14:43
cheaper.
Yam Peleg
Yam Peleg 1:14:43
definitely use more.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:14:45
I think maybe there is more to run it all the time.
Nisten Tahiraj
Nisten Tahiraj 1:14:47
Demand for this.
1:14:49
Even if you do like what I have to do for work, for open source models, like I need more GPUs. I have to use like every single skill and stuff from hacking and from Linux and from networking just to like get the hardware and, and then run it and just, just not enough. Like you could use like two orders of magnitude more.
Alex Volkov
Alex Volkov 1:15:09
Let's talk about this for a little bit, just super quick, because
1:15:13
I got very viral with one postal mind that I didn't expect to go viral. that talks about like folks not being able to sleep I also, there's an article from Stevie from Guest Town, which we haven't covered guest town yet, but there's a whole crop of insanity happening over there at Stevie Guest Town, a whole crop of ai, like psycho causes. We did talk to you about the, a psychosis, but now what happens is, many folks, including myself, I can report from my own personal thing and I meditate every day, folks, and still this happens to me. It's hard to go to sleep because I don't know if my agents are getting used to the maximum potential. It's hard to go to sleep or I don't know if, like the army of things that I have going on I usually have like one or two agents running, doing things overnight in the loop. I have a task board, whatever, I have an AI employee basically doing work overnight for me. It's hard for me to go sleep because I'm like, if I don't tell this thing exactly what I need and how it's gonna look in the morning, it's not gonna work. It's gonna fail on waiting on me Correct. is this your experience as well? Is this hard to fall asleep? what's going on?
Yam Peleg
Yam Peleg 1:16:14
agents and cost is growing hard on me.
Alex Volkov
Alex Volkov 1:16:17
Yeah.
Yam Peleg
Yam Peleg 1:16:17
Really, really growing on me.
1:16:19
Absolutely. Yeah. I think everyone, everyone in the field is working as hard as they possibly can over the past couple of weeks.
Alex Volkov
Alex Volkov 1:16:25
the whole point of this to work?
1:16:26
Les? Everyone tell me like, what the fuck are we doing? Isn't the whole point of this so that I can like, chill on my couch, smoke hook by the way, stop smoking, hook on the hook anymore, but like, just do something else while the agents work. Isn't this the whole point?
LDJ
LDJ 1:16:39
think, I think the point of it is to enable the choice and unlocking those
1:16:43
new possibilities of, hey, now you could get the same amount of work done while working less if you wanted to, but there's also this other option of you can still continue working the same amount of hours per day, but now get way more insanely more amount of things done in the day.
Ryan Carson
Ryan Carson 1:16:59
this
Yam Peleg
Yam Peleg 1:16:59
I think that, everyone in the field because it's not
1:17:02
consolidated and, there is no one choice and everyone is still exploring. I think at this point, everyone is optimizing their own, framework and everyone is trying to get their own framework off the ground as hard as possible to get as much automation. It's not necessarily that you use all of it, you use it for whatever you want, but a lot of the effort is still in the looping and in making the thing better so you can use other things. So you can use this for other things. And I think this is what most people are doing at the moment because everyone sees the potential. Exactly.
Alex Volkov
Alex Volkov 1:17:42
Ryan, Tell me.
Ryan Carson
Ryan Carson 1:17:43
So yeah, we're all feeling like
Alex Volkov
Alex Volkov 1:17:44
how's your sleep?
Ryan Carson
Ryan Carson 1:17:46
Not good.
1:17:46
I honestly, seriously not good. Like I'm, I, I typically wake up at 2:00 AM now, and I, you know, I, I, it's not good. So we're all dealing with it. I think what Yam said is very true. the primitives don't exist yet to manage. Teams of agents, you know, we have Klugy auth, you know, we have environment sort of rot, we have orchestration, problems. Like, it's like we have a tool, but, there's no toolbox. I think that's why everyone's stressed out because, you know, you could run agents 24 7 3 6 5, but it's still kind of hard. that's why I built Ant Farm and the truth, and it's open source and free. It's not something I'm trying to make money on, but the truth is, it doesn't solve it either. Like, you know, there's still all these problems with orchestration. and this is what FOMO is. Like this is why FOMO is one of the most powerful human emotions. Like, if you're missing out, everyone is afraid. Like, oh, someone else is running agents 24 7 I and I'm not, therefore I'm missing out, right? And so you can't sit on the couch and smoke a hookah, and that's a very human thing. And, well, it'll be interesting to see where it goes. This is why I was trying to point out, we should all try to calm ourselves down and realize, okay, no one's running agents 24 7 and actually doing productive work. They may be running small teams of agents, you know, to build, real apps, but we're just not there yet. So, but it doesn't help. I'm sure I'm gonna wake up at 2:00 AM tomorrow.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:19:18
It's all of the acceleration.
1:19:19
We always talk about accelerate, accelerate, accelerate. But now every moment an agent is not running, you think you are losing time. You know you are wasting time because it could be doing something for you. Yeah. And personally, I also think about when I'm sleeping, what is my agent doing and is it maybe doing something, especially if it's something like open claw where it has so many access surfaces and so many information. Also thinking, should I shut it down if I'm not giving it a task, to make sure that it's not doing anything. It shouldn't be doing or getting any input. It shouldn't, like if it's doing research, it gets me my morning brief and it could get prompt injected by doing a web search for instance. So that is one reason why you should use, one of the best models and not try to save money by using a cheaper model for something like that. It's just web search, but it isn't an attack surface as well. So sometimes I think about shouldn't I shut it down and turn it on when I see what it's doing.
Alex Volkov
Alex Volkov 1:20:13
All right folks.
1:20:14
I think we all feel this and I think some of the listeners of the show feel this as well. I think that like prioritizing sleep is very, very important to me. This whole endeavor is working still, but working less. This is the whole point. but you know, we shall see. I think I wanna move on because there's a few things that we haven't covered yet. And I think the highlight for me this week was something entirely else that also brings it to, just like, what the fuck is happening in this world? And I think I just got access finally. So, we're gonna try this in real life. we're moving on towards,
Nisten Tahiraj
Nisten Tahiraj 1:20:45
I asked to help, solve your sleeping problem on
Alex Volkov
Alex Volkov 1:20:48
No, no, I didn't get access to 5.3.
1:20:50
I got access to somebody else. Okay. 5.3, the Turbo, nitro, whatever they call it, spark. I didn't get access to Spark, but folks, earlier this week, our whole feeds, I think we talked to you about this before. Our whole feeds were full of new type of videos, generated by ai. these videos are generated with, den Seeds two. Seeds two is a new, a new version of, of Seedance, a 15 second high quality multi-shot output. We had multi-shot before. But the thing that makes c dents absolutely different is there's like multiple things. It can get nine images as reference, it could get three video clips as reference. It can get three audio clips as reference and then plus instructions. So you can, you can direct this model with audio and video, with director level control. And the, some of the videos that are posted, they feel like the jump from when we were before Sora and then we saw Sora for the first time. We didn't get it for like nine months. Just absolute mind bending level of realism in this app. also Biden's absolutely scraped all of Hollywood, all of Netflix, all of YouTube, all of Biden's. It's really like the, the, the amount of, policy violations in this, in this thing is wild. So I'm gonna show a clip that I did, back when I needed to use the VPN. The folks like Dens just launched and it's available here on Byte Plus platform. This is a video that I generated. I need to reshare this because you guys need to hear this because multi-channel stereo sound is also included in this model. We barely got to sound this year with like VO 3.1 and now this model probably like absolutely Beats Vo does full sound. This is a very short clip. My first generation here that I asked Fi, Spider-Man doing back flips. this clip looks exactly like the video game, Spider-Man with music and everything. I wanna show you the actual, videos that they have here and folks who are just listening on the show. This is the place where you should tune in because this is, I don't dunno how else to describe this. yeah, almost taking off the glasses too early, I think is how I would describe this. let me see. Yeah. Okay. Let's take a look at some of these videos folks. Let's take a look together,
1:23:13
multi-shot, consistent videos with physics, just mind blowing quality of physics. If you guys, I like
Yam Peleg
Yam Peleg 1:23:20
balance
Alex Volkov
Alex Volkov 1:23:21
and conversation.
1:23:22
Lip sync and talking and music. Jumps, synchronized motion. There's just so much here that's new.
Ryan Carson
Ryan Carson 1:23:35
It's amazing
Alex Volkov
Alex Volkov 1:23:35
that it's hard to describe what is so amazing about this.
1:23:39
the things that they show here, like audio visual experiences and the multi-shot and the camera movements. This is a shot from outside of the window of a pizza shop and do this preparing pizza. There is everything, the sound of him taking the spatula there from the oven. The pizza. You go from the sound of him like patting on the pizza box.
LDJ
LDJ 1:24:01
really good, but, I'm a bit disappointed that the pizza guy didn't cut
1:24:04
the slices before he put it in the box,
Alex Volkov
Alex Volkov 1:24:06
bro.
1:24:08
Some people want the full pizza. What do you mean? yeah, I've seen, I've seen some, examples. So here's a, here's an example of a video reference, right? They take, pictures of, some video game characters and they, they take a video reference of, I actually don't know if it's like actual people fighting or they also generated them as well with a green reference. Like it looks like reference and they make a video game. They literally just make a video game. the video references, I think. The video reference part is absolutely wild. Look at this beautiful anime style.
LDJ
LDJ 1:24:38
Yeah, this is insane.
Alex Volkov
Alex Volkov 1:24:45
The thing about video models before this, even good ones
1:24:49
like VO 3.1, is that, the face changes from one frame to another. So multi-shot is like very easy to tell, Seedance apparently. And I saw this only today has a internal test mode that we can get access to of 45 seconds for 45 seconds. One shot.
Yam Peleg
Yam Peleg 1:25:09
can we play with this Alex?
Alex Volkov
Alex Volkov 1:25:10
have
Yam Peleg
Yam Peleg 1:25:10
access?
Alex Volkov
Alex Volkov 1:25:11
Yes.
1:25:11
Alright.
Yam Peleg
Yam Peleg 1:25:11
Alright.
Alex Volkov
Alex Volkov 1:25:12
So on Byte Plus, they just upload this seedance 0.2.
1:25:16
but we should mention a few more things from the, from the TLDR. Right folks, I think it's time for us to end this plane. two hours in, we had, let me just recap super quick. We had an insane week and it looks like insane or still because Thursday was the moment for all the major labs to drop their things. we had both Lou from ZI and olive from Minimax here to talk about how the open source labs are doing it with a l and scaling both of 'em launched incredible models, that you can get access to on WB inference today. and then we were sitting here talking about XAI when Breaking News dropped from Google that the new deep think version of their model, is now state of the art on RKGI with an insane jump to 86% or some crazy stuff like this. And, humanities slash exam with no tools is absolutely beating everything else. meanwhile, open the eye, decided to not be left behind and released. GPT 5.3 Codex Spark it. It's taking a long time to even say the names of these models, but the Spark model is supposedly a different version of, 5.3 that runs on Cereus with super, super fast inference. And we're trying to scramble to actually like vibe test this model for you and, and, and see. But like we know for a fact that like speed is very, very important. And then we end the show with CED dance. These weeks, they're not getting smaller, they're not getting, acceleration is absolutely here. So with this being said, everybody of you who tuned into Thursday Eye Earth crossed 2000 folks. welcome to the world of acceleration. If you're new, if you're not new here, you've been with us for a while, you know that, you know, it's accelerating. But we're here to test it out, to talk about this, to give you tips on how to use this in your real life, like with Vox Roll. and we're here every week and if you missed any part of the show, the show is getting turned into a newsletter and a podcast. I really try to write the newsletter myself with no Ai slap. So this takes a while. newsletter and the podcast so that you'd be able to get all the links from the show, everything we've talked about and the opinions and testing out some stuff. So with that, thank you so much. Thank you, Ryan Carson. Thank you ya. Felix, thank you. LDJ. Wolfram and Nisten, Lou and all that joined us and everybody here in the audience who helped, bring us breaking news to help tell us like some feedback. I appreciate all of you. let's go to be AI native, get some sleep folks. It's really important for your health. the agents, I'm not gonna go anywhere. If anything, they're gonna be better. So like, you'll have to learn less as, as the more we kind of progress. So everything's gonna be fine with that. Thank you so much folks. We'll see you here next week. Bye-bye. Go play with the new models.
Ryan Carson
Ryan Carson 1:27:45
See you everybody.
Alex Volkov
Alex Volkov 1:27:45
Bye-bye.