ThursdAI · September 25, 2025

📆 ThursdAI - Qwen‑mas Strikes Again: VL/Omni Blitz + Grok‑4 Fast + Nvidia’s $100B Bet

Moondream 3’s tiny VLM punches up, Wan Animate and Kling 2.5 turbocharge video, and Pulse/GDP Eval push agents into the real world

94 min

YouTube Spotify Apple Podcasts Substack

What happened in AI the week of September 25, 2025?

Qwen-mas really does strike again here: the show is loaded with Alibaba releases across vision and omni, while Nvidia's OpenAI exposure and the Pulse preview keep the big-company section loud. Vik Korrapati joins to explain why Moondream 3 matters in the tiny-VLM race, and the rest of the episode keeps tying model progress back to real multimodal products.

Qwen-mas and the Open-Model Barrage
Moondream 3 with Vik Korrapati
Robotics, GDP Eval, and the Real-World Agent Push
Nvidia, OpenAI, Pulse, and Grok-4 Fast
Video Models, Suno, and the Audio Demo Pileup

Episode Summary

Qwen-mas really does strike again here: the show is loaded with Alibaba releases across vision and omni, while Nvidia's OpenAI exposure and the Pulse preview keep the big-company section loud. Vik Korrapati joins to explain why Moondream 3 matters in the tiny-VLM race, and the rest of the episode keeps tying model progress back to real multimodal products.

In This Episode

🔓 Qwen-mas and the Open-Model Barrage
🎨 Moondream 3 with Vik Korrapati
🧪 Robotics, GDP Eval, and the Real-World Agent Push
💰 Nvidia, OpenAI, Pulse, and Grok-4 Fast
🔊 Video Models, Suno, and the Audio Demo Pileup

Hosts & Guests

Alex Volkov

Host · W&B / CoreWeave

@altryne

Vik Korrapati

CTO & Co-founder · Moondream AI

@vikhyatk

Yam Peleg

AI builder & founder

@Yampeleg

Nisten Tahiraj

AI operator & builder

@nisten

LDJ

Nous Research

Ryan Carson

AI educator & founder

@ryancarson

🔓 Qwen-mas and the Open-Model Barrage

The episode opens on a flood of Alibaba activity, and the panel treats that pace itself as news. Qwen releases across vision and omni make the show feel like a direct update on how quickly open multimodal systems are improving.

Alibaba dominates the open-model portion of the show
Vision and omni releases are discussed as workflow tools, not just model cards

🎨 Moondream 3 with Vik Korrapati

Vik Korrapati gives the vision section a sharper engineering lens. The discussion is especially useful because it highlights why small, capable models matter for real products and why the tiny-VLM race is important well beyond benchmark bragging rights.

Moondream 3 is framed as a practical product-building model
Vik adds clarity on why smaller vision systems still matter

🧪 Robotics, GDP Eval, and the Real-World Agent Push

The middle of the episode moves from model capability into deployment pressure. Robotics, evaluation, and real-world agent tasks all come up as evidence that the next phase of competition is not just about chat quality but about action and reliability.

The panel looks for signs that agents are moving closer to real environments
Evaluation remains a recurring concern whenever product claims get ambitious

💰 Nvidia, OpenAI, Pulse, and Grok-4 Fast

The big-company section is driven by scale, money, and distribution. Nvidia's OpenAI exposure, Pulse chatter, and Grok-4 Fast all feed a conversation about who is building the most durable product moat as model access becomes more commoditized.

Money and infrastructure become central to the story here
The panel treats Grok Fast as part of a larger competitive pressure cycle

🔊 Video Models, Suno, and the Audio Demo Pileup

The closing segment runs through video systems, music generation, and live audio demos without feeling scattered. Instead, it reinforces the show's main idea: multimodal product quality is climbing across image, video, music, and voice all at once.

Video, music, and voice launches all land in the same closing arc
The audio demos are treated as product proof points, not just entertainment

TL;DR & Show notes

Hosts and Guests
- Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
- Co Hosts - @yampeleg @nisten @ldjconfirmed @ryancarson
- Guest - Vik Korrapathy (@vikhyatk) - Moondream
Open Source AI (LLMs, VLMs, Papers & more)
- DeepSeek V3.1 Terminus: cleaner bilingual output, stronger agents, cheaper long-context (X, HF)
- Meta’s 32B Code World Model (CWM) released for agentic code reasoning (X, HF)
- Alibaba Tongyi Qwen on a release streak again:

Alex Volkov 0:31

Hello and welcome.

0:33

Welcome everyone to Thursday Eye for September 25th. My name is Alex ov. I'm an AI Avengers with Weights, & Biases from CoreWeave. I'm very excited today to start the show and have my friends here, Ryan Carson, Yam Peleg. We're probably gonna have a few more folks here as well, as we had another very dense week with releases. we're gonna talk about all of this, obviously, but, we always start the show with a little banter about, how we're doing and what was the one thing that we must absolutely talk about this week. The one thing that stood out this week, the one thing that, maybe one of us didn't catch and the other did and got them excited. So would love to welcome to the show, Ryan Carson, welcome back. It's been a minute, that you've been here. would love to hear from you and your doggy. What was the one thing this week that was very, very cool

Ryan Carson 1:18

good to see you guys.

1:19

there's stuff happening, but it's like, I don't know how much I could say. So, new models, right? Exciting new models is, is, is what I wanna talk about.

Alex Volkov 1:27

All right.

1:28

saying hello as well to Yam Peleg. Welcome, yam. What's up buddy?

Yam Peleg 1:32

Hey, How are you doing?

1:33

Good.

Alex Volkov 1:34

What was your one?

1:35

not to miss AI update from this week.

Yam Peleg 1:38

When is on fire again?

1:40

When is on fire? They dropped an insane model. This week, I don't know if everybody had saw this, it's like a huge table of benchmarks that they are state of the art, but the table is so large that it doesn't go into a screenshot on Twitter.

Alex Volkov 1:57

So definitely, yeah.

1:58

I'm, I'm with you. Um, see, we, we would like to give you like, diverse positions on the show, but when it's clear that this is QWEN's week, it's clear that this is QWEN's week, folks. let's hear from our friend Tira all the way up in Canada. What's up? Listen. What's up buddy?

Nisten Tahiraj 2:13

Hey, my thing of the week was, was actually

2:17

META'S code, world Gen model. Even though I was not impressed at first because it was mostly Python based, the more I looked at it, it just felt like this is the way it should be.

Alex Volkov 2:29

Yeah.

2:29

So I saw you sent this over and it was very interesting to discuss, So we're definitely gonna cover that as well. LDJ WhatsApp.

LDJ 2:37

Yeah, a lot of exciting things.

2:38

people, you know, yam, I think you mentioned, actually, were you talking about Wan, like WAN or Qwen? Yes.

Alex Volkov 2:46

Qwen and Wan both.

2:47

Okay. Qwen is the LLMs and Wan is the video models.

LDJ 2:50

Yeah, the new Qwen, multimodal model or MMIO or Omni Modal as we

2:55

might call it, it's able to do audio and text output and audio, I believe. and vision and text input, like, so three modalities input, and it's 30 B active, 3 billion. So that's really exciting. But I guess since Yam already covered Qwen, the work from jmi, which is also a similar model called mimo, which we have in the show notes that we're gonna cover later. That seems also pretty exciting.

Alex Volkov 3:19

Yeah, a hundred percent.

3:20

we definitely, like, I wanted to do this exercise, didn't have a chance to just go through the show notes and just count the number of times, Alibaba shows up and I think it's like half of them. Um, I think it's time to start with TLDR folks. We have a big show. We will also have guests today. because just as we finished the show last week, Vic, who was a friend of the show, released the preview of Moon Dream three and Moon Dream is quite incredible. And so I asked Vic to come here and tell us all about this and why he thinks Moon Dream is awesome and chat with him. So Vic is gonna be here, soon as well. So very, very worth sticking around because Vic always goes viral for some reason. and then hopefully, he'll come and tell us about his, his release. Alright folks, so with this let's dive into the TLDR. let's go.

4:13

TDR of course stands for too long. Didn't listen in our case because, you didn't have the whole time for the show. but we are here to give you the kind of the scoop, the super quick newsy bit so that you'll know everything that did happen. And then if you want to, you'll know, where to go for links, After the lecture, we turn into a podcast and a newsletter at Thursday, eight News. feel free to subscribe. It's free all. So hosting guest this week, me, Alex Ko, Weights, & Biases. He'll be back here soon. And we have PNI LDJ and Ryan Carson this week. And then we also, our guests are Vic and his co-founder, All righty. In open source, we have maybe like 90% of the show is gonna be open source, maybe a little bit less. Not tons of news from the big companies and APIs, but definitely a huge onslaught of news. starting with deeps seek folks, new deeps seek this week. This is like, you know, in addition to everything else, new deeps seek this week, deeps seek, but not like four. Not like everybody's waiting for like R two or dipsy, you know, V four, Dipsy, V three, E 0.1, Terminus. So we already told you about V 3.1. and then dipsy updates, another, another tiny update. They call it, cheaper and long con, longer context, stronger agents, blah, blah, blah, blah, blah. Very interesting. Some folks already reporting that. It's pretty cool though. I will say. Deep seek did not own this week. also we have what Nisten mentioned. CWM met 32 billion parameter code world model released for agen code reasoning. It's, it looks like a research sheet preview. Also on hack and face. You're have to go and accept a bunch of stuff to even like, get at it. very interesting. Absolutely. Very interesting as well. And then of course, the kings of open source, the undisputed Frontier Lab in open source, at this moment and have been for a long time. Alibaba Tonge. Qwen. So Alibaba is the huge company, the parent company, the behemoth, the Chinese behemoth. They have Alibaba Cloud and Alibaba Togi. Alibaba Togi is two labs, Qwen and one WAN one. So they do multimodal stuff and video, et cetera. they released just like a ton of open source things. We're gonna run through all of them, but basically Qwen three, VL finally released tons of people were waiting for the video and the vision enabled version of Qwen three, and that's released with thinking version and the regular version as well. they do like to release separate thinkers and, regular, you know, instruct lms. we also have Qwen three Omni, which is a 30 billion parameter open source with just 3 billion active parameters. it's what LDJ mentioned in the beginning. It's the end-to-end multimodal in, multimodal out. they call it omni Modal, with unifying text, image, audio and video. And then it can see you and talk to you and multiple languages as well. It's pretty cool to talk to it in different languages. it often switches. Qwen also released, where did they release as well. They also risk Qwen image edit, which is an update on Qwen image. pixel Perfect multi-image editing. So it adds the ability to add multiple images for references, et cetera. And they also risk Qwen three TT s Flash, although that one is not opensource. I think that one is just API and I think we'll get to another release from Qwen and the big companies in API. this is my lease from Open Source. I believe the Wolf firm also dropped something as well. do you guys. Have any other open source stuff that we've missed here? Lemme see what a boring week. There's nothing going on. Nothing. Nothing at all. qu guard they released safety moderation model. That's new from them. I don't remember them going into this. I know meta always releases like Lama guard, if you guys remember. qu guard was also released from them.

Nisten Tahiraj 7:46

did we cover the tiny IB M1 that just blew up?

Alex Volkov 7:51

No.

Nisten Tahiraj 7:52

Okay.

7:52

There's, I just checked hugging face. Top of the week. There's a 200 and and 50 million parameter IBM image, text to text model for just scanning documents very, very fast. if you're run in eight bit, that's 250 megs. you can literally run that on a toaster raspberry pie. That's the top of the week now.

Alex Volkov 8:14

please send me, so we'll add to show notes as well

8:16

for folks who are listening in in our intro Running Home models. LDJ, I see your hand up. what do you have for us open source wise?

LDJ 8:23

Yeah, so there's a couple things.

8:24

there is something called power retention. I'll put in stream your chat as well. Yeah. but that's basically an, an intention alternative that was open sourced, in, it was published in a paper a couple months back, but the company just put out a model for it. just, I wanna say yesterday or two days ago. Nice. And then we also have, liquid foundation models. L fm, sold Liquid ai. They just released something.

Alex Volkov 8:47

We covered lfm, but I was wondering if they have new

8:49

releases or this is like a follow up on, the previous releases. So definitely let's dive into this. All right? Mm-hmm. Moving, moving. let's see. Yeah. Okay. We're moving forward towards, I had this, like, a little bit of a, like a different corner here. I called it, evals and benchmarks because I saw two important ones this week. We don't often, we covered them and we need to know about them. If you guys remember, humanities, last exam came out and we're like, we're gonna tell you about this. 'cause that's gonna come up. That's gonna come up. So I had three here. Okay. So, Meta MSL, meta Super Intelligence Labs and Hagen Face today together redid the Gaia benchmarks Benchmark for agent evaluation. And looks like GBT five high dominates the execution. Chemic K two takes delete in open source weight performance. Okay, so we have the agent evaluation stuff from Meta. I saw scale, release, Swyx Bench hard or something like the Swyx bench hard. They took the Swyx bench and you guys remember there's the Swyx bench series of software engineering stuff, and then there's the Swyx bench verified, the OpenAI kind of clarified, removed, et cetera. And, scale AI released something. Yeah, the way I found out about this was funny because Alex Wang, who's now in MSL, posted about this, about his friend's work at scale. So we're gonna add this, as well. And I think there was a third one. Unfortunately don't have the same notes, but there's definitely, oh yes, there is there's a new benchmark of models playing among us and it looks like GPT five is the most, like, mischievous and the one like is good for, you know, good for like lying to other folks and finding out who's, who's lying to other folks as well. So that one, like among us Bench is the third one that I had but speaking of big companies, I think it's time to move to that segment of TLDR and just cover that. Ryan, you want to take this first one?

Ryan Carson 10:31

Yes.

10:31

So, there's a couple things that going on with Nvidia actually. But this one is fascinating, right? Nvidia is investing up to a hundred billion dollars.

Alex Volkov 10:39

How?

10:39

How, excuse me. Sorry, what? How much?

Ryan Carson 10:42

A a hundred

Alex Volkov 10:43

ba ba ba billion.

10:45

That's insane. That's, that is absolutely insane. The numbers that like open the eyes throwing out in the air. I was just gonna say, sorry, go ahead.

Nisten Tahiraj 10:51

11 zeros.

Ryan Carson 10:53

Yeah.

10:54

It's so much money. Y'all. Like, we say there were billion, but actually if you realize it's, you know, a thousand, millions, like this is so much money.

Alex Volkov 11:03

We're not speaking in parameters.

11:04

we're not speaking in like, like again, like we, we mentioned billions on the show since we started. This is always something billions, and we also mentioned trillions, but like this, a hundred billion dollars for 10 gigawatts of compute and power that Nvidia is gonna invest in open air. We're gonna have to talk about this. Alright, we are moving to the ne next, thing. yam, you wanna take this one?

Yam Peleg 11:26

Okay.

11:26

Oh, so X AI is also on fire. recently we get new rocks all the time and new, new category of GRS all the time. this week we got grok four fast. the thing about this is, well, it is fast, but it is a little bit intuitively fast because it can also search fast online and you ask it to do a bunch of things. And you see in Lightspeed go all over the internet to get the answer for you and return you the answer. This is what it's useful for. It's not the smartest model, the most capable model of obviously it's fast. there is a trade off at the end, but it is extremely cost efficient. it's multimodal.

Alex Volkov 12:13

yeah, XI is not like left behind for sure.

12:17

And then, we are back to our friends from Alibaba. I so, Alibaba launches their biggest model yet there's like flagship LLM Reasoner, called Qwen three Max. Qwen three max, And then also they also had like a launch announcement and the CEO of Alibaba showed, some roadmap stuff. And, if you guys, were excited about the numbers that we talked about before with the billions dollars, they have numbers about numbers of compute and scale that they're planning to go towards. that's also gonna be very exciting. So definitely this is IT folks. this is all I have from big companies and lms. Do you guys remember anything from other big companies on topic? Maybe anything

Ryan Carson 12:52

This is not specifically about other labs, but as somebody

12:55

that works in Agent Lab now and deals directly with them, the speed and urgency at XAI is unbelievable. I can see them directly, like Anthropic versus Gemini, versus XI et cetera. they're just in the slack all the time pushing so hard, and I'm pretty bullish.

Nisten Tahiraj 13:13

I also heard of them on a lot of the Twitter group chats for some

13:17

of the open source, vibe coding tools. They have been in direct chat with the engineers. Like, some tool call doesn't work. something's off. Someone's using a proxy. It's, they're on it, they're on it right away. so there's a very strong feedback loop, going on there.

Alex Volkov 13:34

Yep.

13:34

well, another thing from BigCommerce, API is Sam Altman posted a video of the Abilene, Texas, data center that is built as and scaled up as part of the Stargate initiative between OpenAI, Oracle and, All right, moving forward to this week's buzz corner where I update you about everything that happens in and biases. our fully connected series, if you guys remember, I talked to you about Fully Connected, which is our premier conference, is is coming up in London, uk, November four and five. If you're there, please join. And I have tickets for you. They're running out, but you, you always know I got you. And we also have one in Japan, that's coming up, I think October 31st. I wish to be there, but it's Halloween, so I'm not gonna be, but I think it's gonna be super, super cool. So if you are in Japan during those times, please join us. And a bunch of industry, great folks in fully connected conferences that we have. we also probably gonna sponsor some other conferences like, AI engineer in New York. Let's dive into vision video. Big, big, big updates here. it's just from week to week, this. I, I don't know folks like I, I, I don't know how to explain to you like why this is like so popular and why this runs so quick. we'll start with Moon Dream three, which is a small vision model. Uh, and, uh, Vic and the co-founder will join us. And now this is a preview of the third version. It's a 9 billion parameter, M-O-E-V-L-M with only 2 billion proactive. it has incredible ability to understand the pictures that you're sending and do pointing and do segment. There's a bunch of stuff we're gonna ask Vic about all of them. also an open source from the same Togi lab from from Alibaba is Wan 2.2 animate, which affectionately the whole, the whole, feed calls one animate because why not 'cause one ate Sounds cool and one way better. This is a lip sync and a character swap model. What does the character swap model mean? You take a picture that you generated with whatever image generation of, I don't know, like LD J's avatar. Here is a cat or testing catalog. Avatar is like this little green alien with, with a hat, whatever. And then you actually upload a video of yourself or record a video yourself. they will just basically take that image and put it in that behavior so You can act out stuff act one for example, from runway is one such thing, but Act one was not open source and this one is open source. It's really good. They also have separate lip sync model as well. just incredible, incredible release. So exciting. Those things are so visual. We're gonna try this on the stream soon. Hopefully we'll get there. it's gonna be so fun. One inmate is definitely a great open source. the folks from Cling started also going crazy. If you guys remember, cling is one of the top leading video models. this one is not open source, but they cling 2.5, turbo release, 30% cheaper. they call this pro grade ai, with sound. Voice and audio. We're not over yet with the multimodal stuff. voice and audio Suno released the V five. they call this AI music model redefines audio quality. I am honestly, folks, I'm at the end of my ability to discern between Suno V four and V five. Like, honestly, it's so good that even V four was good, but like, I've honestly tuned into Suno Stream instead of Spotify stream multiple times and just forgot about the fact that I'm listening to AI Generat music. I Qwen also, I think I mentioned this, Qwen TTS flash and Qwen image edit. I think we're at the end of the TL DR. It only took 20 minutes, usually takes 10. So let's dive into open source at this point, folks. on the semi, something very big. I think it's time to open source.

17:17

Alrighty, let's get it started. And I think as we get started, it's time to hit that breaking news button that I've been waiting for because, we, we do have some small breaking news that happened while we prepared, so let's do that. Hey, AI breaking news coming at you only on Thursday. All right. I think, we are talking about liquid nanos. LDJ, do you wanna help me cover this? We'll take a look together.

LDJ 17:47

Sure.

17:48

Yeah. So, liquid foundation models, they are a company working on alternative architectures, trying to make especially long context more efficient like some other companies are. they released a new set of models ranging from 350 million parameter size to 2.6 billion parameter size. It's called Liquid Nanos, and I believe, it's open source as well. they also have some specialized versions where they have a set of general purpose versions as well as extract versions of them that are designed to extract important information from a wide variety of unstructured documents into structured outputs like J-S-O-N-X-M-L, or Yam O. So I think that could be really interesting. And it seems like it's even competing in that type of task with models like GPT four and models like 10 times its size while being only around a billion parameters.

Alex Volkov 18:40

Tiny, tiny models.

18:41

Absolutely. And I think LFM, they, we had, we had folks on lfm on the show. They, they're focusing on optimization and running on like even CPUs and toaster, et cetera, with tiny models, but not only tiny models. Also like with, with just like, Performance upgrades as well. So, shout out to LFM, liquid nano. Alright, we are moving on so much open source to talk about. So let's run through the most important things. we'll cover deep seek terminus super quick because, because deeps seek, release a new model and it's always fun, to see the whale kind of come back from, from, from the, from the deep surface, release something and go back, deep seek released, Terminus, let me open this. deepy, V three one. Terminus has, reasoning with tool use, upgraded a little bit, some stats. Upgraded MMO approaches jumped a little bit, like by point. I think. the small updates always interest me because, they always release like a 0.1 release and then one of the evals is significantly jumping, on top of the other one. I'm trying to remember which one this was for. Deep sec. I think it's like terminal bench or something.

Yam Peleg 19:43

I think the most interesting thing to note here is that they

19:47

compromise and some evolves for others. Yeah. it's not a full upgrade where you see like the next, the next model is crushing the previous one on everything. It's not the case. It's, it feel more like a fix or known issues and some enhanced, agentic, capabilities. But they did, compromise on some benchmarks over here. You can see, for example, I, I know it's like 0.1, but live code bench is a little bit

Alex Volkov 20:23

I think it's the one that gets lower a little bit.

Yam Peleg 20:25

slightly lower

Alex Volkov 20:26

Yeah.

20:26

Mm-hmm.

Yam Peleg 20:27

Yeah.

20:27

So it is interesting to see this release, when everyone is waiting for, the next big thing for Deepy to come out of, nowhere and give us, a crashing model like they did last year. now we get an update, which looks like a fix. And, from there release, if you just quote, I'll tell you exactly that.

Alex Volkov 20:50

Totally.

20:50

It does feel like a bug fix release. Absolutely. I have the quote here if you want me to read it. yeah. V three one fixes the CNE and code switching bug, which happens a lot to Chinese models, including Qwen, where sometimes it starts talking in Chinese or English to you, and pushes a genetic performance, especially in coding web task. I think that the highlight there to call out is, humanities. Last exam is the biggest jump here from 15 points to 21.7 points. Humanities last exam is just like not being the last exam. Ryan, do you have any human last

Yam Peleg 21:18

new, updated, now with the final Do Doc X new copy of, yeah.

Nisten Tahiraj 21:23

Second, last,

Ryan Carson 21:25

Yeah.

21:25

it's interesting, obviously we talk a lot about evals and, Alex you think a lot about this being at Weights, & Biases, but it's, I still find it wild how, when you're building an agent, it's so hard to figure out exact evals and I end up just coding all day using a new agent harness, using a new model and seeing how it feels. in the end, for instance, we tried Codex pretty hard, gave a good run for its money, and we decided to go back to sonnet four or even GPT five, which is pretty wild. Interesting.

Nisten Tahiraj 21:55

Interesting.

21:55

I noticed this problem when using the. 3.1 as well. It would just think too much for some tool calls. And, it was annoying because it made the experience very inconsistent. It would be extremely smart at solving some very complex problems. Like you're trying to run foreign function interfaces from a node V eight engine, and then you're trying to go back into the browser. It would just think too much and get stuck, but it would also be very smart. So I think this update might make a bigger deal in those that, have chosen to invest time to build their tooling around it.

Alex Volkov 22:32

Yep.

Nisten Tahiraj 22:33

especially like the terminal bench and the browser stuff.

Alex Volkov 22:37

Yep.

22:37

And then they have also, two modes there. Chat for the Jason and reason for longer thought,

Nisten Tahiraj 22:41

more like if you use playwright, it's

22:43

gonna make it a lot better.

Alex Volkov 22:45

Yeah.

Nisten Tahiraj 22:45

I think that's, what this looks like.

Alex Volkov 22:47

All righty.

22:48

We're moving to Meta's Nisten. While you already have, you, have you on, on mute, meta released a 32 billion code word model. tell me what code. World model means, like we, we talked about world models from fefe Lee World Labs, and we talked about like different other world models. Visually, what does a code world model mean and why is it interesting at all to our audience?

Nisten Tahiraj 23:08

Well, there is a generic lack of awareness on programming knowledge

23:14

as your agent or your model is working on a large app, and it's something that you would expect to have from a senior developer, which is stuff like, where is your state, where are your types? What is your compiler doing? What's like the biggest rusty gear in this contraption that you have set up? So for that, you need to have some kind of world model of the code base, like you're running JavaScript. Is it JavaScript or is this something in the browser. this does not use JavaScript at all. This is built on Python, but, the way that, they have trained this is very different the other people that I. Ask and are very competent. seem to be coming to the same conclusion that this is the way it should be. it should understand the compiler very well, like it should be thinking of the program from the ground up as it's building it. And it should not just be making a very nice, code site. Now, that doesn't mean you're gonna get great results right away, but I think they're really onto something here. if this works out, this could be how everybody else makes their coding models. so yeah, it tries to think more like a compiler and be aware of what is actually going on with the software.

Ryan Carson 24:34

So to sort of pile on here, this is what everyone's

24:37

trying to do with their agents. Do MD file, you know, try to give this context where normally a software engineer would walk into a room and remember these things and understand how the entire repo works. and it's feels like a bandaid, you know, with all of the, cursor rules or agents md and I'm excited by hearing this, where the model actually starts to understand better the bigger picture, as a human engineer would. So excited to see where this goes.

Alex Volkov 25:04

Yep.

25:05

This is, I think, where they're going. They released three models, the pre-trained, three train pre-trained S-F-T-N-R-L. they also released a model that seems to do some very interesting, eval stuff. But I think the paper and the approach is what's very, very interesting here. all right, folks, in the interest of time, we're moving towards the. The darling of this week, Alibaba and, Tonge Qwen. again, for folks who are just like tuning in, I will explain kind of the connection. Alibaba is this huge conglomerate, full of like many, many things they do. and then Tonge is, like the AI things in Alibaba split into two. There's like the Alibaba Cloud and AI and et cetera. And there's, Tonge where they train models. And Togi is con constructed from Qwen models and one models as well. So, Togi Qwen, Leba Qwen are friends during Yang Ling, who, who's been on the podcast multiple times and like a bunch of other folks, they have been on an insane release streak because I think it coincided with their kind of, announcement week as well. so Qwen three, so we're just gonna run through some of them. It's really gonna be impossible to cover. Like, you gotta remember Qwen three, for example, we covered that for like half an hour with Junior Yang, but like, it's gonna be impossible. and I think this is the downside of them dropping seven models in a day and a half, and like the, like, it's really impossible to cover all the things, but Qwen VL is definitely something that folks have been waiting for. Qwen three when it was released, Qwen three was only like a textual model. and Qwen VL is something that like many, many folks waited for. It's the, you know, the, the vision enabled model, that they've trained the, vision encoder on top of this, it's a 235 billion parameter, 22 billion. Active. It's, a thinking version. So this is a reasoner with eyes, basically. it's a huge, huge model. the next segment of our show is going to be talking about, we're gonna talk with Vic and Jay about, do you need so much parameters for some like, vision. But, some of the evals for Qwen vl are very, very impressive. they compare themselves to Gemini 2.5 Pro and GPT five at high reasoning Cloud Opus four one. and it's really like, the sum numbers are really impressive. they have benches here that I never heard about, like Zero Bench and Visual Logic, and that they're showing that they're better than the competition. But, you know, M-M-M-M-U, for example, getting over Gemini 2.5 with this release, this is a huge model again, but like, it's very, very impressive. I think somebody called out that the doc VQA here is, like at, almost, almost entirely solved at like 96%. what else can we talk about? Qwen VL vision, the, they, I, I, I gotta say that, for the visual models, the. list of evals. This is like the longest list of evals that I've seen being released with a model, like one of the longest ones. It's quite something because every new dimension, every new multimodality like modality that you add, adds with its own, list of, evals. So for example, they have evals for video here, video, MME, and LV bench and, and et cetera. some of this that you couldn't even find results for other models. 2D and 3D grounding like hypers and objection, like I, there's evils here that like, good job on them to even going, finding those evils and then putting them in in the release. any comments on Qwen, vl, folks? Super quick before we continue.

Yam Peleg 28:14

Yeah, there were some skepticism about some of the benchmark

28:17

scores because we know that some of the benchmarks are, has, has label issues and scoring is so high on them. Feels like, wait, how can you score so high if there are label issues? I'm trying to find them to, show it because, you know, criticism and so on. I wanna show it if I'm saying something like this, but, yeah. we just don't have time. But there are people questioning some of the benchmark because, the benchmarks themselves are not perfect. anyway. It's a, it's an absolutely amazing release. Just look at this thing. I'm not sure how many people are there on Tonkey Labs, but man, you guys are shipping, like, I don't know how many people are you, but you guys are absolutely shipping, firing on all cylinders. Like, that's insane. At the same day. I think, with Omni, at the same day with Qwen Max. That's crazy. Crazy. Absolutely crazy.

Alex Volkov 29:07

grounding, visual agent understanding of the, you know,

29:11

they say there's so of on os world. Os world is just like click things tasks, And, we have a screenshot to code HTML C and JavaScript, the support song called Dry io. long context, scalable to 1 million tokens. I think that this is also important, like this is an open source model and with some, some tricks you can scale the vision model to 1 million tokens, which supports the up to two hour video and like very long PDFs. So very much incredible release, con vl. another very much incredible release that we'll cover before we're gonna get to our friends from who already joined is another release from Qwen is Qwen three Omni. LDJ. Wanna talk about this one? I think we mentioned this and I wanna see, what we can share about this. we were excited about omni models here for a long time. Obviously a model that you can talk to natively, that you don't have to transcribe what you say and send it via text. And you can also see what you see are all these models are very interesting.

Nisten Tahiraj 30:02

Dude.

30:03

It's, The, well, first of all, the 2 35 is one I use the, the most for any cus sorry, this was, it's not No,

Alex Volkov 30:11

You're good.

Nisten Tahiraj 30:11

and also that was the second highest benchmarking medical

30:16

one that I was able to run even on stuff that was not on the benchmarks. So them adding vision to it, it's a pretty big deal. I don't use the thinking version because I find it adds way too many tokens and not much more to the quality. but, yeah, meanwhile I, the only, it just runs like crazy on even just two 30 nineties that, yeah. does your business use Gemini? Okay. You can actually just host it now. you can get most of the work done and, as data gets better and as training gets better, I'm still a big believer that three B active parameters is probably all you're gonna need.

Alex Volkov 30:55

Yep.

Nisten Tahiraj 30:55

But, yeah, let's see where this goes.

30:57

I just wanna see more people, well, including myself, just like make more apps with it that are just the whole stack of the apps, that's self-contained and you can talk to it and it talks back and it does everything all in one container. 'cause we're not seeing a whole lot of it applied, but, yeah, I'm sure it'll come.

Alex Volkov 31:16

All right.

31:16

So we have thanks. We have like the Qury Omni, super quick, to cover. I think Qury Omni is something to show and tell about, and, you know, we, we will do this after the conversation. LDJ go ahead. Give us your thoughts about Omni, super quick and then we're gonna move on to our interview and also we have some breaking news afterwards.

LDJ 31:33

Yeah, so Qwen three Omni, it's able to take in audio streams

31:38

and also output audio streams. And if I remember right with this architecture, they also have, a thinking ability detached from the audio. So essentially it can stream audio in real time while also having the separate thinking process in the background And yeah, just really interesting and the fact that it's only 3 billion active parameters, again, like their original non omni models, means that you could run it pretty fast, even on consumer hardware, and it's gonna be exciting for consumer local speech streaming.

Alex Volkov 32:08

Yep.

32:09

I was very excited about these models when, when OpenAI released four O, which stood for Omni, because you could talk to it, he could see you. since then we had, you know, Quin Quin Low, the show, the, the, the guy from Pipe. And he mentioned multiple times that like the downside of these models that they're not as good. I actually had the tweet go, not viral, but definitely noticed by Open the Eye this week where Open the eye advanced voice mode is. Be starting to become really dumb, at least for me. And many, many people confirmed it, it's for them as well. Like it adheres to instructions too much. And, some people confirmed it like, Hey, advanced voice mode is not as advanced as the basic voice mode. Basic voice mode did basically what we're saying, like it translated transcribed you and then sent it to the model. advanced voice mode is a whole model, and that became not that great. so omni models are great in theory, but sometimes I think the industry caught up that like if you scale up the language part of the model and then do some stuff on the periphery, then you'll have much, better, intelligence from these models. But it's still cool to talk to it and hear a native output, like it's still really, really cool. the country model I play with it's really funny. They have languages support. they understand 119 languages, they said, and they get a sub 250 milliseconds real time response. So like the benefit there is definitely response. and then it speaks in like nine languages. So have it spoken to me in Russian. the code switching is still a problem for these models. It started speaking Chinese out of the blue. it like didn't really understand, tries to apply for it. So from a user's perspective, this model is not as great as, for example, advanced voice mode, which also goes down in usage as well. so I think when is a bunch of other stuff as well. We're gonna mention them in the other, sections of the show, but now it's time for us to go to Moon Dream because, just as we finish the show. last week, Vic and Jay who are joining the stage now decided to release it. Posters the I not on Thursday. so if you guys are here with us for a long time, you remember, Vic, Jay is the first time that you're joining us as well. we covered Moon Dream two back then. And Moon Dream was this incredible small model that's like incredibly, at looking at things. And, this you guys released, a preview of three. So, would love to, introduce you guys, say hi and kind of, you know, what are you guys releasing? I'm very interested in hearing what's new with the Moon three.

Jay @ Moondream 34:18

Sure.

34:18

well I'm Jay. I'm the CEO of Moon Dream. I co-founded Moon Dream with Vic. I know Vic's been here a few times. Wanna say Hi, Vic.

Vik Korropatti 34:26

Hey folks.

34:27

Very happy to be on again.

Jay @ Moondream 34:30

So, yeah, really quickly about Moon Dream Three.

34:32

it's a new architecture, a new model, and I guess it's, it's really addressing two things, while keeping one thing kind of similar, which is, moon Dream is kind of known for being, Accurate, like highly accurate grounding kind of vision, language model that's very fast. It's very small. We kind of keep saying like we're focused on the top left of the graph. The most like intelligent tiny model there is. this release kind of kept it small but really, really bumped up the intelligence level. So we're reaching state-of-the-art, scores on some grounding stuff. We can get into that later while keeping the size of the same. The active parameters, this is a mixture of expert model, so it's a nine B, mixture of extra model with two B active parameters. Same two b kind of level of active parameters is the first model. But the first model was a dense, model. the first thing we addressed is the intelligence got way up. And the second thing we addressed is what we've seen perpetually since the launch of Moon Dream is this last mile problem in the vision, AI space. People use A VLM, they get super excited about the accuracy and ease of use, and then they run into this, ah, but it failed in this case, failed in that case, we've been really, really diving deep this summer into reinforcement learning, and have become converts ourself. Like this model is kind of the product of a ton of RL itself, but more importantly, it's a really effective model at rl. So we'll be launching the RL part a bit later. It's just, way smarter and it's really, really great at rl.

Alex Volkov 36:06

Let's talk about, thank you so much, Jay, for covering this.

36:08

Vic, maybe I can direct the next question to you. Let's talk about capabilities, because I know some folks know about Moon Dream, but definitely not everyone. and you have capabilities like pointing and detection specific things, and here, paired with reasoning. I think the cooler examples that we're showing now on screen for folks who are just listening, the cooler example is like, you are kind of like asking the model non descriptive things. You're not telling it what you're looking for. You're, describing in general like, what you would like for the model to give you. And, like the picture with bridesmaids for example, you have, a picture with like, the bride and the bridesmaids and like the model counts, the bridesmaids. If you talk about the capabilities that lead to the point of being able to do this and like, why would somebody need something like this?

Vik Korropatti 36:52

we do a different style of reasoning thinking compared

36:55

to most of the thinking models where we focus very much on allowing the model to ground its thoughts. When you think about how a human process a picture, Like that bridesmaid picture, for example, like if someone asked you how many you, you, you start pointing at that stuff, you'd be like, 1, 2, 3, 4, 5, 6, 7. And that's really what we enable the model to do. if you look at that example, that underlying text, the model actually generates those points internally. Pointing was a skill we had. we'd introduced the model before reasoning, and we basically allowed it to leverage all of its skills in terms of pointing object detection, et cetera. While reasoning about images, we find this works very, very well for tasks that involve grounding counting, especially with state of the art on counting compared to every single frontier model out there. similarly with our object detection, when you use a VLM to do object detection, it's going to be slower than the object detection models. Most people are familiar with yolo, et cetera, just because the parameter count is, so much higher. But on the flip side, it enables you to open vocabulary object detection. you just type in person wearing metal or whatever, and it just goes and finds it. You don't have to train specifically to find the thing that you are wanting to detect. we find that's useful for a lot of customers.

Alex Volkov 38:09

It's more general and you, you guys work with customers.

38:11

So maybe Vic, maybe the next question to you is like, Who needs this? Like who, who needs a language model? Visioner, that's tiny and open source versus something like the dedicated like vision models like yolo, et cetera, who are very tiny and maybe around CPU versus, folks. So you guys are kind of like the way Vic described it to me in, the middle and the smaller side. You have the dedicated vision models, et cetera. on the larger side, folks maybe wanna pay open the eye and like trust them and use the big models that can do everything also for vision, but maybe they're not state of the art. And you guys are kind of in the middle with small models that can deploy on-prem. who uses this? like who's the target audience? Can you gimme some examples of the customers, and their use cases for this model?

Vik Korropatti 38:56

Jay, you wanna take this one?

Jay @ Moondream 38:58

Sure.

38:58

there's kinda like two parts to unpack, I guess. but the simplest way I can answer it, to be honest, is that we're just seeing a rise of agentic uses for vision. vision is a bit different. Like I think that the huge models are great when you're doing kind of like one-off, you know, you're having a conversation with it and you want it to think really hard about something, but when you're about to put it in production, you can't wait for 30 seconds per frame. Like vision is usually associated with something that needs to take action quickly and that's just too slow. And also there's cost, which is, if I were to like analyze one frame a second of a real time stream, that would cost me like $18,000 on GPT five a month. that just doesn't scale. The cost doesn't work. So as we're seeing customers want to adopt vision language models to automate and improve kind of their manufacturing or retail situations or, drone operations and so on, they want something that's fast and they want something that's cheap and can run kind of continually. So that's why that, that's the main kind of thrust of it. But to be honest, I think we're still early in the VLM story and there's all kinds of new use cases, that people come out to us. And that's one of the great parts of us being open source is we have people come to us with a ton of new stuff and kind of educate us on the amazing use cases.

Alex Volkov 40:18

That's awesome.

40:18

Yeah, I think you had a few questions.

Yam Peleg 40:20

you have a very, you said, when Dream Three is doing something

40:24

weird with their attention layer. how did you get this performance? It's really, really impressive.

Vik Korropatti 40:31

I will, I'll take the first one and then I'll

40:33

let you handle the second one. our stuff with the attention layer, it really started from us trying to extend the context length. When you're doing agent text stuff, you need a lot more context. And our previous model was just 2000 because it was like single turn question answering. when we were working on that, I started reading some papers and almost all context extension papers. Post talk, do some sort of temperature scaling in the attention layer. So temperature is, how smooth the attention is when you're attending to your past tokens. And so I figured like, Hey, why not just make this learnable? Like why not let the model learn based on position how to adjust its temperature on a per head basis? and then I looked up papers doing similar things, and I found this paper that was kind of, nobody paid a lot of attention to it. It was called selective attention. I think I linked to that over there. where in addition to the position wise, learnable temperature, they also make it data dependent. So depending upon the content of the current token, you can adjust what your temperature is going to be. And we found that this massively helps with, both long context and just general purpose modeling. there's a paper that came out after this called Gated Attention, which explores a more generalized version of this. but we think this is, something that almost every, we benefit from this putting in, so it's very parameter efficient. It adds like an extra 0.5% of your parameter count and almost no extra flops. So it's a great idea. The other thing we've spent a lot of time on is creating stability and mixture of experts. One thing with mixture of experts is when you have a distribution shift during training, right? When you switch from pre-training to. Fine tuning or post-training, the distribution of data changes a lot. And at that point, the easiest thing for the model to adjust is like, which experts tokens get routed to. And it makes some very shortsighted decisions on that front that cause it to lose a lot of that pre-training learn knowledge, because now it's routing stuff to new, to new experts that aren't specialized to those tasks.

Alex Volkov 42:26

So I thank you guys for this and thank you for coming up.

42:29

we have tons of other news to cover as well, but I really wanted to feature you Vic, you launched mdr, I believe last RUM two on the show. and it's great to see the extent and how many people get very, very excited. While you guys were chatting about this, I went to the Moon Dream playground, which you guys or everybody can use Moon Dream AI slash c slash playground. first of all, congrats on the very much improved, playground. there is, some images here that I'm showing for folks who are just listening. there's an image of a cutting board with a bunch of vegetables, avocados, tomatoes, eggs, et cetera, green onion, and there's a knife. And the type of non-descriptive prompting that you can do with these models, unlike the division models, is like, find me something that can be used as a weapon and it shows like a knife, for example. Those examples are very visual and just like the fact that there's reasoning about what it sees is just incredible to me. The fact that it's in small package and runs only a 2 billion active parameters is just absolutely incredible. We'll never cease to get amazed about the type of stuff that can be done with just like very, very small, number of parameters. So, huge congrats to you guys on the release. Any update on when the actual full Moon dream dropped for Fox Vic?

Vik Korropatti 43:37

it's gonna be in a couple of weeks.

Alex Volkov 43:38

yeah, we said a couple weeks.

43:42

The audio's breaking up. Vic started training the model and the flops on his CPU ended, so now there's no more flops for the microphone, I heard this Jay, so we're gonna be looking forward. he kicked the training so we can't hear him anymore. We can't hear him anymore. I really appreciate your time here. Thank you so much for coming up. folks can find Moon Dream, at Moon Dream ai and then we're gonna move forward because we have a breaking news as well. Another breaking news. Thank you guys. AI breaking news coming at you only on Thursday. As a reminder, folks, this is why Thursday I call Thursday I, because people love to ship on a Thursday and we have quite a few new updates. Super quick while we were interviewing, Vic and Jay from Moon Dream, we have, a new robotics thing from Google. Let's take a look at this. That's super cool. Logan posted interesting, our first widely available robotics model, Gemini robotics, er actually, somebody in the comments, shouted it out. So I wanna highlight the community folks, because if you are listening to us and also monitoring the feed and you know what's going on, please comment as well. But basically, Gemini Robotics, ER 1.5, they're claiming state of the art on a set of embodied reasoning tasks that can be used directly through the Gina API, embodied reasoning tasks means, What does it mean? LDJ and, but basically a robot like point of view and doing things with limbs, basically, as far as I understand the body tasks.

LDJ 45:05

Yeah.

45:06

I think sometimes it's also called egocentric data, but yeah, when you basically have a camera or sensor specific to that robot, and then you have the robot doing a bunch of tasks and a bunch of information from its POV.

Alex Volkov 45:17

Yep.

45:18

And so they, they're showing this kind of eval, where Gemini, robotics, we are thinking is significantly outperforming like Gemini, like the general models, Gemini 2.5 g, P five, and G PT five Mini. they're not including any other, robotics, here, but basically there's also a writeup, and they're saying clear, very clear. Logan said this very clear, the robotics is the future. so. I'm very interested in, in, in seeing how that's gonna go and also whether or not Google is gonna actually step into actual robotics or just provide the, the software as well. because I've been waiting for my humanoid robots and they're not coming. Although I did see quite a few updates this week about humanoids, learning to walk, et cetera. this is the new update for GAL Robotics Fest. Powerful spatial reasoning, orchestrating agent behaviors agent as in real world agent for your robot. Basically, the summary is this title, an agentic brain for your Robot. This is what they released, robotics er. we also know that Nvidia does a bunch of stuff, in robotics and are looking at this, robotics future. You guys remember Jenssen's standing out on stage and behind him, like a full row, row of human robots of different shapes and sizes as they kinda scale up that, any last comments about this? We have another small update from actually open the eye, that's also breaking news, OpenAI says today, and literally this happened like what, 10 minutes ago? OpenAI said, today we're introducing GDP eval a new evaluation that measures AI on real world economical valuable tasks, evals, ground process and evidence instead of speculation and help track how AI improves and that kind of work that matters most. This feels similar to me, to something like what Tropic is doing. I think Tropic has also some similar GDP cost stuff very similar to, if you guys remember, bench eval where, grok or some other things. The, oh ven bench, where Gro and other things are running a vending machine, actually, like, you know, the, the, the model is in charge of like ordering things. GDP evolve from open the eye is a new thing. GDP Evolve spends 44 occupations selected from the nine sectors contributing to the users growth domestic product. GDP. this is very interesting because some people measure a GI by the, the projected LGI. We talked about this at length at some point. Some people measure whether or not a GI is gonna be a GI by like how much of the economy it's gonna capture on how much is gonna improve. so it looks like the open air has like an evolve for this now.

LDJ 47:44

Yeah,

Alex Volkov 47:44

let's take a look at some tasks.

47:46

we can take a look at the task.

LDJ 47:48

yeah, I guess one comment would be, in terms of opening eyes definition of

47:52

a GI, I think they're one of the only labs that have had a somewhat specific definition beyond just saying, oh, it's artificial general intelligence. their definition is, something along the lines of, highly autonomous system that is, is able to outperform a majority of humans at a majority of economically valuable labor. I personally think that's a pretty good marker and goal to strive for. this seems like hopefully we get more evals like this, that kind of more directly or better service proxies towards that type of goal.

Alex Volkov 48:26

A hundred percent.

48:27

I wanna show and talk about this, stat or this screenshots from the eval where. I gotta give props to open eye here and everybody else who releases evals when you release an eval and your on top models are not the top on this eval. This is, this gets applause from me for sure. they have a screenshot here from GDP eval win rate performance and economic available tasks. cloud Opus 4.1 is leading this eval, and they have this like line, industry expert is around 50% of the benchmark here, win rate, versus an industry professional cloud. Opus 4.1 gets around 47.6% win rate against an industry professional. first of all, this is not OpenAI stuff. OpenAI, GBT five high gets 38%. So I like tropic cooked here. Tropic absolutely cooked in the real economic stuff, but also the fact that we have superhuman coding agents, did they win the algorithmic, challenge that we talked about ICPC. And now we have, models very coming very close to parity with industry experts in this field of, economically viable tax. I think it's just, we're moving super, super fast and we're gonna get to the point where we're gonna talk about NVIDIA's investment. Open the ice very soon. I'm, I'm just seeing this progress. I'm amazed. Go ahead.

Yam Peleg 49:43

What are the actual questions in this evil?

49:45

I want to, understand how do we measure it? What are we actually measuring? And again, I mean respect for releasing an evil where tropic is leading, I mean, respect. you could have kept these evils in secret and still measure it, but like releasing it. Respect, seriously. So here, here's an

Alex Volkov 50:01

on the left side, we have the prompt in the test context.

50:04

your manufacturing engineer, automobile assembly line product is a cable spooling truck for underground mining operations. You're reviewing the final testing step, This task is complicated, has associated risks, requires high labor, makes, works area cluttered. and then there's like the whole requirements and the output that they're looking for is something like the overall design with exploded view of components. This is the experience, humans deliverable. So I think they're looking for the model to like output something like this. like on the right where it would like, based on these instructions, generate an actual machine for an automotive industry. This is one example that they posted. I'm assuming there's more here as well. We have to move forward because we have like tons of other stuff. I think we covered open source pretty much, to an extent that we could this week. Let's talk about, I mean, we're already in the evals and benchmarks area, right? So, eval from OpenAI, we're gonna add to this, Let's mention the other, benchmarks as well from OpenAI. meta, MSL and Hagan face together release, Gaia two and R, which is a meta like evaluations for agen things. So let's take a look at some of the performance, here. 2 categories of things execution and search. G PT five high is leading. very interestingly, the second model after that is cloud four sonnet, and then Gemini Point five Pro is leading behind this, releasing an eval where at this point the top model gets 80%. I think it's a little bit, not super constructive, but hey, this is what they have. very interestingly, this is a eval again, talking about releasing evals that your model is not top on. this is an eval from MSL, metas Super Intelligence Labs and the LAMA models. Lama Maverick is not Remotely the closest here to the top. so they also released, an eval, not featuring their, models. So GPT five high from OpenAI is leading on execution, search, ambiguity and adaptability and noise here. And then Cmic K two from Moonshot is leading in the open weight ones, which is this blue thing from Kimi. what type of questions? That's a good, they have a whole writeup that we'll add in. Show notes. we're not gonna dive into this, but Gaia was a very familiar benchmark from before, so this is like a second version of Gaia. and they have a budget scaling curve. I think they adjusted to, time running as well, because the more models run, the more they can go for LDJ. Comments on this one?

LDJ 52:28

Yeah.

52:28

I can give a quick example of what a question on the original guide Benchmark Would look like. So it would be something like, which astronaut on NASA's Orion announcement in the fifth astronaut from the left halfway down the page on the picture in that location of the webpage. And it's like really long-winded, questions that you have to go through, like, you know, 5, 6, 7 steps or sometimes less, sometimes more in order to like look at images, analyze what is the next step you should take. And basically just finding really obscure information usually on the internet.

Alex Volkov 53:00

Yeah.

53:01

And, thanks LDJ for the comment. they added a thousand brand new human created scenarios and they spent across execution, search, ambiguity, handling what happens in, user requests, conflicting things, for example, or scheduling conflicts, et cetera. adaptability, time reasoning, temporal reasoning. I think this one is a big one. We kept talking about LMS having very bad temporal reasoning. Like lms, when you talk to them, they assume it's now. And then when you talk to them tomorrow, they still assume you're like at the same point as them. 'cause they don't really exist on the time curve necessarily. I think the time sets actions like ordering a cab after a delay or some stuff like this as part of this agent benchmark. Gaia runs with an execution environment where an agent of your choice has access to a combination of application associated pre-populated data. they mock up a smartphone environment for this simulating what a human would use in their daily life. I think it's because meta wants to have a personal agent do stuff for you from your glasses. And I think that that's part of the thing and they wanna compete on this benchmark. So good luck to them and hopefully they'll beat this benchmark, moving forward The other two ones is from scale ai, Alex Wang. Let me see if I canand the ring. Alexander, lemme see if I can find the exact, thing that I was mentioning. gimme a sec. Yes. Swyx Bench Pro from scale ai, includes multi file edits, a hundred plus lines change on average and complex dependencies across large code bases. This is Swyx Bench Pro and Current Top Models on Swyx Bench Pro is GBT five with 23% and Cloud Opus 4.1 with 22%, and other models drop below 15%. This is Swyx Bench Pro from scale ai and then they have two sets. They have the public set and the commercial set. And, the data set is on Hung phase as well, including the code. here is the performance comparison. GBT five Codex, GBT five. The one released in July is the leading model on this benchmark, following closely by CloudOps 4 1 1 in cloud four Sonet. And, other models drop further, like one three for example, very low. very interesting that they've updated this. the commercial dataset, is a different dataset. I don't think they expose the commercial. on this one, CloudOps 4.1 takes the lead as well.

Nisten Tahiraj 55:16

My only comment is, Just stop letting Alex name benchmarks.

55:20

It's almost the, the dumbest name or here he just added a pro to, he wants to know, nobody knows who or what this benchmark is for. Alex is really bad at naming benchmarks. If he likes the name, don't use it just changes something else.

Alex Volkov 55:38

And also he's not part of the lab anymore.

55:40

Go ahead, OJI.

Nisten Tahiraj 55:42

Oh yeah, that too.

LDJ 55:43

Yeah.

55:43

on the scale website, there's two different subsets of the benchmark. Basically there's the commercial data set, which I guess is more focused on enterprise use cases.

Alex Volkov 55:51

Yeah, definitely two sets of, of comments.

55:53

And then I think we also have like a whole PDF of research as well in here. so they release the whole PDF as well and they're showing like G GPT five as taking, the lead on the public one. And then, Opus is taking the lead on the commercial one. they call this contamination resilient curation built from the commercial repos, source from purchase startups, code bases, and copy left public repos. Copy left public reports. Very, very interesting. Alrighty folks, the last one I wanted to cover in the evals thing is this among us Bench, like super quick. I think the models playing games is actually like very important. this is a benchmark that somebody, pitted models playing among us. The game where you have to deceive your friends that you're playing with, that you're not the killer. and then, it looks like, open the eye models are the best at deceiving, which is, I dunno if that's great as a concept. We are moving forward to big company's news. But just before this I'll do a quick this week's buzz to let you know about something that's coming from Wits Ambassador. And then we're gonna keep talking about NVIDIA's incredible investment into OpenAI. So stay with us. This's, are you all about.

57:15

All right folks, welcome to this week's buzz, where I update you about everything that happens in the world of Weights, &, Biases from CoreWeave. And I think the only update I have for you this week is that we are coming to London on November 4th and fifth. So if you are listening to us from Europe, and you would like to go to London for a few days of great speakers and talk to folks who also use Weights, & Biases, and CoreWeave models, please feel free to take a look at the show notes after this. You can go to fully connected.com and then purchase the tickets there. I have a promo code for you as well. They'll share at the end of the show. Actually, I'll show it now. So if you go to fully connected.com and you enter, FCLN, thirst ai. I'm gonna add this to the newsletter as well. you are able to get, full discount and go for free. so if you are in London, if you would like to visit, that conference, feel free. We also have a conference coming up in, Tokyo on October 31st as well. So if you're in London, reach out to me. I'll find a way to get you in there as well. this is my only update here for Wisdom Devices, CoreWeave. We are working on some incredible, incredible stuff, for next week and hopefully next, next week. and so I'll keep you up to date there. let's move on In the big companies, I wanna talk about this thing. Because the numbers are insane. let me pull this up on the screen. But basically I think this week's biggest announcement from the big companies APIs was Nvidia and OpenAI, Greg Bachman, Hansen, Jensen, Wan and Sam Altman both standing two of them in leather jackets announcing an insane a hundred billion dollars investment across the next however years that Nvidia will invest in OpenAI. And I don't like the whole internet exploded with like, the meme of Infinite Money glitch because fame, like obviously OpenAI buys GPUs from Nvidia to train their models, and now Nvidia will invest in OpenAI. so that kind of, if you draw it as a diagram, it kind of looks like a circle. Money's going in the circle. but also folks, let's talk about the absolute numbers here. A hundred billion dollars, worth, to turn into something like 10 gigawatts of compute. This is generally the thing Jensen said famously, this is the biggest infrastructure project in history. Computing demand is going through the roof for OpenAI. Every person I know uses. I GPT. This is like literally a quote from Jensen one we talked about on Thursday, I think a week ago. that. Research paper from OpenAI got to a point where in July there was 700 million weekly active users of J GPT. I'm assuming there's more now because like it, it grows and the, you know, September started and the students went to school. So what do we think, what comments do we have about this, money going in circle, but like, it seems with no end for AI folks, I would love to like open commentary and just like have a little banter on this.

LDJ 1:00:07

somebody has mentioned before that it's actually not

1:00:10

that unusual to what Microsoft and OpenAI originally did together. Back in, I wanna say it was around the time that GPT three came out and Microsoft invested a few billion into them. But what ended up happening is, for the most part, OpenAI basically spent a bunch of that money just renting out Azure Compute and, you know, paying Microsoft to, to use their compute. And that was also kind of a circular investment. but yeah, it's exciting that this amount of power and computes gonna come online. And I'm sure other companies are probably going to also add competitive amounts.

Alex Volkov 1:00:43

There's a famous quote from Sam Altman that says,

1:00:46

you cannot out accelerate me. And it feels like we're getting this more and more from Sam Alman as well. Yeah. Many comments on this, like insanity of, of. 10 gigawatts worth of millions of GPUs,

Yam Peleg 1:00:58

I wanted to pull out the charts.

1:00:59

There are charts of, the entire United States, electricity production and the fraction of that going to ai. same with the s and p 500. the amount of capital and the amount of capital that is in AI at the moment. These numbers are crazy. These are the part, I mean, despite this is the more than couple of countries already, that's the amount of, the electricity production of full country, in some places in the world. It's pretty crazy. The speed of how fast that thing actually step up. and it's great that you mentioned that, Microsoft did pretty much the same kind of structure of investment, at the time of GT three, but it's a completely different thing on GPT three, OpenAI didn't have revenue, source at all. Now it's an extremely profitable company. GPT three was kind of, yeah, it was a hundred x, less, I think, but at the same time it was just risk at this point. Now it's a different thing. It's an infrastructure that is maed by the government in some sort of a cold war between, several countries that see this as a national asset at this point. And yeah, it's pretty cool to see that Oracle is, moving a hundred billion dollars this way. Then Nvidia is moving a hundred billion dollars that way. Then OpenAI takes the a hundred million dollars, just return them back to Nvidia in terms, so Nvidia would give them, equipment. and the best part at this time, because of the news announcement, the general public put a hundred billion dollars into Nvidia stock. Yeah. So basically the money is just printed out of thin air. Yes. And that's the infinite money glitch.

Alex Volkov 1:02:41

Jensen announces a new a hundred billion dollar investment

1:02:45

over the next couple of years in OpenAI, Nvidia stock rises more than a hundred billion dollars because they're already like crazy.

Yam Peleg 1:02:53

We had a comment from Milon about this.

1:02:54

Yeah. Listen.

Nisten Tahiraj 1:02:57

Oh yeah.

1:02:58

A lot of people, some more than ours are quite concerned with the, the money loop because loops can crash. But there is a pretty solid underpinning here that if you look in any government, the main expense, the biggest expense of pretty much any government is just healthcare and education. And that ends up being like over 50 or 60%. And now you have, you do have the tech that's actually like very, very good at those and getting better and getting applied better. So, this will work. Like there is a pretty solid economic basis to do this. now whether the loop will boom or suffer or how, how it'll go, the risks are there, but I think there's gonna be more of these, to be honest. I'm pretty bullish on it. It's also kind of crazy. I mean, we're building a 10 gigawatt nuclear reactor here in Ontario. We're just expanding the one that was there. And each one of these things is like huge. And they're adding four more of those. And like, that was a big plan for the next, I don't know, five or whatever years. And now it's gone. It's just gone. it's a 10 gigawatt, just stick caught. there's a use for it already.

Alex Volkov 1:04:08

Ooh.

1:04:09

All righty. We have, another breaking news. I just, before we get to breaking news, LDJ, just one second. so we had the Nvidia announcement, a hundred billion dollars into OpenAI for 10 gigawatts of compute, but also in addition to this, you guys remember the projects target that there was announced with Oracle and SoftBank, and they announced this, the OpenAI, Oracle SoftBank announcing five new AI data centers under target, OpenAI, overarching AI international platform, the Nvidia stuff. It is not Stargate. It is like a different thing. they're working across like multiple things. combine capacity from these five new sites along with our flagship site in Abilene, Texas. and ongoing project with CoreWeave, brings Stargate to nearly seven gigawatts of planning capacity and over 400 billion in investment over the next three years. This puts us on a clear path to securing the full $500 billion 10 gigawatt commitment we announce ahead of schedule. So basically there's multiple projects in the tens of gigawatts, And the comment that I saw from Sam Altman internally in OpenAI, that he posted, somebody re posted Alex Heath, I think repost that they said, where's this, where's, where's my comment? OpenAI started this year at around 230 megawatts of capacity and is now on track to exit 2025, north of two gigawatts of operational capacity, two gigawatts. And they're looking for at least 20 gigawatts, 10 from this Nvidia deal and another 10 from projects Target.

Nisten Tahiraj 1:05:33

you can calculate how many views they had on that because if

1:05:36

you average out an H 100 to, I don't know, just say one kilowatt, well, it's like 0.7 and then that's like 230,000, maybe like close to 300,000.

Alex Volkov 1:05:45

All right, LDJ, we're ready for the breaking news.

LDJ 1:05:48

real quick on the gigawatt note, I was going to note from that same source

1:05:51

that said the thing about 230 megawatts to, gigawatts, Sam also apparently had said by 2033, they're aiming to want to have 250 gigawatts of compute. that's the long term goal. and also clarification on the Stargate and Nvidia thing, I do believe that Nvidia investment is actually towards the Stargate compute because both mentioned 10 gigawatts and Nvidia mentions partnering with the Stargate partners. I think it is part of Stargate.

Alex Volkov 1:06:19

I think those are separate things, but we'll go and

1:06:22

take a look, but it feels to me that those are separate things because they're announced separately. they didn't mention Abilene in Texas at all on the Nvidia stuff, as far as I'm saying, but maybe I'm confused. Alrighty. I think we have some breaking news and this is the third one. this, this show. Well, all right. Let's go. AI breaking news coming at you only on Thursday. I,

1:06:50

all righty. LDJ, this is, this one is yours. You announced it. Go ahead.

LDJ 1:06:54

well, to be fair, Yama was actually the first one to put it in.

1:06:57

All right, yam, and then if you wanna go, ya

Yam Peleg 1:07:00

No, no, it's yours.

1:07:01

All yours.

LDJ 1:07:02

Okay, sure.

1:07:02

So, now in preview Chat, GPT Pulse, this is a new experience where Chat GPT can proactively deliver personalized daily updates from your chats, feedback, and connected apps like your calendar.

Alex Volkov 1:07:13

as far as I understand,

LDJ 1:07:15

oh, Alex, it says Alex, they

Alex Volkov 1:07:17

spent Alex wrong on the video, but we're looking at the

1:07:19

video where Chat GPT will give you a fresh update, on the things that you, it knows you like, stuff like reading your emails, et cetera. having a to-do list, how does it do proactive stuff? Does it read my, oh, so, okay, so we're watching this video, like, altogether, we're trying to figure this out, but like, here's one thing J GPT says, here's a few tips to support your ACL recovery. So like, it knows, 'cause you talked to it that you had like torn ACL, whatever, and like it would just send you some stuff to read, I guess. I think the proactive part is the most important thing here. So you can curate what happens in chat. DBT pools. LDJ, where do you want to scroll me towards?

LDJ 1:07:59

Yeah, we're all seeing this at the same time.

1:08:01

Really?

Alex Volkov 1:08:01

Oh, but it says try now.

1:08:03

So, we can probably try now let's see if we can pull this up

LDJ 1:08:05

While you're trying that, I'm gonna see if it's occlusive to plus or pro users

Alex Volkov 1:08:09

so

Yam Peleg 1:08:09

basically from what I understand, it can react.

1:08:13

I mean, Chat GPT can be proactive already. You can take, you can ask it to remind you later, remind you in noon and remind you every week, remind you every day some, something like that. That already happened. And by the way, I use this a lot. I don't know how you guys use the app, but I use Chat GPT app in a million ways. It was never intended to because it's just so good. I think if I see this for the first time, just like everybody else, but from what I see, the thing is now it can react to things that are changing in the data source changing. Like, for example, it can monitor your email or it can monitor your calendar and just validate that this is what we're talking about. But it's completely different than just by time. I mean, it's proactively going to watch, what your stuff for you, the stuff that you care about, and give you an update. let's move on.

LDJ 1:09:02

Yeah, I am verifying it now, actually, it says an explicit

1:09:05

example in the blog post. It says it can connect to Google Calendar and it can remind you to buy a birthday gift for a date that's coming up. yeah, this is really cool.

Alex Volkov 1:09:16

It looks like it's gonna only launch for pro users on mobile.

1:09:19

So when I click this on JGBT desktop, I have it. I don't have it, but now I do have it on mobile.

LDJ 1:09:24

says plus users as well.

1:09:25

It says, rolling out to Plus. Mm-hmm.

Alex Volkov 1:09:27

Ah, okay, alrightyy.

1:09:28

All righty. no idea if I can show you this because I'm on the phone and No, but I, I do have it in the Chat. GPT Plus, let me see how it looks. Maybe I'll take a screenshot.

LDJ 1:09:38

browser or, on your desktop

Alex Volkov 1:09:40

on the mobile app.

1:09:41

On mobile app. the blog post says, it released Chat. GPT Plus is now on preview, on promo. Users general Wallet will follow. And then we go into Pulse and Pulse. Hey Alex, I'm here to surface what's helpful to you? Once a day, every day you decide what shows up. If you tell me what to focus on, I'll curate it for tomorrow. I think that, there's quite a few startups that are trying this, so OpenAI definitely went into some startups, things. perplexity does something like this, but obviously not based on your chats as well. Very interesting, the type of stuff it chose for me. and it says, yeah, get insights from your emails and calendar let's your GBT proactively read email and Google Calendar give you helpful insights. And you can say allow, don't allow. And then they have basically like a news roundup for you. they show me Minecraft mods that build reading skills. 'cause I once asked it for Minecraft stuff for kids, for example. very interesting. And then you can kind of like, push it towards what you want.

Yam Peleg 1:10:33

we are talking on the mobile only,

Alex Volkov 1:10:35

Mobile as far as I see.

1:10:36

Like I, I only have this on the mobile.

Yam Peleg 1:10:37

Mm-hmm.

Alex Volkov 1:10:39

And, oh, it has news for me from ai, but like meta's

1:10:41

new 32 code world models here. And infrastructure, Nvidia is a hundred billion gigawatt plan. So basically don't use this. 'cause then you don't need Thursday. I basically No, I'm just kidding. but I, I'll get my news and I'll let you know about them, but we already know. and after all of these things, you also ask you to send a notification to allow for notification. And also it will ask you to focus to, to get it to focus on the stuff that you actually want. So you have like a quiz after all. Like you have like five things in the quiz, at the end. It's pretty cool. It is pretty cool. It's based on the chats that you're talking about. This. I'm not sure if I want to learn more about the chats, but I think for some people who are using this for study, for example, that's great. let's move on because we're almost at the end of the show. We have 15 minutes left and we have so much more to talk about. So, more, more visual stuff as well. So we covered the, the insane infrastructure, investments in in, in, in Open the Eye. And I remember Sam Altman tweeting something like they have a lot of experiences that are GPU heavy that are still coming out, and it'll start with poor people just because they don't have the capacity. also when GPT five Codex, came out last week, they lowered capacity because there's so much demand that they had to like, scramble for more GPUs. I think they restored it now. So OpenAI is definitely limited. The investment to a hundred billion. Sam Altman was like legit with a straight face on stage answering this question of, Hey, in a few years when AI can solve cancer or teach every kid in the world how to read or whatever, do we wanna choose between the two very more important things? At this point, we talked about cure cancer with, Dr. Daria, here on the show. Like, we're expecting incredible things from ai, not only in code, it looks like so far we've seen this in code and maybe some personal things. And in this time we look to the use cases of J GPT from that research. From NBR and BER and Chat GPT as well. Like we see what people are using Chat GPT for, but we are expecting so much more. So all these like gigawatts, hundreds of gigawatts, this is expenditure that needs to turn into something like, cancer curing basically. I think Jensen is the winner around everywhere here. no matter who raises which money for which ai, besides the Chinese labs maybe. Nvidia wins.

Nisten Tahiraj 1:13:00

Yeah, really quickly.

1:13:02

Like think about how much wattage do you need to actually run something to do work for you. it doesn't matter if it's medical work or like solving a problem on your website and stuff. You're usually gonna need eight GPUs and they're doing like 700 watts each. So you need like five kilowatts per person and there's a lot of people. So you're gonna need that.

LDJ 1:13:27

Yep.

1:13:27

Yeah, I mean, to add onto that, like let's say it's even just one H 100 running to have the agent running at real time, and let's say even that agent is able to run up the ability of a human in every single area, then. There's over 3 billion working adults in the world. We own, you know, even a company like OpenAI, it has like less than 10 million H one hundreds, right now. So, you know, we'd have to get to the billions of GPUs, to actually start replacing most labor. Yep.

Alex Volkov 1:13:57

so we're going towards it not super slowly, it seems.

1:14:00

I just wonder where the hell all this power is gonna come from. Where the hell in our, like aging infrastructure in the US all this power is coming from. I guess we'll see. alright, moving forward folks. XAI launches grok fi GR four Fast. In addition to like getting even more of a funding round. GR four fast looks to be on the top left, chart of the intelligence versus price kind of, thing. This is a chart from artificial analysis Intelligence index. GR four Fast gets, 75, points on them. No, 75 a year. So like around 60 ish points. they say it's 47 x cheaper than GR four. They have a non reasoning version of this, as well? they say near flagship performance about the speed and the cost. I think they're the two things. And also the highlight here 2 million token in context window, 2 million tokens of Context Window. Great for gen stuff. and, they reduce reasoning tokens by 40%. impressive stuff from Gro.

Yam Peleg 1:14:57

but just want to mention, on Live Code Bench Gro Forecast is the

1:15:01

first better than Grok four itself, with the Gro four to the second place. needless to say, it's 1% of, Opus, 4.1 in terms of price and, kind of, benchmark wise, at least on paper, it looks, competitive. Yep. I, I haven't tried it myself, so I can say, but, I'm searching for, I've seen a video that I wanna show you all, of how fast it actually is in real life, but, but yeah, that's pretty much crazy. I haven't seen that coming.

Alex Volkov 1:15:32

We've seen this trajectory on other like models, with

1:15:35

the German, GPT five codex as well. They boasted the fact that like this, this model thinks for less, but gets, to, to, to the rate results as well. So like, it's really funny that we're like lowering the thinking even though the, the scale, test time compute, sorry, the test time compute scale is supposedly the more things, the better the results are. Now everybody tries to like lower the amount of thinking. 'cause they also know that when you use coding models, you want 'em to work fast. righty. Qwen three max and the Alibaba roadmap. I really want, we have to cover this before the multimodal stuff because que three max was released. It's their flagship. it's a great model. let's look at the benchmarks. they pitted themselves against, the open source que 3 2 35 B, Nisten, CloudOps four and then Deeps six V, V 3.1. And then, they get 81% on a IE 25, 60 9% of life code, life code bench V six. We just saw R four get like 80% of life, right? Yeah. Rack four is 80%. So it's very interesting how selective Qwen is about their kind of evals. They're putting themselves very nicely on top, but you have to take. Into the context that other models that are not on this are getting incredible evals as well. QU three max is only available in their API, but very cheap. They have a thinking heavy mode and strong contact collect, a million tokens and contact collects. we have a testimony from Ethan Molik saying, so far Quantac seems impressive on non reasoning model. So it's a non reasoning model. Comparing this to reasoning models like Grok is maybe not the best 'cause reasoning models obviously get significantly more.

Yam Peleg 1:17:05

say, I asked it, go search online.

1:17:07

What is going on on AI today? Like, what, what are today's releases? And it just goes extremely fast. Just searches website. Have a website or a website. This is extremely good for something you wanna search like a search engine, because it's fast.

Nisten Tahiraj 1:17:23

I asked preview last week to just make like a space

1:17:27

debris simulator and it was like the nicest looking 3D spinning globe with stuff that I had seen on other ones. I mean, it took two prompts and the only one that made it nice was Opus 4.1. So yeah, it's, pretty up there.

Alex Volkov 1:17:43

Yeah.

1:17:44

in addition to this, we have a, screenshot from the conference from Alibaba. they released all of these things, and they're investing in scale like crazy. some of this is in Chinese, but they are planning to go from 1 million to 10 million contact lengths and a hundred million contact lengths. Basically just throwing numbers around. total parameter scaling. They're going to terabytes of parameters. Totally. they want to test time, compute scale from. Yes. I love, I see sunglasses, yama sunglasses. I love them. test time, 60 4K reasoning tokens to 1 million reasoning tokens. When they get to this thing, scale data, synthetically data. This is like, scale is all you need basically. And their scale.

Yam Peleg 1:18:24

Can you zoom in?

1:18:25

Can you zoom in on this blue button? Doesn't matter what it says. Just can you zoom in exactly, exactly this button? Yeah. Scale is all you need. Wait, wait. When? Absolutely. Ly when Need a screenshot of

Alex Volkov 1:18:35

this LDJ.

1:18:36

Scaling is all you need. Yep. Let's take a

Yam Peleg 1:18:39

a GI is coming.

Alex Volkov 1:18:40

Yeah, let's take a screenshot here.

1:18:42

a GI is coming and it looks like, Alibaba is also like the foundational lab of, of China is very into this. All right. Folks, we need to move forward because, also stuff from Alibaba, but other plans, there's multimodal stuff and we're almost at the end of our show and there's so much still to show you. wait, another screenshot. LDJ got the glasses. Wait, hold on. Scaling is all you need. Zoom in and then we're all with the glasses. Hold on. We have to take this photo for folks who are listening to us on the Twitter space. I'm sorry, but like, there's a scaling is all you need. All of us are wearing glasses, yams, chugging Red Bull like a madman. let's go.

Yam Peleg 1:19:16

scale is all you need,

Alex Volkov 1:19:17

Yes.

Yam Peleg 1:19:17

Scale is only need, attention is only need.

1:19:19

Everything is the computer.

Alex Volkov 1:19:21

Yes,

Yam Peleg 1:19:21

everything is computer.

Alex Volkov 1:19:23

is computer.

Yam Peleg 1:19:23

Everything took a screenshot of,

LDJ 1:19:26

I took a screenshot of the slide too so I could translate

1:19:29

the Chinese to English. Yeah. I'll let you know when I get the answer back from Jim. All right.

Alex Volkov 1:19:33

but I do wanna move forward because I think,

1:19:35

a bunch of other stuff as well. We talked about moon frame. I wanna show you, one ate, I don't know if we're gonna be able to try one Ate, but why not? Lemme see if I can send anything to Cliff. And also note to Alex, sign into all the things that you're about to demo before you go this. So they have, 1, 1 2 0.2 animate is this character swapping model from one that you can, basically take any character and then, record yourself. record yourself with the camera, et cetera. And then they will just basically swap it for you. Let's take a look at how it looks, because it looks crazy. this is a great example here. Let me see. I can turn this on. here's a person. There's like just swaps, just he is the reference image. He moves around, et cetera. And then they have a reference. he's the reference motion and the reference image and just folks, this is like deep fake on another level. This is just quite, quite incredible. Obviously runway had in the model like this as well, but this is in the open source, release. the fluid hair dynamics is quite, look at this the hair dynamics is crazy. DC

Yam Peleg 1:20:34

See, open source.

Alex Volkov 1:20:35

Yeah.

Yam Peleg 1:20:35

Whoa.

Alex Volkov 1:20:36

look at the cloth dynamics on this guy that's jumping.

1:20:39

It's really something else. Like, it's really something like my whole feed was, going crazy for this. you can try this model on file, replicate, glyph, et cetera. I have another example of videos I believe

Nisten Tahiraj 1:20:50

here.

1:20:50

I'm warning you guys, you're gonna spend all day. Absolutely credits, and you're gonna fire up at H 200 and then it takes like three, four minutes to make one video. And then you're gonna keep changing it. Just be, it's fun, but just be careful with the time you spend on this.

Alex Volkov 1:21:07

So basically you need one picture and then the

1:21:09

reference movement, right? So you can act out a scene and, it brings to life that scene. So it's different than like a character. It's different than a character thing that you give it one image and tell what to say and like it animates it on its own. You are actually moving, right? So you can take videos of yourself and turn it into something. they basically just take the motion from the picture and then they create the character. That character will not do anything that you didn't tell it to do. It will follow your instructions. but with precise body motion control, it's quite incredible. This is cinematic stuff. this is what Hollywood uses.

Yam Peleg 1:21:42

What do we need this for?

1:21:43

I have a feeling I need this for something that there is a use case. Yeah, I can use it for something real, but I'm not sure exactly what, because it doesn't animate a video of myself. It is animating a video of something else based on what I do. So yeah. What do you guys think? Like what can I use this for?

Nisten Tahiraj 1:22:03

I'm using it.

1:22:04

Onboarding, just remove the background from the video and just put the onboarding on your, on your site or your app and just have it go around. Yeah. It it's really nice.

Yam Peleg 1:22:14

Oh yeah.

1:22:15

You can actually use the picture of yourself with a video of yourself That's a really good use case. Absolutely.

Alex Volkov 1:22:21

If you want, like a, if you wanna act out a robot, for example,

1:22:25

and not have the model come up with the robot movements, you actually want the specific thing that you want, you want the robot like to pick up something, for example, so you can act it out yourself and then replace the robot on top of you, and then basically like precise control for, for video.

Yam Peleg 1:22:38

Oh, absolutely.

1:22:38

I'm just saying that in real life, most people don't need to act out robots, but many people absolutely need a video of themselves, like a perfect video of themselves, doing something absolutely, you can use this for, I didn't even think about this, but yeah. Extremely useful.

Alex Volkov 1:22:54

so in the spirit of showing stuff and explaining them

1:22:57

to the folks who are just listening to us, this is 1, 1 2 0.2 animate, effectively called One Animate. It's on hug and face. it's on, I believe it's open source. I want like, I think it's open source. let's see. I'm trying to find the actual weights for this. it's on GitHub as well. cinematic level aesthetics and then efficient high definition, hybrid text and image to video. and then also moving forward, we have two, more updates from the video perspective. Cling 2.5 is the new one. cling, released a version of Cling, the video model that looks cinematic as hell. they didn't announce the announcement that much. They only reshared. They gave clean access to multiple folks, and they reshared, the best creations like what we're seeing right now. character. Consistency is the holy grail of character consistency with frames you can usually get there fairly, Okay. So we are seeing this incredible movie over samurai writing through explosions and media showers, et cetera. and you can see cinematic motion. I think the highlight there is, this one supports audio a little bit as well. And we are definitely looking some very, very impressive things, here. so clinging is available on their website and everywhere else. and just like they have very strong prompt adherence and also, physics. So most video models cannot do acrobatics for example, or Olympic stuff. And we're seeing a girl doing like a breakdowns. And, besides a few things where her body flips, I think generally it looks very impressive. so cl for the creators is very, very, very like, big deal for video creators. And another big deal also from Van when released. when 4.5 preview. So we talked about when Animate, which is a, a, a character replacing model. This one is a video model. this is a preview and they talk about, architectural features, native multimodality, new unified framework for both understanding and generation flexibility, supporting the input and output of text, image, video, and audio. So this is also another multimodal model. Unlike Claim Qwen Omni, where you talk to it, this one understands what you want and you can like provide videos. For example, this one looks incredible. Like I tested this one. I think I posted about this. this is one of the top results that I had from a video model. On this go-to task that I have, I have a task where I take a picture of myself and then I ask it to animate, of me interviewing a, white polar bear that is addressed like me.

Yam Peleg 1:25:29

What is the best video model for the general, for general

1:25:33

use cases today, if I want to just have a, generate a video about. Something, something. What do you guys think?

Alex Volkov 1:25:40

I think one is up there.

1:25:41

this is not open source. cling is up there, VO is still up there. but I think for some of them they showed an example of people preference. Let me see if I have this, preference. Yeah. So this is Clings kind of blog post and basically I'm, to answer your question, we're seeing a win loss ratio of like people preference, right? They say the clink 2.5 VV versus VO three, clink 2.5, turbo gets 70% win rate against V. That's a lot. And then, 50% win against C dance. And then, what's the difference here? Oh, see, dance meaning, and then, like 50% Againsts dance, and with ties, so like very impressive, I think Cling is up there. Cling is definitely up there for like the regular person. but one is new. And I think that the highlight of one is. it generates audio. And I don't know if you heard me saying welcome to first. Yeah. But that wasn't me. one is able to take just text and output, a script with lip sync. So you are able to create characters that talk, up to like, I think 10 seconds. but it also able to take an audio that you prerecorded like us from the chat here and also use that for lip sync. So this model is like an omni model. I think this is like the big thing here. It's a full omni model that like does understand video as well. So video inputs, and image inputs. This is like the new thing. And also does 10 a DP, ten second video generation. this is not a, open source model, but I think one is like the most impressive capability wide release from videos.

Yam Peleg 1:27:08

it's also impressive that all of this is coming from Alibaba.

Alex Volkov 1:27:11

Yeah, yeah, yeah.

1:27:12

But, Alibaba is on top of anything else, is Alibaba is releasing incredible stuff. folks we're, over time, I do wanna play a Sun song. I definitely do wanna play Sun Song. Lemme see if I'm able to do this with the audio. I should be able to with audio, soon. B five. Lemme see if I can log in there over quick. I'll just play the YouTube video because, we all must be to be able to hear this. Hopefully you guys on this stream hear this as well. this is the announcement, the video announcement. let's hear for a few examples. blah, blah, blah, blah. they talk about the improvements. I wanna see examples. I don't know, remastering new features. Okay, what does V five soon sound like?

SUNO 1:27:56

The song was made with Suno version five using samples we recorded on the

1:28:00

instruments you see in this video.

Alex Volkov 1:28:09

Ah, very interesting.

1:28:11

So these folks are sitting in a studio recording samples and then just like mixes them together kind of, and they have like guitars there. Somebody's knocking on the thing.

1:28:32

I said in the beginning like, I'm getting to the limit of my ability to tell you like sooner five is better than swimmer four. But I did listen to it and I think we had some other comments as well. I'm getting to my limit of like understanding the differences there. It's, it's quite crazy. The, the level Suno V five came to,

Yam Peleg 1:28:47

If I go to Suno, yes.

1:28:48

If I just go to and listen to random songs and Suno, which I'm doing quite a lot. Yes, it's out. cool.

Alex Volkov 1:28:53

the, this, this guy is showing, where, where

1:28:57

can I show you the Suno thing? human-like vocals and clear audio. Like this is the, the capabilities includes remastering covering songs you can like upload, and also songs you created with Suno four. you can remaster with Suno five. user and community reactions are very strong, at least on Reddit. some songs that I created like Afrobeats, I could not distinguish them from actual produced songs. but again, not a master. I will try and reimagine some of the Thursday eye opening credits with Sunna five, and then we'll see where we end up. I think we covered mostly everything, invoicing audio. There's this, Shomi me mimo audio with a hundred million plus hours of pre-trained audio. I wonder where they get all this audio from. It is kind of scary, but this is state-of-the-art few shot, across diverse audio tests like reasoning, dialogue and TTS, in one general model. So very, very impressive. It's on hug and face. there is a demo and, you could talk to this demo. this demo uses, lemme see if I can show you guys the demo so that we can actually look at the demo together super quick. I think if we go here, go like this, you will be able to actually see in here. Yeah. Okay. let's, access the microphone. A hello, hello. Testing, testing one to three. What can you tell us?

Nisten Tahiraj 1:30:12

You can tell us.

Alex Volkov 1:30:12

What can you tell us about yourself?

MiMI Audio 1:30:17

Hi.

1:30:17

Testing complete. I'm Nemo Audio. Your friendly voice, AI created by Xmi. I can hear what you are saying and chat back with my voice. Think of me as your tech savvy buddy who's always up for a conversation. I'm here to help out, share some laughs and keep things light and fun.

Alex Volkov 1:30:34

That was pretty cool.

1:30:34

That was really quick. I paused there for, for right now, it can match your language, but the choose prompts and voices that match your language, default settings are optimized for English. so you have voices here that you can choose and then preset prompts, as well.

Yam Peleg 1:30:47

maybe tell it to think or something hard,

1:30:49

which is not just interesting.

Alex Volkov 1:30:50

Yeah.

Yam Peleg 1:30:50

Mm-hmm.

Alex Volkov 1:30:53

Hey, could you help us?

1:30:55

Can you sing Happy Birthday to me? My name is Alex.

MiMI Audio 1:31:01

Wait, hold up.

1:31:01

Before I start singing, I gotta know, is this a real birthday or you just testing me? 'cause I don't wanna accidentally reveal someone's age, you know? But hey Alex, since you asked so nicely, here comes your birthday song. Ahem. Happy birthday to you. Happy birthday to you. No, you're

Alex Volkov 1:31:14

I want you to actually sing it out.

1:31:16

Can you sing?

MiMI Audio 1:31:19

Oh, you want the computer?

1:31:20

Okay, here we go. Aham, happy birthday to you. Happy.

Alex Volkov 1:31:25

I'm so far not super impressed by this omni model at all.

1:31:28

because it's doesn't thing and it says aham A, but, okay. It's A seven B. Yes.

LDJ 1:31:32

Well, actually, yeah.

Nisten Tahiraj 1:31:33

And that doesn't come close.

Alex Volkov 1:31:34

But it's, it is trained and also I noticed it, like it

1:31:36

changes his voices a little bit. well this was MIMO audio as well, and I think on this folks. It's time to finish. And before we finish, I wanted to tell all of you something that I noticed on my timeline as well, this week was, one of the more important Jewish holidays called Rosh Hashanah. So I wanted to like, wish a happy New Year to all of you. It doesn't only include the Jewish people, but, to present this, I went and decided to use some AI art for the congratulations. I did. And I noticed I already did this in 2022 in September, and this was the state of the art. I'm showing a very ugly picture of, two minions with their mouth full of,

Nisten Tahiraj 1:32:14

pomegranate.

Alex Volkov 1:32:14

Pomegranate seeds.

1:32:15

I got stuck on this. and this was three years ago. This was the state of the art. I think I generated this with the Dream Studio from Stable Diffusion, I believe, three years ago. Exactly. And so I was like, okay, I need to do this a little bit better. So here's the difference in three years from the state of the art of imagery back then on proprietary stuff. And, just incredible progress that we're living through. And sometimes it's great to pause and see and compare. So with that, I want to tell you, dense week, my head is exploding right now. but in incredible updates, including some new stuff from Open the Eye. Just absolutely madness. Madness. Week after week it looks like we're going into October, which will be our next show September was dense, folks dense. Like, I'm barely able to keep up. So I'm very happy that, we are here to be able to do this together. Thank you. LDJ, Yam, Peleg, Nisten. Everybody who tune in every time Ryan was here before we had, Vic and Jay on the show as guests. Thank you everybody who tune in, So, with that, if you missed any part of the show, the show is turned into a podcast and a newsletter. Newsletter is free. It's a weekly newsletter. Please sign up and tell your friends.

Nisten Tahiraj 1:33:18

we're still 4.9 stars on Apple Podcast.

1:33:21

Yes.

Alex Volkov 1:33:21

you are subscribed to Apple Podcast, please go in there

1:33:23

and leave us a five star review. It's gonna really, really help other folks discover and stay up to date together with you and us. With that, We're gonna sign out and then, we'll follow up with the newsletter and then we'll see you next week. Bye-bye.

ThursdAI · September 25

0:00 0:00