What happened in AI the week of January 23, 2025?

From Weights & Biases - the craziest week of AI, R1 beats O1 but MIT license, $500B investment into AI with SoftBank, OpenAI Operator Agents, White House AI Executive Order & more AI news This episode covers Major AI Investments and Updates, ByteDance's UiTars and Other Open Source News, Open Source AI: DeepSeek R1, Introducing Operator: AI Agents in Action, and Humanity's Last Exam Benchmark.

Jan 23, 2025 - 🔥 DeepSeek R1 is HERE, OpenAI Operator Agent

Q: Major AI Investments and Updates: what should I know?

**Alex Volkov:** Like, as a small, tiny announcement of half a trillion dollars investment upcoming in AI from, from OpenAI and, Masayoshi san from,from Vision Fund. and, Larry Ellison from Oracle.

Q: ByteDance's UiTars and Other Open Source News: what should I know?

**Alex Volkov:** Also in the open source LLMs, kind of LLMs, ByteDance dropped UiTars. UiTars is, ByteDance's computer use model that they claim, 7 billion parameters and 72 billion parameters, controls your Mac or PC and they have an app for both and they beat GPD 4.

Q: Open Source AI: DeepSeek R1: what should I know?

**Alex Volkov:** All right, folks. Open source AI has never been as hot as this week.

Q: Introducing Operator: AI Agents in Action: what should I know?

**Sam Altman:** AI agents are AI systems that can do work for you. You give them a task and they go off and do it.

Q: Humanity's Last Exam Benchmark: what should I know?

**Alex Volkov:** So shout out We're not gonna use the breaking news button because it's gonna happen before the show, but it's okay It's called humanity's last exam and this is a very unsaturated benchmark as you guys know We talk about benchmarks all the time MMU math all those things and they are always always They're getting close to saturated like like math is at 98%saturated, I believe at 99 percent MMLU is saturated. and we talked about, frontier, frontier math, which is an attempt to, have a very, very hard math problems.

Alex Volkov 0:29

Alrighty, welcome everyone to Thursday.

0:33

I for. January 23rd. My name is Alex Volkov. I'm an AI evangelist with Weights Biases and you are on ThursdayEye. ThursdayEye is the weekly show that keeps you up to date of everything that happened in the world of AI from week to week. And, we're here to talk about maybe one of the biggest weeks, definitely one of the biggest weeks since the beginning of the year, but Maybe one of the biggest weeks in open source since LLAMA was released. And so, we're going to have, quite a conversation today with, with a few folks that, they're going to join soon about R1 specifically from DeepSeek. And, R1 is a, we already covered R1 on, on the show when it was announced and was in preview, but since then we got it in open source. And so now R1 actually runs on my Mac and I'll show it to you here, and runs on everybody else's, you know, as well. and it's going to be, is going to be quite a show. Also, a few other things happened. Let's, let me just like, like very, very gently, a few other things happened. Like, as a small, tiny announcement of half a trillion dollars investment upcoming in AI from, from OpenAI and, Masayoshi san from, from Vision Fund. and, Larry Ellison from Oracle. So just a tiny, a tiny, tiny investment in the world of AI that was announced on like the president's, third, third day or something. and it's been huge, across the news. and, there's a bunch of other open sources as well. So, we're going to cover those. I think, one of the things that I want to highlight is, there's also a new benchmark that just dropped called the humanities last exam that I definitely would like to talk about. And, oh yeah, of course, but Bidense is all over the place. Bidense dropped a, a computer controlling agent with like 7 billion parameter that beats Antropics Cloud Computer Use. so, all of that, and more, I think. and I've got some very upsetting updates from ML, LM Marina, and you guys know we talk about LM Marina for a long time. just now, just a second before I jump to the space, some very upsetting, kind of like, News about, about, Ella Marina, which we should also talk about. I don't know how verified those are. It's just like a post on Reddit, but it kind of explains a few things. So definitely worth chatting about that as well. all right. I think, I think with this intro, I will just say what else, what else is there to be said, let's, let's dive into our one. Let's dive into our one. I think we'll do a TLDR because there's like a bunch of stuff and then we'll dive in and meanwhile I'll, I'll see that my host will, will join as well. definitely let's do a TLDR. I think we're good on, on folks. I see a bunch of other folks in the audience. yeah, let's do a TLDR of open source and then we'll dive in. All right, folks. here's the TLDR, everything we're going to cover on the next two hours, including hopefully some breaking news from OpenAI. this week started. Very strong with deep seek. The AKA, the Chinese Whale Bros, the quant trading firm from China. That became the number one ai, open source, company in the world right now released. R one, the Reasoning Model, in MIT license. And we're going to cover all of this. This is probably still the, the biggest. The biggest news, they didn't only release two models. They released like a bunch of other, quants, uh, and, distilled versions on top of Quen and Lama. And the vibes are insane. This is like a o1 level model, very close to o1, if not beating o1 multiple places at home. Also in the open source LLMs, kind of LLMs, ByteDance dropped UiTars. UiTars is, ByteDance's computer use model that they claim, 7 billion parameters and 72 billion parameters, controls your Mac or PC and they have an app for both and they beat GPD 4. 0 and Cloud, while running. Kind of, in open source running those models locally did not work for me. I haven't been able to run this. So maybe some folks in the audience can tell us what their experience was, but the, the metrics are fairly, fairly ridiculous. And the examples they show are also ridiculous. And, given that OpenAI is going to launch something, this is very interesting in comparison as well. And, In other open source LLM news, just from today breaking news, there was a new benchmark, a new eval, that's not saturated at all, the top models on this eval is getting around 10%, which is by the way, DeepSeq, which is great. it's called Humanities Last Exam, HLE. So get used to saying HLE, Humanities Last Exam. It's 3, 000 questions, that was written by nearly a thousand subject matter experts, from around the world. And, this just launched from, from a bunch of folks. And then we're going to definitely cover what type of questions there are. I looked at a few, I did not. there's no way I'm answering any of those myself. I need an AI to help me. but the best thing about this is not saturated at all. this is most of the stuff that I have in open source LLMs. I see LDJ just joined. LDJ, welcome.

LDJ 5:43

Hey.

Alex Volkov 5:44

So we covered the open source LLMs.

5:48

We're still on the TRDR and we're moving to the next thing that we're going to talk about in the big companies and APIs, an area of Thursday that we cover. And this one is a big one because I think the main pieces piece of news is this one is SoftBank Masayoshi san and Oracle and OpenAI. All launched together is a thing called Stargate Project, and it's an insane investment, commitment that talks about 500 billion dollars, a Manhattan project for AI infrastructure of sorts, and they have announced and signed it in front of the president, and, There's been some speculation and debate whether or not they have the funds, but generally we should talk about what this means because this, this is just like absolutely huge. If you guys remember the 7 trillion that, at some point Sam Altman talked about, maybe raising, this is. half a trillion dollars basically for, for like creating, I don't know, a hundred thousand jobs. Definitely a huge announcement in the big companies area. another thing in big companies, Google Gemini flash thinking. If you guys remember, Gemini has their own reasoning model and, they launched the Gemini flash thinking and they updated it for this year with significant updates, plus 1 million contacts and, So code use as well. so that's great. It's really, really good. It's really good. I don't know if like our one level good, maybe, but like the fact that it's, 1 million context for that thinker, it's also super fast because of Gemini flash. So they did update this. they jumped in, in a few evils as well. That was great. at some point also this week, this week was kind of crazy. well, yeah, this week was kind of crazy, but at some point this week, there's also the whole point with, Sam Altman saying the Twitter hype is out of control. we're not going to deploy AGI next month and nor we have built AGI after all of them talking about, Hey, you know, we were moving our sites to ASI. That's what we talked about last week. So it's really fun to see Sam Altman kind of like throwing a wet towel on the hype on Twitter and then going and like announcing half a trillion dollars investments in the next four years. It's really like, it's really funny over there. the small potatoes there's in the big companies as well as perplexity announced a new search API thing. so we're moving on to vision and video. Nvidia released Eagle two, which is a series of VMs. and they were like very efficient and then they yanked the weights. but also Hug and Face released a small VLM today, the tiniest VLM, 256 million parameters, and that's million, not billion. they released it today in kind of breaking news. this tiny model that runs in like one gigabyte of RAM or some, some crazy things. Beats IDefix from almost two years ago. If you guys remember IDefix is Hug and Face's vision model. It was 80 billion parameters. Two years ago, it was 80 billion. And now it's 256 million. I just, I found it incredible. And I just, a shout out to the Hug and Face folks for this launch. folks, in this week's buzz, which is a category that I talk about, everything that happened in, in, in Weights and Biases this week. Usually, I cover our you know, news, I cover the, the, the events we're doing, we did two workshops in Seattle. I, I did, and some folks from Thursday, we covered like a bunch of stuff. This week, we have state of the art updates from Weights Biases. Wait, let me, I have a thing for this. we have a state of the art. Let me say this slowly again, just, just like, just to like make, make sure that you folks understand what I'm talking about. From Weights Biases this week. The this week's buzz corner is going to cover a state of the art update that we did not me personally. I wasn't involved. I'm just like, I'm the hype man. we broke the state of the art on sweet bench verified with a in house. Agent that writes code that uses O1 from OpenAI and we're now, if you go to the SweBench verified, SweBench website and you go to the verified thing and, the top most results is Weizenbass's, WB programmer from Sean Lewis, our CTO, sitting at 64. 6%, 64. 6 percent of SweBench verified is now solved by O1 based programmer agent. From Weights Biases, I found it, incredible, I was, like, very, very happy to see that, like, you know, in this week's buzz, it's not only about our products, I can also update you about, like, some crazy shit that's been, that's being built within Weights Biases, so, very, very happy about this, we should cover this, when we get it. last but not least, we have AI Art and Diffusion. So who knew on, if you guys remember HY video, the open source video that was released, we now have a 3d model from them as well, which is also super cool. You're just like. Write some text and it generates a 3d model and it's state of the art. It looks incredible. Not only do they generate shapes they also have another model that colors them and so Coming from an idea or one image to a 3d model is now it just it's so good. I've seen all these models before From stability opening. I had shaped these a while ago. This one is looking very, very good. We're gonna show and play around with this as well. And I think maybe the last thing besides the, you know, the breaking news and everything, is going to be ByteDance. It's all over the place. By the way, ByteDance is releasing crazy shit. I don't know if ByteDance was about to say, Hey, we're going to get, you know, TikTok banned. So we're going to just like release some stuff. I have like two updates about ByteDance this week, three last week. It's crazy. ByteDance drops Trey, which is a cursor competitor, which is free. So if you don't want to pay for cursor, you can download Trey and it will import your cursor configs and it will run the, I don't know that they will pay for. They will pay for the cloud tokens for you. And if you're okay with sending your code to a server in China, and if you're okay with that, then you have a free course of competitor. It's pretty good. It looks nice. so very unexpected. Folks, I think that this is it. Like besides the fact that like, I need to add this here. Open AI, open AI, operator in pro tier is about to launch. I think that this is like a hundred percent true right now. I've seen like screenshots already. There's screenshots of the pricing. So we're going to chat about this as well. with that, I want to say hi to some folks who joined us to talk about R1 as we're sliding into open source. and then we also have Pietro Schirano with us in the Twitter space. welcome as well.

Nisten 12:30

Hey, what's up everybody?

Alex Volkov 12:32

And I think that, all of you joined just in time because

12:35

now we want to talk about R1. And in order to talk about R1, we need to skip to our open source, let's start with open source,

12:57

open source AI. Let's get it started. All right, folks. Open source AI has never been as hot as this week. I think that this is like, this is very clear. Open source AI is the corner where we celebrate everything that happens in open source. Rarely do we get a model that beats frontier models that you can just run at home. Rarely, like it happened maybe twice so far. And we always talk about how open source probably lags behind some, some, some, closed source models in this, in this way or that way. This week, DeepSeek, the well grown firm from China, the Quan firm, released R1. they released not only R1, they released like, like, multiple things in R1. They released two actual, like, full models, which are huge, huge models that you cannot run. and they also released six distillations, on top of Quen. So they released DeepSeek R1, DeepSeek R1. 0. And they also released, DeepSeek, R1 on top of Quen And also Lama based 8 billion parameter and 72 billion parameter deep seek R1 kind of like distills, right? So like the basically trained a huge thinking model. And then they also released a bunch of decisions as well. Not only that they released it under MIT license, which is just do whatever, take it, distill it, convert it, train on it, do whatever. They basically just said that, here's this, take it. we've, we've talked about our one when it, when it launched. A couple of months ago. and it launched only as a preview in their UI. It did not launch, as, as an official product back then. there was no pricing, I believe. and now I think it was like just a toggle in their UI. but now you can download those weights. You can use them for production. You could do whatever MIT license is probably like, I don't know, listen, if you were with me, but like, I think MIT is more freeing than Apache. I think like MIT is just like literally just like do whatever you want.

Nisten 14:49

MIT is like.

14:51

A jailbreak to the whole legal system, pretty much. That's what most people don't realize. It's like, this is, it's not my problem. You're a problem now.

Alex Volkov 15:02

Yeah.

Nisten 15:03

They can do whatever you want.