Episode Summary

Alex here, celebrating an absolutely crazy (to me) milestone, of #100 episodes of ThursdAI ๐Ÿ‘ 100 episodes in a year and a half (as I started publishing, 100 episodes that documented INCREDIBLE AI progress, we mention on the show today, we used to be excited by context windows jumping from 4K to 16K!

Hosts & Guests

Alex Volkov
Alex Volkov
Host ยท W&B / CoreWeave
@altryne
Liad Yosef
Liad Yosef
Software Engineer ยท Shopify
@liadyosef
Michael Luo
Michael Luo
PhD Student ยท UC Berkeley (Sky Computing Lab)
@michaelzluo
Ido Salomon
Ido Salomon
AI Lead / Co-creator ยท Monday.com (GitMCP)
@idosal1
Nisten Tahiraj
Nisten Tahiraj
Weekly co-host of ThursdAI ยท AI operator & builder
@nisten
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host, AI model evaluator ยท Independent AI evaluator (r/LocalLLaMA)
@WolframRvnwlf
Yam Peleg
Yam Peleg
Weekly co-host of ThursdAI ยท AI builder & founder
@Yampeleg
LDJ
LDJ
Weekly co-host of ThursdAI ยท Nous Research
@ldjconfirmed

By The Numbers

Open Source AI & LLMs: Llama 4 Takes Center Stage (A
4
What are you doing Zuck) Meta dropped the long awaited LLama-4 models, huge ones this time - Llama 4 Scout: 17B active parameters out of \~109B total (16 experts).
Open Source AI & LLMs: Llama 4 Takes Center Stage (A
4 M
- Llama 4 Maverick: 17B active parameters out of a whopping \~400B total (128 experts).
Open Source AI & LLMs: Llama 4 Takes Center Stage (A
288B
- Unreleased: Behemoth - 288B active with 2 Trillion total parameters chonker!
Open Source AI & LLMs: Llama 4 Takes Center Stage (A
8
- Both base and instruct finetuned models were released These new models are all Multimodal, Multilingual MoE (mixture of experts) architecture, and were trained with FP8, for significantly more tokens (around 30 Trillion Tokens!) with interleaved attention (iRoPE), and a refined SFT > RL > DPO post-training pipeline.
Open Source AI & LLMs: Llama 4 Takes Center Stage (A
10M
The biggest highlight is the stated context windows, 10M for Scout and 1M for Maverick, which is insane (and honestly, I haven't yet seen a provider that is even remotely able to support anything of this length, nor do I have the tokens to verify it)

๐Ÿ”“ Open Source AI & LLMs: Llama 4 Takes Center Stage (Amidst Some Drama)

This was by far the biggest news of this last week, and it dropped... on a Saturday? (I was on the mountain โ›ท๏ธ!

  • This was by far the biggest news of this last week, and it dropped...
  • Meta dropped the long awaited LLama-4 models, huge ones this time - Llama 4 Scout: 17B active parameters out of \~109B total (16 experts).
  • - Llama 4 Maverick: 17B active parameters out of a whopping \~400B total (128 experts).

๐Ÿ“ฐ The messy release - Big Oof from Big Zuck

Not only did Meta release on a Saturday, messing up people's weekends, Meta apparently announced a high LM arena score, but the model they provided to LMArena was... not the model they released!? It caused LMArena to release the 2000 chats dataset, and truly, some examples are quite damning and show just how unreliable LMArena can be as vibe eval.

  • Not only did Meta release on a Saturday, messing up people's weekends, Meta apparently announced a high LM arena score, but the model they provided to LMArena was...
  • It caused LMArena to release the 2000 chats dataset, and truly, some examples are quite damning and show just how unreliable LMArena can be as vibe eval.
  • We've chatted on the show that this may be due to some VLLM issues, and speculated about other potential reasons for this.

๐Ÿ“ฐ Too big for its own good (and us?)

One of the main criticism the OSS community had about these releases, is that for many of us, the reason for celebrating Open Source AI, is the ability to run models without network, privately on our own devices. Llama 3 was released in 8-70B distilled versions and that was incredible for us local AI enthusiasts!

  • Llama 3 was released in 8-70B distilled versions and that was incredible for us local AI enthusiasts!
  • Why didn't Meta release those sizes?
  • Was it due to an inability to beat Qwen/DeepSeek enough?

๐Ÿ“ฐ My Take

Despite the absolutely chaotic rollout, this is still a monumental effort from Meta. They spent _millions_ on compute and salaries to give this to the community. Yes, no papers yet, the LM Arena thing was weird, and the inference wasn't ready.

  • Despite the absolutely chaotic rollout, this is still a monumental effort from Meta.
  • They spent _millions_ on compute and salaries to give this to the community.
  • Yes, no papers yet, the LM Arena thing was weird, and the inference wasn't ready.

๐Ÿค– Together AI & Agentica (Berkley) finetuned DeepCoder-14B with reasoning ([X]( [Blog](

Amidst the Llama noise, we got another stellar open-source release! We were thrilled to have Michael Lou from Agentica/UC Berkeley join us to talk about DeepCoder-14B-Preview which beats DeepSeek R1 and even o3-mini on several coding benchmarks.

  • Amidst the Llama noise, we got another stellar open-source release!
  • We were thrilled to have Michael Lou from Agentica/UC Berkeley join us to talk about DeepCoder-14B-Preview which beats DeepSeek R1 and even o3-mini on several coding benchmarks.
  • The stated purpose of the project is to democratize RL and they have open sourced the model ([HF]( the dataset ([HF]( the Weights & Biases [logs]( and even the [eval logs](

๐Ÿ“ฐ NVIDIA Nemotron ULTRA is finally here, 253B pruned Llama 3-405B ([HF](

While Llama 4 was wrapped in mystery, NVIDIA dropped their pruned and distilled finetune of the previous Llama chonker 405B model, turning at just about half the parameters. And they were able to include the LLama-4 benchmarks in their release, showing that the older Llama, finetuned can absolutely beat the new ones at AIME, GPQA and more.

  • While Llama 4 was wrapped in mystery, NVIDIA dropped their pruned and distilled finetune of the previous Llama chonker 405B model, turning at just about half the parameters.
  • And they were able to include the LLama-4 benchmarks in their release, showing that the older Llama, finetuned can absolutely beat the new ones at AIME, GPQA and more.
  • Nemotron Ultra supports 128K context and fits on a single 8xH100 node for inference.

๐ŸŽฅ Vision & Video: Kimi Drops Tiny But Mighty VLMs

The most impressive long form AI video paper dropped, showing that it's possible to create 1 minute long video, with incredible character and scene consistency TK: Video comparison of Tom & Jerry Scene This paper to a pre-trained transformer, allowing it to one shot generate these incredibly consistent long scenes.

  • The most impressive long form AI video paper dropped, showing that it's possible to create 1 minute long video, with incredible character and scene consistency
  • TK: Video comparison of Tom & Jerry Scene
  • This paper to a pre-trained transformer, allowing it to one shot generate these incredibly consistent long scenes.
TL;DR and Show Notes

  • Hosts and Guests

  • Open Source LLMs

    • Meta drops LLama 4 (Scout 109B/17BA & Maverick 400B/17BA) - (Blog, HF, Try It)

    • Together AI and Agentica (UC Berkley) announce DeepCoder-14B (X, Blog)

    • NVIDIA Nemotron Ultra is here! 253B pruned LLama 3-405B (X, HF)

    • Jina Reranker M0 - SOTA multimodal reranker model (Blog, HF)

    • DeepCogito - SOTA models 3-70B - beating DeepSeek 70B - (Blog, HF)

    • ByteDance new release - Seed-Thinking-v1.5

  • Big CO LLMs + APIs

    • Google announces TONS of new things ๐Ÿ™Œ (Blog)

    • Google launches Firebase Studio (website)

    • Google is announcing official support for MCP (X)

    • Google announces A2A protocol - agent 2 agent communication (Blog, Spec, W&B Blog)

    • Cloudflare - new Agents SDK (Website)

    • Anthropic MAX - $200/mo with more quota

    • Grok 3 finally launches API tier (API)

    • OPenAI adds enhanced memory to ChatGPT - can remember all your chats (X)

  • This weeks Buzz - MCP and A2A

    • W&B launches the observable.tools initiative & invite people to comment on the MCP RFC

    • W&B is the launch partner for Google's A2A (Blog)

  • Vision & Video

    • Kimi-VL and Kimi-VL-Thinking - A3B vision models (X, HF)

    • One-Minute Video Generation with Test-Time Training (Blog, Paper)

  • Voice & Audio

    • Amazon - Nova Sonic - speech2speech foundational model (Blog)

  • AI Art & Diffusion & 3D

    • HiDream-I1-Dev 17B MIT license new leading open weights image gen 0 passes Flux1.1[pro] ! (HF)

  • Tools

    • GitMCP - turn any github repo into an MCP server (try it)

Alex Volkov
Alex Volkov 0:16
Welcome everyone to a celebratory ThursdAI
0:21
Today is April 10th, 2025. Can you believe it? It's two years and something after we started recording these on a Thursday when G PT four was released and today is our hundredth episode. I'm so excited. Let's go. I'm for once am gonna use these like reactions. Thing is from Apple. I'm gonna do some confetti. Let's see if this works. Yeah, let's go. Thank you everyone for being here for so long. I'm very, very excited to open this episode. joining me, Wolfram. We're gonna have a few more guests down the show Wolf. I welcome. How are you man? Looking good.
Wolfram Ravenwlf
Wolfram Ravenwlf 0:59
man.
1:00
Oh, I'm so excited. So much interesting stuff to talk about, especially, you will mention it, but I have some favorites.
Alex Volkov
Alex Volkov 1:07
Yeah, we'll definitely discuss some very
1:09
interesting research that you did. I also see, let me just introduce myself first for folks who are here, new folks, although I see a lot of familiar faces in the audience. My name is Alex Ov. I'm an AI avengers with weights and biases. and, I've been doing this for, for a while, with me, Wolfram and I see your title also AI Avengers, but also AI Evaluator. How do you pronounce this? I love this.
Wolfram Ravenwlf
Wolfram Ravenwlf 1:31
Yeah, AI evaluator or, doing the evaluation set has been my main
1:35
thing and now I joined a company where I can do it even more and even bigger. So that is a big thing for me and it fits to the Lama for stuff I did as well.
Alex Volkov
Alex Volkov 1:44
And also what is the company name?
1:45
Give them a shoutout.
Wolfram Ravenwlf
Wolfram Ravenwlf 1:47
It's amine from germany.com and yeah, I just started
1:52
this month and I'm really excited here. And AI evaluations, that is our thing and it's been my thing. So it's really aligned and the valuation. In the sense of giving AI value, that is also very important as an evangelist. So we want to show the good things AI can do, and I'm more excited whatever, to be doing that,
Alex Volkov
Alex Volkov 2:13
a hundred percent.
2:14
And so I work with advisor, you work at LMI, and, folks maybe don't know that. we've talked about your evaluations on lama sub Reddit for a while before you joined. And since then you've been basically a co-host and a great friend. So really thank you for joining on this journey. I'm really, really happy that you're here, because for weeks like this week, your expertise in evaluation is absolutely crucial for us to give the audience the complete picture for folks who are just joining. you're joining on ThursdAI, we're a weekly AI podcast, live show, newsletter, community. There's so much happening in ThursdAI, for the past two years that've been doing this, that I'm very, very excited to celebrate the a hundredth episode. this week, a hundred episode is a lot of episodes and I believe we've missed maybe three Thursdays for the past two years. so with that self congratulations, I think it's time to actually talk to folks about what happened this week. we're gonna have a full show. I've been chatting with a person that's very dear to me, about the show and I always, the day before the show, I was like, there's so much to talk about. I don't know how we're gonna make it. and she was like, well, you always say this and you always somehow make it. So folks, let's focus on how to actually deliver all this great news. And, I will just say we're gonna have folks joining us in a second, guests that worked on some cool models. As one of the coolest things we get to do on the show is talk with the authors of the cool things we actually talk about. And so that's gonna happen here as well. please stay tuned. We have a couple of guests and then also a couple of guests later down the show as well. So in the beginning and later down the show as well, as we start with the TLDR. Everything that I'm gonna talk about is going to get in the newsletter. You can get this in the newsletter and you can subscribe to it on our substack Thursday I News. And, the reason for the TLDR is that if any part grabs you, you'll know pretty much approximately where it is going to be in the show. So with this, we'll welcome Niton as well, LDJ Niton, Wolf from Reverend Wolf. Let's do another celebration for two for a hundredth episode as go. And we're off. I love the sound effects. and then we're off to talk about our TLDR folks. it.
4:40
All right. This is the TLDR. Here's everything that we're going to talk about on ThursdAI and, we'll start with open source of course. And the biggest news in open source this week, of course is meta drops. LAMA four. LAMA four, has been, we've been waiting for it for a while. Scout and Maverick, both models have dropped. Scout is, 109 billion parameters. Maverick is at 400 billion parameters. Both of them are MOE models, with significantly less active parameters. We're gonna cover that very messy release. And we'll do some evaluation sharing from Wolfram and some general vibe collections as you know us. That's how we get to know which models are great. we'll also have. Friends from, together AI and the Genca, announcing deep coder 14 B, which is now almost a state of the art, if not state of the art in multiple reasoning and non reasoning, places. So we will chat with Michael and, I will find the name of the other author, and the third author, our friend Alai, said he cannot make it. So this is like a boo for because we've been waiting for Alpi for a minute. but other folks who've been training this model will chat with us. I see Michael already, almost with us here. so we're gonna chat about, deep code 14 B, which came outta nowhere and it's like really, really, really good. Nvidia finally gave us nron. You folks remember we talked about Nvidia Nron? Their distilled and pruned versions of llama distillation, is, a way to cut down the model size and pruning is also another way to cut down the model size. And they're done both, and they're probably like the best at the world at this. And they've distilled a LAMA 4 0 5 B into 253 billion parameters, and it performs much better, which was quite incredible. So shout out to the folks at, Nvidia for this effort. And we're gonna cover Nvidia Nitron Ultra as well. We've been waiting for this. The other two smaller versions came out and they've been great. Gina, our friends of Gina, released a re Ranker M Zero, which is state-of-the-art multimodal reran. We just mentioned it briefly because if you are into, this rag type thing for multimodality, that's absolutely a thing for you to know about. And then, we also had Deep Cogito, which is a new company, And they released models all the way from 3 billion parameters, 70 billion parameters. they're beating deep seek, trained, wan I believe. And yeah, we're gonna talk about Deep Giro as well. there's also this one that I wanted to mention and they gonna get buried in the news as well. Kimmi released a vision language model and VL thinking model. both these models are only 3 billion parameters active also MOE, and they beat Quin and they beat like a bunch of other models, for a very significant portion of, of activated parameters. Very small, very great releases from Kimi. Great lab and always get lost, in the noise. that's in open source. folks we move to big companies. And API updates, which I will say until an hour ago, Google absolutely controlled this category. This week Google has their Google next. Conference, huge conference. Shout out to wait and biases team, if you're on Google next right now, go and and give them a high five and talk to them about Weave. but Google has this Google next. So they had a bunch of announcements queued up, like literally a ton of things. Gemini 2.5 flash, is coming, which is great, and soon gonna be available in Vertex. So far we only had 2.5 pro. So 2.5 flash is as new model is coming, VO two editing capabilities, you'll now be able to edit videos with VO two. One cool thing that they released is a competitor to all these vibe coding places, which is called Firebase Studio, which is basically, they used to have it called Project IDX, and now they rebranded it and they changed some stuff. And so Firebase Studio is absolutely something that you guys should check out, and we're gonna talk about this as well. The main cool thing that I liked is that they announced official support for MCP. Let's go Google joins as a huge company. Google joins the MCP effort, and Google is announcing official support for MCP coming to their SDKs. We already knew this based on Sundar saying, Hey to MCP or not to MCP. That is the question. CEOs of big companies don't know about the thing unless the company already is involved in it. but Guru is announcing official support from MCP joining Microsoft that announced last week, AWS that announced last week. All these great companies, basically no one left anymore besides maybe Elon with Rock, but, who cares? I'm just kidding. Google also announced, in addition to supporting MCP, something new called A two A Protocol, which is agent to agent communication Protocol, which weights and biases is a proud launch partner with this. So we'll definitely let you know about a two, A protocol because I think it's very, very important. and, yeah, very interesting. our friends at CloudFlare, They announced a new agents is decay, NPMI agents. And that agent is decay is very simple and they could use like a bunch of stuff. They're also all in into the MCP high train as well. small thing on Tropic announced on Tropic Max at 200 bucks a month, tier, basically. And that tier is giving you more quota, there's no new models, which is, I don't know why people will pay for this. And lastly, grok three finally launched their API tier. So now you can access Grok three, the API GR three, fast, and then also GR three Mini and GR three Mini Fast. Now, these big company updates were all that I had up until last night. And then this morning, Alman said something, basically high, high, high, high, high, high hype. So something today from opening, I folks, as always, we'll be on it. So I think that this is it on big company's APIs. Folks, comments Wolf from LDJ Niton? Anything that I missed from the big companies? Very important stuff.
LDJ
LDJ 10:07
They did say that oh three PRO actually will be coming
10:11
in the next couple of weeks. I don't know if we'll also see that today or not, but that's something exciting to look forward to.
Alex Volkov
Alex Volkov 10:17
Yeah, so open AI breaking news.
10:19
That's what we'll write down here, and then we'll get it. Folks. This week's buzz is going to be very, very full I'm very, very proud of. And I wanna tell you all about this. I've mentioned this briefly as a teaser. In the last show, we finally, launched Observable Tools. Observable Tools is an initiative to invite people to basically create the future of MCP Observability. I've published the RFC on the GitHub spec of MCP, the official MCPI would love, for everybody who listens here to go and upload that. observe all that tools while I do the TLDR, but also definitely we'll chat about what this means very, very soon. and then, we also announcement that we are the launch partner for Google's A to A. So what this means, we're officially in the launch and we're gonna be supporting the A to a protocol, for Google. So I'll definitely mention both these things, although I would love to chat about a to a separately, voice and audio super quick. We're almost there folks. We're almost at the end of our, TLDR, we have Amazon launching Nova. a speech to speech foundational model that's like state of the art. Amazon has been flexing lately. And then we have a high dream, which is a new diffusion model that's not happening every day. If you guys remember so far, flux two and GBT four were the main models, that kind of were leading the charts and now we have some new, very previously unknown, image diffusion model, 17 billion parameter called High Dream, So high Dream is like the state of the art image generation model We have this one thing that we can't wait to talk about our friends, IDO and Liad have built Git MCP, which is a way to turn any GitHub into an MCP server, anyone. And so we're gonna have them talk about, that effort with us as well here. I think that's mostly it, All right, folks. This is it for the TLDR, and I'm looking at folks, comments to tell me if this is everything that we've covered today. and do you know what would
Wolfram Ravenwlf
Wolfram Ravenwlf 12:09
be cool today?
Alex Volkov
Alex Volkov 12:10
the hundreds
Wolfram Ravenwlf
Wolfram Ravenwlf 12:11
episode, a thousand followers on YouTube.
12:13
Come on, people.
Alex Volkov
Alex Volkov 12:15
Uh, we're very close.
12:16
I think we're almost there. We're almost there. So we're gonna check out three, three people, folks, three people left to join us on a thousand people on YouTube. Thanks Wolf for the shoutout. we'll definitely get there by the end of this. by the end of the show. I wanna pin the video to you guys on the stream so you'd be able to see. But you guys know where to go. there we go. This is the video. I'm gonna pin this to the top of the space, and I think with the quick check that everybody's here, Niton and Yam, welcome folks. And I think it's time LDJ was here. I think it's time for us to get started with the actual open source stuff. How do you guys think? Anything that I missed? No, let's start with open source.
13:12
Open Source ai. Let's get it started. the main thing that we should talk about in open source is obviously LAMA four, but until we get there, I wanna chat about an additional release that was released this week, and I wanna chat with Michael There we go. Michael, welcome.
Michael Lou
Michael Lou 13:31
Nice to meet you.
Alex Volkov
Alex Volkov 13:32
Nice.
13:33
thank you for coming up. we are gonna wait for, a friend of yours, but if they don't join that we're gonna have interview you about this release that you guys just released. I wanna just, I, I mentioned this briefly in the TLDR, but, together AI together with agenta, which is you guys, right? announced deep code of 14 billion parameters. And I love having the folks who worked on the model that actually talk to us about the models. first of all, welcome to the show. Please introduce yourself and the effort. And then let's talk about some of the updates, numbers and, and what it took to get this model here.
Michael Lou
Michael Lou 14:02
Sure.
14:02
so I'm currently Michael, a fourth year PhD at, uc, Berkeley. And this model Decoder is part of the Age Gena project and the age Gena project main goal is basically to democratize reinforcement learning for oms. And our goal here is to build great training systems for oms as well as to, discover training recipes that everyone can use, right? That's part of the democratization process, but that is pretty much the main goal.
Alex Volkov
Alex Volkov 14:25
Awesome.
14:25
So, uh, Genca is in Berkeley effort. It is a
Michael Lou
Michael Lou 14:29
Berkeley effort from, Berkeley Sky Lab, as well as
14:31
Berkeley AI research, and we're a bunch of researchers from there.
Alex Volkov
Alex Volkov 14:35
Yeah.
14:35
Alrightyy. So let's talk about the thing that you released. I haven't seen releases from you guys before. At least it wasn't on my radar. and out of almost, I wouldn't say nowhere, but definitely as a surprise to me because we've seen some stuff from together. Ai, obviously together, has like great folks in there, but almost outta nowhere together with you guys. You guys released one of the top coding models. That is a fine tune. So let's talk about, fine tune of what and how you guys achieved it, and then we're gonna talk about a few more details.
Michael Lou
Michael Lou 15:03
Mm-hmm.
15:05
Cool. Cool. Yeah. So regarding the fine tuning process is using deep seeks, GRPO or a slightly improved version of A-G-R-P-O. the main idea here is really different than regular, fine tuning right in regular fine tuning, you have a bunch of opening eye check on patient's data. And then you just fine tune with respect to that. But in this case, really, if you want a small model, to become as good as O three mini, you need to have reinforced learning you have to teach a model to know what is right or wrong, right? So essentially what is right is I pass all my coding tests and what is wrong is I fail out my coding test. And a simple signal such as these and letting training go on for two and a half weeks, you're able to get a 40 B model to relatively decent performance, right? So there's actually nothing new to the algorithm. There's nothing new system. But really the novel part here is just being patient and waiting for the model to turn.
Alex Volkov
Alex Volkov 15:54
Yep.
15:54
Could you talk about the collaboration with Together ai please? How did that happen and what did you guys do?
Michael Lou
Michael Lou 15:58
Yeah, because we're researchers from academia,
16:00
we're compute bound, right? So actually together AI was actually, we're fortunate enough to be together AI who offered us a decent set of GPUs, 64 GPUs for us to, to train our models on, right? in return for GPUs, we got better outreach and also, a trainable model at 14 B because we could have never trained that without their compute.
Alex Volkov
Alex Volkov 16:18
Yeah, that makes sense.
16:19
So let's talk about some numbers, 0.6% on live code bench, which is an 8% improvement on the base model, which is quite impressive because the base model is Dipe R one, distilled Quinn 14 B. So I know it's a mouthful, but like this is Quinn Distillation fine tuned on Dipe R one outputs, which is we've covered both these labs and they're great at what they do anyway. And you guys achieve an 8% improvement, wellbeing academics and with, with the few gps from together, let's say not the Army of gps, the and Deep has. and, talk to me about the, the inference stuff. I would love to hear some, some more about this.
Michael Lou
Michael Lou 16:55
Sure.
16:55
Yeah. So basically, part of this RO process requires, training for very long context. Essentially you have let's say a VM sampling engine. It, it will sample stuff up to 30 2K context. And basically we've trained our model up to three K max But the issue here is that existing models do not really, generalize beyond their trained context, right? So for example, if I took Deeps Seq R one or even Deeps seq and I train it on three K context and generalize the six four context, the performance remains the same because, it doesn't exhibit these properties. But because we train, our algorithm with a different version of, GRPO and using filtering from Dapo. we're able to show these generalized properties where you can get good performance gains out, good out of domain performance gains, where your context can scale up to 60 4K without even training on 60 4K context. And I think that's the beautiful part of reinforcing learning. It teaches your model how to generalize better.
Alex Volkov
Alex Volkov 17:51
Absolutely.
17:51
one last thing I wanted to mention, as a question to you before, NISTA also has a question, talk to us about the dataset. and also I would shout out a hundred percent every time that somebody comes here and releases a link to its and biases. So this is great. Folks can go into the one B logs of this training, and just take a look at everything. Michael, this is you, right? One B ai slash m Lua. So this is like literally your project one B. So shout out, for releasing this. You guys also released the eval logs, which is absolutely great. let's talk about the dataset because you guys didn't just release the model. You also released multiple things, including the training logs, including the dataset as well. Let's talk about the data. What'd you get it? How'd you create it? And what's cool about this, and thank you so much for releasing this because like we always applaud open source releases end-to-end.
Michael Lou
Michael Lou 18:36
Yeah, great.
18:36
Great question. Yeah, so this kind of align with our go to, democratize rl. That's why we open source, everything regarding dataset. It was very difficult to procure the right dataset because, currently everyone does math for rl and there are a lot of easy, verifiable questions for math. And when we did it for math for deep scaler, basically we got a 1.5 B model to 43%. it was very easy to find a dataset, right? it took less than a day, but for this one, it took actually a couple weeks. And the reason behind this is most of the datasets on the open source are very, very dirty and unverifiable, and most of them are a bit too easy, right? And because of that, we had to do many iterations over the right, combination of data sets.
Alex Volkov
Alex Volkov 19:17
All righty, folks.
19:18
Michael, thank you so much for coming. Shout out to you and the team, at, Berkeley Agenta and together for joining this effort. incredible kudos for the very, very detailed blog posts, very detailed evals, metrics, logs and relation logs and dataset, everything. Incredible kudos. We're gonna always applaud this. Thank you so much for also coming up and talking to us. as I said, one of the coolest things we get to do here is to talk with the researchers that released the models, so folks can follow you and we'll add your links to the show notes as well. thank you so much for coming as we move on to LAMA four, Michael, any super quick shout outs to the team before we move on?
Michael Lou
Michael Lou 19:52
Yeah.
19:52
Shout out to together. Ai. Shout out to my advisors.
Alex Volkov
Alex Volkov 19:54
Yeah.
Nisten
Nisten 19:55
Shout out to Alpe too.
19:56
'cause he also helped us a lot last year.
Alex Volkov
Alex Volkov 19:59
Yep.
20:00
Alpe is one of our good friends who's never on the show. He's like a Yeti. I've been trying to get Alpe on the show for two years. may he may never come. Michael, thank you so much. You consider now the friend of the part as well come back to us. feel free to stick around in the kind of the ex extreme as well while we move on to Lama four, because those are huge, huge news. Thanks Michael. now folks, it's time. First of all, all five of us are on stage LDJ without his face on, maybe a hundred in one episode is what it takes for a person to show his face on stream. but besides this folks this week, Lama four, what the hell is happening? Let's talk about this. from where are we with this effort? what did they release? What's going on? let's cover LAMA four. I think folks need us to tell them what's going on.
Wolfram Ravenwlf
Wolfram Ravenwlf 20:40
You said meta dropped LA form in a way.
20:42
They dropped it on a Saturday. They really dropped it there. And, yeah, it's been long and waited. Then we got it earlier than expected. I was more expecting it, later this month. So that was a surprise. And, the model architecture was completely different.
Alex Volkov
Alex Volkov 20:58
this felt like it was a surprise for a few meta folks as well.
21:01
Within meta this felt on a Saturday release, we definitely all got like a little bit. I was on the mountains somewhere. we all got what is going on Saturday? yeah,
Wolfram Ravenwlf
Wolfram Ravenwlf 21:09
a lot of controversy because of that and the inference
21:12
providers didn't have much chance. what I read about is they got one day in advance so they could, optimize VLLM for example. And we may still be seeing issues with that. And the architecture is completely different than what we were used to. So there was a rumor that, meta scrapped everything and, redid everything after the deep seek moment. What we are seeing here, the 70 B and the smaller ones are all gone now. We got the big models MOE for the first time. Multi model, which is great, but the license says not in Europe, which is not great.
Alex Volkov
Alex Volkov 21:42
So here's how, I joined the thing that you mentioned,
21:44
like just sizes wise, right? We are used to running LAMA models on our max. there's the whole community of open source from big companies like together AI and CEREBRA is now running LAMA for a thousand tokens per second. That's all great. But to us in the open source, what open source means is local, to us, implicitly, open source means local. This is why we go to open source. If we need to run this in the cloud somewhere, we might as well run 4.0 or Gemini 2.5, which is incredible. Gemini 2.5 flash, which is gonna come out today is also incredible, right? So for us, one of the coolest things about open source is the ability to run this on my device, Lama and some other folks were the models that we ran In the seven B and 72 B, you could quantize them down. Lama ForeScout, which is the smaller one that they released, they released two models and promised us behemoth, which is absolutely insane. We should talk about behemoth Lama Forc is, they lead with 17 billion parameters, but those are 17 billion active parameters. There's an MOE that you need to download, I don't know, 200 gigabytes. It's 109 billion total parameters, and 16 experts over 17 billion active parameters. Active parameters is great when you run it on the cloud It's not great. when I need to download and actually load this whole model into my Mac, right? Lama four Maverick, the largest brother is a 400 billion parameter model. There's not a MacBook in the world that can run this no matter how much you quantize, right? Or maybe you can say differently, but basically this is why the community also bifurcated because hey, folks, we're used to small models for lama, where are small models? And it seems like maybe they couldn't achieve. Some of the performance boosts on top of, I don't know, the distilled versions of R one, for example, or Quinn is doing an incredible job enough to justify the release of small models because I'm sure they trained it, like I'm sure they distilled this behemoth thing into the 70 billion parameter. Maybe they didn't see enough of a difference to release this. which a bummer for us. on the open source side. Wolf go ahead. There's more things there with the messy release.
Wolfram Ravenwlf
Wolfram Ravenwlf 23:41
Yeah, and there was a version on Lan Marina, which
23:44
was an experimental version though. They were saying, oh, it's very good, but it turned out that is not the model you can download.
Alex Volkov
Alex Volkov 23:50
what the hell was that
Wolfram Ravenwlf
Wolfram Ravenwlf 23:51
not a good thing, not a fair play in a way.
23:55
So that is very weird. What does meta doing? Why? so we all want this model.
Alex Volkov
Alex Volkov 23:59
Yeah.
23:59
Just to clarify, El El Marina is where for folks who are still new, there's every time there's new folks here, there's folks selecting between two different models. They've been running the secretly against other models as well, and when they announced, they announced like, Hey, this is like the numbers in M Marina. It turns out that they provided them a different API. So they released Lama Scout, lamo Maverick in both instruct versions and base versions, which is great. Kudos to them. Base versions always great for fine tuning. But there was another version that we did not get that was only the API right, that El Marina got, which is like WTF mate. what, what's going on? LDJ, you had a comment about this.
LDJ
LDJ 24:37
Yeah.
24:37
on the original point of why they released on Saturday, there was somebody that asked Zuck on threads why he did it, and he actually responded and he said, because that's when it was ready. So there's that. I'm not sure if I exactly believe that. I feel like maybe they were trying to rush something out because of something else. But
Alex Volkov
Alex Volkov 24:53
shout out to Big Zuck with my whole love about the
24:57
open source love and everything. That is some bullshit. That is a, great bullshit because we know how these models work. You can keep training them. You can just sit and watch the loss and weights and biases or whatever, framework that you do. But obviously use weights and biases. 'cause Thursday, and then you can just sit and watch the and then arbitrarily decide it's already here or it's already here. Like that statement makes no sense. I'm sorry, this is like a different thing. also, Google next is started Monday. So like Saturday was a release just before Google is gonna announce some, some crazy new shit. there's always these releases are not always like about the model as well. But yeah. LJ thanks for this. Yeah.
LDJ
LDJ 25:29
on VLLM as well, we, while we were just talking about could we,
Alex Volkov
Alex Volkov 25:32
could we ask one of you to clarify what VLLM is?
25:35
'cause folks who are listening to us may not know. I think it's important for the next part that we're gonna talk about.
LDJ
LDJ 25:40
Yeah, I think, well from, or Nten, you guys have the most
25:43
recent experience with it, probably
Alex Volkov
Alex Volkov 25:45
Nten.
25:45
You wanna give us a, like a one minute in VLM.
Nisten
Nisten 25:48
Ignore the V, the L and, lm, it's not an LLM, it's not a V, it is, you could
25:54
just call it orange for all you want. It's, it's a runtime. It's what you need to use when you're running the model on multiple GPUs. And VLLM is the fastest because it does something which, for example, llama CPP doesn't do, or LM studio, does not do. And that is tensor parallelism. So if you're gonna run this model on multiple GPUs and you need to split it, or you have a small model and you need multiple copies of it to run really fast, you use VLLM because it has implemented tensor parallelism and that makes it a lot faster when you're hosting it commercially, usually, or when you have multiple, multiple gpu. So that's what you use for. I would say more than half, probably two thirds of all commercial inference is all based on VLLM. The rest would be on NVIDIA's, either, tensor RT or Google's own runtime. And you have Lama CVP and stuff. But yeah, most commercial inference, they're running the models is on VLLM
Alex Volkov
Alex Volkov 26:57
and it's an open source project, Yeah, it's completely open
Nisten
Nisten 27:01
So yeah, there has been issues in the past with VLLM as well and I
27:05
encountered almost exact same, garbling. So after you talk to the model for a while, it just starts to output a lot of gibberish, a lot of random characters. It's still coherent, but it just starts to garble out. And I've noticed this with deep Seek V three as well, which is a similar, architecture. So yeah, this is just one of the issues
Alex Volkov
Alex Volkov 27:27
let's talk about the reason why we're basically
27:29
mentioning VLLM as well. So not only did folks see a difference in quality of the model compared to Ella Marina, there was also a bunch of providers that, like we said, got like one day, Hey, ta-da, here's like our, please put them on. previously the reason why Zach stated that LAMA four is getting released this openly is because to get all these providers like together AI that we just talked about and, and, Cerebra and core, like all of these, like folks who kind of like put models on inference, to, to align on this open source standard called lama, which, famously, previously saved meta billions of dollars when they did this for, for the data center stuff, right? This is like the stated reason for that to releasing Lama's open source, this time they gave him a day. In advance. So basically the decision to send it to ship, it was, it came on Friday probably, and on Saturday, like they released the model and, and some folks worked very hard weekends in these, providers to put Lama for Maverick and Scout on their inference, and then they put it on open router and then folks started seeing Wolf firm. Like differences between even the providers. So not only from the, the special version that El Marina got via API, but also like the actual provider implementations. Part of it maybe is due to VLLM Wolf. You wanna talk to us? Yeah. through some of this.
Wolfram Ravenwlf
Wolfram Ravenwlf 28:41
So Meta released it when it was ready for them,
28:43
but they didn't give the inference providers a chance to of course we were curious, we wanted to use them. it's actually possible to run the scout model locally if you use a very small quant, almost, two bit, for example. And I was running it on my own system with, 30 tokens per second, which was quite fast. And oh, it did pretty well. I posted the benchmark results and then I decided I want to see, how does, the really good, big original model do. So I thought, okay, use together ai because they have the not quantized version. that's a benchmark there. And it was worse in the full position compared to what I was getting locally though. I was what the fact, that's impossible.
Alex Volkov
Alex Volkov 29:23
Wait a second.
29:24
Hold, hold, hold on. What you are just saying is, is crazy. So it bears repeating your, very, very much quantized version. I think you used ANS slot or something like this, right?
Wolfram Ravenwlf
Wolfram Ravenwlf 29:34
Yeah.
29:35
Ans slot dynamic,
Alex Volkov
Alex Volkov 29:37
very quantized version basically like low, low, low position
29:40
three bits outperformed the hosted version is hopefully full precision because that's what we expect from hosted versions on your benchmarks.
Wolfram Ravenwlf
Wolfram Ravenwlf 29:49
Yeah, it did.
29:50
It's the MU poll benchmark, so it's not something specific. Everybody can produce it and others. FI produced it as well. the results I was getting, they were consistent. I was not doing one benchmark I was doing for, and I did four with the together API both with the recommended settings and the default settings, which is temperature zero for example.
Alex Volkov
Alex Volkov 30:08
Just to summarize for folks who listen, This is
30:10
a very low precision model. You get it like very significantly low precision in order to run on your hardware because these models are huge and you're able to get it run on your hardware. But like the supposed downside, this is, it's worse than the full precision model, which is trained in FPA by the way. but you're getting better results. How much on FP on MMOU PRO did you get?
Wolfram Ravenwlf
Wolfram Ravenwlf 30:29
I did get with a local model, just a small model.
30:32
I get 73% For scout? For scout, yeah. Locally. and when I did it together, I was getting only 63 to 67%. So let's say 65. So that was a big difference between the original model, 65%, compared to the 72% I was getting with a tiny version of that, that was running locally. So that is normally not possible, which could mean two things actually. Either they overfit on the training data. So the model wasn't very, smart, but it had it in its memory in a way, or which is more likely that the big model, has inference issues. So that is what I personally expect, that VLLM has issues, and that, as far as I know, has even been confirmed already. So we are not getting the full quality yet.
Alex Volkov
Alex Volkov 31:21
It's fixed today.
31:22
I've asked around and also let, let's shout out John Durbin for and shoots ai. That also provided you like an inference as well. And you did some other retesting on other providers. Not only together ai, so we're not like pointing the finger here together specifically. and you got some decent results for some of them, but still not, not as much as they Maverick.
Wolfram Ravenwlf
Wolfram Ravenwlf 31:41
Maverick results.
31:42
It was better than the OTT P three, but it was still not as good in this test. Of course, always talking about just what I tested. But, Lama for Marick on Open router and on shoot both of them, they got about 82%. So that was also much better than the scout versions.
Alex Volkov
Alex Volkov 31:59
It's also higher than what metal released in their eval as well.
32:02
Metal release for Maverick, 80.5. So you're getting 82? Yeah. so that was a full
Wolfram Ravenwlf
Wolfram Ravenwlf 32:07
benchmark.
32:07
I was just doing excerpt on the computer science part so I could run it in better times. So it's not exactly the same score, but it is part and it's very close. So I think both course of Meta were right. And, the big takeaway here is that, it is the best LAMA model. We have the Marick and it's also the biggest, that you just can't run locally easily. but at least it's there. And if there is an inference issue, that also means that if the issue is fixed and the things I tested were with the issue, then may raise even more. So maybe it does be deep seek if it is Fixed completely. That is something I can't tell you because we have to wait for the fixes and benchmark it again to see.
LDJ
LDJ 32:49
Yep.
32:49
So yeah. According to Meta's benchmarks reported for Maverick, it actually trades blows with even Deepsea three V 0.1, and for the behemoth it trades blows with cloud 3.7. And so for that, it seems like this is also like my first theory when I see people showing benchmarks where it's actually Maverick is like really bad and my first thought is, okay, this is a new architecture or it's using MOE, which previous models usually didn't. And I think, yeah, it's very possible that the inference frameworks are just not properly running the model. And like we have seen these types of things happen before at smaller scales, but I think that's probably the best explanation for this.
Alex Volkov
Alex Volkov 33:33
So let's talk about some of the other stuff because, okay,
33:35
the release was messy from multiple angles because of this Saturday, the release, it's unprecedented. Maybe folks are chilling with their kids, for example, don't have the time. the inference providers didn't get enough time to maybe adjust and test it properly and compare this to the release metrics from meta, and Lama, decision was made, about when to release this fine, but also there's a few new things mentioned, let's run through them super quick. This is the first MOE from LAMA. So far. They released dense model. The one of the biggest dense models that we've got was 4 0 5 B previously, LAMA three, 1.1405 B. So first MOE released was like significantly less active parameters, only 17 billion active parameters great on CPU, great on distributed stuff. we also have multimodal native capabilities there, images plus text. They use meta clip for vision encoding something new, context windows folks. Let's talk about context windows for a second. met Lama Scout is released with 1 million, sorry, yeah. 1 million. contact window and metal lama Maverick released with an absolutely staggering. 10 million tokens in the context window. at least based on the announcement, not about how it performs based on the announcement, this absolutely deserves air horn because two years ago we were celebrating 4K to six K to eight k jump, and then rope moved us to 30 2K and was like, oh, when am I getting access to 30 2K? We're talking about 10 million tokens in the context window, at least conceptually supported, whether or not it actually works like this or not. We will discuss next, but significantly longer context windows, which is insane. I think Google announced theoretical support for, for general the correction.
Wolfram Ravenwlf
Wolfram Ravenwlf 35:05
Yeah, the correction.
35:06
scout, a smaller model has a 10 million and Maverick has 1 million. Okay.
Alex Volkov
Alex Volkov 35:10
Yeah.
35:10
thank you for that Scout. The smaller one has, 10 million and then Maverick has 1 million, correct? Interesting choice.
Nisten
Nisten 35:16
I think I'm the only one that actually liked the model, surprisingly.
Alex Volkov
Alex Volkov 35:21
your expertise.
35:22
I just wanna run down the other, like new things that we got super quick and I would love to hear from you because I would love to know vibes from the community. it trained on three x more tokens compared to LAMA three. So significant can jump in. I think 40. Trillion tokens for scout and 22 trillion tokens for Maverick, which is like absolutely insane. improve multi-lingual capabilities as well. 10 x more multilingual tokens, and eye rope that, LDG also mentioned. Nissan. You use these models, you said you use them almost daily. What is your, how do you use them? which version? Where it's inference, and what, what's your experience like?
Nisten
Nisten 35:57
I've seen them with the open web UI and I've been using the together
36:01
and the fireworks, APIs for them. They're very fast. They're pretty good at, at tool use 'cause I have a whole bunch of, of my own tools set up. And so I usually start, the thing about open Web UI is that it lets you start off with a model and then you can switch to another model as, as you're talking to it. So I'm almost always starting off with Lama form Maverick. And, it is very fast. it does the job initially. It can't really vibe code on its own. this is what, I noticed. I am still waiting for better. Improved inference. And I have not tested the Lama CPP yet, but, yeah, I noticed that I had to switch to R one eventually. Even V three doesn't really do it at some point if you need to solve a problem. the thing about this is that one they mentioned, they trained it all in eight bit. So this is all eight bit trained. there are some issues with VLLM, with the runtime again that, llama CPP avoids, and that is it. Quantize also very, very tiny sensitive parts of, of the model, which are called the layer norms. And you're not supposed to, quantize those normalization, linear layers because they can induce much bigger, errors further out in the model. So if someone ran it at, B Float 16 or FP 16, I think that's probably what they hosted the inference on. Then they would get much better performance. But again, you need 800 gigs of GPU, to run the Lama form Maverick. So that's an issue there. the architecture is very interesting. I think Lama four Scout should be able to do pretty well on a 64 gig or 96 gig MacBook. It's probably the fastest model that you can run on that thing. And that's still very intelligent. So that's a good architecture for that. For Maverick, I am not as convinced because, for example, deep Seek has 37 billion active parameters. So it's 36 gigs of Ram is always active. 17 b might be too small for how wide the model is, the expert weights tend to store, it's all very abstract, but they tend to store more of the knowledge of what the model knows and the main weights, the KQ and vs. tend to store more of, the actual logic and, its ability to think it's not exactly that, but it's roughly that. And, also the previous deep decoder V two had 23 billion or 27 billion active parameters. there's that as well. Also interesting to note was that when Juang came with the 1 million Quin, he mentioned that it was an MOE of their 14 B model, which is, very interesting, in this case because that would make it a very similar architecture to what the Lama ForeScout is now, from a few months ago. yeah, I like the model. It can't really vibe code. I'm still holding out judgment on it, and I tend to be the first to overreact. So I'm not quite convinced that issue is fully solved yet before making the call. It might also just be that the first initial task, it might be very well optimized for one question, one answer type of thing. So if you give a really hard question, it can do the answer and, it answered all my very hard questions. So that's why I was very impressed at first, but in Multiterm it wasn't quite doing it. So yeah, that's what I.
Alex Volkov
Alex Volkov 39:26
Yum.
39:26
what is your take on this whole release and models? give us a little bit of your,
Yam Peleg
Yam Peleg 39:31
I don't think there is anything in bad faith.
39:33
I think it was just rushed and many of the, quality issues we see are like, these things are hard. when you come up with a custom architecture, people need to understand that when we say, VLM is faster, it's not just for quality of life. when we say faster, it is pretty much impossible to run it normally as people imagine without, appropriate infrastructure, like VLM, you can run it naively, but it's unusable. chances are you're not even going to be able to run it on your hardware, even if your hardware theoretically allows it. So when people use VLM, it's not just for quality of life. It's pretty much the only way that you can do this. if drop on people, a custom architecture they've never seen before. They have a couple of hours to implement it, mistakes can happen. this is why I think we see all this discrepancy between Ella Marina and the evils and so on. I just think it was rushed and we all see what happened this week. So we all understand pretty much why was it rushed this way? But overall it's huge. Like I didn't expect anything like this from even level four. it's more than what I expected. I was just shocked by how far they're taking this and just like you said, Alex, you can just train them more. they can just drop another one next week and pretty much solve many of the issues. and yeah, theoretically, I don't know if they're going to do it, but,
Alex Volkov
Alex Volkov 41:03
well we at least know that there's at least one
41:05
more model they're about to drop, which is Lamo four behemoth. And let's cover that a little bit. So they gave us two models, but they announced three models. scout and Maverick had the 109 billion parameters and then 400 billion parameters. MOE. previously they had LAMA at even one, we called it the chunk or lama, which four or five B billion parameters. they announced LAMA for behemoth, at 2 trillion. Parameters, 2 trillion parameters, like it is getting ridiculous folks. and then out of those 288 billion active parameters, still 16 experts, which is very interesting because other Moes, bigger sizes, Moes got two crazy level of number of experts, like 200 or something. and then they basically mentioned that this model is a teacher model for distillation. It's not a model that you run inference of because it wouldn't make sense, but basically you would distill these models like the behemoth models to make the smaller models better. Like we saw with R one when they release the deep one and dips three, et cetera. And then they use like deep secret one to distill into smaller models as well. So that model is coming at some point. It probably keeps training as well, so we're definitely gonna get something else from them. One thing that I wanted to cover super quick is that the idea that they've trained on benchmarks because they, they, they kinda release some stuff and then the ization made them
Yam Peleg
Yam Peleg 42:19
Oh.
Alex Volkov
Alex Volkov 42:19
they came out.
Yam Peleg
Yam Peleg 42:20
I must ask, how do you even over feed L Cs?
42:24
even if you want, how would you do this? and I get that they're doing it, but they just don't understand why. even if I wanted to do this,
Nisten
Nisten 42:31
It's still a lot more alpaca data sets with just one question, one answer.
LDJ
LDJ 42:36
I think it would be a lot of, it's not really cheating.
Nisten
Nisten 42:39
yeah, I don't think that that would even count as cheating
42:41
because it's, so there was,
Alex Volkov
Alex Volkov 42:42
there was quite a few kind of things coming out in
42:44
Reddits and some anonymous people who are supposedly quit, et cetera. I just wanna highlight the Ahed Alda, who leads Lama effort in meta came out very clearly against this and said, we, there's no such thing as like doing this. There's no decision. At which point we decided, we'll, quietly do this to appease Z or whatever. While that checks out, those are probably the reasons for the differences in the benchmark that we see. And we did see this as well between the different providers. we did start seeing like differences. So folks will catch up and we'll release like better, better support, because it's hard. But absolutely a huge shout out to everyone in the LAMA team, for this incredible release we've been seeing. China taking over the open source significantly. Chinese folks and great labs, very super, super great labs like Alibaba, quenda, we and love on the show. Shout out to Junior Yang, like deep seek that basically broke the stock market until this break that we are in right now until yesterday. That was the biggest break from the beginning of the year. like great labs, controlling the open source and just absolutely crushing it. we, it's great to see that. we're waiting for open the eyes open source release maybe today. I don't know. but until then, it's great. And shout out to the whole Lama team for this huge effort.
Nisten
Nisten 43:55
I think instead of saying state of the art from now on, they should
43:57
say a stock market breaking model.
Alex Volkov
Alex Volkov 44:00
Yeah.
Nisten
Nisten 44:00
that's why
Wolfram Ravenwlf
Wolfram Ravenwlf 44:02
Alrightyy.
44:03
So I just wanna and you know what, yeah, go ahead. It shows how used we are to such big models, such big context, length of a million tokens. I remember when, a year ago I thought I was happy if we could get more than eight k. And so this is an amazing advancement and we get used to it. So we have great models. They are, even in my benchmarks at Marick was almost, pretty much on par with plot seven, seven sonnet in just this benchmark of course, but it's, it's local sonnet in a way as well. And, so we, we really have to celebrate that a bit more. I think we are, we absolutely
Alex Volkov
Alex Volkov 44:34
have to celebrate and I wanna just join this celebration effort
44:37
and to mention folks, there's millions of dollars that meta spent on giving you these open source models, millions of dollars in compute and maybe hundreds of millions of dollars in paychecks, in salaries to all the execs and the ML researchers and everybody who like meta is giving us an incredible release. Yes, they didn't release the papers this time. They will probably follow up with some papers as well. Meta is absolutely, standing in for the US for the western world in Hey, we can open source. And we are opensourcing we're confirming to this, this is absolutely to be celebrated. So what the release was messy. I absolutely think we should absolutely celebrate this incredible effort from meta and shout out to everybody there who worked on it.
Wolfram Ravenwlf
Wolfram Ravenwlf 45:16
Just one little point.
45:17
and the big thing, how, if it is good or not, will show when the community does fine tunes distillations, what comes out of it.
Alex Volkov
Alex Volkov 45:24
Yeah.
Wolfram Ravenwlf
Wolfram Ravenwlf 45:25
That, there's a lot of potential and we will see when the
45:27
ES comes, and so dolphins and all the other models, then we will see if it was really useful and good for the community.
Alex Volkov
Alex Volkov 45:34
and also another shout out is because of this, they gave us the base
45:38
models, not only in instruct model, so like the fine tunings folks can take the base model start to comparing this to the fine tune model and start like beating it. As we talked about when Lama three was released, Lama three instruct was really hard to beat in fine tune. Like they did such a good job back then that like even the greatest folks like news research, like Dolphin, like other of those folks, John Durbin, like all of these folks who like the data sets had a really hard time. And so they all did some of a mix. we'll definitely probably see, some like this here as well, but the community will absolutely pick it up. absolutely great release. I wanna smoothly move over to the next topic because I think it's relevant and even, in the eval that I'm gonna show in a second, it's relevant. So Nvidia released Lama Nitron Ultra, which they announced multiple times on stage, Listen, you remember we talked about, Jensen at one event at the beginning of the year, he showed something in the group chat. I was like, did they release this or did they just announce this? Then Nvidia GTC, they released n We actually, chatted about this as well. We tried to bring Nvidia folks here. Nvidia legal department did not agree to them to talk about this. Fine, fine. We're gonna talk about this anyway by ourselves But our friend, Chris was, on the actual release post for Nitron, and they released the two smaller nitron, distill. So one of them is for the LAMA 70 B, Ultra released. And this is kind of a little bit of the evals that they show. And I love it. They released it a day after, I believe on Monday. and they already included LAMA behemoth benchmarks within this, LAMA Nitron Ultra. So again, LAMA Nitron Ultra is a distillation of, Lama 3.1, 4 0 5 B, the previous version of lama. Supposedly the newer versions of LAMA should be better. And what we see is a very interesting thing where the distilled version of 4 0 5 B, which totals to 253 billion parameters, beats. Beats behemoth, not scout or Maverick. It beats behemoth. The one that, llama just announced didn't release. So this distilled version beats behemoth A-G-P-Q-A, which we know is like a very complex science questions. It beats, it on complex math. Now it beats, Lama 3.145 B on tool calling. Generally, they obviously pick the benchmarks because some of them only have all models. Some of 'em have the few models, but also in, did they add the MOU here? No, they didn't add the U here. So basically they took a previous version of the model. They still pruned it and fine tuned it in addition. And now it gets like very, very, very good, on multiple, multiple things. shout out to, absolutely shout out to, Nvidia for this effort. Anything folks you would like to add for this effort? We, we've covered Nvidia, pruning and distilling before for Nron Ultra, but anything if you wanna add, this is now a good time to, looks like we're all good here. Nvidia can do it there. Nvidia. So I think the, the highlight there is, we are used to distilled models to run on like lower quants on our max. they're also adding pruning, which is a different way of reducing kind of the, not necessarily precision, but the sizes and the weights of the model. And so they pruned this four or five B into 253 billion parameters, which is great. and you can find those on names and like other things. Another thing that we wanna mention is that this model has reasoning built in with reasoning and with off reasoning. So within the system message, you can decide for the nitron model, to turn on reasoning. one thing that comes to mind is that LAMA did not release any reasoning models. Folks, the LAMA four releases came out and those were M-O-E-F-P eight, a bunch of new stuff in their multilingual, multimodal, not reasoning. So we didn't yet get, despite the fact that we rumored that there was like fire sales inside meta and like they scraped it multiple times, despite all these rumors that came out, we didn't yet get a reasoning model from Llama, which I'm assuming we'll get at some point because the world is moving towards, test time compute.
Nisten
Nisten 49:24
Shout out to Nvidia for releasing the open code reasoning data sets.
49:28
I don't necessarily like the Nvidia models. I've had plenty of criticism, but, They've been pushing the most open data sets out of, out of anyone. And those are super, super useful. Alright? And I think those are gonna be valuable for a long time, like even after the model is gone.
Alex Volkov
Alex Volkov 49:46
let's move to, the companies in the P look.
49:49
No. So before this open source, we still have to cover a little bit, but yeah, I know it. This is getting long. We're exactly an hour and 10 minutes. we also need to mention Kimmi, vl and Kim vl thinking, let's talk about those because it feels like Kimmi is not getting the, the spotlight that it deserves all the time. They release something and somebody else takes over. the Kimmi released an MOE reasoning VLM, and we love reasoning VLM. So if we talked about, QVQ from, from Quinn a, a couple of times, most of the major labs, like the big models, they're also reasoners with LLM. So Gemini 2.5 is A VLM reasoner, like it can see. And reason, R one famously the great reasoning model that came out in open source. R one is only text and not multi multimodal. And we are getting a reasoning VLM, which I wanna say it's a fine tune. I'm not sure entirely, but let me, lemme take a look. I think it's a fine tune. Maybe not. so yeah, this, this is a fine tune of their base model of Moonlight. so Kim vl, from Moonlight, 3 billion activated parameters, so very, very small, can run on, on local devices. And they show here a very great math vision, test where they're on the top left kind of with, with a very high score compared to everybody else with activated the parameters. Gemma two 27 billion parameters. They're beating that on the score with only activated like 3 billion parameters. 33, on math vision, compared to 10 x larger models. Now, it's an interesting choice to compare your active parameters to, a full model, but still, very impressive result from Kimi. and, there's a bunch of info key tasks and they're like basically leading the pack between, even Quin 2.5 vl, which we know is a great vision model as well. shout out to Kimi on this release. I believe, go ahead.
Nisten
Nisten 51:28
I've been following them for some time, I think are pretty legit.
51:32
I'm still waiting for Lama CVP support and to run this. I wasn't able to run it and there are no quantization of it yet because of that. But I am really waiting for this one because I think they were one of the first to use, deep seeks, GRPO technique on a small model. And they got the math model to perform as well as, GPT-4 O, which is like another, three B. this team has been doing some very interesting stuff and I think it's very much being slept on right now as a model and MIT license too.
Alex Volkov
Alex Volkov 52:03
Alrightyy, so that covers the open source section of Thursday.
52:07
there's been great releases and we've chatted with Michael, from Deep Coder. we covered LAMA four at length. now because we're an hour and 10 minutes into the show, I wanna take a break and remind folks you are on Thursday I, the weekly AI news show that brings you latest and greatest in ai. And also this show is sponsored by weights and biases. We have a. special segment here on the show that's called This Week's Buzz, where I talk about everything new in biases. And this week there's something that I'm very, very proud of because I'm leading this internally in I wanna chat with you about. And usually, folks get a break, on the stage. you can go and smoke if you want to, but I would love to enjoy, telling you what I'm about in and biases. So let's go, let's get it. This is this week's buzz.
53:02
Welcome to this week's Buzz folks, a corner of Thursday I, where I talk about everything that happens within weight and biases. And this week I'm very, very excited to announce a few initiatives. One of them is something that I'm leading personally, is why I'm very, very excited. And you know that we've been talking about MOE on the show multiple weeks in a row. We started with mc, peeling everyone with the, March 6th show, with the folks from CloudFlare, about Mc p, just 1 0 1. if you don't know what MCP is, definitely go and check that episode out. You'll definitely get as excited as ours. But I would just wanna mention that, while, MCP is, heating up everywhere, it's also very important to weight and biases to make sure that we're not missing this thing. And MCP has this very interesting, benefits, but also drawbacks. one of the benefits is the standardization of tools. So if you have an agent or a chat bot, et cetera, this is an CP client. And then the more you use CPS as tools, the more kind of you trust, other servers to run those tools and they don't run in your context. What this means is it's. Almost impossible to start observing, end to end what your agent does or what your chat bot does. and with biases has a tool called Weave for LLM Observability and evaluation. And I'm looking at this and I'm saying, oh, this is great that this agent exists. But also, oh, the more MCP tools an agent will use, the less your whole stack will become observable because you outsource the calls to your tools via this protocol. And we started thinking hard about this and what this means for developers and developer experience across AI agents. And, we are very, very proud. To, to launch an initiative that I call Observable Tools. That's an actual URL that you can go to Observable Tools. If you go there, what you'll see is, first of all, you'll see this like great video that we did with Header Labs. and, for folks who don't know the reference here, I'll maybe tell you at the end of my segment. but basically you'll see great manifesto. but also this is a manifesto of ours that talks about observable tools. Let's make MCP tools transparent and observable. There's a discussion about, peering into the MCP Black box and our vision is full stack agent observability that you as a developer can see every step of execution in the agent chain. So you'd be able to observe it and know what went wrong and when it went wrong, you'd be able to do evaluations, et cetera. So for all those things, I think that, first of all, I invite you to go and check out the manifesto. I wrote it together with, Gini 2.5, but mostly I wrote it. But the Gini 2.5 definitely helped. It's a great model for creative writing and helping you think through things. we think that observability is essential. we embrace open standards. CP is an open standard and the vendor neutral open standard. So we are embracing Open Telemetry as well, which is another vendor neutral, open standard that should be able to tie into this for full observability across the execution. And we talk about the path forward and in the path forward. What I wanna highlight specifically is the fact that because MCP is an open standard. everybody can suggest changes to the protocol and the specification. Not everybody can actually, affect the specification. and it's not as easy, like MCP folks are absolutely swarmed suggestions and different proposals and ideas. and so we sent a proposal by we, yeah, this is my submission to the proposal. It's been getting a lot of attention already, but I would love for you to go in there as well. Proposal 2 6 9 on the MCP spec, adding open telemetry trace support to MCP. And we've outlined a lot of the work and the way we think it should happen. I'll namely highlight a few things where we dispatch from maybe the regular open telemetry model. we think that the tools need to report telemetry back to the client and not to like a separate location to ease of use. And so that, the tools will not have the configuration for observability If this means nothing to you, that's great. I have a diagram here that you can go and check out. we outlined specifically how we think this would work, the transmission mechanism, the standard, the notification types and rational alpha notification types. We went super, super deep on this. we researched into the protocol. Including feeding, via GitHub, all of the code for the MCP spec into Gemini to review what we thought makes sense or doesn't make sense. So we actually came up with a very detailed spec of how to integrate, open telemetry within MCP protocol. we talk about alternatives as well. we have a bunch of schema changes. So basically very detailed, very deep, suggestion to get to here, for folks who are just listening, what I'm showing on the screen right now is on the left, we see that you have an agent, the agent does some stuff, and then at some point it gets to an MCP call. And when that MCP call happens, our tools, observability tools, not only us, all of our competitors as well, friends in the observability industry like Lang Smith and Braintrust, and Arise, and like all of these folks whose names are not to be mentioned on Thursday until today, because we want collaboration. All of these folks, basically, if you trace your agent, when you get to an MCP tool call, all is MCP tool call and the timing, the whole call took. What we want is something like this on the right, where with full observability within the tools, you can, open up with the tool call and you can see within the tool call the several execution steps within the tool call separately. This will help your agent. To get to a point of, you can see as an agent developer full execution trace in one tree of everything that happened in your agent. Regardless of which tools, those are ran, obviously you need the tools to support this. And this is the observability. this is what we're trying to support with observable tools. So this is our proposal. I invite everybody to go if this interests you, read and support this as well. I invite our friends from the industry again, shout out to Samuel from Lock Fire, who came in and gave a lot of the comments. like I invite folks from, Galileo and Brain Trust, all of the folks. here's my call to action to you guys. Please, join, the RFC 2 6 9 on the MCP spec. and, give your comments, give your thoughts. Collaborate with the folks who are open telemetry, who are already there in the chat discussing, the right way to do this versus the wrong way to do this. As developers who will build servers, please go in there and look and say, okay, for me as a developer to add observability, this will be harder. This will be easier. We also want you to participate in this, not only the labs that will end up benefiting from this, We want developers in there to talk about how the developer experience will happen. So this is our, effort. This is a call to action to you. Please go to observable tools. please check out the video that I was very happy to build, and get excited in the AI video. But also, please, please, please participate, upload this. So the folks, basically my goal in here is to make this impossible to ignore for the MCP authors, for the spec authors, because we think it's very important. And the only way to do this, to make it impossible to ignore is to galvanize the community to go in there and give thoughts and comments and upvotes. so this is, my ask from you. if you wanna give me a present for the hundredth episodes of ThursdAI, I'm very, very excited about this effort. this is my first foray into open source and specifications and fcs, very, very, proud of the work that our team did. In the preparation of all this, you'll see the very detailed specification that we put in there. would love for you to, join this effort with us as well. the second thing that I want to announce in this week's buzz is that, Google has released the, the Agent to agent protocol and weight Biases is a, proud, proud, contributor from day one. I'm gonna add the link here as well. so basically Google released this new thing called Agent to Agent Protocol. and I think we'll cover this in the next section. I just wanted to highlight that out of the hundred or so companies that Google got to, collaborate with them on this Wisdom biases is one of the folks. And then I got the benefit to write the WB blog about our support for this. So definitely, I am very excited about this protocol, and I would like to talk to you about the differences of that protocol versus MCP because, they're not competing. They're basically, both kind of com, complimentary, standards. So this has been, this week's buzz. I think for the first time, maybe I'll invite comments in this week's buzz for the next, two minutes I would love to hear from you guys comments on this thing as well.
Yam Peleg
Yam Peleg 1:00:47
Unbelievably important.
1:00:49
Seriously, seriously. So important. I think that like I just don't understand. Okay. I do understand. but I don't understand how other people don't see it. But let's start with tools. MCP at the moment has four different concepts, but everyone is focused on tools because it's very easy to understand what tools are. They let your model do things like search GitHub or something. People just don't, there aren't specs of tools. Like no one is specifying what tools your model actually sees. It changes everything. Like you need to see, just tell me like, my model is going to see, this is a search tool for GitHub, for example, What's gonna happen five minutes from now anyway, but this is like a search tool from GitHub. It's very different if you generate a search tool for a specific ripple. Then that you have a general search tool for GitHub. It's extremely different, even though you can do the same, with both. It's extremely different for the model to actually see that. This is why I'm saying like what you guys, at the weight and biases are proposing absolutely important, absolutely transparent. You really want to see what your model sees and have a trace of every step of the way because it just very not intuitive today. And moreover, MCP has four different concepts that they have roots, which basically, points your model to what data, it needs to know about and sampling, which is basically the other way around. Instead of you querying the model that the model used tools, like you can just use it the other way around, like request predictions from the model through API, for example. And, which allows you to do like loops and, and very crazy things that we didn't see, people doing at the moment. And reusable prompts. At the moment you just see them thrown around. Everyone just store them however, but like one standard for reusable prompts, probably also with versioning and improvements and so on. the whole movement is extremely important and absolutely a great move. and it's great to see that it's adopted. All around the market. Absolutely.
Nisten
Nisten 1:02:58
And just really quickly, I'd even go step further and ask just for, open
1:03:02
API, just follow the open API standard too, because if you can advertise the tool search correctly, and it can advertise all of this calls, then you can just take that and put it into swagger just like you do any other backend. And, then any company can just work with it, with their existing, their existing setup. Even if they're like a dinosaur insurance company or something, they can still use it. So yeah.
Alex Volkov
Alex Volkov 1:03:24
All.
1:03:25
So folks, I appreciate the support. for folks who are listening, please go to one b.me/cp spec. This will be in the notes. This is direct link to the spec. I would love participation up voting. Let's make this impossible to ignore. So the MCP folks will look at this and say, oh, the community wants transparency within the whole stack of execution within MCP. Speaking of MCP, for folks who are listening, they don't see, but we have two other folks here, friends of mine, or at least one friend of mine, another new friend of mine though, who joined us, who did also a crazy thing this week. And I really wanted to highlight this because, again, getting MCP build, but also a very, very interesting effort from you guys as well. So we would love, for you to maybe introduce yourself super briefly and let's do like 10 minutes discussing the, the TTO that you guys built, but also would love to hear from you because, IDO, IDO and Liad, I know you guys been deep into this thing, so we would love to hear after the chat with us, your kind of feedback on the effort that we're leading and just announced. So feel free to unmute Liad, maybe start with you and Dando about who you are and why are you here?
Liad Yosef
Liad Yosef 1:04:20
Yeah.
1:04:20
Hi. first of all, it's super fun to be here in the 100th, episode. I'm a huge fan of Alex and the podcast. and, yeah. so I'm Liad. I'm currently an engineer in Shopify. I've known Nido from, my days in doa. I, usually have a frontend architect and recently, working a lot on, MCP and AI infrastructure. and, I let you know, introduce himself in a second. I'll just say that, Alex asked us to be on, on this podcast because something super crazy that happened, this week, with, an open source project that we started.
Ido Salomon
Ido Salomon 1:04:48
Awesome.
1:04:49
So thanks, analyst. really excited to be here and congratulations again. I also want confetti, if you can. basically I tried working on with and, and getting cp like I said, we worked together in a previous company. Right now I'm, cloud architect and AI lead in, Palo Alto Networks.
Alex Volkov
Alex Volkov 1:05:04
In Palo Alto Networks.
1:05:05
I would just say Liad. the company that you mentioned. This is how we met as well. Yeah, I remember you interviewed me like a long time ago and I was like, nah, this is not for me. But we stayed friends and then, we followed each other. So I'm very, very excited that, throughout our like front end stuff and backend stuff, like different things, we ended up in this new world of ai. I'm very excited to have you here as well, because this week, you guys built something super cool working nights and weekends and that basically exploded. So let's walk through, first of all what you built and then a little bit of a story of how
Liad Yosef
Liad Yosef 1:05:32
Yeah.
1:05:32
Okay. everything started, so I, I was, Posting things about MCP for a while, and then everything started when, Mr. Dub of three Js, it's very popular now with the vibe coding. so he wrote that, he created an LMS fulls T xt, file, and it was three megabytes, so you can't feed it into LLM context. And I posted, okay, this sounds really good for an MCP who wants to join me to build an MCP, like a side project. And then he replied, that's a weird way to send me a dm. And so we paired together and we decided to build, a general purpose MCP for every GitHub report documentation. We work on it for I think three nights. We have a full-time job, so we worked on it as a side project at nights. and we, we built this MCP server. yeah, this one, which is really complex behind the scenes because it's only one server. SSC service sent events. So it's a remote server. It's very different from all the other MCP servers that are being built. And it's meant to be like a general purpose for every GitHub repo. so we built it, and we really hacked through it last week. And we released it on Thursday, We got amazing traction.
Alex Volkov
Alex Volkov 1:06:36
So just to clarify what I get as a developer by using Git MCP,
1:06:40
so I work on a new repo, for example. There's a few ways for me to give my LLM the context, right? Stan works on GitHub GG, for example, where you drop the whole repo and it just gives you all of the text files in one huge chunk. And then for the bigger context models, like 1 million Gemini, et cetera, you just dump the whole repo. But then not necessarily is that always better because we also need to be, concerned about the context length, about the understanding of context. What you guys built is different. What you guys built is, an MCP integration into every GitHub repo basically. So I'd be able to add this as a tool. And then what does it give my LLM as context?
Liad Yosef
Liad Yosef 1:07:16
So actually it really benefits the people that develop using
1:07:19
your repo because they can access all the documentation in your repo, be it read me or m stick steel, or even the commentation in the code. They can access it every MCP client. So for example, if I'm working with land graph and I have questions about land graph, I just plug in this MCP and my ID cursor can. answer questions about RAF's documentation. and we provide the search tools, semantic search tools, and the fetching. And, we support all sorts of formats. So really this is, ad hoc instant MCP server for every GitHub repo out there.
Alex Volkov
Alex Volkov 1:07:54
question for ido maybe, to take on you.
1:07:56
Leond mentioned is an effort from answer ai and Jeremy Howard is a great effort. could you tell us like why you chose LLMT XT specifically? 'cause not everybody supports it yet, but maybe they will. I think Jeremy Howard also shouted out you guys because of this effort as well, ido, maybe take us through the XT thing and why you guys chose that. In addition, I think you have a fallback to the Reaper as well. Mm-hmm.
Ido Salomon
Ido Salomon 1:08:14
I think, LM TT is a classic use case for this because, it
1:08:18
is targeted for LM so it fits right in. in there you kind have, the format is you have a link and then, where do you, can you find the relevant documentation? And it's something that is not human accessible at all, obviously, and is not even, something that most agents can interact with. So the real power of a tool like this is say, okay, I. First tool, bring me the lmt XT with all the relevant links. another tool is fetch the relevant link content. So if really make something like LM 60, which is can be very difficult for, for people to consume, and grant them an easy way to get all the data they need, in singleplex.
Alex Volkov
Alex Volkov 1:08:54
Alrighty.
1:08:54
thank you for the breaking down L mt XT. Definitely folks should adopt this. I've seen a few tools that actually generate this based on kind of the stuff that you have. and I would love to, invite folks to go and check out l mt think Jeremy Howard doesn't do stuff just because, I've seen some feedback from Peter levels about some web devs. Olympics is definitely absolutely necessary and there's a reason why you guys chose this. Leah, walk me through a little bit, about the virality that you guys got. Maybe expectedly or unexpectedly would love to also hear about the CloudFlare kind of conversion.
Liad Yosef
Liad Yosef 1:09:21
Yeah, so that was really crazy.
1:09:23
We released on Thursday night, we got, amazing traction and traffic and we initially built on, virtual function, so serverless, environment. but with all the crazy amount of traffic that we got, we really pushed the limit to the edge. pun intended. we really stretch this, this, infrastructure with our SSE servers. SSE server, it requires a lot of memory to hold connections. And at some point IDO was telling me, Hey, we're probably gonna have to start paying, Dozens or hundreds of dollars every hour from now on. So we to decide what to do and all of a sudden I got a DM from, Ermo the CEO of Versa, telling me, Hey, do you want to be on the, early release of arbitrary compute, which is like the long running compute, something that can solve our problems? And we started to talk about it. and then few hours later, I think it was 4:00 AM my time, I'm getting a dm, we were still working, so we were awake. I'm getting a DM from, the CTO of CloudFlare saying, Hey, why are you on Versal? The recent to move, the recent
Alex Volkov
Alex Volkov 1:10:19
Ct O Dan just got this job like last week because the CO just moved
1:10:23
so recent of, Dane is a great dude.
Liad Yosef
Liad Yosef 1:10:25
Yeah, yeah.
1:10:26
I was getting actually three dms from three people in CloudFlare at the same time. One of them was the CTO, director of product was the other one. and, saying, yeah, do you want to try to move to CloudFlare workers because there's a big release of agents that they did this week. he said, yeah, we are willing to sponsor everything, so it'll be on us all the costs. And that's not something that you can easily say no to. so we started, another blitz of coding.
It was 5
It was 5 1:10:46
00 AM for us, and we wrote the cloud flare team.
1:10:50
Okay, we are just gonna sleep for two hours, and then we are gonna continue working with that. And, really shout out to the CloudFlare team, which helped, with us on a Saturday. we walked on a Saturday. They walked on a Saturday, just to get us migrated. It wasn't an easy migration, this story got a lot of virality. we got to the front page of, the leading, tech magazine in Israel. and really git CP is generating so much traction right now. We have, Jeremy Howard, featuring us as the, proposed solution for GitHub documentation. we have, a lot of, packages repositories or libraries saying, Hey, just use Git MCP if you want an MCP server for all documentation. So we're getting a lot of traction. I think we really hit a niche that needed to be solved here. Absolutely.
Alex Volkov
Alex Volkov 1:11:30
yeah.
1:11:31
comments on this? I know you wanted to come up back for a question and, get Oh,
Yam Peleg
Yam Peleg 1:11:34
oh, yeah, yeah.
1:11:35
look, I can speak about this for hours. I just want to emphasize what is so novel about this? there are, CPS for GitHub, GitHub tools and so on. But the thing is, what you get here is by going into these URL, your model gets a different tool specifically for a single GitHub repository. And this is what this tool is doing. The model then can use it with a single step, like it always see the, the name of the GitHub, the name of the repo, exactly as it is written in the code, each time it invokes this tool, it gets the documentation. You basically generate a different customized tool just by requesting this URLA little bit different. It's extremely novel. I haven't seen any MCP do this before, especially not at this scale. No one had ever done this at this scale, and it makes.
Alex Volkov
Alex Volkov 1:12:27
congrats folks, on this release.
1:12:28
very great work. I really wanted to highlight, not only the crazy story that you had, but also the fact that folks can now use this, actually give every chat, that they have. Chad g PT is gonna come out with MCP support. Google has announced MCP supports, PI famously, and even, de has all these folks, like we support MCP, it's gonna be get supported in SDK as well. they will definitely add this to the AI studio. We looked at Quinn, for example. Quinn is about to support MCP. Everybody is jumping in on this protocol because it makes sense. And the more people jumping in on this, the more people will build on this, the more people will build, the more they will need, to, to use different like repos, four tools. And I think you guys, did a great one. yeah. Two things I wanted to ask you for. One of them is like future plans, what you have in store now that it works in some CloudFlare shadow to cloud as well. And second of all, I would love to hear your opinion about the observability kind of proposal that we just announced as well. So maybe we'll start with Dito and then lean.
Ido Salomon
Ido Salomon 1:13:20
Yeah, I think, future plans, maybe expanding beyond GitHub
1:13:23
and going to, other, uh, GI choice. Maybe it's hugging face mean the CEO of hugging face also asked us. Maybe we can do that for them. so definitely stuff on, on the horizon for the open source project, and think we'll see where that takes us.
Alex Volkov
Alex Volkov 1:13:37
Awesome.
1:13:38
Yeah, we're looking forward to, and please keep us posted and we'll keep the community posted as well, LIAD.
Liad Yosef
Liad Yosef 1:13:43
Yeah.
1:13:44
so yeah, like Ido said, the plan ahead is actually clear. We have a work cut out for us, and we're still doing it as a side project. we work at night. I think it's helpful that IDO has a jet lag because I just came back from the United States, so he's being up, late and I think about the observability, proposal. So I read it, you sent it to me, a few days ago. and, I'm a little bit involved in the protocol itself, like the protocol definition itself. I'm reading a lot. I'm, giving opinions in all sorts of, forums and, I think that's something that was missing. it's missing. The proposal is really good. I think yam that, I don't know how it didn't, happen till now. but I think, we ran into that problem as well of trying what's exactly going on, in terms of observability. So I really support this proposal. Great job.
Alex Volkov
Alex Volkov 1:14:22
And hopefully once it lands, if it lands and when
1:14:25
it lands, let's be positive here. you guys will be able to implement this into GI MCP as well. Leah, there's one more thing that I wanted to talk about, but we're not allowed technically to have you comment. But I will just say that, one of the cool things we mentioned briefly, you work at Shopify. Shopify released their own MCP, which we can talk about. you guys raised their own MCP for documentation as well previously, and that's been public. And also Shopify has done a pivot into AI completely Look at this, I found this to be incredible. So there was a memo and then Toby, the CEO of Shopify and some other folks on LinkedIn posted the full memo. I found a few things very interesting there for our audience as well. One of them is you have to justify headcount and by justifying headcount, this is all from the public release, right? no private information here or anything like this. But the justifying headcount thing broke my mind completely. the whole thing, from Toby, executive letter was, you have to be reflexive about AI usage. You have to immediately think about AI first before everything else. And that's how wants the company to run. it been very clear that you guys have been supporting the MCP spec from the start as well. and within your own MCP but also the fact that you have to, when you as a manager, ask for more headcount. You have to justify why AI wouldn't like, take care of that extra load by getting some other folks. And also that the performance reviews will include whether or not like you are using ai, which I find absolutely incredible as just an idea so I think you work in a very progressive AI progressive space that fits very well with how we work here and how we think about, AI here on Thursday. Folks with that, thank you so much for coming. You're considered friends of the platform now on gi, MCP is absolutely incredible. Folks would definitely give it a try. we're looking forward for some more of the stuff that you guys do. Although you got shout out pretty much for everyone in the industry. just Jeremy Howard, like high five is like worth a lot, but basically everybody else. So folks in the audience definitely give give MCPA try and we'll keep, we'll keep moving because we have so much talk to talk about. and we are moving to big companies and APIs. Let's go. All right folks, and we're welcome back. Our, expert co-host to the stage because we have a bunch to cover and we haven't gotten to I think half of it, but we're now, AI breaking news coming at you only on ThursdAI. Folks, we got breaking news from Open the eye. Sam Altman just announced like some hype tweet before about the feature that he can't sleep because of at night and there's only three features like this previously. LDJ, would you walk us through the breaking news announcement please?
LDJ
LDJ 1:16:50
So the tweet says, starting today, memory and chat, GBT can now reference
1:16:55
all of your path chats to provide more personalized responses, drawing on your preferences and interests to make it even more helpful for writing, getting advice, learning and beyond. There is no storage limit for what chat GBT can reference when reference chat history is turned on.
Alex Volkov
Alex Volkov 1:17:13
So unlike the memory stuff where chat selectively like looks into
1:17:20
your chat and decides, oh, this piece is very important for the user, what they just announced is a complete review of all of my chat, how I have thousands of chats. How are they doing this? Can we take a guess?
Nisten
Nisten 1:17:34
It's probably some land graphs, some something going on.
1:17:37
I don't think it's arag. Yeah.
LDJ
LDJ 1:17:39
Yeah.
1:17:39
I think it's like maybe like at least some fancy like vector embedding or something in the background happening like that. it might be also,
Nisten
Nisten 1:17:48
yeah, might be a tiny model.
LDJ
LDJ 1:17:50
Yeah.
1:17:50
I think it's still cool, and like still a step forward, but it's probably not all in the context. And maybe in certain ways it might even be better than if it was all in the context. it depends what we're comparing. But yeah, this is interesting. personally I think I've had this feature on my chat, for at least a couple weeks now. to be honest, it's not a huge difference, but I suspect this is one of those things where as the models get better and reasoning gets better and models keep scaling, et cetera, I think over the next year or so, the models will be able to take advantage of this feature even more into higher resolution and even better understanding of your past chats.
Alex Volkov
Alex Volkov 1:18:31
we'll have to figure out if this was lunch, but, but for us as well.
1:18:34
So I think it's a great addition. I was voting for this, honestly, because I keep getting annoyed by the fact that remembered, let's say I talk about Thursday, I, and it remembered from I don't know, a year ago how many followers I have. And now every time I chat about this, like as a, as a person on X with 24,000 followers, bro, we have, we have more now, which, speaking of which, I think folks, we did this, we did this for a hundreds episode, we're now past a thousand followers on YouTube, which means we're clearly in the creator space for YouTube. So shout out. Thank you everybody for joining. thank you folks. I'm very, very excited. I invite all of you to subscribe and put up a notification on YouTube so you'll always know when we go live so you don't have to go next, et cetera. So shout out and thank you everybody for subscribing. Thank you. The YouTube crowd, YouTube crowd. Please throw a hundreds in the chat to celebrate a hundred. let's do this in the chat as well. right folks, we're moving on.
Wolfram Ravenwlf
Wolfram Ravenwlf 1:19:25
one thing of the Tivity, release, but because I
1:19:27
don't know if I really want every of my chats to be in the memory.
Alex Volkov
Alex Volkov 1:19:31
don't know if you're gonna get it because Europe
1:19:33
is always yeah, maybe I don't.
Wolfram Ravenwlf
Wolfram Ravenwlf 1:19:35
I know Sam is listening, of course, I'm sure he watches.
1:19:38
So feature request, let us flag chats for exclusion or inclusion or something. Make it configurable Thanks, Sam.
Alex Volkov
Alex Volkov 1:19:44
Yes, absolutely.
1:19:45
There's a few things that I've asked that I'm like, Hmm, I don't know if I want a complete memory of all of this, folks. Let's talk about big companies APIs. I'm very, very excited about this. There's like a whole conversation we need to do about a protocol. Maybe we'll do a deep dive in the next feature because I don't think that we have enough time to cover this fully. And I'm really, really excited about this protocol. So I'm actually thinking we'll do a full deep dive. I'll chat with some folks from Google to come and talk to us about this, as well. but Google announced like a bunch of new things. Let's walk through some of them. one of them is their new TPU, seventh Generation TPU. those are exciting news. Maybe only for Google. I don't know how many folks are getting excited about TPUs. Niton you're saying? No. what's exciting about this one for you? It's, if you look
Nisten
Nisten 1:20:26
at the uptime for models, if you look at like cloud's
1:20:29
uptime, which, open router, tells you Google's is always perfect. you look at their own hosting or the AWS one, sometimes it's 60%. Sometimes it's 80% uptime for the whole day, which is pretty terrible for an enterprise product. Google's is always perfect. They, even if they don't have the best models, they might just win on infrastructure alone.
Alex Volkov
Alex Volkov 1:20:50
Gini 2.5 is kind of fire
1:20:53
Another thing that they've announced, Gini 2.5 flash, which is a new model we haven't heard of before. It was like Gini 2.0 flash and Gemini 2.5 flash workforce model with low latency cost efficiency will soon be available in Vertex ai. this is a semi announcement. We don't love covering these announcements about Hey, something is coming. we don't love these as well as we don't love announcing, but I still mention this. shout out to LinkedIn. that open router has two new say, like special secret models that you can try. And all of them have 1 million contacts. And if you ask them, they're trained by Google. Yeah, I think you asked it as well and says, I'm trained by Google, but we don't know if it's trained by Google. It could be like some other stuff. and we don't know who this is. I think we have a confirmation from J Yang that it's not Quinn. I think I saw something like this pass across. So Google also announced VO two editing capabilities, if folk remember in the very vast space of vision. so video models generating video models. VO two from Google is absolutely up there beating Sora, which are with Dane, Blaine Brown here to talk about them compared to soa. And like other ones, Kling and hall video, there's like a bunch of minimax. There's like a bunch of video models released, runway released Gen four VO two is still one of the top ones significantly. And now they added video capabilities, editing capabilities. Imagine three. also thing that they have announced is deep research. Do you guys see this from Gemini? Okay, folks, Here's an announcement for you. deep research with 2.5 Pro. They integrated the best thinking model that they have into the deep research.
Nisten
Nisten 1:22:16
I used it quite a bit.
1:22:17
Like I did run quite a few queries. It was good. this is just, like open AI or at least how open AI's deep research was a few weeks ago, and it was just giving you 30 page reports. this still cannot do that. I am still waiting for something that just dumps a whole report. I also very disappointed that they don't. List the sources when you copy paste it or when you export it to a Google Doc. I really want those damn sources. And then you have to go through the UI and try to find them one at a time or make an agent for it, which is pretty stupid. just release The sources you're already showing on the website. Why do I have to make a separate tool? So feature request for them, for sure.
Alex Volkov
Alex Volkov 1:22:58
the comparison that they released now, not
1:23:00
the previous deep research. the 2.5 experimental deep research, comparative evaluation of Gemini Open I research by folks. Obviously this was raised by Google, so take it with a grain of salt. But overall, 69.9% of people prefer the Gemini 2.5 D research results. Over 30% of open ID research results, instruction following. 60% of people say IT instruction follows better. And again, deep research from open AI was to me another Chad GPT moment. This was you guys talk, remember we talking about Dr. UMAS here? And he says it gave him a full on like scientific research and like pattern search and like prior art, like a bunch of stuff. Incredible, incredible work. I keep using deep research as well. Reminder that open the edges recently gave deep research to Chad GBT $20 a month. This is free and unlimited for, for folks right now for on the free tier. But basically for 20 bucks a month, you'll get unlimited comprehensiveness. 76% preferred it over. deep research. Very comprehensive. if there's one thing that you would describe the deep research program from Open The Eyes, comprehensiveness, it gives you way too much things sometimes to read. So I find like those metrics incredible. I I haven't tried it yet, fully because until just now I didn't get the 2.5 thing. So I'll definitely give it a try and let you guys know about some stuff. also on the docket of Google releases, I. Another thing that they've launched is Firebase Studio. Did you guys see this Firebase studio? so basically we know that, all of these apps that started to show up, like all of these labs that you can just let go. And why did you drop me in the documentation of Firebase studio? let me find this super quick interesting fire studio. Yeah. so all of these, companies that you can just drop a whole spec of, of website you want build, they're gonna build them online for you. Unlike the ideas like windsurf and Cursor, like all of these things, those are like complete, vibe coding, let's say, web websites and apps. Google has basically said, Hey, we have a cloud-based gentech development environment designed to accelerate how you build, test, deploy, and run production, quality applications all in one place. and basically this is a reframe and the rebuild of something. They had project IDX, they announced on Google IO before.
Nisten
Nisten 1:25:02
fire based studio.
1:25:04
that's the u rl.
Alex Volkov
Alex Volkov 1:25:05
Oh, that's easy.
1:25:06
Yeah. Fire based studio. Let's go full stack AI workplace, and you can try it and then you, you're able to build full stack things in there, and it looks incredible. So shout out to the Google folks, the launch list. And I would love to invite the community to tell us how your experience is with this thing. I think we have more breaking news folks. We got more breaking news. Let's go. Yeah, I think around by dance. Let's do it. I love breaking news. AI breaking news coming at you only on ThursdAI All righty folks, we have ance coming with a seed thinking version 1.5, a 200 billion reasoning model with only 20 billion active parameters that bids deep seek R one across domains. let's take a look at the chart here we have seed thinking. interestingly, they did not compare to llama four Maverick or, scout, but they, compared to the deeps seek R one and Gemini 2.5 Pro and oh three mini high, and it looks like on some comparisons, they are beating deep CR one, but not quite getting up to the Gemini 2.5 Pro, or AMI High. But this is a open model, correct? I think it's open model. And I'm not sure. I'll check. Yeah. And they released, let's see if this person has a, a link here. So we got a GitHub repo. actually no, I don't see the model itself. I see some results. We have a technical report from them. We have the scores, 74 and A-A-I-M-E, which is yeah, it's impressive. It beat deep CCAR one on multiple things, and a smaller model. and they say results are from internal sandbox, which may different from report results. but yeah, I don't see the model itself. So just an announcement for now. Still back in news, Biden has been killing it with the deep, the omni human that was released last week. We talked about this and just general, like lip-syncing stuff. shout out to Biden's for this release seed thinking, and hopefully they'll drop this because they've been dropping, open source, models as well. yeah, we, we mentioned before about the official support for MCP. I just wanna highlight this, that I'm very, very excited when Demi and Sundar Hai, basically both say MCP is a good protocol, rapidly becoming open standard for a DA agent era. We're excited to announce we'll be supporting it. For Gemini models, NSDK, look forward to developing it further with the MCP team and others in the industry. This is great. This is absolutely incredible, as a support. And this just solidifies MCP as the one protocol that the tools we'll use. And, folks, I think we're getting to the end of the show. Let's run through this super quickly. Let's see if we covered anything else. CloudFlare new agency Z. Okay. GR three releases an API tier. Finally, you are able to test the claims that Grok is like the best model as well. open the eye, adds memory. We, we chatted about this. I think I'm gonna add just this high Dream one Dev, Wolf. Have you seen this? Anybody? Have you guys seen this?
Wolfram Ravenwlf
Wolfram Ravenwlf 1:28:03
I saw that image.
1:28:04
What you are seeing now, but I haven't looked into it in more detail.
Alex Volkov
Alex Volkov 1:28:08
looks like a better flux.
1:28:09
Oh, okay. Yeah. Yeah. This bits Flux and Bits Recraft and Bits Rev. Like all of these, like folks behind the scenes a whole. Race of image models that try to get to some level now open the eyes GBT four oh because of the talking to it basically is the best one. And we all have ified ourselves multiple times now and did the Muppets and everything. But, there's a diffusion models and, there's a new one coming out from a company called High Dream and High Dream Dev, on artificial analysis image Arena leaderboard beats, flux and beats Imagine and beats idiagram and beats like all of these folks at, at 1100 ELO score and just comes very close to open ai. So if you haven't tried High Dream, give it a try in instruction. Following is quite incredible. I think that is it for us. Here's some comparisons to flux, and instruction following is look very, very, very good as well. And it's great at text as well. One thing that I don't know yet is whether or not the, this model is fine tuneable, but it is open source, so flux, only the dev version is open source. The pro version is not, so this beats the pro version of Flux. So maybe it's a call to arms to the Flux team to release Fox. I think that we've covered most of it. We didn't mention Novas, so from, from Amazon Novas. So is there like a, a voice model that we can maybe play a sample of, later, but basically a speech to speech model. That sounds super cool. and I know Wolf from, we generally geek out with you on this, but I think we're coming out at the end of the episode exactly two hours. I wanna, close this episode. With the fact that multiple things happened, during the a hundredth episode for ThursdAI, first of all folks, a hundredth episode is a fucking lot of time that we spend here with all of you in the community. So every time I finish the show, I'm very full with gratitude to everybody who listens. We had more than 1300 folks tune in across different streams and also on XA bunch of folks resha our content and enjoy it. and, multiple more folks listen up following and are getting the news from us. So first of all, shout out to you guys for, showing up and clearly explaining the news. I love the fact that we have guests from, the community that are actually working on the stuff that we announced, and we have the ability to spotlight them, even if compared to the open AI release, that's not necessarily going to change the news as much, but we wanna highlight the folks who work on open source and do a lot selflessly, like the folks who did something that they built in their time to support and really help tons of people. So love to highlight, great work from community. Obviously it's great to also cover the, the huge companies if we're involved, like with the A two A protocol, that's even better. But generally, we want to let you know what is the latest and greatest in the world of ai. And it's been. Absolutely my pleasure. And I think I speak for all of us. It's been our pleasure to deliver a hundred episodes of Thursday night to you guys, and spend it with the community. It's been growing like crazy. We just passed a thousand followers on YouTube, and, the next target is 10,000. So tell your friends and your loved ones also to subscribe. If you have multiple, YouTube accounts, definitely do that as well. with that, I think we'll conclude the show for today. If you have missed any part of Thursday, don't worry, just subscribe to Thursday I news and you'll get everything, all of the links and mentions and people, and everything that we've mentioned. You'll get in show notes as well, including a breakdown and summary of the conversation that we had. as well, this has been the hundredth episodes. anything to add folks before we conclude
Wolfram Ravenwlf
Wolfram Ravenwlf 1:31:31
to the next 100
Alex Volkov
Alex Volkov 1:31:32
to the.
1:31:36
Let's go. All right folks, with this we're gonna end the stream for today. Thank you for joining Thursday I, and have a great Thursday and we'll be back with you here next week. Bye bye.