ThursdAI · March 27, 2025

📆 ThursdAI - Mar 27 - Gemini 2.5 Takes #1, OpenAI Goes Ghibli, DeepSeek V3 Roars, Qwen Omni, Wandb MCP & more AI news

From Weights & Biases - what an incredible week, we had Tulsee from Google to cover Gemini 2.5, Morgan from W&B to chat about MCP and Prince Canuma about MLX. We also ghiblified ourselves with OpenAI

By Alex Volkov

84 min

YouTube Spotify Apple Podcasts Substack

Episode Summary

Welcome back to ThursdAI! And folks, what an _absolutely insane_ week it's been in the world of AI. Seriously, as I mentioned on the show, we don't often get weeks _this_ packed with game-changing releases.

In This Episode

🔓 Big CO LLMs + APIs
📰 GPT-4o got another update (as I'm writing these words!) tied for #1 on LMArena, beating 4.5
🔓 Open Source LLMs
🎨 AI Art & Diffusion & Auto-regression
🤖 This Week's Buzz + MCP ([X]( [Github](
🤖 Agents, Tools & MCP
🔊 Voice & Audio
🔊 MLX-Audio

Hosts & Guests

Alex Volkov

Host · W&B / CoreWeave

@altryne

Tulsee Doshi

Senior Director & Head of Product, Gemini Models · Google DeepMind

@tulseedoshi

Morgan McQuire

Engineer · Weights & Biases

@morgymcg

Prince Canuma

ML Developer & OSS Contributor · MLX Community

@Prince_Canuma

Nisten Tahiraj

Weekly co-host of ThursdAI · AI operator & builder

@nisten

Yam Peleg

Weekly co-host of ThursdAI · AI builder & founder

@Yampeleg

Wolfram Ravenwolf

Weekly co-host, AI model evaluator · Independent AI evaluator (r/LocalLLaMA)

@WolframRvnwlf

By The Numbers

Big CO LLMs + APIs

2.5

Google came out swinging this week, dropping Gemini 2.5 Pro and, based on the benchmarks and our initial impressions, taking back the crown for the best all-around LLM currently available.

Big CO LLMs + APIs

We saw massive jumps on benchmarks like AIME (up nearly 20 points!) and GPQA.

Big CO LLMs + APIs

My own testing on reasoning tasks confirms this – the latency is surprisingly low for such a powerful model (around 13 seconds on my hard reasoning questions compared to 45+ for others), and the accuracy is the highest I've seen yet at 66% on that specific challenging set.

Big CO LLMs + APIs

It also inherits the strengths of previous Gemini models – native multimodality and that massive long context window (up to 1M tokens!).

Big CO LLMs + APIs

120

The performance on long context tasks, like the needle-in-a-haystack test shown on Live Bench, is truly impressive, maintaining high accuracy even at 120k+ tokens where other models often falter significantly.

🔓 Big CO LLMs + APIs

Okay, let's start with the big news. Google came out swinging this week, dropping Gemini 2.5 Pro and, based on the benchmarks and our initial impressions, taking back the crown for the best all-around LLM currently available.

Okay, let's start with the big news.
(Check out the X announcement, the [official blog post]( and seriously, go [try it yourself at ai.dev](
We were super lucky to have Tulsee Doshi, who leads the product team for Gemini modeling efforts at Google, join us on the show to give us the inside scoop.

📰 GPT-4o got another update (as I'm writing these words!) tied for #1 on LMArena, beating 4.5

How much does Sam want to win over Google? So much he's letting it ALL out. Just now, we saw an update from LMArena and Sam, about a NEW GPT-4o (2025-03-26) which jumps OVER GPT 4.5 (like..

How much does Sam want to win over Google?
So much he's letting it ALL out.
Just now, we saw an update from LMArena and Sam, about a NEW GPT-4o (2025-03-26) which jumps OVER GPT 4.5 (like..

🔓 Open Source LLMs

The open-source community wasn't sleeping this week either, with some major drops! The Whale Bros at DeepSeek silently dropped an update to their V3 model (X architecture.

The open-source community wasn't sleeping this week either, with some major drops!
The Whale Bros at DeepSeek silently dropped an update to their V3 model (X architecture.
This isn't R1 (their reasoning model), but the powerful base model that R1 was built upon (and supposedly R2 when it'll come out)

🎨 AI Art & Diffusion & Auto-regression

This was arguably where the biggest "mainstream" buzz happened this week, thanks mainly to OpenAI. This felt like a direct response to Gemini 2.5's launch, almost like OpenAI saying, "Oh yeah? Watch this!" They _finally_ enabled the native image generation capabilities within GPT-4o (Blog, Examples).

This was arguably where the biggest "mainstream" buzz happened this week, thanks mainly to OpenAI.
This felt like a direct response to Gemini 2.5's launch, almost like OpenAI saying, "Oh yeah?
Watch this!" They _finally_ enabled the native image generation capabilities within GPT-4o (Blog, Examples).

🤖 This Week's Buzz + MCP ([X]( [Github](

Bringing it back to Weights & Biases for a moment. We had Morgan McQuire on the show, who heads up our AI Applied team, to talk about something we're really excited about internally – integrating MCP with Weave, our LLM observability and evaluation tool. Morgan showed a demo and have shipped the MCP server, which you can try right now!

Bringing it back to Weights & Biases for a moment.
Morgan showed a demo and have shipped the MCP server, which you can try right now!
Coming soon is the integration with wandb models, which will allows ML folks around the world to build agents that monitor loss curves for them!

🤖 Agents, Tools & MCP

And speaking of MCP... This was HUGE news, maybe slightly overshadowed by the image generation, but potentially far more impactful long-term, as Wolfram pointed out right at the start of the show. OpenAI officially announced support for the Model Context Protocol (MCP) ([docs here]( Why is this massive?

This was HUGE news, maybe slightly overshadowed by the image generation, but potentially far more impactful long-term, as Wolfram pointed out right at the start of the show.
OpenAI officially announced support for the Model Context Protocol (MCP) ([docs here](
HD DVD – standards wars suck!).

🔊 Voice & Audio

Just one more quick update on the audio front: Alongside the image generation, OpenAI also quietly updated the advanced voice mode in ChatGPT (YT announcement.

Just one more quick update on the audio front:
Alongside the image generation, OpenAI also quietly updated the advanced voice mode in ChatGPT (YT announcement.
This should lead to a much more natural conversation flow.

🔊 MLX-Audio

And speaking (heh) of audio and speech, we had the awesome Prince Canuma, you probably know Prince. He's the MLX King, the creator and maintainer of essential libraries like MLX-VLM (for vision models), FastMLX, MLX Embeddings, and now, MLX-Audio.

And speaking (heh) of audio and speech, we had the awesome Prince Canuma, you probably know Prince.
He's the MLX King, the creator and maintainer of essential libraries like MLX-VLM (for vision models), FastMLX, MLX Embeddings, and now, MLX-Audio.
Seriously, huge props to Prince and the folks in the MLX community for making these powerful open-source models accessible on Mac hardware.

TL;DR and Show Notes:

Guests and Cohosts
- Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
  Co Hosts - Wolfram Ravenwlf (@WolframRvnwlf), Nisten Tahiraj (@nisten), Yam Peleg (@yampeleg)
- Tulsee Doshi - Head of Product, Gemini Models at Google DeepMind (@tulseedoshi)
- Morgan McQuire - Head of AI Applied Team at Weights & Biases (@morgymcg)
- Prince Canuma - ML Research Engineer, Creator of MLX Libraries (@PrinceCanuma)
Big CO LLMs + APIs
- 🔥 Google reclaims #1 position with Gemini 2.5 Pro (thinking) - (X, Blog, Try it)
- ARC-AGI 2 benchmark revealed - Base LLMs score 0%, thinking models 4%.
Open Source LLMs
- Deepseek updates DeepSeek-V3-0324 685B params (X, HF) - MIT License!
- Qwen launches an Omni 7B model - perceives text, image, audio, video & generates text and speech (HF)
AI Art & Diffusion & Auto-regression
- OpenAI launches native image support in GPT-4o (Model Card, X thread, Ad threads, Full Lord of the Rings trailer, Model Card)
- Reve - new SOTA diffusion image gen claims (X, Blog/News, Try)
- Ideogram 3 launched - another SOTA claim, strong on text/logos, realism, style refs (Blog, Try it)
This weeks Buzz + MCP
- Weights & Biases Weave official MCP server tool - talk to your evals! (X, Github)
Agents , Tools & MCP
- OpenAI has added support for MCP - MCP WON! (Docs)
Voice & Audio
- OpenAI updates advanced voice mode with semantic VAD for more natural conversations (YT announcement).
- MLX-Audio v0.0.3 released by Prince Canuma (Github)
Show Notes and other Links
- Catch the show live & subscribe to the newsletter/YouTube: thursdai.news/yt
- Try Gemini 2.5 Pro: AI.dev
- Learn more about MCP from our previous episode (March 6th).

Alex Volkov 0:00

Favorite time of the week.

0:01

You are on Thursday. I

0:13

My name is Alex Volkov, I'm the AI Evangelist with weights and biases. I'm the host of, for Thursday, I, for the past two plus years, we can say two plus now because we just celebrated our birthday a couple of weeks ago. with me, I have Wolfram Raven Wolf here as a co-host and we also have Yam Peg. joining us. What's up Yam? So with this, we'll say it was an insane week, and we have a few guests coming to talk about the how insane. Let's jump into the TL;DR because, time is of the essence today. every ThursdAI, but specifically this week. this week is gonna be a very visual one. if you're listening But you probably would like to jump into the video. for some of it, you don't have to do it for all of it. We'll let you know, when the visual part comes, but, I can describe already, a very funny meme that I saw. And just like memes on memes this week, folks. so let's run through like the insane week that we had. We're starting, we're gonna start with big companies and APIs this time, not only because we're gonna have a guest from Google joining us very soon, but also because there's the huge thing this week. Google Reclaims the number one position with Gemini 2.5 Pro. It's a thinking model. It's an incredible model. It, and really like across the board every eval, including style control, like everywhere. Everybody that we've talked to coding wise, speed wise is just unbelievably incredible. Google's Gemini 2.5 Pro is now the absolute latest state of the art in the world of ai and folks who haven't tried it and have a tendency to go to open ai, but default should absolutely go and try this. we'll definitely dive in deeper and talk about kind evals and talk about why it's so successful. We'll have an interviewer soon. there's also one thing that's big company or foundational lab related. you guys know that when oh three came out, or at least was announced, the bigger thing, the biggest evils they presented everybody was shocked, is RKGI the kind of the Franco Chalet, test for a GI and that bet, oh three supposedly beat that at the 87%, which is above human level. rrg GI announced. Two, their second version, their second version, base LLMs with no thinking get 0% accuracy. supposedly those are tasks that are easy for humans to do and hard for LLMs and the thinking models right now get 4%. So this is yet another hill to climb, yet another uncharted territory for LLMs, and thinking model. very exciting. in open source this week we have, the whale bros silently just drop on hugging face Dipe V three, which is basically the base model before the thinking RL happened on top and got us to R one, deeps six V three was updated with 0 3 24. It's 685 billion parameters. It's insanely huge. It's an MOE. so we're gonna chat about this. There's a few updates there. very interesting. Also, our friends from Quinn launched a super cool new thing that we wolf. Quinn launches Omni seven B, which is a model that understands voice video natively and also outputs text and voice. So I understands like all the three modalities and outputs, text and voice. when I hear Omni now I expect image generation. I would like it to also draw myself as Ghibli, but I'm just kidding. Quinn launches seven B Omni. It's a 7 billion model and if you look inside, for some reason it's 11 billion. I saw this meme and we would love to chat about this. So shout out to our friends at Quin in ai Art in Diffusion, an absolute explosion of creativity world changing things happened in ai, art in diffusion. AI art and diffusion was the category generally, but I also added auto regression as well. Why? Because open the eye launched native image support for G PT four oh, and that is not diffusion models anymore. We're used to diffusion models, the models that kind of take an image and kind of from blur make it into an image. and then Grok launched Aurora, if you guys remember. And that was an auto aggressive model as well that kind of generates things from the top to bottom. And you kinda see the progression. Kind of reminds me of gif loading a long time ago. Wolf from you probably remember. Also progressive lift.

Yam Peleg 3:59

we have this confirmed.

4:00

Yes.

Alex Volkov 4:00

I'm pretty sure that we have a confirmation of this and I'll try

4:03

to find the source for this for sure. So thanks now for checking me. I still get a little bit feedback from yourself. I hear myself. so auto regression, and that's why it kinda generates kinda like slowly like this, so open the air, launch their thing. We're gonna talk about the gamification of the open web. Absolutely. we also saw Revit, which is another. That I think is diffusion another state of the art model that just absolutely crushed benchmarks across artificial analysis It was, hidden and anonymous before and then they kinda like text generation there is crazy though I would say we would probably be as excited about Rev right now had opened the idea, not come up with image generation that you can talk to. And also, not to mention another one, Agram, which we've talked about multiple times on the show. also great for, logo Generation and realism, they launched Agram three, which they also claim state of the art folks. So we literally have three image adjacent things to use and try to figure out which was the state of the art. None of them compared to each other. I saw, I looked like Rev didn't compared to Igram. Igram didn't compare to ve and then Open Edge is didn't compare to anyone. This is like ta we have a thing. in this week's buzz, I added a plus here. and also you'll see this plus in multiple places. in this week's buzz, I added this week's exhaust plus MCP. We'll have Morgan who was supposed to come last week. and we just needed to do a little bit more work. Hopefully Morgan will join us and talk about, weave Official MCP tool that we've launched that you can use if you use Weave. and you can just like chat with the thing and we'll go and actually get you the right information using MCP. So we'll chat more, with Morgan about MCP and MCP importance for weight advice as well. and also in MCP news, this Wolf mentioned in the beginning of the show, and I found that like to be one of the more incredible under the current thing, open Air has added support for MCP, which is huge, folks. It's huge because everybody was worried that Tropic is coming up with their own standard and here will come like Sam Alman and does another standard. And no, they just said, not only do we support it, here's support for agents as decay, and also we will add MCP support to GPT. I cannot like wait to talk about this, how exciting I am about this thing as well. voice mode was updated and I think that's basically what we're gonna talk about