Episode Summary

Wow. Just… wow. What a week, folks. Seriously, this has been one for the books.

Hosts & Guests

Alex Volkov
Alex Volkov
Host Β· W&B / CoreWeave
@altryne
T
Todd Segal
Product Manager Β· Google
@
Yam Peleg
Yam Peleg
Weekly co-host of ThursdAI Β· AI builder & founder
@Yampeleg
Nisten Tahiraj
Nisten Tahiraj
Weekly co-host of ThursdAI Β· AI operator & builder
@nisten
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host, AI model evaluator Β· Independent AI evaluator (r/LocalLLaMA)
@WolframRvnwlf

By The Numbers

OpenAI o3 & o4‑mini: SOTA Reasoning Meets Tool‑Use (
3
o3 is not only SOTA on nearly all possible logic, math and code benchmarks, which is to be expected from the top reasoning model, it also, and I think for the first time, is able to use tools during its reasoning process.
Benchmark Dominance: As expected, these models crush
3
o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more.
Benchmark Dominance: As expected, these models crush
$65
It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k!
Benchmark Dominance: As expected, these models crush
99.5%
It hits 99.5% on AIME (math problems) when allowed to use its Python interpreter and beats the older o3-mini on general tasks.
Benchmark Dominance: As expected, these models crush
200
While its context window is currently 200k (unlike 4.1's 1M), its performance within that window is unparalleled.

🎨 OpenAI o3 & o4‑mini: SOTA Reasoning Meets Tool‑Use ([Blog]( [Watch Party](

The long awaited o3 models (promised to us in the last days of x-mas) is finally here, and it did NOT dissapiont and well.. even surprised!

  • The long awaited o3 models (promised to us in the last days of x-mas) is finally here, and it did NOT dissapiont and well..
  • Tools like searching the web, python coding, image gen (which it...
  • can zoom and rotate and crop images, it's nuts) to get to incredible responses faster.

🎨 Thinking visually with images

This one also blew my mind, this model is SOTA on multimodality tasks, and a reason for this, is these models can manipulate and think about the images they received. Think... cropping, zooming, rotating.

  • This one also blew my mind, this model is SOTA on multimodality tasks, and a reason for this, is these models can manipulate and think about the images they received.
  • The models can now perform all these tasks to multimodal requests from users.

πŸ“° Benchmark Dominance: As expected, these models crush existing benchmarks.

o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more. It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k! o4-mini is no slouch either.

  • o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more.
  • It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k!
  • It hits 99.5% on AIME (math problems) when allowed to use its Python interpreter and beats the older o3-mini on general tasks.

πŸ”“ OpenAI open sources MRCR eval and Codex (Mrcr [HF]( Codex [Github](

Let's face it, this isn't the open source OpenAI coverage I was hoping for, Sam promised us an open source model, and they are about to drop this, I'd assume close to Google IO (May 20th) to steal thunder. But OpenAI did make OpenSource waves this week in addition to the above huge stories.

  • But OpenAI did make OpenSource waves this week in addition to the above huge stories.
  • MRCR is a way to evaluate long context complex tasks, and they have taken this Gemini research and open sourced a dataset for this eval.
  • The best part about this CLI, is that it's hardened security, using Apple Seatbelt which limits it execution to the current directory + temp files (on a mac at least)

⚑ This Week's Buzz: Playground Updates & A Deep Dive into A2A

On the Weights & Biases front, it's all about enabling developers to navigate this new model landscape. TK: Alex Video With all these new models dropping, how do you actually _choose_ which one is best for _your_ application? You need to evaluate!

  • On the Weights & Biases front, it's all about enabling developers to navigate this new model landscape.
  • With all these new models dropping, how do you actually _choose_ which one is best for _your_ application?
  • Our W\&B Weave Playground now has full support for the new GPT-4.1 family and the o3/o4-mini models.

πŸ”Š Voice & Audio: Talking to Dolphins? 🐬

In perhaps the most delightful news of the week, Google, in collaboration with Georgia Tech and the Wild Dolphin Project, announced DolphinGemma. It's a \~400M parameter audio model based on the Gemma architecture (using SoundStream for audio tokenization) trained specifically on decades of recorded dolphin clicks, whistles, and pulses. The goal?

  • In perhaps the most delightful news of the week, Google, in collaboration with Georgia Tech and the Wild Dolphin Project, announced DolphinGemma.
  • To decipher the potential syntax and structure within dolphin communication and eventually enable rudimentary two-way interaction using underwater communication devices.
  • It runs on a Pixel phone for field deployment.

🎨 AI Art & Diffusion & 3D: Seedream Challenges the Champs

ByteDance wasn't just busy with video; their Seed team announced Seedream 3.0, a powerful bilingual text-to-image model. Highlights: 1. Generates native 2048x2048 images.

  • ByteDance wasn't just busy with video; their Seed team announced Seedream 3.0, a powerful bilingual text-to-image model.
  • Generates native 2048x2048 images.
  • Fast inference (\~3 seconds for 1Kx1K on an A100).
TL;DR and Show Notes

Everything we covered today in bite-sized pieces with links!

  1. Hosts and Guests

    1. Alex Volkov - AI Evangelist & Weights & Biases (@altryne)

    2. Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed)

    3. Todd Segal - Principal Software Engineer @ Google - Working on A2A Protocol

  2. Big CO LLMs + APIs

    1. πŸ‘‘ OpenAI launches o3 and o4-mini in chatGPT & API (Blog, Our Coverage, o3 and o4-mini announcement)

    2. OpenAI launches GPT 4.1, 4.1-mini and 4.1-nano in API (Our Coverage, Prompting guide)

    3. 🚨 Google launches Gemini 2.5 Flash with controllable thinking budgets (Blog Post - Placeholder Link, API Docs)

    4. Mistral classifiers Factory

    5. Claude does research + workspace integration (Blog)

    6. Cohere Embed‑4 β€” Multimodal embeddings for enterprise search (Blog, Docs Changelog, X)

  3. Open Source LLMs

    1. OpenAI open sources MRCR Long‑Context Benchmark (Hugging Face)

    2. Microsoft BitNet v1.5 (HF)

    3. INTELLECT‑2 β€” Prime Intellect’s 32B β€œglobally‑distributed RL” experiment (Blog, X)

    4. Z.ai (previously chatGLM) + GLM‑4‑0414 open‑source family (X, HF Collection, GitHub)

  4. This weeks Buzz + MCP/A2A

    1. Weave playground support for GPT 4.1 and o3/o4-mini models (X)

    2. Chat with Todd Segal - A2A Protocol (GitHub Spec)

  5. Vision & Video

    1. Veo‑2 Video Generation in GA, Gemini App (Dev Blog)

    2. Kling 2.0 Creative Suite (X, Blog)

    3. ByteDance public Seaweed-7B, a video generation foundation model (seaweed.video)

  6. Voice & Audio

    1. DolphinGemma β€” Google AI tackles dolphin communication (Blog)

  7. AI Art & Diffusion & 3D

    1. Seedream 3.0 bilingual image diffusion – ByteDance (Tech post, arXiv, AIbase news)

  8. Tools

    1. OpenAI debuts Codex CLI, an open source coding tool for terminals (Github)

    2. Use o3 with Windsurf (which OpenAI is rumored to buy at $3B) via the mac app integration + write back + multiple files