Episode Summary
Wow. Just⦠wow. What a week, folks. Seriously, this has been one for the books.
In This Episode
- π¨ OpenAI o3 & o4βmini: SOTA Reasoning Meets ToolβUse ([Blog]( [Watch Party](
- π¨ Thinking visually with images
- π° Benchmark Dominance: As expected, these models crush existing benchmarks.
- π OpenAI open sources MRCR eval and Codex (Mrcr [HF]( Codex [Github](
- β‘ This Week's Buzz: Playground Updates & A Deep Dive into A2A
- π Voice & Audio: Talking to Dolphins? π¬
- π¨ AI Art & Diffusion & 3D: Seedream Challenges the Champs
Hosts & Guests
By The Numbers
π¨ OpenAI o3 & o4βmini: SOTA Reasoning Meets ToolβUse ([Blog]( [Watch Party](
The long awaited o3 models (promised to us in the last days of x-mas) is finally here, and it did NOT dissapiont and well.. even surprised!
- The long awaited o3 models (promised to us in the last days of x-mas) is finally here, and it did NOT dissapiont and well..
- Tools like searching the web, python coding, image gen (which it...
- can zoom and rotate and crop images, it's nuts) to get to incredible responses faster.
π¨ Thinking visually with images
This one also blew my mind, this model is SOTA on multimodality tasks, and a reason for this, is these models can manipulate and think about the images they received. Think... cropping, zooming, rotating.
- This one also blew my mind, this model is SOTA on multimodality tasks, and a reason for this, is these models can manipulate and think about the images they received.
- The models can now perform all these tasks to multimodal requests from users.
π° Benchmark Dominance: As expected, these models crush existing benchmarks.
o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more. It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k! o4-mini is no slouch either.
- o3 sets new State-of-the-Art (SOTA) records on Codeforces (coding competitions), SWE-bench (software engineering), MMMU (multimodal understanding), and more.
- It scored a staggering $65k on the Freelancer eval (simulating earning money on Upwork) compared to o1's $28k!
- It hits 99.5% on AIME (math problems) when allowed to use its Python interpreter and beats the older o3-mini on general tasks.
π OpenAI open sources MRCR eval and Codex (Mrcr [HF]( Codex [Github](
Let's face it, this isn't the open source OpenAI coverage I was hoping for, Sam promised us an open source model, and they are about to drop this, I'd assume close to Google IO (May 20th) to steal thunder. But OpenAI did make OpenSource waves this week in addition to the above huge stories.
- But OpenAI did make OpenSource waves this week in addition to the above huge stories.
- MRCR is a way to evaluate long context complex tasks, and they have taken this Gemini research and open sourced a dataset for this eval.
- The best part about this CLI, is that it's hardened security, using Apple Seatbelt which limits it execution to the current directory + temp files (on a mac at least)
β‘ This Week's Buzz: Playground Updates & A Deep Dive into A2A
On the Weights & Biases front, it's all about enabling developers to navigate this new model landscape. TK: Alex Video With all these new models dropping, how do you actually _choose_ which one is best for _your_ application? You need to evaluate!
- On the Weights & Biases front, it's all about enabling developers to navigate this new model landscape.
- With all these new models dropping, how do you actually _choose_ which one is best for _your_ application?
- Our W\&B Weave Playground now has full support for the new GPT-4.1 family and the o3/o4-mini models.
π Voice & Audio: Talking to Dolphins? π¬
In perhaps the most delightful news of the week, Google, in collaboration with Georgia Tech and the Wild Dolphin Project, announced DolphinGemma. It's a \~400M parameter audio model based on the Gemma architecture (using SoundStream for audio tokenization) trained specifically on decades of recorded dolphin clicks, whistles, and pulses. The goal?
- In perhaps the most delightful news of the week, Google, in collaboration with Georgia Tech and the Wild Dolphin Project, announced DolphinGemma.
- To decipher the potential syntax and structure within dolphin communication and eventually enable rudimentary two-way interaction using underwater communication devices.
- It runs on a Pixel phone for field deployment.
π¨ AI Art & Diffusion & 3D: Seedream Challenges the Champs
ByteDance wasn't just busy with video; their Seed team announced Seedream 3.0, a powerful bilingual text-to-image model. Highlights: 1. Generates native 2048x2048 images.
- ByteDance wasn't just busy with video; their Seed team announced Seedream 3.0, a powerful bilingual text-to-image model.
- Generates native 2048x2048 images.
- Fast inference (\~3 seconds for 1Kx1K on an A100).
Everything we covered today in bite-sized pieces with links!
Hosts and Guests
Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed)
Todd Segal - Principal Software Engineer @ Google - Working on A2A Protocol
Big CO LLMs + APIs
π OpenAI launches o3 and o4-mini in chatGPT & API (Blog, Our Coverage, o3 and o4-mini announcement)
OpenAI launches GPT 4.1, 4.1-mini and 4.1-nano in API (Our Coverage, Prompting guide)
π¨ Google launches Gemini 2.5 Flash with controllable thinking budgets (Blog Post - Placeholder Link, API Docs)
Mistral classifiers Factory
Claude does research + workspace integration (Blog)
Cohere Embedβ4 β Multimodal embeddings for enterprise search (Blog, Docs Changelog, X)
Open Source LLMs
OpenAI open sources MRCR LongβContext Benchmark (Hugging Face)
Microsoft BitNet v1.5 (HF)
INTELLECTβ2 β Prime Intellectβs 32B βgloballyβdistributed RLβ experiment (Blog, X)
Z.ai (previously chatGLM) + GLMβ4β0414 openβsource family (X, HF Collection, GitHub)
This weeks Buzz + MCP/A2A
Weave playground support for GPT 4.1 and o3/o4-mini models (X)
Chat with Todd Segal - A2A Protocol (GitHub Spec)
Vision & Video
Veoβ2 Video Generation in GA, Gemini App (Dev Blog)
ByteDance public Seaweed-7B, a video generation foundation model (seaweed.video)
Voice & Audio
DolphinGemma β Google AI tackles dolphin communication (Blog)
AI Art & Diffusion & 3D
Seedream 3.0 bilingual image diffusion β ByteDance (Tech post, arXiv, AIbase news)
Tools
OpenAI debuts Codex CLI, an open source coding tool for terminals (Github)
Use o3 with Windsurf (which OpenAI is rumored to buy at $3B) via the mac app integration + write back + multiple files