Episode Summary

This is one of those wonderfully overloaded ThursdAI episodes where the news week refuses to fit inside the runtime. NVIDIA drops Nemotron 3 Ultra live with Chris Alexiuk on hand to explain the 550B open model, Arena launches Agent Arena with Peter Gostev, and Karan joins to talk Hermes Agent turning into a real community tool. Around that spine, Alex and the crew hit Microsoft MAI, Gemma 4, MiniMax M3, Ideogram 4, Reve V2, RTX Spark laptops, ElevenLabs dubbing, Cartesia audio, and Wolfram’s token-usage view for WolfBench.

Hosts & Guests

Alex Volkov
Alex Volkov
Host Β· AI Evangelist, W&B / CoreWeave
@altryne
Peter Gostev
Peter Gostev
Head of AI, Arena
@petergostev
Chris Alexiuk
Chris Alexiuk
Product Research Engineer, NVIDIA
@llm_wizard
Karan
Karan
Co-founder, Nous Research
@karan4d
Wolfram Ravenwolf
Wolfram Ravenwolf
AI model evaluator (r/LocalLLaMA)
@WolframRvnwlf
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten
LDJ
LDJ
Nous Research
@ldjconfirmed

By The Numbers

Nemotron 3 Ultra parameters
550B
NVIDIA open sparse model discussed with Chris Alexiuk; 55B active parameters.
Active parameters
55B
Nemotron 3 Ultra active parameter count for the sparse MoE model.
MAI Thinking 1 total parameters
1T
Microsoft MAI Thinking 1 described as a 1T total, 35B active MoE trained from scratch.
MAI training tokens
33T
Microsoft MAI Thinking 1 was discussed as trained on 33T tokens without distillation.
Ideogram 4 parameters
9.3B
Open-weight text-to-image model focused on text rendering and layout control.
Nemotron ASR throughput
17x
Alex highlights Nemotron 3.5 ASR as 17x faster than Parakeet-style baselines with half the size.

πŸ”₯ Breaking During The Show

NVIDIA Nemotron 3 Ultra drops the day of the show
Alex opens with NVIDIA’s new 550B open sparse model as breaking news, then Chris Alexiuk joins from NVIDIA HQ to explain the model, data, recipes, NVFP4 checkpoint, and agentic-harness focus.
Arena launches Agent Arena during the episode
The crew sees Arena’s new real-world agentic evaluation launch live, and Peter Gostev joins to explain why long-running agent tasks require a different benchmark than one-turn chatbot preference battles.

πŸ“° Show Open & Welcome

Alex opens a packed June 4 show with Nemotron 3 Ultra breaking from NVIDIA, a fresh wave of image models, Microsoft declaring a serious frontier-model push, and the usual promise to compress an unreasonable news week into one live show. Chris Alexiuk, Karan, and Peter Gostev are set up as the main guest voices for the episode.

  • NVIDIA Nemotron 3 Ultra dropped the same day as the show
  • Microsoft MAI, open image models, and agent benchmarks set the agenda
  • Chris Alexiuk, Karan, and Peter Gostev join as guests
Alex Volkov
Alex Volkov
"We have been training for weeks such as these because this week was absolutely stacked with AI news."

⚑ ThursdAI TL;DR - Jun 4, 2026

The fast run-through frames the week: NVIDIA RTX Spark, Microsoft MAI models, MiniMax M3, Gemma 4, Agent Arena, image-model leaderboard chaos, ElevenLabs dubbing, and CoreWeave/W&B hackathon notes. It is the table of contents for a show that keeps getting interrupted by real launches.

  • Chris Alexiuk joins for Nemotron 3 Ultra
  • Karan joins for Hermes Agent and Nous Research
  • Peter Gostev joins to explain Agent Arena

πŸ”“ Open Source AI

The open-source block starts with the usual ThursdAI bias toward models people can inspect, run, and build on. Alex sets aside NVIDIA for the later Chris interview and opens with smaller but important releases from Google, JetBrains, and MiniMax.

  • Open-source model coverage split into multiple segments
  • NVIDIA saved for a deeper guest interview
  • Gemma, Mellum, and MiniMax lead the first pass

πŸ”“ Google Gemma 4 12B (Encoder-Free Multimodal)

Gemma 4 12B gets the first technical dive because its encoder-free multimodal design matters: instead of bolting a separate vision/audio encoder onto a language model, Google is pushing toward one unified network. LDJ and Yam explain why this can make smaller multimodal models cheaper, cleaner, and easier to run locally.

  • 12B parameter encoder-free multimodal model
  • Apache 2.0 license and 16GB VRAM target
  • LDJ explains why unified multimodal training matters
LDJ
LDJ
"Encoder-free gets rid of this, and you actually have it more cohesive, more like you would ideally think maybe the human brain works."

πŸ› οΈ JetBrains Mellum 2

JetBrains ships Mellum 2, a 12B mixture-of-experts coding model with only 2.5B active parameters. The panel treats it as another sign that IDE companies are trying to turn years of developer workflow context into model advantage.

  • 12B MoE coding model with 2.5B active parameters
  • Trained with a three-stage curriculum over 10T tokens
  • Available on CoreWeave Inference

πŸ”“ MiniMax M3

MiniMax M3 brings a one-billion-token sparse attention context claim and strong coding/agentic benchmark numbers, but the panel keeps the hype measured because weights and licensing details still matter. The practical thread is that MiniMax models already have a following for cheap agentic tool calling even when pure coding quality is debated.

  • Open-weights frontier coding model announcement
  • One-billion-token sparse attention context claim
  • Reported 59 on SWE-bench Pro and 66 on an internal benchmark

πŸ€– Agent Arena from LMArena

Agent Arena lands live enough to get the breaking-news treatment. Peter Gostev explains why chatbot A/B preference battles are no longer enough and how Arena is moving toward real agent workflows with web search, files, terminals, user corrections, and objective recovery signals.

  • Arena launches real-world agentic evals at scale
  • Models are judged on longer workflows, not one-turn chat only
  • Peter explains the move from battle mode to agent mode
Peter Gostev
Peter Gostev
"There is something that definitely was missing about this, and we heard a lot about this from the community: longer term, more difficult tasks that can go on for many minutes and hours."

🏒 Microsoft MAI Thinking & Code Models

Microsoft uses Build 2026 to show seven MAI models across thinking, code, image, transcription, and voice. The panel focuses on MAI Thinking 1 and MAI Code 1 Flash as signs that Microsoft AI is becoming a model lab in its own right rather than only an OpenAI distribution channel.

  • Seven Microsoft AI models announced at Build 2026
  • MAI Thinking 1 is a 1T total, 35B active MoE
  • MAI Code 1 Flash ships into GitHub Copilot

🎨 Microsoft MAI Image 2.5

MAI Image 2.5 gets attention because it jumps high on Arena image leaderboards surprisingly quickly. Alex and Peter discuss its strengths in editing, cleanup, diagrams, and documents, while also testing the public playground path for people who want to try it outside heavier Microsoft developer surfaces.

  • Number two on Arena image-to-image at the time of discussion
  • Strong image cleanup, background, document, and diagram results
  • Available through playground.microsoft.ai

🎨 Ideogram 4 (Open Weights)

Ideogram 4 is the rare image-model release that is both strong at text/layout and open weights, even if under a non-commercial license. The panel digs into its 9.3B parameter size, design-arena showing, bounding-box prompting, and the tradeoff between precise structured prompting and casual generation.

  • 9.3B parameter open-weight text-to-image model
  • Strong design and text-rendering results
  • Supports bounding-box/layout-style prompting

🎨 Reve V2 (Layout-Based Image Model)

Reve V2 climbs near the top of text-to-image Arena, but Alex’s live testing shows both the promise and weirdness of the model. The interesting part is not perfect portraits; it is the layout engine and editing flow that make precise graphic/image iteration feel different from normal prompt-only generation.

  • Reve V2 reaches about 1200 ELO on image Arena
  • Alex tests portrait generation and finds inconsistent identity quality
  • The layout-first editor is the real differentiator

πŸ”“ Interview: Chris Alexiuk (NVIDIA) - Nemotron 3 Ultra

Chris Alexiuk joins from NVIDIA HQ to unpack Nemotron 3 Ultra, a 550B sparse open model with 55B active parameters designed around agentic harnesses. The conversation covers NVIDIA’s open data, recipes, reward model, GenRM, NVFP4 checkpoint, hybrid Mamba/Transformer architecture, and why speed matters more as agents run longer contexts.

  • 550B total parameters with 55B active
  • Built for agentic harnesses like OpenCode, Hermes, and OpenClaw
  • Open weights, data, recipes, reward model, and training details released
Chris Alexiuk
Chris Alexiuk
"Nemotron-3 Ultra is a 550 billion parameter, sparse ML model with 55 billion active parameters."

πŸ”Š NVIDIA Nemotron 3.5 ASR

The NVIDIA segment continues into speech with Nemotron 3.5 ASR, a tiny but fast streaming transcription model. Chris credits NVIDIA’s speech research team while Alex highlights the 600M parameter size, 40-language support, and throughput jump that pushes the latency/accuracy frontier.

  • 600M parameter streaming ASR model
  • Supports 40 languages
  • Reported 17x more throughput than Parakeet with half the size
Alex Volkov
Alex Volkov
"It is 600 million parameters. Basically nothing. Runs for 40 languages, which is quite incredible."

πŸ’» NVIDIA RTX Spark & Computex Laptops

The Computex discussion shifts from cloud-scale NVIDIA to local AI PCs. RTX Spark and the new laptop wave put RTX 5070-class GPUs, 128GB memory, and roughly one petaflop of local AI into thin machines, which raises the practical question of what agents should run locally versus remotely.

  • RTX Spark brings NVIDIA further into AI PCs
  • 128GB memory and roughly 1 petaflop local AI headline the announcement
  • Chris notes Nemotron Ultra is too large for local laptops, but smaller models are improving fast

πŸ€– Interview: Karan (Nous Research) - Hermes Agent

Karan joins to talk about Nous Research’s surreal moment at Computex and Hermes Agent’s unexpected community adoption. He frames Hermes as a tool originally built for RL rollouts that escaped into real user workflows, with the community and the agent itself becoming major contributors to its growth.

  • Jensen Huang showed Nous Research on stage at Computex
  • Hermes Agent has grown into a widely used open agent harness
  • Karan says Hermes was built for RL rollouts before the community ran with it
Karan
Karan
"We made it to do RL rollouts on, right? We made it for the same reason CodeX or Clawcode were created by those labs."

πŸ› οΈ Hermes Harness Engineering & Security

The Hermes discussion turns into a useful taxonomy of harness engineering: prompts, simulated terminals, permissioning, tool environments, and the security boundary around letting agents act. Karan connects today’s agent harnesses back to WorldSim and the older prompt-engineering lineage that made terminal-style agents possible.

  • Harness engineering is treated as a new craft layer above prompt engineering
  • WorldSim and simulated-terminal work are framed as precursors
  • The panel discusses permissions, security, and local control

πŸ–₯️ Hermes Desktop

Karan previews Hermes Desktop as a more accessible UI for the same agent power: chat, permissions, tool visibility, admin controls, and local app-style usage. Alex compares the shape of it to Codex-level local harnesses rather than a simple chatbot wrapper.

  • Hermes Desktop packages Hermes Agent into a desktop UI
  • Admin controls target small teams, startups, and personal agent fleets
  • Users can inspect tool calls, reasoning traces, and permissions

πŸ”Š Voice & Audio - ElevenLabs Dubbing V2

The audio section is the live-demo brain-melter: Alex plays ElevenLabs Dubbing V2 translating voices while preserving cadence, expression, intonation, and even stutters. The section includes multilingual demos from Alex, Nisten, and Alex’s daughter Emma, who is only present as a private dubbing example rather than a show participant.

  • ElevenLabs Dubbing V2 preserves cadence and expression across languages
  • Alex demos Nisten in Hebrew and his own voice in multiple languages
  • Emma is a dubbing-demo voice only, not included as a guest

πŸ”Š Cartesia Ink2 Streaming ASR Demo

The show squeezes in Cartesia Ink2 and related audio/transcription notes near the end while Alex starts summarizing the huge episode. It also becomes a bridge into W&B/CoreWeave and WeaveHacks reminders, including hackathon credits and practical builder calls to action.

  • Cartesia Ink2 streaming ASR gets a short mention/demo slot
  • Alex recaps the major guests and topics before moving to community notes
  • WeaveHacks is promoted for San Francisco builders

πŸ§ͺ WolfBench - Token Usage Visualization

Wolfram shows a WolfBench feature that visualizes not just benchmark score, but token usage. The important point is that two models can look close on a leaderboard while one burns dramatically more tokens, which changes the real cost and latency story.

  • WolfBench adds a 3D token-usage visualization
  • Gemini 3.5 Flash and GPT 5.5 are compared through score plus token depth
  • Wolfram argues cost/time calculations need token usage, not only benchmark bars
Wolfram Ravenwolf
Wolfram Ravenwolf
"Something these bars never show is how many tokens did it use to get that score."

πŸ“° Show Wrap-up

Alex closes a two-and-a-half-hour show that still somehow did not cover everything. The final beat thanks the live audience, points listeners to podcast and YouTube versions, and notes that the rumored OpenAI drop did not arrive, which may have been a mercy given how full the episode already was.

  • Show runs more than two and a half hours
  • Alex thanks listeners and points to podcast/YouTube versions
  • The expected OpenAI update did not land during the show
TL;DR and Show Notes - June 4, 2026
  • Show Notes & Guests

  • Open Source LLMs

    • NVIDIA released Nemotron 3 Ultra, a 550B / 55B-active open-weight MoE built for long-running agents, with weights, data, recipes, GenRM, and training assets released (X, Tech Report, Announcement, HF).

    • NVIDIA also shipped Nemotron 3.5 ASR, a 600M open multilingual streaming STT model for voice agents (X, HF, Benchmark, Voice Agent Repo).

    • Google dropped Gemma 4 12B, an encoder-free multimodal model that runs locally under Apache 2.0 (X, HF).

    • MiniMax announced M3, a natively multimodal, 1M-context coding and agentic model with open weights coming soon (X, API, Code).

    • JetBrains released Mellum2, a 12B MoE with 2.5B active params trained from scratch by a small team (X, Blog, HF).

    • H Company launched Holo 3.1, local computer-use agents from 0.8B to 35B with new quantized checkpoints (X, Blog).

  • Big CO LLMs + APIs

    • NVIDIA announced RTX Spark, its new Arm + Blackwell PC platform for local AI agents and 120B-class local inference (coverage).

    • Microsoft AI launched seven new MAI models, including MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Transcribe-1.5, and MAI-Voice-2 (Blog, Tech Report).

  • AI Art & Diffusion & 3D

    • MAI-Image-2.5 landed near the top of Arena image leaderboards, though hands-on tests were mixed (X, Try it).

    • Ideogram 4.0 became the top open-weight text-to-image model with strong typography and layout control (X, Blog, HF).

    • Reve 2.0 jumped to #2 on Text-to-Image Arena with native 4K, code-like layout control, and precise editing (X, Blog, Try it).

    • xAI released Grok Imagine Video 1.5 Preview for image-to-video with synced audio (xAI).

  • Tools & Agentic Engineering

    • Arena launched Agent Arena, a new leaderboard for real agent workflows instead of one-shot chatbot prompts (Arena).

    • Cognition rebranded Windsurf into Devin Desktop, a multi-agent command center with ACP support (X, Announcement).

    • Nous Research launched Hermes Desktop, bringing Hermes Agent into a native desktop app for Mac, Windows, and Linux (X, Site).

  • This Week’s Buzz

    • WeaveHacks 4 is this weekend in SF with OpenAI, Cursor, DeepMind, and more joining (lu.ma/weavehacks).

    • Nemotron 3 Ultra is live on CoreWeave Inference through W&B at full NVFP4 precision (Try it).

    • WolfBench added 3D token-depth bars, making model efficiency much easier to see (wolfbench.ai).

  • Voice & Audio

    • ElevenLabs launched Dubbing v2, an audio-to-audio dubbing model that preserves performance across 90+ languages (X, Dubbing).

    • Cartesia launched Ink-2, a fast streaming STT model built for voice agents (X, Ink, AA).

    • NVIDIA’s Nemotron 3.5 ASR looks like a major open-source voice-agent infrastructure drop (HF).

  • AI in Society

    • Bernie Sanders proposed the American AI Sovereign Wealth Fund Act, calling for public equity stakes in major AI companies (coverage).

    • Anthropic published When AI Builds Itself, laying out scenarios for AI-driven AI R&D and recursive self-improvement (Anthropic).

    • AI leaders urged Congress to mandate synthetic DNA/RNA screening and recordkeeping (WIRED).

    • Anthropic confidentially filed for an IPO, adding another frontier-lab public-market storyline to watch (Axios).