Episode Summary
Recorded live from the AI Engineer Summit in New York, this might be the most packed ThursdAI episode ever โ in a single week, Google dropped Gemini 3 Pro (45% on ARC-AGI-2!), xAI shipped Grok 4.1 then Grok 4.1 Fast with a full Agent Tools API, OpenAI answered with GPT-5.1-Codex-Max capable of 24-hour+ coding runs, and Meta segmented the universe with SAM 3 and SAM 3D. Oh, and Google capped Thursday itself with Nano Banana Pro generating flawless 4K infographics while Alex was still live on air. Three incredible guests joined โ Swyx from Cognition/Latent Space who organized the summit, Thor Schaeff from Google DeepMind (on day three of his new job!), and Dominik Kundel from OpenAI breaking down Codex's native compaction magic. The future didn't just arrive โ it showed up with luggage.
In This Episode
- ๐๏ธ Live from AI Engineer: The Craziest Week in AI
- ๐๏ธ AI Engineer Summit: Coding Agents Take Center Stage
- ๐ Gemini 3 Pro: Google's AI Comeback is Complete
- ๐ Antigravity: Google's Free Agentic IDE That Feels Like the Future
- ๐ค GPT-5.1-Codex-Max: 24-Hour Agent Runs and Native Compaction
- โก Grok 4.1 Fast & Agent Tools API: xAI's Developer Moment
- ๐ Nano Banana Pro: 4K Image Generation with Perfect Text
- ๐ฌ Meta SAM 3 & SAM 3D, OLMo 3, and Open Source News
Hosts & Guests
By The Numbers
๐ฅ Breaking During The Show
๐๏ธ Live from AI Engineer: The Craziest Week in AI
Alex kicks off the show live from the AI Engineer Summit in New York, joined by co-host Ryan Carson and surprise guest Swyx. The panel does a lightning-round 'pick one release from the week' โ Ryan goes Gemini 3, Swyx agrees it's underrated, and Alex cheats by picking Antigravity (which includes Gemini 3). The TLDR is staggering: every major AI lab shipped something massive in the same five-day window.
- Recorded live at AI Engineer Summit in New York with a professional podcast studio on the expo floor
- Ryan Carson (Amp): first time they've ever switched their default model โ Gemini 3 Pro is now default at Amp
- Swyx calls Gemini 3 'still underrated despite all the attention it's already got'
๐๏ธ AI Engineer Summit: Coding Agents Take Center Stage
Swyx walks Alex and Ryan through the summit's theme โ coding agents โ and explains why every major lab converging on agentic workflows makes it the right bet for 2025. From Cursor to Jules to CodeRabbit to Anthropic and Google Labs, the agent lab ecosystem is maturing fast. This year's summit also targets enterprise for the first time, with Fortune 500 attendees from Capital One, Bloomberg, and Atlassian.
- 23 applicants for every speaker slot โ Swyx curated an all-star lineup from every lab
- First summit focused on enterprise digital transformation alongside the developer community
- Swyx: 'If you take vertical AI seriously enough, you eventually end up building an agent lab'
๐ Gemini 3 Pro: Google's AI Comeback is Complete
Thor Schaeff (Google DeepMind, day three on the job!) joins the panel to celebrate Gemini 3 Pro's launch. The numbers are genuinely wild: 45.14% on ARC-AGI-2 with Deep Think mode, 81% on MMLU-Pro, and major gains in coding. Ryan confirms Amp switched to it as their default model the day it launched โ the first time they've ever switched defaults. Deep Think mode explained, plus Gemini landing across Gmail, Calendar, and AI Mode in Search.
- ARC-AGI-2: 31.11% standard, 45.14% with Deep Think โ biggest ever jump on this benchmark
- Ryan Carson: Amp switched to Gemini 3 Pro as default on launch day โ never done that before
- AI Mode rolling out in Google Search powered by Gemini 3 Pro
๐ Antigravity: Google's Free Agentic IDE That Feels Like the Future
Alex's personal pick of the week, Antigravity is a free VS Code fork reimagined for agent-first coding. The killer feature: an Agent Manager that acts like an inbox for your coding agents โ run multiple agents in parallel, each working on different parts of your codebase simultaneously. Browser integration lets agents take screenshots and videos of your running app, then debug and iterate. Gemini 3 Pro handles the heavy coding; Nano Banana handles images.
- Agent Manager: inbox-style interface to coordinate multiple parallel coding agents
- Browser integration: agents can control Chrome, take screenshots, and self-debug
- Free tier powered by Gemini 3 Pro โ only model alongside GPT-OS 120B open source
๐ค GPT-5.1-Codex-Max: 24-Hour Agent Runs and Native Compaction
Dominik Kundel from OpenAI joins live to break down GPT-5.1-Codex-Max, the newest frontier coding model designed for long-horizon software tasks. The headline: native compaction training lets it run for 24+ hours on a single task (an internal run reportedly went a full week). Dominik explains how compaction differs from just starting a new thread, efficiency gains (30% fewer thinking tokens), Windows/PowerShell improvements, and the new extra-high reasoning level.
- Native compaction: model trained to intelligently summarize prior context and run indefinitely
- 30% fewer thinking tokens at median compared to predecessors โ faster and smarter
- 58% on TerminalBench 2 โ new SOTA; also leads SWE-Bench and SWE-Lancer vs. predecessors
- Windows PowerShell support significantly improved; experimental Windows sandbox launched
โก Grok 4.1 Fast & Agent Tools API: xAI's Developer Moment
xAI had a huge week: Grok 4.1 briefly topped LM Arena (1483 Elo), then Grok 4.1 Fast landed with a 2M token context, native X search, Reddit search, web browsing, and code execution. The Agent Tools API benchmarks are jaw-dropping: 93-100% on ฯยฒ-Bench, 72% on Berkeley Function Calling v4 โ at $0.20/$0.50 per million tokens. Yam confirms the X and Reddit search is real and working. Alex shares his experience using both models in his N8N research agent.
- Grok 4.1 topped LM Arena at 1483 Elo before Gemini 3 eclipsed it
- Grok 4.1 Fast: $0.20 input / $0.50 output per million tokens โ free for 2 weeks on xAI API and OpenRouter
- Agent Tools: native X + Reddit search that other models refuse to do
- 72% on Berkeley Function Calling v4 โ top of the leaderboard, 10ร cheaper than Gemini 3 Pro
๐ Nano Banana Pro: 4K Image Generation with Perfect Text
Breaking news mid-show: Google releases Nano Banana Pro, upgraded with thinking traces, 4K resolution, and SynthID watermarking. Alex demos it live by generating an 8MB infographic about the week's AI news โ the text is perfect across the entire image, logos are pixel-accurate, and the composition is impressive. Wolfram demos generative UIs in Gemini โ Gemini building an interactive news dashboard with real-time market data on demand.
- Breaking news during the live show โ Alex demos it instantly with an AI news infographic
- Perfect text rendering across 4K images โ no garbled letters, accurate logos
- Thinking traces visible before generation โ Gemini 3 plans, Nano Banana executes
- SynthID watermarking and C2PA metadata for provenance on every image
- Generative UIs: Gemini builds interactive dashboards with real data on the fly
๐ฌ Meta SAM 3 & SAM 3D, OLMo 3, and Open Source News
Meta joins the party with SAM 3 โ open-vocabulary video segmentation with text and exemplar prompts โ and SAM 3D for turning single photos into 3D objects and human body reconstructions. The panel demos it live on dog videos. LDJ and Nisten highlight OLMo 3 from Allen AI as a fully open 32B model (full dataset, training recipe, hyperparameters) โ the contrast to open-weights-only releases from Qwen and DeepSeek is stark.
- SAM 3: click or text-prompt to segment and track any object across video โ live demo with golden retrievers
- SAM 3D: single image to 3D object or full human body reconstruction
- OLMo 3: Allen AI's fully open 32B dense model โ dataset, recipe, and hyperparameters all public
- Marimo Python notebooks: new VS Code and Cursor extension with reactive notebooks and UV integration
If you only skim one section, make it this one:
- Gemini 3 Pro: 1M-token multimodal model, huge reasoning gains — new LLM king; ARC-AGI-2: 31.11% (Pro), 45.14% (Deep Think) — enormous jumps
- Antigravity IDE: free, Gemini-powered VS Code fork with agents, plans, walkthroughs, and browser control
- Nano Banana Pro: 4K image generation with perfect text + SynthID provenance; dynamic generative UIs in Gemini
xAI
- Grok 4.1: big post-training upgrade — #1 on human-preference leaderboards, much better EQ & creative writing, fewer hallucinations
- Grok 4.1 Fast + Agent Tools API: 2M context, SOTA tool-calling & agent benchmarks (Berkeley FC, T2-Bench, research evals), aggressive pricing and tight X + web integration
OpenAI
- GPT-5.1-Codex-Max: frontier agentic coding model built for 24h+ software tasks with native compaction for million-token sessions; big gains on SWE-Bench, SWE-Lancer, TerminalBench 2
- GPT-5.1 Pro: new research-grade ChatGPT mode that will happily think for minutes on a single query
Meta
- SAM 3: open-vocabulary segmentation + tracking across images and video (with text & exemplar prompts)
- SAM 3D: single-image to 3D objects & human bodies; surprisingly high-quality 3D from one photo
Robotics
- Sunday Robotics — ACT-1 & Memo: home robot foundation model trained from a $200 skill glove instead of $20K teleop rigs; long-horizon household tasks with solid zero-shot generalization
Recorded live at the AI Engineer Summit in New York. Three incredible guests: Swyx (Cognition/Latent Space), Thor Schaeff (Google DeepMind, day 3!), and Dominik Kundel (OpenAI).