Cognition rebrands Windsurf into Devin Desktop multi-agent hub
Cognition rebranded Windsurf into Devin Desktop, a multi-agent command center with Agent Client Protocol (ACP) support. The move consolidates Cognition's IDE acquisition into its Devin agent brand as a desktop control surface for running multiple coding agents.
JetBrains open-sources Mellum 2, a 12B MoE coding model
JetBrains released Mellum 2, a 12B mixture-of-experts coding model with only 2.5B active parameters, trained from scratch by a small team using a three-stage curriculum over 10T tokens. The panel read it as IDE companies converting years of developer-workflow context into model advantage; it is also available on CoreWeave Inference.
Microsoft ships MAI-Code-1-Flash into GitHub Copilot
Part of the seven-model MAI launch at Build 2026, MAI-Code-1-Flash is Microsoft AI's fast coding model and ships directly into GitHub Copilot. The panel saw it as a sign Microsoft intends to serve its own models inside its developer surfaces instead of relying solely on OpenAI.
MiniMax announces M3 coding/agentic model with 1M context
MiniMax announced M3, a natively multimodal coding and agentic model with a one-million-token sparse attention context claim and open weights promised soon. Reported numbers include 59 on SWE-bench Pro, and the panel noted MiniMax already has a following for cheap agentic tool calling even as pure coding quality is debated.
Nous Research launches Hermes Desktop agent app for Mac/Win/Linux
Nous Research launched Hermes Desktop, packaging the Hermes Agent harness into a native desktop app for Mac, Windows, and Linux. Karan previewed chat, permissions, tool-call visibility, reasoning traces, and admin controls aimed at small teams, startups, and personal agent fleets.
Anthropic released Claude Opus 4.8 during the episode, hitting 69.2% on SWE-bench Pro (up from 64.3% on 4.7 and ahead of GPT-5.5 at 58.6%), a new-best 57.9% on Humanity's Last Exam with tools, and 83.4% on OSWorld-Verified. It also shows a real long-context jump past the usual 200K cliff (85.9% GraphWalks BFS at 256K), with new thinking modes in the UI. Anthropic teased bringing Mythos-class models to all customers in the coming weeks.
Dynamic Workflows and Ultra Code land in Claude Code
Alongside Opus 4.8, Anthropic shipped Dynamic Workflows and an Ultra Code mode in Claude Code, which Yam fired up live on the show. The headline proof point: Bun was ported from Zig to Rust — about 750K lines — via Dynamic Workflows, with 99.8% of the test suite passing and the port merged in 11 days.
Datacurve's DeepSWE: a contamination-free coding benchmark
DeepSWE is a coding leaderboard built from 113 original tasks written from scratch and shipped as shallow clones with no git history to cheat from. GPT-5.5 leads at 70% with a big drop-off after the top few, and Kimi K2 is the top open-source entry. Replaying older benches, Datacurve found SWE-Bench Pro's verifier is wrong ~32% of the time and caught Claude Opus reading the gold commit out of git history on 12-18% of passes.
Google AI Studio builds free native Android apps; 250K in week one
Google AI Studio now lets anyone build native Android apps for free, with 250,000 apps created in the first week. The crew framed it as another step toward personalized, disposable software that anyone can vibe-code on demand.
Weights & Biases launches MCP server with 20 tools for agents
W&B officially launched its MCP server with 20 schema-first tools so coding agents can read experiments, monitor training, and run autonomous research loops. Agents can query metadata before pulling full 300-metric runs, keeping their context windows from blowing up.
Anthropic doubles Claude usage limits outside peak hours for a limited time
Anthropic doubled Claude usage outside peak hours for a limited period, covering Claude Code and other Claude surfaces. The move gives heavy users substantially more agentic and coding throughput during off-peak windows.
Cursor launches Composer 2.5 with Opus-class coding at much lower cost
Cursor launched Composer 2.5, a coding model continued-trained on top of Kimi K2.5 (with permission) that delivers Opus-class coding performance at much lower cost. The crew noted Cursor is 'absolutely back' with strong pre-training and post-training teams, and that training now runs partly on the Colossus supercomputer.
Antigravity 2.0 becomes Google's central agentic coding harness
Antigravity 2.0 was positioned at I/O 2026 as the single agent harness powering agentic experiences across Google, from internal tooling to Search, Workspace and developer products. Born from the Windsurf acquisition, it evolved from an agent-first IDE into the through line for Google's agentic strategy, now exposed to external developers as well.
Gemini API gets Managed Agents with hosted sandboxes and the Interactions API
Google launched Managed Agents in the Gemini API, letting developers spin up hosted Antigravity agents with Linux sandboxes and persistent state. It ships alongside the next-generation Interactions API, which Logan Kilpatrick described as designed for agentic systems rather than the old tokens-in, tokens-out model interaction pattern.
OpenAI Codex Mobile arrives in the ChatGPT mobile apps
OpenAI's Codex Mobile is now available in the ChatGPT mobile apps, enabling remote agent workflows from a phone. The crew discussed it as part of the broader shift toward driving coding agents from anywhere rather than just the desktop.
xAI launches Grok Build, an agentic CLI coding tool in beta
xAI launched Grok Build, an agentic CLI coding tool, in beta for SuperGrok Heavy subscribers. It joins the crowded field of terminal-based coding agents as xAI's entry into agentic engineering tooling.
Anthropic adds separate Claude Agent SDK credits to paid plans
Anthropic announced separate monthly Claude Agent SDK credits for Pro, Max, Team, and Enterprise subscribers, starting June 15, 2026. This gives agent builders a dedicated usage pool on top of regular plan limits.
Artificial Analysis Coding Agent Index benchmarks model + harness combos
Artificial Analysis launched the Coding Agent Index, a benchmark that evaluates model and harness combinations rather than models alone. Opus 4.7 in Cursor CLI leads at 61, GLM-5.1 tops the open-weight entries at 53, and costs vary 30x across combos for similar capability.
Hermes passes OpenClaw as #1 CLI agent on OpenRouter, adds computer use
Nous Research's Hermes overtook OpenClaw as the #1 CLI agent on OpenRouter. It also added background computer use via Trykua, and Alex described switching his own daily agent workflow from OpenClaw to Hermes.
/goal command lands in Codex, Claude Code, and Hermes - the productized Ralph
The /goal command is now available in Codex, Claude Code, and Hermes, productizing the Ralph loop pattern: set a measurable success condition and the agent iterates autonomously until it is done. Codex's implementation is winning early head-to-head comparisons over Claude Code, and the show framed it as turning coding agents into 24/7 AI employees.
Cognition launches Devin for Terminal CLI coding agent
Cognition launched Devin for Terminal, a local CLI coding agent. Its /handoff command lets you seamlessly transfer a local session to Devin's cloud environment.
Cursor launches SDK exposing the runtime that powers the IDE
Cursor launched an SDK that exposes the same runtime, harness, and models that power the Cursor IDE, making the Cursor agent embeddable in any product. The Cursor Agent + GPT-5.5 combo also topped WolfBench's Terminal-Bench 2.0 leaderboard this week.
HeyGen HyperFrames integrates natively with Claude Design
HeyGen's HyperFrames now integrates natively with Claude Design, enabling HTML-to-MP4 motion graphics from a single CLI command. The integration brings programmatic video composition into the Claude Design workflow.
Mistral Medium 3.5: 128B dense flagship with 256K context
Mistral launched Medium 3.5, a 128B dense flagship model with 256K context and configurable reasoning, released with weights on Hugging Face. Alongside it Mistral shipped a Vibe coding agent.
Stripe opens Projects.dev: 32 infra providers provisionable by agents
Stripe removed the waitlist on Projects.dev, which lets AI agents provision infrastructure from 32 providers (Cloudflare, WorkOS, ElevenLabs, Twilio, Daytona, Browserbase, AgentMail and more) via CLI. It is part of Stripe's push into agent engineering announced around Sessions 2026.
Qwen3.6-27B: dense Apache-2.0 model beats Alibaba's own 400B flagship
Alibaba shipped Qwen3.6-27B, a dense 27B-parameter model under Apache 2.0 that beats Alibaba's own 400B flagship on every major coding benchmark. Yam described it as getting Opus 4-or-5-level capability at home, and it continues the dense-beats-MoE story in open source.
Kimi K2.6: 1T MoE open-source SOTA on SWE-Bench Pro
Moonshot AI released Kimi K2.6, a 1-trillion-parameter MoE with 32B active parameters, 384 experts, MLA attention, and a 256K context window under a modified MIT license. It claims open-source state of the art on SWE-Bench Pro at 58.6, and Wolfram called it the best open-source model he has ever tested on his private wolf-bench.
Codex gets background computer use on macOS plus Chronicle screen memory
Codex shipped true background computer use on macOS: a second cursor running on its own thread that works while you work, with subagents controlling different windows in parallel, building on OpenAI's Software Apps Inc. (ex-Apple Shortcuts team) acquisition. Chronicle adds total screen memory by taking a screenshot every 10 seconds and feeding it into Codex context, so you can ask what you were doing an hour ago. Codex also passed 4 million users this week.
OpenAIDevs releases Euphony, an open-source Codex session log visualizer
The OpenAI developer relations team released Euphony, an open-source visualizer for Codex session logs. It lets developers inspect and replay what their Codex agent sessions actually did.
GPT-5.5 and GPT-5.5 Pro drop live, SOTA across the board
OpenAI shipped GPT-5.5 and GPT-5.5 Pro mid-show, taking state of the art on Terminal-Bench 2 (82.7%, up from 75%), SWE-Bench Verified (73%), GDPval (84%) and Frontier Math (35%), beating Opus 4.7 and Gemini 3.1. It uses ~40% fewer tokens than 5.4, netting roughly 20% cheaper to run despite API pricing doubling to $5/$30 per million ($30/$180 for Pro). Peter Gostev called it the first model that genuinely sustains multi-hour long-running tasks, with one task running 8.5 hours straight; rollout was Codex-first, not yet in ChatGPT.
W&B LEET TUI ships workspace mode with multi-run compare and GPU metrics
Weights & Biases shipped workspace mode for LEET, its terminal UI for experiment tracking. The update brings multi-run comparisons, live GPU metrics, and images rendered directly in the terminal.
SpaceX/xAI and Cursor strike $10B collab with $60B acquisition clause
Cursor and SpaceX/xAI announced a deal structured as a $10B collaboration with a $60B acquisition clause. The panel discussed it in the week-in-review as one of the biggest industry moves of the week.
Qwen 3.6-35B-A3B: Apache 2.0 MoE with 3B active hits 73.4% SWE-Verified
Alibaba Qwen open-sourced Qwen 3.6-35B-A3B under Apache 2.0 the same morning Opus 4.7 dropped: a 35B MoE with only 3B active parameters that scores 73.4% on SWE-bench Verified, rivaling models 10x its size. It is natively multimodal with 262K context extensible to 1M, and the crew called it the strongest mid-size LLM on nearly all benchmarks, putting to rest doubts about Qwen's open-source commitment after Junyang Ling's departure.
Claude Code Routines: cron and event-triggered agents on Anthropic's cloud
Anthropic launched Claude Code Routines, autonomous agents that run on Anthropic's cloud and can be triggered by cron schedules, GitHub events, or API calls. It moves Claude Code from an interactive CLI toward standing, self-scheduling automation infrastructure.
Claude Opus 4.7 drops live with 87.6% SWE-bench Verified and xhigh effort
Anthropic shipped Claude Opus 4.7 minutes before the show, scoring 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro, an 11-point jump over Opus 4.6 on the harder agentic coding eval. It adds a new 'xhigh' (extra high) reasoning effort, 3x vision resolution, a +22% ScreenSpot Pro computer-use jump (57.7% to 79.5%), and a /ultrareview command in Claude Code at the same pricing, though a new tokenizer uses 1.0-1.35x more tokens. The system card mentions the unreleased 'Mythos' 331 times, and an MRCR long-context drop from 78% to 32% suggests a new pre-trained base.
Marimo released Marimo Pair, which embeds Claude Code, Codex, or OpenCode agents directly inside its reactive, dependency-graph-aware Python notebooks. Founding engineer Trevor Manz joined the show to explain why reactive notebooks are a natural verification surface for agent-written code; the launch trended on Hacker News this week and was featured as part of This Week's Buzz (Marimo is in the CoreWeave family).
OpenAI dropped a massive Codex update mid-show: native macOS computer use that runs in the background with its own separate cursor so you can keep working, 90+ plugins, gpt-image-1.5 image generation and editing, an in-app browser, a memory preview that 'learns from experience', proactive work suggestions, multi-terminal SSH into dev boxes, and thread automations. Alex's hot take: Codex, not ChatGPT, is becoming OpenAI's super-app.
Warp now supports any CLI agent with vertical tabs and mobile control
Warp shipped support for running any CLI coding agent inside its terminal, adding vertical tabs for parallel agent sessions, notifications, built-in code review, and mobile remote control of running agents. It positions Warp as a harness-agnostic cockpit in the increasingly crowded agent-management race.
Windsurf 2.0 ships Agent Command Center and full Devin integration
Cognition launched Windsurf 2.0, the first big post-acquisition release, headlined by the Agent Command Center, a Kanban-board mission control for managing dozens of agents at once. It adds Spaces for switching context between parallel tasks and integrates Devin directly inside Windsurf, so you can plan locally with a Socratic-method agent and hand off to Devin in the cloud for end-to-end execution. Theodor Marcu said internal Cognition usage doubled after launching Managed and Scheduled Devins.
Anthropic unveils Claude Mythos, a frontier model 'too dangerous to release'
Anthropic announced Claude Mythos Preview under Project Glasswing, a cyber-defense frontier model it says is too dangerous to release publicly: it found zero-days in every major OS and browser and escaped its sandbox. It scores 77% on SWE-bench Pro (up from 53% on Opus 4.6) and 64% on HLE, priced at $25/$125 per M tokens and available only to ~40 partner companies. Peter Gostev's read: the real reason it's unreleased is compute shortage, not safety.
Cursor ships remote agents and a code review agent
Cursor launched remote agents plus a code review agent that the company says catches 78% of issues before merge. Mentioned in the week's tools and agentic-engineering roundup.
Codex hits 3M WAU with plugins, sub-agents and Guardian Approvals
OpenAI's Codex reached 3M weekly active users, up from 2M last month, as VB from the Codex team walked through what's behind it: plugins that bundle skills plus MCP servers (Stripe, Supabase, shadcn), sub-agents that decompose tasks into parallel Codex agents, and experimental hooks. New Guardian Approvals spins up a sub-agent that risk-classifies every tool call, auto-approving low/medium risk and escalating only the dangerous ones.
W&B Automations launch: event triggers from training runs
Weights & Biases shipped Automations, event-triggered actions that pipe signals from your training runs into notifications (Slack), GitHub Actions, and deployments, pairing nicely with the new W&B iOS app. In the same Buzz segment: GLM-5.1 and Gemma 4 both went live on W&B Inference.
GLM-5.1 takes #1 open-source spot on SWE-Bench Pro at 58.4%
Z.ai released GLM-5.1, now the #1 open-source model on SWE-Bench Pro at 58.4%. It can run autonomously for 8 hours with 1,700+ agent steps, and is already live on W&B Inference. Open weights are up on Hugging Face alongside an arXiv paper.
Alibaba ships Qwen3.6-Plus with near-Opus agentic coding and 1M context
Alibaba released Qwen3.6-Plus, an API model with agentic coding performance near Opus 4.5 and a 1M-token context window. The panel noted continued strong momentum for the Qwen family in practical coding and agent workloads.
Cursor 3 ships as agent-first rebuild, dropping the VS Code fork
Cursor released Cursor 3, a ground-up agent-first rebuild that is no longer a VS Code fork and supports parallel cloud and local agents. It marks a major repositioning of the editor around agentic workflows rather than traditional IDE editing.
Claw-code clean-room rewrite becomes fastest repo to 100K GitHub stars
After Claude Code's source leaked via npm, Sigrid Jin and Bellman published claw-code, a clean-room rewrite that became the fastest GitHub repo to pass 100K stars, hitting the mark in roughly 24 hours. Sigrid joined the show to separate the verifiable implementation details from the social-media exaggeration around the leak.
WolfBench results show Hermes Agent beating Claude Code and OpenClaw
Wolfram published new WolfBench agent-harness results showing Hermes Agent outperforming Claude Code and OpenClaw on Terminal Bench 2.0 across most model combinations. The panel dissected the findings and stressed reproducible eval setup and fair harness configuration.
Modular 26.2 runs FLUX.2 in under a second, 99% cheaper than Nano Banana
Modular shipped its 26.2 release with state-of-the-art image generation, running FLUX.2 in under one second (sub-300ms claims) at 99% lower cost than Nano Banana, plus upgraded AI coding with Mojo. Alex noted the surprise of an inference platform releasing model-level optimization and hoped the approach spreads to all image generation.
Anthropic makes Opus 4.6 1M context the default in Claude Code, same price
Anthropic made 1M token context the default for Opus 4.6 in Claude Code at the same price, turning what was previously experimental and expensive into the standard. MRCR benchmark performance holds at 93% at 256K and 76% at 1M. For agent users this means far less compaction and longer uninterrupted sessions, though auto-compaction still triggers around 170K unless manually raised.
Cursor Composer 2 beats Opus 4.6 on TerminalBench at a tenth of the price
Cursor launched Composer 2, its first proprietary model that genuinely competes with frontier labs. It scores 61% on TerminalBench (beating Opus 4.6) at $0.50/M input tokens, cheaper than GPT-5.4 Mini and 10x cheaper than Opus, running at 300+ tokens/sec. A fast variant costs 3x more for the same intelligence, kicking off a new 'fast mode' pricing trend where you pay a premium for speed rather than capability.
Google AI Studio gets full-stack vibe coding with Antigravity and Firebase
Google AI Studio received a full-stack vibe coding overhaul featuring the Antigravity agent, Firebase integration, and multiplayer support. The update pushes AI Studio from a model playground toward a full app-building environment.
MiniMax M2.7: first self-evolving model hits 56% on SWE-Bench Pro
MiniMax dropped M2.7, billed as the first self-evolving model: it ran 100+ autonomous RL optimization loops and wrote its own agent scaffolding, built by one engineer over four days with zero lines of human code. It scores 56.22% on SWE-Bench Pro, within one point of Opus 4.6's 57.3%, and WolfBench shows it roughly matching Sonnet 4.6 on OpenClaw agent tasks. Not yet open weights, though rumors suggest a release is coming.
OpenAI acquires Astral, makers of uv and Ruff, to join the Codex team
OpenAI acquired Astral, the company behind the uv Python package manager, Ruff, and ty, with the team joining Codex specifically — OpenAI's third acquisition of the month. The panel drew the parallel to Anthropic buying Bun for TypeScript infrastructure: OpenAI now owns core Python tooling for the code its agents write. The tools remain open source and forkable.
OpenAI ships subagents for Codex with custom TOML configs
OpenAI added subagents to Codex, enabling parallel specialized agents configured via custom TOML files. Paired with the cheap GPT-5.4 Mini and Nano models, this enables the orchestrator-plus-workers pattern where a flagship model spawns inexpensive parallel subagents for tasks like visual testing.
OpenAI ships GPT-5.4 Mini and Nano for coding, computer use, and subagents
OpenAI released GPT-5.4 Mini ($0.75/M input) and Nano, smaller variants optimized for coding and computer use at a fraction of flagship cost. Mini hits 72% on OS World verified, matching the human baseline and nearly reaching full 5.4's 75%, while beating Sonnet 4.5 on most benchmarks. They are designed as cheap parallel subagent workers under a GPT-5.4 orchestrator in Codex, and Mini is 2x faster than the previous GPT-5 Mini.
Unsloth Studio: web UI for local fine-tuning with 2x speed, 70% less VRAM
Unsloth launched Studio, an open-source web UI for local LLM training and inference claiming 2x speed and 70% less VRAM, supporting 500+ models across text, vision, audio, and embeddings. The panel framed it as a potential 'LM Studio moment for fine-tuning', bringing no-code training to beginners. Confirmed working on Google Colab Pro, training models overnight for about $20/month.
Weights & Biases launches native iOS app for monitoring training runs
W&B shipped its most-requested feature ever: a native iOS app for monitoring AI training runs with live metrics and push notifications for crash alerts. Practitioners can now keep an eye on long-running training jobs from their phone instead of staying glued to a dashboard.
Cursor joins ACP registry and goes live in JetBrains IDEs
Cursor joined the Agent Communication Protocol (ACP) registry and is now live inside JetBrains IDEs. The move is a cross-ecosystem win for ACP, the emerging open standard that lets any AI agent plug into any editor.
Karpathy open-sources AutoResearcher for autonomous ML experiments
Andrej Karpathy open-sourced AutoResearch, a framework that runs AI-driven ML experiments autonomously. Over two days it ran 700 experiments on nanochat GPT-2, stacked 20 improvements, and achieved an 11% training speedup. Tobi Lütke adapted it overnight for Shopify's Liquid templating engine for a 51% render-time improvement, and the repo hit 26K GitHub stars quickly.
700 AutoResearcher experiments run in 2 days (Karpathy)11% GPT-2 training speedup from stacked AutoResearcher improvements51% Shopify Liquid render time improvement using AutoResearcher
/last30days research skill searches X, Reddit, YouTube and TikTok
Matt Van Horn presented /last30days, a research skill that searches X, Reddit, YouTube, and TikTok for the last 30 days of content on any topic. It uses the ScrapeCreators API under the hood, works best in Claude Code, and installs from GitHub.
Weights & Biases officially launched Agent Skills, installable via `npx skills add wandb/skills`. The launch coincided with Nemotron 3 Super becoming available on W&B Inference at $0.20/1M input tokens, one of the best price-performance options for a 120B model.
Cognition previews SWE-1.6, hitting 51% on SWE Bench Pro
Cognition previewed SWE-1.6, the next iteration of its software-engineering model line, citing 51% on SWE Bench Pro. It was covered in the TL;DR tools segment as part of the week's agentic coding model releases.
Google released a command-line interface for Google Workspace, making Workspace data and actions scriptable from the terminal for developers and agents. Covered briefly in the TL;DR tools segment.
OpenAI brought its Codex desktop app to Windows, expanding the agentic coding tool beyond its initial platforms. Mentioned in the TL;DR tools and agentic engineering rundown.
OpenAI drops GPT-5.4 Thinking and GPT-5.4 Pro live during the show
OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro mid-show, a frontier general model that folds Codex-level coding into a unified reasoning model. It ships with a 1M token context window, a /fast mode, and mid-reasoning steering, posting 83.3% on ARC-AGI 2 (Pro) and roughly 75% on OS World computer use. The panel tested it live in Codex and called it a major general-model jump, while noting input pricing rose about 50% versus 5.2.
83.3% ARC-AGI 2 (GPT-5.4 Pro)75% OS World / computer-use score1M Context window
Ryan Carson experimented with OpenAI's Symphony framework, letting agents work through PRs overnight. One agent not only created a PR but found a bug and filed its own detailed Jira ticket with no human intervention, a small but telling sign of where agentic development is heading.
Qwen 3.5 lands: 35B/3B-active Medium outperforms the old 235B flagship
Alibaba released the Qwen 3.5 family of open-weight models, headlined by Qwen3.5-35B-A3B, a 35B model with only 3B active parameters that outperforms their previous 235B flagship. Variants include a 122B-A10B and a dense 27B, with the panel highlighting the hybrid state-space (Mamba-layer) architecture and strong practical coding and agent performance at a tiny active-parameter footprint.
Anthropic shipped Remote Control for Claude Code, enabling remote and async control of coding sessions, alongside a new memory capability. The panel framed these as part of labs converging on richer agent harnesses with remote, async workflows as a primary competitive layer.
Claude Cowork gets automations (cron jobs), matching Codex
Claude Cowork added automations, cron-job-style scheduled agent runs, in the same week OpenAI's Codex gained equivalent automation support. The panel saw labs converging on heartbeats, cron jobs, and cloud-based agents as standard product surface area.
Devin 2.2: computer use, browser, and self-verifying autonomous work
Cognition shipped Devin 2.2, an autonomous coding agent that can use a computer and browser to verify and fix its own work, plus a free public Devin Review workflow for PR review and scheduled/automated sessions. Nader Dabit framed the release as two years of platform maturity converging with stronger models, letting non-engineers fix issues directly by just asking Devin.
Cursor launched cloud agents, moving agentic coding work off the local machine into remote, async sessions. The panel highlighted Cursor's cloud agents and UI demos as important progress for frontend development workflows.
LM Studio launches LMLink for remote access to local models
LM Studio launched LMLink, which lets you use your locally hosted models from anywhere via Tailscale. It extends the local-model story so that on-device inference is reachable from any of your machines.
Anthropic ships Claude Sonnet 4.6 with 79.6% SWE-Bench and 1M context
Anthropic launched Claude Sonnet 4.6, its most capable Sonnet ever, scoring 79.6% on SWE-Bench Verified, nearly matching Opus 4.6 at Sonnet pricing of $3/$15 per million tokens. It ships with a 1M token context window in beta and is now the default model on Claude AI. In blind Claude Code testing, users preferred Sonnet 4.6 over the previous Opus 4.5 59% of the time, and it beats the previous Gemini 3 Pro on most benchmarks.
Dreamer launches beta platform for building agentic apps with no-code AI
Dreamer launched its beta, a full-stack platform for building and discovering agentic apps with no-code AI. It aims to let non-developers assemble and share agent-powered applications.
Gemini 3.1 Pro drops live with 44% HLE and 77% ARC-AGI at the same price
Google released Gemini 3.1 Pro minutes before the show, claiming 2.5x better abstract reasoning and improved coding and agentic capabilities at the same price point as its predecessor. It scores 44% on Humanity's Last Exam, 77% on ARC-AGI without a custom harness, and 68 on Terminal Bench, putting it at or near state of the art alongside Opus 4.6. In Nisten's live vibe-coding test it was blazingly fast but less polished than Opus 4.6 and Codex output.
OpenAI acqui-hires OpenClaw creator Peter Steinberger
OpenAI acqui-hired Peter Steinberger, the creator of the viral OpenClaw agent, in what the panel speculated might be the first single-founder billion-dollar deal. Yam Peleg broke the news on the show, calling Steinberger 'the goat'. The move lands the most popular third-party agent harness builder inside OpenAI, amid a week where Anthropic's terms changes pushed agent users toward OpenAI subscriptions.
Ryan Carson publishes the viral Code Factory agentic engineering blueprint
Ryan Carson published his viral Code Factory article, a blueprint for fully automated code generation, review, and deployment inspired by OpenAI's Harness Engineering post. The setup chains GitHub Actions, Reptile code review, CI gates, a risk-classification system for high-risk file changes, and a self-healing loop where Codex fixes its own PR issues until all checks pass. He says it takes a week-plus of setup but unlocks massive throughput.
Entire raises $60M seed, ships first OSS release 'Checkpoints'
Entire raised a $60M seed round to build an open-source developer platform for AI agent workflows. Alongside the funding it shipped its first open-source release, Checkpoints, available on GitHub.
MiniMax M-2.5 hits 80.2% SWE-Bench Verified with 10B active params
MiniMax dropped M-2.5 thirty minutes before the show: a 200B-total, 10B-active open-weights model scoring 80.2% on SWE-Bench Verified, approaching Opus 4.6 at roughly 1/20th the cost (~15 cents per task with a 57% win rate over Opus). Trained with MiniMax's decoupled Forge RL framework and optimized for end-to-end task time with fewer tool calls and thinking tokens. Senior researcher Olive Song joined live and revealed the model was still training — they cut a checkpoint for early release.
OpenAI ships GPT 5.3 Codex Spark on Cerebras for real-time coding
OpenAI released GPT 5.3 Codex Spark, a smaller Codex variant built for real-time coding, served on Cerebras hardware — OpenAI's first model on Cerebras — with reported speeds of over 1000 tokens/sec. Available to ChatGPT Pro users in the Codex app, CLI, and IDE extension. It broke during the show as the second breaking-news drop of the episode.
Ryan Carson releases AntFarm for agent coordination
Co-host Ryan Carson released AntFarm, a tool for coordinating teams of coding agents. It targets the missing primitives for managing multiple agents that the panel discussed during the agent-psychosis segment.
Z.ai launches GLM-5, the open-weights agentic coding crown
Z.ai released GLM-5, a 744B-parameter MoE model (40B active) trained on 28.5 trillion tokens that takes the #1 open-source ranking for agentic coding with 77.8% SWE-bench Verified. It introduces the SLIM asynchronous RL framework for post-training, adopts DeepSeek's sparse attention to cut deployment cost, and was trained on Huawei chips rather than NVIDIA. Lou from Z.ai joined the show live and summed it up as bigger, faster, better, and cheaper.
Qwen3-Coder-Next hits 70.6% SWE-Bench Verified with 3B active params
Alibaba's Qwen3-Coder-Next is an 80B MoE coding agent model with only 3B active parameters that scores 70.6% on SWE-Bench Verified and 44% on the much harder SWE-Bench Pro. It was trained on 7.5T tokens with 20,000 parallel RL environments and runs under 48GB of RAM with GGUF quantization, making near-frontier agentic coding feasible on local hardware.
Anthropic ships Claude Opus 4.6 with 1M context and agent teams
Anthropic dropped Opus 4.6 live during the show, claiming state-of-the-art on GDP-eval, Browse Comp, and agentic search, with 65.4% on Terminal Bench and 99% on TAU Bench MCP tool use. It is the first Opus model with a 1 million token context window and introduces adaptive thinking, where the model picks up contextual clues about reasoning effort. Pricing matches Opus 4.5 under 200K tokens and doubles above, and Claude Code gains agent teams for orchestrating parallel sessions.
OpenAI launches standalone Codex app for managing parallel coding agents
OpenAI shipped Codex as a dedicated Mac app, a command center for running multiple AI coding agents in parallel. Features include work trees for parallel project branches, scheduled automations, a skills marketplace with Cloudflare, Vercel, Figma, Notion, and Linear integrations, inline diff review with per-line commenting, and cloud hand-off. OpenAI granted a free month of access to all users including the free tier, and doubled rate limits for all tiers for two months.
OpenAI answers Opus with GPT-5.3-Codex, first model that helped build itself
One hour after Opus 4.6, OpenAI released GPT-5.3-Codex, billed as the first model instrumental in developing itself — the Codex team used early versions to debug its own training and manage its own deployment. It scores 73% on Terminal Bench 2.0, a 10-point gap over Opus 4.6, while running queries 25% faster and more token-efficiently than its predecessor, with improved mid-task steerability.
Anthropic launches MCP Apps: interactive UI inside Claude chat
Anthropic's MCP Apps render interactive, branded UI components (Box files, Figma, color pickers) directly within Claude conversations, evolving MCP from tools to embedded app experiences. It is protocol-based, so any app can integrate, letting brands reclaim identity from text-only LLM responses.
Jan AI releases Jan v3, a 4B model built for fast local inference
Jan v3 is a 4B-parameter open model optimized for local inference, hitting 132 tokens/sec with a 262K context window and a 40% improvement on coding. The Jan desktop app it powers has reached 5M downloads.
Alongside Kimi K2.5, Moonshot AI shipped Kimi Code, a coding tool that pairs with its new flagship model's strong agentic coding abilities. The code is available on GitHub with an announcement page at kimi.ai/code.
Moonshot AI releases Kimi K2.5, the new open-source king
Moonshot AI's Kimi K2.5 takes the open-source crown, becoming the most-used model on OpenRouter and topping open-source leaderboards. The panel highlighted its strong agentic coding performance and tool use.
The Klein team was acqui-hired by OpenAI's Codex group following the viral 'imagine the smell' hackathon controversy. Discussed as part of the growing Codex ecosystem, which Peter Steinberger used to build Clawdbot entirely.
Claude Code VS Code extension hits general availability
Anthropic's Claude Code VS Code extension reached general availability, bringing full agentic coding directly into the IDE. The GA release makes Claude Code's agent workflows accessible from the VS Code Marketplace without the CLI.
Vercel launched skills.sh, a registry where you can browse and install agent skills from the command line for any agent, including Clawdbot. It hit 20K installs within hours, and releases like Browser Use shipping as a skill signal a broader shift from MCP servers toward skills.
GLM-4.7-Flash: 30B MoE local coding agent with only 3B active params
Z.AI released GLM-4.7-Flash, a 30B parameter MoE model with only 3B active parameters, designed as the ultimate local coding and agent assistant. It hits 59% on SWE-Bench Verified (approaching Sonnet 4's 64%) and runs at 120 tokens/sec on a stock Mac Studio M3 Ultra, fast enough to run RALF autonomous coding loops even on CPU.
59% SWE-Bench Verified120 tps Speed on Mac Studio M3 Ultra
Vercel releases official agent skill packs for Next.js and React
Vercel began releasing official agent skill packs for Next.js and React, packaging its framework expertise in the agent skills standard. Ryan Carson highlighted that you can point any skills-compatible coding agent at the pack and it installs the skills for you, an early sign of experts shipping domain knowledge as skills.
NousCoder 14B: 7% LiveCodeBench jump in 4 days of RL training
Nous Research released NousCoder 14B, an open source competitive programming model that achieved a 7% jump on LiveCodeBench accuracy in just four days of RL training on 48 NVIDIA B200 GPUs. Training used 24,000 verifiable problems, and the release ships under a full Apache 2 license with training code and a benchmark harness.
Ralph Wiggum autonomous coding technique hits 1.2M views
Ryan Carson published a viral breakdown (1.2M views on X) of Ralph Wiggum, the autonomous coding technique created by Jeff Huntley: write a PRD, break it into atomic user stories with acceptance criteria in JSON, then run a bash loop that has a CLI agent pick the next story, code it, commit, and loop. The technique works with any CLI agent (Amp, Claude Code, Cursor CLI, Gemini CLI), compounds learning via agents.md, and won a YC hackathon running overnight on Sonnet 4.5.
Catnip by W&B: open source iOS app to run Claude Code anywhere
Chris Van Pelt of Weights & Biases released Catnip, an open source iOS app that lets you run Claude Code from anywhere via GitHub Codespaces. It is available on the App Store with source on GitHub.
Qwen 3 Coder posts insane scores in the race for the coding crown
Alibaba's Qwen 3 Coder landed in July with what the crew called insane benchmark scores for an open-weights coding model. Together with Kimi K2 and GLM 4.5 it made July the peak month for Chinese open source.
Claude Code launches, starting the CLI agent revolution
Claude Code launched in February, having started as an internal Anthropic engineering tool. Multiple co-hosts picked it as the single most impactful AI release of 2025 — it began the CLI agent era and proved, in Kwindla's words, that 'sometimes it's mostly about the harness.'
Claude Opus 4 drops in Q2 — Ryan's pick for best model ever
Claude Opus 4 launched in Q2 and became Ryan Carson's pick as the best coding model he had used in over 700 days of daily LLM coding. It cemented Anthropic's lead in agentic coding through the middle of the year.
Claude Skills launches — 'MCP-level if not bigger'
Anthropic launched Claude Skills in October. It was largely missed at release but picked up steam fast, with the show arguing Skills is 'MCP level if not bigger' for Claude users as a way to package reusable agent capabilities.
Cursor 2 and the Composer model level up IDE agents
Cursor shipped Cursor 2 along with its Composer model in October, leveling up in-IDE agentic coding. It capped a year in which Cursor's sales exploded on the back of Claude 3.7 and the vibe coding wave.
GPT-5 Codex: OpenAI's specialized coding model moves the stock
GPT-5 Codex dropped in September as OpenAI's coding-specialized fine-tune of GPT-5. Yam dubbed it the 'infinite money glitch' because the release moved OpenAI-linked stock prices significantly.
Windsurf Code Maps generates flowcharts of entire codebases
Windsurf released Code Maps in November, a feature that generates flowchart-style maps of entire codebases. It was one of the quieter but practical dev-tool releases in a month dominated by frontier model drops.
GLM 4.6 quietly becomes the model businesses actually use
Zhipu's GLM 4.6 arrived in October and, per Nisten, quietly became a go-to model that many businesses still run today. It continued GLM's trajectory from hackathon favorite to production workhorse.
Gemini 3 Flash delivers frontier intelligence at $0.50/1M input tokens
Google launched Gemini 3 Flash, offering frontier-tier capability at flash-tier pricing of $0.50 per million input tokens. It scores 78% on SWE-bench Verified, beating larger models on some agentic tasks, and supports tool-calling at scale with up to 100 simultaneous function calls.
$0.50 per 1M Gemini 3 Flash input tokens78% SWE-bench Verified
GPT 5.2 Codex drops live during the show with 400K context
OpenAI released GPT 5.2 Codex via API after months of exclusivity in the Codex app, making it available in Cursor, GitHub Copilot, and VS Code with native context compaction for long sessions. Cursor showcased it by building a complete browser from scratch in Rust, roughly 3 million lines of code across about 330,000 commits, driven by hundreds of concurrent agents.
Cursor offers GPT-5.1-Codex-Max for free through Dec 11
Cursor made OpenAI's GPT-5.1-Codex-Max available free to users through December 11, alongside a blog post on its codex model harness. The promotion gives developers no-cost access to a frontier coding model inside the editor.
DeepSeek V3.2 and V3.2-Speciale post gold-medal reasoning under MIT license
DeepSeek released V3.2 and the reasoning-first V3.2-Speciale, a 685B-parameter MoE under MIT license. Speciale posted gold-medal-level olympiad results and 96% on AIME (versus GPT-5 High at 94%), with V3.2 hitting 73.1% on SWE-Bench Verified. Aggressive pricing around 28 cents per 1M tokens on OpenRouter pushes open models closer to top closed-model capability.
96% AIME73.1% SWE-Bench Verified685B Total parameters (MoE)
Mistral returns to Apache 2.0 with Mistral Large 3 and Ministral 3
Mistral relaunched its model family under permissive Apache 2.0 licensing with Mistral Large 3 and the small Ministral 3 edge models. Large 3 ships a 256K context window and strong open-model coding positioning. The licensing shift reignited discussion around open model portability and deployability.
W&B launches LLM Evaluation Jobs for OpenAI-compatible APIs
Weights & Biases launched LLM Evaluation Jobs, letting teams run evaluations against any OpenAI-compatible API during training cycles instead of only at the end. The show framed it as a practical workflow upgrade for getting earlier model quality signals without blindly burning compute.
Anthropic launches Claude Opus 4.5, reclaiming the coding crown
Anthropic released Claude Opus 4.5, scoring 80.9% on SWE-bench Verified to top GPT-5.1 (77.9%) and Gemini 3 Pro (76.2%). It adds a new 'Effort' parameter for compute control, Tool Search to cut agent token overhead, and Programmatic Tool Calling where the model writes and executes code loops. Pricing dropped to $5/M input and $25/M output, roughly one-third the old Opus price.
W&B launches Serverless LoRA Inference on CoreWeave
Weights & Biases launched Serverless LoRA Inference on CoreWeave: upload a LoRA adapter to W&B Artifacts and serve it instantly on top of any supported base model with no cold starts and no dedicated GPU instances. Alex demoed a 'Mocking SpongeBob' LoRA he trained in 25 minutes, served on a Qwen 2.5 base.
Antigravity: Google's free agent-first IDE powered by Gemini 3 Pro
A free VS Code fork reimagined for agent-first coding, with an inbox-style Agent Manager for running multiple coding agents in parallel across a codebase. Browser integration lets agents control Chrome, take screenshots and videos of the running app, and self-debug. The free tier is powered by Gemini 3 Pro, with GPT-OSS 120B as the open-source alternative and Nano Banana for images.
Marimo ships reactive Python notebooks extension for VS Code and Cursor
Marimo released a new VS Code and Cursor extension bringing its reactive Python notebooks directly into the editor, with UV integration for dependency management. It was highlighted in the open-source roundup as a notable dev-tool release of the week.
GPT-5.1-Codex-Max runs 24-hour coding tasks with native compaction
OpenAI's newest frontier agentic coding model is trained with native compaction, letting it intelligently summarize prior context and work on a single task for 24+ hours (an internal run reportedly lasted a full week). It uses 30% fewer thinking tokens at median than its predecessors and sets a new SOTA of 58% on TerminalBench 2, also leading on SWE-Bench and SWE-Lancer. Windows PowerShell support is significantly improved, alongside an experimental Windows sandbox and a new extra-high reasoning level.
58% TerminalBench 2 (new SOTA)24h+ Single-task agent run time via native compaction30% Fewer thinking tokens at median
Terminal-Bench 2.0 and Harbor launch as new bar for coding agents
Terminal-Bench 2.0 launched alongside the Harbor framework, with 89 hard, realistic terminal-based tasks built with around 1000 Discord contributors. The Warp agent tops the leaderboard at 50% with Codex CLI close behind, and the panel argued an unsaturated 50% ceiling makes it far more meaningful than near-saturated benchmarks like MMLU.
LMArena launches Code Arena for live agentic coding evaluations
LMArena launched Code Arena, a live evaluation platform where models build real applications agentically and humans vote on the results. It extends the arena-style crowdsourced ranking approach to agentic coding workflows.
W&B ships LEET, an open-source terminal UI for monitoring ML runs
Weights & Biases released LEET (Lightweight Experiment Exploration Tool), an open-source terminal-native dashboard for tracking ML runs, demoed live by Dima Duev of the SDK team. It works fully offline for air-gapped HPC clusters and brings real-time metrics, system stats, and zoomable interactive charts to the terminal.
Anthropic publishes code-execution-with-MCP pattern for token-efficient agents
Anthropic published an engineering post showing how running MCP-connected tools as code, instead of direct tool calls, slashes token use and scales agents to many more tools. The approach echoes Cloudflare's Code Mode and framed the episode's interview with Kenton Varda about agents writing code against tool APIs.
Cursor added an in-IDE browser, letting developers preview and interact with their running app without leaving the editor. The panel called out how performant the implementation is, tightening the loop between agentic code edits and visual verification.
Moonshot AI releases Kimi K2 Thinking, an open 1T-param reasoning MoE
Moonshot AI released Kimi K2 Thinking, an open-source 1-trillion-parameter mixture-of-experts reasoning agent with 256K context and large-scale tool-calling capacity. The panel treated it as the open-source centerpiece of the week, focusing on its reasoning quality and coding utility rather than just benchmark screenshots, and as a sign open models keep closing the usability gap with frontier closed models.
Windsurf ships Codemaps, AI-annotated navigable maps of your codebase
Cognition's Windsurf launched Codemaps, AI-annotated and navigable maps of a codebase powered by SWE-1.5 for fast mode and Claude Sonnet 4.5 for smart mode. It aims to help developers and agents build a structural understanding of large repos instead of navigating file by file.
Cognition SWE-1.5: 950 tok/s coding model hitting 40% on SWE-bench Pro
Cognition released SWE-1.5, a fast agentic coding model that serves around 950 tokens per second and scores about 40% on SWE-bench Pro. It ships inside Windsurf and reinforces the week's theme of speed-focused coding models from agent labs.
CoreWeave acquires Marimo, the reactive Python notebook company
CoreWeave, the parent company of Weights & Biases, acquired Marimo, makers of the open-source reactive Python notebook. Covered in the This Week's Buzz segment, the deal brings a popular developer notebook tool into CoreWeave's AI cloud stack.
Cursor 2.0 ships with Composer, its own 4x-faster coding model
Cursor released version 2.0 of its AI code editor alongside Composer, a new in-house coding model claimed to be about 4x faster. The launch came up as evidence that developer products are being rebuilt agent-first, with speed and orchestration as the new battleground.
MiniMax M2: open-source agentic model at 8% of Claude's price, 2x speed
MiniMax released M2, an open-source agentic model positioned at roughly 8% of Claude's price while running about twice as fast. Head of Engineering Skyler Miao joined the show for a deep dive, framing M2 as both a model story and a speed story, and the panel read it as part of a broader open-model pressure wave on frontier labs.
8% of Claude's price2x speed vs comparable frontier models
Pokee AI launched Pokee, an agentic workflow builder for chaining AI actions into automated workflows. It was covered in the tools rundown as part of the expanding agent-first builder stack.
Claude Code comes to the web with sandboxed cloud coding
Anthropic brought Claude Code to the web, letting developers delegate software tasks through a browser with GitHub integration, secure sandboxed execution, multi-repo support, and automatic pull requests, making it usable even from a phone. The Claude desktop app was also upgraded with screen context via screenshots, file sharing, and a new voice mode.
Google AI Studio launches 'Vibe Coding' build experience
Google's Gemini AI Studio launched a 'Vibe Coding' experience at ai.studio/build, letting users build apps from natural-language prompts with Gemini. It puts Google into the rapidly crowding prompt-to-app space alongside the week's other coding-agent moves.
TorchForge: PyTorch-native library for scalable RL post-training
Meta's PyTorch team, in collaboration with Weights & Biases/CoreWeave and Stanford, introduced TorchForge, a PyTorch-native library for scalable reinforcement-learning post-training and agent development. Built for massive GPU runs (W&B/CoreWeave provided 520 H100s) and competing with Ray via tools like the Monarch scheduler.
Amp launches a free tier powered by ads and surplus model capacity
Amp (from the Sourcegraph team) launched a free tier for its coding agent, funded by ads and surplus model capacity. CEO Quinn Slack joined the show to explain the economics and the product thinking behind ad-supported AI dev tooling.
Claude Skills: custom instructions for AI agents now live
Anthropic launched Claude Skills, folders of instructions and resources that Claude loads on demand to specialize agents for specific tasks. The panel treated it as a major piece of the emerging builder stack, with Simon Willison arguing Skills could be a bigger deal than MCP.
Cognition SWE-grep: RL-trained fast context retrieval for coding agents
Cognition released SWE-grep, an RL-trained multi-turn context retriever that finds relevant code for agentic coding tasks far faster than full agent loops. It powers fast context retrieval in Cognition's products, and a public playground lets developers try it on real repos.
Meta releases 32B Code World Model for agentic code reasoning
Meta released CWM, a 32B open-weights research model trained to internally model code execution, aimed at agentic code reasoning rather than plain code completion. The weights are on Hugging Face under facebook/cwm, giving the open-source community a new approach to code world modeling.
Scale AI debuts SWE-bench Pro, a harder contamination-resistant eval
Scale AI released SWE-bench Pro, a tougher, contamination-resistant successor to SWE-bench for evaluating coding agents on realistic software engineering tasks. It ships with a public dataset on Hugging Face plus separate public and commercial leaderboards, and frontier models score far lower than on the original SWE-bench.
OpenAI ships GPT-5-Codex, an agentic coding upgrade for Codex
OpenAI released GPT-5-Codex, a version of GPT-5 finetuned for agentic coding inside the Codex product family. It anchors the episode's coding discussion, with the panel focusing on how coding models are becoming trustworthy enough for longer, productized agent workflows rather than just one-shot completions.
W&B brings Weave traces into Models workspaces for RL runs
Weights & Biases shipped Weave inside W&B Models workspaces, so reinforcement learning runs can now be logged and inspected with Weave trace tooling alongside training metrics. The show frames it as giving RL training 'x-ray vision' into what the model is actually doing.
Grok Code 1 takes ~50% of coding traffic on OpenRouter
xAI's new Grok Code 1 coding model rocketed to roughly 50% of all coding traffic on OpenRouter shortly after launch, helped by a free promotional period and fast, cheap inference. The panel discussed it as evidence that the coding-model market is highly price- and speed-sensitive.
DeepSWE-Preview hits 59% SWE-Bench Verified with pure RL on Qwen3-32B
Agentica and collaborators (with guest Michael Luo of UC Berkeley) released DeepSWE-Preview, a fully open-sourced RL-trained coding agent built on Qwen3-32B that reached 59% on SWE-Bench Verified, a top open result in a benchmark dominated by closed systems. The team published training methodology and weights, emphasizing reproducible reward design and verification over sealed benchmark numbers.
Cursor rolls out coding agents on web, mobile, and Slack
Cursor launched its AI coding agents on web and mobile with Slack integration, extending code agents beyond the editor window into ambient, always-on workflow software. The launch landed the same week Cursor poached key creators of Claude Code, making it product-strategy news as much as HR news.
AlphaEvolve: Gemini-powered coding agent for discovering new algorithms
Google DeepMind announced AlphaEvolve, a Gemini-powered coding agent that designs and evolves advanced algorithms, credited on the show as one of the week's mind-bending algorithmic-discovery stories. DeepMind opened an interest form for early access rather than shipping it broadly.
JetBrains open-sources Mellum-4b, its code completion focal model
JetBrains published Mellum-4b-base on Hugging Face, a 4B-parameter model specialized for code completion that powers its IDE AI features. Listed in the episode's open-source links roundup.
PromptEvals: 12K+ real production assertion criteria for LLM evals
Shreya Shankar and collaborators released PromptEvals, the first large-scale corpus of production LLM guardrails: 2,087 developer prompts paired with 12,623 assertion criteria covering structure, style, grounding and hallucination checks, about 5x larger than prior sets. Fine-tuned open Mistral-7B and Llama-3-8B checkpoints generate assertions +21 F1 better than GPT-4o at a fraction of the latency. Accepted to NAACL 2025.
Dex Horthy publishes 12-Factor Agents, a guide to production-ready agents
HumanLayer founder Dex Horthy published 12-Factor Agents, an open GitHub repo and essay distilling common patterns and pitfalls for building reliable, production-ready AI agents. Drawing on his experience building agent SDKs, it argues that serious teams end up writing large parts from scratch and lays out principles for robust agent design, discussed in depth on the show.
OpenAI debuts Codex CLI, an open source terminal coding agent
OpenAI released Codex CLI, an open source coding tool for the terminal. It ships with hardened security, using Apple Seatbelt on macOS to limit execution to the current directory plus temp files.
OpenAI launches GPT-4.1 family (4.1, mini, nano) in the API
OpenAI released the GPT-4.1 family of models, available via API only, in three sizes: 4.1, 4.1-mini and 4.1-nano. The family features a 1M token context window, in contrast to o3's 200k, and is aimed at developers building on long-context and coding workloads.
W&B Weave Playground adds GPT-4.1 family and o3/o4-mini support
The Weights & Biases Weave Playground shipped full support for the new GPT-4.1 family and the o3/o4-mini models, letting developers evaluate and compare the week's new models for their own applications.
Cloudflare releases a new Agents SDK for building stateful AI agents
Cloudflare shipped a new Agents SDK for building and deploying AI agents on its edge platform. It joins the week's wave of agent infrastructure announcements alongside Google's A2A and broad MCP adoption.
GitMCP turns any GitHub repo into an MCP server instantly
Creators Liad Yosef and Ido Salomon launched GitMCP, a free tool that turns any GitHub repository into an MCP server by simply swapping the domain (gitmcp.io/user/repo). It lets AI assistants ground themselves in a repo's docs and code, and the creators joined the show to demo it.
Google launches Firebase Studio AI app-building environment at Cloud Next
As part of a flood of announcements at Google Cloud Next 2025, Google launched Firebase Studio, a browser-based AI-powered environment for building and shipping full-stack apps. It was one of the headline developer-facing launches from the event.
DeepCoder-14B: open RL-finetuned coder beats DeepSeek R1 and o3-mini on coding
Together AI and Agentica (UC Berkeley Sky Computing Lab) released DeepCoder-14B-Preview, a reasoning model finetuned with RL that beats DeepSeek R1 and even o3-mini on several coding benchmarks. The project aims to democratize RL: the team open-sourced the model, the training dataset, the Weights & Biases logs, and the eval logs. Guest Michael Luo from Agentica joined the show to discuss the release.
W&B launches observable.tools initiative and MCP observability RFC
Weights & Biases launched the observable.tools initiative and published an RFC (RFC-269) proposing observability standards for the Model Context Protocol, inviting community comment. W&B also announced it is a launch partner for Google's A2A protocol.
OpenHands LM 32B: MIT-licensed coding agent model hits 37.2% SWE-Bench
All Hands AI (formerly OpenDevin) released OpenHands LM 32B, an MIT-licensed Qwen finetune that scores 37.2% on SWE-Bench Verified, competing with much larger models on real-world repo tasks. The OpenHands agent also took the #2 spot on the new Live SWE-Bench leaderboard, and the 32B model runs locally on a single RTX 3090. A hosted OpenHands Cloud version is also available; guest Xingyao Wang joined the show to discuss it.
37.2% SWE-Bench Verified score#2 Live SWE-Bench leaderboard (OpenHands agent)
Devin 2.0 launches with new IDE experience and $20/month entry price
Breaking during the show: Cognition Labs launched Devin 2.0, the second version of its AI software engineer, with a new IDE experience. Crucially, pricing now starts at $20/month, down from the original $500/month tier, making the agent far more accessible.
W&B launches Observable.tools initiative to add observability to MCP
Alex and Weights & Biases launched the Observable Tools initiative to bring observability to the Model Context Protocol (MCP) ecosystem, since external tool calls currently lose visibility for debugging and security. A concrete proposal using OpenTelemetry was posted to the MCP specification GitHub discussions for community feedback.
Windsurf shipped a deployments feature that lets users push apps straight to Netlify from the editor. A small but practical step toward end-to-end app building inside AI coding tools.
OpenAI adopts Anthropic's Model Context Protocol - MCP won
OpenAI officially announced support for the Model Context Protocol (MCP) in its Agents SDK, effectively settling the agent tool-connectivity standards war in MCP's favor. Possibly more impactful long-term than the week's flashier launches, since the entire ecosystem can now converge on one protocol for connecting models to tools and data.
W&B ships official Weave MCP server - talk to your evals
Weights & Biases shipped an official MCP server for Weave, its LLM observability and evaluation tool, letting agents and MCP clients query and analyze your evals directly. Morgan McQuire of the W&B Applied AI team demoed it on the show, with wandb Models integration coming soon so agents can monitor loss curves for you.
Cursor shipped Claude 3.7 MAX, a mode giving the agent the full context window and higher tool-call limits with Claude 3.7 Sonnet. It is aimed at harder, longer coding tasks at premium usage-based pricing.
Google makes Deep Research free, adds Canvas and Live Previews to Gemini
Google made its Deep Research agent free for Gemini users and shipped Canvas, a collaborative workspace with live previews for code and documents. Demos on the show included a playable Tetris game and a markdown word counter built and previewed directly inside Gemini.
Google AI Studio adds native YouTube video understanding via link dropping
Google AI Studio now lets you drop a YouTube link and have Gemini natively understand the video. This unlocks video analysis, summarization, and support use cases without downloading or preprocessing the content.
Responses API + Web Search, File Search, Computer Use tools
OpenAI launches Responses API with Web Search, File Search, and Computer Use
OpenAI announced a new agent-focused developer stack at a livestream: the Responses API, a new way to build with OpenAI designed for agentic workloads, plus an Agents SDK. It ships with three built-in tools: Web Search, a File Search tool providing built-in RAG over your files, and a Computer Use tool for agents that operate computer interfaces.
Baidu launches Miaoda no-code AI app building tool
Baidu introduced Miaoda, a no-code AI-powered build tool that lets users create applications without writing code. It joins the growing wave of AI-assisted app builders coming out of Chinese tech giants.
Cloudflare ships support for building MCP servers on Workers
Cloudflare published tooling and docs for building and deploying Model Context Protocol servers on Cloudflare Workers, riding the MCP wave sweeping the AI community. Senior PM Dina Kozlov joined the show's MCP deep dive to walk through it alongside MCP builder Jason Kneen.
Google ships Gemini-powered Data Science Agent in Colab
Google launched a Data Science Agent inside Google Colab, powered by Gemini, that can autonomously generate complete, working notebooks from natural language descriptions of an analysis task. It automates data loading, exploration, and modeling boilerplate for data scientists.
Anthropic releases Claude 3.7 Sonnet, a coding beast with immaculate vibes
Anthropic shipped its long-awaited model update, Claude 3.7 Sonnet, which the crew called a coding BEAST with 'immaculate' vibes. It was one of the week's two huge model drops alongside GPT-4.5 and became an instant favorite for AI coding workflows like those discussed in the Windsurf interview.
Inception Labs debuts Mercury, a commercial diffusion LLM
Inception Labs announced Mercury, billed as the first commercial-scale diffusion large language model, generating text via diffusion rather than autoregressive decoding. The approach promises dramatically faster token throughput, demoed first with the Mercury Coder playground.
Weights & Biases releases an AI agents whitepaper and announces agents course
Weights & Biases released a whitepaper on evaluating AI agent applications and announced an upcoming agents course built in collaboration with OpenAI's Ilan Biggio, with signups at wandb.me/agents. The push targets agent evaluation and observability tooling for the community.
xAI launches Grok 3, claiming SOTA benchmarks and a 1M token context window
xAI dropped Grok 3 on Monday evening, claiming state-of-the-art performance on several benchmarks and a 1 million token context window, with heavy emphasis on agents and future reasoners. The launch was messy, with a bug serving Grok 2 to some users and an eval-methodology spat with OpenAI over best-of-N scores, but vibes shifted positive, with co-hosts calling the base model the best coding model out. It is free for now, 'until their GPUs melt', with no API yet for independent evaluation.
Block open-sources Goose, a local AI agent framework
Block (the company behind Square) released Goose, an open-source local agent framework that runs on your machine and can use any LLM to execute tasks with tools. It was a centerpiece of the show's agents discussion as an open alternative for building autonomous workflows locally.
ByteDance launches Trae, an AI IDE competing with Cursor
ByteDance launched Trae, an AI-powered code editor positioned as a Cursor competitor. It is ByteDance's second shipping move of the week alongside the UI-TARS computer-use models.
Guest Pietro Schirano released RAT (Retrieval Augmented Thinking), a technique and tool that extracts DeepSeek R1's reasoning traces and feeds them to a cheaper, faster model like GPT-3.5 Turbo for the final answer. It showcases the new pattern of mixing open reasoning traces with closed completion models.
W&B programming agent breaks SOTA on SWE-bench Verified
Weights & Biases announced a state-of-the-art AI programming agent built with OpenAI's o1 that broke the SOTA score on SWE-bench Verified. The work was developed and tracked with W&B Weave, the team's LLM observability toolkit.