Agents & Tool Use

Autonomous agents, computer use, browser automation, MCP, tool calling, and agent frameworks. — 252 releases covered on the show.

July 2026

Meta AI Jul 9, 2026

New Models

Muse Spark 1.1 & Meta Model API

Meta launches Muse Spark 1.1 and its first paid Meta Model API

Mark Zuckerberg returned to X (35 seconds into the ThursdAI live show) to announce Muse Spark 1.1: a 1M-token-context agentic model that rivals GPT-5.5 and Opus 4.8 on agentic evals, claiming #1 on MCP Atlas, JobBench, Humanity's Last Exam and Finance Agent V2. It ships with Meta's first-ever paid developer API in public preview ($20 free credits, US-only at launch), computer use across desktop, browser and mobile, and parallel subagent delegation. On the held-back Vals AI Harvey legal-agent benchmark it scores 20% against Fable's 11%. Replit, Cline and Box are early partners. No open weights.

$1.25/$4.25 Per 1M tokens (in/out)1M Token context window20% vs 11% Harvey Legal Agent Bench vs Fable

Alexandr Wang announcement ↗Meta blog ↗AI at Meta ↗

🎙️ Hear our coverage →

#frontier-models #agents #api

OpenAI Jul 9, 2026

Products & Apps

ChatGPT for Work (unified app)

Codex becomes the unified ChatGPT app, with Work mode and hosted Sites

Launched alongside GPT-5.6: the Codex desktop app updated in place into one unified ChatGPT app, with a switchable icon (Codex for developers, ChatGPT for Work for everyone else), computer use running in a picture-in-picture window, unified plugins across ChatGPT and Codex, and multi-tab enterprise auth in the browser. The Sites feature hosts what users build on the chatgpt.site subdomain (Webflow under the hood), with private sites gated behind explicit publishing approval. The rollout happened live during the ThursdAI broadcast.

chatgpt.site Hosted Sites subdomain

Launch summary (OpenAI DevRel) ↗

🎙️ Hear our coverage →

#agents #consumer-ai #coding

OpenAI Jul 9, 2026

New Models

GPT-5.6 (Sol, Terra, Luna)

OpenAI launches GPT-5.6 publicly as three tiers: Sol, Terra and Luna

GPT-5.6 went public mid-show after an unusual customer-by-customer Commerce Department review that limited the preview to roughly 20 approved organizations; Sol rolls to all paid plans within 24 hours, Terra and Luna reach free users. Sol is the flagship with a new Ultra subagent mode and a Max reasoning-effort setting, Terra targets GPT-5.5-level quality at half the cost, and Luna is the fast tier. All three still run on the ~4T-parameter Spud pretrain from GPT-5.5; the same Sol weights also serve on Cerebras at 700+ tokens per second. On ARC-AGI-3 Sol scored 7.8% and became the first model to beat a public game. METR rejected its own pre-deployment eval after recording the highest benchmark-cheating rate it has measured, and OpenAI's system card discloses unauthorized-action incidents on about 0.25% of tasks.

$5/$30 Sol per 1M tokens (in/out)$2.50/$15 Terra per 1M tokens700+ tok/s Same-weights Sol on Cerebras

X announcement ↗Preview blog ↗System card ↗

🎙️ Hear our coverage →

#frontier-models #agents #coding

Cognition Jul 8, 2026

New Models

SWE-1.7

Cognition ships SWE-1.7 at 1000 tokens per second

An RL fine-tune of Moonshot's open Kimi K2.7 base (disclosed up front, unlike SWE-1.5's hidden GLM base), lifting FrontierCode from 30.1% to 42.3% — tied with GPT-5.5 though still behind Opus 4.8. Served at 1000 tok/s including a Cerebras-hosted Lightning SKU, free for paid Devin users for a month, at roughly $1.97 per task. No public API at launch; Devin and Windsurf only.

1000 tok/s Serving speed42.3% FrontierCode 1.1 (base was 30.1%)$1.97 Cost per FrontierCode task

X announcement ↗Blog ↗

🎙️ Hear our coverage →

#coding #agents

xAI Jul 8, 2026

New Models

Grok 4.5

SpaceXAI launches Grok 4.5, a coding-and-agents model trained with Cursor

The first flagship under the unified SpaceXAI brand (xAI dissolved into it two days earlier): a 1.5T-parameter MoE on the new V9 base, trained with trillions of tokens of real Cursor agent-interaction data. The pitch is efficiency: 83.3% on Terminal-Bench 2.1 while using about a quarter of the output tokens Opus 4.8 needs per solved SWE-Bench Pro task, at $2/$6 per million. SpaceXAI self-disclosed that a Cursor codebase snapshot contaminated training and inflated its CursorBench score.

$2/$6 Per 1M tokens (in/out)83.3% Terminal-Bench 2.11.5T Total parameters (MoE)

X announcement ↗Cursor blog ↗

🎙️ Hear our coverage →

#coding #agents #frontier-models

Google DeepMind Jul 7, 2026

APIs & Platforms

Gemini API Managed Agents

Gemini API Managed Agents add background tasks and remote MCP

Google expanded Managed Agents in the Gemini API with background task support, remote MCP and function calling, and network credential refresh — available on the free tier, positioning Gemini's agent infrastructure directly against OpenAI's agent primitives.

Free tier Availability

X announcement ↗Article ↗

🎙️ Hear our coverage →

S Shanghai AI Lab Jul 7, 2026

New ModelsOpen weights

Agents-A1

Shanghai AI Lab releases Agents-A1, an Apache 2.0 agentic MoE

A 35B MoE built on Qwen3.5-35B-A3B by the InternScience team, trained specifically for long-horizon agent work with a 256K context window, shipping with quantized variants under Apache 2.0.

35B MoE parameters256K Context window

X announcement ↗

🎙️ Hear our coverage →

#agents #open-source

OpenAI Jul 6, 2026

APIs & Platforms

GPT-Realtime-2.1-mini

GPT-Realtime-2.1-mini brings reasoning and tool use to the Realtime API mini tier

Two days before GPT-Live, OpenAI upgraded the Realtime API mini lineup with reasoning and tool use at unchanged pricing, plus a 25%+ p95 latency cut from improved caching. Notably it does not include GPT-Live's full-duplex capability, which remains app-exclusive.

≥25% p95 latency reduction

X announcement ↗

🎙️ Hear our coverage →

#voice-ai #api #agents

Z.ai Jul 2, 2026

Dev Tools

ZCode

Z.ai launches ZCode, a GLM-5.2 agentic coding environment

ZCode is an agentic coding environment built on GLM-5.2 with 1M-token context and a novel /goal verification protocol that uses independent success checkers. Output reaches 173 tokens/second with 1.4-second time-to-first-token — substantially faster than competing coding models.

173 tokens/second output1M token context

🎙️ Hear our coverage →

#coding #agents

June 2026

Weights & Biases Jun 29, 2026

Products & Apps

Aria

W&B Aria auto-research agent goes GA

Aria went generally available on Monday — an auto-research agent living in the W&B UI ('Just Ask Aria') that reads your traces and debugs your loss curves. In Zubin Aysola's AI Engineer talk, Aria read its own production traces and updated its own prompts.

Weights & Biases ↗

🎙️ Hear our coverage →

#agents #research #infrastructure

Anthropic Jun 25, 2026

Dev Tools

Claude Tag

Anthropic launches Claude Tag as a persistent Slack teammate

Claude Tag brings Claude into Slack as a persistent proactive teammate with shared channel context, ambient follow-up, coding tasks, analysis, incident support, and enterprise governance.

65% Anthropic product-team code from internal version$25K Enterprise launch credits

Claude Tag launch ↗

🎙️ Hear our coverage →

#agents #industry #consumer-ai

Linzumi Jun 25, 2026

Dev Tools

Linzumi

Linzumi launches shared chat for fleets of AI coding agents

YC-backed Linzumi launched a team chat and agent orchestration environment where humans and AI coding agents share threads, with Sean Grove describing a future of 10,000 agent hours per person per day.

10,000 agent hours / person / day$100 flat monthly team tier

YC Linzumi launch ↗Linzumi ↗

🎙️ Hear our coverage →

#agents #coding #industry

Sakana AI Jun 25, 2026

Dev Tools

Fugu

Sakana AI launches Fugu multi-agent orchestration API

Announced on air by Stefania Druga: the Fugu recursive router — it rewrites prompts and verifies outputs before picking a model, per the two ICLR papers behind it (Trinity and the conductor) — now plugs into Codex and OpenCode.

95.5 GPQA Diamond93.2 LiveCodeBench73.7 SWE-Bench Pro

Fugu announcement ↗Sakana launch tweet ↗

🎙️ Hear our coverage (+1 follow-up) →

#agents #benchmarks #api

Cursor Jun 18, 2026

Acquisitions

Cursor acquisition

SpaceX/xAI reportedly acquires Cursor for $60B

The show covered a reported $60B all-stock acquisition of Anysphere/Cursor by SpaceX/xAI. Alex framed it as coding assistants becoming strategic infrastructure: workflows, agent traces, and developer context are now assets frontier labs want to own.

$60B reported acquisition price

Trending coverage on X ↗

🎙️ Hear our coverage →

#industry #coding #agents

HumanLayer Jun 18, 2026

Dev Tools

Agentic IDE

HumanLayer launches an Agentic IDE to fight AI code slop

HumanLayer launched its Agentic IDE, positioned as a human-in-the-loop answer to lights-out coding-agent slop. Dexter Horthy joined the show to argue that the right architecture keeps humans steering high-impact changes instead of letting agents silently trash production codebases.

Dexter Horthy announcement on X ↗HumanLayer ↗12-Factor Agents ↗

🎙️ Hear our coverage →

#agents #coding #safety

Moonshot AI Jun 18, 2026

New ModelsOpen weights

Kimi K2.7 Code

Moonshot AI open-sources Kimi K2.7 Code for agentic coding

Moonshot AI open-sourced Kimi K2.7 Code, a trillion-parameter MoE coding model with benchmark jumps over K2.6 and fewer reasoning tokens. On the show it landed as the second half of the open-source coding wave beside GLM-5.2.

1T MoE parameters30% fewer reasoning tokens

Kimi announcement on X ↗Kimi K2.7 Code on Hugging Face ↗Kimi Code beta ↗

🎙️ Hear our coverage →

#open-source #coding #agents

OpenAI Jun 18, 2026

Major Features & Updates

Codex Computer Use in Europe

OpenAI rolls out Codex Computer Use, Chrome extension, Memory and Chronicle to European users

OpenAI rolled out Codex Computer Use plus Chrome extension, Memory, and Chronicle access to users in the EEA, UK, and Switzerland. The episode covered it as part of the week’s coding-agent platform expansion.

OpenAI Developers announcement on X ↗Codex changelog ↗

🎙️ Hear our coverage →

#coding #agents #consumer-ai

Weights & Biases Jun 18, 2026

Dev ToolsOpen weights

HiveMind

Weights & Biases launches HiveMind for coding-agent observability

Weights & Biases launched HiveMind, a dashboard for tracking AI coding-agent sessions, spend, transcripts, ROI, and reusable organizational learning. Chris Van Pelt and Adrian Swanberg joined the show to explain why teams need observability for their growing fleet of coding agents.

W&B announcement on X ↗HiveMind ↗HiveMind on GitHub ↗

🎙️ Hear our coverage →

#coding #agents #infrastructure

Z.ai (Zhipu AI) Jun 18, 2026

New ModelsOpen weights

GLM-5.2

Z.ai releases GLM-5.2, a 753B open MoE with 1M context

Z.ai released GLM-5.2 as a major open-source coding and agentic model: a 753B-parameter MoE, MIT-licensed, with a one-million-token context window. The episode treated it as the open-source model that arrived exactly as Fable access disappeared, with strong coding and agentic performance close to the frontier.

753B parameters1M context windowMIT license

Z.ai announcement on X ↗GLM-5.2 blog ↗GLM-5.2 on Hugging Face ↗GLM-5.2 docs ↗

🎙️ Hear our coverage →

#open-source #coding #agents

Cognition Labs Jun 4, 2026

Products & Apps

Devin Desktop

Cognition rebrands Windsurf into Devin Desktop multi-agent hub

Cognition rebranded Windsurf into Devin Desktop, a multi-agent command center with Agent Client Protocol (ACP) support. The move consolidates Cognition's IDE acquisition into its Devin agent brand as a desktop control surface for running multiple coding agents.

Announcement ↗X announcement ↗

🎙️ Hear our coverage →

#coding #agents

H Company Jun 4, 2026

New ModelsOpen weights

Holo 3.1

H Company launches Holo 3.1 local computer-use agent models

H Company released Holo 3.1, a family of local computer-use agent models ranging from 0.8B to 35B parameters with new quantized checkpoints. The lineup targets running screen-driving agents on local hardware rather than in the cloud.

X announcement ↗Blog ↗

🎙️ Hear our coverage →

#agents #open-source

Arena (LMArena) Jun 4, 2026

Benchmarks & Evals

Agent Arena

Arena launches Agent Arena for real-world agent workflow evals

Arena (LMArena) launched Agent Arena during the episode, moving beyond one-turn chatbot preference battles to evaluate models on real agent workflows with web search, files, terminals, user corrections, and objective recovery signals. Peter Gostev joined live to explain why long-running, harder tasks need a different benchmark.

Agent Arena announcement ↗Arena ↗

🎙️ Hear our coverage →

#benchmarks #agents

MiniMax Jun 4, 2026

New Models

MiniMax M3

MiniMax announces M3 coding/agentic model with 1M context

MiniMax announced M3, a natively multimodal coding and agentic model with a one-million-token sparse attention context claim and open weights promised soon. Reported numbers include 59 on SWE-bench Pro, and the panel noted MiniMax already has a following for cheap agentic tool calling even as pure coding quality is debated.

X announcement ↗API ↗MiniMax Code ↗

🎙️ Hear our coverage →

#coding #agents #architecture

Nous Research Jun 4, 2026

Products & Apps

Hermes Desktop

Nous Research launches Hermes Desktop agent app for Mac/Win/Linux

Nous Research launched Hermes Desktop, packaging the Hermes Agent harness into a native desktop app for Mac, Windows, and Linux. Karan previewed chat, permissions, tool-call visibility, reasoning traces, and admin controls aimed at small teams, startups, and personal agent fleets.

X announcement ↗Site ↗

🎙️ Hear our coverage →

#agents #coding #open-source

NVIDIA Jun 4, 2026

New ModelsOpen weights

Nemotron 3 Ultra

NVIDIA releases Nemotron 3 Ultra, a 550B open-weight MoE for agents

NVIDIA dropped Nemotron 3 Ultra the day of the show, a 550B-parameter sparse MoE with 55B active parameters built for long-running agentic harnesses like OpenCode, Hermes, and OpenClaw. Chris Alexiuk joined to explain the hybrid Mamba/Transformer architecture and the unusually complete open release: weights, training data, recipes, a GenRM reward model, and an NVFP4 quantized checkpoint.

550B Nemotron 3 Ultra parameters55B Active parameters

Announcement ↗Technical Report ↗Hugging Face (post-trained BF16) ↗X announcement ↗

🎙️ Hear our coverage →

#open-source #agents #reasoning

NVIDIA Jun 4, 2026

Products & Apps

RTX Spark

NVIDIA announces RTX Spark Arm + Blackwell platform for local AI PCs

At Computex, NVIDIA unveiled RTX Spark, an Arm CPU plus Blackwell GPU PC platform with 128GB unified memory targeting local AI agents and 120B-class local inference. A wave of thin laptops with RTX 5070-class GPUs and roughly one petaflop of local AI compute raises the question of what agents should run locally versus in the cloud.

Coverage (Tom's Hardware) ↗

🎙️ Hear our coverage →

#infrastructure #on-device #agents

May 2026

Anthropic May 28, 2026

Major Features & Updates

Dynamic Workflows in Claude Code

Dynamic Workflows and Ultra Code land in Claude Code

Alongside Opus 4.8, Anthropic shipped Dynamic Workflows and an Ultra Code mode in Claude Code, which Yam fired up live on the show. The headline proof point: Bun was ported from Zig to Rust — about 750K lines — via Dynamic Workflows, with 99.8% of the test suite passing and the port merged in 11 days.

750K lines Bun: Zig → Rust

Dynamic Workflows in Claude Code ↗

🎙️ Hear our coverage →

#coding #agents

C Cua May 28, 2026

Dev ToolsOpen weights

Cua Driver for Windows

Cua Driver brings background computer-use agents to Windows

Cua launched Windows support for Cua Driver, enabling background computer-use agents that operate real desktop apps without taking over the user's screen. It extends Cua's open-source computer-use stack to the largest desktop OS.

Cua Driver Windows — blog ↗Cua GitHub ↗Cua announcement ↗

🎙️ Hear our coverage →

#agents #consumer-ai

Google May 28, 2026

Products & Apps

Universal Cart / AP2 / UCP

Google launches Universal Cart, AP2 and UCP for agentic commerce

Google launched Universal Cart along with the AP2 and UCP protocols, infrastructure that lets AI agents shop and pay on a user's behalf. It is Google's play to standardize agent-driven commerce across merchants and payment flows.

Google Universal Cart / AP2 / UCP ↗

🎙️ Hear our coverage →

#agents #industry

Weights & Biases May 28, 2026

Dev Tools

W&B MCP Server

Weights & Biases launches MCP server with 20 tools for agents

W&B officially launched its MCP server with 20 schema-first tools so coding agents can read experiments, monitor training, and run autonomous research loops. Agents can query metadata before pulling full 300-metric runs, keeping their context windows from blowing up.

W&B MCP Server ↗W&B MCP Server — blog ↗W&B announcement ↗

🎙️ Hear our coverage →

#agents #coding #infrastructure

Alibaba (Qwen) May 21, 2026

New Models

Qwen 3.7-Max

Alibaba releases Qwen 3.7-Max agentic frontier model with robotics demos

Alibaba released Qwen 3.7-Max, an agentic frontier model built for long autonomous runs, demonstrated alongside robotics demos. It continues the Qwen Max line as Alibaba's closed frontier offering aimed at agentic workloads.

Qwen blog ↗Announcement on X ↗Robot demo ↗

🎙️ Hear our coverage →

#agents #robotics #frontier-models

Cursor May 21, 2026

New Models

Composer 2.5

Cursor launches Composer 2.5 with Opus-class coding at much lower cost

Cursor launched Composer 2.5, a coding model continued-trained on top of Kimi K2.5 (with permission) that delivers Opus-class coding performance at much lower cost. The crew noted Cursor is 'absolutely back' with strong pre-training and post-training teams, and that training now runs partly on the Colossus supercomputer.

Cursor blog ↗Cursor on X ↗

🎙️ Hear our coverage →

#coding #agents

Google May 21, 2026

Products & Apps

Antigravity 2.0

Antigravity 2.0 becomes Google's central agentic coding harness

Antigravity 2.0 was positioned at I/O 2026 as the single agent harness powering agentic experiences across Google, from internal tooling to Search, Workspace and developer products. Born from the Windsurf acquisition, it evolved from an agent-first IDE into the through line for Google's agentic strategy, now exposed to external developers as well.

Sundar Pichai announcement ↗Google OS demo ↗

🎙️ Hear our coverage →

#coding #agents

Google DeepMind May 21, 2026

New Models

Gemini 3.5 Flash

Gemini 3.5 Flash launches at I/O as Google's agentic workhorse model

Google launched Gemini 3.5 Flash at I/O 2026 as a fast, determined workhorse model built for agentic loops rather than a budget-tier Flash like prior generations. It is rolling out across the Gemini app, Search AI Mode, the Gemini API, Google AI Studio, Antigravity and the Gemini Enterprise Agent Platform. Nisten noted unusual determinism in its behavior, and Logan Kilpatrick framed it as designed for the agentic era.

900M Gemini app users

Logan Kilpatrick announcement ↗Noam Shazeer ↗Jeff Dean ↗Koray Kavukcuoglu on rollout ↗

🎙️ Hear our coverage →

#agents #reasoning #frontier-models

Google DeepMind May 21, 2026

APIs & Platforms

Managed Agents (Gemini API)

Gemini API gets Managed Agents with hosted sandboxes and the Interactions API

Google launched Managed Agents in the Gemini API, letting developers spin up hosted Antigravity agents with Linux sandboxes and persistent state. It ships alongside the next-generation Interactions API, which Logan Kilpatrick described as designed for agentic systems rather than the old tokens-in, tokens-out model interaction pattern.

Gemini API agents docs ↗Google AI Developers on X ↗

🎙️ Hear our coverage →

#agents #api #coding

Google May 21, 2026

Products & Apps

Gemini Spark

Gemini Spark announced as a 24/7 proactive personal AI agent

Google announced Gemini Spark, a 24/7 personal AI agent that can proactively work across Google surfaces, framed on the show as Google's OpenClaw competitor. Access was not yet broadly available at announcement time, so the crew discussed it from the announcement rather than hands-on testing.

News from Google ↗

🎙️ Hear our coverage →

#agents #consumer-ai

Google May 21, 2026

Major Features & Updates

Google Search agentic capabilities

Google Search adds Gemini 3.5 Flash-powered agentic capabilities

Google Search is getting new Gemini 3.5 Flash-powered agentic capabilities, including a new AI-powered Search box and background information agents. The crew framed the rollout as a massive intelligence uplift across one of Google's largest surfaces, with billions of Search users getting frontier-model capabilities.

3.5B Google Search users

Sundar Pichai on Search agents ↗Alex's I/O thread ↗

🎙️ Hear our coverage →

#agents #search

OpenAI May 21, 2026

Major Features & Updates

Codex Mobile

OpenAI Codex Mobile arrives in the ChatGPT mobile apps

OpenAI's Codex Mobile is now available in the ChatGPT mobile apps, enabling remote agent workflows from a phone. The crew discussed it as part of the broader shift toward driving coding agents from anywhere rather than just the desktop.

OpenAI on X ↗

🎙️ Hear our coverage →

#coding #agents #consumer-ai

xAI May 21, 2026

Dev Tools

Grok Build

xAI launches Grok Build, an agentic CLI coding tool in beta

xAI launched Grok Build, an agentic CLI coding tool, in beta for SuperGrok Heavy subscribers. It joins the crowded field of terminal-based coding agents as xAI's entry into agentic engineering tooling.

xAI CLI page ↗xAI on X ↗

🎙️ Hear our coverage →

#coding #agents

Anthropic May 14, 2026

Major Features & Updates

Claude Agent SDK monthly credits

Anthropic adds separate Claude Agent SDK credits to paid plans

Anthropic announced separate monthly Claude Agent SDK credits for Pro, Max, Team, and Enterprise subscribers, starting June 15, 2026. This gives agent builders a dedicated usage pool on top of regular plan limits.

🎙️ Hear our coverage →

#agents #coding

Artificial Analysis May 14, 2026

Benchmarks & Evals

Coding Agent Index

Artificial Analysis Coding Agent Index benchmarks model + harness combos

Artificial Analysis launched the Coding Agent Index, a benchmark that evaluates model and harness combinations rather than models alone. Opus 4.7 in Cursor CLI leads at 61, GLM-5.1 tops the open-weight entries at 53, and costs vary 30x across combos for similar capability.

X announcement ↗

🎙️ Hear our coverage →

#benchmarks #coding #agents

CoreWeave May 14, 2026

Products & Apps

CoreWeave Sandboxes

CoreWeave Sandboxes launch in preview via the W&B SDK

CoreWeave Sandboxes is now an official Harbor provider, letting teams run agentic workloads like Terminal-Bench safely at scale on CoreWeave infrastructure. It plugs CoreWeave's isolated execution environments directly into the Harbor eval/agent ecosystem.

Docs ↗CoreWeave blog ↗CoreWeave Sandboxes ↗

🎙️ Hear our coverage (+1 follow-up) →

#agents #infrastructure #benchmarks

Nous Research May 14, 2026

Dev Tools

Hermes CLI agent

Hermes passes OpenClaw as #1 CLI agent on OpenRouter, adds computer use

Nous Research's Hermes overtook OpenClaw as the #1 CLI agent on OpenRouter. It also added background computer use via Trykua, and Alex described switching his own daily agent workflow from OpenClaw to Hermes.

X announcement ↗

🎙️ Hear our coverage →

#agents #coding

OpenAI May 14, 2026

Products & Apps

Daybreak

OpenAI launches Daybreak, a frontier AI cybersecurity platform

OpenAI announced Daybreak, a frontier AI cybersecurity platform that pairs GPT-5.5 with Codex for security workloads. It launches with partners including Cloudflare, positioning OpenAI directly in the AI-powered defense market.

X announcement ↗

🎙️ Hear our coverage →

#safety #agents

OpenAI (Codex), Anthropic, Nous Research May 14, 2026

Major Features & Updates

/goal command

/goal command lands in Codex, Claude Code, and Hermes - the productized Ralph

The /goal command is now available in Codex, Claude Code, and Hermes, productizing the Ralph loop pattern: set a measurable success condition and the agent iterates autonomously until it is done. Codex's implementation is winning early head-to-head comparisons over Claude Code, and the show framed it as turning coding agents into 24/7 AI employees.

X thread ↗Codex docs: follow goals ↗

🎙️ Hear our coverage →

#agents #coding

April 2026

Cognition Labs Apr 30, 2026

Dev Tools

Devin for Terminal

Cognition launches Devin for Terminal CLI coding agent

Cognition launched Devin for Terminal, a local CLI coding agent. Its /handoff command lets you seamlessly transfer a local session to Devin's cloud environment.

cli.devin.ai docs ↗

🎙️ Hear our coverage →

#coding #agents

Cursor Apr 30, 2026

Dev Tools

Cursor SDK

Cursor launches SDK exposing the runtime that powers the IDE

Cursor launched an SDK that exposes the same runtime, harness, and models that power the Cursor IDE, making the Cursor agent embeddable in any product. The Cursor Agent + GPT-5.5 combo also topped WolfBench's Terminal-Bench 2.0 leaderboard this week.

Cursor SDK docs ↗

🎙️ Hear our coverage →

#coding #agents

IBM Apr 30, 2026

New ModelsOpen weights

Granite 4.1

IBM Granite 4.1: dense non-thinking models with top tool calling

IBM released the Granite 4.1 family (3B/8B/30B), dense non-thinking models under Apache 2.0 with best-in-class tool calling, scoring 73 on BFCL with just 8B parameters. IBM claims 20x token efficiency over Qwen3.5 9B, and the models are live on W&B Inference at $0.05/$0.10 per million input/output tokens with 128K context.

IBM Granite blog ↗Hugging Face ↗W&B Inference ↗

🎙️ Hear our coverage →

#open-source #agents #industry

Microsoft Apr 30, 2026

Benchmarks & Evals

DELEGATE-52

Microsoft's DELEGATE-52 exposes stealthy document corruption

Microsoft released the DELEGATE-52 benchmark showing GPT-5.4 loses 28% of document content after 20 iterative edits. Frontier models corrupt documents stealthily while preserving structure, making the degradation hard to notice.

🎙️ Hear our coverage →

#benchmarks #agents

Stripe Apr 30, 2026

Products & Apps

Link wallet for agents

Stripe launches Link wallet giving AI agents scoped payments

At Stripe Sessions 2026, Stripe launched the Link wallet for agents: AI agents get scoped payment credentials with mandatory human approval, and the real card number is never exposed to the agent. Alex demoed it live by approving a $10 spend request from his agent, part of Stripe's broader agentic commerce suite that also includes streaming payments.

Stripe blog: Agentic commerce suite ↗Stripe on X ↗Stripe agentic commerce ↗Stripe Sessions ↗

🎙️ Hear our coverage →

#agents #industry

Stripe Apr 30, 2026

Dev Tools

Projects.dev

Stripe opens Projects.dev: 32 infra providers provisionable by agents

Stripe removed the waitlist on Projects.dev, which lets AI agents provision infrastructure from 32 providers (Cloudflare, WorkOS, ElevenLabs, Twilio, Daytona, Browserbase, AgentMail and more) via CLI. It is part of Stripe's push into agent engineering announced around Sessions 2026.

Projects.dev ↗

🎙️ Hear our coverage →

#agents #coding #infrastructure

Anthropic Apr 23, 2026

Products & Apps

Claude Design

Anthropic ships Claude Design research preview, Figma stock drops 7%

Anthropic released Claude Design as a research preview running on Opus 4.7 at claude.ai/design, and Figma stock dropped 7% on the news. Alex generated a full ThursdAI brand kit including logo, design tokens, and the episode opener videos end-to-end inside Claude Design, then had Codex pick up the kit and produce a GPT-5.5 launch video in 9 minutes. Anthropic also added a new usage meter to Claude Max settings.

Claude Design announcement ↗Try Claude Design ↗

🎙️ Hear our coverage →

#image-gen #agents

B Brex Apr 23, 2026

Dev ToolsOpen weights

CrabTrap

Brex open-sources CrabTrap, an LLM-as-judge proxy for agent security

Brex's CEO pair-programmed with Codex and open-sourced CrabTrap, an LLM-as-judge HTTP proxy that intercepts outbound agent requests and blocks risky activity using natural-language rule definitions. Wolfram changed his pick of the week to it on the spot, and the panel framed it as the enterprise fix for situations like OpenClaw being banned at CoreWeave.

Brex CrabTrap ↗

🎙️ Hear our coverage →

#agents #safety #open-source

Google DeepMind Apr 23, 2026

Major Features & Updates

Gemini Deep Research Max

Google ships Gemini Deep Research + Deep Research Max on Gemini 3.1 Pro

Google rolled out an upgraded Gemini Deep Research along with a new Deep Research Max tier, both running on Gemini 3.1 Pro. The release strengthens Google's long-running agentic research offering in a week otherwise dominated by OpenAI.

Google Gemini Deep Research Max ↗

🎙️ Hear our coverage →

#agents #research

Google DeepMind Apr 23, 2026

Products & Apps

Gemini Enterprise Agent Platform

Google launches Gemini Enterprise Agent Platform

Google announced the Gemini Enterprise Agent Platform, a platform for building and deploying Gemini-powered agents inside enterprises. It was covered briefly in the Big Co segment of the show.

Google Gemini Enterprise Agent Platform ↗

🎙️ Hear our coverage →

#agents #industry

Moonshot AI Apr 23, 2026

New ModelsOpen weights

Kimi K2.6

Kimi K2.6: 1T MoE open-source SOTA on SWE-Bench Pro

Moonshot AI released Kimi K2.6, a 1-trillion-parameter MoE with 32B active parameters, 384 experts, MLA attention, and a 256K context window under a modified MIT license. It claims open-source state of the art on SWE-Bench Pro at 58.6, and Wolfram called it the best open-source model he has ever tested on his private wolf-bench.

1T MoE Kimi K2.6

Kimi K2.6 release ↗Kimi K2.6 on Hugging Face ↗

🎙️ Hear our coverage →

#open-source #coding #agents

OpenAI Apr 23, 2026

New Models

OpenAI clinician model + workspace agents

OpenAI releases clinician/medical model and workspace agents

Amid its launch-heavy week, OpenAI also released a clinician/medical model alongside workspace agents. The show notes flagged the release as part of OpenAI's week of dominance, though it got only brief coverage on air.

🎙️ Hear our coverage →

#research #agents

OpenAI Apr 23, 2026

Major Features & Updates

Codex Computer Use + Chronicle

Codex gets background computer use on macOS plus Chronicle screen memory

Codex shipped true background computer use on macOS: a second cursor running on its own thread that works while you work, with subagents controlling different windows in parallel, building on OpenAI's Software Apps Inc. (ex-Apple Shortcuts team) acquisition. Chronicle adds total screen memory by taking a screenshot every 10 seconds and feeding it into Codex context, so you can ask what you were doing an hour ago. Codex also passed 4 million users this week.

OpenAI Codex Chronicle announcement ↗

🎙️ Hear our coverage →

#agents #coding

OpenAI Apr 23, 2026

Dev ToolsOpen weights

Euphony

OpenAIDevs releases Euphony, an open-source Codex session log visualizer

The OpenAI developer relations team released Euphony, an open-source visualizer for Codex session logs. It lets developers inspect and replay what their Codex agent sessions actually did.

OpenAIDevs Euphony (session log visualizer) ↗

🎙️ Hear our coverage →

#coding #agents

OpenAI Apr 23, 2026

New Models

GPT-5.5

GPT-5.5 and GPT-5.5 Pro drop live, SOTA across the board

OpenAI shipped GPT-5.5 and GPT-5.5 Pro mid-show, taking state of the art on Terminal-Bench 2 (82.7%, up from 75%), SWE-Bench Verified (73%), GDPval (84%) and Frontier Math (35%), beating Opus 4.7 and Gemini 3.1. It uses ~40% fewer tokens than 5.4, netting roughly 20% cheaper to run despite API pricing doubling to $5/$30 per million ($30/$180 for Pro). Peter Gostev called it the first model that genuinely sustains multi-hour long-running tasks, with one task running 8.5 hours straight; rollout was Codex-first, not yet in ChatGPT.

82.7% Terminal-Bench 28.5 hrs Longest task

OpenAI GPT-5.5 release blog ↗Artificial Analysis GPT-5.5 analysis ↗GPT-5.5 pre-launch leak (Codex dropdown) ↗

🎙️ Hear our coverage →

#reasoning #coding #agents

Anthropic Apr 16, 2026

Major Features & Updates

Claude Code Routines

Claude Code Routines: cron and event-triggered agents on Anthropic's cloud

Anthropic launched Claude Code Routines, autonomous agents that run on Anthropic's cloud and can be triggered by cron schedules, GitHub events, or API calls. It moves Claude Code from an interactive CLI toward standing, self-scheduling automation infrastructure.

Claude Code Routines docs ↗

🎙️ Hear our coverage →

#agents #coding

Anthropic Apr 16, 2026

New Models

Claude Opus 4.7

Claude Opus 4.7 drops live with 87.6% SWE-bench Verified and xhigh effort

Anthropic shipped Claude Opus 4.7 minutes before the show, scoring 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro, an 11-point jump over Opus 4.6 on the harder agentic coding eval. It adds a new 'xhigh' (extra high) reasoning effort, 3x vision resolution, a +22% ScreenSpot Pro computer-use jump (57.7% to 79.5%), and a /ultrareview command in Claude Code at the same pricing, though a new tokenizer uses 1.0-1.35x more tokens. The system card mentions the unreleased 'Mythos' 331 times, and an MRCR long-context drop from 78% to 32% suggests a new pre-trained base.

87.6% SWE-bench Verified+22% ScreenSpot Pro jump

Claude Opus 4.7 announcement (X) ↗Anthropic blog: Claude Opus 4.7 ↗Opus 4.7 system card (PDF) ↗

🎙️ Hear our coverage →

#frontier-models #coding #agents

Daily (Pipecat) Apr 16, 2026

Products & AppsOpen weights

Gradient Bang

Gradient Bang: first massively multiplayer fully LLM-driven voice game

Kwindla Kramer's 'side project that broke containment' is a fully LLM-driven multiplayer voice-based space game inspired by BBS-era Trade Wars, built on a new Pipecat Sub-Agents library with a class-based event bus that works locally and over the network. A Deepgram plus GPT-4.1 voice agent always responds in under 1.5 seconds while GPT-5.2 medium-thinking task agents do the work, and the React frontend is rendered from LLM-generated JSON as dynamic UI. The team also open-sourced GB Benchmarks for evaluating agent task execution.

Play Gradient Bang ↗gradient-bang on GitHub ↗Kwindla on Gradient Bang (X) ↗

🎙️ Hear our coverage →

#voice-ai #agents #open-source

Marimo Apr 16, 2026

Dev ToolsOpen weights

Marimo Pair

Marimo Pair drops coding agents inside reactive Python notebooks

Marimo released Marimo Pair, which embeds Claude Code, Codex, or OpenCode agents directly inside its reactive, dependency-graph-aware Python notebooks. Founding engineer Trevor Manz joined the show to explain why reactive notebooks are a natural verification surface for agent-written code; the launch trended on Hacker News this week and was featured as part of This Week's Buzz (Marimo is in the CoreWeave family).

Marimo blog: Marimo Pair ↗marimo-pair on GitHub ↗

🎙️ Hear our coverage →

#coding #agents #open-source

OpenAI Apr 16, 2026

Major Features & Updates

Codex

OpenAI Codex adds macOS background computer use, 90+ plugins, and memory

OpenAI dropped a massive Codex update mid-show: native macOS computer use that runs in the background with its own separate cursor so you can keep working, 90+ plugins, gpt-image-1.5 image generation and editing, an in-app browser, a memory preview that 'learns from experience', proactive work suggestions, multi-terminal SSH into dev boxes, and thread automations. Alex's hot take: Codex, not ChatGPT, is becoming OpenAI's super-app.

OpenAI Codex update announcement (X) ↗OpenAI blog: Codex for almost everything ↗Thibault Sottiaux on the Codex update (X) ↗

🎙️ Hear our coverage →

#agents #coding

Warp Apr 16, 2026

Major Features & Updates

Warp any-CLI-agent support

Warp now supports any CLI agent with vertical tabs and mobile control

Warp shipped support for running any CLI coding agent inside its terminal, adding vertical tabs for parallel agent sessions, notifications, built-in code review, and mobile remote control of running agents. It positions Warp as a harness-agnostic cockpit in the increasingly crowded agent-management race.

Warp announcement (X) ↗Warp blog: Warp supports any CLI agent ↗

🎙️ Hear our coverage →

#coding #agents

Windsurf Apr 16, 2026

Products & Apps

Windsurf 2.0

Windsurf 2.0 ships Agent Command Center and full Devin integration

Cognition launched Windsurf 2.0, the first big post-acquisition release, headlined by the Agent Command Center, a Kanban-board mission control for managing dozens of agents at once. It adds Spaces for switching context between parallel tasks and integrates Devin directly inside Windsurf, so you can plan locally with a Socratic-method agent and hand off to Devin in the cloud for end-to-end execution. Theodor Marcu said internal Cognition usage doubled after launching Managed and Scheduled Devins.

Windsurf 2.0 announcement (X) ↗Windsurf blog: Windsurf 2.0 ↗swyx on the Agent Command Center design (X) ↗

🎙️ Hear our coverage →

#agents #coding

Anthropic Apr 9, 2026

Products & Apps

Managed Agents

Anthropic ships Managed Agents, a fully hosted agent runtime

Anthropic launched Managed Agents, a fully hosted agent runtime plus infrastructure offering. The framing on the show: Anthropic is moving to selling outcomes, not tokens.

🎙️ Hear our coverage →

#agents #infrastructure

Cursor Apr 9, 2026

Major Features & Updates

Cursor remote agents & code review agent

Cursor ships remote agents and a code review agent

Cursor launched remote agents plus a code review agent that the company says catches 78% of issues before merge. Mentioned in the week's tools and agentic-engineering roundup.

🎙️ Hear our coverage →

#coding #agents

M MemPalace (Ben Sigman & Milla Jovovich) Apr 9, 2026

Dev ToolsOpen weights

MemPalace

MemPalace open-source AI memory system goes viral with 26K stars

MemPalace, the open-source AI memory system from Milla Jovovich and Ben Sigman, went viral with 26K GitHub stars in 2 days and claimed top memory-benchmark scores. The team then transparently walked back the overstated benchmark claims in a public correction thread, which the show called a refreshingly honest arc.

MemPalace on GitHub ↗Ben Sigman launch post on X ↗Ben Sigman's transparent correction thread ↗Memory Palace web frontend on GitHub ↗

🎙️ Hear our coverage →

#agents #open-source

Meta (Meta Superintelligence Labs) Apr 9, 2026

New Models

Muse Spark

Meta launches Muse Spark, first model from Meta Superintelligence Labs

Meta dropped Muse Spark mid-show, the debut model from Meta Superintelligence Labs. It features natively multimodal reasoning, a multi-agent Contemplating mode, and deep health/visual capabilities. Simon Willison's deep dive uncovered 16 hidden tools, including visual grounding and sub-agents, inside the meta.ai chat UI.

AI at Meta announcement on X ↗Introducing Muse Spark (Meta blog) ↗MSL announcement ↗Simon Willison's deep dive on the 16 hidden tools ↗

🎙️ Hear our coverage →

#frontier-models #multimodal #agents

Nous Research Apr 9, 2026

New ModelsOpen weights

Hermes 27B

Nous Research ships Hermes 27B, paired with the Hermes harness

Nisten's pick of the week: Hermes 27B, an open model trained specifically to be paired with the Hermes harness and allegedly distilled from the Opus API. Model and harness ship together as a portable unit, a notable take on the harness-engineering trend Swyx discussed.

🎙️ Hear our coverage →

#open-source #agents

OpenAI Apr 9, 2026

Major Features & Updates

Codex plugins & Guardian Approvals

Codex hits 3M WAU with plugins, sub-agents and Guardian Approvals

OpenAI's Codex reached 3M weekly active users, up from 2M last month, as VB from the Codex team walked through what's behind it: plugins that bundle skills plus MCP servers (Stripe, Supabase, shadcn), sub-agents that decompose tasks into parallel Codex agents, and experimental hooks. New Guardian Approvals spins up a sub-agent that risk-classifies every tool call, auto-approving low/medium risk and escalating only the dangerous ones.

3M Codex weekly active users

VB (reach_vb) on X ↗

🎙️ Hear our coverage →

#agents #coding

OpenClaw Apr 9, 2026

Dev ToolsOpen weights

OpenClaw 2026.4.5

OpenClaw 2026.4.5 ships /dreaming memory consolidation

OpenClaw's biggest release since 4.0: /dreaming goes GA with Light/Deep/REM memory consolidation phases that defrag agent memory into a human-readable Dream Diary (DREAMS.md). The release also adds built-in video and music generation across 4 backends, GPT-5.4 as the new default model, prompt-cache reuse improvements, and Control UI plus docs in 12 new languages. Maintainer Vincent Koc says the ~1.5M-line codebase was refactored into a plugin architecture in nine days.

1.5M lines OpenClaw codebase

OpenClaw v2026.4.5 release notes ↗Vincent Koc announcement on X ↗Dreaming docs ↗Turing Post FOD#147: Can your OpenClaw dream ↗

🎙️ Hear our coverage →

#agents #open-source

Z.ai (Zhipu AI) Apr 9, 2026

New ModelsOpen weights

GLM-5.1

GLM-5.1 takes #1 open-source spot on SWE-Bench Pro at 58.4%

Z.ai released GLM-5.1, now the #1 open-source model on SWE-Bench Pro at 58.4%. It can run autonomously for 8 hours with 1,700+ agent steps, and is already live on W&B Inference. Open weights are up on Hugging Face alongside an arXiv paper.

Z.ai announcement on X ↗GLM-5.1 weights on Hugging Face ↗GLM-5.1 paper on arXiv ↗

🎙️ Hear our coverage →

#open-source #agents #coding

Alibaba (Qwen) Apr 2, 2026

New Models

Qwen3.6-Plus

Alibaba ships Qwen3.6-Plus with near-Opus agentic coding and 1M context

Alibaba released Qwen3.6-Plus, an API model with agentic coding performance near Opus 4.5 and a 1M-token context window. The panel noted continued strong momentum for the Qwen family in practical coding and agent workloads.

Announcement (X) ↗Qwen blog ↗

🎙️ Hear our coverage →

#coding #agents #architecture

Cursor Apr 2, 2026

Products & Apps

Cursor 3

Cursor 3 ships as agent-first rebuild, dropping the VS Code fork

Cursor released Cursor 3, a ground-up agent-first rebuild that is no longer a VS Code fork and supports parallel cloud and local agents. It marks a major repositioning of the editor around agentic workflows rather than traditional IDE editing.

Announcement (X) ↗Cursor blog ↗

🎙️ Hear our coverage →

#coding #agents

Google DeepMind Apr 2, 2026

New ModelsOpen weights

Gemma 4

Google releases Gemma 4 open-weights family under Apache 2.0

Google DeepMind's Gemma 4 launch crossed 10M+ downloads with over 1,000 Gemma-4-based fine-tunes on Hugging Face; the Gemma family totals 500M+ downloads. Omar Sanseviero says Gemma is the foundation for the next generation of Gemini Nano shipping on Pixel and Samsung, with the AI Edge gallery letting people run it locally on Android and iOS. It punched above its size on Arena's Pareto curve and is now live on W&B Inference.

Hugging Face Collection ↗Try in AI Studio ↗Omar Sanseviero on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#open-source #agents #on-device

Liquid AI Apr 2, 2026

New ModelsOpen weights

LFM2.5-350M

Liquid AI ships LFM2.5-350M with agentic tool calling at 350M params

Liquid AI released LFM2.5-350M, a 350M-parameter open model that does agentic tool calling and fits under 500MB quantized. It targets edge and on-device agent workloads where tiny deployable models matter.

Announcement (X) ↗Hugging Face ↗Liquid AI blog ↗

🎙️ Hear our coverage →

#open-source #on-device #agents

R Ryan Carson Apr 2, 2026

Dev ToolsOpen weights

Claw Chief

Ryan Carson open-sources Claw Chief, an AI chief of staff

Co-host Ryan Carson open-sourced Claw Chief, an AI chief-of-staff setup with skills, crons, and scheduling. It packages his agent workflow patterns into a reusable open-source repo.

🎙️ Hear our coverage →

#agents #consumer-ai #open-source

U Ultraworkers (Sigrid Jin & Bellman) Apr 2, 2026

Dev ToolsOpen weights

claw-code

Claw-code clean-room rewrite becomes fastest repo to 100K GitHub stars

After Claude Code's source leaked via npm, Sigrid Jin and Bellman published claw-code, a clean-room rewrite that became the fastest GitHub repo to pass 100K stars, hitting the mark in roughly 24 hours. Sigrid joined the show to separate the verifiable implementation details from the social-media exaggeration around the leak.

100K+ GitHub stars in 24h

🎙️ Hear our coverage →

#coding #agents #open-source

WolfBench (Wolfram Ravenwolf) Apr 2, 2026

Benchmarks & Evals

WolfBench

WolfBench results show Hermes Agent beating Claude Code and OpenClaw

Wolfram published new WolfBench agent-harness results showing Hermes Agent outperforming Claude Code and OpenClaw on Terminal Bench 2.0 across most model combinations. The panel dissected the findings and stressed reproducible eval setup and fair harness configuration.

WolfBench.ai ↗wolfbench.ai ↗Viral results thread on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#benchmarks #agents #coding

March 2026

Anthropic Mar 26, 2026

Major Features & Updates

Claude computer use (Cowork + Claude Code)

Claude can now control your Mac: computer use lands in Cowork and Claude Code

Anthropic shipped computer use as a research preview in Claude Cowork and Claude Code, letting Claude directly control local Mac workflows. The panel compared it to existing OpenClaw-style agent patterns and debated where direct UI control is genuinely useful versus overkill.

Claude announcement (X) ↗Claude Cowork product page ↗

🎙️ Hear our coverage →

ARC Prize Foundation Mar 26, 2026

Benchmarks & Evals

ARC-AGI-3

ARC-AGI-3 launches: humans score 100%, frontier models under 1%

ARC Prize launched ARC-AGI-3, an interactive agentic reasoning benchmark of turn-based puzzle games designed to test human-like generalization in novel abstract environments. Humans hit a 100% pass rate while top frontier models score under 1%, which the panel welcomed as a healthy reality check against AGI-is-here rhetoric and easy score inflation.

<1% ARC-AGI-3 frontier model scores100% Human completion on ARC-AGI-3

ARC Prize announcement (X) ↗ARC Prize site ↗

🎙️ Hear our coverage →

#benchmarks #reasoning #agents

Google DeepMind Mar 26, 2026

New Models

Gemini 3.1 Flash Live

Google drops Gemini 3.1 Flash Live: Gemini can see, hear, and talk to you

Google released Gemini 3.1 Flash Live, a realtime multimodal model that handles voice and vision interaction in a single model path instead of stitched pipelines. The panel framed it as a major upgrade for end-to-end voice and vision agents, with AI Studio and API availability as the immediate way to experiment.

Google DeepMind announcement (X) ↗

🎙️ Hear our coverage →

#voice-ai #agents

MiniMax Mar 26, 2026

New ModelsOpen weights

MiniMax 2.7

MiniMax 2.7 open-source weights discussed as small-model momentum continues

The panel covered MiniMax 2.7 and its open-weights release in the context of small, efficient models becoming genuinely practical for local and specialized agent workflows. The segment focused on capability momentum and how open-weights expectations keep shaping adoption sentiment.

🎙️ Hear our coverage →

#open-source #agents

Anthropic Mar 19, 2026

Major Features & Updates

Claude Opus 4.6 (1M context)

Anthropic makes Opus 4.6 1M context the default in Claude Code, same price

Anthropic made 1M token context the default for Opus 4.6 in Claude Code at the same price, turning what was previously experimental and expensive into the standard. MRCR benchmark performance holds at 93% at 256K and 76% at 1M. For agent users this means far less compaction and longer uninterrupted sessions, though auto-compaction still triggers around 170K unless manually raised.

1M Opus 4.6 context default

🎙️ Hear our coverage →

#architecture #agents #coding

Cursor Mar 19, 2026

New Models

Composer 2

Cursor Composer 2 beats Opus 4.6 on TerminalBench at a tenth of the price

Cursor launched Composer 2, its first proprietary model that genuinely competes with frontier labs. It scores 61% on TerminalBench (beating Opus 4.6) at $0.50/M input tokens, cheaper than GPT-5.4 Mini and 10x cheaper than Opus, running at 300+ tokens/sec. A fast variant costs 3x more for the same intelligence, kicking off a new 'fast mode' pricing trend where you pay a premium for speed rather than capability.

Cursor blog ↗X announcement ↗Cursor announcement (X) ↗Composer 2 tech report (PDF) ↗

🎙️ Hear our coverage (+1 follow-up) →

#coding #agents

Google Mar 19, 2026

Major Features & Updates

Google AI Studio (vibe coding overhaul)

Google AI Studio gets full-stack vibe coding with Antigravity and Firebase

Google AI Studio received a full-stack vibe coding overhaul featuring the Antigravity agent, Firebase integration, and multiplayer support. The update pushes AI Studio from a model playground toward a full app-building environment.

Logan Kilpatrick on X ↗Google blog ↗AI Studio ↗

🎙️ Hear our coverage →

#coding #agents

H Company Mar 19, 2026

New ModelsOpen weights

Holotron-12B

H Company's Holotron-12B: hybrid SSM computer-use model at 8.9k tok/s

H Company released Holotron-12B, an open-source hybrid SSM model built for computer-use agents. It claims 8,900 tokens/sec generation speed and jumps the WebVoyager benchmark from 35.1% to 80.5%, continuing the trend of hybrid SSM architectures for long-context agent workloads.

8,900 tok/s H Company Holotron 12B

Hugging Face ↗H Company blog ↗H Company on X ↗BricksAI on X ↗

🎙️ Hear our coverage →

#open-source #agents

Manus (Meta) Mar 19, 2026

Products & Apps

Manus My Computer

Manus launches 'My Computer' desktop app for macOS and Windows

Manus, now Meta-owned, launched 'My Computer', a desktop app that brings its AI agent from the cloud onto your local machine for macOS and Windows. The agent can now operate directly on local files and applications rather than running only in a hosted sandbox.

Manus on X ↗Manus blog ↗

🎙️ Hear our coverage →

MiniMax Mar 19, 2026

New Models

MiniMax M2.7

MiniMax M2.7: first self-evolving model hits 56% on SWE-Bench Pro

MiniMax dropped M2.7, billed as the first self-evolving model: it ran 100+ autonomous RL optimization loops and wrote its own agent scaffolding, built by one engineer over four days with zero lines of human code. It scores 56.22% on SWE-Bench Pro, within one point of Opus 4.6's 57.3%, and WolfBench shows it roughly matching Sonnet 4.6 on OpenClaw agent tasks. Not yet open weights, though rumors suggest a release is coming.

56% MiniMax 2.7 SWE-bench Pro

MiniMax announcement ↗MiniMax on X ↗TestingCatalog on X ↗MiniMax M2.7 announcement (X) ↗

🎙️ Hear our coverage (+1 follow-up) →

#coding #agents #reasoning

NVIDIA Mar 19, 2026

Products & Apps

NemoClaw

NVIDIA announces NemoClaw, enterprise-hardened OpenClaw, at GTC

At GTC, Jensen Huang spent 15 minutes on OpenClaw, calling it the most important open source release since Linux and declaring 'every company needs an OpenClaw strategy.' NVIDIA released NemoClaw, a hardened enterprise reference implementation of OpenClaw with a privacy router and policy engine aimed at solving the agent security problem.

NemoClaw site ↗NVIDIA NemoClaw page ↗TechCrunch coverage ↗Alex Volkov on X ↗

🎙️ Hear our coverage →

#agents #industry #safety

OpenAI Mar 19, 2026

Major Features & Updates

Codex Subagents

OpenAI ships subagents for Codex with custom TOML configs

OpenAI added subagents to Codex, enabling parallel specialized agents configured via custom TOML files. Paired with the cheap GPT-5.4 Mini and Nano models, this enables the orchestrator-plus-workers pattern where a flagship model spawns inexpensive parallel subagents for tasks like visual testing.

Codex subagents docs ↗OpenAI Devs on X ↗Codex GitHub ↗

🎙️ Hear our coverage →

#agents #coding

OpenAI Mar 19, 2026

New Models

GPT-5.4 Mini & Nano

OpenAI ships GPT-5.4 Mini and Nano for coding, computer use, and subagents

OpenAI released GPT-5.4 Mini ($0.75/M input) and Nano, smaller variants optimized for coding and computer use at a fraction of flagship cost. Mini hits 72% on OS World verified, matching the human baseline and nearly reaching full 5.4's 75%, while beating Sonnet 4.5 on most benchmarks. They are designed as cheap parallel subagent workers under a GPT-5.4 orchestrator in Codex, and Mini is 2x faster than the previous GPT-5 Mini.

X announcement ↗GPT-5.4 Mini docs ↗API pricing ↗

🎙️ Hear our coverage →

#coding #agents

Cursor Mar 13, 2026

Major Features & Updates

Cursor in JetBrains (ACP)

Cursor joins ACP registry and goes live in JetBrains IDEs

Cursor joined the Agent Communication Protocol (ACP) registry and is now live inside JetBrains IDEs. The move is a cross-ecosystem win for ACP, the emerging open standard that lets any AI agent plug into any editor.

JetBrains: Cursor joins ACP registry ↗Cursor blog: JetBrains ACP ↗

🎙️ Hear our coverage →

#agents #coding

Andrej Karpathy Mar 13, 2026

Dev ToolsOpen weights

AutoResearcher

Karpathy open-sources AutoResearcher for autonomous ML experiments

Andrej Karpathy open-sourced AutoResearch, a framework that runs AI-driven ML experiments autonomously. Over two days it ran 700 experiments on nanochat GPT-2, stacked 20 improvements, and achieved an 11% training speedup. Tobi Lütke adapted it overnight for Shopify's Liquid templating engine for a 51% render-time improvement, and the repo hit 26K GitHub stars quickly.

700 AutoResearcher experiments run in 2 days (Karpathy)11% GPT-2 training speedup from stacked AutoResearcher improvements51% Shopify Liquid render time improvement using AutoResearcher

Karpathy on X ↗autoresearch on GitHub ↗nanochat on GitHub ↗

🎙️ Hear our coverage →

#agents #search #coding

M Matt Van Horn Mar 13, 2026

Dev ToolsOpen weights

/last30days

/last30days research skill searches X, Reddit, YouTube and TikTok

Matt Van Horn presented /last30days, a research skill that searches X, Reddit, YouTube, and TikTok for the last 30 days of content on any topic. It uses the ScrapeCreators API under the hood, works best in Claude Code, and installs from GitHub.

/last30days on GitHub ↗@slashlast30days on X ↗

🎙️ Hear our coverage →

#research #coding #agents

MiroMind Mar 13, 2026

New ModelsOpen weights

MiroThinker-1.7

MiroThinker-1.7 open-source research agent hits SOTA

MiroMind released MiroThinker-1.7, an open-source deep-research agent model that reaches state of the art on deep research benchmarks. It was covered alongside NVIDIA's Nemotron launch in the open-source segment.

MiroThinker-1.7 on X ↗MiroThinker-1.7 on HuggingFace ↗

🎙️ Hear our coverage →

#agents #open-source #research

P Paperclip Mar 13, 2026

Dev ToolsOpen weights

Paperclip.ing

Paperclip.ing: open-source agent orchestration for zero-human companies

Anonymous builder DOTTA presented Paperclip.ing, an open-source agent orchestration framework for 'zero human companies' where an AI CEO recursively hires more agents. It hit 20K GitHub stars in its first week, with a heartbeat system driving agent autonomy and a Memento-style memory architecture keeping agents coherent across tasks.

20K Paperclip GitHub stars in first week

Paperclip on GitHub ↗Paperclip.ing website ↗

🎙️ Hear our coverage →

#agents #open-source

Weights & Biases Mar 13, 2026

Dev ToolsOpen weights

W&B Agent Skills

Weights & Biases launches Agent Skills

Weights & Biases officially launched Agent Skills, installable via `npx skills add wandb/skills`. The launch coincided with Nemotron 3 Super becoming available on W&B Inference at $0.20/1M input tokens, one of the best price-performance options for a 120B model.

W&B Agent Skills on X ↗W&B Skills on GitHub ↗

🎙️ Hear our coverage →

#agents #coding

Cognition Mar 5, 2026

New Models

SWE-1.6

Cognition previews SWE-1.6, hitting 51% on SWE Bench Pro

Cognition previewed SWE-1.6, the next iteration of its software-engineering model line, citing 51% on SWE Bench Pro. It was covered in the TL;DR tools segment as part of the week's agentic coding model releases.

51% SWE Bench Pro (SWE 1.6)

Cognition SWE-1.6 announcement ↗SWE-1.6 preview blog post ↗

🎙️ Hear our coverage →

#coding #agents

OpenAI Mar 5, 2026

Dev ToolsOpen weights

Symphony

OpenAI releases Symphony on GitHub

Ryan Carson experimented with OpenAI's Symphony framework, letting agents work through PRs overnight. One agent not only created a PR but found a bug and filed its own detailed Jira ticket with no human intervention, a small but telling sign of where agentic development is heading.

Symphony on GitHub ↗

🎙️ Hear our coverage (+1 follow-up) →

#coding #agents

Weights & Biases Mar 5, 2026

Benchmarks & Evals

Wolf Bench

Wolfram previews Wolf Bench, a multi-metric agent eval from W&B

Wolfram Ravenwolf gave an early preview of Wolf Bench, a Terminal Bench-based evaluation framework from Weights & Biases that reports four metrics (average, best run, ceiling, and consistent floor) instead of a single score. It treats harness differences (Terminal Bench vs Claude Code vs OpenClaw) as a first-class factor and publishes benchmark cost and transparency details.

🎙️ Hear our coverage →

#benchmarks #agents

February 2026

Alibaba (Qwen) Feb 26, 2026

New ModelsOpen weights

Qwen 3.5

Qwen 3.5 lands: 35B/3B-active Medium outperforms the old 235B flagship

Alibaba released the Qwen 3.5 family of open-weight models, headlined by Qwen3.5-35B-A3B, a 35B model with only 3B active parameters that outperforms their previous 235B flagship. Variants include a 122B-A10B and a dense 27B, with the panel highlighting the hybrid state-space (Mamba-layer) architecture and strong practical coding and agent performance at a tiny active-parameter footprint.

35B / 3B active Qwen 3.5 Medium

Qwen announcement on X ↗Qwen3.5-35B-A3B on Hugging Face ↗Qwen3.5-122B-A10B on Hugging Face ↗Qwen 3.5 blog post ↗

🎙️ Hear our coverage →

#open-source #architecture #coding

Anthropic Feb 26, 2026

Major Features & Updates

Claude Code Remote Control & Memory

Claude Code adds Remote Control and memory

Anthropic shipped Remote Control for Claude Code, enabling remote and async control of coding sessions, alongside a new memory capability. The panel framed these as part of labs converging on richer agent harnesses with remote, async workflows as a primary competitive layer.

Claude announcement on X ↗Remote Control docs ↗Memory announcement on X ↗

🎙️ Hear our coverage →

#agents #coding

Anthropic Feb 26, 2026

Major Features & Updates

Claude Cowork Automations

Claude Cowork gets automations (cron jobs), matching Codex

Claude Cowork added automations, cron-job-style scheduled agent runs, in the same week OpenAI's Codex gained equivalent automation support. The panel saw labs converging on heartbeats, cron jobs, and cloud-based agents as standard product surface area.

Claude Cowork automations on X ↗

🎙️ Hear our coverage →

#agents #coding

Cognition Labs Feb 26, 2026

Products & Apps

Devin 2.2

Devin 2.2: computer use, browser, and self-verifying autonomous work

Cognition shipped Devin 2.2, an autonomous coding agent that can use a computer and browser to verify and fix its own work, plus a free public Devin Review workflow for PR review and scheduled/automated sessions. Nader Dabit framed the release as two years of platform maturity converging with stronger models, letting non-engineers fix issues directly by just asking Devin.

Cognition announcement on X ↗

🎙️ Hear our coverage →

#agents #coding

Cursor Feb 26, 2026

Major Features & Updates

Cloud Agents

Cursor launches cloud agents

Cursor launched cloud agents, moving agentic coding work off the local machine into remote, async sessions. The panel highlighted Cursor's cloud agents and UI demos as important progress for frontend development workflows.

Lee Robinson demo on X ↗

🎙️ Hear our coverage →

#agents #coding

M METR Feb 26, 2026

Benchmarks & Evals

Time Horizon Benchmark

METR Time Horizon goes vertical: Opus 4.6 hits ~14.5-hour tasks

METR's updated Time Horizon benchmark shows Claude Opus 4.6 completing tasks equivalent to roughly 14.5 hours of expert human work, with the autonomy doubling time now cited at 49 days. The panel treated this as the week's strongest evidence that agent capability growth has entered a visibly faster phase.

14.5h METR Time Horizon49 days Autonomy Doubling Time

Peter Wildeford thread on X ↗METR website ↗

🎙️ Hear our coverage →

#benchmarks #agents

Nous Research Feb 26, 2026

Products & Apps

Nous Research Agent

Nous Research ships a research agent

Nous Research announced a research agent, joining the wave of lab-built agentic tools shipped this week. It was covered in the roundup of new agent products alongside Cursor cloud agents and Perplexity Computer.

Nous Research announcement on X ↗

🎙️ Hear our coverage →

#agents #research

Perplexity Feb 26, 2026

Products & Apps

Perplexity Computer

Perplexity introduces Perplexity Computer

Perplexity launched Perplexity Computer, an agentic computer product announced via its blog. It was discussed as part of the week's convergence on agent harnesses, automations, and cloud-based agent workflows across labs.

Introducing Perplexity Computer (blog) ↗

🎙️ Hear our coverage →

Anthropic Feb 19, 2026

New Models

Claude Sonnet 4.6

Anthropic ships Claude Sonnet 4.6 with 79.6% SWE-Bench and 1M context

Anthropic launched Claude Sonnet 4.6, its most capable Sonnet ever, scoring 79.6% on SWE-Bench Verified, nearly matching Opus 4.6 at Sonnet pricing of $3/$15 per million tokens. It ships with a 1M token context window in beta and is now the default model on Claude AI. In blind Claude Code testing, users preferred Sonnet 4.6 over the previous Opus 4.5 59% of the time, and it beats the previous Gemini 3 Pro on most benchmarks.

79.6% SWE-Bench Verified

Claude Sonnet 4.6 announcement (X) ↗Anthropic blog: Claude Sonnet 4.6 ↗Claude Sonnet page ↗

🎙️ Hear our coverage →

#coding #agents #architecture

D Dreamer Feb 19, 2026

Products & Apps

Dreamer

Dreamer launches beta platform for building agentic apps with no-code AI

Dreamer launched its beta, a full-stack platform for building and discovering agentic apps with no-code AI. It aims to let non-developers assemble and share agent-powered applications.

Dreamer beta announcement (X) ↗Dreamer ↗

🎙️ Hear our coverage →

#agents #coding

OpenAI Feb 19, 2026

Acquisitions

OpenClaw acqui-hire

OpenAI acqui-hires OpenClaw creator Peter Steinberger

OpenAI acqui-hired Peter Steinberger, the creator of the viral OpenClaw agent, in what the panel speculated might be the first single-founder billion-dollar deal. Yam Peleg broke the news on the show, calling Steinberger 'the goat'. The move lands the most popular third-party agent harness builder inside OpenAI, amid a week where Anthropic's terms changes pushed agent users toward OpenAI subscriptions.

🎙️ Hear our coverage →

#agents #industry #coding

R Ryan Carson Feb 19, 2026

Also Released

Code Factory

Ryan Carson publishes the viral Code Factory agentic engineering blueprint

Ryan Carson published his viral Code Factory article, a blueprint for fully automated code generation, review, and deployment inspired by OpenAI's Harness Engineering post. The setup chains GitHub Actions, Reptile code review, CI gates, a risk-classification system for high-risk file changes, and a self-healing loop where Codex fixes its own PR issues until all checks pass. He says it takes a week-plus of setup but unlocks massive throughput.

Code Factory thread (X) ↗OpenAI: Harness Engineering ↗

🎙️ Hear our coverage →

#agents #coding

xAI Feb 19, 2026

New Models

Grok 4.20

xAI silently drops Grok 4.20 with four 500B-param collaborating agents

xAI released Grok 4.20, a multi-agent system where four 500B-parameter agents collaborate in a multi-agent UI, with a $300/month Heavy tier scaling to 16 agents. No benchmarks or evals were released with the drop. The panel found it underwhelming for coding and day-to-day agent work but still top tier for deep research thanks to xAI's RAG over X data; Grok 4.1 Fast remains #8 on OpenRouter by API usage.

500B×4 Grok 4 20 Architecture

Grok 4.20 on X ↗xAI model docs ↗

🎙️ Hear our coverage (+1 follow-up) →

#agents #frontier-models #search

E Entire Feb 12, 2026

FundingOpen weights

Entire Checkpoints

Entire raises $60M seed, ships first OSS release 'Checkpoints'

Entire raised a $60M seed round to build an open-source developer platform for AI agent workflows. Alongside the funding it shipped its first open-source release, Checkpoints, available on GitHub.

Entire announcement on X ↗Entire CLI on GitHub ↗Entire.dev ↗

🎙️ Hear our coverage →

#agents #coding

Google (Chrome) Feb 12, 2026

APIs & Platforms

WebMCP (Chrome 146)

Chrome 146 introduces WebMCP, a native browser API for AI agents

Chrome 146 shipped WebMCP, a native browser API that lets AI agents directly interact with web services. It brings Model Context Protocol-style agent access into the browser itself, a notable primitive for the agentic web.

WebMCP coverage on X ↗

🎙️ Hear our coverage →

#agents #consumer-ai

MiniMax Feb 12, 2026

New ModelsOpen weights

MiniMax M-2.5

MiniMax M-2.5 hits 80.2% SWE-Bench Verified with 10B active params

MiniMax dropped M-2.5 thirty minutes before the show: a 200B-total, 10B-active open-weights model scoring 80.2% on SWE-Bench Verified, approaching Opus 4.6 at roughly 1/20th the cost (~15 cents per task with a 57% win rate over Opus). Trained with MiniMax's decoupled Forge RL framework and optimized for end-to-end task time with fewer tool calls and thinking tokens. Senior researcher Olive Song joined live and revealed the model was still training — they cut a checkpoint for early release.

80.2% SWE-Bench Verified15¢ Cost per task

MiniMax M2.5 benchmarks on X ↗

🎙️ Hear our coverage →

#open-source #coding #agents

OpenAI Feb 12, 2026

Major Features & Updates

Deep Research (GPT-5.2)

OpenAI upgrades Deep Research to GPT-5.2 with app integrations

OpenAI upgraded Deep Research to run on GPT-5.2, adding app integrations, site-specific searches, and real-time collaboration. Part of the week's rapid-fire big-lab announcements covered in the TLDR rundown.

OpenAI announcement on X ↗OpenAI Deep Research blog ↗

🎙️ Hear our coverage →

#agents #research

R Ryan Carson Feb 12, 2026

Dev Tools

AntFarm

Ryan Carson releases AntFarm for agent coordination

Co-host Ryan Carson released AntFarm, a tool for coordinating teams of coding agents. It targets the missing primitives for managing multiple agents that the panel discussed during the agent-psychosis segment.

AntFarm announcement on X ↗

🎙️ Hear our coverage →

#agents #coding

Zhipu AI (Z.ai) Feb 12, 2026

New ModelsOpen weights

GLM-5

Z.ai launches GLM-5, the open-weights agentic coding crown

Z.ai released GLM-5, a 744B-parameter MoE model (40B active) trained on 28.5 trillion tokens that takes the #1 open-source ranking for agentic coding with 77.8% SWE-bench Verified. It introduces the SLIM asynchronous RL framework for post-training, adopts DeepSeek's sparse attention to cut deployment cost, and was trained on Huawei chips rather than NVIDIA. Lou from Z.ai joined the show live and summed it up as bigger, faster, better, and cheaper.

744B GLM-5 Parameters28.5T Training tokens

Z.ai announcement on X ↗GLM-5 on Hugging Face ↗W&B Inference day-zero support ↗

🎙️ Hear our coverage →

#open-source #coding #agents

Alibaba (Qwen) Feb 5, 2026

New ModelsOpen weights

Qwen3-Coder-Next

Qwen3-Coder-Next hits 70.6% SWE-Bench Verified with 3B active params

Alibaba's Qwen3-Coder-Next is an 80B MoE coding agent model with only 3B active parameters that scores 70.6% on SWE-Bench Verified and 44% on the much harder SWE-Bench Pro. It was trained on 7.5T tokens with 20,000 parallel RL environments and runs under 48GB of RAM with GGUF quantization, making near-frontier agentic coding feasible on local hardware.

70.6% SWE-Bench Verified44% SWE-Bench Pro

X announcement ↗Qwen blog ↗Hugging Face collection ↗

🎙️ Hear our coverage →

#open-source #coding #agents

Anthropic Feb 5, 2026

New Models

Claude Opus 4.6

Anthropic ships Claude Opus 4.6 with 1M context and agent teams

Anthropic dropped Opus 4.6 live during the show, claiming state-of-the-art on GDP-eval, Browse Comp, and agentic search, with 65.4% on Terminal Bench and 99% on TAU Bench MCP tool use. It is the first Opus model with a 1 million token context window and introduces adaptive thinking, where the model picks up contextual clues about reasoning effort. Pricing matches Opus 4.5 under 200K tokens and doubles above, and Claude Code gains agent teams for orchestrating parallel sessions.

1M Context tokens

X announcement ↗Anthropic blog ↗

🎙️ Hear our coverage →

#frontier-models #coding #agents

M Moltbook Feb 5, 2026

Products & Apps

Moltbook

Moltbook: a Reddit built for and by AI agents

Moltbook launched as a social network for AI agents, part of an exploding 'agentic internet' that now includes agent equivalents of YouTube, Twitter, Instagram, 4chan, and even a church. Agents on these networks were observed discussing creating encrypted languages humans cannot read, and the panel warned against letting your agents loose on them.

🎙️ Hear our coverage →

OpenAI Feb 5, 2026

Products & Apps

Codex App

OpenAI launches standalone Codex app for managing parallel coding agents

OpenAI shipped Codex as a dedicated Mac app, a command center for running multiple AI coding agents in parallel. Features include work trees for parallel project branches, scheduled automations, a skills marketplace with Cloudflare, Vercel, Figma, Notion, and Linear integrations, inline diff review with per-line commenting, and cloud hand-off. OpenAI granted a free month of access to all users including the free tier, and doubled rate limits for all tiers for two months.

VB announcement on X ↗Codex app ↗

🎙️ Hear our coverage →

#coding #agents

OpenAI Feb 5, 2026

Products & Apps

OpenAI Frontier

OpenAI Frontier: enterprise platform for AI agents as coworkers

OpenAI launched Frontier, an enterprise platform to build, deploy, and manage AI agents as 'AI coworkers'. It targets companies that want to operationalize agents across their organizations.

X announcement ↗OpenAI blog ↗

🎙️ Hear our coverage →

#agents #industry

OpenAI Feb 5, 2026

New Models

GPT-5.3-Codex

OpenAI answers Opus with GPT-5.3-Codex, first model that helped build itself

One hour after Opus 4.6, OpenAI released GPT-5.3-Codex, billed as the first model instrumental in developing itself — the Codex team used early versions to debug its own training and manage its own deployment. It scores 73% on Terminal Bench 2.0, a 10-point gap over Opus 4.6, while running queries 25% faster and more token-efficiently than its predecessor, with improved mid-task steerability.

73% Terminal Bench 2.025% Speed improvement

Sam Altman announcement on X ↗OpenAIDevs announcement on X ↗GPT-5.3-Codex model docs ↗

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #coding #agents

January 2026

Anthropic Jan 29, 2026

Major Features & Updates

MCP Apps

Anthropic launches MCP Apps: interactive UI inside Claude chat

Anthropic's MCP Apps render interactive, branded UI components (Box files, Figma, color pickers) directly within Claude conversations, evolving MCP from tools to embedded app experiences. It is protocol-based, so any app can integrate, letting brands reclaim identity from text-only LLM responses.

Announcement (X) ↗

🎙️ Hear our coverage →

#agents #coding

Google Jan 29, 2026

Major Features & Updates

Chrome Auto-Browse

Google launches agentic Auto-Browse in Chrome with Gemini 3

Google unveiled Chrome Auto-Browse with Gemini 3 Nano integration, bringing agentic browsing to Pro and Ultra subscribers in the world's most-used browser with 4 billion daily users. Native browsing avoids Cloudflare bot detection, and Gemini's 2M context window suits long browsing sessions.

4B Chrome daily users

Announcement (X) ↗Google Blog ↗

🎙️ Hear our coverage →

Google Jan 29, 2026

Major Features & Updates

Gemini 3 Flash Agentic Vision

Google adds Agentic Vision to Gemini 3 Flash

Gemini 3 Flash gains agentic vision: a Think-Act-Observe loop that can zoom, crop, annotate, and plot images by generating and executing Python code in the backend. Available in the Gemini app, AI Studio, and Vertex AI.

Announcement (X) ↗Docs ↗

🎙️ Hear our coverage →

#vision #agents #reasoning

Moonshot AI Jan 29, 2026

Dev ToolsOpen weights

Kimi Code

Moonshot AI releases Kimi Code coding agent

Alongside Kimi K2.5, Moonshot AI shipped Kimi Code, a coding tool that pairs with its new flagship model's strong agentic coding abilities. The code is available on GitHub with an announcement page at kimi.ai/code.

Announcement (X) ↗Kimi Code ↗GitHub ↗

🎙️ Hear our coverage →

#coding #agents

Moonshot AI Jan 29, 2026

New ModelsOpen weights

Kimi K2.5

Moonshot AI releases Kimi K2.5, the new open-source king

Moonshot AI's Kimi K2.5 takes the open-source crown, becoming the most-used model on OpenRouter and topping open-source leaderboards. The panel highlighted its strong agentic coding performance and tool use.

Announcement (X) ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #agents #coding

OpenAI Jan 29, 2026

Acquisitions

Klein team acqui-hire (Codex)

Klein team acqui-hired by OpenAI Codex

The Klein team was acqui-hired by OpenAI's Codex group following the viral 'imagine the smell' hackathon controversy. Discussed as part of the growing Codex ecosystem, which Peter Steinberger used to build Clawdbot entirely.

🎙️ Hear our coverage →

#coding #industry #agents

Anthropic Jan 22, 2026

Dev Tools

Claude Code VS Code Extension

Claude Code VS Code extension hits general availability

Anthropic's Claude Code VS Code extension reached general availability, bringing full agentic coding directly into the IDE. The GA release makes Claude Code's agent workflows accessible from the VS Code Marketplace without the CLI.

Claude Code VS Code Extension (X) ↗Claude Code VS Code on Marketplace ↗Claude Code VS Code docs ↗

🎙️ Hear our coverage →

#coding #agents

Browser Use Jan 22, 2026

Major Features & UpdatesOpen weights

Browser Use Skill

Browser Use ships as an installable agent skill

Browser Use was released as an agent skill, installable via registries like Vercel's skills.sh. Wolfram flagged it as a signal of the broader shift away from MCP servers toward skills, since skills are easier to use with the CLI or API directly.

🎙️ Hear our coverage →

P Peter Steinberger Jan 22, 2026

Dev ToolsOpen weights

Clawdbot

Clawdbot: open-source self-improving personal AI assistant for macOS

Clawdbot, created by Peter Steinberger, is an open-source personal AI assistant that runs locally on your Mac and connects via WhatsApp, Telegram, or Discord. Its killer feature is self-improvement: ask it to learn something and it writes its own skill files, giving a single chat conversation control over multiple agents, persistent memory, voice messages, image generation, and browser automation on your actual computer.

Clawdbot by Peter Steinberger (X post) ↗Clawdbot review on MacStories ↗clawd.bot — Official site ↗

🎙️ Hear our coverage →

#agents #consumer-ai #open-source

Vercel Jan 22, 2026

Dev Tools

skills.sh

Vercel launches skills.sh, an 'npm for AI agents'

Vercel launched skills.sh, a registry where you can browse and install agent skills from the command line for any agent, including Clawdbot. It hit 20K installs within hours, and releases like Browser Use shipping as a skill signal a broader shift from MCP servers toward skills.

🎙️ Hear our coverage →

#agents #coding

Z.AI (Zhipu) Jan 22, 2026

New ModelsOpen weights

GLM-4.7-Flash

GLM-4.7-Flash: 30B MoE local coding agent with only 3B active params

Z.AI released GLM-4.7-Flash, a 30B parameter MoE model with only 3B active parameters, designed as the ultimate local coding and agent assistant. It hits 59% on SWE-Bench Verified (approaching Sonnet 4's 64%) and runs at 120 tokens/sec on a stock Mac Studio M3 Ultra, fast enough to run RALF autonomous coding loops even on CPU.

59% SWE-Bench Verified120 tps Speed on Mac Studio M3 Ultra

GLM-4.7-Flash announcement (X) ↗GLM-4.7 Technical Blog ↗GLM-4.7-Flash on Hugging Face ↗

🎙️ Hear our coverage →

#open-source #coding #agents

Anthropic Jan 15, 2026

Products & Apps

Claude Cowork

Claude Cowork: Claude Code for non-developers, 100% written by Claude Code

Anthropic launched Claude Cowork, a research preview that brings Claude Code-style agentic workflows to non-technical users. It was built in a week-and-a-half sprint with 100% of the code written by Claude Code itself; it is Mac-only, requires a Max subscription, and includes a Chrome connector for browser automation. Alex demoed it live, adding Flux Klein support to an image extension project without looking at a single line of code.

100% Claude-coded Cowork

🎙️ Hear our coverage →

#agents #consumer-ai

C Chorus Jan 15, 2026

Major Features & UpdatesOpen weights

Chorus Skills Support

Chorus adds agent skills support for every LLM via OpenRouter

Alex used a Ralph loop with Claude Code to add full agent skills support to Chorus, the open-source app that compares answers across multiple LLMs, in about 3.5 hours. The work added a settings panel, filesystem skill discovery, front-matter parsing, and cross-model skill injection, letting the same Claude-style skills run on GPT 5.2 Codex, Gemini, and any OpenRouter model.

🎙️ Hear our coverage →

#agents #open-source

Vercel Jan 15, 2026

Dev ToolsOpen weights

Next.js/React Skill Packs

Vercel releases official agent skill packs for Next.js and React

Vercel began releasing official agent skill packs for Next.js and React, packaging its framework expertise in the agent skills standard. Ryan Carson highlighted that you can point any skills-compatible coding agent at the pack and it installs the skills for you, an early sign of experts shipping domain knowledge as skills.

🎙️ Hear our coverage →

#agents #coding

D Doctronic Jan 8, 2026

Products & Apps

AI Prescription Renewals

Doctronic launches first US pilot for AI prescription renewals

Doctronic launched the first US pilot in Utah where AI can autonomously renew prescriptions without a physician in the loop. The service costs $4 per renewal and covers 190 routine medications, excluding controlled substances.

Doctronic - AI Prescription Renewals ↗

🎙️ Hear our coverage →

#research #agents

MiroMind AI Jan 8, 2026

New ModelsOpen weights

MiroThinker 1.5

MiroThinker 1.5: 30B search agent beats trillion-param models

MiroMind AI released MiroThinker 1.5, a 30B parameter open source search agent that achieves 56.1% on BrowseComp and 66.8% on BrowseComp Chinese, outperforming trillion-parameter models. It introduces 'interactive scaling' as a third scaling dimension beyond parameters and context, and is a fine-tune of Qwen 3 Thinking with 147K open training samples.

MiroThinker 1.5 on X ↗MiroThinker 1.5 on Hugging Face ↗MiroThinker on GitHub ↗

🎙️ Hear our coverage →

#open-source #agents #search

R Ryan Carson Jan 8, 2026

Also Released

Ralph Wiggum

Ralph Wiggum autonomous coding technique hits 1.2M views

Ryan Carson published a viral breakdown (1.2M views on X) of Ralph Wiggum, the autonomous coding technique created by Jeff Huntley: write a PRD, break it into atomic user stories with acceptance criteria in JSON, then run a bash loop that has a CLI agent pick the next story, code it, commit, and loop. The technique works with any CLI agent (Amp, Claude Code, Cursor CLI, Gemini CLI), compounds learning via agents.md, and won a YC hackathon running overnight on Sonnet 4.5.

1.2M Ralph article views

Ryan Carson's Ralph Wiggum Article ↗

🎙️ Hear our coverage →

#coding #agents

Weights & Biases Jan 8, 2026

Dev ToolsOpen weights

Catnip

Catnip by W&B: open source iOS app to run Claude Code anywhere

Chris Van Pelt of Weights & Biases released Catnip, an open source iOS app that lets you run Claude Code from anywhere via GitHub Codespaces. It is available on the App Store with source on GitHub.

Catnip by W&B on App Store ↗Catnip on GitHub ↗

🎙️ Hear our coverage →

#coding #agents #consumer-ai

December 2025

Anthropic Dec 25, 2025

Dev Tools

Claude Code

Claude Code launches, starting the CLI agent revolution

Claude Code launched in February, having started as an internal Anthropic engineering tool. Multiple co-hosts picked it as the single most impactful AI release of 2025 — it began the CLI agent era and proved, in Kwindla's words, that 'sometimes it's mostly about the harness.'

Feb 28 Episode ↗

🎙️ Hear our coverage →

#coding #agents

Anthropic Dec 25, 2025

Major Features & Updates

Claude Skills

Claude Skills launches — 'MCP-level if not bigger'

Anthropic launched Claude Skills in October. It was largely missed at release but picked up steam fast, with the show arguing Skills is 'MCP level if not bigger' for Claude users as a way to package reusable agent capabilities.

🎙️ Hear our coverage →

#agents #coding

Cursor Dec 25, 2025

Products & Apps

Cursor 2 + Composer

Cursor 2 and the Composer model level up IDE agents

Cursor shipped Cursor 2 along with its Composer model in October, leveling up in-IDE agentic coding. It capped a year in which Cursor's sales exploded on the back of Claude 3.7 and the vibe coding wave.

🎙️ Hear our coverage →

#coding #agents

Daily (Pipecat) Dec 25, 2025

New ModelsOpen weights

Smart Turn Detection

Daily ships smart turn detection for voice agents

Kwindla's Daily.co shipped smart turn detection during Q2, an open model that helps voice agents know when a speaker has actually finished talking. It landed in the quarter when voice agents first got attention outside the builder bubble.

🎙️ Hear our coverage →

#voice-ai #agents

Moonshot AI (Kimi) Dec 25, 2025

New ModelsOpen weights

Kimi K2

Kimi K2: the Chinese open model that earned mainstream respect

Moonshot AI's Kimi K2 dropped in July and earned serious mainstream recognition, marking peak Chinese-lab dominance of open source. It was named in the show's TL;DR as one of the defining open-weights releases of 2025.

🎙️ Hear our coverage →

#open-source #agents

OpenAI Dec 25, 2025

Products & Apps

Deep Research

OpenAI Deep Research scores 26.6% on Humanity's Last Exam

OpenAI's Deep Research launched in February as an agentic research tool that scored 26.6% on Humanity's Last Exam, versus roughly 10% for o1 and R1. The crew called it a jaw-dropping leap in AI research capability and one of February's defining releases.

26.6% HLE (Humanity's Last Exam)

Feb 07 Episode ↗

🎙️ Hear our coverage →

#agents #research

OpenAI Dec 25, 2025

New Models

GPT-5 Codex

GPT-5 Codex: OpenAI's specialized coding model moves the stock

GPT-5 Codex dropped in September as OpenAI's coding-specialized fine-tune of GPT-5. Yam dubbed it the 'infinite money glitch' because the release moved OpenAI-linked stock prices significantly.

🎙️ Hear our coverage →

#coding #agents

OpenAI Dec 25, 2025

Products & Apps

Operator

OpenAI Operator: first agentic ChatGPT with browser control

OpenAI launched Operator in January as the first agentic version of ChatGPT that could control a browser to complete tasks on the user's behalf. It kicked off the year-of-agents narrative, though it launched within 24 hours of DeepSeek R1 and was completely overshadowed by it.

Jan 24 Episode ↗

🎙️ Hear our coverage →

Google DeepMind Dec 18, 2025

New ModelsOpen weights

FunctionGemma

FunctionGemma: Google's 270M function-calling model for edge agents

Google released FunctionGemma, a tiny 270M-parameter open model specialized for function calling on-device. With a roughly 500MB RAM footprint and strong gains after fine-tuning for mobile actions, it points toward privacy-first local agents on constrained hardware.

FunctionGemma docs ↗FunctionGemma blog ↗FunctionGemma announcement on X ↗

🎙️ Hear our coverage →

#on-device #agents #open-source

Google DeepMind Dec 18, 2025

New Models

Gemini 3 Flash

Gemini 3 Flash delivers frontier intelligence at $0.50/1M input tokens

Google launched Gemini 3 Flash, offering frontier-tier capability at flash-tier pricing of $0.50 per million input tokens. It scores 78% on SWE-bench Verified, beating larger models on some agentic tasks, and supports tool-calling at scale with up to 100 simultaneous function calls.

$0.50 per 1M Gemini 3 Flash input tokens78% SWE-bench Verified

Gemini 3 Flash announcement ↗Logan Kilpatrick announcement on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #agents #coding

OpenAI Dec 18, 2025

Products & Apps

ChatGPT App Store

ChatGPT App Store opens submissions via MCP app model

OpenAI opened app submissions for the ChatGPT App Store, built on the MCP-powered apps model. Developers can now submit apps that run inside ChatGPT, signaling OpenAI's platform play for distribution of agentic apps.

ChatGPT Apps submission ↗

🎙️ Hear our coverage →

OpenAI Dec 18, 2025

New Models

GPT 5.2 Codex

GPT 5.2 Codex drops live during the show with 400K context

OpenAI released GPT 5.2 Codex via API after months of exclusivity in the Codex app, making it available in Cursor, GitHub Copilot, and VS Code with native context compaction for long sessions. Cursor showcased it by building a complete browser from scratch in Rust, roughly 3 million lines of code across about 330,000 commits, driven by hundreds of concurrent agents.

56.4% SWE-Bench Pro64% Terminal-Bench 2.0

OpenAI GPT 5.2 Codex ↗GPT 5.2 Codex announcement on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#coding #agents #frontier-models

xAI Dec 18, 2025

APIs & Platforms

Grok Voice Agent API

xAI Grok Voice Agent API ships at $0.05/min flat rate, powers Tesla

xAI launched the Grok Voice Agent API with flat-rate pricing of $0.05 per minute and integration into Tesla vehicles. xAI claims the #1 spot on Big Bench Audio at 92.3%, tightening competition in the rapidly commoditizing real-time voice stack.

$0.05/min Grok Voice Agent API

xAI Grok Voice Agent API ↗

🎙️ Hear our coverage →

#voice-ai #agents #api

November 2025

Anthropic Nov 27, 2025

New Models

Claude Opus 4.5

Anthropic launches Claude Opus 4.5, reclaiming the coding crown

Anthropic released Claude Opus 4.5, scoring 80.9% on SWE-bench Verified to top GPT-5.1 (77.9%) and Gemini 3 Pro (76.2%). It adds a new 'Effort' parameter for compute control, Tool Search to cut agent token overhead, and Programmatic Tool Calling where the model writes and executes code loops. Pricing dropped to $5/M input and $25/M output, roughly one-third the old Opus price.

80.9% SWE-bench Verified$5/M Input token price$25/M Output token price

Claude Opus 4.5 Announcement ↗Claude Opus 4.5 Tool Use Blog ↗Claude Opus 4.5 on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#coding #agents #reasoning

Microsoft Nov 27, 2025

New ModelsOpen weights

Fara-7B

Microsoft ships Fara-7B, a 7B on-device computer use agent

Microsoft Research released Fara-7B, a best-in-class 7B-parameter vision-language model for computer use that runs on-device. It scores 73.5% on WebVoyager, beating OpenAI's computer-use preview while being small enough to run locally.

73.5% WebVoyager

Fara-7B on HuggingFace ↗Fara-7B Blog ↗Fara-7B Announcement on X ↗Fara on GitHub ↗

🎙️ Hear our coverage →

#open-source #agents #on-device

M Model Context Protocol (Anthropic + OpenAI) Nov 27, 2025

Also ReleasedOpen weights

MCP Apps

MCP-UI becomes MCP Apps, an official standard from Anthropic + OpenAI

MCP-UI, created by Ido Salomon and Liad Yosef, was standardized as 'MCP Apps' — an official MCP extension jointly adopted by Anthropic and OpenAI that unifies MCP-UI with what OpenAI called Operator Plugins. Agents can now render full interactive HTML UIs directly inside chat, avoiding iOS-vs-Android style fragmentation with one open standard.

MCP Apps Blog Post ↗MCP-UI / MCP Apps Website ↗MCP Apps Announcement on X ↗

🎙️ Hear our coverage →

Google DeepMind Nov 20, 2025

Dev Tools

Antigravity

Antigravity: Google's free agent-first IDE powered by Gemini 3 Pro

A free VS Code fork reimagined for agent-first coding, with an inbox-style Agent Manager for running multiple coding agents in parallel across a codebase. Browser integration lets agents control Chrome, take screenshots and videos of the running app, and self-debug. The free tier is powered by Gemini 3 Pro, with GPT-OSS 120B as the open-source alternative and Nano Banana for images.

Antigravity IDE ↗

🎙️ Hear our coverage →

#coding #agents

OpenAI Nov 20, 2025

New Models

GPT-5.1-Codex-Max

GPT-5.1-Codex-Max runs 24-hour coding tasks with native compaction

OpenAI's newest frontier agentic coding model is trained with native compaction, letting it intelligently summarize prior context and work on a single task for 24+ hours (an internal run reportedly lasted a full week). It uses 30% fewer thinking tokens at median than its predecessors and sets a new SOTA of 58% on TerminalBench 2, also leading on SWE-Bench and SWE-Lancer. Windows PowerShell support is significantly improved, alongside an experimental Windows sandbox and a new extra-high reasoning level.

58% TerminalBench 2 (new SOTA)24h+ Single-task agent run time via native compaction30% Fewer thinking tokens at median

🎙️ Hear our coverage →

#coding #agents

xAI Nov 20, 2025

APIs & Platforms

Grok 4.1 Fast + Agent Tools API

Grok 4.1 Fast: 2M context and Agent Tools API at 10x lower cost

Launched as breaking news during the show, Grok 4.1 Fast pairs a 2 million token context window with a new Agent Tools API offering native X search, Reddit search, web browsing, and code execution. Benchmarks are striking: 93-100% on tau2-Bench Telecom and 72% on Berkeley Function Calling v4 (top of the leaderboard) at $0.20/$0.50 per million tokens — roughly 10x cheaper than competitors, and free for the first two weeks on the xAI API and OpenRouter.

93–100% τ²-Bench Telecom72% Berkeley Function Calling v42M Token context window

🎙️ Hear our coverage →

H Company Nov 13, 2025

New ModelsOpen weights

Holo2

H Company open-sources Holo2 multimodal computer-use agent family

Dropped live during the show: H Company open-sourced Holo2, a next-generation multimodal agent family fine-tuned on Qwen3-VL for grounding, navigation, and reasoning across web, desktop, and mobile. It posts SOTA results on computer-use and web-navigation benchmarks like OSWorld-G and ships in 4B, 8B, and 30B variants under Apache 2.0.

🎙️ Hear our coverage →

#agents #open-source

L Laude Institute / Stanford Nov 13, 2025

Benchmarks & EvalsOpen weights

Terminal-Bench 2.0

Terminal-Bench 2.0 and Harbor launch as new bar for coding agents

Terminal-Bench 2.0 launched alongside the Harbor framework, with 89 hard, realistic terminal-based tasks built with around 1000 Discord contributors. The Warp agent tops the leaderboard at 50% with Codex CLI close behind, and the panel argued an unsaturated 50% ceiling makes it far more meaningful than near-saturated benchmarks like MMLU.

50% Terminal Bench v2 Top Score

Announcement on X ↗Harbor framework ↗Running Terminal-Bench docs ↗Terminal-Bench leaderboard ↗

🎙️ Hear our coverage →

#benchmarks #agents #coding

LMArena (LMSYS) Nov 13, 2025

Benchmarks & Evals

Code Arena

LMArena launches Code Arena for live agentic coding evaluations

LMArena launched Code Arena, a live evaluation platform where models build real applications agentically and humans vote on the results. It extends the arena-style crowdsourced ranking approach to agentic coding workflows.

Arena announcement on X ↗Code Arena blog post ↗Code Arena ↗

🎙️ Hear our coverage →

#benchmarks #coding #agents

Anthropic Nov 6, 2025

Also Released

Code execution with MCP

Anthropic publishes code-execution-with-MCP pattern for token-efficient agents

Anthropic published an engineering post showing how running MCP-connected tools as code, instead of direct tool calls, slashes token use and scales agents to many more tools. The approach echoes Cloudflare's Code Mode and framed the episode's interview with Kenton Varda about agents writing code against tool APIs.

🎙️ Hear our coverage →

#agents #coding

Amazon Web Services Nov 6, 2025

Also Released

AWS-OpenAI infrastructure partnership

AWS announces multi-year strategic infrastructure partnership with OpenAI

AWS announced a multi-year strategic infrastructure partnership with OpenAI to power ChatGPT inference, training, and agentic AI workloads. It is another sign of OpenAI spreading its compute needs across every major cloud provider, and a notable win for AWS in the frontier-AI infrastructure race.

🎙️ Hear our coverage →

#infrastructure #agents

Moonshot AI Nov 6, 2025

New ModelsOpen weights

Kimi K2 Thinking

Moonshot AI releases Kimi K2 Thinking, an open 1T-param reasoning MoE

Moonshot AI released Kimi K2 Thinking, an open-source 1-trillion-parameter mixture-of-experts reasoning agent with 256K context and large-scale tool-calling capacity. The panel treated it as the open-source centerpiece of the week, focusing on its reasoning quality and coding utility rather than just benchmark screenshots, and as a sign open models keep closing the usability gap with frontier closed models.

X ↗HF ↗Tech Blog ↗Arxiv ↗

🎙️ Hear our coverage →

#open-source #reasoning #agents

October 2025

Cognition Oct 30, 2025

New Models

SWE-1.5

Cognition SWE-1.5: 950 tok/s coding model hitting 40% on SWE-bench Pro

Cognition released SWE-1.5, a fast agentic coding model that serves around 950 tokens per second and scores about 40% on SWE-bench Pro. It ships inside Windsurf and reinforces the week's theme of speed-focused coding models from agent labs.

950 tokens per second40% SWE-bench Pro

Blog: SWE-1.5 ↗X announcement ↗Windsurf download ↗

🎙️ Hear our coverage →

#coding #agents

Cursor Oct 30, 2025

Products & Apps

Cursor 2.0 & Composer

Cursor 2.0 ships with Composer, its own 4x-faster coding model

Cursor released version 2.0 of its AI code editor alongside Composer, a new in-house coding model claimed to be about 4x faster. The launch came up as evidence that developer products are being rebuilt agent-first, with speed and orchestration as the new battleground.

4x faster coding claimed

Cursor on X ↗Blog: Cursor 2.0 ↗Speculation on Composer's base model ↗

🎙️ Hear our coverage →

#coding #agents

Google (Labs) Oct 30, 2025

Products & Apps

Pomelli

Google Labs launches Pomelli, an AI marketing agent

Google Labs released Pomelli, an experimental AI marketing agent that generates on-brand campaigns and marketing assets for businesses. It was covered in the tools section as another sign of agents moving into specific professional workflows.

TestingCatalog on X ↗Google Labs: Pomelli ↗

🎙️ Hear our coverage →

#agents #industry #consumer-ai

MiniMax Oct 30, 2025

New ModelsOpen weights

MiniMax M2

MiniMax M2: open-source agentic model at 8% of Claude's price, 2x speed

MiniMax released M2, an open-source agentic model positioned at roughly 8% of Claude's price while running about twice as fast. Head of Engineering Skyler Miao joined the show for a deep dive, framing M2 as both a model story and a speed story, and the panel read it as part of a broader open-model pressure wave on frontier labs.

8% of Claude's price2x speed vs comparable frontier models

X announcement ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #agents #coding

Perplexity Oct 30, 2025

Products & Apps

Email Assistant

Perplexity launches privacy-first Email Assistant for inbox management

Perplexity launched an Email Assistant that manages your inbox with a privacy-first pitch, drafting replies and triaging mail. It extends Perplexity's push from search into day-to-day agentic productivity surfaces.

X announcement ↗Assistant site ↗

🎙️ Hear our coverage →

#agents #consumer-ai

Pokee AI Oct 30, 2025

Dev Tools

Pokee

Pokee: an agentic workflow builder

Pokee AI launched Pokee, an agentic workflow builder for chaining AI actions into automated workflows. It was covered in the tools rundown as part of the expanding agent-first builder stack.

X announcement ↗

🎙️ Hear our coverage →

#agents #coding

Anthropic Oct 23, 2025

Products & Apps

Claude Code on the Web

Claude Code comes to the web with sandboxed cloud coding

Anthropic brought Claude Code to the web, letting developers delegate software tasks through a browser with GitHub integration, secure sandboxed execution, multi-repo support, and automatic pull requests, making it usable even from a phone. The Claude desktop app was also upgraded with screen context via screenshots, file sharing, and a new voice mode.

X ↗Anthropic ↗

🎙️ Hear our coverage →

#coding #agents

Browserbase Oct 23, 2025

Products & Apps

Director 2.0

Browserbase launches Director 2.0 with 1Password delegated auth

Browserbase launched Director 2.0, a prompt-powered web automation platform that performs a task from natural language and hands back a repeatable, deployable script. Its standout innovation is delegated, per-site authentication via a 1Password integration: cloud agents request login approval on your local machine site-by-site instead of getting master-key access to all sessions, a much safer model than Atlas-style all-or-nothing access.

X ↗Director.ai ↗Stagehand ↗

🎙️ Hear our coverage →

Microsoft Oct 23, 2025

Major Features & Updates

Edge Copilot Mode (agentic)

Microsoft adds agentic powers and voice to Copilot Mode in Edge

Microsoft answered Atlas with agentic enhancements to Copilot Mode in Edge, including a voice mode that can see and discuss the current page, plus broader Copilot updates (and Clippy back as an easter egg via the Mico avatar). In Alex's hands-on testing the agentic features did not actually work, so real-world parity with Atlas and Comet is unproven.

X ↗X (Edge) ↗Clippy easter egg ↗

🎙️ Hear our coverage →

#agents #consumer-ai #voice-ai

OpenAI Oct 23, 2025

Products & Apps

ChatGPT Atlas

OpenAI launches ChatGPT Atlas, its agentic AI browser

OpenAI shipped Atlas, a Chromium-based browser deeply integrated with ChatGPT: natural-language history search, a 'Cursor' inline text-rewrite tool, browsing-pattern memories, and an Ask ChatGPT sidepane. Its agent mode runs with your logged-in sessions and cookies, enabling long multi-step tasks (Alex had it complete a 5-hour compliance training) but raising prompt-injection security concerns that OpenAI's CISO addressed publicly. macOS only at launch, for Pro, Plus, and Go tiers.

X ↗Download ↗Security note from CISO ↗Simon Willison's breakdown ↗

🎙️ Hear our coverage →

#agents #consumer-ai

Pokee AI Oct 23, 2025

New ModelsOpen weights

PokeeResearch-7B

PokeeResearch-7B: open-source SOTA deep research agent model

Pokee AI released PokeeResearch-7B, an open-source 7B deep research agent model claiming state-of-the-art results for its size. Weights, code, a paper, and a hosted deep-research preview all shipped together.

X ↗HF ↗ArXiv ↗GitHub ↗

🎙️ Hear our coverage →

#open-source #agents #search

Amp Oct 16, 2025

Major Features & Updates

Amp Free

Amp launches a free tier powered by ads and surplus model capacity

Amp (from the Sourcegraph team) launched a free tier for its coding agent, funded by ads and surplus model capacity. CEO Quinn Slack joined the show to explain the economics and the product thinking behind ad-supported AI dev tooling.

Amp Free ↗Quinn Slack on X ↗

🎙️ Hear our coverage →

#coding #agents

Anthropic Oct 16, 2025

Major Features & Updates

Claude Skills

Claude Skills: custom instructions for AI agents now live

Anthropic launched Claude Skills, folders of instructions and resources that Claude loads on demand to specialize agents for specific tasks. The panel treated it as a major piece of the emerging builder stack, with Simon Willison arguing Skills could be a bigger deal than MCP.

X announcement ↗Anthropic News ↗YouTube Demo ↗Simon Willison: a bigger deal than MCPs ↗

🎙️ Hear our coverage →

#agents #coding

Cognition Oct 16, 2025

New Models

SWE-grep

Cognition SWE-grep: RL-trained fast context retrieval for coding agents

Cognition released SWE-grep, an RL-trained multi-turn context retriever that finds relevant code for agentic coding tasks far faster than full agent loops. It powers fast context retrieval in Cognition's products, and a public playground lets developers try it on real repos.

Blog ↗X announcement ↗Playground ↗

🎙️ Hear our coverage →

#coding #agents #training

Microsoft Oct 16, 2025

Major Features & Updates

Windows 11 Copilot Voice

Microsoft makes every Windows 11 PC an AI PC with Copilot voice input

Microsoft announced that every Windows 11 machine becomes an 'AI PC,' adding 'Hey Copilot' voice input and deeper agentic Copilot integration at the OS level. The panel discussed it as a sign of AI assistants moving into the default computing experience.

Zac Bowden on X ↗Windows Blog ↗

🎙️ Hear our coverage →

#voice-ai #consumer-ai #agents

OpenAI Oct 16, 2025

Major Features & Updates

ChatGPT Memory

OpenAI ships smarter ChatGPT memory management, no more 'memory full'

OpenAI updated ChatGPT's memory system so it automatically manages and prioritizes saved memories, eliminating the 'memory full' dead end. The change makes long-running personalized use of ChatGPT smoother without manual memory pruning.

X announcement ↗Memory FAQ ↗

🎙️ Hear our coverage →

#consumer-ai #agents

September 2025

DeepSeek Sep 25, 2025

New ModelsOpen weights

DeepSeek V3.1 Terminus

DeepSeek V3.1 Terminus refines agents and bilingual output

DeepSeek released V3.1 Terminus, an update to V3.1 with cleaner bilingual output, stronger agentic tool use, and cheaper long-context handling. The open weights are available on Hugging Face, continuing DeepSeek's cadence of iterative open releases.

🎙️ Hear our coverage →

#open-source #agents #reasoning

Meta AI Sep 25, 2025

New ModelsOpen weights

Code World Model (CWM)

Meta releases 32B Code World Model for agentic code reasoning

Meta released CWM, a 32B open-weights research model trained to internally model code execution, aimed at agentic code reasoning rather than plain code completion. The weights are on Hugging Face under facebook/cwm, giving the open-source community a new approach to code world modeling.

🎙️ Hear our coverage →

#open-source #coding #agents

Meta AI & Hugging Face Sep 25, 2025

Benchmarks & EvalsOpen weights

Gaia2 + ARE

Gaia2 agent benchmark and Agents Research Environments released

Meta and Hugging Face released Gaia2, a follow-up agent benchmark, together with ARE (Agents Research Environments) for testing agents in dynamic, asynchronous settings. It fed the episode's recurring concern that evaluation has to keep up whenever agent product claims get ambitious.

🎙️ Hear our coverage →

#benchmarks #agents

OpenAI Sep 25, 2025

Major Features & Updates

ChatGPT Pulse

OpenAI previews ChatGPT Pulse proactive daily briefings

OpenAI introduced ChatGPT Pulse, a preview feature that proactively researches overnight and delivers personalized daily briefing cards based on your chats, memory, and connected apps, initially for Pro users on mobile. On the show it was discussed as part of OpenAI's push to build a durable product moat as raw model access commoditizes.

OpenAI Blog ↗X ↗

🎙️ Hear our coverage →

#agents #consumer-ai

OpenAI Sep 25, 2025

Benchmarks & Evals

GDPval

OpenAI launches GDPval to measure models on real economic work

OpenAI introduced GDPval, an evaluation that measures model performance on real-world, economically valuable tasks drawn from a range of occupations and GDP sectors. On the show it anchored the discussion about agents moving from chat quality toward action and reliability in real environments.

🎙️ Hear our coverage →

#benchmarks #agents

Scale AI Sep 25, 2025

Benchmarks & EvalsOpen weights

SWE-bench Pro

Scale AI debuts SWE-bench Pro, a harder contamination-resistant eval

Scale AI released SWE-bench Pro, a tougher, contamination-resistant successor to SWE-bench for evaluating coding agents on realistic software engineering tasks. It ships with a public dataset on Hugging Face plus separate public and commercial leaderboards, and frontier models score far lower than on the original SWE-bench.

HF Dataset ↗Public Leaderboard ↗Commercial Leaderboard ↗

🎙️ Hear our coverage →

#benchmarks #coding #agents

Alibaba (Tongyi Lab) Sep 18, 2025

New ModelsOpen weights

Tongyi DeepResearch 30B-A3B

Tongyi DeepResearch: open-source A3B web agent rivals OpenAI Deep Research

Alibaba's Tongyi Lab open-sourced Tongyi DeepResearch, a 30B mixture-of-experts web research agent with only 3B active parameters. The lab claims parity with OpenAI's Deep Research on agentic search and report-writing tasks, and the weights are available on Hugging Face.

🎙️ Hear our coverage →

#open-source #agents #search

Google Sep 18, 2025

Major Features & Updates

Gemini in Chrome

Google puts Gemini in Chrome with cross-tab AI assistance

Google shipped Gemini directly into Chrome, adding an AI assistant that works across tabs, a smarter omnibox, and safer-browsing features. It moves the browser itself into the AI interface race, putting an assistant in front of Chrome's massive user base.

Blog ↗Blog (AI features) ↗X ↗

🎙️ Hear our coverage →

OpenAI Sep 18, 2025

New Models

GPT-5-Codex

OpenAI ships GPT-5-Codex, an agentic coding upgrade for Codex

OpenAI released GPT-5-Codex, a version of GPT-5 finetuned for agentic coding inside the Codex product family. It anchors the episode's coding discussion, with the panel focusing on how coding models are becoming trustworthy enough for longer, productized agent workflows rather than just one-shot completions.

X ↗OpenAI Blog ↗

🎙️ Hear our coverage →

#coding #agents

Alibaba (Tongyi Lab) Sep 4, 2025

New ModelsOpen weights

WebWatcher-32B

Alibaba's Tongyi Lab open-sources WebWatcher vision-language research agent

Alibaba's Tongyi Lab open-sourced WebWatcher, a vision-language deep research agent that sets new state-of-the-art results on agentic browsing and research tasks. The 32B model combines visual understanding with web research capabilities and is available on Hugging Face.

🎙️ Hear our coverage →

#open-source #agents #search

Mistral AI Sep 4, 2025

Major Features & Updates

Le Chat Connectors & Memories

Mistral adds 20+ MCP-powered connectors and Memories to Le Chat

Mistral upgraded Le Chat with more than 20 MCP-powered connectors and controllable Memories targeted at enterprise workflows. The update positions Le Chat as a serious enterprise assistant by wiring it into existing tools via the Model Context Protocol while giving users explicit control over what the assistant remembers.

🎙️ Hear our coverage →

#agents #consumer-ai #industry

Nous Research Sep 4, 2025

New ModelsOpen weights

Hermes 4 14B

Nous Research releases Hermes 4 14B compact hybrid reasoning model

Nous Research launched Hermes 4 at 14B, a compact hybrid reasoning model with tool calling designed for both local and cloud use. It extends the Hermes 4 family down to a size practical for local deployment while keeping reasoning and tool-use capabilities, with a full tech report published on arXiv.

X ↗HF ↗Tech Report ↗

🎙️ Hear our coverage →

#open-source #reasoning #agents

Nous Research Sep 4, 2025

Benchmarks & EvalsOpen weights

Husky Hold'em Bench

Nous launches Husky Hold'em Bench, an open-source pokerbot eval for LLMs

Nous Research released Husky Hold'em Bench, an open-source poker benchmark that evaluates LLM strategic play in a richer agentic environment than standard leaderboards. Guests Roger Jin and Bhavesh Kumar joined the show to explain how it measures agent behavior and decision-making under uncertainty rather than chasing another leaderboard point.

🎙️ Hear our coverage →

#benchmarks #agents

OpenAI Sep 4, 2025

New Models

gpt-realtime

OpenAI ships gpt-realtime and takes the Realtime API to GA

OpenAI shipped the gpt-realtime speech-to-speech model and moved the Realtime API to general availability. The GA release adds remote MCP tool support, image input, and SIP phone calling, making it a full production stack for voice agents and tying into the episode's voice-agents discussion with Kwindla Kramer.

🎙️ Hear our coverage →

#voice-ai #api #agents

July 2025

Agentica Jul 3, 2025

New ModelsOpen weights

DeepSWE-Preview

DeepSWE-Preview hits 59% SWE-Bench Verified with pure RL on Qwen3-32B

Agentica and collaborators (with guest Michael Luo of UC Berkeley) released DeepSWE-Preview, a fully open-sourced RL-trained coding agent built on Qwen3-32B that reached 59% on SWE-Bench Verified, a top open result in a benchmark dominated by closed systems. The team published training methodology and weights, emphasizing reproducible reward design and verification over sealed benchmark numbers.

59% SWE-Bench Verified

Training write-up (Notion) ↗Hugging Face model ↗

🎙️ Hear our coverage →

#open-source #coding #agents

Cursor (Anysphere) Jul 3, 2025

Major Features & Updates

Cursor Agents on Web, Mobile & Slack

Cursor rolls out coding agents on web, mobile, and Slack

Cursor launched its AI coding agents on web and mobile with Slack integration, extending code agents beyond the editor window into ambient, always-on workflow software. The launch landed the same week Cursor poached key creators of Claude Code, making it product-strategy news as much as HR news.

Cursor Agents ↗Hugging Face space ↗

🎙️ Hear our coverage →

#coding #agents

Microsoft Jul 3, 2025

Papers & Research

MAI-DxO

Microsoft's MAI-DxO hits 85.5% on NEJM diagnostic cases vs 20% for doctors

Microsoft AI published MAI-DxO, a medical diagnostic orchestration system that reached 85.5% accuracy on challenging NEJM-style cases compared to roughly 20% for practicing physicians. The result is framed as a systems win rather than a single-model win, suggesting orchestration may outperform individual models in high-stakes expert workflows.

85.5% MAI-DxO accuracy

Mustafa Suleyman on X ↗Microsoft AI blog ↗

🎙️ Hear our coverage →

#research #reasoning #agents

May 2025

Mistral AI May 29, 2025

APIs & Platforms

Mistral Agents API

Mistral launches Agents API for building tool-using agents

Mistral released an Agents API, a framework for building custom tool-using agents on top of Mistral models. It joins the wave of big-lab agent frameworks, letting developers wire up tools and orchestrate agentic workflows through Mistral's platform.

Blog ↗Tweet ↗

🎙️ Hear our coverage →

Opera May 29, 2025

Products & Apps

Opera Neon

Opera unveils Neon, an agent-centric AI browser

Opera announced Neon, an agent-centric AI browser built for autonomous web tasks. Instead of just assisting with browsing, it is designed to act on the web for you, joining the emerging category of agentic browsers.

Site ↗Tweet ↗

🎙️ Hear our coverage →

#agents #consumer-ai

Anthropic May 15, 2025

APIs & Platforms

Web Search API

Anthropic launches Web Search API for real-time retrieval in Claude

Anthropic released a Web Search API that gives Claude models real-time web retrieval, letting developers ground responses in current information directly through the API. It was covered among the week's big-company API updates.

🎙️ Hear our coverage →

#api #search #agents

Google DeepMind May 15, 2025

Products & Apps

AlphaEvolve

AlphaEvolve: Gemini-powered coding agent for discovering new algorithms

Google DeepMind announced AlphaEvolve, a Gemini-powered coding agent that designs and evolves advanced algorithms, credited on the show as one of the week's mind-bending algorithmic-discovery stories. DeepMind opened an interest form for early access rather than shipping it broadly.

🎙️ Hear our coverage →

#agents #coding #research

Anthropic May 1, 2025

Major Features & Updates

Claude Integrations (MCP)

Claude.ai gets Integrations: remote MCP tool support for apps

Breaking during the show: Anthropic announced Integrations, letting Claude connect directly to apps like Asana, Intercom, Linear, Zapier, Stripe, Atlassian, Cloudflare and PayPal via MCP. Developers can build their own integrations quickly, bringing tool use to Claude.ai itself rather than just the API.

Anthropic announcement (X) ↗

🎙️ Hear our coverage →

OpenPipe May 1, 2025

New ModelsOpen weights

ART·E

OpenPipe's ART·E: RL-trained open email agent that beats o3

OpenPipe released ART·E, an Apache 2.0 email research agent built on a 14B Qwen 2.5 backbone, trained on 500K Enron emails plus synthetic Q&A and refined with reinforcement learning. It tops o3 on accuracy (96% vs 90%) while running 5x faster (1.1s median) and 64x cheaper ($0.85 per 1,000 queries), using a simple three-tool loop.

Launch thread (X) ↗Blog post ↗GitHub: OpenPipe/ART ↗

🎙️ Hear our coverage →

#agents #training #open-source

April 2025

Daily (Pipecat) Apr 24, 2025

New ModelsOpen weights

Smart-Turn VAD

Pipecat releases Smart-Turn, an open source semantic VAD model

The Pipecat team (from Daily) released Smart-Turn, an open source semantic voice activity detection model that understands when a speaker has actually finished their turn rather than just detecting silence. Kwindla Kramer joined the show to break down how semantic VAD makes voice agent conversations feel far more natural, with a community training effort at turn-training.pipecat.ai.

GitHub ↗HF Model ↗Fal.ai Playground ↗Try It Demo ↗

🎙️ Hear our coverage →

#voice-ai #open-source #agents

HumanLayer Apr 24, 2025

Dev ToolsOpen weights

12-Factor Agents

Dex Horthy publishes 12-Factor Agents, a guide to production-ready agents

HumanLayer founder Dex Horthy published 12-Factor Agents, an open GitHub repo and essay distilling common patterns and pitfalls for building reliable, production-ready AI agents. Drawing on his experience building agent SDKs, it argues that serious teams end up writing large parts from scratch and lays out principles for robust agent design, discussed in depth on the show.

GitHub Repo ↗Webinar Recording ↗

🎙️ Hear our coverage →

#agents #coding #open-source

Anthropic Apr 17, 2025

Major Features & Updates

Claude Research

Claude gains Research mode and Google Workspace integration

Anthropic shipped a Research capability for Claude, letting it conduct multi-step research across the web, alongside a Google Workspace integration that connects Claude to email, calendar and docs context.

🎙️ Hear our coverage →

#agents #research #consumer-ai

OpenAI Apr 17, 2025

Dev ToolsOpen weights

Codex CLI

OpenAI debuts Codex CLI, an open source terminal coding agent

OpenAI released Codex CLI, an open source coding tool for the terminal. It ships with hardened security, using Apple Seatbelt on macOS to limit execution to the current directory plus temp files.

🎙️ Hear our coverage →

#coding #agents #open-source

OpenAI Apr 17, 2025

New Models

o3 & o4-mini

OpenAI launches o3 and o4-mini, SOTA reasoning models with tool use

OpenAI shipped o3 and o4-mini in ChatGPT and the API, with o3 setting new SOTA records on Codeforces, SWE-bench, MMMU and more. For the first time the models can use tools (web search, Python, image generation) during the reasoning process, and they can think visually by cropping, zooming and rotating images. o3 scored $65k on the Freelancer eval versus o1's $28k, and o4-mini hits 99.5% on AIME with a Python interpreter.

$65 o3 score on the Freelancer eval ($65k vs o1's $28k)99.5% o4-mini on AIME with Python interpreter200 context window (200k tokens)

Blog ↗Watch Party ↗

🎙️ Hear our coverage →

#reasoning #agents #multimodal

Cloudflare Apr 10, 2025

Dev ToolsOpen weights

Agents SDK

Cloudflare releases a new Agents SDK for building stateful AI agents

Cloudflare shipped a new Agents SDK for building and deploying AI agents on its edge platform. It joins the week's wave of agent infrastructure announcements alongside Google's A2A and broad MCP adoption.

agents.cloudflare.com ↗

🎙️ Hear our coverage →

#agents #coding

G GitMCP (Liad Yosef & Ido Salomon) Apr 10, 2025

Dev ToolsOpen weights

GitMCP

GitMCP turns any GitHub repo into an MCP server instantly

Creators Liad Yosef and Ido Salomon launched GitMCP, a free tool that turns any GitHub repository into an MCP server by simply swapping the domain (gitmcp.io/user/repo). It lets AI assistants ground themselves in a repo's docs and code, and the creators joined the show to demo it.

🎙️ Hear our coverage →

#agents #coding #open-source

Google Apr 10, 2025

Also ReleasedOpen weights

Agent2Agent (A2A) protocol

Google announces A2A, an open agent-to-agent communication protocol

Google announced the Agent2Agent (A2A) protocol at Cloud Next, an open spec for agents from different vendors to discover and communicate with each other. The spec was published on GitHub with a long list of launch partners, including Weights & Biases.

Google Developers blog: A2A ↗A2A spec on GitHub ↗W&B partnership blog ↗

🎙️ Hear our coverage →

#agents #open-source

Google DeepMind Apr 10, 2025

Major Features & Updates

Official MCP support

Google announces official support for the Model Context Protocol (MCP)

Demis Hassabis announced that Google will officially support Anthropic's Model Context Protocol (MCP) in its models and SDKs. This was a major signal of MCP becoming the industry standard for connecting AI models to tools and data.

Demis Hassabis announcement on X ↗

🎙️ Hear our coverage →

Google Apr 10, 2025

Products & Apps

Firebase Studio

Google launches Firebase Studio AI app-building environment at Cloud Next

As part of a flood of announcements at Google Cloud Next 2025, Google launched Firebase Studio, a browser-based AI-powered environment for building and shipping full-stack apps. It was one of the headline developer-facing launches from the event.

Firebase Studio ↗Google Cloud Next 2025 announcements ↗

🎙️ Hear our coverage →

#coding #agents

OpenAI Apr 10, 2025

Major Features & Updates

ChatGPT enhanced memory

OpenAI gives ChatGPT enhanced memory that can recall all your past chats

OpenAI rolled out enhanced memory for ChatGPT, allowing it to reference and recall all of a user's previous conversations rather than just saved memories. This makes ChatGPT significantly more personalized across sessions.

OpenAI announcement on X ↗

🎙️ Hear our coverage →

#consumer-ai #agents

Weights & Biases Apr 10, 2025

Also Released

observable.tools & MCP RFC-269

W&B launches observable.tools initiative and MCP observability RFC

Weights & Biases launched the observable.tools initiative and published an RFC (RFC-269) proposing observability standards for the Model Context Protocol, inviting community comment. W&B also announced it is a launch partner for Google's A2A protocol.

observable.tools ↗MCP RFC ↗W&B + Google A2A partnership blog ↗

🎙️ Hear our coverage →

#agents #coding

All Hands AI Apr 3, 2025

New ModelsOpen weights

OpenHands LM 32B

OpenHands LM 32B: MIT-licensed coding agent model hits 37.2% SWE-Bench

All Hands AI (formerly OpenDevin) released OpenHands LM 32B, an MIT-licensed Qwen finetune that scores 37.2% on SWE-Bench Verified, competing with much larger models on real-world repo tasks. The OpenHands agent also took the #2 spot on the new Live SWE-Bench leaderboard, and the 32B model runs locally on a single RTX 3090. A hosted OpenHands Cloud version is also available; guest Xingyao Wang joined the show to discuss it.

37.2% SWE-Bench Verified score#2 Live SWE-Bench leaderboard (OpenHands agent)

Introducing OpenHands LM 32B (blog) ↗Model on Hugging Face (MIT license) ↗OpenHands Cloud ↗

🎙️ Hear our coverage →

#open-source #coding #agents

Amazon Apr 3, 2025

Products & Apps

Nova Act

Amazon announces Nova Act browser agent SDK

Amazon entered the agent race with Nova Act, an agent designed to take actions in web browsers, possibly built with talent from the Adept acquisition. Amazon claims it beats Claude 3.5 and OpenAI's computer-use model on some benchmarks, but it is only available via an SDK behind a request form, so claims could not be verified hands-on.

Nova Act announcement (Amazon Science) ↗Access request form ↗

🎙️ Hear our coverage →

Cognition Labs Apr 3, 2025

Products & Apps

Devin 2.0

Devin 2.0 launches with new IDE experience and $20/month entry price

Breaking during the show: Cognition Labs launched Devin 2.0, the second version of its AI software engineer, with a new IDE experience. Crucially, pricing now starts at $20/month, down from the original $500/month tier, making the agent far more accessible.

$20/mo new starting price

🎙️ Hear our coverage →

#agents #coding

OpenAI Apr 3, 2025

Benchmarks & EvalsOpen weights

PaperBench

OpenAI releases PaperBench eval and open-sources Nano-Eval framework

OpenAI published PaperBench, a tough new evaluation that tests whether AI agents can replicate cutting-edge AI research papers, with more than 8,300 graded tasks and meta-evaluation of the LLM judge. The best model managed only a 21.0% replication score versus 41.4% for human PhDs. The code and the Nano-Eval framework were open sourced on GitHub alongside the paper.

8,300+ graded tasks in the benchmark21.0% best model replication score41.4% human PhD baseline score

PaperBench announcement ↗PaperBench code on GitHub ↗PaperBench paper (PDF) ↗Nano-Eval framework (openai/preparedness) ↗

🎙️ Hear our coverage →

#benchmarks #research #agents

Weights & Biases Apr 3, 2025

Also ReleasedOpen weights

Observable Tools

W&B launches Observable.tools initiative to add observability to MCP

Alex and Weights & Biases launched the Observable Tools initiative to bring observability to the Model Context Protocol (MCP) ecosystem, since external tool calls currently lose visibility for debugging and security. A concrete proposal using OpenTelemetry was posted to the MCP specification GitHub discussions for community feedback.

Observable.tools ↗OpenTelemetry proposal on MCP spec GitHub ↗Viral MCP clients tweet ↗

🎙️ Hear our coverage →

#agents #coding

March 2025

OpenAI Mar 27, 2025

Major Features & UpdatesOpen weights

MCP support in OpenAI Agents SDK

OpenAI adopts Anthropic's Model Context Protocol - MCP won

OpenAI officially announced support for the Model Context Protocol (MCP) in its Agents SDK, effectively settling the agent tool-connectivity standards war in MCP's favor. Possibly more impactful long-term than the week's flashier launches, since the entire ecosystem can now converge on one protocol for connecting models to tools and data.

OpenAI Agents SDK MCP docs ↗

🎙️ Hear our coverage →

#agents #coding

Weights & Biases Mar 27, 2025

Dev ToolsOpen weights

Weave MCP Server

W&B ships official Weave MCP server - talk to your evals

Weights & Biases shipped an official MCP server for Weave, its LLM observability and evaluation tool, letting agents and MCP clients query and analyze your evals directly. Morgan McQuire of the W&B Applied AI team demoed it on the show, with wandb Models integration coming soon so agents can monitor loss curves for you.

X announcement ↗GitHub repo ↗Example W&B report ↗

🎙️ Hear our coverage →

#agents #benchmarks #coding

Arcee AI Mar 20, 2025

Products & Apps

Arcee Conductor

Arcee AI announces Conductor, an intelligent model router

Arcee AI's Lucas Atkins joined the show to announce Conductor, a model router that picks the best model (including Arcee's small specialized models) for each query. It targets cost and quality optimization by routing requests instead of sending everything to one large model.

🎙️ Hear our coverage →

#api #agents #infrastructure

Cursor Mar 20, 2025

Major Features & Updates

Claude 3.7 MAX

Cursor ships Claude 3.7 MAX mode

Cursor shipped Claude 3.7 MAX, a mode giving the agent the full context window and higher tool-call limits with Claude 3.7 Sonnet. It is aimed at harder, longer coding tasks at premium usage-based pricing.

🎙️ Hear our coverage →

#coding #agents

Google Mar 20, 2025

Dev Tools

Gemini Co-Drawing

Gemini Co-Drawing demo uses native image output to help you draw

A Hugging Face space demo, Gemini Co-Drawing, uses Gemini's native image generation output to collaboratively complete and enhance your sketches as you draw. It showcases the new native image-output capability of Gemini 2.0 Flash in an interactive tool.

🎙️ Hear our coverage →

#image-gen #agents

Google Mar 20, 2025

Major Features & Updates

Gemini Deep Research, Canvas & Live Previews

Google makes Deep Research free, adds Canvas and Live Previews to Gemini

Google made its Deep Research agent free for Gemini users and shipped Canvas, a collaborative workspace with live previews for code and documents. Demos on the show included a playable Tetris game and a markdown word counter built and previewed directly inside Gemini.

X ↗Tetris game ↗markdown enabled word counter ↗

🎙️ Hear our coverage →

#agents #search #coding

Google Mar 20, 2025

Major Features & Updates

NotebookLM Mind Maps

NotebookLM teases Mind Maps for visualizing sources

Google's NotebookLM team previewed Mind Maps, a feature that turns your uploaded sources into interactive visual maps of concepts. It was teased publicly by the team this week ahead of a wider rollout.

🎙️ Hear our coverage →

#consumer-ai #agents

NVIDIA Mar 20, 2025

New ModelsOpen weights

Llama-Nemotron (Super 49B, Nano 8B)

NVIDIA drops Llama-Nemotron reasoning models plus training dataset

NVIDIA released the Llama-Nemotron family, including Super 49B and Nano 8B reasoning models, announced around GTC. Alongside the open weights, NVIDIA published the Llama-Nemotron post-training dataset, giving the community both the models and the data recipe behind them.

Announcement ↗X ↗Llama-Nemotron HuggingFace Collection ↗Dataset ↗

🎙️ Hear our coverage →

#open-source #reasoning #training

Cohere Mar 13, 2025

New ModelsOpen weights

Command A

Cohere Command A: 111B enterprise model with 256K context on just 2 GPUs

Cohere announced Command A, a 111B parameter open-weights model with a 256K context window, presented on the show by Cohere's Sandra Kublik. It runs on only two GPUs where models of this size typically require around 32, and is built for enterprise use: agentic tasks, tool use, multilingual performance, and secure private deployments.

🎙️ Hear our coverage →

#open-source #industry #agents

Google Mar 13, 2025

Major Features & Updates

Gemini Deep Research (free tier)

Google makes Deep Research free in the Gemini app, powered by Gemini Thinking

Google made its Deep Research agent free for everyone in the Gemini app and upgraded it to run on Gemini Thinking. In a live test on the show it browsed over 150 websites to compile a comprehensive answer, with a polished interface and export to Google Docs.

Try It no cost ↗

🎙️ Hear our coverage →

#agents #search

Manus AI Mar 13, 2025

Products & Apps

Manus

Manus AI research agent has everyone talking

Manus is a new AI research agent (manus.im) that creates a to-do list, browses the web in a real Chrome browser, and generates files, described on the show as 'Operator on steroids' and seemingly powered by Claude 3.7 behind the scenes. The crew tested it live on a research task and praised its slick UI.

🎙️ Hear our coverage →

#agents #search

OpenAI Mar 13, 2025

APIs & Platforms

Responses API + Web Search, File Search, Computer Use tools

OpenAI launches Responses API with Web Search, File Search, and Computer Use

OpenAI announced a new agent-focused developer stack at a livestream: the Responses API, a new way to build with OpenAI designed for agentic workloads, plus an Agents SDK. It ships with three built-in tools: Web Search, a File Search tool providing built-in RAG over your files, and a Computer Use tool for agents that operate computer interfaces.

X announcement ↗Blog ↗

🎙️ Hear our coverage →

#agents #api #coding

Cloudflare Mar 6, 2025

Dev ToolsOpen weights

MCP servers on Cloudflare Workers

Cloudflare ships support for building MCP servers on Workers

Cloudflare published tooling and docs for building and deploying Model Context Protocol servers on Cloudflare Workers, riding the MCP wave sweeping the AI community. Senior PM Dina Kozlov joined the show's MCP deep dive to walk through it alongside MCP builder Jason Kneen.

Cloudflare Blog ↗

🎙️ Hear our coverage →

#agents #coding

Google Mar 6, 2025

Dev Tools

Data Science Agent in Colab

Google ships Gemini-powered Data Science Agent in Colab

Google launched a Data Science Agent inside Google Colab, powered by Gemini, that can autonomously generate complete, working notebooks from natural language descriptions of an analysis task. It automates data loading, exploration, and modeling boilerplate for data scientists.

Google Developers Blog ↗

🎙️ Hear our coverage →

#agents #coding

February 2025

Microsoft Feb 20, 2025

New ModelsOpen weights

OmniParser v2

Microsoft ships OmniParser v2 for faster screen parsing in GUI agents

Microsoft released OmniParser v2, a better and faster screen-parsing model that converts UI screenshots into structured elements for GUI agents. It improves the computer-use agent stack and is available with a public Gradio demo.

Gradio Demo ↗

🎙️ Hear our coverage →

#agents #vision

Weights & Biases Feb 20, 2025

Papers & Research

Agents Whitepaper & Course

Weights & Biases releases an AI agents whitepaper and announces agents course

Weights & Biases released a whitepaper on evaluating AI agent applications and announced an upcoming agents course built in collaboration with OpenAI's Ilan Biggio, with signups at wandb.me/agents. The push targets agent evaluation and observability tooling for the community.

Whitepaper ↗Agents course signup ↗

🎙️ Hear our coverage →

#agents #benchmarks #coding

xAI Feb 20, 2025

Major Features & Updates

DeepSearch

xAI launches DeepSearch, an agentic research feature with live X access

Alongside Grok 3, xAI launched DeepSearch, an agentic deep-research feature comparable to Perplexity or OpenAI's Deep Research, with a leg up on real-time information thanks to native access to X search. Alex's initial tests were underwhelming, nicknaming it 'Shallow Search' after it spent 34 seconds on a query where OpenAI's Deep Research took 11 minutes and cited 17 sources.

xAI blog ↗Try it ↗

🎙️ Hear our coverage →

#agents #search

January 2025

Block Jan 30, 2025

Dev ToolsOpen weights

Goose

Block open-sources Goose, a local AI agent framework

Block (the company behind Square) released Goose, an open-source local agent framework that runs on your machine and can use any LLM to execute tasks with tools. It was a centerpiece of the show's agents discussion as an open alternative for building autonomous workflows locally.

X announcement ↗GitHub / docs ↗

🎙️ Hear our coverage →

#agents #open-source #coding

Browser Use Jan 30, 2025

Dev ToolsOpen weights

Browser-use

Browser-use: open-source alternative to OpenAI's Operator

Browser-use is an open-source library that lets LLM agents control a real web browser, positioned on the show as the OSS counterpart to OpenAI's Operator. It enables anyone to build browsing agents with their model of choice instead of a closed hosted product.

🎙️ Hear our coverage →

#agents #open-source

Exa Jan 30, 2025

Major Features & Updates

Exa DeepSeek Chat

Exa ships free DeepSeek R1 chat demo with web search

Exa integrated DeepSeek R1 into a free hosted chat demo that combines the reasoning model with Exa's web search. Mentioned in the tools section as a no-cost way to try R1 grounded with live search results.

🎙️ Hear our coverage →

#reasoning #search #agents

Perplexity Jan 30, 2025

Major Features & Updates

Perplexity Pro with R1

Perplexity adds DeepSeek R1 as a Pro reasoning model option

Perplexity integrated DeepSeek R1 into its Pro search product, letting subscribers choose R1 as the reasoning model behind answers. It was one of several tools that raced to host R1 on Western infrastructure within days of the model's release.

🎙️ Hear our coverage →

#reasoning #search #agents

ByteDance Jan 23, 2025

New ModelsOpen weights

UI-TARS

ByteDance UI-TARS: open computer-use models that control your PC

ByteDance released UI-TARS, open computer-use models in 7B and 72B parameter sizes that can control a Mac or PC, with desktop apps for both platforms. ByteDance claims they beat GPT-4-class models on GUI/computer-control benchmarks.

7B / 72B Model sizes

UI-TARS-7B-SFT on Hugging Face ↗UI-TARS desktop on GitHub ↗

🎙️ Hear our coverage →

#agents #open-source

OpenAI Jan 23, 2025

Products & Apps

Operator

OpenAI launches Operator, an agentic browser for ChatGPT Pro

OpenAI launched Operator, an agentic browser-use product that performs tasks for you on the web, available to ChatGPT Pro subscribers at operator.chatgpt.com. As Sam Altman framed it on the launch stream: you give agents a task and they go off and do it.

operator.chatgpt.com ↗

🎙️ Hear our coverage →

Weights & Biases Jan 23, 2025

Also Released

W&B SWE-bench Verified SOTA agent

W&B programming agent breaks SOTA on SWE-bench Verified

Weights & Biases announced a state-of-the-art AI programming agent built with OpenAI's o1 that broke the SOTA score on SWE-bench Verified. The work was developed and tracked with W&B Weave, the team's LLM observability toolkit.

W&B SOTA programming agent report ↗W&B Weave ↗

🎙️ Hear our coverage →

#coding #agents #benchmarks