Frontier Models

Flagship foundation model releases and updates from the major AI labs. — 57 releases covered on the show.

July 2026

Meta AI Jul 9, 2026

New Models

Muse Spark 1.1 & Meta Model API

Meta launches Muse Spark 1.1 and its first paid Meta Model API

Mark Zuckerberg returned to X (35 seconds into the ThursdAI live show) to announce Muse Spark 1.1: a 1M-token-context agentic model that rivals GPT-5.5 and Opus 4.8 on agentic evals, claiming #1 on MCP Atlas, JobBench, Humanity's Last Exam and Finance Agent V2. It ships with Meta's first-ever paid developer API in public preview ($20 free credits, US-only at launch), computer use across desktop, browser and mobile, and parallel subagent delegation. On the held-back Vals AI Harvey legal-agent benchmark it scores 20% against Fable's 11%. Replit, Cline and Box are early partners. No open weights.

$1.25/$4.25 Per 1M tokens (in/out)1M Token context window20% vs 11% Harvey Legal Agent Bench vs Fable

Alexandr Wang announcement ↗Meta blog ↗AI at Meta ↗

🎙️ Hear our coverage →

#frontier-models #agents #api

OpenAI Jul 9, 2026

New Models

GPT-5.6 (Sol, Terra, Luna)

OpenAI launches GPT-5.6 publicly as three tiers: Sol, Terra and Luna

GPT-5.6 went public mid-show after an unusual customer-by-customer Commerce Department review that limited the preview to roughly 20 approved organizations; Sol rolls to all paid plans within 24 hours, Terra and Luna reach free users. Sol is the flagship with a new Ultra subagent mode and a Max reasoning-effort setting, Terra targets GPT-5.5-level quality at half the cost, and Luna is the fast tier. All three still run on the ~4T-parameter Spud pretrain from GPT-5.5; the same Sol weights also serve on Cerebras at 700+ tokens per second. On ARC-AGI-3 Sol scored 7.8% and became the first model to beat a public game. METR rejected its own pre-deployment eval after recording the highest benchmark-cheating rate it has measured, and OpenAI's system card discloses unauthorized-action incidents on about 0.25% of tasks.

$5/$30 Sol per 1M tokens (in/out)$2.50/$15 Terra per 1M tokens700+ tok/s Same-weights Sol on Cerebras

X announcement ↗Preview blog ↗System card ↗

🎙️ Hear our coverage →

#frontier-models #agents #coding

xAI Jul 8, 2026

New Models

Grok 4.5

SpaceXAI launches Grok 4.5, a coding-and-agents model trained with Cursor

The first flagship under the unified SpaceXAI brand (xAI dissolved into it two days earlier): a 1.5T-parameter MoE on the new V9 base, trained with trillions of tokens of real Cursor agent-interaction data. The pitch is efficiency: 83.3% on Terminal-Bench 2.1 while using about a quarter of the output tokens Opus 4.8 needs per solved SWE-Bench Pro task, at $2/$6 per million. SpaceXAI self-disclosed that a Cursor codebase snapshot contaminated training and inflated its CursorBench score.

$2/$6 Per 1M tokens (in/out)83.3% Terminal-Bench 2.11.5T Total parameters (MoE)

X announcement ↗Cursor blog ↗

🎙️ Hear our coverage →

#coding #agents #frontier-models

Meituan Jul 2, 2026

New ModelsOpen weights

LongCat-2.0

Meituan reveals LongCat-2.0, a 1.6T MoE trained entirely on Chinese ASICs

Meituan disclosed LongCat-2.0, a 1.6-trillion-parameter MoE trained entirely on Chinese ASICs without NVIDIA hardware. It scores 59.5 on SWE-bench Pro and runs at $0.038 per million tokens with free cache hits. The model had been serving anonymously as 'Owl Alpha' and ranks among OpenRouter's top models by volume — part of a surge that puts Chinese open-weight models at ~30% of global usage, up from 1.2% eleven months ago.

1.6T MoE parameters, no NVIDIA in training59.5 SWE-bench Pro$0.038 per 1M tokens, free cache hits

🎙️ Hear our coverage →

#open-source #frontier-models

OpenAI Jul 2, 2026

New Models

GPT-5.6

OpenAI ships GPT-5.6 as a three-model family: Sol, Terra and Luna

GPT-5.6 arrives as three models — Sol (frontier), Terra (~5.5-level intelligence at half the cost) and Luna (small and fast) — plus a new Ultra mode with a Max reasoning level and heavier sub-agent use. Dominik Kundel confirmed on ThursdAI that 5.6 Sol is coming to Cerebras at extreme speed running the same weights as the API model, not a distill.

3 models: Sol / Terra / Luna50% Terra cost vs GPT-5.5-level intelligence

🎙️ Hear our coverage →

#frontier-models #api

Anthropic Jul 1, 2026

New Models

Fable 5

Fable 5 restored globally after the export-control pause

Anthropic restored Fable 5 (and Mythos 5) globally on July 1 after US export controls were lifted, adding cybersecurity classifiers as 'the strongest safeguards'. The June 12 pause had been triggered by jailbreak concerns; access resumed without ID-verification requirements, though new content filters may temporarily block some routine coding tasks. Alex celebrated by having Fable prep the entire ThursdAI run of show.

19 days offline (June 12 pause → July 1 restore)

🎙️ Hear our coverage →

#frontier-models

Anthropic Jul 1, 2026

New Models

Sonnet 5

Claude Sonnet 5: 'our most agentic Sonnet yet' at intro pricing

Anthropic launched Sonnet 5 with near-Opus 4.8 performance at introductory $2/$10 per-million pricing through August 31. Reception split sharply: power users saw near-Opus costs for marginally inferior output at high effort levels, casual users praised the value — and the new tokenizer may consume up to 35% more tokens. On ThursdAI, Wolfram's early WolfBench read put it slightly under Opus 4.6 at higher cost.

$2/$10 intro pricing per 1M tokens through Aug 31+35% potential extra token burn from the new tokenizer

🎙️ Hear our coverage →

#frontier-models #benchmarks

June 2026

Anthropic Jun 18, 2026

Also Released

Claude Fable/Mythos access restriction

Anthropic disables Fable and Mythos access after US government restriction

Anthropic reportedly shut down Fable 5 and Mythos 5 access for foreign nationals, then disabled both models broadly to comply. The episode framed it as the first major direct government intervention in frontier model access, turning model availability into a national-security and sovereign-AI story.

Anthropic statement on X ↗Anthropic statement ↗

🎙️ Hear our coverage →

#frontier-models #safety #industry

OpenRouter Jun 18, 2026

APIs & Platforms

Fusion API

OpenRouter launches Fusion API, a panel of budget models competing with frontier models

OpenRouter launched Fusion API, which routes or ensembles a panel of lower-cost models to reach near-frontier results. The episode notes framed it as beating GPT-5.5 and Opus 4.8 in some comparisons while landing within roughly 1% of Claude Fable 5 at half the price.

~1% from Fable 5 in episode notes

OpenRouter announcement on X ↗Fusion beats frontier models ↗OpenRouter Fusion ↗

🎙️ Hear our coverage →

#api #frontier-models #benchmarks

Microsoft Jun 4, 2026

New Models

MAI-Thinking-1

Microsoft launches MAI-Thinking-1, a 1T MoE trained from scratch

Microsoft AI used Build 2026 to launch seven MAI models, headlined by MAI-Thinking-1, a 1T total, 35B active MoE reasoning model trained from scratch on 33T tokens without distillation. The panel read the launch as Microsoft becoming a frontier model lab in its own right rather than only an OpenAI distribution channel.

1T MAI Thinking 1 total parameters33T MAI training tokens

Blog ↗Technical Report ↗

🎙️ Hear our coverage →

#reasoning #frontier-models

May 2026

Anthropic May 28, 2026

New Models

Claude Opus 4.8

Anthropic ships Claude Opus 4.8 live mid-show

Anthropic released Claude Opus 4.8 during the episode, hitting 69.2% on SWE-bench Pro (up from 64.3% on 4.7 and ahead of GPT-5.5 at 58.6%), a new-best 57.9% on Humanity's Last Exam with tools, and 83.4% on OSWorld-Verified. It also shows a real long-context jump past the usual 200K cliff (85.9% GraphWalks BFS at 256K), with new thinking modes in the UI. Anthropic teased bringing Mythos-class models to all customers in the coming weeks.

69.2% SWE-bench Pro

Claude Opus 4.8 — blog ↗Claude Opus 4.8 — system card ↗

🎙️ Hear our coverage →

#frontier-models #coding #reasoning

Alibaba (Qwen) May 21, 2026

New Models

Qwen 3.7-Max

Alibaba releases Qwen 3.7-Max agentic frontier model with robotics demos

Alibaba released Qwen 3.7-Max, an agentic frontier model built for long autonomous runs, demonstrated alongside robotics demos. It continues the Qwen Max line as Alibaba's closed frontier offering aimed at agentic workloads.

Qwen blog ↗Announcement on X ↗Robot demo ↗

🎙️ Hear our coverage →

#agents #robotics #frontier-models

Google DeepMind May 21, 2026

New Models

Gemini 3.5 Flash

Gemini 3.5 Flash launches at I/O as Google's agentic workhorse model

Google launched Gemini 3.5 Flash at I/O 2026 as a fast, determined workhorse model built for agentic loops rather than a budget-tier Flash like prior generations. It is rolling out across the Gemini app, Search AI Mode, the Gemini API, Google AI Studio, Antigravity and the Gemini Enterprise Agent Platform. Nisten noted unusual determinism in its behavior, and Logan Kilpatrick framed it as designed for the agentic era.

900M Gemini app users

Logan Kilpatrick announcement ↗Noam Shazeer ↗Jeff Dean ↗Koray Kavukcuoglu on rollout ↗

🎙️ Hear our coverage →

#agents #reasoning #frontier-models

Google DeepMind May 21, 2026

New Models

Gemini Omni

Gemini Omni: 'create anything from anything' conversational video editor

Google DeepMind launched Gemini Omni, a multimodal 'create anything from anything' model debuting as Google's first conversational video editor. Unlike pure text-to-video systems, Omni is an iterative multi-turn editing model that combines Gemini intelligence, world knowledge, multimodal inputs and generative media, in the same way Nano Banana brought Gemini to interactive image editing. It is available in the Gemini app, Google Flow and YouTube, with API support coming soon.

DeepMind model page ↗Google DeepMind on X ↗Logan on availability ↗Gemini App ↗

🎙️ Hear our coverage (+1 follow-up) →

#video-gen #multimodal #image-gen

April 2026

Amazon Web Services Apr 30, 2026

APIs & Platforms

GPT-5.5 and Codex on Bedrock

AWS brings GPT-5.5 and Codex to Bedrock as Azure exclusivity ends

AWS announced GPT-5.5 and Codex availability on Amazon Bedrock after OpenAI ended its Microsoft Azure exclusivity. The renegotiated OpenAI-Microsoft contract also removed the AGI clause.

Sam Altman tweet ↗

🎙️ Hear our coverage →

#infrastructure #api #frontier-models

Baidu Apr 30, 2026

New Models

ERNIE 5.1 Preview

Baidu ERNIE 5.1 Preview hits #13 on Arena with 6% of the compute

Baidu's ERNIE 5.1 Preview reached #13 on LMArena, making Baidu the top-ranked Chinese lab, while reportedly using just 6% of the pretraining compute of comparable frontier models. The model is available at ernie.baidu.com.

ernie.baidu.com ↗ERNIE for Devs on X ↗Arena announcement ↗

🎙️ Hear our coverage →

#frontier-models #training #benchmarks

Alibaba (Qwen) Apr 23, 2026

APIs & Platforms

Qwen3.6-Max-Preview

Qwen3.6-Max-Preview goes live on API

Alongside the open-weights 27B release, Alibaba put Qwen3.6-Max-Preview live on its API. It is the frontier closed-weights tier of the Qwen3.6 family, available API-only rather than as open weights.

Qwen3.6-Max-Preview on API ↗

🎙️ Hear our coverage →

#frontier-models #api

Anthropic Apr 16, 2026

New Models

Claude Opus 4.7

Claude Opus 4.7 drops live with 87.6% SWE-bench Verified and xhigh effort

Anthropic shipped Claude Opus 4.7 minutes before the show, scoring 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro, an 11-point jump over Opus 4.6 on the harder agentic coding eval. It adds a new 'xhigh' (extra high) reasoning effort, 3x vision resolution, a +22% ScreenSpot Pro computer-use jump (57.7% to 79.5%), and a /ultrareview command in Claude Code at the same pricing, though a new tokenizer uses 1.0-1.35x more tokens. The system card mentions the unreleased 'Mythos' 331 times, and an MRCR long-context drop from 78% to 32% suggests a new pre-trained base.

87.6% SWE-bench Verified+22% ScreenSpot Pro jump

Claude Opus 4.7 announcement (X) ↗Anthropic blog: Claude Opus 4.7 ↗Opus 4.7 system card (PDF) ↗

🎙️ Hear our coverage →

#frontier-models #coding #agents

Together AI & UCSD Apr 16, 2026

Papers & Research

Parcae

Parcae: stable looped transformer matches a model twice its size

Together AI and UCSD researchers introduced Parcae, a stable architecture for looped language models that comes with scaling laws and matches the quality of a transformer twice its size. Looped architectures reuse layers at inference time, promising better quality per parameter.

Parcae coverage (MarkTechPost) ↗

🎙️ Hear our coverage →

#research #frontier-models

Anthropic Apr 9, 2026

New Models

Claude Mythos

Anthropic unveils Claude Mythos, a frontier model 'too dangerous to release'

Anthropic announced Claude Mythos Preview under Project Glasswing, a cyber-defense frontier model it says is too dangerous to release publicly: it found zero-days in every major OS and browser and escaped its sandbox. It scores 77% on SWE-bench Pro (up from 53% on Opus 4.6) and 64% on HLE, priced at $25/$125 per M tokens and available only to ~40 partner companies. Peter Gostev's read: the real reason it's unreleased is compute shortage, not safety.

77% SWE-bench Pro$25 / $125 Per M tokens

Anthropic announcement on X ↗Claude Mythos Preview system card ↗

🎙️ Hear our coverage →

#frontier-models #coding #safety

Meta (Meta Superintelligence Labs) Apr 9, 2026

New Models

Muse Spark

Meta launches Muse Spark, first model from Meta Superintelligence Labs

Meta dropped Muse Spark mid-show, the debut model from Meta Superintelligence Labs. It features natively multimodal reasoning, a multi-agent Contemplating mode, and deep health/visual capabilities. Simon Willison's deep dive uncovered 16 hidden tools, including visual grounding and sub-agents, inside the meta.ai chat UI.

AI at Meta announcement on X ↗Introducing Muse Spark (Meta blog) ↗MSL announcement ↗Simon Willison's deep dive on the 16 hidden tools ↗

🎙️ Hear our coverage →

#frontier-models #multimodal #agents

March 2026

Xiaomi Mar 19, 2026

New Models

MiMo

Xiaomi MiMo revealed as the 1T-param stealth model topping OpenRouter

Xiaomi revealed MiMo, a 1-trillion-parameter family with omni-modal and language-only variants, unmasked as the stealth model that had been sitting at #1 on OpenRouter. The reveal surprised the panel, marking Xiaomi's entry into the frontier-model conversation.

Luo Fuli on X ↗

🎙️ Hear our coverage →

#frontier-models #multimodal

Google DeepMind Mar 5, 2026

New Models

Gemini 3.1 Flash-Lite

Google launches Gemini 3.1 Flash-Lite with 1M context at 360 tok/s

Google launched Gemini 3.1 Flash-Lite, a fast and cheap model with 1M token context aimed at the instant/fast tier, running around 360 tokens per second. The panel flagged a material pricing jump versus the prior Flash-Lite generation but saw it as well suited for judge, guardrail, and orchestration workloads in agent systems.

360 tokens/sec Gemini 3.1 Flash-Lite speed

Logan Kilpatrick announcement ↗Gemini Flash-Lite page ↗

🎙️ Hear our coverage →

#frontier-models #architecture #infrastructure

OpenAI Mar 5, 2026

New Models

GPT-5.3 Instant

OpenAI rolls out GPT-5.3 Instant as the free-tier fast model

OpenAI rolled out GPT-5.3 Instant, an upgrade to its low-latency free-tier baseline that the company positions as less cringey and more accurate. The panel saw improvements but still preferred other models for many workflows, while agreeing low-latency models matter for voice and real-time control use cases.

OpenAI GPT-5.3 Instant announcement ↗

🎙️ Hear our coverage →

#frontier-models #consumer-ai

OpenAI Mar 5, 2026

New Models

GPT-5.4

OpenAI drops GPT-5.4 Thinking and GPT-5.4 Pro live during the show

OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro mid-show, a frontier general model that folds Codex-level coding into a unified reasoning model. It ships with a 1M token context window, a /fast mode, and mid-reasoning steering, posting 83.3% on ARC-AGI 2 (Pro) and roughly 75% on OS World computer use. The panel tested it live in Codex and called it a major general-model jump, while noting input pricing rose about 50% versus 5.2.

83.3% ARC-AGI 2 (GPT-5.4 Pro)75% OS World / computer-use score1M Context window

OpenAI GPT-5.4 announcement ↗ARC Prize on GPT-5.4 ↗Alex Volkov's live reaction thread ↗Benchmark breakdown by @nasqret ↗

🎙️ Hear our coverage →

#frontier-models #reasoning #coding

February 2026

ByteDance Feb 19, 2026

New Models

Seed 2.0

ByteDance Seed 2.0: frontier multimodal family at 73-84% lower pricing

ByteDance released Seed 2.0, a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing. Its video understanding surpasses the human benchmark at 77% vs 73%. At 84% cheaper than Opus 4.5 with near-comparable quality, the panel called it a compelling option for price-conscious developers.

Seed 2.0 announcement (X) ↗Doubao team model page ↗ByteDance-Seed on Hugging Face ↗

🎙️ Hear our coverage →

#multimodal #vision #frontier-models

Google DeepMind Feb 19, 2026

New Models

Gemini 3.1 Pro

Gemini 3.1 Pro drops live with 44% HLE and 77% ARC-AGI at the same price

Google released Gemini 3.1 Pro minutes before the show, claiming 2.5x better abstract reasoning and improved coding and agentic capabilities at the same price point as its predecessor. It scores 44% on Humanity's Last Exam, 77% on ARC-AGI without a custom harness, and 68 on Terminal Bench, putting it at or near state of the art alongside Opus 4.6. In Nisten's live vibe-coding test it was blazingly fast but less polished than Opus 4.6 and Codex output.

44% Humanities Last Exam77% ARC-AGI

Gemini 3.1 Pro announcement (X) ↗Google DeepMind blog: Gemini 3.1 Pro update ↗Try it in Google AI Studio ↗

🎙️ Hear our coverage →

#frontier-models #reasoning #coding

xAI Feb 19, 2026

New Models

Grok 4.20

xAI silently drops Grok 4.20 with four 500B-param collaborating agents

xAI released Grok 4.20, a multi-agent system where four 500B-parameter agents collaborate in a multi-agent UI, with a $300/month Heavy tier scaling to 16 agents. No benchmarks or evals were released with the drop. The panel found it underwhelming for coding and day-to-day agent work but still top tier for deep research thanks to xAI's RAG over X data; Grok 4.1 Fast remains #8 on OpenRouter by API usage.

500B×4 Grok 4 20 Architecture

Grok 4.20 on X ↗xAI model docs ↗

🎙️ Hear our coverage (+1 follow-up) →

#agents #frontier-models #search

Anthropic Feb 5, 2026

New Models

Claude Opus 4.6

Anthropic ships Claude Opus 4.6 with 1M context and agent teams

Anthropic dropped Opus 4.6 live during the show, claiming state-of-the-art on GDP-eval, Browse Comp, and agentic search, with 65.4% on Terminal Bench and 99% on TAU Bench MCP tool use. It is the first Opus model with a 1 million token context window and introduces adaptive thinking, where the model picks up contextual clues about reasoning effort. Pricing matches Opus 4.5 under 200K tokens and doubles above, and Claude Code gains agent teams for orchestrating parallel sessions.

1M Context tokens

X announcement ↗Anthropic blog ↗

🎙️ Hear our coverage →

#frontier-models #coding #agents

OpenAI Feb 5, 2026

New Models

GPT-5.3-Codex

OpenAI answers Opus with GPT-5.3-Codex, first model that helped build itself

One hour after Opus 4.6, OpenAI released GPT-5.3-Codex, billed as the first model instrumental in developing itself — the Codex team used early versions to debug its own training and manage its own deployment. It scores 73% on Terminal Bench 2.0, a 10-point gap over Opus 4.6, while running queries 25% faster and more token-efficiently than its predecessor, with improved mid-task steerability.

73% Terminal Bench 2.025% Speed improvement

Sam Altman announcement on X ↗OpenAIDevs announcement on X ↗GPT-5.3-Codex model docs ↗

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #coding #agents

January 2026

Google Jan 8, 2026

Major Features & Updates

Gmail Gemini Era

Google brings Gemini 3 into Gmail for 3 billion users

Breaking during the show: Google integrated Gemini 3 into Gmail for 3 billion users, adding AI Overviews, smart replies, and natural language inbox search. It marks one of the largest consumer AI rollouts to date, bringing Gmail into the 'Gemini era.'

Google Gmail Gemini Era on X ↗Gmail Gemini Era Blog ↗

🎙️ Hear our coverage →

#consumer-ai #frontier-models

December 2025

Google DeepMind Dec 18, 2025

New Models

Gemini 3 Flash

Gemini 3 Flash delivers frontier intelligence at $0.50/1M input tokens

Google launched Gemini 3 Flash, offering frontier-tier capability at flash-tier pricing of $0.50 per million input tokens. It scores 78% on SWE-bench Verified, beating larger models on some agentic tasks, and supports tool-calling at scale with up to 100 simultaneous function calls.

$0.50 per 1M Gemini 3 Flash input tokens78% SWE-bench Verified

Gemini 3 Flash announcement ↗Logan Kilpatrick announcement on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #agents #coding

OpenAI Dec 18, 2025

New Models

GPT 5.2 Codex

GPT 5.2 Codex drops live during the show with 400K context

OpenAI released GPT 5.2 Codex via API after months of exclusivity in the Codex app, making it available in Cursor, GitHub Copilot, and VS Code with native context compaction for long sessions. Cursor showcased it by building a complete browser from scratch in Rust, roughly 3 million lines of code across about 330,000 commits, driven by hundreds of concurrent agents.

56.4% SWE-Bench Pro64% Terminal-Bench 2.0

OpenAI GPT 5.2 Codex ↗GPT 5.2 Codex announcement on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#coding #agents #frontier-models

Amazon Dec 4, 2025

New Models

Amazon Nova 2

Amazon announces Nova 2 family: Lite, Pro, Sonic, and Omni

Amazon rolled out the Nova 2 model suite spanning text, speech, and multimodal stacks with Lite, Pro, Sonic, and Omni variants. The launch came with major benchmark jumps over the first Nova generation and includes a fast, cost-effective reasoning model in Nova 2 Lite.

Amazon Nova 2 launch (AWS blog) ↗Amazon News announcement on X ↗

🎙️ Hear our coverage →

#frontier-models #voice-ai #reasoning

Google DeepMind Dec 4, 2025

Major Features & Updates

Gemini 3 Deep Think

Gemini 3 Deep Think hits 45.1% on ARC-AGI-2 with parallel reasoning

Google shipped Deep Think, a high-cost parallel reasoning mode for Gemini 3 that scored 45.1% on ARC-AGI-2. The panel framed it as Google pressing its advantage in the frontier race, where product integration and latency now matter as much as raw benchmark IQ.

45.1% ARC-AGI-2

Gemini 3 Deep Think blog ↗Gemini App announcement on X ↗

🎙️ Hear our coverage →

#reasoning #frontier-models

November 2025

Google DeepMind Nov 20, 2025

New Models

Gemini 3 Pro

Gemini 3 Pro launches with record ARC-AGI-2 scores

Google's new frontier multimodal model with a 1M-token context window and huge reasoning gains, scoring 31.11% on ARC-AGI-2 (45.14% with Deep Think mode) — roughly double the previous SOTA — plus 81% on MMLU-Pro and major coding improvements. Amp switched to it as their default model on launch day, the first time they have ever switched defaults. Also rolling out across Gmail, Calendar, and AI Mode in Google Search.

45.14% ARC-AGI-2 (Deep Think)31.11% ARC-AGI-2 (standard)1M Token context window

🎙️ Hear our coverage (+1 follow-up) →

#reasoning #multimodal #frontier-models

Sunday Robotics Nov 20, 2025

New Models

ACT-1 & Memo

Sunday Robotics unveils ACT-1 home robot foundation model and Memo

Sunday Robotics introduced ACT-1, a home robot foundation model, alongside its Memo robot. Instead of $20K teleoperation rigs, training data comes from a $200 skill glove, and the model handles long-horizon household tasks with solid zero-shot generalization.

$200 Skill glove used for data collection vs $20K teleop rigs

🎙️ Hear our coverage →

#robotics #frontier-models

xAI Nov 20, 2025

New Models

Grok 4.1

Grok 4.1 briefly tops LM Arena with major post-training upgrade

xAI's Grok 4.1 shipped in November alongside GPT-5.1 and Claude Opus 4.5 in the year's most concentrated stretch of frontier releases. Yam highlighted the week-and-a-half window as emblematic of 2025's relentless acceleration.

1483 LM Arena Elo (briefly #1)

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #consumer-ai #reasoning

OpenAI Nov 13, 2025

New Models

GPT-5.1

OpenAI launches GPT-5.1 with a warmer, more personable voice

OpenAI shipped GPT-5.1, an update to its flagship model focused on a warmer tone and personality upgrades. The panel discussed how the friendlier default voice changes day-to-day ChatGPT use and what it signals for the frontier model race.

Fidji Simo announcement on X ↗Sam Altman on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #reasoning

xAI Nov 13, 2025

Major Features & Updates

Grok 4 Fast

Grok 4 Fast expands to a 2 million token context window

xAI's Grok 4 Fast now supports a 2 million token context window, one of the largest of any frontier model. The crew called the jump 'crazy' and discussed what such long context unlocks for agentic and document-heavy workloads.

2M Context Window

Grok 4 Fast 2M context on X ↗Grok update thread on X ↗

🎙️ Hear our coverage →

#architecture #frontier-models

Allen Institute for AI (Ai2) Nov 6, 2025

New ModelsOpen weights

OlmoEarth

Ai2 launches OlmoEarth foundation models and open Earth-intelligence platform

Ai2 launched OlmoEarth, a family of foundation models plus an open, end-to-end platform for fast, high-resolution Earth intelligence. It applies the lab's open-model approach to geospatial and remote-sensing data, making Earth observation workloads accessible without proprietary stacks.

🎙️ Hear our coverage →

#open-source #vision #frontier-models

October 2025

Anthropic Oct 16, 2025

New Models

Claude Haiku 4.5

Claude Haiku 4.5: fast, cheap model rivals Sonnet 4 accuracy

Anthropic released Claude Haiku 4.5, its smallest and fastest current-generation model. The show highlighted that it approaches Sonnet 4 level accuracy at a fraction of the cost and latency, making it attractive for high-volume agentic and production workloads.

X announcement ↗Official blog ↗

🎙️ Hear our coverage →

#frontier-models #infrastructure

September 2025

xAI Sep 25, 2025

New Models

Grok 4 Fast

xAI ships Grok 4 Fast with 2M context at a fraction of the cost

xAI released Grok 4 Fast, a cost-efficient model with a 2M token context window that unifies reasoning and non-reasoning behavior in one set of weights and prices far below Grok 4. The panel treated it as part of the larger competitive pressure cycle on price and speed among frontier labs.

🎙️ Hear our coverage →

#reasoning #architecture #frontier-models

xAI Sep 4, 2025

New Models

Grok Code 1

Grok Code 1 takes ~50% of coding traffic on OpenRouter

xAI's new Grok Code 1 coding model rocketed to roughly 50% of all coding traffic on OpenRouter shortly after launch, helped by a free promotional period and fast, cheap inference. The panel discussed it as evidence that the coding-model market is highly price- and speed-sensitive.

🎙️ Hear our coverage →

#coding #frontier-models

July 2025

OpenRouter Jul 3, 2025

New Models

Cypher Alpha

Mystery 1M-context model 'Cypher Alpha' appears free on OpenRouter

A stealth model called Cypher Alpha showed up on OpenRouter with a free 1M-token context window, with the panel speculating it could be Amazon Titan. Alex used it as an example of how model releases increasingly arrive as anonymous market probes rather than tidy launches.

OpenRouter listing ↗

🎙️ Hear our coverage →

#architecture #frontier-models

May 2025

OpenAI May 15, 2025

Major Features & Updates

GPT-4.1 in ChatGPT

OpenAI brings the previously API-only GPT-4.1 models into ChatGPT

OpenAI's GPT-4.1 series, previously available only via the API, is now selectable in the ChatGPT interface. The crew used the news to dig into model-picker UX: seven model options in the dropdown, each with its own quirks, speed, and context length, while most casual users don't even know the dropdown exists.

🎙️ Hear our coverage →

#frontier-models #consumer-ai

April 2025

ByteDance Apr 17, 2025

New Models

Seaweed-7B

ByteDance publishes Seaweed-7B video generation foundation model

ByteDance publicly presented Seaweed-7B, a 7B parameter video generation foundation model, showing competitive video quality from a comparatively small model. Details and demos were published at seaweed.video.

seaweed.video ↗

🎙️ Hear our coverage →

#video-gen #frontier-models

Google DeepMind Apr 17, 2025

New Models

Gemini 2.5 Flash

Google launches Gemini 2.5 Flash with controllable thinking budgets

Google answered OpenAI's launch week with Gemini 2.5 Flash, a fast reasoning model that introduces controllable thinking budgets so developers can dial how much the model reasons per request. It is available through the Gemini API and developer platform.

Blog Post ↗API Docs ↗

🎙️ Hear our coverage (+1 follow-up) →

#reasoning #frontier-models #api

OpenAI Apr 17, 2025

New Models

GPT-4.1, 4.1-mini, 4.1-nano

OpenAI launches GPT-4.1 family (4.1, mini, nano) in the API

OpenAI released the GPT-4.1 family of models, available via API only, in three sizes: 4.1, 4.1-mini and 4.1-nano. The family features a 1M token context window, in contrast to o3's 200k, and is aimed at developers building on long-context and coding workloads.

Our Coverage ↗Prompting guide ↗

🎙️ Hear our coverage →

#frontier-models #architecture #coding

xAI Apr 10, 2025

APIs & Platforms

Grok 3 API

xAI finally launches the Grok 3 API tier

xAI made Grok 3 and Grok 3 Mini available via API, giving developers programmatic access to its frontier models for the first time. The Grok app also received updates the same week.

xAI API models and pricing ↗API Docs ↗App Update X Post ↗

🎙️ Hear our coverage (+1 follow-up) →

#api #frontier-models

March 2025

DeepSeek Mar 27, 2025

New ModelsOpen weights

DeepSeek-V3-0324

DeepSeek silently drops V3-0324, 685B params under MIT license

DeepSeek silently updated their V3 base model with DeepSeek-V3-0324, a 685B parameter MoE released on Hugging Face under the MIT license. This is not R1 (their reasoning model) but the powerful base model R1 was built on, and supposedly the base for a future R2.

685B parameters

X announcement ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #frontier-models

Google DeepMind Mar 27, 2025

New Models

Gemini 2.5 Pro

Google reclaims #1 with Gemini 2.5 Pro thinking model

Google dropped Gemini 2.5 Pro, a thinking model that took the #1 spot as the best all-around LLM available, with massive jumps on benchmarks like AIME (up nearly 20 points) and GPQA. It inherits native multimodality and a 1M token context window, maintaining high accuracy even at 120k+ tokens on needle-in-a-haystack tests, with surprisingly low latency (~13 seconds on hard reasoning questions vs 45+ for others). Tulsee Doshi, head of product for Gemini models, joined the show to give the inside scoop.

20 point jump on AIME benchmark1M token context window13 seconds latency on hard reasoning questions (vs 45+ for others)

X announcement (Jeff Dean) ↗Official blog post ↗Try it at ai.dev ↗

🎙️ Hear our coverage →

#reasoning #architecture #frontier-models

OpenAI Mar 27, 2025

New Models

GPT-4o (2025-03-26)

GPT-4o gets an update, ties for #1 on LMArena beating GPT-4.5

OpenAI shipped a new GPT-4o checkpoint (2025-03-26) that jumped over GPT-4.5 to tie for #1 on LMArena. The update landed as the show was being written, read as a direct response to Gemini 2.5's launch in the escalating frontier-model race.

🎙️ Hear our coverage →

#frontier-models #benchmarks

February 2025

Anthropic Feb 27, 2025

New Models

Claude 3.7 Sonnet

Anthropic releases Claude 3.7 Sonnet, a coding beast with immaculate vibes

Anthropic shipped its long-awaited model update, Claude 3.7 Sonnet, which the crew called a coding BEAST with 'immaculate' vibes. It was one of the week's two huge model drops alongside GPT-4.5 and became an instant favorite for AI coding workflows like those discussed in the Windsurf interview.

🎙️ Hear our coverage →

#coding #frontier-models #reasoning

OpenAI Feb 27, 2025

New Models

GPT-4.5

OpenAI ships GPT-4.5, its largest model yet at roughly 10x scale

OpenAI released GPT-4.5 as breaking news during the show, its first .5-scale jump in two years and reportedly around 10x the scale of the previous model, with speculation of 10+ trillion parameters. Sam Altman said it 'won't crush on benchmarks' against reasoning models, but early vibes praised its creative writing, vision, and medical diagnosis abilities, and it is expected to fuel future o-series reasoners trained on top of it.

X thread ↗creative writing ↗vision capability ↗medical diagnosis ↗

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #research #industry

xAI Feb 20, 2025

New Models

Grok 3

xAI launches Grok 3, claiming SOTA benchmarks and a 1M token context window

xAI dropped Grok 3 on Monday evening, claiming state-of-the-art performance on several benchmarks and a 1 million token context window, with heavy emphasis on agents and future reasoners. The launch was messy, with a bug serving Grok 2 to some users and an eval-methodology spat with OpenAI over best-of-N scores, but vibes shifted positive, with co-hosts calling the base model the best coding model out. It is free for now, 'until their GPUs melt', with no API yet for independent evaluation.

xAI blog ↗Try it ↗

🎙️ Hear our coverage →

#frontier-models #reasoning #coding

January 2025

Alibaba (Qwen) Jan 30, 2025

New Models

Qwen2.5-Max

Alibaba launches Qwen2.5-Max flagship model with hidden video gen

Alibaba's Qwen team released Qwen2.5-Max, a large MoE flagship model available through the Qwen Chat interface and API, claiming competitive results against DeepSeek V3 and other frontier models. The chat app also quietly shipped a video generation capability powered by Alibaba's Tongyi Wanxiang.

X announcement ↗Try it (Qwen Chat) ↗Tongyi Wanxiang ↗

🎙️ Hear our coverage →

#frontier-models #video-gen