Reasoning & Math

Reasoning models, chain-of-thought, math performance, and test-time compute. — 91 releases covered on the show.

July 2026

Liquid AI Jul 7, 2026

Papers & ResearchOpen weights

Antidoom

Liquid AI open-sources Antidoom, removing the reasoning doom-loop

An open method that suppresses the failure mode where reasoning models spiral into repetitive degenerate output: doom-loop rates dropped from 22.9% to 1% on Qwen3.5-4B and from 10.2% to 1.4% on an LFM2.5 checkpoint, with eval scores improving across the board.

22.9%→1% Doom-loop rate, Qwen3.5-4B

X announcement ↗

🎙️ Hear our coverage →

#reasoning #open-source #training

June 2026

Microsoft Jun 4, 2026

New Models

MAI-Thinking-1

Microsoft launches MAI-Thinking-1, a 1T MoE trained from scratch

Microsoft AI used Build 2026 to launch seven MAI models, headlined by MAI-Thinking-1, a 1T total, 35B active MoE reasoning model trained from scratch on 33T tokens without distillation. The panel read the launch as Microsoft becoming a frontier model lab in its own right rather than only an OpenAI distribution channel.

1T MAI Thinking 1 total parameters33T MAI training tokens

Blog ↗Technical Report ↗

🎙️ Hear our coverage →

#reasoning #frontier-models

NVIDIA Jun 4, 2026

New ModelsOpen weights

Nemotron 3 Ultra

NVIDIA releases Nemotron 3 Ultra, a 550B open-weight MoE for agents

NVIDIA dropped Nemotron 3 Ultra the day of the show, a 550B-parameter sparse MoE with 55B active parameters built for long-running agentic harnesses like OpenCode, Hermes, and OpenClaw. Chris Alexiuk joined to explain the hybrid Mamba/Transformer architecture and the unusually complete open release: weights, training data, recipes, a GenRM reward model, and an NVFP4 quantized checkpoint.

550B Nemotron 3 Ultra parameters55B Active parameters

Announcement ↗Technical Report ↗Hugging Face (post-trained BF16) ↗X announcement ↗

🎙️ Hear our coverage →

#open-source #agents #reasoning

May 2026

Anthropic May 28, 2026

New Models

Claude Opus 4.8

Anthropic ships Claude Opus 4.8 live mid-show

Anthropic released Claude Opus 4.8 during the episode, hitting 69.2% on SWE-bench Pro (up from 64.3% on 4.7 and ahead of GPT-5.5 at 58.6%), a new-best 57.9% on Humanity's Last Exam with tools, and 83.4% on OSWorld-Verified. It also shows a real long-context jump past the usual 200K cliff (85.9% GraphWalks BFS at 256K), with new thinking modes in the UI. Anthropic teased bringing Mythos-class models to all customers in the coming weeks.

69.2% SWE-bench Pro

Claude Opus 4.8 — blog ↗Claude Opus 4.8 — system card ↗

🎙️ Hear our coverage →

#frontier-models #coding #reasoning

Google DeepMind May 21, 2026

New Models

Gemini 3.5 Flash

Gemini 3.5 Flash launches at I/O as Google's agentic workhorse model

Google launched Gemini 3.5 Flash at I/O 2026 as a fast, determined workhorse model built for agentic loops rather than a budget-tier Flash like prior generations. It is rolling out across the Gemini app, Search AI Mode, the Gemini API, Google AI Studio, Antigravity and the Gemini Enterprise Agent Platform. Nisten noted unusual determinism in its behavior, and Logan Kilpatrick framed it as designed for the agentic era.

900M Gemini app users

Logan Kilpatrick announcement ↗Noam Shazeer ↗Jeff Dean ↗Koray Kavukcuoglu on rollout ↗

🎙️ Hear our coverage →

#agents #reasoning #frontier-models

OpenAI May 21, 2026

Papers & Research

Erdős planar unit distance result

OpenAI model makes progress on 80-year-old Erdős planar unit distance problem

OpenAI announced that a general-purpose reasoning model made progress on the Erdős planar unit distance problem, challenging an 80-year-old mathematical belief. The panel called it the most important news of the week outside Google I/O, as a sign that frontier reasoning models are starting to contribute to genuinely open mathematics.

80-year Erdos math problem

OpenAI blog post ↗OpenAI on X ↗

🎙️ Hear our coverage →

#reasoning #research

April 2026

Mistral AI Apr 30, 2026

New ModelsOpen weights

Mistral Medium 3.5

Mistral Medium 3.5: 128B dense flagship with 256K context

Mistral launched Medium 3.5, a 128B dense flagship model with 256K context and configurable reasoning, released with weights on Hugging Face. Alongside it Mistral shipped a Vibe coding agent.

Mistral blog ↗Hugging Face ↗Mistral Vibe on X ↗

🎙️ Hear our coverage →

#open-source #reasoning #coding

OpenAI Apr 23, 2026

New Models

GPT-5.5

GPT-5.5 and GPT-5.5 Pro drop live, SOTA across the board

OpenAI shipped GPT-5.5 and GPT-5.5 Pro mid-show, taking state of the art on Terminal-Bench 2 (82.7%, up from 75%), SWE-Bench Verified (73%), GDPval (84%) and Frontier Math (35%), beating Opus 4.7 and Gemini 3.1. It uses ~40% fewer tokens than 5.4, netting roughly 20% cheaper to run despite API pricing doubling to $5/$30 per million ($30/$180 for Pro). Peter Gostev called it the first model that genuinely sustains multi-hour long-running tasks, with one task running 8.5 hours straight; rollout was Codex-first, not yet in ChatGPT.

82.7% Terminal-Bench 28.5 hrs Longest task

OpenAI GPT-5.5 release blog ↗Artificial Analysis GPT-5.5 analysis ↗GPT-5.5 pre-launch leak (Codex dropdown) ↗

🎙️ Hear our coverage →

#reasoning #coding #agents

Anthropic Apr 16, 2026

New Models

Claude Opus 4.7

Claude Opus 4.7 drops live with 87.6% SWE-bench Verified and xhigh effort

Anthropic shipped Claude Opus 4.7 minutes before the show, scoring 87.6% on SWE-bench Verified and 64.3% on SWE-bench Pro, an 11-point jump over Opus 4.6 on the harder agentic coding eval. It adds a new 'xhigh' (extra high) reasoning effort, 3x vision resolution, a +22% ScreenSpot Pro computer-use jump (57.7% to 79.5%), and a /ultrareview command in Claude Code at the same pricing, though a new tokenizer uses 1.0-1.35x more tokens. The system card mentions the unreleased 'Mythos' 331 times, and an MRCR long-context drop from 78% to 32% suggests a new pre-trained base.

87.6% SWE-bench Verified+22% ScreenSpot Pro jump

Claude Opus 4.7 announcement (X) ↗Anthropic blog: Claude Opus 4.7 ↗Opus 4.7 system card (PDF) ↗

🎙️ Hear our coverage →

#frontier-models #coding #agents

OpenAI Apr 9, 2026

New Models

GPT-Image-2

OpenAI's GPT-Image-2 leaks on LM Arena under three codenames

OpenAI's GPT-Image-2 posted the biggest single jump ever recorded on Arena, sitting 200+ ELO points above the previous top image model even on medium reasoning. The thinking/reasoning image model generates functioning QR codes, pixel-perfect infographics, 4K output, multi-image character consistency, and equirectangular 360-degree images that Peter Gostev stitched into a walkable street-view reconstruction of ancient Babylon. It even produces screenshots of IDEs containing SVG code that actually renders, enabling a new design-then-implement meta with Codex.

levelsio on X ↗RituWithAI on X ↗DataChaz on X ↗GPT-Image-2 announcement ↗

🎙️ Hear our coverage (+1 follow-up) →

#image-gen #reasoning

March 2026

ARC Prize Foundation Mar 26, 2026

Benchmarks & Evals

ARC-AGI-3

ARC-AGI-3 launches: humans score 100%, frontier models under 1%

ARC Prize launched ARC-AGI-3, an interactive agentic reasoning benchmark of turn-based puzzle games designed to test human-like generalization in novel abstract environments. Humans hit a 100% pass rate while top frontier models score under 1%, which the panel welcomed as a healthy reality check against AGI-is-here rhetoric and easy score inflation.

<1% ARC-AGI-3 frontier model scores100% Human completion on ARC-AGI-3

ARC Prize announcement (X) ↗ARC Prize site ↗

🎙️ Hear our coverage →

#benchmarks #reasoning #agents

MiniMax Mar 19, 2026

New Models

MiniMax M2.7

MiniMax M2.7: first self-evolving model hits 56% on SWE-Bench Pro

MiniMax dropped M2.7, billed as the first self-evolving model: it ran 100+ autonomous RL optimization loops and wrote its own agent scaffolding, built by one engineer over four days with zero lines of human code. It scores 56.22% on SWE-Bench Pro, within one point of Opus 4.6's 57.3%, and WolfBench shows it roughly matching Sonnet 4.6 on OpenClaw agent tasks. Not yet open weights, though rumors suggest a release is coming.

56% MiniMax 2.7 SWE-bench Pro

MiniMax announcement ↗MiniMax on X ↗TestingCatalog on X ↗MiniMax M2.7 announcement (X) ↗

🎙️ Hear our coverage (+1 follow-up) →

#coding #agents #reasoning

NVIDIA Mar 13, 2026

New ModelsOpen weights

Nemotron 3 Super 120B

NVIDIA releases Nemotron 3 Super 120B with $26B open-source bet

NVIDIA launched Nemotron 3 Super, a 120B Hybrid Mamba-Transformer MoE model with 12B active parameters, a 1M-token context window, and 450 tok/s throughput. It shipped with BF16/FP8/NVFP4 weights, a base checkpoint, SFT and pre-training data, and the full training recipe, alongside a $26B 5-year open-source commitment. It is available on W&B Inference at $0.20/M input and $0.80/M output.

120B Nemotron 3 Super total parameters12B Nemotron 3 Super active parameters (MoE)1M Nemotron 3 Super context window (tokens)

NVIDIA on X ↗Nemotron 3 Super blog post ↗Nemotron 3 Super on HuggingFace ↗W&B Inference (Nemotron) ↗

🎙️ Hear our coverage →

#open-source #architecture #reasoning

OpenAI Mar 5, 2026

New Models

GPT-5.4

OpenAI drops GPT-5.4 Thinking and GPT-5.4 Pro live during the show

OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro mid-show, a frontier general model that folds Codex-level coding into a unified reasoning model. It ships with a 1M token context window, a /fast mode, and mid-reasoning steering, posting 83.3% on ARC-AGI 2 (Pro) and roughly 75% on OS World computer use. The panel tested it live in Codex and called it a major general-model jump, while noting input pricing rose about 50% versus 5.2.

83.3% ARC-AGI 2 (GPT-5.4 Pro)75% OS World / computer-use score1M Context window

OpenAI GPT-5.4 announcement ↗ARC Prize on GPT-5.4 ↗Alex Volkov's live reaction thread ↗Benchmark breakdown by @nasqret ↗

🎙️ Hear our coverage →

#frontier-models #reasoning #coding

February 2026

Agentica Feb 26, 2026

Benchmarks & Evals

ARC-AGI-3 public set result

Agentica claims to solve all public ARC-AGI-3 tasks

Agentica published a claim of solving all public ARC-AGI-3 tasks, adding to the week's theme of benchmark saturation. The panel discussed it alongside METR and ARC-AGI-2 results as part of weighing signal versus noise in headline benchmark leaps.

Agentica claim on X ↗

🎙️ Hear our coverage →

#benchmarks #reasoning

C Confluence Labs Feb 26, 2026

Benchmarks & Evals

ARC-AGI-2 SOTA result

Confluence Labs exits stealth with 97.9% SOTA on ARC-AGI-2

Confluence Labs emerged from stealth with a 97.9% state-of-the-art result on the ARC-AGI-2 benchmark, publishing code on GitHub. The panel read it as a major signal that ARC-AGI-2 is near saturation, part of a broader pattern of benchmarks getting solved faster than expected.

97.9% ARC-AGI-2

Y Combinator post on X ↗Confluence Labs ARC-AGI-2 GitHub repo ↗

🎙️ Hear our coverage →

#benchmarks #reasoning

Google DeepMind Feb 19, 2026

New Models

Gemini 3.1 Pro

Gemini 3.1 Pro drops live with 44% HLE and 77% ARC-AGI at the same price

Google released Gemini 3.1 Pro minutes before the show, claiming 2.5x better abstract reasoning and improved coding and agentic capabilities at the same price point as its predecessor. It scores 44% on Humanity's Last Exam, 77% on ARC-AGI without a custom harness, and 68 on Terminal Bench, putting it at or near state of the art alongside Opus 4.6. In Nisten's live vibe-coding test it was blazingly fast but less polished than Opus 4.6 and Codex output.

44% Humanities Last Exam77% ARC-AGI

Gemini 3.1 Pro announcement (X) ↗Google DeepMind blog: Gemini 3.1 Pro update ↗Try it in Google AI Studio ↗

🎙️ Hear our coverage →

#frontier-models #reasoning #coding

Google DeepMind Feb 12, 2026

New Models

Gemini 3 Deep Think

Gemini 3 Deep Think scores 84% on ARC-AGI 2

Google dropped an upgraded Gemini 3 Deep Think mid-show, hitting 84% on ARC-AGI 2 — the biggest single jump in the benchmark's history, up from Opus 4.6's 68% set just one week earlier. It also scored 48.4% on Humanity's Last Exam without tools, taking state of the art on both.

84% ARC-AGI 2

Sundar Pichai announcement on X ↗

🎙️ Hear our coverage →

#reasoning #benchmarks

Anthropic Feb 5, 2026

New Models

Claude Opus 4.6

Anthropic ships Claude Opus 4.6 with 1M context and agent teams

Anthropic dropped Opus 4.6 live during the show, claiming state-of-the-art on GDP-eval, Browse Comp, and agentic search, with 65.4% on Terminal Bench and 99% on TAU Bench MCP tool use. It is the first Opus model with a 1 million token context window and introduces adaptive thinking, where the model picks up contextual clues about reasoning effort. Pricing matches Opus 4.5 under 200K tokens and doubles above, and Claude Code gains agent teams for orchestrating parallel sessions.

1M Context tokens

X announcement ↗Anthropic blog ↗

🎙️ Hear our coverage →

#frontier-models #coding #agents

InternLM (Shanghai AI Lab) Feb 5, 2026

New ModelsOpen weights

Intern-S1-Pro

Intern-S1-Pro: 1 trillion parameter open MoE for scientific reasoning

InternLM released Intern-S1-Pro, a 1 trillion parameter open-source MoE model targeting SOTA scientific reasoning across chemistry, biology, materials, and earth sciences. The panel noted it beats frontier models on science benchmarks, a massive compute investment for an open release.

X announcement ↗Hugging Face ↗Arxiv ↗ModelScope ↗

🎙️ Hear our coverage →

#open-source #reasoning #research

StepFun Feb 5, 2026

New ModelsOpen weights

Step 3.5 Flash

StepFun Step 3.5 Flash: frontier reasoning claims at 11B active params

StepFun released Step 3.5 Flash, a 196B sparse MoE model with only 11B active parameters, claiming frontier-level reasoning while generating at 100-350 tokens per second. It continues the trend of sparse Chinese MoE models delivering high speed at low active parameter counts.

X announcement ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #reasoning

January 2026

Google Jan 29, 2026

Major Features & Updates

Gemini 3 Flash Agentic Vision

Google adds Agentic Vision to Gemini 3 Flash

Gemini 3 Flash gains agentic vision: a Think-Act-Observe loop that can zoom, crop, annotate, and plot images by generating and executing Python code in the backend. Available in the Gemini app, AI Studio, and Vertex AI.

Announcement (X) ↗Docs ↗

🎙️ Hear our coverage →

#vision #agents #reasoning

Liquid AI Jan 22, 2026

New ModelsOpen weights

LFM2.5-1.2B-Thinking

Liquid AI's LFM2.5-1.2B-Thinking: on-device reasoning under 900MB

Liquid AI released LFM2.5-1.2B-Thinking, a 1.2B parameter reasoning model that runs entirely on-device with under 900MB of memory. Its hybrid architecture with gated convolutions delivers 239 tokens/sec on an AMD CPU and 82 tokens/sec on a mobile NPU, making it practical for edge devices, Raspberry Pi, and older iPhones.

1.2B Parameters, under 900MB memory

LFM2.5-1.2B-Thinking announcement (X) ↗LFM2.5-1.2B-Thinking on Hugging Face ↗LFM2.5-1.2B-Thinking on Liquid LEAP ↗

🎙️ Hear our coverage →

#open-source #reasoning #on-device

Meituan (LongCat) Jan 15, 2026

New ModelsOpen weights

LongCat Flash Thinking

Meituan's LongCat Flash Thinking: 560B MoE with 27B active, MIT licensed

Meituan released LongCat Flash Thinking, an open-source reasoning MoE with 560B total parameters and only 27B active, under an MIT license. It continued the run of large sparse Chinese open-weights models offering frontier-style reasoning at low active-parameter cost.

560B/27B LongCat Flash

🎙️ Hear our coverage →

#open-source #reasoning

NVIDIA Jan 8, 2026

New ModelsOpen weights

Alpha Mayo

NVIDIA Alpha Mayo: open source reasoning self-driving models

NVIDIA announced Alpha Mayo at CES, a family of open source reasoning-based self-driving AI models. The models perform end-to-end autonomous driving with explicit reasoning steps, like identifying jaywalkers and stopping accordingly, demoed in a Mercedes-Benz.

NVIDIA CES 2026 News ↗

🎙️ Hear our coverage →

#robotics #reasoning #open-source

December 2025

Anthropic Dec 25, 2025

New Models

Claude Opus 4

Claude Opus 4 drops in Q2 — Ryan's pick for best model ever

Claude Opus 4 launched in Q2 and became Ryan Carson's pick as the best coding model he had used in over 700 days of daily LLM coding. It cemented Anthropic's lead in agentic coding through the middle of the year.

🎙️ Hear our coverage →

#coding #reasoning

DeepSeek Dec 25, 2025

New ModelsOpen weights

DeepSeek R1

DeepSeek R1: the open reasoning model that crashed NVIDIA's stock

DeepSeek's open-weights reasoning model dropped January 23rd and matched OpenAI's o1 at roughly 50x cheaper pricing, with an alleged training cost of just $5.5M. It crashed NVIDIA stock 17% — a $560B single-day loss, the largest single-company monetary loss in history — and made Chinese AI a household topic. The crew named it the earthquake that shattered assumptions about who leads AI.

$560B NVIDIA stock loss$5.5M DeepSeek R1 training cost

Jan 24 Episode ↗Jan 30 Episode ↗

🎙️ Hear our coverage →

#open-source #reasoning

DeepSeek Dec 25, 2025

New ModelsOpen weights

DeepSeek V3.1 Terminus

DeepSeek V3.1 Terminus lands amid September's relentless pace

DeepSeek resurfaced in September with V3.1 Terminus, another strong open-weights release that arrived just as the crew was barely keeping up with the weekly firehose. Nisten noted that missing a single week in this period left you completely lost.

🎙️ Hear our coverage →

#open-source #reasoning

Google DeepMind Dec 25, 2025

New Models

Gemini 2.5

Gemini 2.5 takes the #1 benchmark spot in March

Gemini 2.5 briefly claimed the top benchmark position in March, the moment Wolfram identified as the pivotal point where OpenAI stopped being the undisputed leader. It foreshadowed Google's full comeback later in the year.

Mar 27 Episode ↗

🎙️ Hear our coverage →

#reasoning #benchmarks

Google DeepMind Dec 18, 2025

New Models

Gemini 3 Flash

Gemini 3 Flash delivers frontier intelligence at $0.50/1M input tokens

Google launched Gemini 3 Flash, offering frontier-tier capability at flash-tier pricing of $0.50 per million input tokens. It scores 78% on SWE-bench Verified, beating larger models on some agentic tasks, and supports tool-calling at scale with up to 100 simultaneous function calls.

$0.50 per 1M Gemini 3 Flash input tokens78% SWE-bench Verified

Gemini 3 Flash announcement ↗Logan Kilpatrick announcement on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #agents #coding

Amazon Dec 4, 2025

New Models

Amazon Nova 2

Amazon announces Nova 2 family: Lite, Pro, Sonic, and Omni

Amazon rolled out the Nova 2 model suite spanning text, speech, and multimodal stacks with Lite, Pro, Sonic, and Omni variants. The launch came with major benchmark jumps over the first Nova generation and includes a fast, cost-effective reasoning model in Nova 2 Lite.

Amazon Nova 2 launch (AWS blog) ↗Amazon News announcement on X ↗

🎙️ Hear our coverage →

#frontier-models #voice-ai #reasoning

DeepSeek Dec 4, 2025

New ModelsOpen weights

DeepSeek V3.2 / V3.2-Speciale

DeepSeek V3.2 and V3.2-Speciale post gold-medal reasoning under MIT license

DeepSeek released V3.2 and the reasoning-first V3.2-Speciale, a 685B-parameter MoE under MIT license. Speciale posted gold-medal-level olympiad results and 96% on AIME (versus GPT-5 High at 94%), with V3.2 hitting 73.1% on SWE-Bench Verified. Aggressive pricing around 28 cents per 1M tokens on OpenRouter pushes open models closer to top closed-model capability.

96% AIME73.1% SWE-Bench Verified685B Total parameters (MoE)

DeepSeek V3.2 (Hugging Face) ↗DeepSeek V3.2-Speciale (Hugging Face) ↗DeepSeek V3.2 announcement ↗DeepSeek announcement on X ↗

🎙️ Hear our coverage →

#open-source #reasoning #coding

Google DeepMind Dec 4, 2025

Major Features & Updates

Gemini 3 Deep Think

Gemini 3 Deep Think hits 45.1% on ARC-AGI-2 with parallel reasoning

Google shipped Deep Think, a high-cost parallel reasoning mode for Gemini 3 that scored 45.1% on ARC-AGI-2. The panel framed it as Google pressing its advantage in the frontier race, where product integration and latency now matter as much as raw benchmark IQ.

45.1% ARC-AGI-2

Gemini 3 Deep Think blog ↗Gemini App announcement on X ↗

🎙️ Hear our coverage →

#reasoning #frontier-models

November 2025

Anthropic Nov 27, 2025

New Models

Claude Opus 4.5

Anthropic launches Claude Opus 4.5, reclaiming the coding crown

Anthropic released Claude Opus 4.5, scoring 80.9% on SWE-bench Verified to top GPT-5.1 (77.9%) and Gemini 3 Pro (76.2%). It adds a new 'Effort' parameter for compute control, Tool Search to cut agent token overhead, and Programmatic Tool Calling where the model writes and executes code loops. Pricing dropped to $5/M input and $25/M output, roughly one-third the old Opus price.

80.9% SWE-bench Verified$5/M Input token price$25/M Output token price

Claude Opus 4.5 Announcement ↗Claude Opus 4.5 Tool Use Blog ↗Claude Opus 4.5 on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#coding #agents #reasoning

DeepSeek Nov 27, 2025

New ModelsOpen weights

DeepSeek Math V2

DeepSeek Math V2: 685B open-weights model with IMO gold-level math

DeepSeek surfaced DeepSeek Math V2, a 685B-parameter Apache-2.0 model that reaches IMO gold-level math reasoning. It is the first open-weights math champion at this level, dropped quietly on HuggingFace during the week.

685B Parameters

DeepSeek Math V2 on HuggingFace ↗

🎙️ Hear our coverage →

#open-source #reasoning

Prime Intellect Nov 27, 2025

New ModelsOpen weights

INTELLECT-3

Prime Intellect releases INTELLECT-3, a 106B open MoE model

Prime Intellect released INTELLECT-3, a 106B-parameter mixture-of-experts model with 12B active parameters that scores 90% on AIME 2024/2025. The lab fully open-sourced the training stack alongside the weights, showing a small lab can train frontier-scale models.

106B Total parameters (12B active)90% AIME 2024/2025

INTELLECT-3 on HuggingFace ↗INTELLECT-3 Blog ↗INTELLECT-3 Announcement on X ↗Try INTELLECT-3 ↗

🎙️ Hear our coverage →

#open-source #reasoning #architecture

Google DeepMind Nov 20, 2025

New Models

Gemini 3 Pro

Gemini 3 Pro launches with record ARC-AGI-2 scores

Google's new frontier multimodal model with a 1M-token context window and huge reasoning gains, scoring 31.11% on ARC-AGI-2 (45.14% with Deep Think mode) — roughly double the previous SOTA — plus 81% on MMLU-Pro and major coding improvements. Amp switched to it as their default model on launch day, the first time they have ever switched defaults. Also rolling out across Gmail, Calendar, and AI Mode in Google Search.

45.14% ARC-AGI-2 (Deep Think)31.11% ARC-AGI-2 (standard)1M Token context window

🎙️ Hear our coverage (+1 follow-up) →

#reasoning #multimodal #frontier-models

OpenAI Nov 20, 2025

Major Features & Updates

GPT-5.1 Pro

GPT-5.1 Pro: research-grade deep-thinking mode in ChatGPT

OpenAI also shipped GPT-5.1 Pro, a new research-grade ChatGPT mode that will happily think for minutes on a single query. It targets hard research-style questions where extended deliberation pays off, rounding out OpenAI's big week alongside Codex-Max.

🎙️ Hear our coverage →

xAI Nov 20, 2025

New Models

Grok 4.1

Grok 4.1 briefly tops LM Arena with major post-training upgrade

xAI's Grok 4.1 shipped in November alongside GPT-5.1 and Claude Opus 4.5 in the year's most concentrated stretch of frontier releases. Yam highlighted the week-and-a-half window as emblematic of 2025's relentless acceleration.

1483 LM Arena Elo (briefly #1)

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #consumer-ai #reasoning

Baidu Nov 13, 2025

New ModelsOpen weights

ERNIE-4.5-VL-28B-A3B-Thinking

Baidu open-sources ERNIE-4.5-VL-28B-A3B-Thinking visual reasoning model

Baidu released ERNIE-4.5-VL-28B-A3B-Thinking, an Apache 2.0 open-weights visual reasoning MoE with only 3B active parameters that claims to rival much larger models like GPT-5 High on vision tasks. It features image zooming, spatial grounding, and reasoning, with strong small-model performance attributed to GSPO training from the Qwen team.

3B Active Parameters

Baidu announcement on X ↗Hugging Face model page ↗GitHub repo ↗Ernie blog post ↗

🎙️ Hear our coverage →

#open-source #vision #reasoning

OpenAI Nov 13, 2025

New Models

GPT-5.1

OpenAI launches GPT-5.1 with a warmer, more personable voice

OpenAI shipped GPT-5.1, an update to its flagship model focused on a warmer tone and personality upgrades. The panel discussed how the friendlier default voice changes day-to-day ChatGPT use and what it signals for the frontier model race.

Fidji Simo announcement on X ↗Sam Altman on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#frontier-models #reasoning

W WeiboAI Nov 13, 2025

New ModelsOpen weights

VibeThinker-1.5B

WeiboAI releases VibeThinker-1.5B open reasoning model

Weibo's AI team open-sourced VibeThinker-1.5B, a tiny reasoning model that reportedly outperforms much larger models like DeepSeek R1 on select reasoning benchmarks. Part of a week where small open-weights models from Chinese labs kept punching above their weight.

WeiboLLM announcement on X ↗Hugging Face model page ↗Arxiv paper ↗VentureBeat coverage ↗

🎙️ Hear our coverage →

#open-source #reasoning #on-device

Moonshot AI Nov 6, 2025

New ModelsOpen weights

Kimi K2 Thinking

Moonshot AI releases Kimi K2 Thinking, an open 1T-param reasoning MoE

Moonshot AI released Kimi K2 Thinking, an open-source 1-trillion-parameter mixture-of-experts reasoning agent with 256K context and large-scale tool-calling capacity. The panel treated it as the open-source centerpiece of the week, focusing on its reasoning quality and coding utility rather than just benchmark screenshots, and as a sign open models keep closing the usability gap with frontier closed models.

X ↗HF ↗Tech Blog ↗Arxiv ↗

🎙️ Hear our coverage →

#open-source #reasoning #agents

October 2025

OpenAI Oct 30, 2025

New ModelsOpen weights

GPT-OSS-Safeguard

OpenAI ships GPT-OSS-Safeguard, first open-weight safety reasoning models

OpenAI released GPT-OSS-Safeguard, its first open-weight safety reasoning models, built on the GPT-OSS family. The models let developers apply custom safety policies via reasoning rather than fixed classifiers, extending OpenAI's open-weights push into the trust-and-safety layer.

X announcement ↗Hugging Face collection ↗

🎙️ Hear our coverage →

#open-source #safety #reasoning

September 2025

DeepSeek Sep 25, 2025

New ModelsOpen weights

DeepSeek V3.1 Terminus

DeepSeek V3.1 Terminus refines agents and bilingual output

DeepSeek released V3.1 Terminus, an update to V3.1 with cleaner bilingual output, stronger agentic tool use, and cheaper long-context handling. The open weights are available on Hugging Face, continuing DeepSeek's cadence of iterative open releases.

🎙️ Hear our coverage →

#open-source #agents #reasoning

xAI Sep 25, 2025

New Models

Grok 4 Fast

xAI ships Grok 4 Fast with 2M context at a fraction of the cost

xAI released Grok 4 Fast, a cost-efficient model with a 2M token context window that unifies reasoning and non-reasoning behavior in one set of weights and prices far below Grok 4. The panel treated it as part of the larger competitive pressure cycle on price and speed among frontier labs.

🎙️ Hear our coverage →

#reasoning #architecture #frontier-models

J Jeremy Berman & Eric Pang Sep 18, 2025

Papers & Research

ARC-AGI SOTA method

Jeremy Berman and Eric Pang set new ARC-AGI SOTA using Grok-4

Independent researchers Jeremy Berman and Eric Pang published a new state-of-the-art result on ARC-AGI, built on Grok-4 with heavy test-time compute and iterative program synthesis. Berman joins the show to walk through the method, its limitations, and why iteration matters more than leaderboard narratives; the approach is documented in a detailed write-up.

🎙️ Hear our coverage →

#reasoning #benchmarks

Luma AI Sep 18, 2025

New Models

Ray3

Luma's Ray3: a 'reasoning' video model with native HDR

Luma AI launched Ray3, a video generation model it bills as a 'reasoning' video model, with native HDR output, a fast Draft Mode, and Hi-Fi mastering. It is available in Luma's Dream Machine and feeds the episode's closing theme of a next wave of video models.

X ↗Try It ↗

🎙️ Hear our coverage →

#video-gen #reasoning

Mistral AI Sep 18, 2025

New ModelsOpen weights

Magistral-Small-2509

Mistral updates its open reasoning model with Magistral-Small-2509

Mistral published Magistral-Small-2509, an updated checkpoint of its small open-weights reasoning model. The refresh keeps Mistral's open reasoning line current as the open-model competitive baseline moves quickly.

🎙️ Hear our coverage →

#open-source #reasoning

OpenAI Sep 18, 2025

Major Features & Updates

ChatGPT thinking budgets

OpenAI adds thinking budgets to the ChatGPT app

OpenAI rolled out thinking budgets in the ChatGPT app, letting users control how much reasoning effort the model spends on a request. It is a small but notable product lever for tuning the cost-versus-quality tradeoff of reasoning models.

🎙️ Hear our coverage →

#reasoning #consumer-ai

Nous Research Sep 4, 2025

New ModelsOpen weights

Hermes 4 14B

Nous Research releases Hermes 4 14B compact hybrid reasoning model

Nous Research launched Hermes 4 at 14B, a compact hybrid reasoning model with tool calling designed for both local and cloud use. It extends the Hermes 4 family down to a size practical for local deployment while keeping reasoning and tool-use capabilities, with a full tech report published on arXiv.

X ↗HF ↗Tech Report ↗

🎙️ Hear our coverage →

#open-source #reasoning #agents

July 2025

Microsoft Jul 3, 2025

Papers & Research

MAI-DxO

Microsoft's MAI-DxO hits 85.5% on NEJM diagnostic cases vs 20% for doctors

Microsoft AI published MAI-DxO, a medical diagnostic orchestration system that reached 85.5% accuracy on challenging NEJM-style cases compared to roughly 20% for practicing physicians. The result is framed as a systems win rather than a single-model win, suggesting orchestration may outperform individual models in high-stakes expert workflows.

85.5% MAI-DxO accuracy

Mustafa Suleyman on X ↗Microsoft AI blog ↗

🎙️ Hear our coverage →

#research #reasoning #agents

Tencent Jul 3, 2025

New ModelsOpen weights

Hunyuan-A13B-Instruct

Tencent ships Hunyuan-A13B: 80B MoE with only 13B active params

Tencent released Hunyuan-A13B-Instruct, an 80B-parameter MoE that activates only 13B parameters at inference while keeping a 256K context window. Built by the team with WizardLM lineage, it posts strong reasoning benchmarks and feels unusually practical for its class, though the panel flagged its license limits.

13B Hunyuan active params

X announcement ↗Hugging Face ↗Try it ↗

🎙️ Hear our coverage →

#open-source #architecture #reasoning

May 2025

DeepSeek May 29, 2025

New ModelsOpen weights

DeepSeek-R1-0528

DeepSeek drops R1-0528, an updated open reasoning model with big gains

DeepSeek released R1-0528 out of nowhere, an update to their open-weights reasoning model with serious performance jumps: AIME 91, LiveCodeBench 73, and SWE-bench Verified 57.6. They also shipped an 8B distilled version based on Qwen3 that can run on a laptop, keeping it among the best open-weight models available.

91 AIME score, beating previous R1 by a mile8B Distilled Qwen3-based version runnable on a laptop

🎙️ Hear our coverage →

#open-source #reasoning

UC Berkeley May 29, 2025

Papers & Research

Intuitor (Learning to Reason Without External Rewards)

Paper: models can learn to reason without external rewards

A mind-bending paper showing that reinforcement learning with internal or even random rewards can improve reasoning models. Intuitor matched or exceeded some GRPO results (the external-reward framework DeepSeek popularized with R1) when finetuning Qwen2.5 3B, questioning how much of RL's gains come from the reward signal itself.

3B Qwen2.5 model size where Intuitor matched or exceeded GRPO results

X announcement ↗

🎙️ Hear our coverage →

#reasoning #training #research

A A-M Team May 15, 2025

New ModelsOpen weights

AM-Thinking v1

AM-Thinking v1: 32B dense reasoning model beats bigger MoEs at math and code

A 32B dense open-weights reasoning LLM from a new Chinese team that takes on much larger mixture-of-experts models and comes out on top for math and code, hitting 85.3% on AIME 2024, 70.3% on LiveCodeBench v5, and 92.5% on Arena-Hard. It supports a /think reasoning toggle, ships with a permissive license, is tooled for vLLM, LM Studio, and Ollama, and runs at 25 tokens/sec on a single 80GB GPU with INT4 quantization. A multilingual RLHF pass and 128k context window are in the works.

32B dense parameters85.3% AIME 202425 tokens/sec on a single 80GB GPU with INT4

Hugging Face ↗Paper ↗Project page ↗

🎙️ Hear our coverage →

#open-source #reasoning

ByteDance May 15, 2025

New Models

Seed1.5-VL

ByteDance publishes Seed1.5-VL, a 20B vision-language thinking model

ByteDance's Seed team published the technical report for Seed1.5-VL, a 20B-parameter vision-language model with thinking capabilities. It was covered among the big-company releases of the week, with the tech report shared on GitHub.

Technical report ↗

🎙️ Hear our coverage →

#vision #multimodal #reasoning

Alibaba (Qwen) May 1, 2025

New ModelsOpen weights

Qwen 3

Alibaba open-weights the full Qwen 3 family under Apache 2.0

Alibaba released the entire Qwen 3 stack: two MoE models (235B total/22B active and 30B/3B active) plus six dense siblings from 32B down to 0.6B, all Apache 2.0 with day-one support in LM Studio, Ollama, vLLM, MLX and llama.cpp. The headline feature is a runtime hybrid 'thinking' toggle (/think and /no_think) that trades latency for reasoning depth. Trained on ~36T tokens with 128K context and 119-language coverage, the 235B MoE rivals DeepSeek-R1, o1, o3-mini and Gemini 2.5 Pro on coding and math.

235 B Flagship MoE total parameters (22B active)30 B Qwen3-30B-A3B hit 57 tok/s on a Mac with speculative decoding36 Trillions of pre-training tokens (2x Qwen 2.5)

Qwen 3 blog post ↗GitHub ↗Hugging Face collection ↗HF demo ↗

🎙️ Hear our coverage →

#open-source #reasoning #architecture

Microsoft May 1, 2025

New ModelsOpen weights

Phi-4-reasoning

Microsoft ships Phi-4-reasoning and Phi-4-reasoning-plus (14B, MIT)

Microsoft fine-tuned the 14B Phi-4 on 1.4M curated chain-of-thought traces (SFT) and added a small RL stage (Plus variant) to create two MIT-licensed reasoning models. They punch far above their weight: Phi-4-reasoning-plus outperforms DeepSeek-R1-Distill-70B on AIME 25 (78% vs 51%) and sits within a few points of the full 671B DeepSeek-R1, while running on a single GPU with explicit <think> scaffolding.

ArXiv paper ↗Tech report ↗Hugging Face: Phi-4-reasoning ↗Suriya's thread ↗

🎙️ Hear our coverage →

#open-source #reasoning #on-device

Xiaomi May 1, 2025

New ModelsOpen weights

MiMo-7B

Xiaomi enters open weights with MiMo-7B, MIT-licensed reasoning family

Xiaomi's first open-weights release is a 7B dense family (Base, SFT, RL, RL-Zero) trained from scratch on 25T tokens with a multi-token-prediction objective and rule-verifiable reinforcement learning. The RL variant matches OpenAI o1-mini on benchmark suites despite being far smaller, scoring 55.4% on AIME 2025 and 49.3% on LiveCodeBench v6, all under an MIT license with vLLM-ready weights.

Hugging Face model hub ↗

🎙️ Hear our coverage →

#open-source #reasoning #training

April 2025

Google DeepMind Apr 17, 2025

New Models

Gemini 2.5 Flash

Google launches Gemini 2.5 Flash with controllable thinking budgets

Google answered OpenAI's launch week with Gemini 2.5 Flash, a fast reasoning model that introduces controllable thinking budgets so developers can dial how much the model reasons per request. It is available through the Gemini API and developer platform.

Blog Post ↗API Docs ↗

🎙️ Hear our coverage (+1 follow-up) →

#reasoning #frontier-models #api

OpenAI Apr 17, 2025

New Models

o3 & o4-mini

OpenAI launches o3 and o4-mini, SOTA reasoning models with tool use

OpenAI shipped o3 and o4-mini in ChatGPT and the API, with o3 setting new SOTA records on Codeforces, SWE-bench, MMMU and more. For the first time the models can use tools (web search, Python, image generation) during the reasoning process, and they can think visually by cropping, zooming and rotating images. o3 scored $65k on the Freelancer eval versus o1's $28k, and o4-mini hits 99.5% on AIME with a Python interpreter.

$65 o3 score on the Freelancer eval ($65k vs o1's $28k)99.5% o4-mini on AIME with Python interpreter200 context window (200k tokens)

Blog ↗Watch Party ↗

🎙️ Hear our coverage →

#reasoning #agents #multimodal

Prime Intellect Apr 17, 2025

New ModelsOpen weights

INTELLECT-2

Prime Intellect launches INTELLECT-2, a 32B globally-distributed RL run

Prime Intellect released INTELLECT-2, a 32B reasoning model trained with globally decentralized reinforcement learning, a follow-up to the INTELLECT-1 decentralized pretraining run covered on the show in December. The release includes open weights on Hugging Face, a tech report, and the PRIME-RL training code.

Blog ↗X ↗Blog ↗Tech report ↗

🎙️ Hear our coverage (+1 follow-up) →

#open-source #training #reasoning

Zhipu AI (Z.ai) Apr 17, 2025

New ModelsOpen weights

GLM-4-0414

Z.ai (formerly chatGLM) releases the GLM-4-0414 open-source family

Z.ai, the rebranded Zhipu AI / chatGLM team, released the GLM-4-0414 family of open-source models. The drop includes base, reasoning and rumination variants published on Hugging Face and GitHub.

X ↗HF Collection ↗GitHub ↗

🎙️ Hear our coverage →

#open-source #reasoning

ByteDance Apr 10, 2025

Papers & Research

Seed-Thinking-v1.5

ByteDance publishes Seed-Thinking-v1.5 reasoning model tech report

ByteDance's Seed team published Seed-Thinking-v1.5, a new reasoning model announced via a technical report on GitHub. It was mentioned among the week's open-source LLM news, though weights were not released at the time.

GitHub: Seed-Thinking-v1.5 ↗

🎙️ Hear our coverage →

#reasoning #research

D Deep Cogito Apr 10, 2025

New ModelsOpen weights

Cogito v1 Preview (3B-70B)

Deep Cogito debuts Cogito v1 Preview models from 3B to 70B, beating DeepSeek 70B

New lab Deep Cogito released the Cogito v1 Preview family of open models ranging from 3B to 70B parameters, claiming SOTA results at each size and beating DeepSeek's 70B distill. The models are available on Hugging Face, giving local AI enthusiasts the small-to-mid sizes Llama 4 skipped.

3B-70B Model size range

Deep Cogito research blog: Cogito v1 Preview ↗Hugging Face: cogito-v1-preview-llama-70B ↗

🎙️ Hear our coverage →

#open-source #reasoning

Moonshot AI (Kimi) Apr 10, 2025

New ModelsOpen weights

Kimi-VL & Kimi-VL-Thinking

Moonshot drops Kimi-VL and Kimi-VL-Thinking, tiny A3B open vision models

Moonshot AI released Kimi-VL and Kimi-VL-Thinking, compact vision-language models with only ~3B active parameters (A3B MoE). The thinking variant adds reasoning to a tiny VLM, and both are available openly on Hugging Face.

A3B ~3B active parameters (MoE)

Hugging Face collection: Kimi-VL-A3B ↗

🎙️ Hear our coverage →

#open-source #vision #reasoning

NVIDIA Apr 10, 2025

New ModelsOpen weights

Llama-3.1-Nemotron-Ultra-253B

NVIDIA ships Nemotron Ultra, a 253B pruned and distilled Llama 3.1-405B

NVIDIA released Nemotron Ultra, a pruned and distilled finetune of Llama 3.1-405B at roughly half the parameters (253B). Its benchmarks even included Llama 4 comparisons, showing the older finetuned Llama beating the new models on AIME, GPQA and more. It supports 128K context and fits on a single 8xH100 node for inference.

253B Parameters (pruned from Llama 3.1-405B)128K Context window

Hugging Face: Llama-3_1-Nemotron-Ultra-253B-v1 ↗Announcement on X ↗

🎙️ Hear our coverage →

#open-source #training #reasoning

Together AI & Agentica (UC Berkeley) Apr 10, 2025

New ModelsOpen weights

DeepCoder-14B-Preview

DeepCoder-14B: open RL-finetuned coder beats DeepSeek R1 and o3-mini on coding

Together AI and Agentica (UC Berkeley Sky Computing Lab) released DeepCoder-14B-Preview, a reasoning model finetuned with RL that beats DeepSeek R1 and even o3-mini on several coding benchmarks. The project aims to democratize RL: the team open-sourced the model, the training dataset, the Weights & Biases logs, and the eval logs. Guest Michael Luo from Agentica joined the show to discuss the release.

14B Model parameters

Together AI blog: DeepCoder ↗Announcement on X ↗Hugging Face: DeepCoder-14B-Preview ↗Hugging Face dataset: DeepCoder-Preview-Dataset ↗

🎙️ Hear our coverage →

#open-source #coding #reasoning

Google DeepMind Apr 3, 2025

Benchmarks & Evals

Gemini 2.5 Pro USAMO results

Gemini 2.5 Pro scores 24.4% on USAMO olympiad math, crushing the field

New evaluation results published this week showed Gemini 2.5 Pro scoring 24.4% on the USA Math Olympiad (USAMO), problems so hard that most top models score under 5%. The result showcases a step change in frontier reasoning ability on competition mathematics.

24.4% Gemini 2.5 Pro USAMO score<5% typical score for other top models

🎙️ Hear our coverage →

#reasoning #benchmarks

H HKU NLP (University of Hong Kong) Apr 3, 2025

New Models

Dream 7B

Dream 7B: a diffusion language model challenger unveiled

Researchers unveiled Dream 7B, a diffusion-based language model that posts strong benchmark results, notably on planning-style tasks like Sudoku, possibly because parallel generation handles global constraints better than autoregression. It hints at viable alternative LLM architectures, but the weights were not yet released at show time, so results could not be independently verified.

Dream 7B blog post ↗Benchmark results thread (Sudoku) ↗

🎙️ Hear our coverage →

#architecture #research #reasoning

March 2025

ARC Prize Foundation Mar 27, 2025

Benchmarks & Evals

ARC-AGI 2

ARC-AGI 2 benchmark revealed, thinking models score just 4%

The ARC Prize Foundation revealed ARC-AGI 2, the next iteration of the abstract reasoning benchmark. Base LLMs score 0% and even thinking models only reach about 4%, showing how far current frontier models remain from human-level fluid intelligence.

0% base LLM score on ARC-AGI 24% thinking model score on ARC-AGI 2

X announcement ↗

🎙️ Hear our coverage →

#benchmarks #reasoning

Google DeepMind Mar 27, 2025

New Models

Gemini 2.5 Pro

Google reclaims #1 with Gemini 2.5 Pro thinking model

Google dropped Gemini 2.5 Pro, a thinking model that took the #1 spot as the best all-around LLM available, with massive jumps on benchmarks like AIME (up nearly 20 points) and GPQA. It inherits native multimodality and a 1M token context window, maintaining high accuracy even at 120k+ tokens on needle-in-a-haystack tests, with surprisingly low latency (~13 seconds on hard reasoning questions vs 45+ for others). Tulsee Doshi, head of product for Gemini models, joined the show to give the inside scoop.

20 point jump on AIME benchmark1M token context window13 seconds latency on hard reasoning questions (vs 45+ for others)

X announcement (Jeff Dean) ↗Official blog post ↗Try it at ai.dev ↗

🎙️ Hear our coverage →

#reasoning #architecture #frontier-models

ByteDance Mar 20, 2025

Papers & ResearchOpen weights

DAPO

ByteDance releases DAPO, an RL method that beats GRPO

ByteDance published DAPO, a reinforcement learning method for LLM post-training presented as an improvement over GRPO. The paper ships with an open GitHub implementation, making the technique reproducible for the open-source RL community.

X thread ↗Github ↗Paper ↗

🎙️ Hear our coverage →

#training #reasoning #research

LG AI Research Mar 20, 2025

New ModelsOpen weights

EXAONE Deep 32B

LG open sources EXAONE and EXAONE Deep 32B reasoning model

LG AI Research open sourced its EXAONE family, headlined by EXAONE Deep 32B, a thinking/reasoning model. The release puts a large Korean lab's reasoning model in open weights on Hugging Face, and Alex published a live reaction video to the launch.

LG Blog ↗HuggingFace page ↗Alex Reaction Video ↗

🎙️ Hear our coverage →

#open-source #reasoning

NVIDIA Mar 20, 2025

New ModelsOpen weights

Llama-Nemotron (Super 49B, Nano 8B)

NVIDIA drops Llama-Nemotron reasoning models plus training dataset

NVIDIA released the Llama-Nemotron family, including Super 49B and Nano 8B reasoning models, announced around GTC. Alongside the open weights, NVIDIA published the Llama-Nemotron post-training dataset, giving the community both the models and the data recipe behind them.

Announcement ↗X ↗Llama-Nemotron HuggingFace Collection ↗Dataset ↗

🎙️ Hear our coverage →

#open-source #reasoning #training

OpenAI Mar 20, 2025

APIs & Platforms

o1-pro API

OpenAI makes o1-pro available via API at $600 per 1M output tokens

OpenAI exposed its o1-pro reasoning model through the API for the first time, priced at $600 per million output tokens. The show jokingly framed the pricing as 'for oligarchs', but it makes OpenAI's highest-compute reasoning tier programmatically accessible.

🎙️ Hear our coverage →

#reasoning #api

Nous Research Mar 13, 2025

New ModelsOpen weights

DeepHermes 3 (24B / 3B)

Nous Research releases DeepHermes 24B and 3B hybrid reasoning models

Nous Research released DeepHermes hybrid reasoners at 24B (Mistral-based) and 3B sizes, models that can toggle between standard chat responses and long chain-of-thought reasoning. The 24B preview is available on Hugging Face as part of the week's wave of open-source reasoning model releases.

X announcement ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #reasoning

Reka AI Mar 13, 2025

New ModelsOpen weights

Reka Flash 3

Reka Flash 3: 21B open-source reasoning model under Apache 2.0

Reka AI open sourced Reka Flash 3, a 21B parameter reasoning model released under an Apache 2.0 license and trained with the REINFORCE Leave One-Out (RLOO) reinforcement learning technique. It excels at chat, coding, instruction following, and function calling, with Nisten calling it possibly one of the best ~20B models available.

Blog ↗Hugging Face ↗X announcement ↗

🎙️ Hear our coverage →

#open-source #reasoning

Alibaba (Qwen) Mar 6, 2025

New ModelsOpen weights

QwQ-32B

Qwen releases QwQ-32B reasoning model that matches R1 on some evals

Alibaba's Qwen team released QwQ-32B, an open-weights reasoning model that matches DeepSeek R1 on several evals despite being roughly 20x smaller at 32B parameters. Qwen tech lead Junyang Lin joined the show to announce it, and the episode dubbed it Alibaba's 'R1 killer' for bringing strong reasoning to a size that runs on consumer hardware.

Announcement (X) ↗Blog ↗Hugging Face ↗Chat Demo ↗

🎙️ Hear our coverage →

#open-source #reasoning

February 2025

Anthropic Feb 27, 2025

New Models

Claude 3.7 Sonnet

Anthropic releases Claude 3.7 Sonnet, a coding beast with immaculate vibes

Anthropic shipped its long-awaited model update, Claude 3.7 Sonnet, which the crew called a coding BEAST with 'immaculate' vibes. It was one of the week's two huge model drops alongside GPT-4.5 and became an instant favorite for AI coding workflows like those discussed in the Windsurf interview.

🎙️ Hear our coverage →

#coding #frontier-models #reasoning

Perplexity Feb 20, 2025

New ModelsOpen weights

R1-1776

Perplexity releases R1-1776, a censorship-free DeepSeek R1 fine-tune

Perplexity open-sourced R1-1776, a fine-tuned version of DeepSeek R1 designed to remove Chinese government censorship on topics like Tiananmen Square and Taiwanese independence. They used human experts to identify around 300 sensitive topics and built a censorship classifier to train the bias out, claiming no significant impact on standard eval performance. The name 1776 is a nod to American independence.

Hugging Face ↗Blog post ↗

🎙️ Hear our coverage →

#open-source #reasoning #safety

xAI Feb 20, 2025

New Models

Grok 3

xAI launches Grok 3, claiming SOTA benchmarks and a 1M token context window

xAI dropped Grok 3 on Monday evening, claiming state-of-the-art performance on several benchmarks and a 1 million token context window, with heavy emphasis on agents and future reasoners. The launch was messy, with a bug serving Grok 2 to some users and an eval-methodology spat with OpenAI over best-of-N scores, but vibes shifted positive, with co-hosts calling the base model the best coding model out. It is free for now, 'until their GPUs melt', with no API yet for independent evaluation.

xAI blog ↗Try it ↗

🎙️ Hear our coverage →

#frontier-models #reasoning #coding

January 2025

Exa Jan 30, 2025

Major Features & Updates

Exa DeepSeek Chat

Exa ships free DeepSeek R1 chat demo with web search

Exa integrated DeepSeek R1 into a free hosted chat demo that combines the reasoning model with Exa's web search. Mentioned in the tools section as a no-cost way to try R1 grounded with live search results.

🎙️ Hear our coverage →

#reasoning #search #agents

O Open Thoughts Jan 30, 2025

DatasetsOpen weights

OpenThoughts-114k

Open Thoughts releases OpenThoughts-114k reasoning dataset

An open reasoning dataset with 114k examples released by the Open Thoughts project to fuel open replication of reasoning models like DeepSeek R1. It gives the open-source community high-quality chain-of-thought training data for distilling and fine-tuning reasoning LLMs.

X announcement ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #reasoning #training

Perplexity Jan 30, 2025

Major Features & Updates

Perplexity Pro with R1

Perplexity adds DeepSeek R1 as a Pro reasoning model option

Perplexity integrated DeepSeek R1 into its Pro search product, letting subscribers choose R1 as the reasoning model behind answers. It was one of several tools that raced to host R1 on Western infrastructure within days of the model's release.

🎙️ Hear our coverage →

#reasoning #search #agents

UC Berkeley Jan 30, 2025

Papers & ResearchOpen weights

TinyZero & RAGEN

Berkeley TinyZero and RAGEN replicate DeepSeek R1-Zero

Berkeley researchers released TinyZero and RAGEN, open replications of DeepSeek's R1-Zero reinforcement-learning recipe on small models. The projects showed that R1-style emergent reasoning behavior can be reproduced cheaply, with training runs logged publicly on Weights & Biases.

GitHub ↗W&B logs ↗

🎙️ Hear our coverage →

#reasoning #training #open-source

Center for AI Safety & Scale AI Jan 23, 2025

Benchmarks & Evals

Humanity's Last Exam (HLE)

Humanity's Last Exam: a deliberately unsaturated frontier benchmark

Humanity's Last Exam (HLE) launched as a new, very hard benchmark designed to stay unsaturated as models max out MMLU and math evals. It crowdsourced expert-level questions to measure frontier model capability where existing benchmarks are at 98-99% saturation.

Humanity's Last Exam website ↗

🎙️ Hear our coverage →

#benchmarks #reasoning

DeepSeek Jan 23, 2025

New ModelsOpen weights

DeepSeek R1

DeepSeek R1: MIT-licensed open source reasoning model rivals o1

DeepSeek released R1, a state-of-the-art open source reasoning model under a permissive MIT license. It matches or beats OpenAI's o1 on key reasoning benchmarks while being fully open weights, and DeepSeek also shipped a family of distilled smaller models. The show called this the hottest week open source AI has ever had.

DeepSeek on Hugging Face ↗Combine DeepSeek R1 reasoning with GPT-3.5 Turbo (egghead) ↗Run DeepSeek with more thinking (Gist) ↗

🎙️ Hear our coverage →

#open-source #reasoning

Google DeepMind Jan 23, 2025

New Models

Gemini 2.0 Flash Thinking 01-21

Google ships updated Gemini Flash Thinking with 1M context

Google released an updated Gemini Flash Thinking model (01-21) with a 1 million token context window, built-in code execution, and improved evals over the previous Thinking release. It pushes Google's reasoning-model line forward in the same week DeepSeek R1 landed.

1M Context window (tokens)

Noam Shazeer announcement on X ↗

🎙️ Hear our coverage →

#reasoning #architecture

P Pietro Schirano Jan 23, 2025

Dev ToolsOpen weights

RAT (Retrieval Augmented Thinking)

RAT: pipe DeepSeek R1 reasoning into other models

Guest Pietro Schirano released RAT (Retrieval Augmented Thinking), a technique and tool that extracts DeepSeek R1's reasoning traces and feeds them to a cheaper, faster model like GPT-3.5 Turbo for the final answer. It showcases the new pattern of mixing open reasoning traces with closed completion models.

RAT announcement on X ↗Combine DeepSeek R1 reasoning with GPT-3.5 Turbo (egghead) ↗

🎙️ Hear our coverage →

#reasoning #coding