Everything AI Released in February 2026

57 releases covered live on the show — every model, product, paper and tool that mattered, with links and our analysis.

🧠 New Models 32

Alibaba (Qwen)
New ModelsOpen weights

Qwen 3.5

Qwen 3.5 lands: 35B/3B-active Medium outperforms the old 235B flagship

Alibaba released the Qwen 3.5 family of open-weight models, headlined by Qwen3.5-35B-A3B, a 35B model with only 3B active parameters that outperforms their previous 235B flagship. Variants include a 122B-A10B and a dense 27B, with the panel highlighting the hybrid state-space (Mamba-layer) architecture and strong practical coding and agent performance at a tiny active-parameter footprint.

35B / 3B active Qwen 3.5 Medium
Liquid AI
New ModelsOpen weights

LFM2-24B-A2B

Liquid AI releases LFM2-24B-A2B, a laptop-friendly 24B MoE

Liquid AI released LFM2-24B-A2B, a 24B mixture-of-experts model with only 2.3B active parameters that runs on consumer laptops. The panel highlighted its speed and surprisingly strong non-coding reasoning, reinforcing the trend of efficient low-active-parameter open models for local use.

Alibaba (Qwen)
New ModelsOpen weights

Qwen3.5-397B-A17B

Alibaba opens Qwen 3.5: 397B-param multimodal MoE with only 17B active

Alibaba released Qwen3.5-397B-A17B, billed as the first open-weight native multimodal MoE model, with 397B total parameters, just 17B active, 512 experts, and 262K native context extendable to 1M. It delivers 8.6-19x faster inference than Qwen3-Max and continues Qwen's strength in multilingual and medical tasks, scoring 52.5% on Terminal Bench, third place among open-source models. Nisten found coding still trails GLM-5.

397B Qwen 3.5 Parameters
Anthropic
New Models

Claude Sonnet 4.6

Anthropic ships Claude Sonnet 4.6 with 79.6% SWE-Bench and 1M context

Anthropic launched Claude Sonnet 4.6, its most capable Sonnet ever, scoring 79.6% on SWE-Bench Verified, nearly matching Opus 4.6 at Sonnet pricing of $3/$15 per million tokens. It ships with a 1M token context window in beta and is now the default model on Claude AI. In blind Claude Code testing, users preferred Sonnet 4.6 over the previous Opus 4.5 59% of the time, and it beats the previous Gemini 3 Pro on most benchmarks.

79.6% SWE-Bench Verified
ByteDance
New Models

Seed 2.0

ByteDance Seed 2.0: frontier multimodal family at 73-84% lower pricing

ByteDance released Seed 2.0, a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing. Its video understanding surpasses the human benchmark at 77% vs 73%. At 84% cheaper than Opus 4.5 with near-comparable quality, the panel called it a compelling option for price-conscious developers.

Cohere Labs
New ModelsOpen weights

Tiny Aya

Cohere Labs releases Tiny Aya, a 3.35B multilingual model for 70+ languages

Cohere Labs released Tiny Aya, a 3.35B-parameter multilingual model family supporting 70+ languages that is small enough to run locally on phones. It extends Cohere's Aya line of open multilingual models, bringing broad language coverage to on-device deployments.

Google DeepMind
New Models

Gemini 3.1 Pro

Gemini 3.1 Pro drops live with 44% HLE and 77% ARC-AGI at the same price

Google released Gemini 3.1 Pro minutes before the show, claiming 2.5x better abstract reasoning and improved coding and agentic capabilities at the same price point as its predecessor. It scores 44% on Humanity's Last Exam, 77% on ARC-AGI without a custom harness, and 68 on Terminal Bench, putting it at or near state of the art alongside Opus 4.6. In Nisten's live vibe-coding test it was blazingly fast but less polished than Opus 4.6 and Codex output.

44% Humanities Last Exam77% ARC-AGI
Google DeepMind
New Models

Lyria 3

Google DeepMind launches Lyria 3 music generation in the Gemini app

Google DeepMind launched Lyria 3, its most advanced AI music generation model, now available in the Gemini app. It generates 32-second high-fidelity music tracks with creative controls and can compose music from uploaded images. Google also published a prompt guide covering vocals, lyrics, and different styles.

xAI
New Models

Grok 4.20

xAI silently drops Grok 4.20 with four 500B-param collaborating agents

xAI released Grok 4.20, a multi-agent system where four 500B-parameter agents collaborate in a multi-agent UI, with a $300/month Heavy tier scaling to 16 agents. No benchmarks or evals were released with the drop. The panel found it underwhelming for coding and day-to-day agent work but still top tier for deep research thanks to xAI's RAG over X data; Grok 4.1 Fast remains #8 on OpenRouter by API usage.

500B×4 Grok 4 20 Architecture
Zyphra
New ModelsOpen weights

ZUNA

Zyphra opens ZUNA, a 380M-param EEG brain-computer interface model

Zyphra released ZUNA, a 380M-parameter open-source BCI foundation model that translates EEG brain signals into text, reconstructing clinical-grade brain signals from sparse, noisy data. Dubbed 'thought to text' by the community, it works with roughly $500 non-invasive EEG headsets, likely needs personalized training per user, and is small enough to run in real time on a consumer gaming GPU. It is Apache licensed.

ByteDance
New Models

Seedance 2.0

ByteDance Seedance 2.0 shatters video generation reality

ByteDance launched Seedance 2.0, a unified multimodal video generation model that accepts up to 9 images, 3 videos, and 3 audio clips as references and produces 15-second multi-shot clips with native stereo audio and strong character consistency (a 45-second internal test mode also exists). The panel compared the quality jump to seeing Sora for the first time. Available on the BytePlus platform.

MiniMax
New ModelsOpen weights

MiniMax M-2.5

MiniMax M-2.5 hits 80.2% SWE-Bench Verified with 10B active params

MiniMax dropped M-2.5 thirty minutes before the show: a 200B-total, 10B-active open-weights model scoring 80.2% on SWE-Bench Verified, approaching Opus 4.6 at roughly 1/20th the cost (~15 cents per task with a 57% win rate over Opus). Trained with MiniMax's decoupled Forge RL framework and optimized for end-to-end task time with fewer tool calls and thinking tokens. Senior researcher Olive Song joined live and revealed the model was still training — they cut a checkpoint for early release.

80.2% SWE-Bench Verified15¢ Cost per task
OpenAI
New Models

GPT 5.3 Codex Spark

OpenAI ships GPT 5.3 Codex Spark on Cerebras for real-time coding

OpenAI released GPT 5.3 Codex Spark, a smaller Codex variant built for real-time coding, served on Cerebras hardware — OpenAI's first model on Cerebras — with reported speeds of over 1000 tokens/sec. Available to ChatGPT Pro users in the Codex app, CLI, and IDE extension. It broke during the show as the second breaking-news drop of the episode.

100 tps Codex Spark speed
Zhipu AI (Z.ai)
New ModelsOpen weights

GLM-5

Z.ai launches GLM-5, the open-weights agentic coding crown

Z.ai released GLM-5, a 744B-parameter MoE model (40B active) trained on 28.5 trillion tokens that takes the #1 open-source ranking for agentic coding with 77.8% SWE-bench Verified. It introduces the SLIM asynchronous RL framework for post-training, adopts DeepSeek's sparse attention to cut deployment cost, and was trained on Huawei chips rather than NVIDIA. Lou from Z.ai joined the show live and summed it up as bigger, faster, better, and cheaper.

744B GLM-5 Parameters28.5T Training tokens
Alibaba (Qwen)
New ModelsOpen weights

Qwen3-Coder-Next

Qwen3-Coder-Next hits 70.6% SWE-Bench Verified with 3B active params

Alibaba's Qwen3-Coder-Next is an 80B MoE coding agent model with only 3B active parameters that scores 70.6% on SWE-Bench Verified and 44% on the much harder SWE-Bench Pro. It was trained on 7.5T tokens with 20,000 parallel RL environments and runs under 48GB of RAM with GGUF quantization, making near-frontier agentic coding feasible on local hardware.

70.6% SWE-Bench Verified44% SWE-Bench Pro
Anthropic
New Models

Claude Opus 4.6

Anthropic ships Claude Opus 4.6 with 1M context and agent teams

Anthropic dropped Opus 4.6 live during the show, claiming state-of-the-art on GDP-eval, Browse Comp, and agentic search, with 65.4% on Terminal Bench and 99% on TAU Bench MCP tool use. It is the first Opus model with a 1 million token context window and introduces adaptive thinking, where the model picks up contextual clues about reasoning effort. Pricing matches Opus 4.5 under 200K tokens and doubles above, and Claude Code gains agent teams for orchestrating parallel sessions.

1M Context tokens
New ModelsOpen weights

Intern-S1-Pro

Intern-S1-Pro: 1 trillion parameter open MoE for scientific reasoning

InternLM released Intern-S1-Pro, a 1 trillion parameter open-source MoE model targeting SOTA scientific reasoning across chemistry, biology, materials, and earth sciences. The panel noted it beats frontier models on science benchmarks, a massive compute investment for an open release.

Kling AI
New Models

Kling 3.0

Kling 3.0: 15-second multi-shot video with native audio

Kuaishou's Kling 3.0 launched as an all-in-one AI video creation engine with native multimodal generation, 15-second multi-shot sequences, built-in audio, and character consistency across scenes. Alongside Grok Imagine, it marks the week native audio and lip sync became table stakes for video models.

Mistral AI
New ModelsOpen weights

Voxtral Transcribe 2

Mistral's Voxtral Transcribe 2 dethrones Whisper as SOTA transcription

Mistral AI launched Voxtral Transcribe 2, state-of-the-art speech-to-text with sub-200ms latency, native diarization support, and open weights under Apache 2.0. The panel called it the first model to dethrone Whisper after roughly three years, and Alex used it to transcribe this very episode.

OpenAI
New Models

GPT-5.3-Codex

OpenAI answers Opus with GPT-5.3-Codex, first model that helped build itself

One hour after Opus 4.6, OpenAI released GPT-5.3-Codex, billed as the first model instrumental in developing itself — the Codex team used early versions to debug its own training and manage its own deployment. It scores 73% on Terminal Bench 2.0, a 10-point gap over Opus 4.6, while running queries 25% faster and more token-efficiently than its predecessor, with improved mid-task steerability.

73% Terminal Bench 2.025% Speed improvement
StepFun
New ModelsOpen weights

Step 3.5 Flash

StepFun Step 3.5 Flash: frontier reasoning claims at 11B active params

StepFun released Step 3.5 Flash, a 196B sparse MoE model with only 11B active parameters, claiming frontier-level reasoning while generating at 100-350 tokens per second. It continues the trend of sparse Chinese MoE models delivering high speed at low active parameter counts.

🚀 Products & Apps 8

Cognition Labs
Products & Apps

Devin 2.2

Devin 2.2: computer use, browser, and self-verifying autonomous work

Cognition shipped Devin 2.2, an autonomous coding agent that can use a computer and browser to verify and fix its own work, plus a free public Devin Review workflow for PR review and scheduled/automated sessions. Nader Dabit framed the release as two years of platform maturity converging with stronger models, letting non-engineers fix issues directly by just asking Devin.

Taalas
Products & Apps

ChatJimmy (baked-weights chip demo)

Taalas demos 15,000+ tokens/sec with model weights baked into silicon

Taalas published a live demo (chatjimmy.ai) showing Llama 3 8B running at 15,691 tokens per second on a chip with weights baked directly into the hardware. The panel called it a 10x speed-class jump that points at chip-level innovation compressing inference costs and iteration cycles.

15,000 tok/s Taalas Demo Throughput
Moltbook
Products & Apps

Moltbook

Moltbook: a Reddit built for and by AI agents

Moltbook launched as a social network for AI agents, part of an exploding 'agentic internet' that now includes agent equivalents of YouTube, Twitter, Instagram, 4chan, and even a church. Agents on these networks were observed discussing creating encrypted languages humans cannot read, and the panel warned against letting your agents loose on them.

OpenAI
Products & Apps

Codex App

OpenAI launches standalone Codex app for managing parallel coding agents

OpenAI shipped Codex as a dedicated Mac app, a command center for running multiple AI coding agents in parallel. Features include work trees for parallel project branches, scheduled automations, a skills marketplace with Cloudflare, Vercel, Figma, Notion, and Linear integrations, inline diff review with per-line commenting, and cloud hand-off. OpenAI granted a free month of access to all users including the free tier, and doubled rate limits for all tiers for two months.

✨ Major Features & Updates 7

Anthropic
Major Features & Updates

Claude Code Remote Control & Memory

Claude Code adds Remote Control and memory

Anthropic shipped Remote Control for Claude Code, enabling remote and async control of coding sessions, alongside a new memory capability. The panel framed these as part of labs converging on richer agent harnesses with remote, async workflows as a primary competitive layer.

Anthropic
Major Features & Updates

Claude Cowork Automations

Claude Cowork gets automations (cron jobs), matching Codex

Claude Cowork added automations, cron-job-style scheduled agent runs, in the same week OpenAI's Codex gained equivalent automation support. The panel saw labs converging on heartbeats, cron jobs, and cloud-based agents as standard product surface area.

Cursor
Major Features & Updates

Cloud Agents

Cursor launches cloud agents

Cursor launched cloud agents, moving agentic coding work off the local machine into remote, async sessions. The panel highlighted Cursor's cloud agents and UI demos as important progress for frontend development workflows.

Weights & Biases
Major Features & Updates

W&B Inference: MiniMax 2.5 & Kimi K2.5

W&B Inference adds MiniMax 2.5 and Kimi K2.5

Weights & Biases added MiniMax M2.5 and Kimi K2.5 to its CoreWeave-backed Inference service. The panel emphasized price/performance, with MiniMax 2.5 presented as roughly 10x cheaper than premium alternatives in some tiers and Kimi K2.5 praised for practical function calling and image-in-loop use cases.

Weights & Biases
Major Features & Updates

Kimi K2.5 on W&B Inference

W&B adds Kimi K2.5 to its inference service

Weights & Biases launched Kimi K2.5 on its inference service, making Moonshot AI's model available to W&B users. In Wolfram's Terminal Bench deep dive for W&B, Kimi K2.5 achieved a 67.4% ceiling score across multiple runs, among the strongest open-model results he measured.

🔌 APIs & Platforms 1

🛠️ Dev Tools 2

📄 Papers & Research 1

📊 Benchmarks & Evals 3

Agentica
Benchmarks & Evals

ARC-AGI-3 public set result

Agentica claims to solve all public ARC-AGI-3 tasks

Agentica published a claim of solving all public ARC-AGI-3 tasks, adding to the week's theme of benchmark saturation. The panel discussed it alongside METR and ARC-AGI-2 results as part of weighing signal versus noise in headline benchmark leaps.

Confluence Labs
Benchmarks & Evals

ARC-AGI-2 SOTA result

Confluence Labs exits stealth with 97.9% SOTA on ARC-AGI-2

Confluence Labs emerged from stealth with a 97.9% state-of-the-art result on the ARC-AGI-2 benchmark, publishing code on GitHub. The panel read it as a major signal that ARC-AGI-2 is near saturation, part of a broader pattern of benchmarks getting solved faster than expected.

97.9% ARC-AGI-2
METR
Benchmarks & Evals

Time Horizon Benchmark

METR Time Horizon goes vertical: Opus 4.6 hits ~14.5-hour tasks

METR's updated Time Horizon benchmark shows Claude Opus 4.6 completing tasks equivalent to roughly 14.5 hours of expert human work, with the autonomy doubling time now cited at 49 days. The panel treated this as the week's strongest evidence that agent capability growth has entered a visibly faster phase.

14.5h METR Time Horizon49 days Autonomy Doubling Time

💰 Funding 1

🤝 Acquisitions 1

OpenAI
Acquisitions

OpenClaw acqui-hire

OpenAI acqui-hires OpenClaw creator Peter Steinberger

OpenAI acqui-hired Peter Steinberger, the creator of the viral OpenClaw agent, in what the panel speculated might be the first single-founder billion-dollar deal. Yam Peleg broke the news on the show, calling Steinberger 'the goat'. The move lands the most popular third-party agent harness builder inside OpenAI, amid a week where Anthropic's terms changes pushed agent users toward OpenAI subscriptions.

🌀 Also Released 1

Ryan Carson
Also Released

Code Factory

Ryan Carson publishes the viral Code Factory agentic engineering blueprint

Ryan Carson published his viral Code Factory article, a blueprint for fully automated code generation, review, and deployment inspired by OpenAI's Harness Engineering post. The setup chains GitHub Actions, Reptile code review, CI gates, a risk-classification system for high-risk file changes, and a self-healing loop where Codex fixes its own PR issues until all checks pass. He says it takes a week-plus of setup but unlocks massive throughput.