Qwen 3.5 lands: 35B/3B-active Medium outperforms the old 235B flagship
Alibaba released the Qwen 3.5 family of open-weight models, headlined by Qwen3.5-35B-A3B, a 35B model with only 3B active parameters that outperforms their previous 235B flagship. Variants include a 122B-A10B and a dense 27B, with the panel highlighting the hybrid state-space (Mamba-layer) architecture and strong practical coding and agent performance at a tiny active-parameter footprint.
Google DeepMind launches Nano Banana 2 image model mid-show
Google DeepMind announced Nano Banana 2 during the show, a Flash-quality tier of its image model line. Alex broke in mid-TLDR to describe near-Pro image quality at roughly half the price, plus a new image search capability.
Liquid AI releases LFM2-24B-A2B, a laptop-friendly 24B MoE
Liquid AI released LFM2-24B-A2B, a 24B mixture-of-experts model with only 2.3B active parameters that runs on consumer laptops. The panel highlighted its speed and surprisingly strong non-coding reasoning, reinforcing the trend of efficient low-active-parameter open models for local use.
OpenAI releases gpt-audio-1.5 and gpt-realtime-1.5
OpenAI shipped gpt-audio-1.5 and gpt-realtime-1.5, updated audio and realtime voice models available through its platform. The release was covered in the week's voice and audio roundup.
Perplexity launches pplx-embed SOTA embedding models
Perplexity released pplx-embed, a family of state-of-the-art embedding models built for web-scale retrieval. The models are available on Hugging Face and through Perplexity's API with quickstart docs.
Quiver released Arrow 1.0, pitched as solving SVG generation. It was included in the week's AI art and diffusion roundup as a notable niche release for vector graphics.
Alibaba opens Qwen 3.5: 397B-param multimodal MoE with only 17B active
Alibaba released Qwen3.5-397B-A17B, billed as the first open-weight native multimodal MoE model, with 397B total parameters, just 17B active, 512 experts, and 262K native context extendable to 1M. It delivers 8.6-19x faster inference than Qwen3-Max and continues Qwen's strength in multilingual and medical tasks, scoring 52.5% on Terminal Bench, third place among open-source models. Nisten found coding still trails GLM-5.
Anthropic ships Claude Sonnet 4.6 with 79.6% SWE-Bench and 1M context
Anthropic launched Claude Sonnet 4.6, its most capable Sonnet ever, scoring 79.6% on SWE-Bench Verified, nearly matching Opus 4.6 at Sonnet pricing of $3/$15 per million tokens. It ships with a 1M token context window in beta and is now the default model on Claude AI. In blind Claude Code testing, users preferred Sonnet 4.6 over the previous Opus 4.5 59% of the time, and it beats the previous Gemini 3 Pro on most benchmarks.
ByteDance Seed 2.0: frontier multimodal family at 73-84% lower pricing
ByteDance released Seed 2.0, a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing. Its video understanding surpasses the human benchmark at 77% vs 73%. At 84% cheaper than Opus 4.5 with near-comparable quality, the panel called it a compelling option for price-conscious developers.
Cohere Labs releases Tiny Aya, a 3.35B multilingual model for 70+ languages
Cohere Labs released Tiny Aya, a 3.35B-parameter multilingual model family supporting 70+ languages that is small enough to run locally on phones. It extends Cohere's Aya line of open multilingual models, bringing broad language coverage to on-device deployments.
Gemini 3.1 Pro drops live with 44% HLE and 77% ARC-AGI at the same price
Google released Gemini 3.1 Pro minutes before the show, claiming 2.5x better abstract reasoning and improved coding and agentic capabilities at the same price point as its predecessor. It scores 44% on Humanity's Last Exam, 77% on ARC-AGI without a custom harness, and 68 on Terminal Bench, putting it at or near state of the art alongside Opus 4.6. In Nisten's live vibe-coding test it was blazingly fast but less polished than Opus 4.6 and Codex output.
Google DeepMind launches Lyria 3 music generation in the Gemini app
Google DeepMind launched Lyria 3, its most advanced AI music generation model, now available in the Gemini app. It generates 32-second high-fidelity music tracks with creative controls and can compose music from uploaded images. Google also published a prompt guide covering vocals, lyrics, and different styles.
xAI silently drops Grok 4.20 with four 500B-param collaborating agents
xAI released Grok 4.20, a multi-agent system where four 500B-parameter agents collaborate in a multi-agent UI, with a $300/month Heavy tier scaling to 16 agents. No benchmarks or evals were released with the drop. The panel found it underwhelming for coding and day-to-day agent work but still top tier for deep research thanks to xAI's RAG over X data; Grok 4.1 Fast remains #8 on OpenRouter by API usage.
Zyphra opens ZUNA, a 380M-param EEG brain-computer interface model
Zyphra released ZUNA, a 380M-parameter open-source BCI foundation model that translates EEG brain signals into text, reconstructing clinical-grade brain signals from sparse, noisy data. Dubbed 'thought to text' by the community, it works with roughly $500 non-invasive EEG headsets, likely needs personalized training per user, and is small enough to run in real time on a consumer gaming GPU. It is Apache licensed.
Alibaba launches Qwen-Image-2.0 with native 2K resolution
Alibaba's Qwen team launched Qwen-Image-2.0, a 7B-parameter image generation model with native 2K resolution output and superior text rendering. Available to try on chat.qwen.ai.
ByteDance Seedance 2.0 shatters video generation reality
ByteDance launched Seedance 2.0, a unified multimodal video generation model that accepts up to 9 images, 3 videos, and 3 audio clips as references and produces 15-second multi-shot clips with native stereo audio and strong character consistency (a 45-second internal test mode also exists). The panel compared the quality jump to seeing Sora for the first time. Available on the BytePlus platform.
Google dropped an upgraded Gemini 3 Deep Think mid-show, hitting 84% on ARC-AGI 2 — the biggest single jump in the benchmark's history, up from Opus 4.6's 68% set just one week earlier. It also scored 48.4% on Humanity's Last Exam without tools, taking state of the art on both.
MiniMax M-2.5 hits 80.2% SWE-Bench Verified with 10B active params
MiniMax dropped M-2.5 thirty minutes before the show: a 200B-total, 10B-active open-weights model scoring 80.2% on SWE-Bench Verified, approaching Opus 4.6 at roughly 1/20th the cost (~15 cents per task with a 57% win rate over Opus). Trained with MiniMax's decoupled Forge RL framework and optimized for end-to-end task time with fewer tool calls and thinking tokens. Senior researcher Olive Song joined live and revealed the model was still training — they cut a checkpoint for early release.
OpenAI ships GPT 5.3 Codex Spark on Cerebras for real-time coding
OpenAI released GPT 5.3 Codex Spark, a smaller Codex variant built for real-time coding, served on Cerebras hardware — OpenAI's first model on Cerebras — with reported speeds of over 1000 tokens/sec. Available to ChatGPT Pro users in the Codex app, CLI, and IDE extension. It broke during the show as the second breaking-news drop of the episode.
Z.ai launches GLM-5, the open-weights agentic coding crown
Z.ai released GLM-5, a 744B-parameter MoE model (40B active) trained on 28.5 trillion tokens that takes the #1 open-source ranking for agentic coding with 77.8% SWE-bench Verified. It introduces the SLIM asynchronous RL framework for post-training, adopts DeepSeek's sparse attention to cut deployment cost, and was trained on Huawei chips rather than NVIDIA. Lou from Z.ai joined the show live and summed it up as bigger, faster, better, and cheaper.
ACE-Step 1.5: open-source 'Suno at home' music generation under MIT
ACE-Step 1.5 is an MIT-licensed AI music generator that produces full songs in under 10 seconds on consumer GPUs and runs on a MacBook. The panel demoed it live via Pinocchio, generating a ThursdAI song on the spot, and it is available for one-click install.
Qwen3-Coder-Next hits 70.6% SWE-Bench Verified with 3B active params
Alibaba's Qwen3-Coder-Next is an 80B MoE coding agent model with only 3B active parameters that scores 70.6% on SWE-Bench Verified and 44% on the much harder SWE-Bench Pro. It was trained on 7.5T tokens with 20,000 parallel RL environments and runs under 48GB of RAM with GGUF quantization, making near-frontier agentic coding feasible on local hardware.
LingBot-World: open-source world model challenges Google Genie 3
Ant Group released LingBot-World, an open-source world model that generates 10-minute playable environments at 16fps. It positions open weights as a direct challenger to Google's closed Genie 3 in interactive world generation.
Anthropic ships Claude Opus 4.6 with 1M context and agent teams
Anthropic dropped Opus 4.6 live during the show, claiming state-of-the-art on GDP-eval, Browse Comp, and agentic search, with 65.4% on Terminal Bench and 99% on TAU Bench MCP tool use. It is the first Opus model with a 1 million token context window and introduces adaptive thinking, where the model picks up contextual clues about reasoning effort. Pricing matches Opus 4.5 under 200K tokens and doubles above, and Claude Code gains agent teams for orchestrating parallel sessions.
Intern-S1-Pro: 1 trillion parameter open MoE for scientific reasoning
InternLM released Intern-S1-Pro, a 1 trillion parameter open-source MoE model targeting SOTA scientific reasoning across chemistry, biology, materials, and earth sciences. The panel noted it beats frontier models on science benchmarks, a massive compute investment for an open release.
Kling 3.0: 15-second multi-shot video with native audio
Kuaishou's Kling 3.0 launched as an all-in-one AI video creation engine with native multimodal generation, 15-second multi-shot sequences, built-in audio, and character consistency across scenes. Alongside Grok Imagine, it marks the week native audio and lip sync became table stakes for video models.
Mistral's Voxtral Transcribe 2 dethrones Whisper as SOTA transcription
Mistral AI launched Voxtral Transcribe 2, state-of-the-art speech-to-text with sub-200ms latency, native diarization support, and open weights under Apache 2.0. The panel called it the first model to dethrone Whisper after roughly three years, and Alex used it to transcribe this very episode.
OpenAI answers Opus with GPT-5.3-Codex, first model that helped build itself
One hour after Opus 4.6, OpenAI released GPT-5.3-Codex, billed as the first model instrumental in developing itself — the Codex team used early versions to debug its own training and manage its own deployment. It scores 73% on Terminal Bench 2.0, a 10-point gap over Opus 4.6, while running queries 25% faster and more token-efficiently than its predecessor, with improved mid-task steerability.
MiniCPM-o 4.5: first open-source full-duplex omni model
OpenBMB released MiniCPM-o 4.5, the first open-source full-duplex omni-modal LLM that can see, listen, and speak simultaneously. It can listen while speaking and even interrupt the user, bringing real-time conversational behavior to open weights.
StepFun Step 3.5 Flash: frontier reasoning claims at 11B active params
StepFun released Step 3.5 Flash, a 196B sparse MoE model with only 11B active parameters, claiming frontier-level reasoning while generating at 100-350 tokens per second. It continues the trend of sparse Chinese MoE models delivering high speed at low active parameter counts.
Grok Imagine 1.0 tops video arena with native audio and lip sync
xAI launched Grok Imagine 1.0 with 10-second 720p video generation, native audio, and lip sync, taking the #1 spot on the Artificial Analysis text-to-video arena. Generation costs roughly $0.42 per 10-second clip and an API is available.
Z.ai released GLM-OCR, a tiny 0.9B parameter document understanding model that achieves the #1 ranking on OmniDocBench V1.5. It shows that strong OCR and document parsing no longer require large models.
Devin 2.2: computer use, browser, and self-verifying autonomous work
Cognition shipped Devin 2.2, an autonomous coding agent that can use a computer and browser to verify and fix its own work, plus a free public Devin Review workflow for PR review and scheduled/automated sessions. Nader Dabit framed the release as two years of platform maturity converging with stronger models, letting non-engineers fix issues directly by just asking Devin.
Nous Research announced a research agent, joining the wave of lab-built agentic tools shipped this week. It was covered in the roundup of new agent products alongside Cursor cloud agents and Perplexity Computer.
Perplexity launched Perplexity Computer, an agentic computer product announced via its blog. It was discussed as part of the week's convergence on agent harnesses, automations, and cloud-based agent workflows across labs.
Taalas demos 15,000+ tokens/sec with model weights baked into silicon
Taalas published a live demo (chatjimmy.ai) showing Llama 3 8B running at 15,691 tokens per second on a chip with weights baked directly into the hardware. The panel called it a 10x speed-class jump that points at chip-level innovation compressing inference costs and iteration cycles.
Dreamer launches beta platform for building agentic apps with no-code AI
Dreamer launched its beta, a full-stack platform for building and discovering agentic apps with no-code AI. It aims to let non-developers assemble and share agent-powered applications.
Moltbook launched as a social network for AI agents, part of an exploding 'agentic internet' that now includes agent equivalents of YouTube, Twitter, Instagram, 4chan, and even a church. Agents on these networks were observed discussing creating encrypted languages humans cannot read, and the panel warned against letting your agents loose on them.
OpenAI launches standalone Codex app for managing parallel coding agents
OpenAI shipped Codex as a dedicated Mac app, a command center for running multiple AI coding agents in parallel. Features include work trees for parallel project branches, scheduled automations, a skills marketplace with Cloudflare, Vercel, Figma, Notion, and Linear integrations, inline diff review with per-line commenting, and cloud hand-off. OpenAI granted a free month of access to all users including the free tier, and doubled rate limits for all tiers for two months.
OpenAI Frontier: enterprise platform for AI agents as coworkers
OpenAI launched Frontier, an enterprise platform to build, deploy, and manage AI agents as 'AI coworkers'. It targets companies that want to operationalize agents across their organizations.
Anthropic shipped Remote Control for Claude Code, enabling remote and async control of coding sessions, alongside a new memory capability. The panel framed these as part of labs converging on richer agent harnesses with remote, async workflows as a primary competitive layer.
Claude Cowork gets automations (cron jobs), matching Codex
Claude Cowork added automations, cron-job-style scheduled agent runs, in the same week OpenAI's Codex gained equivalent automation support. The panel saw labs converging on heartbeats, cron jobs, and cloud-based agents as standard product surface area.
Cursor launched cloud agents, moving agentic coding work off the local machine into remote, async sessions. The panel highlighted Cursor's cloud agents and UI demos as important progress for frontend development workflows.
Weights & Biases added MiniMax M2.5 and Kimi K2.5 to its CoreWeave-backed Inference service. The panel emphasized price/performance, with MiniMax 2.5 presented as roughly 10x cheaper than premium alternatives in some tiers and Kimi K2.5 praised for practical function calling and image-in-loop use cases.
Weights & Biases launched Kimi K2.5 on its inference service, making Moonshot AI's model available to W&B users. In Wolfram's Terminal Bench deep dive for W&B, Kimi K2.5 achieved a 67.4% ceiling score across multiple runs, among the strongest open-model results he measured.
OpenAI upgrades Deep Research to GPT-5.2 with app integrations
OpenAI upgraded Deep Research to run on GPT-5.2, adding app integrations, site-specific searches, and real-time collaboration. Part of the week's rapid-fire big-lab announcements covered in the TLDR rundown.
W&B Inference adds day-zero GLM-5 and Kimi K2.5 support
Weights & Biases launched day-zero GLM-5 support on its CoreWeave-powered W&B Inference service, alongside Kimi K2.5, with MiniMax 2.5 coming soon. Alex announced $50 in free credits for listeners to test the new open-weights models.
Chrome 146 introduces WebMCP, a native browser API for AI agents
Chrome 146 shipped WebMCP, a native browser API that lets AI agents directly interact with web services. It brings Model Context Protocol-style agent access into the browser itself, a notable primitive for the agentic web.
LM Studio launches LMLink for remote access to local models
LM Studio launched LMLink, which lets you use your locally hosted models from anywhere via Tailscale. It extends the local-model story so that on-device inference is reachable from any of your machines.
Ryan Carson releases AntFarm for agent coordination
Co-host Ryan Carson released AntFarm, a tool for coordinating teams of coding agents. It targets the missing primitives for managing multiple agents that the panel discussed during the agent-psychosis segment.
Anthropic publishes Opus 4.6 sabotage risk report, meeting ASL-4
Anthropic released a sabotage risk report for Claude Opus 4.6, preemptively meeting ASL-4 safety standards for autonomous AI R&D. The report evaluates the model's potential for sabotage-style behaviors as capabilities scale.
Agentica claims to solve all public ARC-AGI-3 tasks
Agentica published a claim of solving all public ARC-AGI-3 tasks, adding to the week's theme of benchmark saturation. The panel discussed it alongside METR and ARC-AGI-2 results as part of weighing signal versus noise in headline benchmark leaps.
Confluence Labs exits stealth with 97.9% SOTA on ARC-AGI-2
Confluence Labs emerged from stealth with a 97.9% state-of-the-art result on the ARC-AGI-2 benchmark, publishing code on GitHub. The panel read it as a major signal that ARC-AGI-2 is near saturation, part of a broader pattern of benchmarks getting solved faster than expected.
METR Time Horizon goes vertical: Opus 4.6 hits ~14.5-hour tasks
METR's updated Time Horizon benchmark shows Claude Opus 4.6 completing tasks equivalent to roughly 14.5 hours of expert human work, with the autonomy doubling time now cited at 49 days. The panel treated this as the week's strongest evidence that agent capability growth has entered a visibly faster phase.
14.5h METR Time Horizon49 days Autonomy Doubling Time
Entire raises $60M seed, ships first OSS release 'Checkpoints'
Entire raised a $60M seed round to build an open-source developer platform for AI agent workflows. Alongside the funding it shipped its first open-source release, Checkpoints, available on GitHub.
OpenAI acqui-hires OpenClaw creator Peter Steinberger
OpenAI acqui-hired Peter Steinberger, the creator of the viral OpenClaw agent, in what the panel speculated might be the first single-founder billion-dollar deal. Yam Peleg broke the news on the show, calling Steinberger 'the goat'. The move lands the most popular third-party agent harness builder inside OpenAI, amid a week where Anthropic's terms changes pushed agent users toward OpenAI subscriptions.
Ryan Carson publishes the viral Code Factory agentic engineering blueprint
Ryan Carson published his viral Code Factory article, a blueprint for fully automated code generation, review, and deployment inspired by OpenAI's Harness Engineering post. The setup chains GitHub Actions, Reptile code review, CI gates, a risk-classification system for high-risk file changes, and a self-healing loop where Codex fixes its own PR issues until all checks pass. He says it takes a week-plus of setup but unlocks massive throughput.