Episode Summary

ThursdAI’s first December episode was a full firehose: DeepSeek V3.2 dropped with gold-medal-level reasoning results, Mistral returned to Apache 2.0 with new large and edge models, and Arcee joined to talk about building US-trained MOEs from scratch. The panel unpacked what these releases mean for open-source momentum, inference cost, and real enterprise adoption constraints. On the closed-model side, OpenAI reportedly hit a β€œcode red” response to Gemini 3 pressure while Amazon rolled out Nova 2 across text, speech, and multimodal stacks. The show closed with rapid updates across eval tooling, video generation, realtime voice, and low-cost image diffusion.

Hosts & Guests

Alex Volkov
Alex Volkov
Host Β· W&B / CoreWeave
@altryne
Lukas Atkins
Lukas Atkins
Arcee AI β€” CTO
@latkins
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host, AI model evaluator
@WolframRvnwlf
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten
LDJ
LDJ
Weekly co-host of ThursdAI
@ldjconfirmed

By The Numbers

AIME
96%
DeepSeek V3.2-Speciale reported AIME score versus GPT-5 High at 94%
SWE-Bench Verified
73.1%
DeepSeek V3.2 agentic coding benchmark result
DeepSeek V3.2-Speciale
685B
Total parameters in the MIT-licensed MoE model
Per 1M tokens
28Β’
Approximate OpenRouter pricing cited for DeepSeek V3.2
Mistral Large 3 context
256K
Quarter-million token context window for Mistral Large 3
ARC-AGI-2
45.1%
Gemini 3 Deep Think benchmark score discussed in the episode

πŸ”₯ Breaking During The Show

DeepSeek V3.2-Speciale posts gold-level olympiad results
DeepSeek’s latest reasoning-first release landed with standout olympiad and coding numbers plus aggressive pricing, pushing open models closer to top closed-model capability.
Mistral returns to Apache 2.0 with Mistral 3 family
Mistral relaunched large and small multimodal models under permissive licensing, reigniting discussion around open model portability and deployability.
OpenAI Code Red and Gemini pressure
The episode covered reports that OpenAI shifted priorities in response to Gemini momentum while the broader API race accelerated across Google, Amazon, and Cursor integrations.

πŸ”“ Open Source LLMs

The panel went deep on DeepSeek V3.2, Mistral 3, Arcee Trinity, and Hermes 4.3 as proof that open models are moving fast on both reasoning and coding utility. They discussed benchmark context, licensing shifts back to Apache 2.0, and why MoE architecture plus efficient post-training is changing the economics of open AI.

  • DeepSeek V3.2-Speciale posted gold-level olympiad and AIME results with MIT license
  • Mistral Large 3 and Ministral 3 relaunched under Apache 2.0 with strong open-model coding positioning
  • Arcee Trinity introduced US-trained open MoEs and previewed Trinity-Large for January 2026
  • Hermes 4.3 highlighted decentralized training and RefusalBench performance

🏒 Big CO LLMs + APIs

Coverage shifted to the frontier API race: OpenAI’s reported internal β€œcode red,” Amazon’s Nova 2 suite, Gemini 3 Deep Think, and Cursor’s temporary free access to GPT-5.1-Codex-Max. The discussion emphasized that product integration and latency matter as much as raw benchmark IQ.

  • OpenAI reportedly paused side projects to focus on intelligence and speed
  • Amazon Nova 2 announced Lite, Pro, Sonic, and Omni with major benchmark jumps
  • Gemini 3 Deep Think introduced high-cost parallel reasoning with ARC-AGI-2 gains
  • Cursor offered GPT-5.1-Codex-Max free access through Dec 11

⚑ This Week’s Buzz

Weights & Biases launched LLM Evaluation Jobs to run evaluations against OpenAI-compatible APIs during training cycles, not just at the end. The segment framed this as a practical workflow upgrade for teams trying to move faster without blindly burning compute.

  • W&B launched LLM Evaluation Jobs
  • Supports evaluating OpenAI-compatible endpoints
  • Focus on earlier model quality signals during development

πŸŽ₯ Vision & Video

Video model updates included Runway Gen-4.5 leaderboard gains and two Kling releases spanning native audio video and image generation. The updates continued the theme that video quality and multimodal consistency are improving week-over-week.

  • Runway Gen-4.5 reached top text-to-video leaderboard position
  • Kling VIDEO 2.6 introduced native audio generation
  • Kling O1 Image expanded image generation capabilities

πŸ”Š Voice & Audio

The show highlighted Microsoft VibeVoice-Realtime-0.5B and its low-latency realtime TTS profile. The segment focused on how sub-second audio response is becoming table stakes for production voice agents.

  • Microsoft VibeVoice-Realtime-0.5B shared with ~300ms latency claims
  • Voice model availability on Hugging Face
  • Realtime speech UX increasingly central to agent products

🎨 AI Art & Diffusion

Image-generation updates centered on speed and cost efficiency, with Pruna P-Image claiming sub-second generation at very low per-image pricing and SeeDream 4.5 adding stronger text rendering and multi-reference fusion.

  • Pruna P-Image promoted sub-second image generation at low cost
  • SeeDream 4.5 emphasized multi-reference fusion
  • Text rendering quality remained a key differentiator
TL;DR and Show Notes

Hosts and Guests

Open Source LLMs

Big CO LLMs + APIs

  • OpenAI Code Red - ChatGPT 3rd birthday, Garlic model in development (The Information)

  • Amazon Nova 2 - Lite, Pro, Sonic, and Omni models (X, Blog)

  • Gemini 3 Deep Think - 45.1% ARC-AGI-2 (X, Blog)

  • Cursor + GPT-5.1-Codex-Max - Free until Dec 11 (X, Blog)

This Week’s Buzz

  • WandB LLM Evaluation Jobs - Evaluate any OpenAI-compatible API (X, Announcement)

Vision & Video

  • Runway Gen-4.5 - #1 on text-to-video leaderboard, 1,247 Elo (X)

  • Kling VIDEO 2.6 - First native audio generation (X)

  • Kling O1 Image - Image generation (X)

Voice & Audio

  • Microsoft VibeVoice-Realtime-0.5B - 300ms latency TTS (X, HF)

AI Art & Diffusion

  • Pruna P-Image - Sub-second generation at $0.005 (X, Blog, Demo)

  • SeeDream 4.5 - Multi-reference fusion, text rendering (X)