Infrastructure & Inference

Compute, GPUs, hardware, serving, quantization, and efficiency for running models at scale. — 63 releases covered on the show.

June 2026

NVIDIA
Products & Apps

RTX Spark

NVIDIA announces RTX Spark Arm + Blackwell platform for local AI PCs

At Computex, NVIDIA unveiled RTX Spark, an Arm CPU plus Blackwell GPU PC platform with 128GB unified memory targeting local AI agents and 120B-class local inference. A wave of thin laptops with RTX 5070-class GPUs and roughly one petaflop of local AI compute raises the question of what agents should run locally versus in the cloud.

May 2026

PrismML
New ModelsOpen weights

Bonsai Image 4B

PrismML's 1-bit Bonsai Image 4B runs local image gen under 1GB

PrismML released 1-bit and ternary versions of Bonsai Image 4B, a sub-1GB diffusion transformer for local image generation. The quantized model even runs in-browser via WebGPU and ships with an iOS app and a Hugging Face demo.

Weights & Biases
Dev Tools

W&B MCP Server

Weights & Biases launches MCP server with 20 tools for agents

W&B officially launched its MCP server with 20 schema-first tools so coding agents can read experiments, monitor training, and run autonomous research loops. Agents can query metadata before pulling full 300-metric runs, keeping their context windows from blowing up.

Anthropic
Also Released

Colossus compute deal

SpaceX IPO filing reveals Anthropic pays $1.25B/month for Colossus compute

The SpaceX IPO filing revealed Anthropic is paying $1.25 billion per month for AI compute at the Memphis Colossus facility. The crew called it a bombastic deal that lets Anthropic serve far more inference at scale and feel less compute-constrained.

$1.25B monthly AI compute spend

April 2026

Stripe
Dev Tools

Projects.dev

Stripe opens Projects.dev: 32 infra providers provisionable by agents

Stripe removed the waitlist on Projects.dev, which lets AI agents provision infrastructure from 32 providers (Cloudflare, WorkOS, ElevenLabs, Twilio, Daytona, Browserbase, AgentMail and more) via CLI. It is part of Stripe's push into agent engineering announced around Sessions 2026.

CoreWeave
Also Released

Anthropic, Meta & Jane Street deals

CoreWeave signs Anthropic, Meta ($21B), and Jane Street ($6B + $1B)

CoreWeave announced a multibillion-dollar deal with Anthropic, a $21B expansion with Meta (taking the relationship past $35B total), and a Jane Street deal worth $6B in cloud plus $1B in equity. CoreWeave now serves 9 of the top 10 AI labs, cementing its position as the neocloud backbone of frontier AI.

Anthropic
Products & Apps

Managed Agents

Anthropic ships Managed Agents, a fully hosted agent runtime

Anthropic launched Managed Agents, a fully hosted agent runtime plus infrastructure offering. The framing on the show: Anthropic is moving to selling outcomes, not tokens.

Weights & Biases
Major Features & Updates

W&B Automations

W&B Automations launch: event triggers from training runs

Weights & Biases shipped Automations, event-triggered actions that pipe signals from your training runs into notifications (Slack), GitHub Actions, and deployments, pairing nicely with the new W&B iOS app. In the same Buzz segment: GLM-5.1 and Gemma 4 both went live on W&B Inference.

OpenAI
Funding

$122B funding round

OpenAI closes $122B funding round at $852B valuation

OpenAI closed a reported $122 billion funding round, described as the largest in history, at an $852B valuation with an IPO said to be incoming. The panel discussed what that scale of capital implies for AI infrastructure spending, product velocity, and competitive pressure across the market.

$122B OpenAI funding round
PrismML
New ModelsOpen weights

Bonsai

PrismML releases Bonsai 1-bit models, an 8B model in 1.15 GB

PrismML released Bonsai, a family of 1-bit quantized open models fitting an 8B model into 1.15 GB and claiming 10x intelligence density, built on decades of compression research. The panel discussed one-bit quantization as a cost/performance lever for cheap local inference.

March 2026

Google Research
Papers & Research

TurboQuant

Google TurboQuant claims 6x KV-cache compression and 8x faster inference

Google Research published TurboQuant, a KV-cache quantization technique claiming 6x compression and 8x inference speedup with near-zero accuracy loss. The panel framed it as a potential unlock for LLM inference economics, while calling stock-market panic over the result premature without broader production validation.

TurboQuant KV-cache compression TurboQuant speedup claim
Modular
Products & Apps

Modular 26.2

Modular 26.2 runs FLUX.2 in under a second, 99% cheaper than Nano Banana

Modular shipped its 26.2 release with state-of-the-art image generation, running FLUX.2 in under one second (sub-300ms claims) at 99% lower cost than Nano Banana, plus upgraded AI coding with Mojo. Alex noted the surprise of an inference platform releasing model-level optimization and hoped the approach spreads to all image generation.

NVIDIA
Also Released

GR LPX (Rubin NVL72 + Groq 3)

NVIDIA GTC: GR LPX pairs Rubin NVL72 servers with the new Groq 3 chip

NVIDIA's GTC hardware reveal integrates the new Groq 3 chip (gen 2 was never publicly seen) into Rubin NVL72 servers via the GR LPX system. Claims include 3x tokens-per-watt efficiency at baseline, up to 30x at higher throughput, and 1000+ tokens/sec on a 2T-parameter frontier model with 400K context — performance the current Blackwell generation can't reach at any price.

Google DeepMind
New Models

Gemini 3.1 Flash-Lite

Google launches Gemini 3.1 Flash-Lite with 1M context at 360 tok/s

Google launched Gemini 3.1 Flash-Lite, a fast and cheap model with 1M token context aimed at the instant/fast tier, running around 360 tokens per second. The panel flagged a material pricing jump versus the prior Flash-Lite generation but saw it as well suited for judge, guardrail, and orchestration workloads in agent systems.

360 tokens/sec Gemini 3.1 Flash-Lite speed

February 2026

Taalas
Products & Apps

ChatJimmy (baked-weights chip demo)

Taalas demos 15,000+ tokens/sec with model weights baked into silicon

Taalas published a live demo (chatjimmy.ai) showing Llama 3 8B running at 15,691 tokens per second on a chip with weights baked directly into the hardware. The panel called it a 10x speed-class jump that points at chip-level innovation compressing inference costs and iteration cycles.

15,000 tok/s Taalas Demo Throughput
Weights & Biases
Major Features & Updates

W&B Inference: MiniMax 2.5 & Kimi K2.5

W&B Inference adds MiniMax 2.5 and Kimi K2.5

Weights & Biases added MiniMax M2.5 and Kimi K2.5 to its CoreWeave-backed Inference service. The panel emphasized price/performance, with MiniMax 2.5 presented as roughly 10x cheaper than premium alternatives in some tiers and Kimi K2.5 praised for practical function calling and image-in-loop use cases.

Weights & Biases
Major Features & Updates

Kimi K2.5 on W&B Inference

W&B adds Kimi K2.5 to its inference service

Weights & Biases launched Kimi K2.5 on its inference service, making Moonshot AI's model available to W&B users. In Wolfram's Terminal Bench deep dive for W&B, Kimi K2.5 achieved a 67.4% ceiling score across multiple runs, among the strongest open-model results he measured.

OpenAI
New Models

GPT 5.3 Codex Spark

OpenAI ships GPT 5.3 Codex Spark on Cerebras for real-time coding

OpenAI released GPT 5.3 Codex Spark, a smaller Codex variant built for real-time coding, served on Cerebras hardware — OpenAI's first model on Cerebras — with reported speeds of over 1000 tokens/sec. Available to ChatGPT Pro users in the Codex app, CLI, and IDE extension. It broke during the show as the second breaking-news drop of the episode.

100 tps Codex Spark speed

January 2026

OpenAI
Also Released

OpenAI x Cerebras Partnership

OpenAI inks $10B deal with Cerebras for 750MW of high-speed compute

OpenAI announced a $10 billion partnership with Cerebras for 750 megawatts of high-speed inference compute, with capacity starting in 2028. It extends OpenAI's pattern of locking in massive compute supply deals beyond its existing cloud partners.

$10B OpenAI × Cerebras
NVIDIA
Acquisitions

Groq acquisition

NVIDIA acquires Groq team and licenses its tech for ~$20B

NVIDIA entered an exclusive licensing deal with Groq and acquired most of its team for approximately $20B. Groq's inference-optimized chips, created by former Google TPU lead Jonathan Ross, complement NVIDIA's training dominance as inference demand grows exponentially across AI use cases.

NVIDIA
Products & Apps

Vera Rubin

NVIDIA Vera Rubin platform: 5x Blackwell inference at CES 2026

Jensen Huang unveiled the Vera Rubin platform at CES 2026, NVIDIA's next-gen AI computer delivering 50 PFLOPS and 5x inference performance over Blackwell while adding only ~200W of power draw. It needs 75% fewer GPUs for 10 trillion parameter MoE training, packs 72 GPUs per rack with 20.7TB memory and 13 TB/s bandwidth, is 100% liquid cooled, and entered full production just four months after the B300.

5x Vera Rubin vs Blackwell75% Fewer GPUs needed

December 2025

NVIDIA
Products & Apps

Project Digits

NVIDIA Project Digits: $3,000 desktop that runs 200B-param models

NVIDIA announced Project Digits in January, a $3,000 desktop supercomputer capable of running 200B parameter models locally. It brought serious local-inference hardware to individual developers and was one of January's standout hardware stories.

Zhipu AI (GLM)
New ModelsOpen weights

GLM 4.5

GLM 4.5 runs on Cerebras fast enough to win hackathons

Zhipu's GLM 4.5 came out in July and was the first open model that ran on Cerebras hardware fast enough that hackathon competitors were winning with it. It set up GLM's quiet rise as a business workhorse later in the year.

Google DeepMind
New Models

Gemini 3 Flash

Gemini 3 Flash delivers frontier intelligence at $0.50/1M input tokens

Google launched Gemini 3 Flash, offering frontier-tier capability at flash-tier pricing of $0.50 per million input tokens. It scores 78% on SWE-bench Verified, beating larger models on some agentic tasks, and supports tool-calling at scale with up to 100 simultaneous function calls.

$0.50 per 1M Gemini 3 Flash input tokens78% SWE-bench Verified
NVIDIA
New ModelsOpen weights

Nemotron 3 Nano

NVIDIA ships Nemotron 3 Nano, a 30B hybrid Mamba-MoE with full recipes

NVIDIA released Nemotron 3 Nano, a 30B-parameter hybrid Mamba-MoE model with only 3B active parameters for efficient inference. The panel called it the most consequential open release of the week because NVIDIA shipped not just weights but technical reports, training recipes, and details on the 25T-token training data.

30B (3B active) Nemotron 3 Nano parameters
Weights & Biases
Products & Apps

LLM Evaluation Jobs

W&B launches LLM Evaluation Jobs for OpenAI-compatible APIs

Weights & Biases launched LLM Evaluation Jobs, letting teams run evaluations against any OpenAI-compatible API during training cycles instead of only at the end. The show framed it as a practical workflow upgrade for getting earlier model quality signals without blindly burning compute.

November 2025

Weights & Biases
Products & Apps

Serverless LoRA Inference

W&B launches Serverless LoRA Inference on CoreWeave

Weights & Biases launched Serverless LoRA Inference on CoreWeave: upload a LoRA adapter to W&B Artifacts and serve it instantly on top of any supported base model with no cold starts and no dedicated GPU instances. Alex demoed a 'Mocking SpongeBob' LoRA he trained in 25 minutes, served on a Qwen 2.5 base.

Weights & Biases
Dev ToolsOpen weights

W&B LEET

W&B ships LEET, an open-source terminal UI for monitoring ML runs

Weights & Biases released LEET (Lightweight Experiment Exploration Tool), an open-source terminal-native dashboard for tracking ML runs, demoed live by Dima Duev of the SDK team. It works fully offline for air-gapped HPC clusters and brings real-time metrics, system stats, and zoomable interactive charts to the terminal.

Amazon Web Services
Also Released

AWS-OpenAI infrastructure partnership

AWS announces multi-year strategic infrastructure partnership with OpenAI

AWS announced a multi-year strategic infrastructure partnership with OpenAI to power ChatGPT inference, training, and agentic AI workloads. It is another sign of OpenAI spreading its compute needs across every major cloud provider, and a notable win for AWS in the frontier-AI infrastructure race.

Sandbar
Products & Apps

Stream / Stream Ring

Sandbar launches Stream voice assistant and Stream Ring wearable

Sandbar launched Stream, a voice-first personal assistant, alongside Stream Ring, a wearable described as a 'mouse for voice' that is now available for preorder. The pairing pushes always-available voice interaction into dedicated hardware rather than the phone.

October 2025

Apple
Products & Apps

M5 chip

Apple announces M5 chip with double the AI performance

Apple unveiled the M5 chip, claiming roughly double the AI performance of the previous generation for Apple Silicon. For local-model enthusiasts on the show, it means more on-device headroom for running and fine-tuning models on Macs.

OpenAI
Also Released

OpenAI x Broadcom custom accelerators

OpenAI and Broadcom to deploy 10 gigawatts of custom AI accelerators

OpenAI announced a strategic collaboration with Broadcom to co-develop and deploy 10 gigawatts of custom AI accelerators. It is another massive compute commitment in OpenAI's infrastructure buildout, this time with chips designed in-house.

September 2025

NVIDIA
Funding

NVIDIA-OpenAI $100B partnership

Nvidia commits up to $100B to OpenAI for 10GW of compute

Nvidia and OpenAI announced a letter of intent under which Nvidia would invest up to $100 billion in OpenAI as the two deploy at least 10 gigawatts of Nvidia systems for OpenAI's next-generation infrastructure. The episode's big-company segment centered on this deal as evidence that money and infrastructure, not just models, now drive the AI race.

Meta AI
Products & Apps

Meta AI Glasses with Display

Meta Connect: new AI glasses with a display and neural control interface

At Meta Connect, Meta unveiled new AI glasses featuring a built-in display, a neural wristband control interface, and a new AI mode. The panel treats the glasses as an interface milestone, arguing the product surface for AI is shifting from apps to display-equipped wearables.

Weights & Biases
Major Features & Updates

Weave in W&B Workspaces

W&B brings Weave traces into Models workspaces for RL runs

Weights & Biases shipped Weave inside W&B Models workspaces, so reinforcement learning runs can now be logged and inspected with Weave trace tooling alongside training metrics. The show frames it as giving RL training 'x-ray vision' into what the model is actually doing.

July 2025

Cloudflare
Major Features & Updates

One-Click AI Bot Blocking

Cloudflare launches one-click AI bot blocking for the web

Cloudflare announced a one-click feature letting site owners block AI scraping bots, a direct response to the economics of perpetual web scraping by AI labs. The move puts a default-off switch in front of a large share of the internet and highlights the tension between open research norms and commercial scraping.

Huawei
New ModelsOpen weights

Pangu Pro MoE

Huawei's Pangu Pro MoE: 72B model trained entirely on Ascend NPUs

Huawei released Pangu Pro, a 72B-parameter MoE trained on its own Ascend NPUs rather than Nvidia or AMD hardware, hitting 1,528 tokens/sec and pretrained on 13T tokens. The panel framed it as the geopolitical open-model story of the week, showing how far Chinese compute stacks have advanced under sanctions.

May 2025

Nous Research
Products & AppsOpen weights

Psyche

Nous Research launches Psyche, a decentralized cooperative-training network

Psyche is Nous Research's decentralized cooperative-training network that lets distributed participants jointly train large models over the internet. The launch includes open code on GitHub and a live dashboard tracking the first run, a 40B model called Consilience. COO Dillon Rolnick joined the show to explain the decentralized training push.

New ModelsOpen weights

Falcon-Edge

Falcon-Edge: ternary BitNet LLMs for edge deployment under 1GB VRAM

TII's Falcon-Edge project releases ternary BitNet LLMs (1B and 3B base models) that slash memory and compute requirements, enabling inference on less than 1GB of VRAM. Fine-tuners get pre-quantized checkpoints and a clear path to 1-bit LLMs.

April 2025

Google DeepMind
New ModelsOpen weights

Gemma 3 QAT

Google ships Quantization-Aware Trained Gemma 3 models for consumer GPUs

Google released Quantization-Aware Training (QAT) versions of the Gemma 3 family, dramatically cutting memory requirements while preserving quality. The 27B model drops from a hefty 54GB to just 14.1GB, and even the 1B model goes from 2GB to about half a gig, making state-of-the-art open models runnable on consumer GPUs. Wolfram took the 4B QAT model for a spin in LM Studio on the show.

27B Gemma 3 27B QAT: 54GB down to 14.1GB1B Gemma 3 1B QAT: 2GB down to ~0.5GB4B 4B QAT model tested in LM Studio
Microsoft
New ModelsOpen weights

BitNet b1.58

Microsoft releases BitNet 1.58-bit model weights on Hugging Face

Microsoft published BitNet (listed in the show notes as BitNet v1.5), its native 1.58-bit quantized LLM, as open weights on Hugging Face. The ternary-weight approach targets extremely efficient CPU inference at a fraction of the memory of standard models.

CoreWeave
Benchmarks & Evals

CoreWeave GB200 inference benchmark

CoreWeave hits 800 tok/s on Llama 405B with NVIDIA GB200 Blackwell

CoreWeave announced record-breaking AI inference benchmarks using NVIDIA's new GB200 Grace Blackwell superchips: 800 tokens/sec on Llama 3.1 405B, plus 33,000 tokens/sec on Llama 2 70B with H200s. It is a marker of how fast inference hardware is accelerating.

800 tok/s Llama 3.1 405B on GB20033,000 tok/s Llama 2 70B on H200

March 2025

Arcee AI
Products & Apps

Arcee Conductor

Arcee AI announces Conductor, an intelligent model router

Arcee AI's Lucas Atkins joined the show to announce Conductor, a model router that picks the best model (including Arcee's small specialized models) for each query. It targets cost and quality optimization by routing requests instead of sending everything to one large model.

Nous Research
APIs & Platforms

Portal

Nous Research opens Portal, an inference API for Hermes models

Nous Research launched Portal, its new inference API service offering access to models like Hermes 3 Llama 70B and DeepHermes 3 8B directly via API. It marks another open-source lab standing up hosted API access to make its models more accessible.

February 2025

DeepSeek
Dev ToolsOpen weights

Open Source Week infra releases

DeepSeek open-sources its infra stack during Open Source Week

DeepSeek ran its Open Source Week, releasing a series of production infrastructure repos (including FlashMLA, DeepEP, and DeepGEMM) that power its training and inference stack. The drops gave the open-source community a rare look at the low-level kernels and communication libraries behind DeepSeek's efficient frontier models.

Inception Labs
New Models

Mercury

Inception Labs debuts Mercury, a commercial diffusion LLM

Inception Labs announced Mercury, billed as the first commercial-scale diffusion large language model, generating text via diffusion rather than autoregressive decoding. The approach promises dramatically faster token throughput, demoed first with the Mercury Coder playground.

Hao AI Lab
Dev ToolsOpen weights

FastVideo

Hao AI Lab's FastVideo makes HunyuanVideo 3x faster with no extra training

Hao AI Lab released FastVideo, a method that makes HunyuanVideo (HY-Video) three times faster with no additional training, using a technique called Sliding Tile Attention that outperforms even flash attention for this workload. Faster inference makes open-source video models far more practical, and it supports HY-Video LoRAs for fine-tuned applications.

Microsoft
Products & Apps

Majorana 1

Microsoft unveils Majorana 1 quantum chip and a new state of matter

Microsoft announced the Majorana 1 quantum chip alongside a claimed new state of matter called topological superconductivity, carving a new path for quantum computing. Alex called the announcement 'absolutely mind blowing' as a potential big deal for the future of computing.

January 2025

Funding

Stargate Project

Stargate Project: $500B AI infrastructure investment announced

OpenAI, SoftBank (Masayoshi Son's Vision Fund), and Oracle (Larry Ellison) announced the Stargate Project, a planned $500 billion investment in US AI infrastructure. The announcement, made alongside the White House, was framed on the show as an AI 'Manhattan Project'-scale buildout of datacenters and compute.

$500B Planned investment