Vision & Perception

Visual understanding: VLMs, OCR, detection, segmentation, and document and video understanding. — 40 releases covered on the show.

June 2026

Midjourney Jun 18, 2026

Products & Apps

Midjourney Medical scanner

Midjourney announces Midjourney Medical, a full-body ultrasonic scanner concept

Midjourney announced Midjourney Medical, a full-body ultrasound scanner concept that the episode described as capturing 806TB per scan in under 60 seconds. The panel treated it as a striking sign that AI-native companies are moving beyond chatbots into hardware, imaging, and healthcare infrastructure.

806TB scan payload<60s scan time

Alex Volkov coverage on X ↗Nick St. Pierre coverage on X ↗Midjourney scanner announcement ↗

🎙️ Hear our coverage →

#research #vision #industry

May 2026

Meta AI May 14, 2026

New ModelsOpen weights

Sapiens2

Meta Sapiens2: family of 6 human-centric vision models (0.1B-5B)

Meta released Sapiens2, a family of six ViT models ranging from 0.1B to 5B parameters trained on 1 billion human images. The models set SOTA on human-centric vision tasks including pose estimation, segmentation, surface normals, and pointmaps, with weights on Hugging Face.

X announcement ↗Hugging Face collection ↗

🎙️ Hear our coverage →

#vision #open-source

P Perceptron AI May 14, 2026

New Models

Perceptron Mk1

Perceptron Mk1: frontier video + embodied reasoning at 1/10th the price

Perceptron released Mk1, a frontier video and embodied reasoning model priced at roughly a tenth of comparable models. It scores 88.5 on VSI-Bench and 72.4 on RefSpatialBench (versus 9.0 for GPT-5m on the latter) and is live on OpenRouter.

X announcement ↗Site ↗

🎙️ Hear our coverage →

#video-gen #robotics #vision

March 2026

Reka AI Mar 26, 2026

New ModelsOpen weights

Reka Edge

Reka AI ships Edge, a 7B multimodal VLM for sub-second on-device inference

Reka AI launched Reka Edge, a 7B-parameter multimodal vision-language model built for sub-second latency on edge devices. Weights are on Hugging Face and the model is available through OpenRouter, with the panel highlighting it as a notable efficient multimodal release for real-world deployment.

Reka AI announcement (X) ↗Reka Edge on Hugging Face ↗Reka Edge on OpenRouter ↗Reka AI blog ↗

🎙️ Hear our coverage →

#open-source #vision #on-device

February 2026

ByteDance Feb 19, 2026

New Models

Seed 2.0

ByteDance Seed 2.0: frontier multimodal family at 73-84% lower pricing

ByteDance released Seed 2.0, a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing. Its video understanding surpasses the human benchmark at 77% vs 73%. At 84% cheaper than Opus 4.5 with near-comparable quality, the panel called it a compelling option for price-conscious developers.

Seed 2.0 announcement (X) ↗Doubao team model page ↗ByteDance-Seed on Hugging Face ↗

🎙️ Hear our coverage →

#multimodal #vision #frontier-models

Zhipu AI (Z.ai) Feb 5, 2026

New ModelsOpen weights

GLM-OCR

Z.ai GLM-OCR: 0.9B model takes #1 on OmniDocBench

Z.ai released GLM-OCR, a tiny 0.9B parameter document understanding model that achieves the #1 ranking on OmniDocBench V1.5. It shows that strong OCR and document parsing no longer require large models.

X announcement ↗Hugging Face ↗Announcement ↗

🎙️ Hear our coverage →

#open-source #vision

January 2026

Google Jan 29, 2026

Major Features & Updates

Gemini 3 Flash Agentic Vision

Google adds Agentic Vision to Gemini 3 Flash

Gemini 3 Flash gains agentic vision: a Think-Act-Observe loop that can zoom, crop, annotate, and plot images by generating and executing Python code in the backend. Available in the Gemini app, AI Studio, and Vertex AI.

Announcement (X) ↗Docs ↗

🎙️ Hear our coverage →

#vision #agents #reasoning

Google DeepMind Jan 15, 2026

New ModelsOpen weights

MedGemma 1.5

Google releases MedGemma 1.5 for offline medical imaging

Google released MedGemma 1.5, a small (4B-class) open model for medical use cases, compact enough to run offline for medical imaging. The panel stressed it is a different model class from Byte's giant M3 medical LLM and that the two pair well together rather than replacing each other.

🎙️ Hear our coverage →

#research #open-source #vision

December 2025

Allen AI Dec 18, 2025

New ModelsOpen weights

OLMO 2 (multimodal)

Allen AI adds video-input multimodal OLMO models in 4B/7B/8B sizes

Allen AI extended its OLMO family with multimodal models that accept video input, released in 4B, 7B, and 8B sizes. It continues Allen AI's fully open approach to model development alongside the BOLMO byte-level work.

OLMO multimodal announcement ↗

🎙️ Hear our coverage →

#open-source #multimodal #vision

Mistral AI Dec 18, 2025

New Models

Mistral OCR 3

Mistral OCR 3 claims 74% win-rate over OCR v2 with aggressive pricing

Mistral released OCR 3, its latest document intelligence model, claiming a 74% win-rate over OCR v2. The panel highlighted its aggressive pricing and document performance gains as part of the open-source-adjacent European push on practical document AI.

Mistral OCR 3 blog ↗Mistral OCR 3 announcement ↗Mistral Console ↗

🎙️ Hear our coverage →

November 2025

Tencent (Hunyuan) Nov 27, 2025

New ModelsOpen weights

HunyuanOCR

Tencent's 1B HunyuanOCR beats 72B models on OCRBench

Tencent released HunyuanOCR, a 1B-parameter OCR model that scores 860 on OCRBench, beating models as large as Qwen3-VL-72B. It is a striking example of task-specialized small models outperforming generalist giants.

1B Parameters860 OCRBench score

HunyuanOCR on HuggingFace ↗HunyuanOCR on GitHub ↗HunyuanOCR Announcement on X ↗Hunyuan Vision Blog ↗

🎙️ Hear our coverage →

#vision #open-source #on-device

Meta AI Nov 20, 2025

New ModelsOpen weights

SAM 3

Meta SAM 3: open-vocabulary segmentation and tracking in video

Meta's Segment Anything Model 3 adds open-vocabulary segmentation with text and exemplar prompts, letting you click or type to segment and track any object across images and video. The panel demoed it live on golden retriever videos, and it ships openly as part of Meta's open-source push.

🎙️ Hear our coverage →

#vision #open-source

Meta AI Nov 20, 2025

New ModelsOpen weights

SAM 3D

SAM 3D turns single photos into 3D objects and human bodies

Released alongside SAM 3, SAM 3D reconstructs 3D objects and full human bodies from a single image with surprisingly high quality. It extends the Segment Anything family from 2D segmentation into single-image 3D reconstruction.

🎙️ Hear our coverage →

#vision #world-models #open-source

Baidu Nov 13, 2025

New ModelsOpen weights

ERNIE-4.5-VL-28B-A3B-Thinking

Baidu open-sources ERNIE-4.5-VL-28B-A3B-Thinking visual reasoning model

Baidu released ERNIE-4.5-VL-28B-A3B-Thinking, an Apache 2.0 open-weights visual reasoning MoE with only 3B active parameters that claims to rival much larger models like GPT-5 High on vision tasks. It features image zooming, spatial grounding, and reasoning, with strong small-model performance attributed to GSPO training from the Qwen team.

3B Active Parameters

Baidu announcement on X ↗Hugging Face model page ↗GitHub repo ↗Ernie blog post ↗

🎙️ Hear our coverage →

#open-source #vision #reasoning

Allen Institute for AI (Ai2) Nov 6, 2025

New ModelsOpen weights

OlmoEarth

Ai2 launches OlmoEarth foundation models and open Earth-intelligence platform

Ai2 launched OlmoEarth, a family of foundation models plus an open, end-to-end platform for fast, high-resolution Earth intelligence. It applies the lab's open-model approach to geospatial and remote-sensing data, making Earth observation workloads accessible without proprietary stacks.

🎙️ Hear our coverage →

#open-source #vision #frontier-models

October 2025

Alibaba (Qwen) Oct 23, 2025

New ModelsOpen weights

Qwen3-VL 2B & 32B

Qwen3-VL adds compact 2B and 32B multimodal models

Alibaba's Qwen team extended the Qwen3-VL family with newly updated 2B and 32B checkpoints. The 2B is a generic VLM (OCR-capable) that holds up against its 4B and 8B siblings from prior weeks, while the 32B reportedly outperforms GPT-5 mini and Claude 4 Sonnet on benchmarks.

X ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #vision #multimodal

Allen Institute for AI (Ai2) Oct 23, 2025

New ModelsOpen weights

olmOCR 2 7B

Ai2 releases olmOCR 2 7B open OCR model

The Allen Institute for AI updated its open OCR line with olmOCR 2 at 7B (released as an FP8 checkpoint), landing in the same week as DeepSeek-OCR, Qwen3-VL, and Liquid's LFM2-VL. Another sign that document understanding became this week's hottest open-model category.

🎙️ Hear our coverage →

#vision #open-source

DeepSeek Oct 23, 2025

New ModelsOpen weights

DeepSeek-OCR

DeepSeek-OCR turns text into compressed vision tokens for massive contexts

DeepSeek open-sourced DeepSeek-OCR, a 3B model (~570M active parameters) that is less an OCR model and more a context-compression breakthrough: it renders text as images, compresses it up to 10x while retaining 97% decoding accuracy (60% even at 20x), and reads it back with a tiny vision decoder. The approach suggests text tokenization is far from optimal and points at vastly cheaper long-context processing; alphaXiv reportedly OCR'd all of arXiv for $1000 versus $7500 with MistralOCR, and a single H100 can process up to 200K pages.

97% decoding accuracy at 10x compression~570M active parameters (3B total)200K pages scannable on a single H100

X ↗HF ↗Paper ↗

🎙️ Hear our coverage →

#vision #open-source #search

Liquid AI Oct 23, 2025

New ModelsOpen weights

LFM2-VL-3B

Liquid AI ships LFM2-VL-3B tiny multilingual vision-language model

Liquid AI released LFM2-VL-3B, a tiny multilingual vision-language model, part of a wave of OCR-and-VLM releases this week. It targets efficient on-device and edge vision-language workloads at the 3B scale.

🎙️ Hear our coverage →

#vision #open-source #on-device

Alibaba (Qwen) Oct 16, 2025

New ModelsOpen weights

Qwen3-VL 3B/8B

Qwen3-VL adds compact 3B and 8B open vision-language models

Alibaba's Qwen team released smaller Qwen3-VL vision-language models in 3B and 8B sizes, bringing the flagship VL capabilities down to edge- and laptop-friendly scales. Weights are open on Hugging Face as part of the Qwen3-VL collection.

X announcement ↗Hugging Face collection ↗

🎙️ Hear our coverage →

#open-source #vision #multimodal

September 2025

Alibaba (Qwen) Sep 25, 2025

New ModelsOpen weights

Qwen3-VL

Alibaba releases Qwen3-VL open-weights vision-language flagship

Alibaba's Qwen team shipped Qwen3-VL, its new flagship open-weights vision-language family, headlining the episode's 'Qwen-mas' barrage. The panel discussed it as a practical workflow tool for visual understanding and agentic GUI tasks, not just another model card, with weights, a blog post, and a Hugging Face demo all available at launch.

X ↗HF ↗Blog ↗Demo ↗

🎙️ Hear our coverage →

#open-source #vision #multimodal

IBM Sep 25, 2025

New ModelsOpen weights

Granite Docling 258M

IBM releases Granite Docling 258M compact document-parsing VLM

IBM published Granite Docling 258M, an ultra-compact open-source vision-language model for document understanding that converts documents into structured output. At just 258M parameters it reinforced the show's point that tiny specialized models are becoming genuinely useful workflow tools.

🎙️ Hear our coverage →

#vision #on-device #open-source

Moondream AI Sep 25, 2025

New ModelsOpen weights

Moondream 3

Moondream 3 preview punches above its weight in the tiny-VLM race

Moondream released a preview of Moondream 3, a small open vision-language model that punches well above its size class. CTO and co-founder Vik Korrapati joined the show to explain why small, capable vision models matter for real product building, framing Moondream 3 as a practical tool rather than a benchmark flex.

🎙️ Hear our coverage →

#vision #on-device #open-source

Moondream Sep 18, 2025

New ModelsOpen weights

Moondream 3 (Preview)

Moondream 3 Preview: 9B MoE VLM with 2B active parameters

Moondream released a preview of Moondream 3, a 9B mixture-of-experts vision-language model with only 2B active parameters. It targets frontier-level visual reasoning at small-model cost, continuing Moondream's run of efficient open vision models.

🎙️ Hear our coverage →

#vision #open-source #architecture

P Perceptron AI Sep 18, 2025

New ModelsOpen weights

Isaac 0.1

Perceptron AI introduces Isaac 0.1, a 2B perceptive-language model

Perceptron AI released Isaac 0.1, a 2B parameter perceptive-language model with open weights on Hugging Face. Despite its small size, the show notes highlight that it 'points better than GPT', excelling at visual grounding and pointing tasks relative to much larger models.

X ↗HF ↗Blog ↗

🎙️ Hear our coverage →

#open-source #vision #multimodal

Alibaba (Tongyi Lab) Sep 4, 2025

New ModelsOpen weights

WebWatcher-32B

Alibaba's Tongyi Lab open-sources WebWatcher vision-language research agent

Alibaba's Tongyi Lab open-sourced WebWatcher, a vision-language deep research agent that sets new state-of-the-art results on agentic browsing and research tasks. The 32B model combines visual understanding with web research capabilities and is available on Hugging Face.

🎙️ Hear our coverage →

#open-source #agents #search

Apple Sep 4, 2025

New ModelsOpen weights

FastVLM-7B

Apple's FastVLM-7B lands with a speed-first vision encoder, 85x faster TTFT

Apple released FastVLM-7B, a vision-language model built around a speed-first vision encoder that delivers up to 85x faster time-to-first-token than peer VLMs. Quantized variants (7B-int4, 1.5B-int8) on Hugging Face make it practical for on-device and real-time vision use, anchoring the show's fast-VLM discussion.

X ↗HF ↗HF (1.5B int8) ↗

🎙️ Hear our coverage →

#vision #on-device #open-source

May 2025

ByteDance May 15, 2025

New Models

Seed1.5-VL

ByteDance publishes Seed1.5-VL, a 20B vision-language thinking model

ByteDance's Seed team published the technical report for Seed1.5-VL, a 20B-parameter vision-language model with thinking capabilities. It was covered among the big-company releases of the week, with the tech report shared on GitHub.

Technical report ↗

🎙️ Hear our coverage →

#vision #multimodal #reasoning

April 2025

NVIDIA Apr 24, 2025

New ModelsOpen weights

Describe Anything (DAM-3B)

NVIDIA releases DAM-3B for region-based image and video captioning

NVIDIA dropped the Describe Anything Model (DAM-3B), a 3 billion parameter multimodal model for region-based image and video captioning. You can point it at a specific region of an image or video and it generates a detailed description of just that area. NVIDIA also published an accompanying DescribeAnything dataset and a Hugging Face demo.

3B Parameters

X Post ↗HF Model ↗HF Demo ↗HF Dataset ↗

🎙️ Hear our coverage →

#vision #multimodal #open-source

Moonshot AI (Kimi) Apr 10, 2025

New ModelsOpen weights

Kimi-VL & Kimi-VL-Thinking

Moonshot drops Kimi-VL and Kimi-VL-Thinking, tiny A3B open vision models

Moonshot AI released Kimi-VL and Kimi-VL-Thinking, compact vision-language models with only ~3B active parameters (A3B MoE). The thinking variant adds reasoning to a tiny VLM, and both are available openly on Hugging Face.

A3B ~3B active parameters (MoE)

Hugging Face collection: Kimi-VL-A3B ↗

🎙️ Hear our coverage →

#open-source #vision #reasoning

March 2025

Mistral AI Mar 20, 2025

New ModelsOpen weights

Mistral Small 3.1

Mistral Small 3.1 24B: open-weights multimodal model

Mistral released Mistral Small 3.1, a 24B-parameter open-weights model that adds multimodal (vision) capabilities to the Small line. Both instruct and base checkpoints were published on Hugging Face, making it a strong local multimodal option at the 24B size class.

Blog Post ↗HuggingFace page ↗Base Model on HF ↗

🎙️ Hear our coverage →

#open-source #multimodal #vision

Roboflow Mar 20, 2025

New ModelsOpen weights

RF-DETR

Roboflow drops RF-DETR, a SOTA open-source object detection model

Roboflow released RF-DETR, a state-of-the-art real-time object detection model, announced as breaking news on the show by CEO Joseph Nelson. The model is fully open source on GitHub and targets practical, deployable computer vision workloads.

RF-DETR Blog Post ↗RF-DETR Github ↗

🎙️ Hear our coverage →

#vision #open-source

Roboflow Mar 20, 2025

Benchmarks & EvalsOpen weights

RF100-VL

Roboflow launches RF100-VL benchmark for vision-language models

Alongside RF-DETR, Roboflow introduced RF100-VL, a new evaluation benchmark for vision-language models built from real-world detection datasets. It gives the community a grounded way to measure how well VLMs handle practical object detection tasks.

RF100-VL Benchmark ↗RF-DETR Blog Post ↗

🎙️ Hear our coverage →

#benchmarks #vision

Cohere For AI Mar 6, 2025

New ModelsOpen weights

Aya Vision

Cohere For AI releases Aya Vision 8B and 32B open multilingual vision models

Cohere For AI released Aya Vision in 8B and 32B sizes, extending the multilingual Aya family with open-weights vision-language capabilities. The models target multilingual multimodal understanding across many languages.

Announcement (X) ↗Hugging Face Collection ↗

🎙️ Hear our coverage →

#open-source #vision #multilingual

Mistral AI Mar 6, 2025

APIs & Platforms

Mistral OCR

Mistral announces state-of-the-art OCR API

Mistral AI announced Mistral OCR, a document-understanding API the company claims is state of the art at extracting text, tables, and equations from complex documents. It targets RAG and document-processing pipelines with structured markdown output.

🎙️ Hear our coverage →

February 2025

Microsoft Feb 20, 2025

New ModelsOpen weights

OmniParser v2

Microsoft ships OmniParser v2 for faster screen parsing in GUI agents

Microsoft released OmniParser v2, a better and faster screen-parsing model that converts UI screenshots into structured elements for GUI agents. It improves the computer-use agent stack and is available with a public Gradio demo.

Gradio Demo ↗

🎙️ Hear our coverage →

#agents #vision

University of Cambridge researchers Feb 20, 2025

Benchmarks & EvalsOpen weights

ZeroBench

ZeroBench: the 'impossible' benchmark where all top VLMs score zero

A new benchmark called ZeroBench launched, claiming to be the impossible benchmark for vision-language models: all current top-of-the-line VLMs score zero on it. Tasks include visually demanding puzzles like reading a question written in the shape of a star hidden among scattered letters, highlighting how far VLMs still are from true visual understanding.

Announcement on X ↗Project page ↗Paper ↗Hugging Face ↗

🎙️ Hear our coverage →

#benchmarks #vision

January 2025

Alibaba (Qwen) Jan 30, 2025

New ModelsOpen weights

Qwen2.5-VL

Alibaba ships Qwen2.5-VL open vision-language model family

Alibaba's Qwen team released Qwen2.5-VL, open-weights vision-language models up to 72B that handle images, documents, video understanding, and on-screen agentic grounding. The 72B Instruct model was immediately available on Hugging Face and in Qwen Chat.

72B Largest variant

Project blog ↗Hugging Face ↗GitHub ↗Try it (Qwen Chat) ↗

🎙️ Hear our coverage →

#vision #open-source #multimodal

NVIDIA Jan 30, 2025

New ModelsOpen weights

Eagle 2

NVIDIA releases Eagle 2 open vision-language models

NVIDIA published Eagle 2, a family of open vision-language models with an accompanying paper, model weights on Hugging Face, and a live demo. It is a fully transparent VLM release covering training data strategy and recipes, competitive with much larger vision models.

Paper ↗Models (HF collection) ↗Demo ↗

🎙️ Hear our coverage →

#vision #open-source #multimodal

Hugging Face Jan 23, 2025

New ModelsOpen weights

SmolVLM (256M)

Hugging Face SmolVLM: tiny vision-language models run on WebGPU

Hugging Face released SmolVLM, a family of tiny vision-language models including a 256M-parameter version small enough to run entirely in the browser via WebGPU. It demonstrates how far efficient multimodal models have shrunk while remaining usable.

256M Parameters (smallest VLM)

SmolVLM-256M WebGPU demo on Hugging Face ↗

🎙️ Hear our coverage →

#vision #open-source #on-device