Vision & Perception

Visual understanding: VLMs, OCR, detection, segmentation, and document and video understanding. — 39 releases covered on the show.

May 2026

March 2026

Reka AI
New ModelsOpen weights

Reka Edge

Reka AI ships Edge, a 7B multimodal VLM for sub-second on-device inference

Reka AI launched Reka Edge, a 7B-parameter multimodal vision-language model built for sub-second latency on edge devices. Weights are on Hugging Face and the model is available through OpenRouter, with the panel highlighting it as a notable efficient multimodal release for real-world deployment.

February 2026

ByteDance
New Models

Seed 2.0

ByteDance Seed 2.0: frontier multimodal family at 73-84% lower pricing

ByteDance released Seed 2.0, a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing. Its video understanding surpasses the human benchmark at 77% vs 73%. At 84% cheaper than Opus 4.5 with near-comparable quality, the panel called it a compelling option for price-conscious developers.

January 2026

Google DeepMind
New ModelsOpen weights

MedGemma 1.5

Google releases MedGemma 1.5 for offline medical imaging

Google released MedGemma 1.5, a small (4B-class) open model for medical use cases, compact enough to run offline for medical imaging. The panel stressed it is a different model class from Byte's giant M3 medical LLM and that the two pair well together rather than replacing each other.

December 2025

November 2025

Tencent (Hunyuan)
New ModelsOpen weights

HunyuanOCR

Tencent's 1B HunyuanOCR beats 72B models on OCRBench

Tencent released HunyuanOCR, a 1B-parameter OCR model that scores 860 on OCRBench, beating models as large as Qwen3-VL-72B. It is a striking example of task-specialized small models outperforming generalist giants.

1B Parameters860 OCRBench score
Meta AI
New ModelsOpen weights

SAM 3

Meta SAM 3: open-vocabulary segmentation and tracking in video

Meta's Segment Anything Model 3 adds open-vocabulary segmentation with text and exemplar prompts, letting you click or type to segment and track any object across images and video. The panel demoed it live on golden retriever videos, and it ships openly as part of Meta's open-source push.

Meta AI
New ModelsOpen weights

SAM 3D

SAM 3D turns single photos into 3D objects and human bodies

Released alongside SAM 3, SAM 3D reconstructs 3D objects and full human bodies from a single image with surprisingly high quality. It extends the Segment Anything family from 2D segmentation into single-image 3D reconstruction.

Baidu
New ModelsOpen weights

ERNIE-4.5-VL-28B-A3B-Thinking

Baidu open-sources ERNIE-4.5-VL-28B-A3B-Thinking visual reasoning model

Baidu released ERNIE-4.5-VL-28B-A3B-Thinking, an Apache 2.0 open-weights visual reasoning MoE with only 3B active parameters that claims to rival much larger models like GPT-5 High on vision tasks. It features image zooming, spatial grounding, and reasoning, with strong small-model performance attributed to GSPO training from the Qwen team.

3B Active Parameters
New ModelsOpen weights

OlmoEarth

Ai2 launches OlmoEarth foundation models and open Earth-intelligence platform

Ai2 launched OlmoEarth, a family of foundation models plus an open, end-to-end platform for fast, high-resolution Earth intelligence. It applies the lab's open-model approach to geospatial and remote-sensing data, making Earth observation workloads accessible without proprietary stacks.

October 2025

Alibaba (Qwen)
New ModelsOpen weights

Qwen3-VL 2B & 32B

Qwen3-VL adds compact 2B and 32B multimodal models

Alibaba's Qwen team extended the Qwen3-VL family with newly updated 2B and 32B checkpoints. The 2B is a generic VLM (OCR-capable) that holds up against its 4B and 8B siblings from prior weeks, while the 32B reportedly outperforms GPT-5 mini and Claude 4 Sonnet on benchmarks.

New ModelsOpen weights

olmOCR 2 7B

Ai2 releases olmOCR 2 7B open OCR model

The Allen Institute for AI updated its open OCR line with olmOCR 2 at 7B (released as an FP8 checkpoint), landing in the same week as DeepSeek-OCR, Qwen3-VL, and Liquid's LFM2-VL. Another sign that document understanding became this week's hottest open-model category.

DeepSeek
New ModelsOpen weights

DeepSeek-OCR

DeepSeek-OCR turns text into compressed vision tokens for massive contexts

DeepSeek open-sourced DeepSeek-OCR, a 3B model (~570M active parameters) that is less an OCR model and more a context-compression breakthrough: it renders text as images, compresses it up to 10x while retaining 97% decoding accuracy (60% even at 20x), and reads it back with a tiny vision decoder. The approach suggests text tokenization is far from optimal and points at vastly cheaper long-context processing; alphaXiv reportedly OCR'd all of arXiv for $1000 versus $7500 with MistralOCR, and a single H100 can process up to 200K pages.

97% decoding accuracy at 10x compression~570M active parameters (3B total)200K pages scannable on a single H100
Liquid AI
New ModelsOpen weights

LFM2-VL-3B

Liquid AI ships LFM2-VL-3B tiny multilingual vision-language model

Liquid AI released LFM2-VL-3B, a tiny multilingual vision-language model, part of a wave of OCR-and-VLM releases this week. It targets efficient on-device and edge vision-language workloads at the 3B scale.

September 2025

Alibaba (Qwen)
New ModelsOpen weights

Qwen3-VL

Alibaba releases Qwen3-VL open-weights vision-language flagship

Alibaba's Qwen team shipped Qwen3-VL, its new flagship open-weights vision-language family, headlining the episode's 'Qwen-mas' barrage. The panel discussed it as a practical workflow tool for visual understanding and agentic GUI tasks, not just another model card, with weights, a blog post, and a Hugging Face demo all available at launch.

IBM
New ModelsOpen weights

Granite Docling 258M

IBM releases Granite Docling 258M compact document-parsing VLM

IBM published Granite Docling 258M, an ultra-compact open-source vision-language model for document understanding that converts documents into structured output. At just 258M parameters it reinforced the show's point that tiny specialized models are becoming genuinely useful workflow tools.

Moondream AI
New ModelsOpen weights

Moondream 3

Moondream 3 preview punches above its weight in the tiny-VLM race

Moondream released a preview of Moondream 3, a small open vision-language model that punches well above its size class. CTO and co-founder Vik Korrapati joined the show to explain why small, capable vision models matter for real product building, framing Moondream 3 as a practical tool rather than a benchmark flex.

Moondream
New ModelsOpen weights

Moondream 3 (Preview)

Moondream 3 Preview: 9B MoE VLM with 2B active parameters

Moondream released a preview of Moondream 3, a 9B mixture-of-experts vision-language model with only 2B active parameters. It targets frontier-level visual reasoning at small-model cost, continuing Moondream's run of efficient open vision models.

Perceptron AI
New ModelsOpen weights

Isaac 0.1

Perceptron AI introduces Isaac 0.1, a 2B perceptive-language model

Perceptron AI released Isaac 0.1, a 2B parameter perceptive-language model with open weights on Hugging Face. Despite its small size, the show notes highlight that it 'points better than GPT', excelling at visual grounding and pointing tasks relative to much larger models.

Alibaba (Tongyi Lab)
New ModelsOpen weights

WebWatcher-32B

Alibaba's Tongyi Lab open-sources WebWatcher vision-language research agent

Alibaba's Tongyi Lab open-sourced WebWatcher, a vision-language deep research agent that sets new state-of-the-art results on agentic browsing and research tasks. The 32B model combines visual understanding with web research capabilities and is available on Hugging Face.

Apple
New ModelsOpen weights

FastVLM-7B

Apple's FastVLM-7B lands with a speed-first vision encoder, 85x faster TTFT

Apple released FastVLM-7B, a vision-language model built around a speed-first vision encoder that delivers up to 85x faster time-to-first-token than peer VLMs. Quantized variants (7B-int4, 1.5B-int8) on Hugging Face make it practical for on-device and real-time vision use, anchoring the show's fast-VLM discussion.

May 2025

April 2025

NVIDIA
New ModelsOpen weights

Describe Anything (DAM-3B)

NVIDIA releases DAM-3B for region-based image and video captioning

NVIDIA dropped the Describe Anything Model (DAM-3B), a 3 billion parameter multimodal model for region-based image and video captioning. You can point it at a specific region of an image or video and it generates a detailed description of just that area. NVIDIA also published an accompanying DescribeAnything dataset and a Hugging Face demo.

3B Parameters
Moonshot AI (Kimi)
New ModelsOpen weights

Kimi-VL & Kimi-VL-Thinking

Moonshot drops Kimi-VL and Kimi-VL-Thinking, tiny A3B open vision models

Moonshot AI released Kimi-VL and Kimi-VL-Thinking, compact vision-language models with only ~3B active parameters (A3B MoE). The thinking variant adds reasoning to a tiny VLM, and both are available openly on Hugging Face.

A3B ~3B active parameters (MoE)

March 2025

Mistral AI
APIs & Platforms

Mistral OCR

Mistral announces state-of-the-art OCR API

Mistral AI announced Mistral OCR, a document-understanding API the company claims is state of the art at extracting text, tables, and equations from complex documents. It targets RAG and document-processing pipelines with structured markdown output.

February 2025

Microsoft
New ModelsOpen weights

OmniParser v2

Microsoft ships OmniParser v2 for faster screen parsing in GUI agents

Microsoft released OmniParser v2, a better and faster screen-parsing model that converts UI screenshots into structured elements for GUI agents. It improves the computer-use agent stack and is available with a public Gradio demo.

Benchmarks & EvalsOpen weights

ZeroBench

ZeroBench: the 'impossible' benchmark where all top VLMs score zero

A new benchmark called ZeroBench launched, claiming to be the impossible benchmark for vision-language models: all current top-of-the-line VLMs score zero on it. Tasks include visually demanding puzzles like reading a question written in the shape of a star hidden among scattered letters, highlighting how far VLMs still are from true visual understanding.

January 2025

Alibaba (Qwen)
New ModelsOpen weights

Qwen2.5-VL

Alibaba ships Qwen2.5-VL open vision-language model family

Alibaba's Qwen team released Qwen2.5-VL, open-weights vision-language models up to 72B that handle images, documents, video understanding, and on-screen agentic grounding. The 72B Instruct model was immediately available on Hugging Face and in Qwen Chat.

72B Largest variant
Hugging Face
New ModelsOpen weights

SmolVLM (256M)

Hugging Face SmolVLM: tiny vision-language models run on WebGPU

Hugging Face released SmolVLM, a family of tiny vision-language models including a 256M-parameter version small enough to run entirely in the browser via WebGPU. It demonstrates how far efficient multimodal models have shrunk while remaining usable.

256M Parameters (smallest VLM)