Meta Sapiens2: family of 6 human-centric vision models (0.1B-5B)
Meta released Sapiens2, a family of six ViT models ranging from 0.1B to 5B parameters trained on 1 billion human images. The models set SOTA on human-centric vision tasks including pose estimation, segmentation, surface normals, and pointmaps, with weights on Hugging Face.
Perceptron Mk1: frontier video + embodied reasoning at 1/10th the price
Perceptron released Mk1, a frontier video and embodied reasoning model priced at roughly a tenth of comparable models. It scores 88.5 on VSI-Bench and 72.4 on RefSpatialBench (versus 9.0 for GPT-5m on the latter) and is live on OpenRouter.
Reka AI ships Edge, a 7B multimodal VLM for sub-second on-device inference
Reka AI launched Reka Edge, a 7B-parameter multimodal vision-language model built for sub-second latency on edge devices. Weights are on Hugging Face and the model is available through OpenRouter, with the panel highlighting it as a notable efficient multimodal release for real-world deployment.
ByteDance Seed 2.0: frontier multimodal family at 73-84% lower pricing
ByteDance released Seed 2.0, a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing. Its video understanding surpasses the human benchmark at 77% vs 73%. At 84% cheaper than Opus 4.5 with near-comparable quality, the panel called it a compelling option for price-conscious developers.
Z.ai released GLM-OCR, a tiny 0.9B parameter document understanding model that achieves the #1 ranking on OmniDocBench V1.5. It shows that strong OCR and document parsing no longer require large models.
Gemini 3 Flash gains agentic vision: a Think-Act-Observe loop that can zoom, crop, annotate, and plot images by generating and executing Python code in the backend. Available in the Gemini app, AI Studio, and Vertex AI.
Google releases MedGemma 1.5 for offline medical imaging
Google released MedGemma 1.5, a small (4B-class) open model for medical use cases, compact enough to run offline for medical imaging. The panel stressed it is a different model class from Byte's giant M3 medical LLM and that the two pair well together rather than replacing each other.
Allen AI adds video-input multimodal OLMO models in 4B/7B/8B sizes
Allen AI extended its OLMO family with multimodal models that accept video input, released in 4B, 7B, and 8B sizes. It continues Allen AI's fully open approach to model development alongside the BOLMO byte-level work.
Mistral OCR 3 claims 74% win-rate over OCR v2 with aggressive pricing
Mistral released OCR 3, its latest document intelligence model, claiming a 74% win-rate over OCR v2. The panel highlighted its aggressive pricing and document performance gains as part of the open-source-adjacent European push on practical document AI.
Tencent's 1B HunyuanOCR beats 72B models on OCRBench
Tencent released HunyuanOCR, a 1B-parameter OCR model that scores 860 on OCRBench, beating models as large as Qwen3-VL-72B. It is a striking example of task-specialized small models outperforming generalist giants.
Meta SAM 3: open-vocabulary segmentation and tracking in video
Meta's Segment Anything Model 3 adds open-vocabulary segmentation with text and exemplar prompts, letting you click or type to segment and track any object across images and video. The panel demoed it live on golden retriever videos, and it ships openly as part of Meta's open-source push.
SAM 3D turns single photos into 3D objects and human bodies
Released alongside SAM 3, SAM 3D reconstructs 3D objects and full human bodies from a single image with surprisingly high quality. It extends the Segment Anything family from 2D segmentation into single-image 3D reconstruction.
Baidu open-sources ERNIE-4.5-VL-28B-A3B-Thinking visual reasoning model
Baidu released ERNIE-4.5-VL-28B-A3B-Thinking, an Apache 2.0 open-weights visual reasoning MoE with only 3B active parameters that claims to rival much larger models like GPT-5 High on vision tasks. It features image zooming, spatial grounding, and reasoning, with strong small-model performance attributed to GSPO training from the Qwen team.
Ai2 launches OlmoEarth foundation models and open Earth-intelligence platform
Ai2 launched OlmoEarth, a family of foundation models plus an open, end-to-end platform for fast, high-resolution Earth intelligence. It applies the lab's open-model approach to geospatial and remote-sensing data, making Earth observation workloads accessible without proprietary stacks.
Qwen3-VL adds compact 2B and 32B multimodal models
Alibaba's Qwen team extended the Qwen3-VL family with newly updated 2B and 32B checkpoints. The 2B is a generic VLM (OCR-capable) that holds up against its 4B and 8B siblings from prior weeks, while the 32B reportedly outperforms GPT-5 mini and Claude 4 Sonnet on benchmarks.
The Allen Institute for AI updated its open OCR line with olmOCR 2 at 7B (released as an FP8 checkpoint), landing in the same week as DeepSeek-OCR, Qwen3-VL, and Liquid's LFM2-VL. Another sign that document understanding became this week's hottest open-model category.
DeepSeek-OCR turns text into compressed vision tokens for massive contexts
DeepSeek open-sourced DeepSeek-OCR, a 3B model (~570M active parameters) that is less an OCR model and more a context-compression breakthrough: it renders text as images, compresses it up to 10x while retaining 97% decoding accuracy (60% even at 20x), and reads it back with a tiny vision decoder. The approach suggests text tokenization is far from optimal and points at vastly cheaper long-context processing; alphaXiv reportedly OCR'd all of arXiv for $1000 versus $7500 with MistralOCR, and a single H100 can process up to 200K pages.
97% decoding accuracy at 10x compression~570M active parameters (3B total)200K pages scannable on a single H100
Liquid AI ships LFM2-VL-3B tiny multilingual vision-language model
Liquid AI released LFM2-VL-3B, a tiny multilingual vision-language model, part of a wave of OCR-and-VLM releases this week. It targets efficient on-device and edge vision-language workloads at the 3B scale.
Qwen3-VL adds compact 3B and 8B open vision-language models
Alibaba's Qwen team released smaller Qwen3-VL vision-language models in 3B and 8B sizes, bringing the flagship VL capabilities down to edge- and laptop-friendly scales. Weights are open on Hugging Face as part of the Qwen3-VL collection.
Alibaba's Qwen team shipped Qwen3-VL, its new flagship open-weights vision-language family, headlining the episode's 'Qwen-mas' barrage. The panel discussed it as a practical workflow tool for visual understanding and agentic GUI tasks, not just another model card, with weights, a blog post, and a Hugging Face demo all available at launch.
IBM releases Granite Docling 258M compact document-parsing VLM
IBM published Granite Docling 258M, an ultra-compact open-source vision-language model for document understanding that converts documents into structured output. At just 258M parameters it reinforced the show's point that tiny specialized models are becoming genuinely useful workflow tools.
Moondream 3 preview punches above its weight in the tiny-VLM race
Moondream released a preview of Moondream 3, a small open vision-language model that punches well above its size class. CTO and co-founder Vik Korrapati joined the show to explain why small, capable vision models matter for real product building, framing Moondream 3 as a practical tool rather than a benchmark flex.
Moondream 3 Preview: 9B MoE VLM with 2B active parameters
Moondream released a preview of Moondream 3, a 9B mixture-of-experts vision-language model with only 2B active parameters. It targets frontier-level visual reasoning at small-model cost, continuing Moondream's run of efficient open vision models.
Perceptron AI introduces Isaac 0.1, a 2B perceptive-language model
Perceptron AI released Isaac 0.1, a 2B parameter perceptive-language model with open weights on Hugging Face. Despite its small size, the show notes highlight that it 'points better than GPT', excelling at visual grounding and pointing tasks relative to much larger models.
Alibaba's Tongyi Lab open-sources WebWatcher vision-language research agent
Alibaba's Tongyi Lab open-sourced WebWatcher, a vision-language deep research agent that sets new state-of-the-art results on agentic browsing and research tasks. The 32B model combines visual understanding with web research capabilities and is available on Hugging Face.
Apple's FastVLM-7B lands with a speed-first vision encoder, 85x faster TTFT
Apple released FastVLM-7B, a vision-language model built around a speed-first vision encoder that delivers up to 85x faster time-to-first-token than peer VLMs. Quantized variants (7B-int4, 1.5B-int8) on Hugging Face make it practical for on-device and real-time vision use, anchoring the show's fast-VLM discussion.
ByteDance publishes Seed1.5-VL, a 20B vision-language thinking model
ByteDance's Seed team published the technical report for Seed1.5-VL, a 20B-parameter vision-language model with thinking capabilities. It was covered among the big-company releases of the week, with the tech report shared on GitHub.
NVIDIA releases DAM-3B for region-based image and video captioning
NVIDIA dropped the Describe Anything Model (DAM-3B), a 3 billion parameter multimodal model for region-based image and video captioning. You can point it at a specific region of an image or video and it generates a detailed description of just that area. NVIDIA also published an accompanying DescribeAnything dataset and a Hugging Face demo.
Moonshot drops Kimi-VL and Kimi-VL-Thinking, tiny A3B open vision models
Moonshot AI released Kimi-VL and Kimi-VL-Thinking, compact vision-language models with only ~3B active parameters (A3B MoE). The thinking variant adds reasoning to a tiny VLM, and both are available openly on Hugging Face.
Mistral Small 3.1 24B: open-weights multimodal model
Mistral released Mistral Small 3.1, a 24B-parameter open-weights model that adds multimodal (vision) capabilities to the Small line. Both instruct and base checkpoints were published on Hugging Face, making it a strong local multimodal option at the 24B size class.
Roboflow drops RF-DETR, a SOTA open-source object detection model
Roboflow released RF-DETR, a state-of-the-art real-time object detection model, announced as breaking news on the show by CEO Joseph Nelson. The model is fully open source on GitHub and targets practical, deployable computer vision workloads.
Roboflow launches RF100-VL benchmark for vision-language models
Alongside RF-DETR, Roboflow introduced RF100-VL, a new evaluation benchmark for vision-language models built from real-world detection datasets. It gives the community a grounded way to measure how well VLMs handle practical object detection tasks.
Cohere For AI releases Aya Vision 8B and 32B open multilingual vision models
Cohere For AI released Aya Vision in 8B and 32B sizes, extending the multilingual Aya family with open-weights vision-language capabilities. The models target multilingual multimodal understanding across many languages.
Mistral AI announced Mistral OCR, a document-understanding API the company claims is state of the art at extracting text, tables, and equations from complex documents. It targets RAG and document-processing pipelines with structured markdown output.
Microsoft ships OmniParser v2 for faster screen parsing in GUI agents
Microsoft released OmniParser v2, a better and faster screen-parsing model that converts UI screenshots into structured elements for GUI agents. It improves the computer-use agent stack and is available with a public Gradio demo.
ZeroBench: the 'impossible' benchmark where all top VLMs score zero
A new benchmark called ZeroBench launched, claiming to be the impossible benchmark for vision-language models: all current top-of-the-line VLMs score zero on it. Tasks include visually demanding puzzles like reading a question written in the shape of a star hidden among scattered letters, highlighting how far VLMs still are from true visual understanding.
Alibaba ships Qwen2.5-VL open vision-language model family
Alibaba's Qwen team released Qwen2.5-VL, open-weights vision-language models up to 72B that handle images, documents, video understanding, and on-screen agentic grounding. The 72B Instruct model was immediately available on Hugging Face and in Qwen Chat.
NVIDIA releases Eagle 2 open vision-language models
NVIDIA published Eagle 2, a family of open vision-language models with an accompanying paper, model weights on Hugging Face, and a live demo. It is a fully transparent VLM release covering training data strategy and recipes, competitive with much larger vision models.
Hugging Face SmolVLM: tiny vision-language models run on WebGPU
Hugging Face released SmolVLM, a family of tiny vision-language models including a 256M-parameter version small enough to run entirely in the browser via WebGPU. It demonstrates how far efficient multimodal models have shrunk while remaining usable.