Multimodal Models

Models that natively combine text, image, audio, or video as inputs or outputs. — 48 releases covered on the show.

July 2026

Google DeepMind Jul 2, 2026

New Models

OmniFlash

Google DeepMind debuts OmniFlash, first of the any-to-any Omni family

OmniFlash — first of Google's any-to-any Omni family — generates videos up to 10 seconds with precise conversational multi-turn editing via the Interactions API: say 'make it daytime' and it redoes light, sky and shadows. Editing Elo 1087 at $0.10 per second of output.

1087 editing Elo$0.10 per second of video, up to 10s

🎙️ Hear our coverage →

#video-gen #multimodal

June 2026

xAI Jun 18, 2026

New Models

Grok Imagine Video 1.5

xAI launches Grok Imagine Video 1.5 with faster generation and native audio

xAI launched Grok Imagine Video 1.5 with nearly 2x faster generation, native audio, and a claimed #1 leaderboard position. The episode grouped it with Gemini Omni as part of the week’s video-generation frontier.

~2x faster generation

xAI announcement on X ↗Grok Imagine Video 1.5 blog ↗xAI video generation docs ↗

🎙️ Hear our coverage →

#video-gen #multimodal #consumer-ai

Google DeepMind Jun 4, 2026

New ModelsOpen weights

Gemma 4 12B

Google drops Gemma 4 12B, an encoder-free multimodal local model

Google released Gemma 4 12B, an encoder-free multimodal model under Apache 2.0 that targets 16GB VRAM local setups. Instead of bolting separate vision or audio encoders onto a language model, it uses one unified network, which LDJ and Yam argued makes smaller multimodal models cheaper, cleaner, and easier to run locally.

X announcement ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #multimodal #on-device

May 2026

Google DeepMind May 21, 2026

New Models

Gemini Omni

Gemini Omni: 'create anything from anything' conversational video editor

Google DeepMind launched Gemini Omni, a multimodal 'create anything from anything' model debuting as Google's first conversational video editor. Unlike pure text-to-video systems, Omni is an iterative multi-turn editing model that combines Gemini intelligence, world knowledge, multimodal inputs and generative media, in the same way Nano Banana brought Gemini to interactive image editing. It is available in the Gemini app, Google Flow and YouTube, with API support coming soon.

DeepMind model page ↗Google DeepMind on X ↗Logan on availability ↗Gemini App ↗

🎙️ Hear our coverage (+1 follow-up) →

#video-gen #multimodal #image-gen

Meta AI May 14, 2026

Major Features & Updates

Muse Spark voice conversations

Meta launches Muse Spark voice conversations across its apps and glasses

Meta rolled out Muse Spark-powered voice conversations across the Meta AI app, WhatsApp, Instagram, Facebook, and Ray-Ban Meta glasses. The feature includes real-time image generation, live camera AI, and instant Reels/maps integration. Alex tested it live and called it surprisingly good, the first big consumer ship from Meta Superintelligence Labs.

X announcement ↗Announcement ↗

🎙️ Hear our coverage →

#voice-ai #consumer-ai #multimodal

Thinking Machines Lab May 14, 2026

New Models

Interaction Models

Thinking Machines Lab drops Interaction Models: real-time multimodal 276B MoE

Mira Murati's Thinking Machines Lab released Interaction Models, a 276B-parameter MoE (12B active) trained from scratch for native real-time multimodal collaboration. It supports full-duplex audio/video/text with 0.40s turn-taking latency and scores 77.8 on FD-bench v1.5. The demo can react live to events like another person entering the camera frame.

276B MoE parameters12B active parameters

X announcement ↗Blog ↗

🎙️ Hear our coverage →

#multimodal #voice-ai

April 2026

NVIDIA Apr 30, 2026

New ModelsOpen weights

Nemotron 3 Nano Omni

NVIDIA Nemotron 3 Nano Omni: hybrid Transformer-Mamba MoE

NVIDIA released Nemotron 3 Nano Omni, a 30B-total/3B-active hybrid Transformer-Mamba MoE with 256K context. It delivers 9x throughput on consumer hardware.

NVIDIA blog ↗

🎙️ Hear our coverage →

#open-source #multimodal #architecture

SenseTime Apr 30, 2026

New ModelsOpen weights

SenseNova U1

SenseTime open-sources SenseNova U1 unified multimodal MoE

SenseTime open-sourced SenseNova U1, a unified multimodal MoE model with 8B total and 3B active parameters that handles understanding and generation with no separate encoder or VAE. The architecture builds on a paper the team presented at ICLR last year.

8B total parameters (3B active MoE)

SenseTime announcement on X ↗Hugging Face collection ↗GitHub ↗Try it ↗

🎙️ Hear our coverage →

#open-source #multimodal #architecture

Alibaba (Qwen) Apr 16, 2026

New ModelsOpen weights

Qwen 3.6-35B-A3B

Qwen 3.6-35B-A3B: Apache 2.0 MoE with 3B active hits 73.4% SWE-Verified

Alibaba Qwen open-sourced Qwen 3.6-35B-A3B under Apache 2.0 the same morning Opus 4.7 dropped: a 35B MoE with only 3B active parameters that scores 73.4% on SWE-bench Verified, rivaling models 10x its size. It is natively multimodal with 262K context extensible to 1M, and the crew called it the strongest mid-size LLM on nearly all benchmarks, putting to rest doubts about Qwen's open-source commitment after Junyang Ling's departure.

73.4% SWE-bench Verified

Qwen 3.6 announcement (X) ↗Qwen3.6-35B-A3B on Hugging Face ↗Qwen blog: Qwen 3.6-35B-A3B ↗

🎙️ Hear our coverage →

#open-source #architecture #coding

Meta (Meta Superintelligence Labs) Apr 9, 2026

New Models

Muse Spark

Meta launches Muse Spark, first model from Meta Superintelligence Labs

Meta dropped Muse Spark mid-show, the debut model from Meta Superintelligence Labs. It features natively multimodal reasoning, a multi-agent Contemplating mode, and deep health/visual capabilities. Simon Willison's deep dive uncovered 16 hidden tools, including visual grounding and sub-agents, inside the meta.ai chat UI.

AI at Meta announcement on X ↗Introducing Muse Spark (Meta blog) ↗MSL announcement ↗Simon Willison's deep dive on the 16 hidden tools ↗

🎙️ Hear our coverage →

#frontier-models #multimodal #agents

Alibaba (Qwen) Apr 2, 2026

New ModelsOpen weights

Qwen3.5-Omni

Alibaba open-sources Qwen3.5-Omni, a 397B native omni-modal model

Qwen3.5-Omni is Alibaba's natively omni-modal open model handling text, image, audio, and video, with 397B total parameters and 17B active. It extends the Qwen family's open-source momentum into unified multimodal workloads.

Announcement (X) ↗Qwen blog ↗

🎙️ Hear our coverage →

#open-source #multimodal

Google DeepMind Apr 2, 2026

New ModelsOpen weights

Gemma 4

Google releases Gemma 4 open-weights family under Apache 2.0

Google DeepMind's Gemma 4 launch crossed 10M+ downloads with over 1,000 Gemma-4-based fine-tunes on Hugging Face; the Gemma family totals 500M+ downloads. Omar Sanseviero says Gemma is the foundation for the next generation of Gemini Nano shipping on Pixel and Samsung, with the AI Edge gallery letting people run it locally on Android and iOS. It punched above its size on Arena's Pareto curve and is now live on W&B Inference.

Hugging Face Collection ↗Try in AI Studio ↗Omar Sanseviero on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#open-source #agents #on-device

March 2026

Luma AI Mar 26, 2026

New Models

Uni-1

Luma Labs Uni-1 thinks and generates pixels simultaneously, #1 preference Elo

Luma Labs released Uni-1, an LLM-based image model that thinks and generates pixels simultaneously and claims the number-one human preference Elo. Unlike traditional diffusion workflows you converse with it and iterate together toward results, and it can also generate infographics; a surprising pivot from Luma's video focus.

Luma Labs announcement (X) ↗Uni-1 announcement page ↗Try Uni-1 in the Luma app ↗

🎙️ Hear our coverage →

#image-gen #multimodal

Mistral AI Mar 19, 2026

New ModelsOpen weights

Mistral Small 4

Mistral Small 4: 119B MoE with 6B active unifies vision, coding, reasoning

Mistral returned to open source with Small 4, a 119B-parameter MoE with 128 experts and only 6B active per token, released under Apache 2.0. It unifies the previous Pixtral (vision), Devstral (coding), and Magistral (reasoning) lines into one model and can fit on a single H100 when compressed. Early WolfBench results are sobering at ~17% on OpenClaw agent tasks, roughly on par with similarly sized Nemotron.

119B Mistral Small 4 total params

Mistral blog ↗Hugging Face ↗X announcement ↗

🎙️ Hear our coverage →

#open-source #architecture #multimodal

Xiaomi Mar 19, 2026

New Models

MiMo

Xiaomi MiMo revealed as the 1T-param stealth model topping OpenRouter

Xiaomi revealed MiMo, a 1-trillion-parameter family with omni-modal and language-only variants, unmasked as the stealth model that had been sitting at #1 on OpenRouter. The reveal surprised the panel, marking Xiaomi's entry into the frontier-model conversation.

Luo Fuli on X ↗

🎙️ Hear our coverage →

#frontier-models #multimodal

Google Mar 13, 2026

New Models

Gemini Embedding 2

Google launches Gemini Embedding 2, a natively multimodal embedder

Google launched Gemini Embedding 2, a natively multimodal embedding model that supports text, image, video, and audio in a single unified embedding space. It is available through the Gemini Embeddings API.

Gemini Embedding 2 on X ↗Gemini Embeddings API docs ↗

🎙️ Hear our coverage →

#search #multimodal

February 2026

Google DeepMind Feb 26, 2026

New Models

Nano Banana 2

Google DeepMind launches Nano Banana 2 image model mid-show

Google DeepMind announced Nano Banana 2 during the show, a Flash-quality tier of its image model line. Alex broke in mid-TLDR to describe near-Pro image quality at roughly half the price, plus a new image search capability.

Google DeepMind announcement on X ↗Nano Banana page ↗

🎙️ Hear our coverage →

#image-gen #multimodal

ByteDance Feb 19, 2026

New Models

Seed 2.0

ByteDance Seed 2.0: frontier multimodal family at 73-84% lower pricing

ByteDance released Seed 2.0, a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing. Its video understanding surpasses the human benchmark at 77% vs 73%. At 84% cheaper than Opus 4.5 with near-comparable quality, the panel called it a compelling option for price-conscious developers.

Seed 2.0 announcement (X) ↗Doubao team model page ↗ByteDance-Seed on Hugging Face ↗

🎙️ Hear our coverage →

#multimodal #vision #frontier-models

ByteDance Feb 12, 2026

New Models

Seedance 2.0

ByteDance Seedance 2.0 shatters video generation reality

ByteDance launched Seedance 2.0, a unified multimodal video generation model that accepts up to 9 images, 3 videos, and 3 audio clips as references and produces 15-second multi-shot clips with native stereo audio and strong character consistency (a 45-second internal test mode also exists). The panel compared the quality jump to seeing Sora for the first time. Available on the BytePlus platform.

Alex's demo thread on X ↗Official launch blog ↗Seedance 2.0 announcement page ↗Seedance 2.0 in CapCut on X ↗

🎙️ Hear our coverage (+1 follow-up) →

#video-gen #multimodal #consumer-ai

OpenBMB Feb 5, 2026

New ModelsOpen weights

MiniCPM-o 4.5

MiniCPM-o 4.5: first open-source full-duplex omni model

OpenBMB released MiniCPM-o 4.5, the first open-source full-duplex omni-modal LLM that can see, listen, and speak simultaneously. It can listen while speaking and even interrupt the user, bringing real-time conversational behavior to open weights.

X announcement ↗Hugging Face ↗GitHub ↗

🎙️ Hear our coverage →

#open-source #voice-ai #multimodal

December 2025

Allen AI Dec 18, 2025

New ModelsOpen weights

OLMO 2 (multimodal)

Allen AI adds video-input multimodal OLMO models in 4B/7B/8B sizes

Allen AI extended its OLMO family with multimodal models that accept video input, released in 4B, 7B, and 8B sizes. It continues Allen AI's fully open approach to model development alongside the BOLMO byte-level work.

OLMO multimodal announcement ↗

🎙️ Hear our coverage →

#open-source #multimodal #vision

November 2025

Google DeepMind Nov 20, 2025

New Models

Gemini 3 Pro

Gemini 3 Pro launches with record ARC-AGI-2 scores

Google's new frontier multimodal model with a 1M-token context window and huge reasoning gains, scoring 31.11% on ARC-AGI-2 (45.14% with Deep Think mode) — roughly double the previous SOTA — plus 81% on MMLU-Pro and major coding improvements. Amp switched to it as their default model on launch day, the first time they have ever switched defaults. Also rolling out across Gmail, Calendar, and AI Mode in Google Search.

45.14% ARC-AGI-2 (Deep Think)31.11% ARC-AGI-2 (standard)1M Token context window

🎙️ Hear our coverage (+1 follow-up) →

#reasoning #multimodal #frontier-models

Meituan (LongCat) Nov 6, 2025

New ModelsOpen weights

LongCat Flash Omni

Meituan releases LongCat Flash Omni, a 560B (27B active) omni model

Meituan's LongCat team released LongCat Flash Omni, a 560B-parameter mixture-of-experts model with roughly 27B active parameters that accepts text, audio, and video input. It extends the open LongCat Flash line into omni-modal territory from a lab better known for food delivery than frontier models.

X ↗HF ↗Announcement ↗

🎙️ Hear our coverage →

#open-source #multimodal

October 2025

InclusionAI (Ant Group) Oct 30, 2025

New ModelsOpen weights

Ming-flash-omni Preview

Ming-flash-omni Preview: sparse MoE omni-modal open model

Ant Group's InclusionAI team released Ming-flash-omni Preview, a sparse mixture-of-experts omni-modal model on Hugging Face. It handles multiple input and output modalities in a single open-weights model, adding to the wave of Chinese open omni-modal releases.

X announcement ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #multimodal #architecture

Alibaba (Qwen) Oct 23, 2025

New ModelsOpen weights

Qwen3-VL 2B & 32B

Qwen3-VL adds compact 2B and 32B multimodal models

Alibaba's Qwen team extended the Qwen3-VL family with newly updated 2B and 32B checkpoints. The 2B is a generic VLM (OCR-capable) that holds up against its 4B and 8B siblings from prior weeks, while the 32B reportedly outperforms GPT-5 mini and Claude 4 Sonnet on benchmarks.

X ↗Hugging Face ↗

🎙️ Hear our coverage →

#open-source #vision #multimodal

Alibaba (Qwen) Oct 16, 2025

New ModelsOpen weights

Qwen3-VL 3B/8B

Qwen3-VL adds compact 3B and 8B open vision-language models

Alibaba's Qwen team released smaller Qwen3-VL vision-language models in 3B and 8B sizes, bringing the flagship VL capabilities down to edge- and laptop-friendly scales. Weights are open on Hugging Face as part of the Qwen3-VL collection.

X announcement ↗Hugging Face collection ↗

🎙️ Hear our coverage →

#open-source #vision #multimodal

September 2025

Alibaba (Qwen) Sep 25, 2025

New ModelsOpen weights

Qwen3-Omni

Qwen3-Omni ships open-weights any-to-any audio, vision, and text

Alongside Qwen3-VL, Alibaba released Qwen3-Omni, an end-to-end omni-modal open-weights model that takes text, image, audio, and video input and can respond with streaming speech. The show treated it as direct evidence of how fast open multimodal systems are improving, with weights on Hugging Face, a GitHub repo, demos, and availability in Qwen Chat and the Model Studio API.

HF ↗GitHub ↗Qwen Chat ↗Demo ↗

🎙️ Hear our coverage →

#open-source #multimodal #voice-ai

Alibaba (Qwen) Sep 25, 2025

New ModelsOpen weights

Qwen3-VL

Alibaba releases Qwen3-VL open-weights vision-language flagship

Alibaba's Qwen team shipped Qwen3-VL, its new flagship open-weights vision-language family, headlining the episode's 'Qwen-mas' barrage. The panel discussed it as a practical workflow tool for visual understanding and agentic GUI tasks, not just another model card, with weights, a blog post, and a Hugging Face demo all available at launch.

X ↗HF ↗Blog ↗Demo ↗

🎙️ Hear our coverage →

#open-source #vision #multimodal

Meta AI Sep 18, 2025

Products & Apps

Meta AI Glasses with Display

Meta Connect: new AI glasses with a display and neural control interface

At Meta Connect, Meta unveiled new AI glasses featuring a built-in display, a neural wristband control interface, and a new AI mode. The panel treats the glasses as an interface milestone, arguing the product surface for AI is shifting from apps to display-equipped wearables.

🎙️ Hear our coverage →

#infrastructure #consumer-ai #multimodal

P Perceptron AI Sep 18, 2025

New ModelsOpen weights

Isaac 0.1

Perceptron AI introduces Isaac 0.1, a 2B perceptive-language model

Perceptron AI released Isaac 0.1, a 2B parameter perceptive-language model with open weights on Hugging Face. Despite its small size, the show notes highlight that it 'points better than GPT', excelling at visual grounding and pointing tasks relative to much larger models.

X ↗HF ↗Blog ↗

🎙️ Hear our coverage →

#open-source #vision #multimodal

July 2025

Baidu Jul 3, 2025

New ModelsOpen weights

ERNIE 4.5

Baidu open-sources ERNIE 4.5, a 10-model multimodal family

Baidu open-sourced the ERNIE 4.5 series, a family of 10 models ranging from 424B down to 0.3B parameters with multimodal capabilities, reportedly beating o1 on DocVQA. The release marks a sharp reversal from Baidu's previous anti-open-source posture and another sign that Chinese labs are setting the pace in open source.

10 ERNIE 4.5 models

X announcement ↗Hugging Face ↗Technical report (PDF) ↗

🎙️ Hear our coverage →

#open-source #multimodal #multilingual

May 2025

ByteDance May 15, 2025

New Models

Seed1.5-VL

ByteDance publishes Seed1.5-VL, a 20B vision-language thinking model

ByteDance's Seed team published the technical report for Seed1.5-VL, a 20B-parameter vision-language model with thinking capabilities. It was covered among the big-company releases of the week, with the tech report shared on GitHub.

Technical report ↗

🎙️ Hear our coverage →

#vision #multimodal #reasoning

Alibaba (Qwen) May 1, 2025

New ModelsOpen weights

Qwen 2.5 Omni

Qwen 2.5 Omni gets an update

Alongside the Qwen 3 launch, Alibaba updated its Qwen 2.5 Omni multimodal model line. Mentioned briefly in the open-source roundup as part of the week's Qwen ecosystem push.

Alibaba Qwen announcement (X) ↗

🎙️ Hear our coverage →

#open-source #multimodal

April 2025

NVIDIA Apr 24, 2025

New ModelsOpen weights

Describe Anything (DAM-3B)

NVIDIA releases DAM-3B for region-based image and video captioning

NVIDIA dropped the Describe Anything Model (DAM-3B), a 3 billion parameter multimodal model for region-based image and video captioning. You can point it at a specific region of an image or video and it generates a detailed description of just that area. NVIDIA also published an accompanying DescribeAnything dataset and a Hugging Face demo.

3B Parameters

X Post ↗HF Model ↗HF Demo ↗HF Dataset ↗

🎙️ Hear our coverage →

#vision #multimodal #open-source

OpenAI Apr 17, 2025

New Models

o3 & o4-mini

OpenAI launches o3 and o4-mini, SOTA reasoning models with tool use

OpenAI shipped o3 and o4-mini in ChatGPT and the API, with o3 setting new SOTA records on Codeforces, SWE-bench, MMMU and more. For the first time the models can use tools (web search, Python, image generation) during the reasoning process, and they can think visually by cropping, zooming and rotating images. o3 scored $65k on the Freelancer eval versus o1's $28k, and o4-mini hits 99.5% on AIME with a Python interpreter.

$65 o3 score on the Freelancer eval ($65k vs o1's $28k)99.5% o4-mini on AIME with Python interpreter200 context window (200k tokens)

Blog ↗Watch Party ↗

🎙️ Hear our coverage →

#reasoning #agents #multimodal

Jina AI Apr 10, 2025

New ModelsOpen weights

Jina Reranker M0

Jina Reranker M0: SOTA multilingual, multimodal document reranker

Jina AI released Jina Reranker M0, a state-of-the-art multimodal and multilingual document reranker model. It reranks documents that include both text and images, targeting retrieval and RAG pipelines, with weights available on Hugging Face.

Jina blog: Reranker M0 ↗Hugging Face: jina-reranker-m0 ↗

🎙️ Hear our coverage →

#search #open-source #multimodal

Meta AI Apr 10, 2025

New ModelsOpen weights

Llama 4 (Scout & Maverick)

Meta drops Llama 4 Scout (109B) and Maverick (400B) open-weights MoE models

Meta released the long-awaited Llama 4 family in a chaotic Saturday drop: Scout (17B active / ~109B total, 16 experts) and Maverick (17B active / ~400B total, 128 experts), with a 2T-parameter Behemoth still in training. The models are multimodal, multilingual MoE architectures trained on ~30T tokens with FP8 and interleaved attention (iRoPE), claiming 10M context for Scout and 1M for Maverick. The release was marred by drama: the LMArena version differed from the released model, and the community criticized the lack of small local-friendly sizes.

10M Stated context window for Llama 4 Scout288B Active parameters of unreleased Behemoth (2T total)17B Active parameters for both Scout and Maverick

Meta blog: Llama 4 multimodal intelligence ↗Hugging Face: meta-llama ↗Try it at meta.ai ↗

🎙️ Hear our coverage →

#open-source #architecture #multimodal

Nomic AI Apr 3, 2025

New ModelsOpen weights

Nomic Embed Multimodal

Nomic Embed Multimodal: SOTA embeddings for visual documents

Nomic AI released Nomic Embed Multimodal, new 3B and 7B parameter embedding models built on Alibaba's Qwen2.5-VL. They achieve SOTA on visual document retrieval by embedding interleaved text-image sequences, ideal for PDFs and complex webpages. The 7B model ships under Apache 2.0 with open weights, code, and data; guest Zach Nussbaum discussed the release on the show.

3B parameters (smaller model)7B parameters (Apache 2.0 model)

Nomic Embed Multimodal blog post ↗Models on Hugging Face ↗

🎙️ Hear our coverage →

#search #multimodal #open-source

March 2025

Alibaba (Qwen) Mar 27, 2025

New ModelsOpen weights

Qwen2.5-Omni-7B

Qwen launches Omni 7B: sees, hears, reads, and talks back

Qwen released Qwen2.5-Omni-7B, an open-weights omni-modal model that perceives text, images, audio, and video, and generates both text and speech. It packs end-to-end multimodal perception and spoken output into a 7B parameter model available on Hugging Face.

7B parameters

Hugging Face ↗

🎙️ Hear our coverage →

#open-source #multimodal #voice-ai

OpenAI Mar 27, 2025

Major Features & Updates

GPT-4o Native Image Generation

OpenAI enables native image generation in GPT-4o, internet goes Ghibli

OpenAI finally enabled GPT-4o's native auto-regressive image generation in ChatGPT, sparking the biggest mainstream AI buzz of the week as the internet ghiblified itself. Launched right after Gemini 2.5, it excels at instruction following, text rendering, and multi-turn editing, with viral demos ranging from ad mockups to a full Lord of the Rings trailer.

X thread with examples ↗Ad threads ↗Full Lord of the Rings trailer ↗Native Image Generation System Card ↗

🎙️ Hear our coverage →

#image-gen #multimodal

Mistral AI Mar 20, 2025

New ModelsOpen weights

Mistral Small 3.1

Mistral Small 3.1 24B: open-weights multimodal model

Mistral released Mistral Small 3.1, a 24B-parameter open-weights model that adds multimodal (vision) capabilities to the Small line. Both instruct and base checkpoints were published on Hugging Face, making it a strong local multimodal option at the 24B size class.

Blog Post ↗HuggingFace page ↗Base Model on HF ↗

🎙️ Hear our coverage →

#open-source #multimodal #vision

Google Mar 13, 2025

Major Features & Updates

Google AI Studio YouTube link understanding

Google AI Studio adds native YouTube video understanding via link dropping

Google AI Studio now lets you drop a YouTube link and have Gemini natively understand the video. This unlocks video analysis, summarization, and support use cases without downloading or preprocessing the content.

Google AI Studio ↗

🎙️ Hear our coverage →

#multimodal #coding

Google DeepMind Mar 13, 2025

Major Features & Updates

Gemini 2.0 Flash native image generation

Gemini Flash gains native image generation and conversational editing

Google enabled native image generation in Gemini Flash Experimental, letting users generate and iteratively edit images conversationally inside the same multimodal model. The crew demoed it live on stream, editing photos of themselves with natural-language instructions, and saw it as a preview of how creative tools like Photoshop will work.

X announcement ↗AI Studio demo ↗

🎙️ Hear our coverage →

#image-gen #multimodal

Google DeepMind Mar 13, 2025

New ModelsOpen weights

Gemma 3

Google open sources Gemma 3, 1B-27B multimodal family with 128K context

Google released Gemma 3, an open-weights model family spanning 1B to 27B parameters with multimodal (text, image, video) capabilities, support for over 140 languages, and a 128K context window. The 27B model runs on a single GPU, with Sundar Pichai claiming competitors need roughly 10x the compute for similar performance. It shipped with day-one open source ecosystem support (Hugging Face, Ollama, Kaggle) plus ShieldGemma 2 for content moderation.

Blog ↗AI Studio ↗HF Collection ↗Hugging Face (27B) ↗

🎙️ Hear our coverage →

#open-source #multimodal #on-device

February 2025

Microsoft Feb 27, 2025

New ModelsOpen weights

Phi-4-multimodal

Microsoft releases Phi-4-multimodal and Phi-4-mini open weights

Microsoft expanded the Phi family with Phi-4-multimodal-instruct, a small open-weights model that handles text, vision, and audio in a single model, alongside a compact Phi-4-mini. The weights shipped on Hugging Face, continuing Microsoft's push for capable small models that can run on-device.

Blog ↗HuggingFace ↗

🎙️ Hear our coverage →

#open-source #on-device #multimodal

January 2025

Alibaba (Qwen) Jan 30, 2025

New ModelsOpen weights

Qwen2.5-VL

Alibaba ships Qwen2.5-VL open vision-language model family

Alibaba's Qwen team released Qwen2.5-VL, open-weights vision-language models up to 72B that handle images, documents, video understanding, and on-screen agentic grounding. The 72B Instruct model was immediately available on Hugging Face and in Qwen Chat.

72B Largest variant

Project blog ↗Hugging Face ↗GitHub ↗Try it (Qwen Chat) ↗

🎙️ Hear our coverage →

#vision #open-source #multimodal

DeepSeek Jan 30, 2025

New ModelsOpen weights

Janus Pro

DeepSeek Janus Pro: open multimodal models in 1.5B and 7B

Amid the R1 frenzy, DeepSeek also released Janus Pro, unified multimodal models at 1.5B and 7B parameters that handle both image understanding and image generation. The open release added to DeepSeek's week of dominating AI news headlines.

1.5B / 7B Model sizes

GitHub ↗Try it (HF Space) ↗

🎙️ Hear our coverage →

#open-source #image-gen #multimodal

NVIDIA Jan 30, 2025

New ModelsOpen weights

Eagle 2

NVIDIA releases Eagle 2 open vision-language models

NVIDIA published Eagle 2, a family of open vision-language models with an accompanying paper, model weights on Hugging Face, and a live demo. It is a fully transparent VLM release covering training data strategy and recipes, competitive with much larger vision models.

Paper ↗Models (HF collection) ↗Demo ↗

🎙️ Hear our coverage →

#vision #open-source #multimodal