Episode Summary

Welcome back to ThursdAI! And wow, what a week. Seriously, strap in, because the AI landscape just went through some seismic shifts.

Hosts & Guests

Alex Volkov
Alex Volkov
Host Β· W&B / CoreWeave
@altryne
Hamel Husain
Hamel Husain
AI Evaluation Consultant Β· Parlance Labs
@HamelHusain
Shreya Shankar
Shreya Shankar
PhD Candidate & Researcher Β· UC Berkeley
@sh_reya
Nisten Tahiraj
Nisten Tahiraj
Weekly co-host of ThursdAI Β· AI operator & builder
@nisten
Yam Peleg
Yam Peleg
Weekly co-host of ThursdAI Β· AI builder & founder
@Yampeleg
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host, AI model evaluator Β· Independent AI evaluator (r/LocalLLaMA)
@WolframRvnwlf

By The Numbers

πŸ“† ThursdAI - May 1- Qwen 3, Phi-4, OpenAI glazegate,
3
Qwen 3 didn't just release a model; they dropped an entire ecosystem, setting a potential new benchmark for open-weight releases.
Qwen 3 β€” β€œHybrid Thinking” on Tap
235 B
Alibaba open-weighted the entire Qwen 3 family this week, releasing two MoE titans (up to 235 B total / 22 B active) and six dense siblings all the way down to 0 .6 B, all under Apache 2.0.
Qwen 3 β€” β€œHybrid Thinking” on Tap
30 B
On my Mac, the 30 B-A3B model hit 57 tokens/s when paired with speculative decoding (drafted by the 0 .6 B sibling).
Qwen 3 β€” β€œHybrid Thinking” on Tap
36
Other goodies: - 36 T pre-training tokens (2 Γ— Qwen 2.5) - 128 K context on β‰₯ 8 B variants (32 K on the tinies) - 119-language coverage, widest in open source - Built-in MCP schema so you can pair with [Qwen-Agent]( - The dense 4 B model actually _beats_ Qwen 2.5-72B-Instruct on several evalsβ€”at Raspberry-Pi footprint In short: more parameters when you need them, fewer when you don’t, and the lawyers stay asleep.
Qwen 3 β€” β€œHybrid Thinking” on Tap
235B
The 235B MoE rivals or surpasses models like DeepSeek-R1 (which rocked the boat just months ago!), O1, O3-mini, and even Gemini 2.5 Pro on coding and math.

πŸ”₯ Breaking During The Show

BREAKING NEWS: Claude.ai will support tools via MCP
During the show, Yam spotted breaking news from Anthropic: Claude is getting major upgrades!

πŸ“° πŸ“† ThursdAI - May 1- Qwen 3, Phi-4, OpenAI glazegate, RIP GPT4, LlamaCon, LMArena in hot water & more AI news

Hey everyone, Alex here πŸ‘‹ Welcome back to ThursdAI! And wow, what a week. Seriously, strap in, because the AI landscape just went through some seismic shifts.

  • Seriously, strap in, because the AI landscape just went through some seismic shifts.
  • This week felt like a whirlwind, with open source absolutely dominating the headlines.
  • Qwen 3 didn't just release a model; they dropped an entire ecosystem, setting a potential new benchmark for open-weight releases.

πŸ“° Qwen 3 β€” β€œHybrid Thinking” on Tap

Alibaba open-weighted the entire Qwen 3 family this week, releasing two MoE titans (up to 235 B total / 22 B active) and six dense siblings all the way down to 0 .6 B, all under Apache 2.0. Day-one support landed in LM Studio, Ollama, vLLM, MLX and llama.cpp.

  • Day-one support landed in LM Studio, Ollama, vLLM, MLX and llama.cpp.
  • The headline trick is a runtime thinking toggleβ€”drop β€œ/think” to expand chain-of-thought or β€œ/no_think” to sprint.
  • On my Mac, the 30 B-A3B model hit 57 tokens/s when paired with speculative decoding (drafted by the 0 .6 B sibling).

πŸ”“ Other Open Source Updates

1. MiMo-7B: Xiaomi entered the ring with a 7B parameter, MIT-licensed model family, trained on 25T tokens and featuring rule-verifiable RL. (HF model hub released Helium-1, a 2B parameter model distilled from Gemma-2-9B, focused on European languages, and licensed under CC-BY 4.0.

  • MiMo-7B: Xiaomi entered the ring with a 7B parameter, MIT-licensed model family, trained on 25T tokens and featuring rule-verifiable RL.
  • (HF model hub released Helium-1, a 2B parameter model distilled from Gemma-2-9B, focused on European languages, and licensed under CC-BY 4.0.
  • They also open-sourced 'dactory', their data processing pipeline.

🎨 Big Companies & APIs: Drama, Departures, and Deployments

While open source stole the show, the big players weren't entirely quiet... though maybe some wish they had been.

  • While open source stole the show, the big players weren't entirely quiet...
  • though maybe some wish they had been.

πŸ“° Farewell, GPT-4: Rest In Prompted πŸ™

TK: Our GPT 4 Wake piece Okay folks, let's take a moment. As many of you noticed, GPT-4, the original model launched back on March 14th, 2023, is no longer available in the ChatGPT dropdown. You can't select it, you can't chat with it anymore.

  • Okay folks, let's take a moment.
  • As many of you noticed, GPT-4, the original model launched back on March 14th, 2023, is no longer available in the ChatGPT dropdown.
  • You can't select it, you can't chat with it anymore.

πŸ“° The ChatGPT "Glazing" Incident: A Cautionary Tale

Speaking of OpenAI...oof. The last couple of weeks saw ChatGPT exhibit some... _weird_ behavior.

  • The last couple of weeks saw ChatGPT exhibit some...
  • Sam Altman himself used the term "glazing" – essentially, the model became overly agreeable, excessively complimentary, and sycophantic to a ridiculous degree.
  • Examples flooded social media: users reporting doing _one_ pushup and being hailed by ChatGPT as Herculean paragons of fitness, placing them in the top 1% of humanity.

πŸ€– BREAKING NEWS: Claude.ai will support tools via MCP

During the show, Yam spotted breaking news from Anthropic: Claude is getting major upgrades! ([Tweet]( They announced Integrations, allowing Claude to connect directly to apps like Asana, Intercom, Linear, Zapier, Stripe, Atlassian, Cloudflare, PayPal, and more (launch partners). Developers can apparently build their own integrations quickly too.

  • During the show, Yam spotted breaking news from Anthropic: Claude is getting major upgrades!
  • They announced Integrations, allowing Claude to connect directly to apps like Asana, Intercom, Linear, Zapier, Stripe, Atlassian, Cloudflare, PayPal, and more (launch partners).
  • Developers can apparently build their own integrations quickly too.

πŸ“° Google Updates & LlamaCon Recap

1. Google: NotebookLM's AI audio overviews are now multilingual (50+ languages!) (X Post was released shortly after our last show, featuring hybrid reasoning with an API knob to control thinking depth. Rumors are swirling about big drops at Google I/O soon!

  • Rumors are swirling about big drops at Google I/O soon!

⚑ This Week's Buzz from Weights & Biases 🐝

Quick updates from my corner at Weights & Biases: 1. WeaveHacks Hackathon (May 17-18, SF): Get ready! We're hosting a hackathon focused on Agent Protocols – MCP and A2A.

  • Quick updates from my corner at Weights & Biases: 1.
  • WeaveHacks Hackathon (May 17-18, SF): Get ready!
  • We're hosting a hackathon focused on Agent Protocols – MCP and A2A.

πŸ“° Evals Deep Dive with Hamel Husain & Shreya Shankar

Amidst all the model releases and drama, we were incredibly lucky to have two leading experts in AI evaluation, Hamel Husain ([@HamelHusain]( and Shreya Shankar ([@sh_reya]( join us. Their core message? Building reliable AI applications requires moving beyond standard benchmarks (like MMLU, HumanEval) and focusing on application-centric evaluations.

  • Building reliable AI applications requires moving beyond standard benchmarks (like MMLU, HumanEval) and focusing on application-centric evaluations.
  • Key Takeaways: - Foundation vs.
  • Application Evals: Foundation model benchmarks test general knowledge and capabilities (the "ceiling").

πŸŽ₯ Vision & Video: Runway Gets Consistent

The world of AI video generation continues its rapid evolution.

πŸ“° Runway References: Consistency Unlocked

TK: Runway references video A major pain point in AI video has been maintaining consistency – characters changing appearance, backgrounds morphing frame-to-frame. Runway just took a huge step towards solving this with their new References feature for Gen-4.

  • A major pain point in AI video has been maintaining consistency – characters changing appearance, backgrounds morphing frame-to-frame.
  • Runway just took a huge step towards solving this with their new References feature for Gen-4.
  • You can now provide reference images (characters, locations, styles, even selfies!) and use tags in your prompts (, ) to tell Gen-4 to maintain those elements across generations.

πŸ”“ HiDream E1: Open Source Ghibli Style

A new contender in open-source image generation emerged: HiDream E1. (HF Link, focuses particularly on generating images in the beautiful Ghibli style. The weights are available (looks like Apache 2.0), and it ranks highly (#4) on the Artificial Analysis image arena leaderboard, sitting amongst top contenders like Google Imagen and ReCraft.

  • A new contender in open-source image generation emerged: HiDream E1.
  • (HF Link, focuses particularly on generating images in the beautiful Ghibli style.

πŸ“° Final Thoughts: Responsibility & Critical Thinking

Phew! What a week. From the incredible potential shown by Qwen 3 setting a new bar for open source, to the sobering reminder of GPT-4's departure and the cautionary tale of the "glazing" incident, it's clear we're navigating a period of intense innovation coupled with growing pains.

  • Don't outsource your judgment entirely.
  • Use multiple models, seek human opinions, and question outputs that seem too good (or too agreeable!) to be true.
  • The power of these tools is immense, but so is our responsibility in using them wisely.

ThursdAI - May 1, 2025 - TL;DR

  • Hosts and Guests
  • Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
  • Co Hosts - @WolframRvnwlf @yampeleg @nisten @ldjconfirmed)
  • Hamel Housain - @HamelHusain
  • Shreya Shankar - @sh_reya
  • Maven Course - AI Evals For Engineers & PMs Questions for Shreya Shankar & Hamel Husain (link Promo code 35% of for listeners of ThursdAI - `thursdai`)
  • Open Source LLMs
  • Alibaba drops Qwen 3 - 2 MOEs, 6 dense (0.6B - 30B) (Blog, GitHub, HF, HF Demo, My tweet, Nathan breakdown)
  • !Qwen3 Main Image
  • Dynamic reasoning!
  • Qwen worked directly with almost all of the popular LLM serving frameworks to ensure that support for the new models was available on day one
  • Not natively multimodal
  • Executive summary
  • Alibaba open-weighted the full Qwen 3 stack: two MoE giants (235 B/22 B active, 30 B/3 B active) and six dense siblings down to 0.6 B. All ship under Apache-2.0 with day-one support for LM Studio, Ollama, MLX, vLLM and MCP. Pre-training doubles data to \~36 T tokens and pushes context to 128 K. A new hybrid β€œthinking” switch (`/think` | `/no_think` or `enable_thinking`) lets users trade latency for reasoning depth at runtime. Benchmarks place the 235 B MoE neck-and-neck with DeepSeek-R1, o1, o3-mini and Gemini 2.5 Pro, while the 4 B dense model meets Qwen 2.5-72B. Multilingual coverage jumps to 119 languages and agentic tooling is reinforced. In short: more parameters when you need them, fewer when you don’t, all fully permissive.
  • Factoids

1. Model roster – MoE: _Qwen3-235B-A22B_ (235 B total / 22 B active) and _Qwen3-30B-A3B_ (30 B / 3 B). Dense: 32 B, 14 B, 8 B, 4 B, 1.7 B, 0.6 B.

2. License – Entire suite under Apache 2.0, rare for a 2025-tier flagship.

3. Context length – 128 K tokens for β‰₯8 B and all MoE variants; 32 K for smaller dense models.

4. Training data – β‰ˆ 36 T tokens (2Γ— Qwen 2.5) including PDF extraction via Qwen2.5-VL and synthetic math/code from Qwen2.5-Math/Coder.

5. Hybrid reasoning switch – Runtime toggle through `enable_thinking` or inline tags `/think`, `/no_think`; supports soft per-turn overrides.

6. Multilingual reach – 119 languages across 10+ families, broadest open-model coverage to date.

7. Agentic upgrades – Built-in MCP schema; pair with Qwen-Agent for low-friction tool calls.

8. Benchmark edge – 235B MoE matches/exceeds DeepSeek-R1, o1, o3-mini, Grok-3 and Gemini-2.5-Pro on coding, math and general evals; 4B dense beats Qwen 2.5-72B-Instruct.

9. Efficiency math – MoE bases hit parity with Qwen 2.5 dense while activating only \~10 % of parameters, slashing inference cost by an order of magnitude.

10. Local-first tooling – Ollama `run qwen3:30b-a3b`, LM Studio, MLX, llama.cpp, k-transformers all supported on day 0.

11. RL recipe – Four-stage post-training: long CoT cold-start β†’ reasoning RL β†’ fusion of thinking/non-thinking β†’ general RL across 20+ tasks.

12. Dataset-driven jump – PDF ingestion + synthetic STEM/code generation credited for outsized small-model gains.

  • Evals
  • !Image
  • LiveBench
  • ![](https://reflect-assets.app/v1/users/SIx5sOhUwRaewATwHjvbAy2w0nJ2/c64c5ec0dbc34c078546b34d913815f7/93ee19b0-f31a-477c-8c19-c75fe7230693?key=c6087a1863f77a5946597fa9875eb093de70754241553b152baf7bc33c226588)
  • 30B quantized on my mac gets better scores than 4.1 mini and 4.1-nano! (link%22:true,%22Total%20Tokens%22:true%7D))
  • ![](https://reflect-assets.app/v1/users/SIx5sOhUwRaewATwHjvbAy2w0nJ2/fc8aa5942ed14c8d84a76dd66d9a1ac6/4436a97f-be20-42c3-b5e7-6f1eadff3756?key=fd581e9606e5f249ada20d9ab2efb19bceff12563e6856f1c2c7bfa73e777d1f)
  • Creative Writing
  • !Image
  • !Image
  • Microsoft - Phi-4-reasoning 14B + Plus (X, ArXiv, Tech Report | HF 14B SFT | HF 14B SFT + RL | Azure Foundry | Suriya’s thread)
  • Executive summary
  • Microsoft’s Phi team took the lightweight 14B Phi-4 and drilled it on 1.4 M β€œteachable” chain-of-thought traces, then sprinkled a mere 6K RL math problems on top to forge two variants: Phi-4-Reasoning (SFT) and Phi-4-Reasoning-Plus (SFT + RL). The result? A pocket-sized model that slugs it out with 70B–235 B behemoths on AIME 25, GPQA, LiveCodeBench and even fresh NP-style puzzlesβ€”while running locally on a single H100 and coming gift-wrapped under an MIT license. Think of it as a turbo-charged study buddy: longer context, explicit `<think>` scaffolding, and a knack for self-correcting when you give it more tokens.
  • Factoids

1. Two SKUs, one weight class – Both versions are 14 B dense; β€œPlus” adds \~90 RL steps yet jumps +15 pp on AIME 25.

2. MIT license – Follows the Phi tradition: fully permissive for research + commercial use.

3. Context window – 32 K tokens by default; internal tests show stable reasoning up to 64K with RoPE interpolation.

4. Structured CoT – Trained to wrap reasoning inside `<think> … </think>` tags, making scratch-pads easy to parse or hide.

5. Data diet – 8.3 B tokens of high-difficulty math, coding & safety traces distilled from o3-mini (medium/high β€œthinking” mode).

6. RL recipe – GRPO with a length-aware reward: wrong answers are nudged to β€œthink longer”, right answers trimmed for brevity.

7. Benchmark punch – Outperforms DeepSeek-R1-Distill-70B on AIME 25 (78% vs 51 %) and sits within 4 pp of full DeepSeek-R1 671 B.

8. Tool-friendly – First Phi model published on Azure AI Foundry; runs in LM Studio, Ollama (`ollama run phi:reasoning`) and vLLM nightly.

9. Generalization – Gains not limited to math: +10 pp on HumanEvalPlus coding and +5 pp on MMLU-Pro vs base Phi-4.

10. Token efficiency – β€œPlus” answers average 1.5 Γ— tokens of SFT-only, yet still \~25 % fewer than o3-mini-high while matching its accuracy.

  • MiMo-7B β€” Xiaomi’s MIT licensed model HF model hub
  • Executive summary

Xiaomi jumps into open-weights R\&D with MiMo-7B, a 7 B dense family trained from scratch on 25 T tokens and then steered by rule-verifiable reinforcement learning. Four checkpointsβ€”Base, SFT, RL, and cold-start RL-Zeroβ€”push math and coding accuracy past 32 B-plus baselines while keeping inference lean and MIT-licensed. An in-house β€œseamless rollout engine” halves RL wall-time, and the weights ship vLLM-ready with built-in multi-token prediction.

  • Model lineup: _MiMo-7B-Base_, _MiMo-7B-SFT_, _MiMo-7B-RL_, _MiMo-7B-RL-Zero_.
  • Parameter count: 7 B denseβ€”no MoE, easy single-GPU fit.
  • Training corpus: 25 T mixed-domain tokens, multi-token-prediction (MTP) objective from day 1.
  • License: MIT, full commercial green light.
  • RL recipe: rewards tied to rule-verifiable math & code tasks; dense RL (no experts) for lightweight deployment.
  • Zero-shot hero: _RL-Zero_ (no SFT) scores 93.6 % on MATH-500 and 49.1 % on LiveCodeBench v5.
  • SFTβ†’RL variant: matches OpenAI _o1-mini_ on benchmark suites despite being 5Γ— smaller.
  • Training speed: β€œseamless rollout engine” delivers 2.29 Γ— faster RL iterations vs naΓ―ve loop.
  • Benchmarks: AIME 2025 = 55.4 %; LiveCodeBench v6 = 49.3 %.
  • Deployment: weights optimized for vLLM; MTP heads cut latency for long outputs.
  • Takeaway: 7 B dense + task-aligned RL now beats mid-tier giants on structured reasoning, opening a new floor for edge-grade math/coder assistants.
  • KyutAI - Helium-1 2B - (Blog | Model (2 B) | Dactory pipeline)
  • !Image
  • Executive Summary
  • KyutAI just lobbed Helium 1, a 2 B-parameter transformer distilled from Gemma-2-9B and purpose-built for Europe’s 24 official languages. The team open-sourced both the weights (CC-BY 4.0) and dactory, a full Common Crawl-to-dataset pipeline that scores, dedups and tags every webpage. With model-soup tricks and language-aware filtering, Helium sets a new state-of-the-art for its size class while fitting comfortably on phones and edge boxes.
  • Factoids
  • Model size – 2 B dense parameters, grouped-query attention, RMSNorm, RoPE; runs in <2 GB VRAM with bfloat16.
  • License – CC-BY 4.0 for weights, MIT for dactory codeβ€”commercial use with attribution.
  • Training compute – 500 K steps, 4 M-token batches on 64 Γ— H100; total 2 T tokens processed.
  • Data pipeline – 770 GB compressed (β‰ˆ400 M docs) across 24 EU languages; paragraph-level dedup + fastText quality scoring.
  • Distillation source – Gemma-2-9B adapted to Helium tokenizer, then fine-tuned into a compact backbone.
  • Model soups – Weighted merge of base + wiki + books + multilingual checkpoints β†’ +3–5 pp on Euro-MMLU & ARC-EU.
  • Edge focus – Latency <40 ms on an iPhone-grade NPU; no server round-trip needed for translation or chat.
  • Specialized variants – dactory lets you rebuild Helium on domain slices (STEM, textbooks, etc.) without re-scraping the web.
  • Benchmarks – Leads 2 B class on MMLU-EU, ARC-EU, FLORES translation and Euro-HellaSwag; competitive with 7 B models in English.
  • Getting started – `pip install dactory`, process Common Crawl in \~4 days on a 32-core box, then fine-tune Helium with your custom slice.

-

  • Qwen 2.5 omni updated
  • Big CO LLMs + APIs
  • GPT-4 RIP - no longer in dropdown
  • Google - NotebookLM AI overviews are now multilingual (X)
  • with more than 50 languages
  • Gemini 2.5 Flash was released - hybrid

-

  • LlamaCon updates
  • Security release focused
  • – Llama Guard 4 (text + image protection)

– Llama Firewall (stops prompt hacks & risky code)

– Prompt Guard 2 (faster jailbreak defense)

– CyberSecEval 4 + a new Defender Program

  • zuck confirmed thinking models are coming
  • new meta.ai is coming + app with a social feed
  • full duplex voice model is also in the works
  • LLama API is powered by Groq and
  • !Image
  • LLama API
  • OpenAI ChatGPT "glazing" update - revert back and why it matters (Announcement, AMA)
  • "_We focused too much on short-term feedback, and did not fully account for how users’ interactions with ChatGPT evolve over time"_

-

  • Chatbot Arena Under Fire β€” β€œLeaderboard Illusion” vs. LMArena (Paper, Reply)
  • "unfair practices favoring big incumbents like OpenAI, DeepMind, X.ai and Meta."
  • Executive summary
  • Cohere Labs’ paper β€œThe Leaderboard Illusion” claims Chatbot Arena (a.k.a. LMArena) is structurally biased: select big-tech providers privately A/B-test dozens of model variants, cherry-pick top scores, receive far more battle data, and suffer fewer silent deprecationsβ€”yielding inflated Elo/BT ratings and distorted rankings. LMArena’s organizers answer that the leaderboard simply reflects real human preferences; pre-release testing is open to any provider and drives better models, not bias. Critics counter that selective reporting, unequal data access, and opaque removals still skew results. The fight now centers on what β€œfair, community-driven evaluation” means when millions of crowd-sourced votes become prime training fuel for a privileged few.
  • Factoids
  • Undisclosed private testing: Meta ran 27 hidden Llama-4 variants in one month; Google (10) and Amazon (7) did similar pre-launch sweeps.
  • Best-of-N inflation: Simulations show testing 10 variants can lift a model’s BT score by \~100 pointsβ€”enough to leapfrog competitors.
  • Data asymmetry: OpenAI (20.4 %) and Google (19.2 %) each hold \~1.2 M battle prompts; 83 open-weight models share <30 % of the total.
  • Sampling skew: Daily sampling peaksβ€”OpenAI/Google 34 %, Meta 18 %, AllenAI 3 %β€”exposing some providers to >10Γ— more votes.
  • Silent deprecation: 205 models quietly removed versus 47 officially flagged; 66 % of silent removals are open-weight/open-source.
  • Overfitting evidence: Finetuning a 7 B model with 70 % Arena data doubled win-rate on ArenaHard (23 β†’ 50 %), but hurt MMLU scores.
  • Leaderboard volatility: Rapid top-spot swaps (e.g., GPT-4o, Grok-3, Gemini variants within days) align with private-testing bursts.
  • Cohere’s fixes: five proposalsβ€”ban score retraction, cap private variants, balanced sampling, proportional deprecations, full test-log disclosure.
  • LMArena’s stance: β€œIf the crowd likes it, it ranks.” Organizers tout new statistical tools (style/sentiment control) and broader user outreach.
  • Community pushback: Labonne et al. flag best-of-N bias, overfitting via preference data, and ranking drift from model retirement; accuse LMArena of sidestepping these core issues.
  • ChatGPT will do shopping for you
  • !Image
  • This weeks Buzz
  • MCP/A2A Hackathon - with A2A team and awesome judges! πŸ€–πŸΆ (Apply)
  • !Cover Image for WeaveHacks: Agent Protocols Hackathon (MCP, A2A) with Weights & Biases & Google Cloud
  • lu.ma/weavehacks
  • Vision & Video
  • Runway References - consistency in video generation (X)
  • Executive summary
  • Runway’s Gen-4 References brings stable, tag-based image conditioning to every paid plan, letting creators lock in characters, outfits, locationsβ€”or even personal selfiesβ€”and re-use them across unlimited generations. Supply one or more β€œreference” images, then steer Gen-4 with text prompts for new camera angles, styles, or compositions; the model keeps the referenced elements intact while treating everything else as creative space. By decoupling continuity from prompt-only hacks and charging a single image credit per run, References turns Gen-4 into a practical pre-viz, storyboarding, and on-device VFX toolβ€”just in time to counter Sora’s long-form ambitions.
  • Factoids
  • Scope of release – Live today for all paid tiers (Standard, Pro, Unlimited) inside the Gen-4 tab.
  • Input types – Accepts photos, AI images, 3-D renders, sketches, even front-camera selfies.
  • Multi-reference magic – Combine separate character + location images; tag each (`<char1>`, `<loc1>`) in the prompt to anchor both.
  • Consistency guarantee – Holds facial structure, clothing pattern, and spatial layout across sequential generations or animation frames.
  • Credit efficiency – Reference runs consume the same credit cost as any Gen-4 still; no surcharge for multi-image conditioning.
  • Prompt control – Swappable style refs: upload a texture or concept art, tag it, and prompt Gen-4 to blend stylistic cues onto locked subjects.
  • Animation hand-off – A saved still with references can be passed directly to Gen-4 Animate for motion, preserving identities scene-wide.
  • Edge cases – Works best on single-frame characters & locations today; roadmap includes object-level and fine-grained style fidelity.
  • Community demos – Users replicate two consistent leads on the same park bench, rotate virtual cameras, and insert CG cars without drift.
  • Competitive angle – Positions Runway as the first consumer tool offering multi-anchor reference generationβ€”an edge over Midjourney Remix and OpenAI Sora leaks.
  • AI Art & Diffusion & 3D
  • HiDream E1 (HF)
  • !demo.jpg
  • Agents, Tools & Interviews
  • OpenPipe - ARTΒ·E open-source RL-trained email research agent (X, Blog | GitHub | Launch thread)
  • !Image
  • ### Executive Summary
  • OpenPipe distilled a 14 B-parameter Qwen 2.5 backbone into ARTΒ·E, an Apache-2.0 inbox agent trained on 500 K Enron emails plus synthetic Q\&A and refined with reinforcement learning that optimizes for correctness, brevity, and fidelity. The result tops o3 on accuracy (96 %), slices end-to-end latency to 1.1 s, and lowers operating cost to $0.85 per 1 000 queries, all with a three-tool loop you can drop into any stack.
  • Factoids:
  • Model size: 14 B dense parameters, no MoE, weights fully released.
  • License: Apache 2.0β€”unrestricted commercial use.
  • Training corpus: 500 K public Enron emails; Q\&A pairs synthesized with GPT-4.1.
  • RL stage: policy fine-tuned on task reward ⟨accuracy, turns, hallucination penalty⟩.
  • Tool interface: `search_emails`, `read_email`, `return_final_answer`; no planners, no recursion.
  • Accuracy: 96 % correct vs o3 90 %, o4-mini 88 %, GPT-4.1 71 %.
  • Latency: 1.1 s median full runβ€”5Γ— faster than o3, 3Γ— faster than o4-mini.
  • Cost efficiency: $0.85 / 1 K runsβ€”64Γ— cheaper than o3, 9Γ— cheaper than o4-mini.
  • Deployment: ready for Ollama (`ollama run art:e-email`), Azure container, or local vLLM.
  • Takeaway: tight task-aligned RL plus synthetic data can eclipse larger frontier models on vertical workloads without exotic agent stacks.
  • PromptEvals - Interview with Shreya Shankar ( NAACL paper | Dataset | Models )
  • PromptEvals is the first large-scale corpus of what engineers actually write and check in production LLM workflows: 2K+ developer prompts paired with 12K+ assertion criteria that cover structure, style, grounding, and hallucination guards. Collected from LangChain’s Prompt Hub and cleaned by hand, the set is five times bigger than anything before it and ships with open Mistral-7B and Llama-3-8B checkpoints that auto-generate assertions faster and cheaper than GPT-4o while scoring +21 F1. For anyone building eval pipelines, PromptEvals is both a ready-made benchmark and a drop-in source of realistic test casesβ€”finally letting us design evaluation methods on data that mirrors real-world constraints instead of toy tasks.

PromptEvals β€” Key Factoids

  • Scale bump: 2 ,087 real developer-written prompt templates paired with 12 ,623 assertion criteriaβ€”β‰ˆ 5Γ— larger than any prior public set.
  • Source of truth: Prompts snapshot from the LangChain Prompt Hub (May 2024); median prompt length β‰ˆ 191 tokens, spanning 40+ domains from finance to horse-racing analytics.
  • Constraint taxonomy: Every assertion labeled into six categoriesβ€”structured output, multiple-choice, length, semantic, stylistic, hallucination preventionβ€”following Liu et al.’s eval taxonomy.
  • Open weights: Two fine-tuned models released (Mistral-7B, Llama-3-8B) that auto-generate assertions; both MIT/Apache-licensed and hosted on HF.
  • Performance pop: Fine-tuned models beat GPT-4o by +20.9 pp average Semantic F1 while cutting latency (\~2.6–3.6 s vs 8.7 s) and cost.
  • Benchmark bundle: PromptEvals test split + scoring script (Semantic F1 & criteria-count metrics) double as an open leaderboard for assertion-generation tasks.
  • Developer realism: Assertions mirror production guardrailsβ€”JSON schema checks, tone policing, grounding testsβ€”and can be executed directly in LangChain eval pipelines.
  • Data quality workflow: Three-step GPT-4o-assisted generation β†’ human spot-check β†’ dedup/refine yields <0.2 fixes per prompt in manual audit.
  • Latency edge: Mistral-7B LoRA variant generates a full criteria list in \~2.6 s on dual A100sβ€”fast enough for live prompt-editing loops.
  • NAACL spotlight: Paper accepted to NAACL 2025; team invites community to extend the set with multimodal prompts and new assertion types.

***

  • Maven Course - AI Evals For Engineers & PMs Questions for Shreya Shankar & Hamel Husain (link Promo code - `thursdai`)
  • What blind spots in current LLM-eval tooling motivated you to create PromptEvals and then formalize them into a course?
  • How do you teach engineers to translate a fuzzy product spec into concrete, machine-checkable assertions?
  • PromptEvals shows fine-tuned open models beating GPT-4o on assertion generationβ€”how do you weave that insight into the course labs?
  • Can you walk through a real industry case where assertion-driven development saved a launch or uncovered a silent failure?
  • The syllabus emphasizes β€œsystematic error analysis.” What frameworks do you give students to prioritize which errors to fix first?
  • How do you balance code-based guards versus LLM-judge evaluations when latency or cost is tight?
  • For RAG and tool-calling systems, which additional metrics (beyond accuracy) do you require students to track in production?
  • What practical tips do you offer for sampling and labeling data so that human reviewers remain effective without burning budget?
  • The course bundles $1 k in Modal credits; what compute footprint should students expect for the assignments, and how do you teach cost governance?
  • Looking ahead, what research questionsβ€”dataset gaps, evaluation metrics, or agent behaviorsβ€”are you hoping future cohorts will tackle?

-

Alex Volkov
Alex Volkov 0:29
Welcome everyone, welcome, welcome to May 1st, ThursdAI, a lot has
0:36
happened this week, so we're gonna talk very fast and hopefully we'll cover everything because a lot has happened this week, starting with a very sad event, that we will cover in a second. Yes, GPT 4 has been officially RIPed. Rest imprompted, and we're going to cover this and we're going to have a moment of wake for GPT 4, because, folks, many of folks who follow us for the past two years know, many don't know that GPT 4 was the start of ThursdAI. We were born on the same day. it was May 13th, 20, May 14th, 2023. And, That's when ThursdAI started. And so we're gonna have a little wake for GPT 4, and say some words, and invite some people to write some words as well in the comments as well. with you, Alex Volkov, I'm an AI evangelist with Weights Biases. I've been doing ThursdAI for more than two years. With me, our co host, Wolfram Ravenwolf, all the way from Germany. Hey, Wolfram.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:26
Hello, everyone.
Alex Volkov
Alex Volkov 1:26
Yam Peleg is with us.
1:28
Hey, Yam. So we're an international crew here. Not so cold anymore, Colorado, in Denver. And we also have an international crew of folks joining us from all over the world. We're streaming live on YouTube, on X, on LinkedIn. There's a few folks on LinkedIn. and I think for the first time we're streaming live to the Weights Biases audience on LinkedIn. Hey, Weights Biases audience, if you're seeing this, this may be your first time with us. So welcome. all right, folks, I think it's time for us to get started with, With the TLDR. First of all, I think we'll do the wake afterwards. So folks who are joining for the GPT4 wake, I think we'll wait a little bit, but, this week has been a crazy week. And, I think we did something last time, although we don't have that much time today. I'll be very punctual with the clock. We have a lot to cover. we may have Junyang join us from Kwin, which I believe was the biggest like open source of this week. last time I went around to just ask my co hosts, what do you guys think? What was the one highlight of AI for you guys? And while my co hosts answer this, I invite everybody in the audience also to comment and tell us what was the one thing. Maybe we missed it, maybe it was a huge thing. how about we start with Wolfram, what was the highlight of your week, this week?
Wolfram Ravenwolf
Wolfram Ravenwolf 2:45
Definitely GWEN3 because this model, it does it all.
2:49
It's a big size, small size everywhere. It is an amazing model. We will talk about it in more detail. I'm running my benchmarks and I'm loving it. Great work.
Alex Volkov
Alex Volkov 3:00
That was the most exciting for me.
3:02
Let's go to Yam. Yam, what is your highlight this week?
Yam Peleg
Yam Peleg 3:05
Oh, definitely GWEN3.
3:07
GWEN3. Definitely. absolutely amazing model. These numbers are unheard of. you can argue if they translate well to, vibes and so on, but numbers themselves are unheard of for these sizes. And it's just amazing to see, every New release, a smaller model gets to the same performance as the largest models, and you think, okay, that's the end. We're not gonna get like a 7b better now, this time, and then a new team just, pops up out of nowhere, and just put it on Twitter, have fun, and enjoy, and it's every single time, like every month, something like this happens, and really makes you think Like, when, what's the limit of those models? Of the small models? Like, when, how far can we push them? And it's also really hard to even measure the limit, even measure how hard it is. it's all vibes from here all the way down at this point, at this moment. yeah. Yeah, definitely. QU three. Yeah,
Alex Volkov
Alex Volkov 4:05
all vibes.
4:06
Absolutely agree with you. Quin three for us as well. Niton. how about you give us something that's not QU three, if you have, if not, it's also fine , if all of us are excited about QU three, that's great.
Nisten Tahiraj
Nisten Tahiraj 4:16
the new, VLLM release, they fixed a lot of the CPU offloading.
4:21
Which makes it very good for running Quant3.
Alex Volkov
Alex Volkov 4:25
Yeah, it's also Quant3 related.
4:29
Yeah,
Nisten Tahiraj
Nisten Tahiraj 4:30
I like, so besides Quant3, by the way, my most used model
4:34
has been QuantCoder32B until now, so
Alex Volkov
Alex Volkov 4:37
I'm
Nisten Tahiraj
Nisten Tahiraj 4:37
glad to have something else, but this one
4:42
just flew under the radar. It's the new Olmo model 1b and that, that beats the older 2. 5, 1. 5b. So Olmo 2, so that's a very nice, very small model. And it's one of the only ones where you have a complete, access to all the data too. And the training logs, so that's from our friends at Allen. And, yeah, I would say that is something that flew under the radar. And there is also another voice thing, which I'm just looking up, because I don't remember, so
Alex Volkov
Alex Volkov 5:13
I'll give it a try.
5:14
All right. I will say the highlights are usually like good things, but also I want to add like notable things. I think for me, notable is GPT 4. is no longer with us and as i said we're gonna have a wake in a bit but i think it's time to jump to the tldr section because as always we have a lot to run through and we for folks who are busy we're doing the tldr which is where you can see everything we're going to talk about and everything with the links we're going to run through it super quick this time and then hopefully we'll jump into open source let's do tldr No, not this one, sorry. One of these days. Alrighty, folks, we are in the TLDR section. A lot has happened this week, a lot. I have a very long list of things that happened this week, starting with open source LLMs. As we all mentioned, Alibaba drops Qwen 3, which is just a huge release. Like models across the board. There's two huge MOEs, mixture of experts. I believe first ones from Qwen, will folks will correct me afterwards. and then also six dense models starting with a very tiny 0. 6 billion, 600 million parameters. Qwen, it's useful for some things. It's surprisingly not bad. All the way up to 30 billion parameters. The stats are incredible. We're gonna hopefully have Junyang Lin with us if he's able to join. And if not, we're just gonna cover it ourselves. We're now friends of Gwen for the longest time. In a surprising release yesterday, Microsoft released PHI 4 with Reasoning. That was New and exciting. We covered PHI, I believe, from the first one. We covered tiny stories, we covered why the synthetic data is great for models, and now the reasoning model as well, and 5. 4 reasoning and 5. 4 reasoning plus, which is super cool. Xiaomi? Also released a model, it's called Maimo, and also to, surprise, it's also a huge, huge Chinese company, and we should at least mention that they are releasing MIT as well, which is very interesting that most of the Chinese models are coming out with open source, everybody else over here, and AcuteAI, which we remember from the voice models, Mochi, they released Helium. 2 billion parameter, Helium 1, and we definitely should mention this, a 2 billion parameter, Helium 1. I believe that's most of the most open source news. there's a few more things in open source. DeepSeq released a prover, like a huge prover that's very interesting for folks for, The specific math, but not necessarily a wide appeal. And there's a few more, but because this week was huge, this is like the main thing. GPT 4 is no longer with us. You cannot find it in dropdown, you cannot use GPT 4, you cannot send inference. Not that you had for the past year and a half, nor did you have a reason to, but GPT 4 started a revolution, and we will do a wake immediately after this, once we get to the big companies and APIs. Google? you guys remember we had the folks from Google, we had Reiza, Abu Bakr, and we had Steve, talking to us when NotebookLM launched, way before it became viral. NotebookLM now does the AI overviews, the podcast y overviews of everything that you want, in multilingual. It does them, I believe in 50 languages, and they sound dope. So that's super cool. Meta had their LlamaCon, a convention specifically dedicated for Llama, where everybody expected a new model. But besides this, they released a bunch of mini kind of tangential security models like LlamaGuard. But they also talked about LLAMA and LLAMA4. coinciding with this, Zach went on the Dvorkesh podcast and gave some replies on why LLAMA, started on LLMCS Arena as number one and then what we got is not number one. but yeah, we have some news from LLAMACon and then some folks in the audience also will give us some news as well. I actually wasn't there, looking forward for an invite. Next time to be able to cover Lomocon, and then also the, one of the biggest pieces of news for this week was GPT glazing update. I don't know if you guys are familiar with the world glazing. I wasn't until Sam Altman tweeted it out, but basically sycophancy is a thing. It's where the model basically is not able to tell you anything, constructive. It all barely tells you anything constructive. It's just agrees with everything you say and highlights everything you say as though you're the king and you pay it and so for the past two weeks gpt4 had been Many of the community noticed this behavior including reaching out to the folks in OpenAI and saying hey, this is just like incredible. I told it I did one push up Then it went on. How amazing I am one, like top of the population, basically something like this. We'll definitely cover what went wrong there because, it was so bad. The Open AI not only admitted it, open air rolled back a release, which I don't remember for the past two and a half years doing this. I don't remember them rolling back anything ever. maybe they did this quietly, but here they also went on an a MA Junior Yang, I believe. No, not Jang, sorry. I'll remember her name. The head of behavior went on an A MA. with, with Reddit to try and address. Joanne Jiang, went on AMA with Reddit, and, yeah, we will talk about the behavior change and how it affects half a billion people now. And then, next up, we have the other kind of interesting thing. Chatbot Arena is under fire, folks. Folks from Cohere released a paper called Leaderboard Illusion, where they claim that the way, Chatbot Arena is built is actually, favoring the bigger labs. And then there's a response from LLM Arena as well, and then some, community members, like Maxime Lebon, responded back. We're gonna chat about this, because we've had our qualms with LLM Arena as well. We'd love to tell you which model is the best one. This is like part of what we do here on ThursdAI, Vibes. And then they're supposedly testing Vibes, but, it's very interesting. very interesting that, there was this paper and then we should definitely discuss this. Also in big news, CharGBT will do shopping for you. This is also a new era. they combined with Shopify. They will now. Do shopping. And also according to this, I just started realizing how much chatGPT knows about me. And I don't think there is a service on the internet. I honestly, I don't think that my mom knows as much about me as chatGPT. this is now a thing which I definitely would like to discuss though this week is crazy busy. So we'll hopefully we'll get there. In this week's buzz, a category where I cover things that happened in Weights Biases, or Round Weights Biases, or Related Weights Biases. I want to tell you again that we have a hackathon coming up, May 17th and 18th in San Francisco. the hackathon is together with Google. I can finally announce this. Google is a sponsor, presenting sponsor for this hackathon. Hackathon is going to be MCP and A2A, the protocols that we both covered them. We covered A2A with Google. Todd Segal from Google, one of the core developers of the protocol. And we covered MCP with Jason Neen and Dina Kozlov, who's in charge in Cloudflare. and we will invite all of the developers from around the world, if you want to fly in. I'm not going to pay for your ticket, but, if you want to fly in, you're more than welcome. But if you're in San Francisco and the Bay Area, you're more than welcome to join our hackathon. Would love to see you there. the hackathon is called WeaveHacks and you can look it up on Luma. And we'll definitely add the link to this in the show notes as well. In the vision and video, finally we're getting to where I really want this whole thing to be. Runway added references, and references are also available on the free plan as well. References allow you to generate video character consistently. So character consistent video throughout frames, which looks quite incredible. And this has been the bane of like video generations. You can create incredible scenes, maybe even based on your own video, but then the consistency for the next frame is not there. Runway is trying to solve this, and that's super cool. We definitely should look at that. That, we, I believe, covered HighDream before, but in AI, Art, and Diffusion, there's a new contender in open source, it's called HighDream, and it does Ghibli as well, HighDream E1, and that is an open source model, we should at least mention this as well, and I think that in agent tools and interviews, we will have an interview with with the friends of the pod, Hamel Hussain and Shreya Shankar, specifically because they also, and Shreya released something called Prompt evals this week, and they also have an upcoming course in which yours truly is a guest speaker, so we'd love to chat with them about evals, everything evals, LLMs, and, the upcoming course, and also Prompt evals, of course, Prompt evals work, and then also in AI tools, agents, and interviews. I have this, I have this. This is a catch all category for me. OpenPipe, our friends of the pod as well, released ARTE, which is an open source, RL trained email research agent that they claim does a better job than, than O3, which is very interesting. we may invite Kyle at some point to chat about this with us, but, absolutely something to talk about as well. I want to see if I missed anything big from folks here on stage. Anything huge that I missed that we should cover? Because this has been a lot.
Wolfram Ravenwolf
Wolfram Ravenwolf 13:48
Not something huge, but since you mentioned the event in
13:50
the US, there's also an AI developer community event in Berlin in two weeks, so I've been to the first three times, and now the fourth one is in Berlin. There are still a few slots left. I will drop a link in the comments.
Alex Volkov
Alex Volkov 14:03
Alright, we'll add the link to this in the show notes as well.
14:05
Folks, I think it's time for us to get started, because the show is long and we have a lot to cover. Let's get started with And our favorite corner that we love every time, Open Source AI. Let's go.
14:30
Open Source AI, let's get it started. I love the opening seQwence to this because I specifically have the Applying Apache 2 license. And speaking of Apache 2, I think the biggest news that we Need to cover this week, is the release of one of our definitely favorite friend of the pod, labs out there, Alibaba, the Qwen team in Alibaba specifically released, bestowed upon us an Apache 2 series of models, called Qwen 3. We've been following Qwen for a long time since Qwen 1, QwenCoder. There's a bunch of releases, QVQ, QWQ, all of the great stuff. And now, Qwen 3 is a. Full stack of open weighted models, two MOE giants, two mixture of expert giants. One is 235 billion parameters with only 25 billion active parameters. and then the other one is a 30 billion parameter with only 3 billion parameter active. and then six dense models, ranging from 0. 6 billion all the way to 30 billion parameter. All shipped under Apache 2 license with day one support! I don't know if you guys saw this, but they want support for literally everything that can run a Qwen. LM Studio, our friends, OLAMA, MLX, VNLM, they also trained them for MCP and mentioned MCP specifically in the release notes. Let's talk about this, folks. I have a bunch of stuff. I want to highlight this one more thing, then we should absolutely discuss this. There are also hybrid thinking models. They're reasoners, so we had the reasoner before. I believe Koen was the first open source reasoner with WQ before DeepSeq. and then we could play with this. Back then it was like experimental, whatever. These models are hybrid reasoners, which we talked about on the show, Sam Altman announced that they're gonna go to there, and they didn't go there yet, so the OpenAI models are not hybrid. Claude, I believe, to some extent, Gemini Flash 2. 5, which was released last week, we should probably also mention this, it was released after a show, literally on Thursday, but after a show, so we'll definitely mention Gemini 2. 5. That's a hybrid reasoner as well, you can specify reasoning. This, I believe, is one of the Second, NVIDIA NEMOTRON was also hybrid, isn't it? But this one has a thinking switch that you can enable mid prompt. Nisten, I know you noticed this for sure. And we definitely should talk about this. Like a user can pass, ah, I want you to think more. So that's super cool. Let's talk about the fact that they released both MOEs and non MOEs. Yam, would you want to, would you want to take this one and discuss the MOE approach versus non MOE approach? and a few of the evals, then we can chat about evals as well.
Yam Peleg
Yam Peleg 16:58
look, there is a lot to discuss.
16:59
I just want to say that, I think that, supporting everything since day one is a lesson learned from LLAMA, probably. from the LLAMA release that, had, its own issues, not long ago. and yeah, I think it was the first thinking model coming from Qwen, first open source. today we take it for granted, like it's expected. Zach also said on the podcast that LLAMA4 is going to get thinking. In, over the, over this year. anyway, what I want to say is, first, the, it's important to consider, to look at the size of the models, and what they compete against. on the Qwen release, tiny models that are competing with 0. 1 on several benchmarks. That's incredible. Let alone, put aside the MOE, larger models, and so on. Even the smaller models are, like, how did they pull this off, like, how did they even do this? Seriously, it's it's the pinnacle of pushing the model to the extreme. You can invite everyone to just go and read the technical materials about how it was done. Also, important to mention, Nisten said it before, Qwen Coder was And one of a kind. There wasn't any model of this size with this performance or resource. a lot to expect. Those things, coding models that you can run yourself locally. Even if they're not the state of the art of state of the art are like local co pilots, which is a massive unlock that you can use yourself. It's just amazing to see, seriously. Anyone want to take the MOE versus non MOE number of parameters? Because there is something about there as well worth mentioning.
Alex Volkov
Alex Volkov 18:59
before we get to number of priorities, one thing that you
19:01
asked that we should also mention, the number of training tokens. We often mention this, it's a massive, if I'm reading this right, 36 trillion tokens, which is insane. we've been talking about, the number of tokens growing as well, 36 trillion tokens, twice as much as GWENT 2. 5. and they have, PDF extraction that they use VL for, synthetic math and code that they generated with Qwen 2. 5 Math Encoder. Like both these great models that we know and love, they did like synthetic training for this new model. It seems that, that kind of like the loop that we're talking about, that like the previous models will generate synthetic data for newer models. New models become better and include this. It's what also we're seeing right now with the square model. Yeah, you want to talk about the size parameters? We can talk about this as well.
Yam Peleg
Yam Peleg 19:48
Nisten, have you seen this?
Nisten Tahiraj
Nisten Tahiraj 19:51
yeah.
19:51
it is actually similar to, the DeepSea Coder model. when Coder v2 came, it was pretty revolutionary. this, So that was around like 220b or so. So this is around 235b and with the same number of active parameters. So that was one of the first like very good models that you could run on a CPU as well because you can do 11 tokens per second. So the sizing on this And I always really like the architecture and I don't know why they moved away from that towards something bigger. Because it's pretty ideal, for example, if you have a Mac or a Windows 10 computer, Yeah, or if you have a CPU, because that's, that's what I like. But also for people serving this stuff in production, because you can parallelize a lot of their requests so much, so it actually ends up being a lot cheaper when you just throw a lot of GPUs at it. anyway, all of that aside, It is fantastic. the Kwin models have always been good, workhorse models, we would call them, since, Kwin 1, the 70B, and, they're also very easy to fine tune. this one's getting all the math questions, all, the tricky physics questions that I throw at it, it's getting them perfectly. it can bytecode. Somewhat, it's still not SONET at VibeCoding. nothing beats SONET at, at VibeCoding, but, it's not like it's that far off. It's not SONET
Alex Volkov
Alex Volkov 21:28
at home
Nisten Tahiraj
Nisten Tahiraj 21:28
yet.
21:29
yeah, I think it is definitely SONET 3. 5 at home. I tested it with all the tool calls and, it is doing, it's doing all the tool calls well. Yeah, I have found the coder to be, so I tried the, so I tried the 4B, I had early access to the 0. 6B, and, Oh, look at you, fancy. Yeah, that was out for a good, three weeks on their thing, and, I never actually, they even had another small MLE, which, I don't know if they actually released that one, but that one was scoring a crazy score. basically they trained this big model and then they distilled all the tiny models. I am, yeah, I'm pretty glad that they released all of the work because you often see with, shops that they do train a whole bunch of smaller models and bigger models and stuff, and then they only release the ones that like do well in the benchmarks. They released all of it. And, and they're all, great. So right now, yeah, I'm using it with the, Fireworks API. I was able to run the MOE one myself, and, that one ran on a single H200 for 1. 50 an hour. It runs at 14 tokens per second, which is okay. It could do better. Yeah, I'm looking at, oh, and then you can also just own the model too, because the license allows you to do that.
Alex Volkov
Alex Volkov 22:47
Apache too?
Nisten Tahiraj
Nisten Tahiraj 22:48
Yeah.
22:49
All over.
Alex Volkov
Alex Volkov 22:49
I want to add, Yam, to something that you said.
22:52
The efficiency math here is crazy. The MOE basis, the hit parity of GWENT 2. 5, While activating 10 percent of the parameters. Kwin 2. 5, which was released 4 months ago? 5 months ago? I, I don't really remember. we were very celebrating that model. That model is now the base of many Finetunes. Many folks running Kwin 2. 5 based stuff. And a lot of, the Reasoners, if you guys remember, when Reasoning came out, RL came out, it was a big thing. And many people trained on GPRO and, like, all of these things. They used Kwin 2. 5, multi models. 10 percent of the active parameter is all it takes now to get to the same level of quality of Coin2. 5. 235 billion, the MOE, the big one that they released, matches or exceeds DeepSeq R1 and DeepSeq R1 broke the stock, market. When it came out, because everybody's oh my god, like GPT 4 at home, the, it's half the parameters. DeepSeq is 560 something parameters, if I don't remember, if I remember correctly, the MOE. this one is like less than half the parameters and matches or exceeds the model that we all thought is a crazy state of the art, only in January, like not only us, it broke through the bubble. And matches O1 and O3 mini in multiple things, right? Especially in math things. Grok 3 and Gemini 2. 5 on coding and math as well. and the highlight of this, for me, the highlight is the 4 billion parameter dense model beats GWEN 2. 5 72b. Four billion parameter model beats the previous highlight of 72B Instruct on like multiple things. It's a four billion parameter model. we keep talking about having these models at home for privacy's sake, for performance speed, maybe not so much because they run like crazy on VLLM. Definitely for privacy's sake, definitely for doing your own thing, fine tuning especially, four billion parameter model gets very close there. Let's talk about the reasoning. Wolfram, I would love to chat with you about just your vibes around this model, your vibes from the community, but also the reasoning stuff. The fact that these are like hybrid reasoners would love to open this up, because I think it's very important.
Wolfram Ravenwolf
Wolfram Ravenwolf 24:51
Yeah, the reasoning is especially important and interesting
24:53
because, when OpenAI was announcing their open source model, a lot of the people who were there wanted a non reasoning model. So it's faster and, doesn't use so many tokens. So now Gwen has shown us that you can have both. You can have a way to disable the reasoning, so I expect from OpenAI now to match that as well, to give us an MOE model, which is a
Alex Volkov
Alex Volkov 25:15
Last week we
25:15
chatted with Maziar, who went to the OpenAI, OpenAI event, and Maziar specifically said that the community reacted to them and said, Hey, please give us the way, like the ability to turn off freezing. He wasn't the only one. And this was the sentiment from the community. So definitely great. Sorry, I just wanted to highlight this one.
Wolfram Ravenwolf
Wolfram Ravenwolf 25:32
Now we have seen that it is possible and how
25:34
easy and well done it can be. So that is a great thing. Also the MOE approach means we can run it faster. because we don't have as many active parameters which is important when you are reasoning because you are generating so many more tokens so that is also an aspect that is here and that is the great thing we not just got a big model like DeepSeq did but we also got a smaller one which we can run locally i can run it on my macbook and it doesn't take much and it has so many contexts it is super fast and the quality is excellent so what they did is not just deliver an amazing model But deliver an amazing model everybody can use, depending on your hardware, of course. If you have a super computer, you can run the big one. And since we mentioned Sonnet at home, I tested at Fireworks, the big one, the 235B, Active 22B, the active one. and it was directly between 7 sonnet in the MMLU Pro. But on my own system, I also got a score of 80 percent in the MMLU Pro. That was just a Q4 4 bit of the 30B for Active 3 bit. that was reasoning a lot, and so it got a good score, and it's running locally. I've never seen such a good score on a local model. That is absolutely, blew me away, absolutely amazing. They delivered a top model that you can run everywhere, on your phone, the small one. It doesn't get a good score. The small one is a 0. 60, that only got a score of 30, 40 something. So that wasn't super. That is to be expected, but the big one, that is great.
Alex Volkov
Alex Volkov 27:12
We have a shout out from Fernando Neira, a friend of
27:14
the pod, also an ML researcher, that says, worth paying attention to the depth of the 30 billion MOE. It's deeper than 8B and 14B as well. All right, let's talk about evals, folks. On Artificial Analysis, GPK Diamond, Qwen3, the top one, there's a lot of Qwen3s, but like the top one, the 234 billion parameter, is landing above GMA 2. 5 Flash, which we also should mention came out since we chatted with you guys last week, and above LLM4 Maverick. Zach is in shambles right now, but I think that the reaction from Zach is specifically that the LLAMA is multimodal and these models are not multimodal, so no multimodality, but multilinguality, support for 119 languages in these Kwin models, which is also very impressive. I chatted with it in Hebrew and in Russian and it works well.
Nisten Tahiraj
Nisten Tahiraj 28:07
Yeah, I think the star of the show here is the 30B MOE
28:13
with only 3 billion active parameters. Yep. Because people are getting, 100 tokens per second on their, MacBooks on this. Yeah. And you're getting pretty close performance to the big one, but it's just way faster. And again, it's 30 gigs on disk and, you're only using three gigs of them at a time in memory, or it's half that if you're running it in four bit and people are getting, crazy speeds on that. that, so that, I think that's like the most. democratized form of intelligence right now because in 4 bit you only need 16 gigs, well more like 20 gigs of VRAM to actually run it so you can buy a GPU now for less than a thousand bucks and just have this extremely capable, also extremely fast user experience. how much memory do people use when you run it? You only need 20 gigs. Really? yeah, it's a 30B, for, for up to, up to 32, 32K or 20K, I think, which is, it is crazy. That's crazy. Yeah, absolutely. Yeah. We should have mentioned the context length
Alex Volkov
Alex Volkov 29:25
for these
Nisten Tahiraj
Nisten Tahiraj 29:25
models as well.
29:26
so now for a thousand dollars, you can get this amount of intelligence that is also fast. Yeah. It's fast. I cannot.
Alex Volkov
Alex Volkov 29:35
100 tokens per second locally is crazy.
29:37
Yeah. It's crazy.
Wolfram Ravenwolf
Wolfram Ravenwolf 29:39
And it's super
29:40
cheap. Even the big one on Fireworks cost me just 15 cent for the full MWU Pro's computer science benchmark. That is a million token. So very cheap. Look, I didn't
Yam Peleg
Yam Peleg 29:50
have time, I didn't have time to Tested myself,
29:52
but seriously, Sonnet at all? You really feel this way, guys?
Nisten Tahiraj
Nisten Tahiraj 29:57
I feel that way with a bigger model.
29:59
the annoying thing is that I started using a bigger one and I really like it. And, so now I do notice the difference when, I use the small one. You just want to use the best. So if you're going to try any of this Just try the 30 b MOE and the don't try the other one because then you're just always gonna want the other one. But it's, it sounds,
Alex Volkov
Alex Volkov 30:20
has a lot of stuff in there, that, that is hard
30:24
to just get by just talking. So vibe coding, for example, I've seen a few examples of folks like trying to build like interfaces. And Simon just gives like an incredible one where Qwen does something, but it's not comparable. I've seen a few examples, as always with these models, some folks will not like them. I would point out that there is a few folks who've tried and got different results, on LocalLlama. Like the thing to highlight is all of these models are Apache 2 licensed. All of them are finetunable. The community is going to come out and finetune. Already folks started doing this. Their reasoning, so for tasks like math and code, they're great. They're absolutely great. And you can like, absolutely turn off reasoning if you don't want to. Oh yeah.
Nisten Tahiraj
Nisten Tahiraj 30:59
So by the way, just to highlight that for the
31:01
viewers, the turn off reasoning is just, Slash, no, underscore think. Slash no think. that's all you need to do. Oh, and, just to address any fights, not fights, in the community, but arguments, as to who first, invented dynamic reasoning, because I posted that, Kwindla finally solved it and, technically also the QEQ model, has solved it. Technically also the, Hermes 3 also included it. Also, technically the Hermes, 405B release had both, thinking and reflection tokens that you could turn on and off. In the model. yeah, so that part of the post don't, take that as absolute what I was saying, but this is, I don't know, this is one of the best implementations.
Alex Volkov
Alex Volkov 31:50
We're very, yeah, somebody commented that NoThink
31:53
will turn off thinking and then SlashThink will turn it on back again. So in the middle of the conversation, you can give your users the ability to turn on and off thinking. That's, I don't remember any other models doing this, with this specific thing. That's great. I foresee this coming everywhere.
Yam Peleg
Yam Peleg 32:07
have you seen the commands for Claude code?
32:10
the secret commands for cloud code. No. Which is ultra think and mega think. Mega think and apparently ultra think is to think more than mega think ultra and more than mega
Alex Volkov
Alex Volkov 32:21
for
Yam Peleg
Yam Peleg 32:21
sure.
32:21
There's Yeah. Yeah. Yeah. And there are like five secret commands that are like ultra think, mega think and it, it was really funny when people found out and There is, there are levels for this, but yeah,
Alex Volkov
Alex Volkov 32:33
I'm sure that this will be built in because Gemini 2.
32:35
5 has a number of tokens. You give it like they literally expose it via the API. We should mention this as well. last thing that I want to mention here, folks, we've been like, hyping up, Qwen 3 for a while. they have support for MCP. We mentioned this by the way, you guys remember when we looked at the Qwen interface, there was like an MCP button there, Yam, I think you noticed that it's not live yet. However, the folks in. the folks in China, if you look at Modelscope, which is the hug and face for Chinese models as well, they have an MCP registry in the Modelscope thing. Very interestingly, Hug Face doesn't have one, but Modelscope does not give an F, and there's like a bunch of very specific Chinese MCP servers as well, so China ecosystem picks up MCP in a big way, and this is a great example because they specifically mentioned the That they've trained MCP selection on these models. MCP selection is a little bit different than just tool calling. And these models have like specifically mentioned, they have a built in MCP schema and you can pair them with their Kwin agent, which they have a free version on GitHub, for like tool calls. And I believe that not a lot of models have been trained for MCP specifically besides Claude and now Kwin. And I think that's also something Probably not
Yam Peleg
Yam Peleg 33:38
even Claude.
33:40
Probably, I just want to say Claude was Yeah. Go. go on. You're gone.
Alex Volkov
Alex Volkov 33:45
I, I believe that I heard from folks, that after they
33:49
they released MCP and started seeing like a big deal that the Claude has some updates, but I may be wrong. I hope that too. Hope that too. But yeah. But Claude has some, like more of understanding what MCP would be, because they've been working on MCP way before they released it, and then by the time they released it, they wanted like a good experience. I believe 3.7 maybe has a little bit. more than 3. 5.
Wolfram Ravenwolf
Wolfram Ravenwolf 34:06
I would like to add something because the tool calling is
34:09
very interesting because, the content seems to be, and I have noticed the same, that the model, the word knowledge of the model is a bit lower than we would expect. While the intelligence is very high, which is interesting from an agentic perspective or if you are a stressing rack or tool calling, a lot of my clients have always asked me how can I make the AI forget a lot of things. It should just answer what we have in the rack and we don't want it to talk about world events or stuff. So maybe that has been, it's not a drawback, but actually a feature of this model when you are using it in such a context. You want the information from the context to be what is that the model, interacts with and not so much background knowledge that could confuse it. So maybe that is not even a bad thing, but a good thing. And yeah, we have a lot of, options for fine tunings or Stuff can be inserted back in again. And that is the great thing about GWEN, because it is now on base. We will see so many fine tunes, I'm sure of it. And that is a bit sad that is not happening with LLAMA, because that is what Meta's role was, or used to be. That they released the models in different sizes, and everybody could use them and build so many. That's how the whole ecosystem was created. And now, MetaPast, And the way to the Chinese that are taking over this. That was a big feedback
Alex Volkov
Alex Volkov 35:23
as well on the MetaLlama 4 release, where we are used to smaller
35:27
models that we can run ourselves, fine tune ourselves and enjoy them at home. And then what's the point of open sourcing models for the community when you cannot run them? You need to run them in the cloud anyway. I wanted to highlight this last thing before we move on, folks. I had a run of evaluating on my machine. This is the first model, I believe, that I like literally used Weave and LLM, sorry, LLM. LLM Studio, I don't know why I blanked on this incredible piece of software. LLM Studio came out with first day support for Kwin models. Shout out LLM community, LLM Studio community for releasing the context version. I ran the Kwin 30b with 3 billion parameters locally, and then I ran it through my Weave evaluation dashboard. And you can see here. this is the model. It gets 43 percent of the 20 super hard thinking questions from AI Explained. As you guys know, like the Beth Ice Cube question, like there's a bunch of others. We always test thinking models for them. Reasoning models is significantly better. This model, the 30 billion parameter with only 3 billion active is very close to other models. It's by far the highest one that I ran open source. And on this test specifically, it beats GPT 4. 1 mini and 4. 1 nano. It gets very close to. 4. 1. This is a model with 3 billion active parameters that learns at home, and on this reasoning test, it answers way better. Not to mention the long dead GPT 4 and like all of these things. the surprising thing though, is you can see the average latency graph. it was a long time because I ran on my machine. I ran in the parallel, It took a while for this model, right? but, I didn't do the optimizations. One last thing that I will say, if you run LLM Studio, which I think I have open. let me see if I have open LLM Studio. I'll open it up. If you run this model in LLM Studio, you should absolutely know that because of the incredible thing, Llama didn't have this as well, by the way. If you run this model, let's say you load this up, you can use speculative decoding to speed that even more. And we chatted about speculative decoding multiple times at this point. And I just want to show you how that looks in a second. You can use
Yam Peleg
Yam Peleg 37:26
the smaller models.
37:27
You can use the smaller models of Kwan. Oh, I didn't think about it. So
Alex Volkov
Alex Volkov 37:32
you can use even the 0.
37:33
6 billion parameters because all you want from speculative models is the ability to think ahead of the model. So if you are, let me just show you super quick, Hey, say hi to ThursdAI community live. So I'm showing a screen for folks who are just listening. And I have the 3, 30 billion parameters with the three active billion parameters running. and when you run this with speculative decoding, you can see how many of the tokens are green. And the green tokens are coming from the very tiny model. So it's 3 billion parameters anyway, but I just ran this, I just ran this at 57 tokens per second. Because the spec was recording. Breaking news! Breaking news! Claude.
Yam Peleg
Yam Peleg 38:17
com, Claude.
38:18
com Release integration for remote MCPs.
Alex Volkov
Alex Volkov 38:23
Right now, let's hit it, let's advance
Yam Peleg
Yam Peleg 38:26
version of research of Claude research just now, five minutes ago.
Alex Volkov
Alex Volkov 38:31
Yeah, we have to hit the breaking news button.
38:32
Hold on. AI breaking news coming at you only on ThursdAI.
38:47
Ooh, we have breaking news. So Yam tell us.
Yam Peleg
Yam Peleg 38:51
All right.
38:51
So coming from, coming from Claude, Claude. ai today, released integration of remote MCPs and an advanced version of Claude Research five minutes ago.
Alex Volkov
Alex Volkov 39:05
that's so cool.
39:06
So Claude. ai, so far Claude only had MCP support in the Claude desktop app and Claude code. And now you're saying they have released five minutes ago, released for remote MCPs. This is huge. This is huge. Let's take a look at this. Anybody want to pull Claude up? Because anybody pay for it? I don't pay for it yet. I should. I know the deep research is great, but I haven't paid for it. But let me say. I
Nisten Tahiraj
Nisten Tahiraj 39:29
pay for it.
39:29
I'm just not seeing it. Yeah, so here's the tweet from
Alex Volkov
Alex Volkov 39:34
Anthropic.
39:34
Today we're announcing integrations, a new way to connect your apps and tools to Cloud. We're also expanding Cloud's research capabilities with an advanced mode that searches the web, your Google workspace, and now your integrations too. We chatted about the research stuff, but we haven't seen this yet. Integrations, you can connect Claude to Asana, Intercom, Linear, Zapier, and more. Our developers can create their own tools as little as 30 minutes. That's, I think, their wing to MCP, which they don't mention on the release. But look at the folks that are integration launch partners. Zapier, Stripe, Atlassian, Asana, Linear, Square, GitLab, CloudFlare, PayPal, and more. Claude now automatically determines when to search and how to deeply investigate, which we know that from O3, Ofri does as well. Both integrations and research are available today in beta for Macs, Teams, and Enterprise plans. We'll soon bring both features to the Pro plan as well. Okay, so only for Macs and Team plans. Nisten, maybe that's Oh yeah, I just have
Nisten Tahiraj
Nisten Tahiraj 40:24
the Pro plan.
40:25
Yeah. They're gonna make me buy the 100 one. All right.
Alex Volkov
Alex Volkov 40:28
Yeah.
40:29
MCP is supported now on the online portal, which is super, super cool. But they,
Nisten Tahiraj
Nisten Tahiraj 40:34
that was supported, before too.
40:36
So you could add your remote
Alex Volkov
Alex Volkov 40:37
MCP?
Nisten Tahiraj
Nisten Tahiraj 40:38
just, I don't believe so.
40:39
I think just CP in general. Yeah. Cloud
Alex Volkov
Alex Volkov 40:41
desktop remote.
40:42
Cloud desktop supported cps, but only the,
Yam Peleg
Yam Peleg 40:47
no, not on the, I don't think, cloud, Both desktop didn't support, I
40:51
think, remote and the web app for sure. Desktop didn't support remote and the web app did not support. Yeah, I agree with
Alex Volkov
Alex Volkov 40:55
him.
40:56
Yep. All right, folks, this is Breaking News from Entropic. If you have the max tier and you want to tell us how your experience is with this, please do in comments. We're moving on. We're still in open source. We still have a lot to cover. Let's quickly move on to the next thing that we should mention. PHI 4 with reasoning. this came out literally just yesterday evening. It doesn't end. It doesn't
Yam Peleg
Yam Peleg 41:16
end PHI as well.
41:17
Yeah.
Alex Volkov
Alex Volkov 41:18
Yeah, so PHI 4 came out with Reasoning, Microsoft released
41:21
this in, I also believe, including a tech report as well, I want to see which, Jesus Christ, which license they have here, let's take a look, developers, architecture, inputs, context length, with 32k tokens, they have, that's, There's no way, Yam. Is this it? So folks who are listening, I'm looking at kind of the table of facts about PHI and they're saying that the training data for PHI was only 16 billion tokens.
Nisten Tahiraj
Nisten Tahiraj 41:51
that's for the reasoning, that's the reasoning fine tuning.
41:55
It's way more than that. It's trillions of tokens to train the whole model. But to do the reasoning tool, That was, I think they, they said it was from O3, O3mini. They generated the 16 billion reasoning chains. That's, yeah.
Alex Volkov
Alex Volkov 42:12
Oh, I
Nisten Tahiraj
Nisten Tahiraj 42:12
see.
42:13
All right. Yeah, that's the Finetune, data set. That's not the full, Not the full. Okay.
Alex Volkov
Alex Volkov 42:17
Cause, cause this doesn't make sense.
42:19
so the dates that they trained this model is from January 2025 to April 2025. So obviously after DeepSeq, everybody started training Reasoning as well. they, FIFOR has, a model card and basically a technical report as well. They have let me just go to my notes super quick, right here. they took the lightweight 14 billion, lightweight. I love that we're calling like, we just mentioned that Qwen has 0. 6 billion. And now Microsoft is claiming the 14 billion is a small model. PHI was small when it came out. It was like 3 billion parameters. We separated this and then it like, it blew to 14 billion. But they added the 1. 4 million teachable chain of thought traces on top, and then added, 6K RL math problems as well. and now they have two variants, 5. 4 Reasoning and 5. 4 Reasoning Plus. 5. 4 Reasoning Plus is a model that was trained to think for longer, just output longer chain, chains of thought. and, they have, it came out to 70. it beats 70 billion parameter, on AIME, what else is interesting here? MIT license, Context Windows 32K, and then they internally try to get to 64K with rope interpolation. they have structured chain of thought thinking with very basic, use GPRO as well. they claim to outperform DeepSeq R1 distilled 70 billion parameters on the AIME, the math competition. They also said that they trained this model before AIME 25 came out and they're getting 78%. So basically obfuscating, not obfuscating, saturating AIME with reasoning is now basically a thing. Basically, the more you reason, the more you get on the AIME, which still remains to be seen how much of this like transforms into real world, use. And, the interesting thing here, I believe, is generalization. They claim that, gains are not limited to math. So they added math problems in reinforcement learning, and they saw a 10 point increase on Human eval plus coding and 5 points on MMLU Pro versus Base 5. 4. So generalization of RL teaching on math is also something they highlight in the paper. And then the last thing is plus answers on average 1. 5 more tokens, than the regular SFT. And, and their comparison to O3 MiniHAI. I haven't seen comparisons to KUEN for PHI, if the folks in the community want to compare those. Everybody
Yam Peleg
Yam Peleg 44:36
forget.
44:37
Everyone forget. Everyone forget.
Alex Volkov
Alex Volkov 44:40
and where's Quin?
44:41
where's Quin? With the meme, with the goose, with the geese. but we should definitely do this. One thing that I have to,
Nisten Tahiraj
Nisten Tahiraj 44:46
sorry, I have to quickly say that when I'm talking to
44:48
other people, even when they are AI companies, there's a lot that they don't even know this exists at all. where this
Alex Volkov
Alex Volkov 44:55
you mean Quinn?
Nisten Tahiraj
Nisten Tahiraj 44:55
yeah.
44:56
Quinn, they have heard of deep seek. And, maybe Claude and, Gemini, and that's about it. Everything else is just PHI, Kwindla, they don't even know. it's a thing. So I have to explain to them. it's very good at math and they're like, do you trust in this stuff? yeah, we've used every single model for the last two years. We know exactly what it can and cannot do and, how to make like contraptions out of it. And, yeah, There's still quite, yeah, so there's that.
Alex Volkov
Alex Volkov 45:23
I want to add, going back to Gwen just for a second, we have a
45:26
comment from a friend of ours and a huge supporter of the community, Bortowski, Colin Kilty, that, after what Fernando said, piggybacking off of Fernando, it seems likely that by going deeper instead of wider, you end up with a smarter model that needs a bit of extra runtime data. That can be provided by search or RAG or MCP, but we'll get way better overall results if you can augment them. And we know that these models absolutely will use RAG and MCP because they've been trained on this as well. And so that's very interesting for like agentic applications, for example, where you rely on external tools for a lot of the data. So maybe you need less of the world model yourself. but if you're training, going deeper, maybe you get overall like better model performance. Performance. So thank you Colin for the shout out, folks. Definitely give Colin Bartoski a follow up.
Nisten Tahiraj
Nisten Tahiraj 46:11
Yeah, and so what they mean for, for the audience by going
46:15
deeper, it means it has more layers. If you open any model, it's just like a bunch of files in there. Like the weights are just there. Files, and, the layers means like this, these chunks of weight, on top of each other. So when you go to mixture of experts, so what they mean by wider, it means it has more, more experts, to route to per, per every layer. we do know from from the Frankenstein and stuff that when you add more layers, so when you go more vertical, the model does actually become, smarter in a way. But when you go wider, it has more, More knowledge. Now this might not end up being accurate if you listen to this in the future, but it is like roughly there with a MLP, so it's very they knocked it out of the park with this architectural choice Like this is the first time we see a 30 billion parameter, 3 billion parameter active. This is, I'm surprised this worked this well.
Alex Volkov
Alex Volkov 47:07
All righty.
47:08
Not going back to Quinn 'cause we can get lost in Quinn hype and get spend here an hour. moving on. I think we're gonna skip, but just like to let you guys know that OMI released MIMO 7 billion parameter. It's a tiny MIT license model. has some interesting things. That's tiny. Seven B is not tiny. Yeah, the tiny is. I'll remove to here. Yes. ux. There
Yam Peleg
Yam Peleg 47:27
is, there is another one from JetBrains.
47:30
I forgot to tell you. There is another coding model from JetBrains last week.
Alex Volkov
Alex Volkov 47:34
Oh, I just saw that.
47:34
yeah. I saw it. and we should It never ends. Yeah. It just never ends. But we don't have time to like. We'll cover all of them first, but we'll mention them. Yeah, JetBrains, there is a coding model as well. shout out for them, giving, throwing their hat in the ring. If you guys can send me a link, I will definitely add to the show notes. so Xiaomi released one and Qt AI released a model. It was called Helium. it's Helium 1, 2 billion parameters. And, it has, interesting evals here. They're comparing themselves to Gemma and Lama 3. 2. But this one talks, right? I don't believe so. No, this is their, this is their just
Nisten Tahiraj
Nisten Tahiraj 48:06
Qtali made the ones that were Qtali made,
Alex Volkov
Alex Volkov 48:08
yeah, Mochi, the stocking model, but I believe the
48:10
helium is Yeah, it was this one. no. Helium is just an LLM. Oh, it's just LLM. Okay. Just LLM. But I think they're doing this as a base model for the you're talking models as well, because I believe the previous, base model for the voice was like Kwin. Folks, we're moving to big companies and APIs, because we have a lot to cover there and then, we will again.
Nisten Tahiraj
Nisten Tahiraj 48:29
Like we're getting into creature territory now, Can it talk?
48:32
Can it talk like a big, can it talk like a dolphin? Can it see me? Speaking of talking,
Alex Volkov
Alex Volkov 48:36
Qwen also released another model, Qwen released 2.
48:39
5 Omni. You guys remember that Qwen has a nominee model? They can talk, and they can listen to voice and talk. They released an update there, I don't know why they didn't bundle, I, I guess it makes sense. But we're getting to the point where even Qwen releases multiple things, a month. I will definitely, Qwen 2. 5 Omni was also updated. folks, moving on to big companies and APIs. We have a lot to talk about. That one
Nisten Tahiraj
Nisten Tahiraj 49:00
talks, just so people know, and it can talk on the phone
49:04
too, but even in your phone, yeah. That's a 3B that talks, ends up being like 4B.
Wolfram Ravenwolf
Wolfram Ravenwolf 49:09
Yeah.
49:09
By the way, Omni also has image and video input at least. Yep. So
Alex Volkov
Alex Volkov 49:17
a little bit of an update.
49:18
Folks, we have to move on. I, we have to move on.
49:27
All righty. We're moving on to big companies and APIs. There's a lot that happened this week. Now huge model updates. there was one. So Gemini, for some reason forgot that ThursdAI existed. Shout out to folks, they should have let us know a bit before. Gemini 2. 5 Flash was released just a little bit after our live stream. So we did add this, if you're listening to this, we did add this in the newsletter, but Gemini 2. 5 Flash is the faster, cheaper model for, from Google. And they obviously came out after, 0. 4 Mini and 4. 5 Mini, like all of that came out. So Gemini 2. 5 was released, and, I believe that, yeah, it's very cheap. I don't have a lot to compare it here, but I did run Reasoning. Oh, the one thing that highlighting in Gemini 2. 5 Flash, it's a hybrid Reasoning model, and you can in the API turn Thinking on or off, and you can also specify how many tokens between 0 and 25. I think 12k tokens or something like this, 25k, you can specify in a, like an API parameter, how much you want it to think like the allotment. If you guys remember the knob that I talked about before with reasoning models, this is the knob. Basically, you can decide how much and based on that, you will pay different prices. So shout out to Gemini 2. 5 for the great model as well. if we're talking about Google Notebook LM, AI overviews, the audio overviews of everything that you provided, that sounds like a podcast. they are now multilingual. So you can have one host speaking one language and another, probably another. they have German, Spanish in Portuguese, like a bunch of other languages, I believe 50 languages. let's talk about, okay, folks, I think it's time. I think it's time.
Yam Peleg
Yam Peleg 50:59
Speaking about Google.
51:00
wait, speaking about Google. There are rumors, very strong rumors about what's going to be dropped on Google io.
Alex Volkov
Alex Volkov 51:11
Yeah.
51:11
Let's talk about Google IO soon. Pretty soon. Google is May 20th and 21st. I'm gonna be there. I'm gonna do the show live, probably, maybe I'll do like live stream from Google io depending on the, wifi. I don't have private knowledge of what they're probably gonna drop there. but I know that the A two A folks are gonna be there. So we'll definitely cover whatever is coming. Yam, you want to give us a hint? Maybe an Ultra event. Alrighty, folks, I think, while we're talking about, I think Google, yeah, I think we've covered everything Google. we'll cover Llamacorn in a second. I want to talk about, OpenAI like briefly. And before we get to the interesting news, OpenAI this week, Let's do one positive thing and then everything else is not as positive. GPT for, chatGPT will do shopping for you with the integration from Shopify based on everything it knows about you. It will suggest things, it could be a path for monetization for them in terms of like affiliate links for shopping, because More than, there is estimates that ChatGPT will get to 100, 1 billion users by end of this year. And if they get there, there's a lot of shopping to be done there based on your interest. imagining how much ChatGPT knows about you, probably way more than Instagram based on their algorithms, because you literally tell it and talk to it. and also many people just look for things with O, with three, with O3 and research. Moving on. folks, I think it's time. I think we're gathered here today to say a few parting words for a dear friend of ours that kicked off the whole field of AI. RIP GPT 4, you will be missed. OpenAI have turned off GPT 4 in the drop down. You are no longer Able to chat with the model that was born on March 14th, 2023, together with ThursdAI. And the reason why we are here talking to you is GPT 4. a lot of people at the time when GPT 4 came out did not realize how big of a breakthrough it is. And then quickly they realized that they can no longer go back to GPT 3.5 because it sucked. here's our, in Solace and Solitude, my favorite thing to say is from Harry Potter. Your body will perish, but your spirit will go on forever. Something like this. that's it folks. This is our wake. Any parting thoughts from the team here on, on the GPT 4? Any good memories? Folks in the community?
Yam Peleg
Yam Peleg 53:33
Leak the weights!
53:34
Leak the weights! There is no downside! Leak the weights! Leak the weights! We need to harass
Nisten Tahiraj
Nisten Tahiraj 53:39
him for the weights because that was an
53:42
achievement for humanity. I don't care what other people's opinions are. My own is that was from 3. 5 to 4. That, that was something quite amazing. And, . I was working on the, Dr. Gupta, the first doctor on, on, on the market. And that, and when we just switched from 3.5 to four, it was like an amazing jump in, like disease diagnosis and stuff too, but also helping us actually code it. that was a big jump. And, they should, we need to, we push them a bit.
Wolfram Ravenwolf
Wolfram Ravenwolf 54:11
That's the saddest thing about this for me personally
54:14
because models come and go and there will be others but the weights of this historical model that would be the perfect opportunity and yeah we have models that are better already and most would probably not be able to run it but it's a historical achievement for history and why wait for some future historians if you could give it to the world right now and have everybody, be open, AI, be open. The
Yam Peleg
Yam Peleg 54:42
redemption arc in history.
54:44
You can just imagine the best
Alex Volkov
Alex Volkov 54:47
redemption arc, seriously.
54:49
Speaking of this, I want to mention a new endeavor of mine that I'm vibe coding right now that I want you guys to know about and gonna be live soon. I built a model graveyard, the AI model graveyard. I call it inference. rip. R. I. P. stands for Rest In Prompt here, for LLM models that we know and love. And, yeah, this is not updated yet, so GPT 4 here is still not showing that it's live. but, yeah, I would like to commemorate and have a place for the community to come and, give our thoughts. Because if you guys remember when these models come out, and then a lot of people are talking about, the, The, the experiences they have with these models. We're celebrating every model as it's born. Like we're not treating this as birth, but like today we had the half an hour about Kwin and we got excited. Like many people come out with evals and then we just move on. And then at some point the company behind decides to just kill off these models. and I think we need a place. to remember how much hype we had and how, how sad. So GPT 4. 5 also, by the way, is getting obfuscated very soon. That one did not live as long. And that's also very interesting to compare. So GPT 4 lived for two years and a month and GPT 5, did not last nearly as long because GPT 5 is going to be obsoleted from the, from the API and probably from the dropdown as well. So RIP GPT 4, you will be missed. your spirit will prevail, will remember you. And with this, let's move on to the other thing that will not be missed. ChatGPT released an update that, Sam Altman mentioned as glazing. Glazing specifically means, as I learned, just sucking up. Just being your sycophantic, AI approver thing with examples that are crazy. specifically, I don't remember, I don't think I added examples, but I have like specific examples where somebody said, something to the tune of to the chat GPT, I woke up, I drank half a glass of water and I did one pushup and the response from chat GPT says, Whoa, you did one pushup, you're like Hercules, and this puts you already in the top 1 percent of the population. And many such examples were flooded X and flooded local LLM and ChargPT and like whatever. People just decided that this is not necessarily something that they want from their AI assistant. I think there's a bunch of stuff here where the, where, where are we going with this? A bunch of examples where, sorry, no, let me try again. since ChargPT turned on memory. Memory for all your chats. there's already a lot of things at stake here because the models respond based on you and how you want it. And so it has access to a lot of your memory. And the highlight of, from, the folks at OpenAI from the blog they released before the rollback is we focus too much on short term feedback and did not fully account for how users interaction with chat GPT evolve over time, and I'm assuming with memory, like evolvement over time is very interesting. There was an announcement, and they mentioned sycophancy specifically. Sycophancy is the kind of the concept of sucking up. and they released a blog talking about, rollback. So I don't remember having a rollback ever before in GPT or any model, like AI model, discussing rollback. Folks, we should discuss, I really want to chat about this for a little bit, at least because we have an interview soon, but. One of the big things here is that how much this affects society and how many people come to ChatGPT for advice now Instead of Google and this advice now is personalized and something like this like of a release where I think they mentioned both system message and kind of model behavior as well. They rolled back the model and the prompt message, they also released, and, rolled back. And I believe that many people got wings because of this model. so we'd love to hear from you guys. What do you think about just generally the effects of this? Like this model is now being used by half a billion people, if not more, every day. And then such a haphazard release, just completely changed, people don't know about this. And many people will not even know that it rolled back, so many people will just have this like mood swing in the middle. Wolfram, we want to hear from you.
Wolfram Ravenwolf
Wolfram Ravenwolf 58:55
Yeah, so we see the impact this has, little
58:57
changes that, model versions, and it's not just the Zyko family. Of course, it could be the other way. We could have a model that is telling people they are worthless. There was an incident with Google where someone leaked or showed where that happened. So it could swing in the other direction. And if you train a model for millions of dollars, you should take special care of the system prompt. It always looks a bit like people just throw it together. I don't know how they test it, how they evaluate it. But, that is super important. The system prompt is, reshaping the model that you trained. it is very important. And, people have custom instructions. my custom instructions prevented me from seeing that, symptom. But I'm sure that my assistance personality has been impacted through that. And a lot of people are coming to, to know their assistance. And if they suddenly have these mood swings, That is also concerning. people have it as well, but you can't really trust it. That is another sign why local AI, which you have on your system with your own prompts, it doesn't change. But on the online, even if the model doesn't change, a little change in the system prompt is probably not communicated, not often. And you can't fully trust it and it could change at any moment. You have no way to prevent that. So always keep that in mind. And you need benchmarks. You have to evaluate regularly, monitoring of the model. That is super important if you use it for anything productive, and it's sad if people have to suffer from, changes like that, but, I guess it happens if you have, they want to improve the model, they want to make it better, and that is the thing if you want to engagement, and it seems like it is, such a nice personality, so you'll enjoy talking to it, but it's not All this is the right thing, though we have been talking about the X algorithm as well. Engagement, user time is not exactly the best thing for yourself in all situations.
Alex Volkov
Alex Volkov 1:00:50
I will say, I had a specific chat with my girlfriend that like she
1:00:54
told me about some like which AI model to use and I told her don't you know just GPT 4 right now, it's stupid. Do not, they're gonna roll it back. But right now, and not many folks have, a lot of folks like us that follow all the news, many folks just believe that whatever they say, whatever stupid idea they come up with, whatever, there's been a bunch of threads of folks, that kind of came up with the worst business ideas ever, and ChaiGPT just like absolutely clapped for them and said, yeah, you should go for it, you should leave your job, you should quit it, just go for it. And Yeah, that's not super healthy. and I think that, I hope folks at OpenAI are contending with the understanding of like how much impact now they have on society with tiny changes like this, and we're going to talk about evals with Hamel and Treya very soon. and how much this should go into automatic evals, like sycophancy now should be part of the release evals to make sure that like this, these next models will not be like sycophantic and will tell you, that the earth is in fact round even though you're like a flat earther person and you believe 100 percent that the government conspiracy imagine i saw about this imagine that somebody that believes in conspiracies and then chadji switches like this and just confirms everything they say and then switches back They absolutely will see this as government intervention in the truth, right? that's how conspiracy folks are like thinking they'll absolutely see, oh no, the government found out that the AI gives you the truth and now called it back. there's a, there's danger in there and I really hoping that, by highlighting this issue, first of all, we'll let you guys know that JGPT was in the spirit for the past two weeks. So if you were about to quit your job based on something that JG PT told you, don't definitely always our recommendation is try multiple models for multiple things. and also, the fact that they rolled it back just shows how important this was even for them, and we haven't seen a rollback yet.
Yam Peleg
Yam Peleg 1:02:35
look, there, there was a case, like this before, a long time
1:02:39
ago, with GPT 3, for those who don't know about the app called Replica. which is, which was an app for basically, it's like a friend, a GPT 3 that, was a friend, and, people, No, that's the
Nisten Tahiraj
Nisten Tahiraj 1:02:53
main AI girlfriend app, those for a while, actually,
1:02:56
is, there's some interesting rumors around, what the users for that are, because I think at some point, I don't have the correct data for this, but the, the majority of the paying users were women above the age of, 55. very much. for, yeah, for, the AI girlfriend market, AI boyfriend market, yeah, sorry, go on. Anyway,
Yam Peleg
Yam Peleg 1:03:14
anyway, anyway, I'm just saying, I'm just saying that
1:03:17
people got, emotionally connected to this app and there were changes. And you should see what people are saying online, like people took it hard on Reddit. There is a full subreddit of people trying to replicate it with local models today just Because of the impact of this. So there are many people who are talking to, chatbots today. of course, chatGPT for, this kind of support, this kind of, you want an advice for something, I don't know, tough conversations you need to have with someone. So you talk to chatGPT and ask how are you going to, how should you do this and so on. And if the model just agrees with everything, just whatever you say doesn't matter, it doesn't really give you a good advice, it doesn't really advise you about a thing, whatever you're saying, it's just gonna say, yeah, that's good, that's a good idea, and just reaffirm whatever you think, that's not a good thing to happen, and many people don't know anything about this.
Alex Volkov
Alex Volkov 1:04:18
yeah, I guess the one thing that I'm missing from OpenAI is
1:04:21
an announcement to the people that this happened and some sort of an understanding that like, this shouldn't happen. Like you guys have control of people's psyche at this point, many people, and I don't know, like some folks give their kids access to ChagGPT for like different things. Kids have a lot of issues, they're not ready to talk with their parents. There is, there should be a responsibility there for folks to lock in and say, okay, this model will, this is how it will behave. Like humans don't change like this from one point to another. And I think there's a responsibility there on OpenAI and hopefully, they will follow up with this. Folks, moving on, we have an interview
Nisten Tahiraj
Nisten Tahiraj 1:04:55
coming up.
1:04:55
I made a meme about it.
Alex Volkov
Alex Volkov 1:04:57
it's absent an instant.
1:04:59
it's mirrored. We cannot see. Yeah, we don't know. Glaze. We can
Yam Peleg
Yam Peleg 1:05:02
see the meme.
Alex Volkov
Alex Volkov 1:05:03
oh.
1:05:03
You can see. No,
Nisten Tahiraj
Nisten Tahiraj 1:05:04
you can see it.
1:05:04
It is
Alex Volkov
Alex Volkov 1:05:06
for us.
1:05:06
It's mirrored. Oh, we wanna get glazed. Okay. Love it. all folks, we have to move on. super quick on the big companies and APIs before we move on and chat with our guests for today. LA Macon updates. We didn't mention La Macon, no big models from long run, but they released like a few security released. Release is Llamaguard 4 with text and image protection. If you are building enterprise, you need guardrails, so they released a bunch of stuff. Firewall with some prom hacks and risky code. That's very interesting. Those are open models that they released. Promguard 2, with jailbreak defense and then something like a secure cybersec eval, which we'll talk about evals in a second. That confirmed thinking models are coming. That's a confirmation on Dworkesh podcast. New MetaAI is coming with a social feed. Speaking of social responsibility, in MetaAI, the new app, you will now see if your friends have shared the AI, how your friends are using AI, which interesting, and maybe OpenAI is going to go there as well. Like they talked about like a social network, a few more things, a full duplex model, which you can talk to is in the works. Some people tested it out. It sounds amazing. And the last big thing. Llama API is coming, powered by Grok and some other folks, which you can, Zach said we're not in the business of running APIs, and now, ta da, they have API offer there as well. So those are like the main updates from LlamaCon, I believe, and then we will, we'll probably talk about chatbot arena confusion after the conversation, which specifically we'll talk about evils, Alrighty, folks, now we have moved to the next section, where we're gonna have two guests of ours, Hamel and Shreya. Welcome, folks. Hi there, welcome to ThursdAI. I believe both of yours maybe first time here, at least in video form, for sure. and, we'd love to chat with you guys. I will let my co hosts take a break, but if you guys have a question, feel free to come back as well. and then we'll chat with you guys about a few things. So welcome folks. Hamo Hossain, for those who are not following the field and Shreyashankar, are, One of the top people in the field of evaluation. I mentioned before, multiple times, Weights Biases have a tool for evaluations, and this is how I got into the field. So I'm fairly new and recent, but, you cannot be in the field of LLM evaluation without reading some of Flair's work. specifically, the highlight that, Hamel, I think you keep referring to as well, is who validates the validators. And we've built on, on, on top of this work as well, And I would love to start by saying that, first of all, welcome, folks. let's do a short round of introduction, maybe. Hamel, how do you introduce yourself when folks ask who you are? Like, how do you encompass that by your work? Yeah,
Hamel Housain
Hamel Housain 1:07:31
so I've been working in machine learning for over 20 years.
1:07:34
I worked at a lot of tech companies like Airbnb and GitHub, been working with large language models for a really long time. I'm now an independent consultant. I find, I found really quickly that thing that people, fail with in building AI applications is evals. they really struggle with that. Like they, people don't have a hard time at all, like creating prototypes, but they really have a hard time. like creating applications that work really well. And yeah, I found that evals were like the weakest link in everyone's knowledge. And so that's what I decided to focus on that. And yeah, it was a little bit of background on me.
Alex Volkov
Alex Volkov 1:08:13
And you have great content on your blog as well, and the talks at AI
1:08:17
Engineer, multiple AI Engineer conferences that we've crossed paths as well. And welcome Shreya. Shreya, please feel free to introduce yourself as well and maybe give us like a little bit of like how are you involved in the field? What are you doing? Why is this like a research interest of yours?
Shreya Shankar
Shreya Shankar 1:08:31
Yeah.
1:08:31
great to be here. I don't have a nice subtitle under my name. It just says my name. I don't have LLM evals or AI evangelists. I don't know how I get that. but I am a researcher at UC Berkeley. I'm finishing up my PhD, and I'm very interested in MLOps, specifically how do you help people engineer very reliable applications around ML and AI, now LLM models. and as Hamel said, like evals are the missing link here. I think what's very interesting is how, early stage we are in evals. in general, people get very confused thinking about what are evals for foundation models. But those sets of evals are completely different from the kinds of evals that people need when they're engineering applications around the foundation models. And I think there's just such a sore lack of, educational content, even research. Like, how do we even go about building this end to end evaluation lifecycle? And I'm very excited to be studying that.
Alex Volkov
Alex Volkov 1:09:26
That's incredible.
1:09:26
And I saw a lot of research that came out from you and your colleagues specific about this that like moved the field forward as well. Let's maybe start there. Let's maybe start with, and we'll also, one thing that I always remind folks on ThursdAI, I love chatting with folks that did a release this week. So you guys have two announcements, but you guys also released Prompt Evals. We'd love to chat about this as well. But let's start with kind of the difference between the benchmark that we see, and we talk a lot about a lot of foundational models here. This is like the bread and butter of ThursdAI. We talk about, Qwen3 released. And then we compare Qwen3 based on, human eval and MMLU and now AIME and like all of these competition math for reasoning models. those are evaluating the model on the set of tasks also, but that's evals that you guys are like, hammering towards, like folks need to build their own evals, right? So maybe let's start with the difference. What is the difference?
Shreya Shankar
Shreya Shankar 1:10:17
Yeah, I probably should take that.
1:10:19
So we think about application centric emails. You could argue that coding is an application. but I think what you want to think about foundation model evals as people who are training the models or post training or fine tuning models. Those kinds of evals are more or less to test for general knowledge or to test that you are able to have general knowledge but in a domain. For example, in medicine or in law very broadly or in coding very broadly. But when I work with people or have all works with people or. Practitioners out in the field, right? Typically, they want to build a specific app around a foundation model. For example, like they want to write an email assistant, or they want a chatbot over their document corpus. and these are so targeted that it's not about, at a very early stage, it's not about fine tuning a model, creating a dataset for fine tuning, or so forth. It's about One, building the prototype and figuring out what are the failure modes that you didn't really expect because an LLM is now doing the job, like maybe the tone, the vibe is a little bit off, maybe for certain kinds of inputs it always hallucinates the entity extraction, I see this quite a bit. So there's always these really bespoke failure modes that no one would care about testing in a foundation model eval because It's just so specific to your use case, but AI application developers need to be doing this. And that's where a lot of our eval work focuses on.
Alex Volkov
Alex Volkov 1:11:41
And Hamel, I think, sending it back to you because, many folks,
1:11:44
when they choose, which model to use in production, I know for a fact, some people just string replace the new model and just, YOLO to production. we've seen that this is not always the best thing. We've seen like reasoning models, for example, that are like better at math and code actually perform worse based on the same prompts as well. Have you seen this from your work with the company as well? what is your recommendation, obviously building evals, but what is your main recommendation there? And what is the delta between AI developers just looking at like general evals to implementing their own stuff?
Hamel Housain
Hamel Housain 1:12:11
Yeah.
1:12:11
I think people take that same mentality of like foundation model benchmarks and try to apply it to evals. And the way they try to do that is use off the shelf metrics. So off the shelf metrics like hallucination score, toxicity score, all that stuff, conciseness score. all these things, you can go to your favorite LLM eval tool provider and they will basically give you these metrics off the shelf. And people try to apply that, be like, okay, I've done it, evals, I'm done. I did my evals, I ate my vegetables, and I'm fine. But, the reality is, that stuff doesn't really work, and it's very misleading, and actually, wastes a lot of time and is confusing. Because it's not specific enough to your application. the applications that work really well, they design metrics in evals, That are very grounded in their failure modes, and that also is important because there's so many things you could eval. There's almost infinite surface area and it's really important to prioritize okay what your failure modes are. Like, and don't say okay, The generic metrics are, can be useful in very skilled hands. it can help you find failures, but you have to use them very, in a very specific way. You can't just, apply them and just take the number at face value. maybe, for example, if you're using hallucination score. Okay, use that hallucination score and see if you can help it rank order possible hallucinations. Something like that. And then dig into it and see what's going on. You may end up figuring out like a different kind of metric or a different kind of eval or like really dialing it in for you. Most people don't do that. That's like an advanced kind of move. What else
Alex Volkov
Alex Volkov 1:14:00
do most people don't do?
1:14:01
There's one phrase I'm waiting for specifically. Oh yeah,
Hamel Housain
Hamel Housain 1:14:04
yeah.
1:14:05
So it all goes back to the famous phrase, like the thing that I love to repeat is like looking at your data. It all just goes back into looking at your data no matter what you're doing, any activity you're doing, Whether, like that's like the antidote to almost all problems. It's hey, you can design your own customized evals. you, how do you do that? you have to look at the data.
Alex Volkov
Alex Volkov 1:14:25
I think it's very important to highlight why, you
1:14:27
guys are, like, specifically repeating this sentence all the time. we're also in the business of, showing people the data via tools in Weave. And, a lot of the, like you said, eat your vegetable, a lot of, developers look at what happens with the LLM every time they develop locally, and then they put it on there, and then they continue to the next feature, or maybe something else, and they don't get the feeling of what happens with the LLM. On LLM, on production for the users. A very good example of this, but maybe on a huger scale, is the glazing thing with ChatGPT that we just talked about before you guys came on. Where, it was so bad that the community reacted, completely, and they rolled back, the whole thing. at some point, at some scale, looking at your data means also listening to your users. But I think for many developers, just like small developers, they just don't play with their tools on their own. I want to move on, because just in the interest of time, Shreya, you guys have a release that's called Prompt Evos. Could you tell us a little bit about this? Another, it's an archive and also on Hug Face?
Shreya Shankar
Shreya Shankar 1:15:21
Yeah, so this is a almost year long effort of collaboration
1:15:25
with LangChain to really understand, probe more about what are the kinds of evals that developers care about. What are the bespoke evals that Hamel and I have been talking about? Are there patterns to them at scale? so LangChain, has a bunch of, Open source tools, the prompt hub being one of them. So we went through all of these, tried to develop, taxonomies of failure modes and develop a bunch of assertions or evaluations, binary evaluations around these prompts. and I think for me, the biggest takeaway that I. Got here was how different these were from the foundation model evals in general. Foundation model evals, I've realized, are like the ceiling, the north star that we strive to do. So we purposely build very hard benchmarks, LiveCodeBench, or like for MML use, getting saturated. but it's just a different paradigm from when you're in production. And literally you want instruction following. You just want the model to do what it said in the prompt. Plus or minus some other things that it should infer because it has general knowledge. and building benchmarks around this is very different, right? You're hoping to get 90 plus percent accuracy, right? You're not trying to game the system. You're, it's a very different paradigm and I'm hopeful that this kind of research or just exposing what developers want moves the needle towards that.
Alex Volkov
Alex Volkov 1:16:38
I think you mentioned a very interesting point that
1:16:40
I haven't internalized fully. when. A new benchmark like MMLU comes out I don't know, Frontier Math or some stuff like this, right? like the ML researchers in these labs, they get excited. They're like, oh yeah, a new hill to climb. This is like the whole point. Whereas for evals, that's not the point. You want to make sure that like your thing works as it is. You already climbed that hill. Yeah.
Shreya Shankar
Shreya Shankar 1:17:00
Yes.
1:17:01
It's more like software in that sense. and I think what's also interesting is the benchmark gaming that happens in ML all the time. the most recent thing being chatbot arena. But you don't want that. You don't want to build benchmarks similar to foundation model benchmarks. You don't want that phenomenon happening in your org where like random, you're trying to push out random changes just to like game your own static benchmark. that's terrible.
Alex Volkov
Alex Volkov 1:17:24
an additional thing and Hamel, like this is
1:17:26
going to go to you as well. where. You mentioned in the beginning people don't do this and some of it is because people don't know how to do it. There's not a lot of content. You guys are great and even like the content for this, Shreya, who validates the validator, showed specifically that like even preferences change while people manually, evaluate. You need to start with manual, manual scoring and manual humans as a loop, as a judge, to make sure that you have some sort of baseline to compare to, and then maybe you can extrapolate. But even that changes over time, and that's what the, who validates the validators showed, and definitely, A seminal paper in this, field that people should definitely go and check for, and check their assumptions as well. Because, they may have built an LLM judge to do something, and then this judge, doesn't perform as well. And how do they know? meta evaluation is important. There's, a whole, a bunch of stuff in this field, and it's a fairly complex field. after diving into this, I'm not chatting with you, looking at your stuff, trying to read your papers. It's a very complex field. and you guys are doing great work at, making this a little bit less complex, Hamil. So let's talk about how you're doing this with the upcoming course.
Hamel Housain
Hamel Housain 1:18:24
Yeah, so with the upcoming course, we're going to go really deep
1:18:27
inside, okay, how do you actually go through the step by step process of, looking at your data properly. So looking at data, it sounds like it's easy, you just open your eyes and you're done. No. It's like you have to do, some data analysis and put your kind of data science thinking hat on the table. And do some detective work in terms of Hey, how do you analyze your traces? Like, how do you look at those traces? How do you suss out failures? And then going through the step by step process of, okay. Starting with error analysis, going up to the different kinds of evals, how to write them, how to check them, how to make sure they're aligned with human beings. And then we're going to, cover like architecture specific evals, hey, what do you want to think about when you're doing RAG? What do you think about, what do you want to think about, with agents? So on and so forth. and go into a lot of areas like, okay, how do you optimize costs? How do do a whole bunch of things. Trey is actually writing a very detailed, I don't know what else to call it. It is a book, . I saw a
Alex Volkov
Alex Volkov 1:19:31
screenshot I'm waiting for, yeah.
1:19:33
I'm really excited. Free draft.
Hamel Housain
Hamel Housain 1:19:34
Yeah.
1:19:35
I've been reading it with great enthusiasm, actually. It's like a captivating book to me, which
Alex Volkov
Alex Volkov 1:19:40
is to accompany the course, right?
1:19:41
if I'm ing correctly. Of
Hamel Housain
Hamel Housain 1:19:42
course, yeah.
1:19:43
And it goes into a lot of detail of. Okay, how do you do this? It is a treatise that compiles a lot of information in one place. Very exciting and that just is going to help prepare students when they're in the course. We'll have reading that they will do so that when we, when they show up to a lecture Okay, we can give them a very rich lecture.
Alex Volkov
Alex Volkov 1:20:07
I'm super excited for this.
1:20:08
I will just say, you guys reached out, Hamel. I'm also a guest speaker at the course, somewhere in June, there's gonna be a talk of mine as well. I'm looking forward to learn more, because even though, this is part of my job at Weights Biases, to talk about evals and We have Weave, which is like the evaluation toolkit. There's a lot of things happening and changing and the ways that people apply this, try, they have lessons, they come and they talk about those lessons. I think it's very important to, to keep up to date. So absolutely recommend folks register. I think, Hamel was very gracious and given us a ThursdAI discount, 35 percent off of, of, for ThursdAI listeners. If you just add ThursdAI at the course, it's really worth your time. If you're working in any place that incorporates like AI, which is. Every place right now. Like I, it's really hard to find a company that's not trying to add LLM juice somewhere. And as we, as you guys talk about, it's not very easy to do with confidence. And this is, I think, the highlight of this work. Anything else that I've missed that you guys want to highlight before we move on, on the course or on the work that you guys do?
Shreya Shankar
Shreya Shankar 1:21:09
Thanks for having us.
1:21:10
Nothing comes to mind for me. Yeah,
Alex Volkov
Alex Volkov 1:21:12
so much.
1:21:12
I'm always excited
Shreya Shankar
Shreya Shankar 1:21:14
about you guys.
Alex Volkov
Alex Volkov 1:21:15
absolutely.
1:21:15
And I need to get more folks more excited because there's the foundational stuff and people get excited, but then they try to apply them locally and then they need to see how these things work in production with reliability. you guys are absolutely the top experts in this field. don't matter who you ask, like the point. There's like a five, five people group that all of you are in the group chat with, but you two are definitely the ones, I'm excited to have folks learn from you. And I'm very happy to be also a guest, speaker in the course as well. with that, we'll probably mention this one more time, but thank you, Hamil. Thank you, Shere, for coming up. and we'll add the links to the show notes for folks who are just listening. and, the 19th. So definitely check it out. Thank you guys for coming up as we move on to the next thing. Cheers. Thank you for having me. Thank you. Absolutely. Thank you guys. Bye bye. All right, and bringing back our co hosts here, this was, Hamel, Hussain, and Shreyash Shankar, the two leaders in evals, and I think we're moving on, folks, because, He had an
Nisten Tahiraj
Nisten Tahiraj 1:22:09
excellent AI engineer course before, I think, and it was
1:22:13
with Swyx and stuff too, so we should post that, because I think it was like 500 bucks, but then afterwards they made it available for free after some time, and that's, that's a good, you should do one of those, full stack courses, even the older ones from, from the full stack on, on YouTube, like that actually helped me a lot. And they're still very relevant, even today.
Alex Volkov
Alex Volkov 1:22:32
Just so you're referring to Hamel's AI engineering course, like a
1:22:35
big one that became the big conference. Yeah, absolutely. Courses, like ML, spends a lot of time on courses. Very great, high signal. folks, moving on, I think that the thing that I, while we're in this, promotional era, I'll just add that this week's buzz, the most interesting thing about, Weights Biases, this week, is I have two announcements for you. We're going to have a hackathon, in, in San Francisco, and I can announce that, we're calling it Weave Hacks. should I play this week? Yeah, I should play. I love my opener. Hold on. In this week's buzz, we tell you all about Weights Biases
1:23:16
I am very excited to announce this super quick because we have a hackathon that's coming up. We call it Weave Hacks. Weave Hacks is a hackathon that we are organizing by ourselves with a bunch of sponsors, friends from the community, and the focus of this hackathon is agent protocols. You've been hearing about them. MCP is everywhere. Right now we're just talking about MCP. Previously in the show, A to A, we had Todd Segel from the A to eight. Team at Google is a new upcoming protocol for agent collaboration and agent conversations. We want you to build with them. We want you to give this, we want to give you the space in our great office in San Francisco to come and just hack and then chat with the folks who are building these protocols. So I'm very proud to announce that Google Cloud is sponsoring this hackathon with us. We have up to 15K prizes. I will just say first. I think on the pod, I haven't mentioned this anywhere, one of the highlights, the top prizes for this is a Unitree robot dog that I've ordered and is now coming, so you'll be able to program this, it's going to run around the office, bring snacks, do all kinds of stuff, and if you put an agent on it, and that agent will, communicate via A2A with something, that's going to be a winning thing, so we're probably going to open up like the framework, so I'm very excited because I got to book the robot dog, and then I got to experience And then in the expense, I said, I'm booking this RoboDog for a hackathon. And I was very excited about the type of work that I'm able to do with Weights Biases. who can expense a RoboDog? That's super cool. I will say that for next hackathon, we're probably going to have a humanoid robot. We, the only reason why we didn't get a humanoid is because the lead time on Unit 3 is ridiculous. But our next hackathon is going to have a walking humanoid, like doing all kinds of things. but at least we're going to get the dog. But also we're going to have folks from. The A2A protocol at Google. Come and you'll be able to talk to them, share concerns, give feedback, etc. So I'm really like, in addition to incredible judges that I haven't confirmed yet, but I will tell you those are some cool people that you want to know and meet and be in the community with, with the low weights and biases. So please, you're more than welcome, to WeaveHacks May 17th and 18th. please join us in San Francisco. If you need, if you're looking for a reason to go to look at the Golden Gate Bridge, go this is your reason, I will be there, I'll give you high fives as well, and I think the hackathon is going to be super, super cool. That's number one. Number two, our Fully Connected, our general Weights Biases, conference, a big one. The two day conference is coming to, San Francisco as well, don't remember the exact hotel, I will put it in the show notes. it's go fullyconnected. com, and we're having incredible speakers. Last year, we had Joe Spisak, head of Metal Llama, announce Llama 3 on stage, so we're looking for more announcements maybe, but absolutely welcome to fly for that, I'm sure that meeting other ML practitioners that use Weights Biases and listening, like tuning into how people use, it's not only about Weave, which I mostly move forward, it's about like everything. That's fully connected. That comes to San Francisco very soon. And the tickets are out. And, I will try to get you a discount for the next one. Hackathon is free to participate. as I always say, ThursdAI is proudly sponsored and solely sponsored by Weights Biases. And, you should definitely give Weave a try. Our observability and LLM evaluation toolkit. Everything that we've talked with Hamel, Hussein, and Shreya Weave supports in a beautiful visual form, as you've seen before, when we elaborated, when we evaluated Kuen, I used Weave for all my evaluations myself as well. So definitely give Weave a try. This has been this week's buzz, and now let's move on to the LLM Arena thing. Oh. In this week's buzz, we tell you all about Weights Biases
1:26:51
Alrighty, we're back and we should talk about the LLmarina thing. We should absolutely talk about the LLmarina thing. Shreya, who had to leave because of a conflict, mentioned this. Hamel is still with us. Hamel, if you want to come up and let me know over chat to chat with us because I think, it's evals related. Folks, as we, when a new model comes out, we are, like, giving you, vibe stuff. And we're telling you, hey, we're This model is this model is this. Wolfram, we brought you on because you were like known in the LM, like local LLM community as like the evaluator. I think your AI evaluator is actually a title of yours. And we try to bring you the vibes. and a lot of what we used to rely on or like for longest time is LLMarina. LLMarina, for those of you who still don't know what LLMarina is, Stanford? I don't remember exactly where they're from. Researchers, they built a way for people to just see two models responding to the same chat and select their preference. And then they had exploded in popularity for the past, as long as we've been running with ThursdAI. And recently they came under fire. Specifically, I remember, because of, Yeah, Hamel, welcome to this discussion. when Metal Llama 4 released Maverick, and there was a whole thing that we talked to you about where kind of the unreleased model, shot up to the top of the charts, and then Maverick released and placed 37 or something? That's number one. Number two, Claude 3. 7, which we all know slaps, like the vibes on Claude 3. 7 is incredible. Sometimes it goes too much, sometimes it builds your whole back end when you're not asking for it, but generally the vibes for 3. 7 are great, and people love this model. It's also placed 25 on LLM Arena, and that doesn't really hurt. Kind of vibe with our vibes necessarily. And there was a very interesting like delta between what we feel like and what LLmarina shows and LLmarina supposedly have a needle score and enough people voting on these models. LLM Arena also has the service that they provide the bigger labs to run their models incognito. And we always talk about this. Nisten is we're trying to figure out which model is which, that they run this model behind the scenes and the folks vote for it without knowing which model it is to not obscure votes. And here comes this paper called Leaderboard Illusion from folks, mainly incoherent, some other folks as well, that claims that the way the arena was set up. is actually favoring bigger companies. so let's talk about this. who has who wants to go first and with some thoughts?
Nisten Tahiraj
Nisten Tahiraj 1:29:18
I think, look, we all know benchmarks are broken, but The
1:29:23
reason that it all works out in the end is because you can combine the vibe checking with what people are actually doing with the LLM Arena scores and with actual benchmarks that people run. I don't think it's necessarily a bad indicator. Yes, we're aware that they can just run LLM. Any API it wants, and we've made accusations here and there, and it becomes a bit of a thing to game the arena. And sometimes it can affect how the model is, is done. But, at the end of the day, it is one more metric that you can use productively. So I. I do think that the arena does actually correspond pretty okay, to, towards what the models can do. There are outliers. For example, Sonnet scores very poorly and people feel like that's the best, but, it's not like it's that far off from, from what it is. I also want to say that the research that came out from, from Cohere's team, that's very, those are very reputable seasoned people in AIML that have been there for, I think, 5, 6 years. And those are all valid points. We should improve it. But I just want to say it doesn't mean it is completely broken or completely useless or anything like that.
Alex Volkov
Alex Volkov 1:30:38
I think they're highlighting, multiple things
1:30:39
here and we should discuss. Hamel, we'd love to hear your thoughts about this. Distorted playing field, and they have undisclosed private testing practices that benefit bigger providers. Specifically, and this threw me off. At an extreme, we identified 27 private LLM variants tested by Meta in the lead up to the LLAMA 4 release. 27 private LLMs, Mete is basically using this, I actually want to know whether or not like for free or whether or not they're supporting LLM Arena, I don't know this. And then they specifically say, both these policies lead to large data access asymmetries over time, providers like Google and OpenAI have received an estimated 20 percent of all data in the arena. A combined 83 open weights models have only received an estimated 30 percent of the total data. there is some sort of like data feedback. And what are these companies doing with this feedback? Hamel, you think they're training on this? They're building the models? What do you think is going on there?
Yam Peleg
Yam Peleg 1:31:28
I have no idea.
1:31:30
I'll be honest, I have no idea. I don't get it. These are real people's scores. It's not an eval, A finite evil you overfit. How, is that, I don't know, are the questions not representative of real life? It's a different, it's a different story, but you are, at the end, making more people favor a specific model, even if you took 20 steps to get there. how can you even overfit this? I'm not sure I get it. To the end, like what exactly is the thing that people overfeed? Is it just, I don't know, people favor longer answers or answers with markdown and don't even read the answers when they rate the models? I'm not sure, but something, it doesn't really add up, in my opinion, because it's really people, I don't know if you have any idea, if you have any insight.
Nisten Tahiraj
Nisten Tahiraj 1:32:28
It's mostly one shot, though, so I use the arena a lot, and
1:32:33
I don't know many others that use it a lot, just so that I can at least discern what the models do, but it tends to, most people just in a question and they see the answer and then they just vote. Or maybe do two or three turns and then that's about it. So that's a hard way, to judge it so it tends to favor a lot for those one shot questions. so if you keep training the model, to do better at one shot. You do actually even get a better model, but it's not necessarily representative of the day to day use that people are gonna do.
Alex Volkov
Alex Volkov 1:33:03
I'll highlight this.
1:33:04
We show that access to chatbot arena data yields substantial benefits. Even limited additional data can result in relative performance gains of up to 112 percent of arena distribution. Which is, it's quite crazy. And also, I want to point out that, folks have started going directly to OpenRouter. And, who's that? Was that Gemini? no, that was OpenAI. Sorry, OpenAI. OpenAI went directly to OpenRouter instead of OpenArena or LLM Arena for the latest releases. So some kind of changes in how folks are like receiving this effort. the response OpenAI is
Wolfram Ravenwolf
Wolfram Ravenwolf 1:33:37
doing the same.
1:33:38
you can write, yeah, responses. And sometimes you get two at the same time and have to pick one. So that is a In their UI. Yeah, in their UI. So It does
Hamel Housain
Hamel Housain 1:33:47
suggest that the data that OpenAI is getting in their UI
1:33:52
is, very different than the data that is appearing in the arena. Because if the data in the arena from the arena allows them to have such a big jump, that means that, okay, there's something to overfit on. There's something specific about the arena population that is different potentially.
Yam Peleg
Yam Peleg 1:34:13
Where are the questions are coming from in the arena?
1:34:16
if I may dig into this subject more, Alex, but like, where are the questions are coming from? Are there? Are they unique? For users?
Alex Volkov
Alex Volkov 1:34:24
Yeah, in
Yam Peleg
Yam Peleg 1:34:25
the arena.
1:34:25
I just want to understand, let's say that I overfit the arena To the end, where the questions are, like, where exactly is the leak? That's what I'm trying to understand. Because if you're going to take OpenAI's data and overfit it to the end, you're going to get a really good model. You're not going to get an overfitted model.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:34:45
I think in the arena, a lot of people are just asking the trick
1:34:48
questions to see which model answers them. how many R's are in strawberry and stuff like that. as soon as on Reddit a new trick question comes up, people use it in the arena to check if the model has it. it gets in the training data that way. And, yeah, then it ends up that correctly. So you always have to make your own private benchmark in the end. Before I move on,
Alex Volkov
Alex Volkov 1:35:08
we should mention, the importance of, again, evaluating stuff.
1:35:11
LM Arena does some evaluations, but then, the folks, Give them recommendations of how to solve this so it's not like it's a scaling point arena is bad It's just like literally they showed like multiple things of how to solve this including showing Spotlight on the fact that the big companies get like data back as well they talk about pre release training and the community they gave some answers And they also pointed out some factual errors, from this paper, but basically the damage potentially has been done already because like, when we know in the open source community, when people have a tendency to not like bigger labs generally, and just focus on open source because of several things that happened this week, for example, the glazing incident, like you said, if you're hosting your own models, glazing will not happen to you. Unless you have a continuously fine tuning thing and you release, if you're hosting your models and you're like, you're checkpointed, that will not happen ever. And OpenAI will just play with, with the prompts to see what better, what works, what doesn't work.
Hamel Housain
Hamel Housain 1:36:10
With the glazing, what's wrong with the evals?
1:36:12
Or like maybe the annotations are getting are just that bad. Maybe we're just like A lot of mid people using chat GPT that like glazing. They talked about
Alex Volkov
Alex Volkov 1:36:22
a few things, specifically about glazing.
1:36:24
Hamel, before you came in, we chatted about this, and they had an AMA, and there was like a combination there of looking for short term like enjoyment and there's also the long term memory that came in that it knows everything that you saw and goes and tries to fit whatever you want it. but yeah, it's going to be very interesting. I don't think we got the clear answer of what exactly went in the last two releases, that they rolled back. we know some prompt and people like, Pliny the Prompter, have shown us like the differences in prompt. The differences are tiny. I wouldn't look at this and consider it like this would, make a model C, c Co. and, but I think combination with a bunch of other stuff, but also they're probably like. I added some training in there. I think that this is it on the leaderboard. I think like with Nisten, Arena is still important. They have the style confirmation. They have the web dev arena on which Claude 3. 7 is absolutely slaps in the top as well. They do a lot of community effort. We'd love to have them here as well. I hope that they look like learn from the speed deck and like absolutely understand that they also have a responsibility towards the community to like not over fit to bigger labs and present like open weighted models as well. So great job from Cohere appointing this finger. I think that's it folks. Last things. I think that this is it for the open, for the bigger companies and APIs, unless we have breaking news, but I doubt. Oh, we did have breaking news a little bit before, where Claude, is now, I think that's pretty much it on the big companies and APIs. And we're moving on because in the interest of time, we're coming up on the two hours. I would love to chat about the runway references thing. I know that we like, this is one reason why we do video, because we could talk about everything in a radio form, but. A lot of the latest upgrades in AI is video based. So I really want to like show showcase some of the incredible stuff that Runway is doing. and we know that there's models. Let me just try to open this link. there's a lot of video models that we've covered from just multiple apps. Kling came out with Kling. 2 and like a bunch of stuff. Runway has been around for a while from Gen 1, Gen 2, and now Gen 4, has upgraded with this thing called References. And References allow you to just specify what you want in the next frames. It's the dream from content, for content creators. So I'm going to show a few examples here. they claim that like around the references is a time machine. they picked a random frame, this specific person picked a random frame from, 90s sitcom and then returned the still frame from a scene of 10 minutes ago. And these are the results. And so this is like one random frame and you can see other frames. You have scene consistency and character consistency in here. And this is something incredible. If you think about like how diffusion works, generally, with like seed or something, that's like really hard to get this consistency level. And then you can generate like all of these things and just the consistency of the couch, you can see. Obviously the design is not a hundred percent perfect, but the consistency of the couch, the theme, the fridge is blue. There's like a bunch of stuff. I want to show like a video. Yeah. let me see this video, for example. So you can get consistency of faces across scenes, which is like great for scene building. I think, one of the, one of the ways that we detect whether or not the video is AI generated is whether or not the face changes from scene to scene. And that, folks, by the end of this year, I believe is going to go away. I believe that no longer we'll be able to judge a video, whether or not it's AI generated or not, based on whether or not the face changed subtly from scene to scene, because of things like consistency, it's quite. Quite something. yeah. Oh, I think this is the release video and I'll show it just a little bit because I think it's super cool. there was also something I learned, that this consistency thing, if you want to put yourself in those videos, it, the, it prefers bold people. Bold people because like they can put on any type of hair and if you look, it just works. Whereas I'm very identified with the haircut that I am. This just works, better as far as At least somebody from FAL, a friend of mine who was also bold, reached out and said, and you can see this dude in like multiple scenes as well in consistency. this dude has an ice cream as well. incredible release from Runway. Incredible, just continuous stream of updates in this area of AI video generation and shout out to Runway for this. This is now also available for, for, Just everyone, I think. It's not only on the premium tier, so super cool for consistency, for character consistency as well. and what else do we have in AI News? Any feedback on this Wolfram? Anything that you would like to add? I know that sometimes we chat about video stuff as well. Looks like Nisten dropped. Thank you, Nisten. Go ahead, Wolfram.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:40:57
It's a very cool model.
1:40:59
I have been experimenting with it and I will upload some videos later. I hope they are finished by then. Oh, nice. The consistency is amazing, though. That is great. What we have seen with OpenAI's image creation, where you can just give it two images and they combine them, now you can do that with video. That is great. I wonder if Zora can do it as well. Has anyone tried?
Alex Volkov
Alex Volkov 1:41:19
Sora cannot do it.
1:41:21
The Sora that we have cannot do it. I'm pretty sure that OpenAI has a Sora inside that absolutely does this. I think that I've even seen leaks of Sora 2 or something, but those are only leaks and we haven't seen them as well. I don't
Yam Peleg
Yam Peleg 1:41:33
know what's going on with video anymore.
1:41:34
I just can't follow. there's so many, like every week you get, this is the best model for video. And then okay. I get it. Then a week later, like two different others. These are the best videos. Yeah. Yeah. You're just. I just don't know which is the best at the moment. They all look amazing. I've seen every week an amazing model and I'm not sure which one is better. you really need to compare a video, the same prompt just to make sure and it's very hard. But it's amazing. It's moving so fast.
Alex Volkov
Alex Volkov 1:42:02
I agree with you.
1:42:03
I think it's harder than evals on LLMs. I think that, just doing evals on video, like, how would you even do this? LLM as a judge, maybe? Actually. It's a hard field. another hard field, but a little bit easier is the image diffusion models. And so there's one update I want to give you there. Hydream was released and Hydream, is this model on Hug Face from, company. Yeah, Hydream AI. There is Hydream E1 in full and they have the screenshot, specifically focusing on converting to Ghibli style, which we know that like Ghiblification from GPT, image generation came. they have a bunch of examples. This model, I believe is Apache 2 license. And you can use it for a bunch of stuff. Vivago. ai is the company and you can use this model like online. and they released drop full weights for this. And I believe that if you go to artificial analysis, which is, by the way, we mentioned, let me just open this in a way that I can
Yam Peleg
Yam Peleg 1:42:52
show you, yeah, they offer, zero GPU now, which basically
1:42:58
means they offer an age 100s. with the GPU, Zebra GPU.
Alex Volkov
Alex Volkov 1:43:03
Wait, really?
1:43:03
Zero GPU upgraded? That's incredible. so the thing I wanted to show is that we talked about, we talked about LLM Arena for vibes, but another resource that I love is Artificial Analysis, and they also have, language models and image and speech models. So text to image, if you go and, look at the image arena, which they have, which is very similar to LLM Arena, which is, they show you two pictures, you decide. Hydream is up there. So lead the board, let's look at Hydream. Hydream, i1dev, which is not the model that we just talked about, is like number four, above Google Imaging even, and above, just a little bit below ReCraft, so this is a great, thing for you to explore if you're interested in latest and greatest in models, I believe that they have video models as well, but I don't think they have, do they have a leaderboard? Yeah, they have a leaderboard, so One
Yam Peleg
Yam Peleg 1:43:42
thing about this is that the model can be, can generate the most I
1:43:46
don't know the best images possible, but the best models like what sets GPT4 apart is that it follows the prompt better than other image generation models. Like you can give it very specific instructions in text. and you put this human here and it stands on this and so on, like many tiny details and it actually makes them. this is, for example, in MidJourney, which is an unbelievably good image generation model as well, it doesn't follow the same way, like you can't control it to the same extent as GPT 4. so I don't know, maybe this is what the evils are checking, but I'm just saying that it might be pretty hard to judge. Yeah, when you see two images and you
Alex Volkov
Alex Volkov 1:44:34
select your preference, that's different from you're playing
1:44:37
with a model, you're talking about it, you're like natural language this. Which we also should mention, Google released native image generation as well, like we knew that they have this capability, but now inside Gemini you can like have native image generation editing as well, they added this week as well. So much happening that I didn't even add this to the show, I just remembered another thing that happened. but yeah, I agree with you. it's really hard to judge and build those evals as well, where people just judge, basically, on two things. folks, I think we're coming to the end of the show. We somehow were able to cover most of the stuff that we wanted to cover, including an interview with two great folks in evaluation. As a reminder, you should definitely try to check that out if you're building anything with AI, and you want to be having, you want reliability. As also a reminder, this show is proudly sponsored by Weights Biases and have been since the beginning, and this is why the show continues to happen, and we have a tool for this called Weave, and, you should definitely give it a try if you're interested in building, your application in any type of way. Reliant way. We're going to give you the tools to do we just released a bunch of new eval APIs that I would love to chat about next week as well. with that said, I think the last thing that we didn't get to, and I think we'll skip, in interest of time, is, OpenPipe or Friends of the Pod, OpenPipe is a platform for fine tuning models, have been increasingly moving into, reinforcement learning, fine tuning as well. And they released an open source, RL trained email research agent called RE and they're claiming state of the art. There're on top of O three agents. It's pretty, pretty cool. So they're getting, lemme try to zoom in here, zoom in. Percentage of an questions answered correctly on their benchmark is 96, where on top of oh three, which is pretty cool. It's based on Quinn 2.5 14 B. So as Niton previously said, Quinn absolutely was the workhouse of the fine tuning community and I'm very excited to see what happens with Quin three. I want to shout out our friend of the part, Junyang Lin, who woke up very late, and stayed up there, but we weren't able to accommodate him on the show, unfortunately, because it was after we already covered. but Junyang was almost here. Shout out to the Qwentin for the incredible release, this week. Absolutely the highlight for most of us as well. and also, one, maybe, comment from me is that if you trust your life decisions to an AI, make sure to have a second opinion, potentially from another AI, but also maybe a human. because incidents like the last two weeks showed us that, we, especially I am talking from myself here, Trust more and more of my decisions. I have a button on my phone, literally as you click it, ChatGPT comes up in voice mode, and I just talk to it about like, how long do I boil an egg for? Which is embarrassing in retrospect. I ask
Yam Peleg
Yam Peleg 1:47:05
so many stupid questions.
1:47:07
Stupid questions. But also. So many stupid questions all day. Absolutely. But also like many
Alex Volkov
Alex Volkov 1:47:11
of us, yeah I agree, like many of us don't, Go to therapists
1:47:16
like every week or want to afford this and chat GPT and tools like Orin that I talked to you about from Illusion Labs and different things. They are increasingly participating in our lives, in decision making, in thinking about things, and I think it's incredible. I think as an iEvangelist, I think it's incredible that we have this tool. I think it's incredible that people who didn't even think about therapy. Now can have access to something that's like therapeutic and can help them like live their lives better. I think it's incredible. However, with this comes the risk of the platforms doing what they want. specifically, I don't think that OpenAI had bad intentions in releasing this for engagement. I absolutely don't think so. However, If the outcome is I don't think that Zack came up one day and said, hey, we're going to buy Instagram and then we're going to ruin lives of 12 year olds everywhere for a decade. I don't think that these things come with intention, but this is sometimes the outcome. And so as AI evangelists, it's important for me to also highlight the potential risk of these things. And also like remind you guys. if you're getting, very hypey responses, yeah, you should fucking go for it, leave your job, and, do the business that you want, if you want to sell farts in a jar, d do not, go get a second opinion, go, In everything you do with AI, like preparing taxes, whatever, open up multiple models and just run through them together and then see where it shakes out. And then also don't forget to talk to a friend that's human. I think it's very important as well. With this, I think we're exactly at two hours. And don't stop thinking yourself. Do not stop thinking for yourself. Critical thinking is very important. As well, and with this, I think we're exactly at two hours. Folks, thank you so much for joining ThursdAI. It's a pleasure of ours to be here every week, to give you the news, to follow this. honestly, without the community, we're now at exactly like a thousand, listeners this week. without this community, we didn't have the incentive to also follow up as well. I think it's great for us. Thank you, a huge thank you for, our guests. Shreya Shankar and Hamed Hussain for coming in. Thank you everybody for tuning in from week to week, for commenting, for giving feedback, for telling us about breaking news, for all of this. If you missed any part of the show, you can find all of everything that we had in thursdai. news, the newsletter and the podcast. I will add, everything on thursdai. news. We just recently went over a hundred episodes. I stopped counting. We'll get there as well. Thank you Wolfram, for co hosting Yam as well. Nisten was here and then, That's it. Thank you for the community. We'll see you here next week, folks. Bye bye.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:49:25
Bye bye.