Episode Summary

Gemini 3.1 Pro dropped live during the show โ€” Google's biggest model yet with 44% on Humanities Last Exam and 77% ARC-AGI. Anthropic launched Sonnet 4.6 with 79.6% SWE-Bench Verified, Alibaba shipped Qwen 3.5 with 397B parameters, and xAI unleashed Grok 4 20 with four 500B-parameter agents collaborating. Ryan Carson laid out the Code Factory blueprint for agentic engineering, and the panel unanimously declared one-shot coding officially dead. Plus OpenClaw's creator Peter Steinberger joined OpenAI in what might be the first single-founder billion-dollar acqui-hire.

Hosts & Guests

Alex Volkov
Alex Volkov
Host ยท W&B / CoreWeave
@altryne
Ryan Carson
Ryan Carson
AI educator & founder
@ryancarson
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host, AI model evaluator
@WolframRvnwlf
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten
LDJ
LDJ
Nous Research
@ldjconfirmed
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg

By The Numbers

Humanities Last Exam
44%
Gemini 3.1 Pro โ€” dropped live during the show, state-of-the-art reasoning benchmark
SWE-Bench Verified
79.6%
Claude Sonnet 4.6 โ€” nearly matching Opus 4.6 at a fraction of the cost
Qwen 3.5 Parameters
397B
Alibaba's open-weight model with only 17B active params and 512 experts
Grok 4 20 Architecture
500Bร—4
xAI's multi-agent model โ€” four 500B-param agents collaborating, no evals released
ARC-AGI
77%
Gemini 3.1 Pro โ€” state-of-the-art without custom harness, terminal bench at 68
Live Listeners
1,500+
ThursdAI's live audience for this episode, approaching three years of weekly broadcasts

๐Ÿ”ฅ Breaking During The Show

Gemini 3.1 Pro โ€” 44% Humanities Last Exam, 77% ARC-AGI
Dropped minutes before the show started. Google's biggest model yet at the same price point. State of the art alongside Opus 4.6 on most benchmarks. Nisten tested it live โ€” blazing fast but didn't pass the vibe check for coding.

๐Ÿ“ฐ Introductions & Top AI News Picks

Alex opens with breaking news โ€” Gemini 3.1 Pro just dropped from Google. The panel shares their top picks: LDJ picks Zuna (thought-to-text BCI), Nisten picks Qwen 3.5, Wolfram picks Gemini 3.1, and Yam drops the bombshell that OpenAI acqui-hired OpenClaw creator Peter Steinberger.

  • Gemini 3.1 Pro drops live as the show starts
  • OpenClaw founder Peter Steinberger joins OpenAI
  • ThursdAI approaching 3 years of weekly broadcasts
Wolfram Ravenwolf
Wolfram Ravenwolf
"Ah, what a day. What a day. You pour a smile to my face when I read your post about the breaking news."
Yam Peleg
Yam Peleg
"OpenAI just acqui hired the goat. Peter Steinberg."

๐Ÿงช Brain-Computer Interface (Thought to Text)

LDJ highlights Zif's release of Zuna, a sub-billion parameter model that translates EEG brain signals into text โ€” what people are calling 'thought to text'. A glimpse into non-invasive brain-computer interfaces becoming accessible.

  • Zuna: 380M parameter BCI foundational model
  • Translates EEG brain signals to text
  • Open source and Apache licensed
LDJ
LDJ
"Zif releasing Zuna. The sub billion parameter model that is basically people are referring to it as thought to text. So it could take in EEG signals from your brain, and basically, better interpret that than previous models."

๐Ÿ”“ Qwen 3.5 Release

Nisten picks Alibaba's Qwen 3.5 as his top news โ€” almost 400B parameters with only 17B active. Qwen models have historically excelled at multilingual and medical performance, and this new release runs faster with fewer active parameters.

  • 397B total parameters, 17B active (down from 22 in previous version)
  • Qwen excels at multilingual and medical tasks
  • Runs faster for data generation workloads
Nisten Tahiraj
Nisten Tahiraj
"It's Qwen 3.5. It's almost 400 billion parameters with 17 billion active. I'm pretty excited about this one because Qwen models tend to usually have the best medical performance."

๐Ÿ“ฐ Are We Still in the AI Bubble?

Alex shares his experience at a Claude Code meetup where even attendees weren't running agents. Ryan reports meeting normies whose reaction to AI progress is mostly fear and dread. The panel discusses the widening gap between the AI-native bubble and everyone else.

  • Even Claude Code meetup attendees barely running agents
  • Ryan closing his seed round, planning to hire one 10x engineer instead of a team
  • Eric S. Raymond (open source pioneer) embraces AI as 'wizard mode'
Ryan Carson
Ryan Carson
"If we're behind, we're in trouble."
Wolfram Ravenwolf
Wolfram Ravenwolf
"Recently there has been a lot of discussion about Open Claw and it has been so negative. I was really shocked to see this. They are claims that it's not really a thing, that it is just crypto people pushing it."
Ryan Carson
Ryan Carson
"I'm working more than I ever have in my whole life."

๐Ÿ“ฐ TL;DR - Weekly AI News Roundup

Alex runs through the week's releases: Qwen 3.5 from Alibaba, OpenClaw joining OpenAI, Anthropic's terms controversy, ByteDance's Seed 2.0, Gemini 3.1 Pro dropping live, Grok 4 20, Google's Lyria 3 music model, and Cohere's multilingual Aya model.

  • OpenClaw founder joins OpenAI โ€” possibly first single-founder billion-dollar deal
  • Gemini 3.1 Pro: 44% HLE, 77% ARC-AGI, same price point
  • Grok 4 20: 500B params ร— 4 agents, no evals released

๐Ÿ”ฅ Gemini 3.1 Pro - Breaking News

The panel dives into Gemini 3.1 Pro which dropped minutes before the show. Same price point with significantly better performance. Ryan insists he only cares about SWE-Bench scores, while Wolfram argues Terminal Bench is more relevant for agent use cases.

  • Same price as previous Gemini, significantly better performance
  • 77% ARC-AGI, 44% Humanities Last Exam, 68 Terminal Bench
  • State of the art alongside Opus 4.6 on SWE-Bench
Ryan Carson
Ryan Carson
"As soon as I don't see bold on SWE bench verified, like I'm not interested. All I want is SWE bench verified and SWE Bench Pro. And they're not top. Logan and team I love you, but I'm not interested."
Wolfram Ravenwolf
Wolfram Ravenwolf
"Terminal bench is more interesting because it's often put in the coding category. But what it is actually is agent stuff like what we are asking of Open Claw. Since I am using AI mostly as an assistant in that way, that is why I'm most interested in the terminal bench scores."

๐Ÿข Gemini 3.1 Pro - Benchmarks & Long Context

LDJ reveals a massive discrepancy in long-context benchmarks โ€” Opus 4.6 scores 76% on MRCR at 1M context vs Gemini 3.1's 26%. The panel debates whether Google is under-reporting competitor scores and highlights the difficulty of comparing benchmarks across different methodologies.

  • Opus 4.6: 76% MRCR at 1M context vs Gemini 3.1 Pro: 26%
  • Google's eval table may be under-reporting Anthropic scores
  • Different measurement methodologies make direct comparison difficult
LDJ
LDJ
"On the MRCR V two eight needle, I do actually have scores for Opus 4.6 at 1 million context, which does show it's significantly better than Gemini here for Opus."

๐Ÿ› ๏ธ Gemini 3.1 Pro - Live Vibe Coding Test

Nisten runs a live vibe coding test in Google AI Studio โ€” the same Martian mass driver simulation they tested with previous models. Gemini 3.1 Pro is blazingly fast but the output doesn't match what Opus 4.6 and Codex achieved.

  • Extremely fast generation โ€” completed in about 20 seconds
  • Created a functional simulation but less polished than Opus/Codex
  • Fast but not passing the initial vibe check for agentic coding
Nisten Tahiraj
Nisten Tahiraj
"It is extremely fast. So I'll give it that. But it often just doesn't work."
Nisten Tahiraj
Nisten Tahiraj
"I'm not impressed compared to what Opus and Codex did. They had a fully working one with like trajectories and stuff. And this is like, I'm just stuck here."

๐Ÿข Codex 5.3 vs Gemini vs Opus Discussion

The panel debates why models perform best in their own harnesses. Ryan argues this is why agent labs are struggling โ€” the model maker always has the natural advantage. LDJ points out that Codex in its own harness scores 77% on Terminal Bench, the true highest score.

  • Codex 5.3 gets 77% in Codex Harness โ€” true state of the art
  • Model labs have natural harness advantage over third-party agents
  • Claude Code's success proves the model+harness synergy
Ryan Carson
Ryan Carson
"Every single agent lab is fighting this natural flow where the model is made by the lab and then the harness is made by the lab. And it's just hard to believe that they're not gonna do the best job at implementing a harness around their model."
LDJ
LDJ
"In the Codex Harness itself, OpenAI with GPT 5.3 gets 77%. And I would say that's the true highest score here. What we ultimately want is in the best conditions, what do the models get if you put them in the best harness."

๐Ÿข Claude Sonnet 4.6 Release

Anthropic releases Sonnet 4.6 โ€” 79.6% on SWE-Bench Verified, 1M token context window, now the default model on Claude AI. LDJ notes it feels like a smaller Opus 4.6 that may have been trained for longer. In Claude Code testing, users preferred Sonnet 4.6 over the previous Opus 4.5 59% of the time.

  • 79.6% SWE-Bench Verified โ€” very close to state of the art
  • 1M token context window in beta, $3/$15 per million tokens
  • Users preferred Sonnet 4.6 over Opus 4.5 59% of the time in blind testing
LDJ
LDJ
"It just kind of feels like a smaller Opus 4.6 basically. I get the impression that maybe it's trained for a lot longer. So like maybe less parameters, but trained for a lot longer than Opus."
Alex Volkov
Alex Volkov
"Sonet beats the previous Gemini three pro in pretty much every benchmark. The smaller version beats the thought leaders. We're moving so fucking fast. What's going on?"

๐Ÿข ByteDance Seed 2.0

ByteDance steps up as a leading Chinese AI provider with Seed 2.0 โ€” a frontier multimodal LLM with video understanding that surpasses the human benchmark (77% vs 73%). Priced at 84% cheaper than Opus 4.5, it's a compelling option for price-conscious developers.

  • 84% cheaper than Opus 4.5 with near-comparable quality
  • Video understanding surpasses human benchmark: 77% vs 73%
  • Pro, Light, Mini, and Code variants available
Alex Volkov
Alex Volkov
"Seed is very close to GPT four quality, and maybe even Opus 4.5, but 84% cheaper than Opus 4.5. So if you're absolutely price maxing, this could be a great model for you."

๐Ÿ’ฐ Anthropic Terms of Use Controversy

Anthropic updated their terms of use, causing panic that Max account OAuth couldn't be used with third-party agents like OpenClaw. They partially reverted, but the situation remains unclear. Meanwhile, Chinese labs and OpenAI explicitly welcomed agent usage with their subscriptions.

  • Anthropic's terms briefly banned using Max accounts with agents
  • OpenAI confirmed Pro subscription works everywhere including OpenClaw
  • Chinese labs explicitly host OpenClaw instances on their platforms
Wolfram Ravenwolf
Wolfram Ravenwolf
"Basically everybody except Anthropic and maybe Google are saying, yeah, you can do this. While they are saying no, you can't."
Ryan Carson
Ryan Carson
"I hadn't paid for the Claude Max plan until Open Claw. I did it and then I just canceled it. And I switched to OpenAI, because you can use Codex on it."

๐Ÿข ChatGPT Personality & OpenAI Model Deprecations

A brief transition segment โ€” Alex acknowledges the need to move on from big lab discussions to cover Grok, open source, and evals. The panel has been discussing for nearly an hour and still hasn't touched half the topics on the docket.

  • Panel acknowledges the sheer volume of news to cover
  • Transition to Grok and open source coverage

๐Ÿข Grok 4 20 Review

xAI releases Grok 4 20 โ€” four 500B-parameter agents collaborating in a multi-agent UI. No benchmarks or evals released. The panel finds it underwhelming for coding and day-to-day work, but acknowledges its strength for deep research via X's data. A $300/month Heavy tier with 16 agents exists.

  • 500B params ร— 4 agents (or ร—16 for Heavy at $300/month)
  • No benchmarks or evals released โ€” silent drop
  • Grok 4.1 Fast still #8 on Open Router for API usage
Nisten Tahiraj
Nisten Tahiraj
"It's not bad. It's not good for day-to-day work, like for agent stuff. But what it is, I'd say still the best at maybe or top tier at is this research stuff just because of whatever RAG system and research system that xAI has."
Ryan Carson
Ryan Carson
"I want X to win. I love X. I've been on X for like 19 years. But nobody uses Grok for production stuff that I know of. Nobody uses it for coding."
LDJ
LDJ
"Grok 4.1 Fast is actually surprisingly popular. Right now it's number eight on Open Router. And the only American models beating it in API usage are Claude Opus, Sonnet, and Gemini Three Flash."

โšก This Week's Buzz - Terminal Bench Benchmarking Deep Dive

Wolfram presents his Terminal Bench benchmarking work for W&B. He reveals that benchmarks are far more nuanced than single scores โ€” runtime limits, harness settings, thinking mode, and resource allocation all dramatically change results. He also shares how Weave tracing caught an inference bug that was causing GLM-5 to score only 5%.

  • Terminal Bench tasks include building Linux kernels and cracking passwords โ€” not just coding
  • Qwen 3.5 scores 52.5% โ€” third place among open source models
  • Kimi K2.5 achieves 67.4% ceiling score across multiple runs
  • Weave tracing caught a critical inference bug affecting GLM-5 scores
Wolfram Ravenwolf
Wolfram Ravenwolf
"Benchmarks are complicated, but they are important and you get a lot of information if you take a closer look and not just compare some numbers, which may not even be directly comparable."
Wolfram Ravenwolf
Wolfram Ravenwolf
"We had abysmal scores. I got only 5%. So I checked and looked in the Weave trace to see what is happening. I saw issues with the code. I reported to our engineering department and they found it and fixed the problem."

๐Ÿค– Code Factory - Agentic Engineering with Ryan Carson

Ryan walks through his viral Code Factory article โ€” a system for fully automated code generation, review, and deployment. Inspired by OpenAI's Harness Engineering article, the setup uses GitHub Actions, Reptile for code review, CI gates, and a self-healing loop where agents fix their own PR issues until all checks pass.

  • Code Factory: agents write, review, and ship code in a loop
  • Risk classification system flags high-risk file changes for extra review
  • Self-healing loop: Codex fixes PR issues until all CI checks pass
  • Takes a week+ of setup but unlocks massive throughput
Ryan Carson
Ryan Carson
"OpenAI released this article called Harness Engineering and they basically documented how they have set up Codex as what I'm calling a code factory. Really a system that makes it easier to build, test, and deploy in a reliable fashion."
Ryan Carson
Ryan Carson
"From the beginning, think like you have a team of a hundred engineers. Even if it's just you, take the time, it's like a week or more of setup. And it really unlocks absolute magic."

๐Ÿ› ๏ธ One-Shot is a Myth - Front End vs Backend AI Coding

Alex demos the new ThursdAI website built entirely with agents, but emphasizes it took days of iteration โ€” not one shot. The panel agrees: one-shot coding is a myth, especially for front end. Ryan recommends design systems and Instill for UI feedback loops, but notes frontend still requires human-in-the-loop driving.

  • New ThursdAI website built with OpenClaw โ€” agents extracted 160+ guests from 152 episodes
  • Running agents overnight produced near-complete website rewrites daily
  • Backend loops work; frontend still requires human steering
  • Design systems dramatically improve agent UI output consistency
Alex Volkov
Alex Volkov
"None of this was one shot. The amount of conversations that I had with my agent to get to a level that this looks coherent between pages is absurd."
Ryan Carson
Ryan Carson
"There is no magic here, no silver bullet and anyone who's saying otherwise is lying or doesn't use the tool."

๐Ÿ“ฐ Will Software Engineers Lose Their Jobs?

Yam reveals he's fired a crazy number of agents this week โ€” models are inherently random and can destructively delete your entire computer by accident. Ryan emphasizes document drift as a critical Code Factory concern. Nisten argues frontend developers are still essential to take projects to completion.

  • Models are inherently random โ€” destructive mistakes are a matter of 'when' not 'if'
  • Document drift is a major Code Factory challenge
  • Frontend developers needed to take things to production quality
Yam Peleg
Yam Peleg
"I can't even tell you how many agents I fired this week. Models can mistakenly, without you even realizing, just delete the entire computer. And you can't even blame anyone because a minute later the context is compacted."
Nisten Tahiraj
Nisten Tahiraj
"You need to finish the job. That's why you need an actual frontend developer. If it's an app you're gonna hand over to a customer, at some point you need to take the thing to completion."

๐Ÿ”Š Google Lyria 3 - AI Music Generation

Google DeepMind launches Lyria 3, their most advanced AI music generation model, available in the Gemini app. It generates 32-second high-fidelity tracks with creative controls, and can compose music from uploaded images. A prompt guide is available for vocals, lyrics, and different styles.

  • 32-second high-fidelity music tracks
  • Image-to-music: upload an image and generate matching music
  • Prompt guide released for vocals, lyrics, and styles

๐Ÿ”“ Open Source Roundup - Qwen 3.5 & Cohere

Deeper dive into Qwen 3.5 โ€” Nisten reports benchmarks look good but coding is behind GLM-5. The model uses a different architecture from DeepSeek, with 512 experts and 262K native context extendable to 1M. Cohere releases Aya 3.3B, a tiny multilingual model supporting 70+ languages.

  • Qwen 3.5: 512 experts, 11 active, 262K native context (extendable to 1M)
  • GLM-5 still ahead on coding; Qwen excels at multilingual
  • Cohere Aya: 3.3B params, 70+ languages
Nisten Tahiraj
Nisten Tahiraj
"The benchmarks are showing very good. The coding is alright. In some of the tests that I wrote, GLM five is still above. Actually in my opinion, GLM five is better than Gemini."

๐Ÿงช Zuna - Open Source Brain-Computer Interface Model

The panel revisits Zuna, the 380M parameter open-source BCI model. Nisten notes it could work with $500 non-invasive EEG headsets, would likely need personalized training per user, and is small enough to run in real time on a gaming GPU. He's considering buying a headset to experiment.

  • 380M params โ€” small enough for real-time on consumer GPUs
  • Compatible with ~$500 non-invasive EEG headsets
  • Needs personalized training per user but fully open source
Nisten Tahiraj
Nisten Tahiraj
"This is the best thing that we have right now. I might end up just buying one of those to see if I can make it work and train it more."

๐Ÿ“ฐ Wrap Up & Outro

Alex recaps the highlights โ€” Sonnet 4.6 and Gemini 3.1 Pro tested live, Code Factory discussion, and the one-shot myth debunked. He promotes the new ThursdAI website and reminds listeners the show is available as a newsletter and podcast everywhere. Over 1,500 listeners tuned in.

  • 1,500+ live listeners
  • New ThursdAI website launched at thursdai.news
  • Approaching 3 years of weekly broadcasts
Alex Volkov
Alex Volkov
"The highlights would be probably Sonnet 4.6 and the new Gemini 3.1 Pro that we've been able to test on the show. Everything that happens on the show ends up as a newsletter and a podcast on ThursdAI News."
TL;DR of all topics covered:

  • Hosts and Guests

  • Open Source LLMs

    • Alibaba releases Qwen3.5-397B-A17B: First open-weight native multimodal MoE model with 8.6-19x faster inference than Qwen3-Max (X, HF)

    • Cohere Labs releases Tiny Aya, a 3.35B multilingual model family supporting 70+ languages that runs locally on phones (X, HF, HF)

  • Big CO LLMs + APIs

    • OpenClaw founder joins OpenAI

    • Google releases Gemini 3.1 Pro with 2.5x better abstract reasoning and improved coding/agentic capabilities (X, Blog, Announcement)

    • Anthropic launches Claude Sonnet 4.6, its most capable Sonnet model ever, with 1M token context and near-Opus intelligence at Sonnet pricing (X, Blog, Announcement)

    • ByteDance releases Seed 2.0 - a frontier multimodal LLM family with Pro, Lite, Mini, and Code variants that rivals GPT-5.2 and Claude Opus 4.5 at 73-84% lower pricing (X, blog, HF)

    • Anthropic changes the rules on Max use, OpenAI confirms it’s 100% fine.

    • Grok 4.20 - finally released, a mix of 4 agents

  • This weeks Buzz

    • Wolfram deep dives into Terminal Bench

    • We’ve launched Kimi K2.5 on our inference service (Link)

  • Vision & Video

    • Zyphra releases ZUNA, a 380M-parameter open-source BCI foundation model for EEG that reconstructs clinical-grade brain signals from sparse, noisy data (X, Blog, GitHub)

  • Voice & Audio

    • Google DeepMind launches Lyria 3, its most advanced AI music generation model, now available in the Gemini App (X, Announcement)

  • Tools & Agentic Coding

    • Ryan is viral once again with CodeFactory! (X)

    • Ryan uses Agentation.dev for front end development closing the loop on componenets

    • Dreamer launches beta: A full-stack platform for building and discovering agentic apps with no-code AI (X, Announcement)

Alex Volkov
Alex Volkov 5:43
Ha ha.
5:44
Good morning. Good morning everyone. Welcome to ThursdAI, today is February 19th, and my name is Alex Volkov. I'm an AI evangelist with Weights, & Biases. I'm your host for today and what a day, what a reason to call our show ThursdAI with breaking news. Just as we start the show. So I'm gonna add my co-host here. I have Wolfram Raven Wolf with this beautiful yellow jacket, and I have LDJ with a beautiful yellow background, almost like a jacket as well. Wolfram, how are you doing, man?
Woflram Ravenwolf
Woflram Ravenwolf 6:16
Ah, what a day.
6:17
What a day. You pour a smile to my face when I read your post about Yeah. The breaking news.
Alex Volkov
Alex Volkov 6:22
Yep.
6:22
A hundred percent. So we're gonna mention breaking news as well. LDJ is with us as well. What's up, LDJ? How you doing?
LDJ
LDJ 6:27
Yeah.
6:28
thank you. Yeah, so I'm doing great. I just got like a good 10 hours of sleep, so I'm, I'm really energized right now, feeling really good. And
Alex Volkov
Alex Volkov 6:36
let's go.
LDJ
LDJ 6:37
Yeah.
6:37
Ready?
Alex Volkov
Alex Volkov 6:38
Looking forward for some, energized tests from you for
6:40
these new models while we at Nisten. Nisten. What's up?
Nisten Tahiraj
Nisten Tahiraj 6:44
What's up?
Alex Volkov
Alex Volkov 6:47
Did you already see the news or you just jump jumping in?
Nisten Tahiraj
Nisten Tahiraj 6:50
I see the news here while we're doing it.
Alex Volkov
Alex Volkov 6:53
Yeah, this is you and everybody else.
Nisten Tahiraj
Nisten Tahiraj 6:55
Yeah.
Alex Volkov
Alex Volkov 6:56
So it looks like we're assembled enough
6:58
to start kind of chatting. obviously the breaking news that we have is Gemini 3.1 Pro just dropped from Google. Noam Shair posted and said, this is the best model yet. so now we're looking at, Gemini 3.1, pro in preview in Google AI Studio. Obviously, we're gonna cover this, very, very soon and talk about this and play with this and maybe run, few tests as well. So if you are here and you are not sure what the heck is going on with all the news in ai, we've been here for the past almost two years now. Folks, as a reminder, in the month, we're gonna have exactly three years as we started weekly broadcast of Thursday, AI to talk about everything. I recently had a chance to go back to some of these episodes. I'll tell you why later down the episode. And, the stuff we talked about back then just looks so meaningless now. 4,000 contacts window, the intelligence level of a boot. and now we're having these intelligence that's just like, come at us. And a GI is very, very close, right? I think we'll start with, we'll start with LDJ. You always have a good one. what is the biggest piece of news that must not be missed from this week?
LDJ
LDJ 8:06
Yeah.
8:07
in terms of things that we should go over that and definitely not miss this week, Zif a releasing zuna. The, the sub billion parameter model that is basically people are referring to it as thought to text. So it could take in EEG signals, which is electroencephalography from this signals from your brain, and basically, better interpret that than than previous models. And it's really interesting, really efficient, really small. I'm excited to see more work in that direction.
Alex Volkov
Alex Volkov 8:37
Yep.
8:37
we're gonna mention this thought to text. How about you, what is your top AI news, update from this week to tell folks?
Nisten Tahiraj
Nisten Tahiraj 8:45
I, I mean, it kind of went quiet.
8:47
It, it's Qwen 3.5 3.5. It's, almost, 400 billion parameters with 17 billion active.
Alex Volkov
Alex Volkov 8:52
Yeah,
Nisten Tahiraj
Nisten Tahiraj 8:52
I'm pretty excited about this one because Qwen models
8:56
tend to usually have the best, medical, performance when it comes to stuff and generating data sets and doing that at multilingual stuff, which is things in data sets that I've, I've published before, and this one has less active parameters from before it was active 22. Now it's active 17, but it's, it's a bigger model, so that means it'll actually run faster for, this type of, data gen. Well, that is when you run it with at least four GPUs or, or hpu.
Alex Volkov
Alex Volkov 9:27
Yeah.
Nisten Tahiraj
Nisten Tahiraj 9:27
So yeah, the Qwen 3.5, release one, 3.5,
Alex Volkov
Alex Volkov 9:31
five missed, almost single handedly, holding us to the
9:34
open source roots that this, the show started with, despite multiple big labs dropping releases this week as well. Wolfram, what, what about you? What is your, what is your top one release?
Woflram Ravenwolf
Woflram Ravenwolf 9:48
Gemini for sure.
9:49
Gemini 3.1. So excited.
Alex Volkov
Alex Volkov 9:52
Gemini 3.1.
9:53
Folks, the evals look ridiculous. I will say, yeah, I think Gemini for me as well, folks, this feels big. this feels big, like, like a big release. And apparently I think, OpenAI may follow up and not, not give Google even one inch of news. so we'll see today if we can have a couple of breaking news.
Woflram Ravenwolf
Woflram Ravenwolf 10:13
Feels like they're holding back and
10:15
waiting with their releases. When another one, forges ahead, they just pull out one out of their drawers and put it out there. It feels like that.
Alex Volkov
Alex Volkov 10:23
I will, I, I'll tell you a short story before the TL DR while we wait
10:26
for some folks, maybe to join us as well. I went to a meetup for Cloud Code sponsored by Anthropic and Tropic wasn't there. And I asked a few folks whether or not they have agents running on behalf of them as they're at the meetup. and not many folks have kind of intelligence churning. For them while they're there. but some do. And so I had a, a great chat with some of these folks. and I just wanted to bring this to the group here because I know for a fact, the, the other two co-hosts, they're not here right now, both Yaman and, and Ryan, they kind of like focus on having agents running all the while. And so I, I, I had a very interesting experience and, it looks like Ryan Boyle from comments also like, said he has a very similar experience. we're still in the bubble, folks. Like the, the, the stuff that we talk about here, I had folks over there that I didn't have to convince. I just told them my experience with Open Claw and only then they're like, oh, okay, then I, I may, I may need to go install this. some of them maybe are listening 'cause I, I promoted the show there as well. but it, it does feel like even, you know, even within the folks who know what cloud code is, we, there's an advantage of knowing things early. I don't know if this is your experience, guys, if you meet people outside the digital world.
Woflram Ravenwolf
Woflram Ravenwolf 11:46
Yeah, in Germany, it's, also the same.
11:48
But, I can tell you something I was really surprised to see because, you know, I come from the local Lama, Reddit, basically where I started and, recently there has been a lot of discussion about open Claw and it has been so negative. I was really shocked to see this, like, they, they, there are claims that it's not really a, a thing that it is just crypto people pushing it. And that, Peter just tried to make money and I don't understand even people are asking, they have no use case for this. I mean, this is a local LA up, Reddit, this is open source software. You can use it with local models as well. They may not be as good, but we didn't have, out of the box experience like this and people were Yeah, just like the ai, yeah, pessimist basically. I was really surprised what happened in that subreddit. Yeah. I mean, it's so different from when I was there back then.
Alex Volkov
Alex Volkov 12:38
So it's very interesting.
12:39
I as though I summoned both Yam Peleg and Ryan Carson with this question. I'll repeat the question 'cause they just came in. What's up, Ryan? basically, first of all, I would love to hear from you the one piece of AI news that we must cover, but, but also, I brought to the panel this feeling that I went to a cloud code meetup and even with that group of folks who go to a meetup to like learn about cloud code and running cloud code, and someone was there like I heard Cloud code. I love it so much. even with that group of people, many of them are kind of not following the, the, the news lately. I asked a bunch of questions of people where they, they have agents running for them while they're the meetup. Many people said, no, there's there barely, barely. A few people have, gone through the same, ringer as many folks on the show. And as I was talking about this, I, I mentioned Ryan and I mentioned Yum and Ryan, we should absolutely mention Code Factor as well. while you answer, but would love to hear from you your thoughts about this when you meet people in the real world. If, if you are, if you're so inclined, do you have that experience as well? Do you see that like we're in our bubble, like way, way, way before folks even discover some of this stuff.
Ryan Carson
Ryan Carson 13:43
If we're behind, we're in trouble.
Alex Volkov
Alex Volkov 13:44
No, we're behind by exactly a week, every week,
13:47
literally by, and we're catching up.
Ryan Carson
Ryan Carson 13:49
and hello to everyone listening.
13:50
so yeah, I actually had this experience. So I have this lovely group of guys that I meet with every two weeks. We have coffee. they're all older than me. and, I said, everybody read Matt Schumer's article before and then we're gonna discuss it. And they, many of them had seen it before, number one, which is interesting 'cause they're definitely normies. and then they, they basically, it was a mix of kind of, it was actually mostly fear and dread. It was, it was no optimism. and, you know, they're worried about their kids and their grandkids. I just said, y'all, I care about you and I care about your kids and your grandkids. And the thing you need to tell them is they need to sign up for a paid account on either Gemini, Claude, or Chad, TBT, and use it. And you have to use this stuff. so there's a lot of dread out there that, I mean, I think we're very lucky to be in this bubble because, we're probably gonna benefit massively from it. And that kind of goes on to code factor. You know, I, I saw the article by OpenAI. really documenting how they built a code factory using Codex and we all knew about, and we've all been trying to do these things. So I set two, three days side and I got my repo completely ready as a code factory. and, you know, I'm shipping a lot. In fact, I'll be shipping while we talk, you know,
Alex Volkov
Alex Volkov 15:13
I, I dunno if you guys remember from, from last stream, at
15:16
some point we talked about the, the psychosis and the AI vampire, like we're working more and more and more. And I was like, isn't the whole point of this to work less? and so I want us to all to get to a point where we can choose to work less if we want to, but it doesn't seem like it's happening.
Ryan Carson
Ryan Carson 15:32
I'm working more than I ever have in my whole life.
Alex Volkov
Alex Volkov 15:35
Yeah.
Ryan Carson
Ryan Carson 15:35
and
Alex Volkov
Alex Volkov 15:35
that, well, you, you also left a, a job to become a startup founder.
15:38
So there's that. Yeah.
Ryan Carson
Ryan Carson 15:39
And by the way, I think I'm gonna close my seed round this week.
Alex Volkov
Alex Volkov 15:41
that's good.
Ryan Carson
Ryan Carson 15:42
I got, got my lead investor, so, yeah, that's good.
15:44
It's more. And my wife is traveling this week, and so I'm literally like, I, I've showered today, but that, that, that's a win.
Alex Volkov
Alex Volkov 15:54
That's why I asked whether or not there are other
15:56
people that you talk to besides the family in the work, Yam Peleg. What's the one piece of AI news that we absolutely must cover based on your experience,
Yam Peleg
Yam Peleg 16:05
man.
16:06
OpenAI, just acqui hired the goat. Peter Steinberg.
Alex Volkov
Alex Volkov 16:12
Yeah.
Yam Peleg
Yam Peleg 16:13
it's interesting.
16:14
It's the entire, I think it's fascinating, the entire thing about open claw, how simple everything is at the end. Like how extremely simple harness, but you know, well thought out can, I don't know, just. Make the, this emergent phenomena of self-improvement, extreme great memory, personality. So on I, I, I, I just wanna say that I completely agree with the entire, piece by Ryan, about, harness engineering and so on, factory code factor and so on. Absolutely.
Ryan Carson
Ryan Carson 16:52
yeah.
16:52
And Yam, it's funny 'cause I've been seeing your screenshots of your like matrix set up. and it makes me laugh. And I also think every time we do this show, how can you be the most advanced person in the world? And you have shitty wifi. Whoa.
Yam Peleg
Yam Peleg 17:06
Oh yeah.
Ryan Carson
Ryan Carson 17:08
I just don't understand.
Yam Peleg
Yam Peleg 17:10
Oh yeah, you're absolutely right.
17:12
What
Ryan Carson
Ryan Carson 17:12
is going on?
Alex Volkov
Alex Volkov 17:13
absolutely right.
Ryan Carson
Ryan Carson 17:13
I wanna pile on with ya really quick and say, I totally agree.
17:16
Obviously open Claw going to OpenAI is, is probably the news of the week. And then I've piled on, and you probably saw this, like my rant about Anthropic and they're just complete lack of derel. except for Ari and I, I, except TAR is doing a good job mostly, but, but it feels like ghosts and the fact that Dario doesn't follow anybody, it's just like, guys just do dere. This is not that complex.
Alex Volkov
Alex Volkov 17:41
Yeah.
Yam Peleg
Yam Peleg 17:42
At end, can we use, oau or not use oau?
17:45
I'm not sure what's the end.
Alex Volkov
Alex Volkov 17:47
This is, I, I don't think there's a clear answer there, but,
17:50
this is definitely something we should get into, in our, in our, section called Tools and the gente coating. I don't know what to call this collective psycho, I don't know what to call this, but we definitely, we definitely must talk about this. This is like a huge thing that happens. The show usually is news folks, but there's just so much news happening with the world. I saw multiple companies now deliver agent stuff just by like repackaging this open source. There's like so much happening in the world of that. And it feels like most of us are clo like at least four people I know on the stage are running OpenClaw right now, doing stuff behind the scenes for them, and many others as well. So definitely we're gonna, we're gonna dedicate a section, but not go fully crazy. Alright, so, go ahead. go ahead Ryan, and then Wolfman. We'll continue. I,
Ryan Carson
Ryan Carson 18:34
I think you should just call the section bonkers time and We'll, you
18:37
know, we'll just talk about the, I mean, like, I went over to a friend's house for an hour and Gemini three, one drops. I'm like, God damn it.
Alex Volkov
Alex Volkov 18:44
Yes sir.
Ryan Carson
Ryan Carson 18:44
I can't keep up.
Alex Volkov
Alex Volkov 18:46
Yes.
18:46
So we'll hopefully have, Friends joined the show to talk about G 9 3 1, but for now, let's go to Til Wolf.
Woflram Ravenwolf
Woflram Ravenwolf 18:53
I saw very interesting post by, Eric S. Raymond.
18:56
You know, he's one of the open source, founders, basically who pushed it, way back then in the nineties. he posted as a veteran programmer that he loved the current state of ai where he feels like a wizard. You know, that term was popular way back then and we were all wizards when we were doing Linux. And you talk to your familiars, you summon your demons and you have them do your bidding, basically mess your will in code. And I think that's a great way to look at it. And yeah, the language has changed. Now we are talking to AI instead of programming assembly, but it is still the same thing. And people who adapt to that, they will always flourish. I think.
Alex Volkov
Alex Volkov 19:32
All right.
19:32
So folks, TLDR, and, and then we're gonna dive in.
19:46
welcome to T-L-D-R-A section where we basically run through every piece of AI news that matters for the past week. today is February 19th. I'm your host Alex Welcov with Weights, & Biases from CoreWeave. with me on the show today, LDJ Wolf and Raven Wolf Nisten to hear Ryan Carson and Yam Peleg. A full panel of us, multidisciplinary panel of AI experts and agent. Orchestrators, let's call us. we have a full show this week, folks. There's a lot of stuff that happened. let's run through them super quick. In open source. We have Alibaba, our friends from Togi Lab released Qwen 3.5. It's a 397 billion parameter, open weight native multimodal, MOE, and we're gonna cover what that means. we all were collectively waiting for deeps sick and deeps sick did not come. There's, the Chinese New Year basically, and all of the major labs, I believe that all of them were waiting, for, something to, you know, to, to like be six. So all of them released last week. We talked with, folks from, from ZI about GLM five, and we talked with Minimax about Minimax, 2.5, and then Alibaba also released theirs. So basically all major labs including released models besides deep seek. That's very, very interesting. no Deeps came. Cohere Labs released a tiny ia, a 3.3 billion multilingual model. Family supporting 70 languages. definitely worth noting that if you're into, small models, now to big companies, LMS and news. There is a bunch. So let's start. yam mentioned this, I I even forgot this, was this week. open Claw founder creator Peter Steinberger, joins OpenAI in what could be the first single founder billion dollar deal. We don't know the detail, we don't know the price. He didn't mention he went on Lex Freedom's podcast to talk about some of it. but wow, what a turn Open Claw is now part of OpenAI. OpenAI is bringing back open with this acquisition. Not only that, they will allow you to use your, your pro plan, with open cloud. We, we should mention this as well. and Tropic started this week. Go ahead.
Woflram Ravenwolf
Woflram Ravenwolf 21:53
Yeah, I just want to say that Tropic has, basically
21:56
said that it's not allowed, they made it very clear that it's not allowed So that is a big difference here.
Alex Volkov
Alex Volkov 22:01
Yeah.
22:02
Anthropic updated their terms of use and everybody freaked out because apparently you're not supposed to use your plans. basically I think the subsidized plans are crazy. For the $200 a month you get like. $3,000 worth of tokens if, if people like, correctly counted them. and then they updated them again and said that you, you are able to use this with the, with the SDK. we would not recommend you to run, your bots connected directly to your Anthropic plan. however, you know, we're, we're still waiting for them for a full confirmation. But yeah, earlier this week, while this was going on openly, I confirmed that you can in fact use your plans, with, with, open Cloud. All right. Let's, let's continue. ByteDance releases, Seed 2, Bytedance the, you know, the TikTok company, they release C two, which is a frontier, multimodal, LLM. And it's really surprising in the, in the evals, which definitely should take a look at this. I dunno why you would use this if you are western oriented, because it's not open source. But it's, it's very surprisingly, surprisingly good. And obviously the big breaking news that we have, and, I know we're in TLDR, but we, we have to do this, Gemini 3.1 launch, AI breaking news coming at you only on Thursday. I.
23:24
So I think that this is the, the, the breaking news from today. Gemini 3.1 Pro is live on the Ice Studio. Let's go. Biggest and baddest, LLM around, besides the deep think version that they had, just a week ago. Noam Shazeer post about this folks, 44% of humanities last exam, 77 RKGI, terminal bench at 68. Also, state of the Arc, besides Codex 'cause uses other tools and harnesses. Just an incredible, incredible release from Gemini. Folks. shout out to the whole team for this effort. We're definitely gonna dive into this. Try to understand what's going on and when we can, use this model. So, Gemini 3.1 Pro, from the big companies. we also have, have this here, even GR four 20 finally released, Grok, four 20 from XAI released, last week we told you a little bit about the shake up in X AI's leadership, a post acquisition with SpaceX, and now, GR four 20 released. supposedly it don't even leak the, the, the size. I don't know if you guys saw it. It's kind of like, it's not a mixture of experts. It is Angen model with four agents running, and they're running 500 million parameters, sorry, 500 billion parameters. and with like four agents that all together talk, they, they did not release one eval. One single eval was not released with this model. it's really hard to judge, X Excel releases I've played with this one. not, not amazing.
Yam Peleg
Yam Peleg 24:48
Oh, how is it, how is it like, not amazing like the other gr
24:52
because they are not bad, you know?
Alex Volkov
Alex Volkov 24:54
Yeah.
24:54
GR 4.1 I use all the time. I talked about this last week on the show, Ryan, if you remember. grog four 20 is not amazing. I, I, it's hard to explain to me why, but like the, the mixture of experts that they talk to each other, the agents, that's not what I want from a major lab. I don't want them to, you know, and if they do this, have, have them do this in background.
Yam Peleg
Yam Peleg 25:14
today, you don't care.
25:15
I think how it works,
Alex Volkov
Alex Volkov 25:16
I mean, yeah, deep think, kind of works like this,
25:19
like deep research works like this, but like, I think they're making a show of it in the UI. And, and then I, the responses that I got were, you know, I, I wasn't blown away. but it is really hard to, to get blown away lately.
Nisten Tahiraj
Nisten Tahiraj 25:31
felt the same way.
25:33
It just started like asking me a lot of questions back and it was very weird. It was very on and off. I had to prompt it a few times to, to get the right thing. It's just, it's really strange. Yeah, strange. I, it can be good 'cause I tested it quite a bit. I, I didn't try to do any tool calls or anything like that, but it, it is, this one's weird.
Alex Volkov
Alex Volkov 25:56
yeah,
Nisten Tahiraj
Nisten Tahiraj 25:57
not amazing.
Alex Volkov
Alex Volkov 25:57
in this week's buzz, our own Wolfram is going to walk us through
26:01
some benchmarking things that we did. I'm looking forward for that, Wolfram. And then we're gonna go to, vision and video. ZR releases zuna a 380 million parameter, BCI foundational Model BCI stands for brain Computer interface, and EEG. The reconstructs, brain signals from sparse data. That's very interesting. in other news from voice and audio, Google DeepMind released Lyria three, music Generation model. it's now available in G App so you can generate music. We've had a few of them, on the show before. There's a great one in open source called Ace Step. but now, Google is, launching their own as well, in tools and gen decoding and whatever we're gonna call this segment, or Ryan is viral once again with Cofactor. I would love to hear from, from you, Ryan, about what. It has changed in how you are preparing this. Basically, if you run through Code Factory for us, it'd be great. I would love that. Ai, launched fairly, fairly to, to big, to big, noise. So we'll see. And even friends of the public next week are sharing, good stuff about this. I think that this is most of the show this week. We may have, a few guests joining. we'll see.
Ryan Carson
Ryan Carson 27:10
I think you nailed it.
Alex Volkov
Alex Volkov 27:10
Okay.
27:11
So I think we'll start with big companies in LMS first. many, many folks, I, I know that we've been hired in open source, recently, but just there's so much to talk about here. So we'll spend the next hour, or so, maybe next, like 45 minutes to talk about the big companies and LLMs. And obviously we should start with the biggest breaking news, Gemini 3.1 folks. It just dropped. so let's, let's start with Gemini 3.1 Pro. literally just dropped on, on, the Google AI studio right now. Let's take a look at these. In terms of to hit the breaking news button again, let's take a look at these emails. They, they are not left behind. So I, first of all, a few things, the same price, same price point, but with significantly better performance. very interesting. They didn't launch Flash. They launched, pro
Ryan Carson
Ryan Carson 27:58
the problem is like, as soon as I don't see bold on Swyx bench
28:01
verified, like I'm not interested.
Alex Volkov
Alex Volkov 28:05
Yeah.
28:05
tell us about this, Ryan.
Ryan Carson
Ryan Carson 28:06
I mean, you know Yeah, yeah.
28:08
Humanities, last exam, a fine, blah, blah. I, I don't care. Like, all I want is Swyx bench verified and Swyx Bench Pro. Like, and, and they're, they're not Top and Logan and team I love you, but I'm not interested. I'm just gonna stay on Codex five three.
Alex Volkov
Alex Volkov 28:21
So this is 0.2% difference.
28:24
You see this, right? So the, the, the only, one comparison to that that we have here is, 80.8 for Opus 4.6 and 56.2 for, on, on the pro version for GPT five three Codex. So basically. Very, very closely tracking, if not state of the art for all for those two Swyx bench verified, but Terminal Bench Pro, we we're gonna talk about terminal bench and its importance for Wolfram. But Wolfram, what'd you think about this one? Look at this
Woflram Ravenwolf
Woflram Ravenwolf 28:54
Yeah, because it's interesting.
28:55
I mean, Ryan, you are basically, agent engineer. Of course the programming benchmarks are the most interesting to you. For me personally, terminal bench is more interesting because it's often, put in the coding category. But what it is actually, and we will look at it later, is, agent stuff like what we are asking of open claw. And since I am using AI mostly as an assistant in that way, and not just as a coder, since I'm more an ai, evangelist, not an engineer, that is why I'm most interested in the terminal bench scores. And I'm excited to see that this is now the top score here. It even better the just, released, Opus 4.6. So, yeah, I'm excited about this and how it'll work in the open claw and other agentic stuff.
Alex Volkov
Alex Volkov 29:37
Very interesting reporting here, Wolfram that they
29:39
have determinants to harness and then other self-reported harnesses, right? So like, they, they differ in scores here and they say that, you know, codex gets a higher score in terminal bench, significantly higher score, 10% higher score. But it's because it's using like a different, a different, harness. Yeah.
Woflram Ravenwolf
Woflram Ravenwolf 29:55
It's using itself.
29:56
So it's a Codex model in its own harness. A Codex harness. And that is why they have the scores for these. And that is the state of the art actually. The, the, codex model. And the Codex Harness has the highest score. Yeah. Basically very high.
Alex Volkov
Alex Volkov 30:11
the one notable thing that I wanna talk about, LDJ,
30:13
I saw you raised your handler. We'll get to as well in a second M-R-C-R-V two eight needle, deep kind of like context. Long, long context. check. This is a very important one, especially for folks running agents and they just like stack everything in their memory versus like running new ones with context. MRCR is basically the quality of recall versus long context. And it looks like they are now showing the Gemini 3.1 pro is state of the art together with the latest sunnet, which also has 1 million tokens. very interesting. They say not supported here. Song at 4.6 just definitely supports 1 million tokens. so we found the bug in, in the evol table, but
Woflram Ravenwolf
Woflram Ravenwolf 30:52
So maybe if that's a reason,
Alex Volkov
Alex Volkov 30:53
could be.
30:54
Yeah, could be. but
Woflram Ravenwolf
Woflram Ravenwolf 30:56
everyone,
Alex Volkov
Alex Volkov 30:56
No, look, look at, look at this poor performance
30:58
for 1 million, needles.
LDJ
LDJ 31:00
Yeah.
31:00
So a couple of things. on the MRCR V two eight needle, I do actually have scores for Opus 4.6 at 1 million context, which does show it's significantly better than, than Gemini here for Opus. yeah. So I just linked it in chat here. Nice. Okay. and. I do think benchmarks aren't everything. And from what I've seen in, in just some, a variety of tests, like, especially like the SPG tests and like a Minecraft bench where they have the different models, you know, build things in Minecraft. It does seem like Gemini definitely has some interesting agentic abilities here and there that might not be fully captured by the benchmarks.
Alex Volkov
Alex Volkov 31:37
Yeah.
LDJ
LDJ 31:37
But here we could see at 1 million context for Opus 4.6, it gets 76% and
31:42
this is also MRC, RV two eight needle.
Alex Volkov
Alex Volkov 31:46
Look at this.
31:46
Incredible difference. Yeah. What could explain, so just for folks who are just listening, LDJ just sent us a confirmed screenshot, I think from Opus 4.6 on MRCR on 1 million gets 76% recall, for long context retrieval. While the new Gemini 3.1 Pro, self-reported from Google gets 26%. So 76 versus 26%. This seems very, very interesting. This is almost like a bug. Interesting. So first of all, Gemini folks did not report the, the 1 million, scores for, for Anthropic in their comparison. but they did report the, the 1 million, the, the regular ones.
LDJ
LDJ 32:29
Mm-hmm.
Alex Volkov
Alex Volkov 32:29
So,
LDJ
LDJ 32:30
to be fair, like the 26% number here is actually pretty typical,
32:35
and I think even just like 3, 4, 5 months ago is actually really good. So yeah, like Opus 4.6 is actually unusually high in this regard. I think, if I remember right, I wanna say 5.2 or 5.3 from OpenAI is like around 50 ish.
Alex Volkov
Alex Volkov 32:51
I think something is wrong with, with one of those tables.
32:54
Okay. So look at here. Opus 4.6. On the Google's table, they report 84%, for 128 average, tokens, 128,000 tokens. In your screenshot that you just sent, Opus is reported at 93% for 256 K. Yeah. So, so like generally looks like they're under reporting because the highest score belongs to Opus based on the screenshot that you sent. Unless they have looked at different things. So that's very interesting.
LDJ
LDJ 33:25
noticed if you go all the way to the left in small text,
33:28
Gemini's Benchmark does report. It says average it's reporting for 1 28 K and then point-wise for 1 million. Yeah. I'm not sure why they chose to do, different measurements for the different context, but that might be a discrepancy of like, maybe Opus is 2 56 K figure. Maybe that's using a pointwise value. I'm not sure.
Alex Volkov
Alex Volkov 33:49
Well, we'll get to the benchmark, segment and Wolfman will
33:52
tell us that it's extremely difficult to pinpoint specific, exact and comparisons.
Nisten Tahiraj
Nisten Tahiraj 33:57
It's not, For me, I tried the same thing that we threw at, same
34:01
Martian thing that we, that we threw at. You mean the
Alex Volkov
Alex Volkov 34:03
Gemini model?
Nisten Tahiraj
Nisten Tahiraj 34:04
I'm just trying it right now in, in AI studio.
34:06
So the, the prompt is calculate how long a mass driver we need to be, like, like a game. And, we'll, we'll just try it again here. And, you, you're gonna see it is extremely fast. Like this is gonna be done. In about 20 seconds. I ran another version here. And, that one is just not,
Alex Volkov
Alex Volkov 34:28
nothing rendered,
Nisten Tahiraj
Nisten Tahiraj 34:29
you'll see it up here in a sec. It's, it, it, it
34:34
writes the components and stuff. it is extremely fast. So I'll give it that. Let's just give it another five.
Alex Volkov
Alex Volkov 34:43
is it writing multiple files in parallel or just like the one file?
Nisten Tahiraj
Nisten Tahiraj 34:46
no, just just one file at a time.
34:48
I think it's almost done there. It's done.
Alex Volkov
Alex Volkov 34:50
Oh, wow.
Nisten Tahiraj
Nisten Tahiraj 34:51
So, so we can do a
Ryan Carson
Ryan Carson 34:53
was fast.
Nisten Tahiraj
Nisten Tahiraj 34:54
it's fast, but it often just doesn't work.
Yam Peleg
Yam Peleg 34:57
that's the question.
34:58
Yeah. The question if it's good. Yeah.
Nisten Tahiraj
Nisten Tahiraj 35:00
Yeah.
35:00
So let's see if we can go back here to the one.
Yam Peleg
Yam Peleg 35:04
oh,
Alex Volkov
Alex Volkov 35:04
something happened and then just the launch didn't work.
Nisten Tahiraj
Nisten Tahiraj 35:06
Yeah.
35:08
So, I, I can just restore the first one because the first time it did do it.
Alex Volkov
Alex Volkov 35:12
Okay.
Nisten Tahiraj
Nisten Tahiraj 35:13
so we can kind of compare here to how we had it last time.
Alex Volkov
Alex Volkov 35:17
Oh,
Nisten Tahiraj
Nisten Tahiraj 35:17
nice.
35:17
And I mean, it did make it. Okay. I'll give it that.
Alex Volkov
Alex Volkov 35:20
Yeah.
Nisten Tahiraj
Nisten Tahiraj 35:21
But again, it's, it's not quite there compared to, Codex or, or 5.2.
Alex Volkov
Alex Volkov 35:27
we see a simulation, right?
35:28
Like we see orbital launch thing happening, we see speed. This looks more like a game than the other ones that we've tested that look more like a simulation like strategy thing.
Nisten Tahiraj
Nisten Tahiraj 35:38
yeah.
Alex Volkov
Alex Volkov 35:38
And it has like a full,
Nisten Tahiraj
Nisten Tahiraj 35:39
I'll give you this.
35:40
it's not that bad, but
Alex Volkov
Alex Volkov 35:41
it's not bad at all.
Nisten Tahiraj
Nisten Tahiraj 35:43
I'm not impressed compared to what Opus and, and codex did.
35:47
They had a fully working one with like trajectories and stuff. And this is like, I'm just stuck here. Opus 4.6, but it did make an actual simulation and I I, I would see the, uh, like the orbit and it, it felt good. Like it, it felt like something that was somewhat finished and, and correct and, and this one, maybe it's just Google's harness. It, it's, it is also likely that plays a big factor too, but yeah, it's not passing the initial vi track.
Alex Volkov
Alex Volkov 36:18
all righty, LDJ,
Nisten Tahiraj
Nisten Tahiraj 36:18
other
Alex Volkov
Alex Volkov 36:19
comments?
Nisten Tahiraj
Nisten Tahiraj 36:20
Yeah, even client build one.
36:22
So
LDJ
LDJ 36:23
yeah, on the terminal bench two scores, I noticed they were specifically
36:27
saying, so they showed bold, which indicates that they have the best score, but that's on the Terminus harness. but when it comes to actual custom harnesses or the best harness for each model, which I think at the end of the day, that's what ultimately matters. Seems like that's not reported, but I recall Codex and the Codex Harness, no, not T two Bench. yeah, bench. Yeah. There we go. See, so in the Codex Harness itself, OpenAI with GPT 5.3 gets 77%. Mm-hmm. And I would say that's the true highest score here. Just in terms of like, what we ultimately want is in the best conditions, what do the models get if you put them in the best harness? Right?
Alex Volkov
Alex Volkov 37:05
Yeah.
37:06
So, folks who are listening and have no idea what harnesses are basically a set of prompts to tell the model how to behave. And the folks who are building the models, they know the best prompts for the models that they release. they iterate on on them. Many of them put the same prompts kind of in, in, in the training in RL as well. So the harnesses from this is why cloud code slaps so much for so many people. Ryan, go ahead.
Ryan Carson
Ryan Carson 37:30
I honestly, think that this is why the agent labs are struggling.
37:34
every single agent lab is fighting this, this natural flow where the model is made by the lab and then the harness is made by the lab. And it's just hard to believe that they're not gonna do the best job at implementing a harness around their model. Right. And so this, this is why Codex CLI and the Codex Mac app is so good with Codex, and this is, like you said, why, you know, Opus four six is so good with cloud code. I think it's just really, it's gonna be hard to beat that. I think it can be done, but it's like fighting gravity.
Alex Volkov
Alex Volkov 38:06
It's, it's really an interesting choice to work in an
38:08
agent lab when all of these, saw the success of CLO code, which is now. $2 billion business, whatever, which started as a site code. And it's like a t UI on top of, you know, a set of prompts and, and a terminal UI on top of, you know, their models. And, Andro goes all the way.
Woflram Ravenwolf
Woflram Ravenwolf 38:27
also on Open, open Vault already.
Alex Volkov
Alex Volkov 38:28
What, what's the
Yam Peleg
Yam Peleg 38:29
Continue.
38:29
Yeah.
Alex Volkov
Alex Volkov 38:30
Yeah.
38:30
What's the, what's the first vibe check for you?
Woflram Ravenwolf
Woflram Ravenwolf 38:33
I had it built a website and yeah, it's cool.
38:36
I even told her to get her avatar from online and she did it, and yeah, it looks great.
Alex Volkov
Alex Volkov 38:41
Yeah, we wanna see the one-shot ability of this.
38:44
Although I posted this week, I, I believe the one shot is a myth, and I would love to talk to you guys about why I feel like this, especially for front end
Ryan Carson
Ryan Carson 38:53
dude, whoever says one shot is like, doesn't
38:55
actually use these tools or
Alex Volkov
Alex Volkov 38:57
No.
Ryan Carson
Ryan Carson 38:57
Understand anything about anything, like nothing
39:00
is one shot in the real world.
Alex Volkov
Alex Volkov 39:04
so much of this is just like, you need the human to be in the loop.
39:09
You need the human to like give, directions and how exactly you want it. Like, no matter how intelligent the models seem to be, the part of decision making seems not to scale together with the ability of coding, at least as far as I can see. Right. What do you think?
Ryan Carson
Ryan Carson 39:24
I'm just about to close my seed round.
39:25
I've got cash in the bank, should be getting ready to hire. But the truth is I'm not. I'm gonna continue to be like the 10 X engineer on the project until I start to break. And then what I'll do is instead of hiring two engineers, three engineers, I'm gonna hire one, I'm gonna pay them very well. So way above market, right? Yeah. But I'll expect them to be running, you know, agents 24 7 3 6 5 and be probably at least five to 10 x output and effectiveness of what I would expect in normal swe. And this is real, like I feel like I'm, I live on, in, in the ditches in the front line, like where, and so I think that's where we're going is software engineers are not gonna lose their jobs, but no new software engineers are gonna be hired on mass. And the, and the problem is I need experience. Like I need the person to have the five to 10 years of experience in the real world building real stuff. And it's, I think that's a real gnarly thing that's gonna happen.
Alex Volkov
Alex Volkov 40:23
You need experience both in software engineering and
40:26
also agentic kind of approach. So you need folks who've been around kind of like most everybody here on this panel, but also are agentic to their core to be able to use that experience and then multiply this with those tools.
Ryan Carson
Ryan Carson 40:39
Yes.
40:40
And the problem is a lot of those folks are gonna wanna start their own startup. Like they're not gonna wanna work for anybody. So it's gonna get gnarly, I think pretty fast.
Alex Volkov
Alex Volkov 40:50
Right.
40:51
Folks, we have to move on 'cause otherwise we're gonna sit here for the whole day. and tropic risk Cloud sonnet 4.6 sonnet is the kind of the everyday use AI from, from, Tropic before the max. Accounts sonnet was the day-to-day use for many, many people. And with next accounts, we all kind switch to the most expensive, very much intelligent opus 4.6. But SA for the longest time was kind of like we keep saying SA home. Home sonnet is the kind of the daily driver for many, many folks. and SAN 4.6, this is still hold up. Science 4.6 is a great model. It supports 1 million, tokens in a context window. It gets 79.6 on Swyx inch verified very, very close to state of the art, like almost 80%, 72 on O world, great at computer use. It's now the default model that runs on cloud ai. I think this is also the model now for, for, everybody who doesn't pay yet, on cloud. And it has 1 million tokens in the context window in beta as well, that you can send entire code bases into, $3 per input, $15 per 1 million tokens output. Still not super cheap, but you get 90%, savings via prom caching as always. So definitely choose the harness that, that has the caching. what do we think, folks? Did you try the new summit? What do we think?
Ryan Carson
Ryan Carson 42:17
I haven't had time.
Alex Volkov
Alex Volkov 42:18
So when you say you haven't had time, I just wanna like, clarify,
42:21
you're using Opus and so switching back to sonnet is opportunity cost to you, right?
Ryan Carson
Ryan Carson 42:26
So this is actually, I was thinking about writing about this.
42:29
I I do think at some point like you have to pick, yes. And so I've decided like I'm all in on the OpenAI stack. I just am. Yeah. Like I'm all top to bottom. And, and, and as tempting as these trinkets are, it's completely destroys my productivity.
Alex Volkov
Alex Volkov 42:45
real.
Ryan Carson
Ryan Carson 42:45
and, and if I was gonna be using an Anthropic model
42:48
B Opus anyway, so I don't, I just don't care about sign five. Yeah. Four, five, unless I was in my app and I needed a specific inference, you know, reason to use it.
Alex Volkov
Alex Volkov 42:56
Yeah.
42:57
LDJ, what about you?
LDJ
LDJ 43:00
Yeah, so I tested it a bit.
43:01
I mean it's, it's hard to say kind of cool new things about things that are not Frontier models, like the Opus models, but I'd say it just kind of feels like a, a smaller Opus 4.6 basically. it does feel though, based on like the benchmarks and kind of based on some of my tests, I get the impression that maybe it's trained for a lot longer. So like maybe less parameters, but trained for a lot longer than Opus in terms of like more iterations, more, chances of RL feedback during The training loop. And if you actually look at some of the benchmarks, what's interesting is sonnet 4.6 actually scores like slightly higher than opus 4.6 in some benchmarks, which I think is maybe because of this, maybe just because it's less parameters, they might have had the opportunity to train it for much longer than they do.
Alex Volkov
Alex Volkov 43:49
So what I have in notes is that, cloud said in cloud code
43:52
early testing and they test sometimes behind the scenes for people. So sometimes when you get results from cloud is because they tested, users preferred sonnet 4.6 over 4.5, roughly 70% of the time. And, preferred sonnet 4.6 over O Opus 4.5 59% of the time. So almost 60% of the time people when they got sonnet, the new sonnet, 4.6 versus the previous opus, they prefer summit, which is supposedly faster as well. I've been testing this, Within Open Claw. I switched to it in Open Claw and I've been testing this, I fairly quickly switched back to Opus again
Woflram Ravenwolf
Woflram Ravenwolf 44:29
the quota is also higher even if you
44:30
are on the subscription. the quota for sonnet is higher. So if you are doing multimedia stuff like ingesting images and so on, so you can quickly blow the context, and surprise that way as a rate limits. Yeah. So if you use sonnet, it is also multimodal and, you can do more before you hit.
LDJ
LDJ 44:47
I did pull up the sonnet 4.6 versus Opus 4.6 benchmarks, which
44:51
I linked in the Streamy yard shot. And I am confirming with my eyes here, on finance Agent V 1.1, Gentech financial analysis, sonnet 4.6 does score a couple percentage points higher than even Opus 4.6.
Alex Volkov
Alex Volkov 45:05
Yeah.
LDJ
LDJ 45:06
As well as on GDP ve, it also scores a bit higher in ELO
45:10
than Opus 4.6 there as well. Yeah.
Alex Volkov
Alex Volkov 45:13
Sonet beats the previous Gemini three pro in
45:16
pretty much every benchmark. the smaller version beats kind of the, the, the thought lead guys we're moving so fucking fast. What's going on? this feels like the same strategy that all these companies have. They release a big model, and then they kinda use some of it releases to distill into smaller models. We've seen this, all over, over and again, sonnet has the same vibe f from Opus as well. Like it has the same kind of solely thing. It has the same, fun to talk to with, so that's very interesting. you can absolutely use one of these son of models to run Codex for you. if you feel what Yam and, and Ryan here feel that like, codex is the best coding model, but you don't necessarily wanna talk to it directly. You can, you can have a middle agent talking to your, like, lower agents and have them code and then review their results and write prompts for them. That's definitely the thing. Obviously, if you want to get super good at it, you need to talk to the intelligence directly and not, delegate. but yeah, so 4.6, also multi model, but not fully multimodal, right? Like it has, it has MMMU in there. It can see images, but I don't think it understands the videos, for example. yet it has visual reasoning that looks very, very good. all right folks, time to, so that's Sonet. Any other comments about Sonet before we move on? folks are asking in the comments, is Sonet more expensive model versus Opus? no. Sonet is much cheaper than Opus. much cheaper. it looks like not two times cheaper at least. depending on the, the context length. the reason why it gets a dedicated usage bar in the Sonnet app is because it's cheaper. And so, like Opus, Opus is the higher one. Everybody switched to Opus because this is the release cycle, and that's, that's what Tropic did. and because of Open Cloud, but Sounded is much cheaper and it costs significantly less money for Tropic to run this. And so this is why it gets its own usage bar. I think it's time for
Yam Peleg
Yam Peleg 47:14
us, but you can, you can, you can blow up your quota really
47:18
quick if you go with long context. it also burns your quota, much quicker. Yeah. So you really need to pay attention to how much context you actually use.
Alex Volkov
Alex Volkov 47:29
All right, super quick coverage of Biden's seed.
47:32
Two point zeros seed is a frontier multimodal, LLM, with pro light mini and code variance. they were LG BT 5.2, which is no longer in in the race at all, and Cloud Opus 4.5. but the pricing is insane. So here's the thing with the, the Chinese kind of like releases. this is from Biden's. Biden's now steps up to be one of the leading AI providers from China. You know, they join, Ernie, and they join, Qwen and Alibaba, and they joined like a bunch of us. There's a whole universe there that's happening, for us Western folks. Like, I don't see necessarily a reason to run this, but the reason for many folks who are price conscious is the this price, right? So seed is very close to, GPT four Pine quality, quality, and maybe even Opus 4.5, but 84% cheaper than Opus 4.5. So if you're absolutely, you know, price maxing, which you, you know, if, if you are only starting out, that's probably what you should be doing. this, this could be a great model for you to try out. you have to register to buy them, though. it is multimodal with video understanding, and I think that this is the highlight. This is multimodal with video understanding. As far as I know so far. Gemini is the only video understanding model that we have access to from the big ones. OpenAI has video understanding, never released it to customers, but we, we were able to trick this with some WhatsApp forward tricks, if you guys remember. So definitely the, the OpenAI models understand video, but they just like, they seem to not give this ability to people. Biden has video understanding that suppresses the human benchmark 77% versus the human benchmark 73%. So if you need, you know, video understanding for that price, this probably a, a very, very good model for you to, to try
Nisten Tahiraj
Nisten Tahiraj 49:15
actually.
49:15
Kimmi can do up to 30 minutes.
Alex Volkov
Alex Volkov 49:17
Oh, the latest, of course.
49:19
Yeah. Yeah. 2.5 is, is multimodel. Yes.
Nisten Tahiraj
Nisten Tahiraj 49:22
Yeah.
49:22
And, it can do like 2048 by, yeah. 10 80 video.
Alex Volkov
Alex Volkov 49:28
That's right.
49:29
Thank you, Nisten. This is a great, great call out. Kimi on, on their, on their service as well. Can, can do video. what else about Biden's? That is interesting folks. Let's take a look super quick here. MMMU, pro Math Vista. Very, very high. This seems like a very good model with math reasoning, very high as well. agent performance, they say beat Opus 4.6. The thing that is very clear after doing Thursday for such a long time is that no matter how performant these models are on benchmarks, sometimes the real world use is just not there. We've seen this time and time again. We talked about benchmarks because, well, that's what we have and obviously all of us are busy. We don't have time to test all, all the models. and then at some point when we do get to test the models, some of them fall completely flat. this is, an example of Cmic K two, for example, that, on benchmarks loses maybe to GLM five, but it just like much nicer to talk to, kind of like, like similar to Opus. so there's definitely this, but we're gonna keep you up to date about the releases. it's very interesting that, seed 2.0 and Seed Dance, they have very, very, very specific naming criteria for their models. we talked about Anthropic changing the rules. we should probably mention this again 'cause we only talked about this in the TLDR. At some point yesterday or the day before, somebody noticed, anthropics documents changed and said it's illegal to use the, OAuth with the max accounts for anything else besides Codex, besides cloud code. and then they including using their agents as decay and I think they reverted back. So right now we are not a hundred percent clear on where we stand, but, and Tropic definitely fumbled the back a little bit with both Open Claw and, and now enabling people to use their, max accounts.
Woflram Ravenwolf
Woflram Ravenwolf 51:14
I mean, but the Chinese labs have especially
51:17
stated that their, subscriptions can be used with this like I did.
Alex Volkov
Alex Volkov 51:20
Yeah.
Woflram Ravenwolf
Woflram Ravenwolf 51:21
So basically everybody except Tropic and maybe Google,
51:24
are saying, yeah, you can do this. While they are saying no, you can't.
Alex Volkov
Alex Volkov 51:29
many of them by the way, jumped directly into hosting.
51:31
So Kimmi jumped into hosting Open Cloud account. Like you can host a full open cloud instance on Kimmi, and many other ones as well. And topic seems to be the one that, that is falling behind there and does not want, people using their Max account. OpenAI responded in response that you can absolutely use your Codex subscription, your opening I pro subscription, for everywhere that you want, which is very interesting approach. The, again, the amount of people that I saw decide to buy the $200 a month CLO account because of Open Claw is surreal.
Ryan Carson
Ryan Carson 52:04
I hadn't paid for, the Claude Max Pan plan until Open Claw.
52:08
I did it and then I just canceled it. and I switched to, I switched to OpenAI, because you can use Codex, et cetera, et cetera on it and Yeah. It's a, it was smart move by Sam to, to get Pete on board, like, you know, they're lucky to have 'em.
Alex Volkov
Alex Volkov 52:24
And we're gonna look forward for more, updates on kind of like
52:28
OpenClaw and OpenAI compared to this. And just the amount of, if they get telemetry from bots or get the usage or the amount of people working on that repository is insane. Like there's 200,000 stars on GitHub that Open Edges essentially bought, and they bought, not necessarily the code, they bought a, a, a whole community of people kind of like trusting that, open the air could get there. folks, we have to move on because we've been here for a while now and we haven't even started talking about grok or open source or evals and, or code. Like, okay, let's, let's move on. let's talk about Grok four 20. I will show you GR four 20. and then we're gonna, we're gonna mention this, so on gr.com, there's two new releases. Now, finally, XAI released grog 4 2 0. it's in beta. and now if you run anything like this, who are the co-hosts of ThursdAI show on X. you will see a very interesting agents thinking, UI here. and you can see that like multiple, multiple ais are talking between themselves, and you can kind of see their thoughts and all of them are researching. this is a very interesting attempt because these, you, you can see all four answers, kind of like here. They're pretty much the same. They're pretty much the same. and it found all the cohos here of the, of ThursdAI. They're listed, explicitly. Regular co-hosts on the official website, appeared in 50 plus episodes of each, especially in why freaking tagged and shouted by Alex in episode recaps. this is not a bad research For 26 seconds, four agents went and did like a deep dive. but this is not, this is, yeah, this is like 206 sources scanned, including Spotify. so the Fanout search that's very similar to, Google AI mode, they have like a very big fan out search. so it explained some stuff. Now actually, we'll give it a thumbs up. Why not let, let, let 'em learn? but the very interesting thing is, this release from XAI is not a standard release. We haven't seen any evals at all. No benchmarks, no evals. Four 20 just released silently. And then Elon Musk started like tweeting about this. This is now kind of how things work. It's not an API yet, so nobody could test it independently as well. And, and the thing that they say about this multiple things, one, it's a 500 billion parameter models times four or times 16, they released a heavy version. If you go and here and you see heavy, for $300 a month, I literally don't know who's paying XAI $300 a month. You can get highest usage, and you can get the, it's not, it's not showing here, but you can get the Grok four 20 heavy with, 16 kind of personalities, 16 bots, doing things together. I don't know who would use this. I definitely know that, both the Gemini Deep Think and GPT five Pro whatever versions of the pro versions of GPT, they're all the same kind of like behind the scenes. They all do fan out and then some model ranks and comparison. This is how they're so good at different, different tasks. Any comments on GR four 20 folks? and the fact that, you know, it was supposed to release a couple of weeks ago, now it's February, it just released and it's very much underwhelming compared to where we thought XAI would be with the Memphis and the, the GPUs they have, go ahead.
LDJ
LDJ 55:48
So I recall it was confirmed by XAI that, they said GR three and
55:52
grok four are, 3 trillion parameters. And so now, this 500 billion parameter thing you're saying I, is that confirmed? 'cause if that is confirmed, that's Elon
Alex Volkov
Alex Volkov 56:01
posting.
LDJ
LDJ 56:02
Okay.
56:03
So then I guess they like shrunk the size of each model. I guess, yeah, it's interesting.
Alex Volkov
Alex Volkov 56:08
Yeah.
56:09
Ellan posting this like, almost like, yeah, verbatim, let me go find this, but yep.
Nisten Tahiraj
Nisten Tahiraj 56:14
Look, it's not bad,
Alex Volkov
Alex Volkov 56:16
it's not bad.
Nisten Tahiraj
Nisten Tahiraj 56:17
It's not good for day-to-day work,
56:19
like for, for agent stuff. But what it is, I'd say still the best at maybe or top tier at is this research stuff just because of, whatever rag system and research system that XAI has, that, that one is still excellent. And, so it, it is worth, if you're gonna do a deep research task, maybe you can just tell your agent, Hey, just, just go on Twitter or gr.com and type this in and get a report from it and get all the sources, because it it, it's very good at, at the sources.
Alex Volkov
Alex Volkov 56:52
Yeah.
56:52
Sourcing is really good. Ryan, go ahead.
Ryan Carson
Ryan Carson 56:55
I want X to win.
56:56
I love X. I've been on X for like 19 years. Literally. you know, the, the new algorithm is bonkers. Like I'm getting over a million views on all my articles now. You know, I posted about my Dell monitor the other day and Michael Dell stopped by it. This is magical place to be, but nobody uses grok for production, stuff like that. I know of. Nobody uses it for coding. I can, I, so I'm kinda left with that, you know, like it's great to see it cranking in, in the X interface, doing amazing research for you, Alex, and I know use it heavily. But, it seems like the talk that Elon says and where the world is as far as actually using grok is like very different.
Alex Volkov
Alex Volkov 57:37
when you say nobody, we should acknowledge that X
57:40
is using grok and very heavily.
Ryan Carson
Ryan Carson 57:42
Yeah, that's great.
Alex Volkov
Alex Volkov 57:43
And I mean, no, but also like X users use grok a lot
57:47
within x to reply and give feedback. There's a lot of feedback loops that they have that is not like counted in like website visits for example. Right? So like, all these companies like SimilarWeb, et cetera, that they compare like traffic to major labs. grok.com is not the main place where GR could used. It's on X usually for many, many people. And that's nearing 500 million active users. So like some of them probably use X in some capacity or other, and there're shoving X everywhere. There's a button of, sorry, there's a button for grok, literally on every post in, in the main bar, et cetera. So they're really, really, really trying. Yeah. I agree with you though. it's not a production model. They talked about working on a coding specific model. Nobody uses grok for code, for example. Yeah.
Ryan Carson
Ryan Carson 58:25
Yep.
58:26
And I'm talking about API usage. Yeah. It's awesome to see it wound into the actual X experience. I'm, the, the X algorithm is being driven by GR now, so that's all great, but I just don't see how they get to be this massive API company at this trajectory.
Alex Volkov
Alex Volkov 58:40
LDJ,
LDJ
LDJ 58:41
I don't know if you guys remember, but, for a little while, and even right
58:44
now when I'm checking, it does seem like. Grok 4.1 Fast is actually surprisingly popular. And like right now it's number eight on open router.
Alex Volkov
Alex Volkov 58:54
Yeah.
LDJ
LDJ 58:54
And half of the ones that's beating it are the Chinese companies.
58:58
But it seems like actually the only American models that are beating Grok 4.1 Fast and API usage, at least on Open router, is Claude Opus Sonnet and Gemini Three Flesh.
Alex Volkov
Alex Volkov 59:09
They've been giving it out away for free on open router
59:11
for the longest time, I think.
LDJ
LDJ 59:13
That's curious.
59:13
Yeah, that's true.
Nisten Tahiraj
Nisten Tahiraj 59:14
And for very cheap as well.
Alex Volkov
Alex Volkov 59:17
look at the research though, because this happened
59:19
really fast and this is the type of research that, GR is really good at. I asked it first of all how long Ryan has been on x to confirm what he's saying live. so for quick fact checking live, it's great. And then I asked it, Hey, who among the co-host is the longest on X? and it looks like, Ryan with 19 years, I'm very close there with 18 years, nine months. and it looks like based on the user id, it gets it. it's very interesting.
Nisten Tahiraj
Nisten Tahiraj 59:39
Seems accurate.
Alex Volkov
Alex Volkov 59:41
Yeah.
Ryan Carson
Ryan Carson 59:42
damn, 18 years, Alex.
59:43
I'm impressed.
Alex Volkov
Alex Volkov 59:44
Do 18.9 years.
59:46
I'm very, very close.
Yam Peleg
Yam Peleg 59:47
sure
Ryan Carson
Ryan Carson 59:47
almost
Yam Peleg
Yam Peleg 59:48
18.
Alex Volkov
Alex Volkov 59:48
No,
Yam Peleg
Yam Peleg 59:49
I'm way from way before an ex, but yeah, I started being.
59:53
Active around 2020.
Alex Volkov
Alex Volkov 59:56
Alright, folks, moving on.
59:57
we have to cover, open source, as well. but before open source, Wolfram, I would love for you to actually, it's connected to Open, open source, right? You've tested a few models and I would love for you to give us in, this week's buzz a little, a little summary, a little discussion about like the stuff that you've been doing. So let's go to this week's buzz and then we'll be back folks. we have a bunch more discuss, including a new audio model from Google.
1:00:37
Welcome to this week's buzz corner of ThursdAI, where we talk about everything happening in the world of Weights, &, Biases. And for the last month or so, Wolfram has joined Weights & Biases and started looking at different models. So we'd love for you to take the next five minutes, to talk to what you found, man.
Woflram Ravenwolf
Woflram Ravenwolf 1:00:55
Yeah, sure.
1:00:55
So, the thing is now that we, I'm, I'm getting started with the benchmarking stuff again, and I used to do MMLU Pro, but, that is not what we need now. The basically about benchmarking, everybody has their favorite benchmarks. Like Ryan said, three bench, because he is a programmer, I want to use my agent with the most, capability. So I'm looking for something that is a agent and, a multiple choice test is not doing it. So the benchmark I chose is the terminal bench and we can I already share my screen? Okay. So what we are looking at right now is what we looked at before, which is, the Gemini 3.1 Pro that just released and the terminal bench. The reason why I like terminal bench is because it is not just a coding, but an agent, benchmark. We can look at the a webpage. The terminal bench, the example they show, and this is interesting, it's not just program something, but they have something like build a Linux kernel or crack a password in an archive that is password protected. Find out something, it's stuff that we are basically various tasks that we can give to our agents. That is why I chose this benchmark. And, it basically has its own agent, but you can use as we have seen other agents, like we have seen here that there's Codex being used, the Codex agent, and I am building an open claw agent actually. So I can test this, which is what interests me the most, but by default it's a Terminus two harness. And important about benchmarks is to always check how they are done. Because if we look at the different models, we see the different score. 68. 80, 68 0.5% here. And, here we have, SONET for instance. And what is important to always check how they test it because they usually give you information what they are doing. And like they're using Terminus two here and they change the resource allocation. These agent tasks, they are running in sandboxes. You need some containers where this is happening because the agent is changing the system and it has, 89 tasks for this benchmarks. And you need separate containers that don't influence each other or your host system. So check what is doing here, like GLM five, which is also a new model. And yeah, we are providing it via inference as well. So terminal bench 2.0, they have, scores for cloud, code as a agent actually, and it's a little bit better in their own benchmark, but not even much better than the default benchmark. And they changed the settings. This is also very important, right? It moved a bit. it is also super important to, check which model settings they are using and how they defer. They also change the resource limits here. They have different timeouts by default. That is also a very important thing,
Alex Volkov
Alex Volkov 1:03:40
Mm-hmm.
Woflram Ravenwolf
Woflram Ravenwolf 1:03:41
And so Qmi for instance, said they turned off the.
1:03:46
Thinking mode.
Alex Volkov
Alex Volkov 1:03:48
running the terminal bench.
1:03:49
Right?
Woflram Ravenwolf
Woflram Ravenwolf 1:03:50
Yeah.
1:03:50
Which is interesting because at first you would think you want the highest intelligence for these benchmarks, So in this kind of benchmark, thinking is not even helpful and is turned off because you also generate so many to, it takes longer, it fills the context faster, and then the model fails harder than if it's not using thinking. That's
Alex Volkov
Alex Volkov 1:04:07
interesting.
Woflram Ravenwolf
Woflram Ravenwolf 1:04:07
Yeah, it is super interesting benchmarks, what you find
1:04:10
out about, what's happening here and now we are looking at QU 3.5, which has a great score by the way, which
Alex Volkov
Alex Volkov 1:04:16
was released this week early.
Woflram Ravenwolf
Woflram Ravenwolf 1:04:18
out
Alex Volkov
Alex Volkov 1:04:18
to the qu
Woflram Ravenwolf
Woflram Ravenwolf 1:04:18
release.
Alex Volkov
Alex Volkov 1:04:19
yeah,
Woflram Ravenwolf
Woflram Ravenwolf 1:04:19
So it has 52.5 and I made a table of all these models
1:04:24
and checked out which is the best. And basically it is, on the open source side, it is on the third place. GLM five is on the second place. yeah, first place is not even open sourced. These, in the comparison we are having here, Gemini is the best, followed by sonnet. And then basically we have G LM five followed by Qwen, followed by minimax, and then Kimi. And Kimi personally is still one of my favorites. It's multimodal has a big context. K 2.5, basically on our own inference. And 20% of these 89 tasks say succeeded all the time. They are the rock solid core, you know, it does these and then the others, there were a lot of fluctuations in there, which depended on various things. Like for instance, if you change the runtime from one hour to two hours, the ceiling raises, raises a lot more higher because now there are tasks that can be done that have not been able to be done before. when you take a closer look at the benchmarks, there are tests that are easy. You could even take them out of the benchmark and still get interesting scores. And there are those tests that sometimes work and sometimes don't. And you have a ceiling. If you check from all the 89 tests, which tasks could the model actually do? Not in one run, but across all the runs. Then you have the ceiling. That is what the model could potentially get as the best score if everything went perfect and some other tasks basically here, 30%, were never done, which is much better if you give it more time.
Alex Volkov
Alex Volkov 1:05:56
Well, all these tasks are tasks on terminal bench two that have to
1:06:00
do with how engineers code their work. the interesting thing here is, the, the sandbox environment that you give, it also has effects and the different scores have effect. So even when we compare between kind of like, you know, the model. Makers just released this chart, and we're comparing this to this chart. the methodologies that they have are not always the same, right? So like it's very hard to compare between them.
Woflram Ravenwolf
Woflram Ravenwolf 1:06:23
It's hard to compare.
1:06:23
You are not comparing apples to apples if you don't check what the numbers mean. Mm-hmm. And you can report various scores. We don't have one score for a model because it's very interesting to have, can give a score for the ceiling which the model achieves. So I can say MK 2.5 could do 67.4% of this, which is, even higher than what we just saw for the net 4.6 or Gemini. Gemini has 68.5%, which is an average, I guess. Mm-hmm. but for Kimi it could actually do 67.54%. They are possible, but that's not what you get in a usual run. So what I am trying to say is that benchmarks are complicated, but they are important and you get a lot of information if you take a closer look and not just compare some numbers, which may not even be directly comparable. So there's a lot of information in here if you are interested in these things and you can look at them and see how the model really do. Does.
Alex Volkov
Alex Volkov 1:07:19
That's awesome.
1:07:19
Welcome. Thank you so much for doing all this work. a lot of the evals that you ran also ran with Weave are, tracing and evaluation framework by the way. Oh,
Woflram Ravenwolf
Woflram Ravenwolf 1:07:27
That, that reminds me.
1:07:28
We noticed the discrepancy when I did the benchmark with our GM five. We had abysmal scores. I got only 5%. So I checked and I looked in the Weave brace to see what is happening. And I saw that there were issues with the code. It was writing Python script. It had, errors that were brain dead model level. So I reported to our engineering department and they found it fixed the problem, and now it's getting the real score. So that is why you are doing the evaluations, not just to find out which model is the best one, but also to make sure that your inference doing the work. So it was quickly fixed. And yeah, it can now fully say that we are providing the, the right version basically.
Alex Volkov
Alex Volkov 1:08:05
So you used Weave to identify this errors before you
1:08:08
like, figure out what's going on. This is why Weave exists. We need to get back to our chat because there's a bunch of other stuff happening, and as I promised, I, saw another for, you know, another blow up article from a, a panelist here. Ryan, you went, you, you just mentioned this, that X algo loves you. we also love you and we would love to hear from you. What, if you give us a summary of of what you posted, that'd be great. Like, if you do like a visual review and tell just like a summary on on Thursday, a that'd be amazing because, you keep turning out incredible stuff. I,
Ryan Carson
Ryan Carson 1:08:42
I feel blessed by the Exalgo, so thank you, probably to Nikita.
1:08:47
e essentially all of us are trying to figure this out, right? We are trying to figure out how do we build a system that allows us to ship faster and more reliably. And, I found that, OpenAI produced a really good article, so I'm gonna pull up mine really quick. And it's called Code Factory. Let's go to the actual article. so OpenAI released this article called Harness Engineering and they basically documented how they have set up Codex, as what I'm calling a code factory. Really a system that, makes it easier to build, test, and deploy in a reliable fashion. But more importantly, it's really about, ultimately the agents writing a hundred percent of your code and reviewing a hundred percent of your code. I would encourage everybody to read this, read it in detail, and then have your agent read it, to help you set up this system. so what I did is I wrote this article, lemme zoom in. that basically documented how I set this up. what I often do is that I, you know, work with the agent for a day or two, get the system set up, and then I ask the agent to document what we did. then I take that mark down and I edit it to create an article on that. and it's pretty straightforward, right? You want a loop, right? You want a loop where the coding agent writes the code. the repo enforces risk aware checks before merge. So what that means is you go in and define what are the high risk, files or routes or, systems that if they get touched, the PR should be flagged as high risk, right? Which adds extra checks, right? And then you have a code review that kicks off. I use reptile. It's great. I pay for that. so I use reptile, but there's plenty of good code review agents. And then there's evidence tests. So then there's ci. So this is all happening with GitHub actions, right? So you have tests, you can do browser testing by GitHub actions. and then you have, a review. And then what happens is, you know, reptile will often find problems. and so then it triggers, codex re remediation step that then fixes that and it loops until, there's no more, comments from G reptile. all the CI goes green, right? So you get all those beautiful green checks and then, you know, okay, this, this code is safe to merge. And then you can, theoretically merge it. let's go into the high level view here. so you open a pr, right? And then you, you classify the risk. Is it high risk or not? If it's, then you compute the, the required checks from those change files, and then you have a risk policy gate. So again, y'all, your agents can do all this, right? So what you do is point them at this article and say, help me set this up. and then you just keep cranking in this loops until you get a good green PR at the end. So this article just walks you through it a little bit of code snippets on how to get it done. some things think about and, bon bing, you have a code factory set up and running. And I will say I use this, you know, every day. it's not easy to set this up. So I think anyone who's looking for some sort of one shot magic, you know, you know, silver bullet. you just gotta grind with your system until it's set up. And you can do this on Codex Amp, factory cloud code. Pick your agent of choice, but, dig in 'cause it is absolutely worth it. And it, it's where we're going. And once you get your code factory set up, you're really gonna start to move into a company factory where then you start to change some of these flows together.
Alex Volkov
Alex Volkov 1:12:17
So I wanna use this kinda as, first of all, huge, huge props on posting
1:12:21
incredible stuff that, that people that follow you, get a lot of benefit from. Second of all, the thing that you mentioned, which is also a new thing for many of us now with ATech, is that hey, point your agent to this article. Yeah, we'll set this, the agent Read that. Yeah, the agent read this. This is new from the last two months, right? Like, like just the fact that, just for folks who are listening, who feel behind, et cetera, this, this is new to us as well. We previously pointed, agents to documentation websites for them to learn how to implement some stuff. Now it's like in skills, but now agents because they can write code for themselves and they can self-improve, et cetera. Now it's absolutely possible to do what Ryan just said. Like, Hey, there's this article. It's very complex point a engine to it. He, the agent will walk you through setting itself up in this way. this is a definitely a new thing. Ryan, what would you say the, the highlight of the kind of the, the change in thought. Code factory for you as well. Is it the self-healing via gile or is it, just
Ryan Carson
Ryan Carson 1:13:20
like that's the, I I think it's about this
1:13:22
idea of setting up a contract. so you, you basically, set up YAML files, Jason files that act as a contract between you and the code factory. Right. And this is all machine executable, machine enforced. Right. And that's the key here, is you need to start to get to the point where it, none of, these things can happen until these gates are passed. And, and we, we used to hate this stuff in software engineering, right? Because it would really slow you down as an engineer. Like if you had to get a green CI before you could do anything, you would just wanna kill yourself. But now the agent has infinite time and infinite patience, so you force it to go through these gates. and real magic happens. So that's the difference is like, from the beginning, think like you have a team of a hundred engineers, right? Even if it's just, you take the time, it's like a week or more of setup. and, it really unlocks absolute magic.
Alex Volkov
Alex Volkov 1:14:18
I think.
1:14:18
this also brings us to the discussion that we started having, and we probably should have right now, is that, for, for some people, running agent things in 24 7, is easier when they work on backend stuff. And then I, I don't know how much of what you said can be applicable for like a website.
Ryan Carson
Ryan Carson 1:14:35
And I do think there's a big difference between front end and backend.
1:14:38
And, but guess what y'all, there's a system for front end that works. So installation, like, it's free, it's amazing. and what it does is it puts a, a little pill on your front end and you can click it and then highlight the area that you want to comment on. You, you comment, and then you can paste that comment into your agent and it targets, you know, exact components, ui. And so you start to get that loop. but it's, I think UI and front end is still very much, you know, driving with your agent, you know, and grinding through it. There's no, there's no loop yet for that.
Alex Volkov
Alex Volkov 1:15:12
I want to, also highlighted like we also experimented with tools.
1:15:16
Some of those running companies, some of those like playing with this, I, my experimentation with the Sonet tools and the Open Claw was building a new website. So since the last show, by the way, since like the world moves so fast, last show was on Thursday, Friday morning. My agent, open call based looking at my email, said, Hey, Alex, your website is about to be dead because the place where you hosted it, is going down today. And I was like, oh. I need a new website. I decided to, hey, like we, we, we should use what we preach. We talked to you about this intelligence. This is like, it never been easier to build websites. So my experience with, building with Open Claw and Codex and whatever is to try and lift a new website for ThursdAI. The previous website was just a, a, a regular website builder with like five links. There was nothing there. There's no system. and I wanted to show you what's possible, but also to also talk about the fact that like, front end one shot is a complete myth. So here's what we have. This is the new ThursdAI website. I would encourage folks to go to ThursdAI news and check it out for yourself, and please gimme comments because again, one shot is a myth. So I asked, Opus 4.5, which is by far the best designer if on design arena, everywhere else to give me three mocks of how a website should be. before this I asked, a bunch of agents to go and do research on what podcasts websites have. Some of them have episodes. some of them have, you know, links to different social medias. Some of them have the co-host, the, the hosts, et cetera. so here's what we have. I find it beautiful because this is like one of the three examples. the thing that I wanted to highlight though is this guest directory. So as iterating with, with this intelligence, we figured out that like, hey, I mentioned everybody who's a guest on the show in the transcript, but it's not stored anywhere. So we have to get to a guest directory. And we had over 160 guests at this point on the show. Many of them from big labs like Google. Look at how many people from Google we had on the show throughout the years, right?
Nisten Tahiraj
Nisten Tahiraj 1:17:08
is Steinberger not in there?
Alex Volkov
Alex Volkov 1:17:12
Peter Steinberger?
Nisten Tahiraj
Nisten Tahiraj 1:17:13
Yeah.
Alex Volkov
Alex Volkov 1:17:14
No, we didn't have them.
Nisten Tahiraj
Nisten Tahiraj 1:17:16
Oh, okay.
Alex Volkov
Alex Volkov 1:17:16
anyway, so, so, the guest directory, 15 guests from Google.
1:17:20
I will just highlight the fact that, extracting those guests from the raw transcripts was a task that I would have never bothered to do before. I asked the agent to spin up multiple subagents to go and read every transcript of every show that we had. We had over 152 episodes at this point, I think, and just like try to extract the guest and then spin up. This orchestration thing that happens. and then, also, also one shadow that like every guest now has a his own page for SEO purposes. Here's Stan's page, for example. and the, and dynamic OG tags. for those of you who know what OG tags are, then making them dynamic is kind of like the, the best thing you can do for your website, but it's really, really, really difficult. agents decided all of this. With that said, none of this was one shot. The amount of conversations that I had with my agent to get to a level that this looks coherent between pages is absurd. I have set up an automation to work on a few things at night while I'm asleep. and for the three, four days afterwards, I woke up to almost a completely new website every day. That is just not sustainable for people in businesses. You don't like that? I, as I said, I don't care because first of all, the website is barely up and second as, as long as the information there, it's good for Google to like refresh the website, et cetera. It's just thinking about a company for example, that would try to get an automation up and have agents working 24 7. And basically where I'm going with this is, one shot I think is amis, especially if it's related to UI or front end. And and also running things 24 7. You have to have exactly exact rules to define exactly what you want. Otherwise, the agents will go haywire. At least in my experience. I challenge everybody else to come and show me how they keep, things coherent between pages, between multiple things, while running 24 7.
Ryan Carson
Ryan Carson 1:19:13
I think number one, you're right.
1:19:14
there is no magic here, no silver bullet and, and anyone who's saying otherwise is lying or doesn't use the tool. but I will say having a design system in place has really changed the game for me. So I work with Josh Puckett, who just released this really cool, interface craft tool. and basically I have a design system folder with three documents in it. They're all marked down and it basically explains the ui, and the way the design works. And then I set up a skill that reminds the agent to always use the design system. and using that plus, orientation, works very well. But it's none of the front end UI stuff is happening in a loop for me. it's all backend. Yeah. So I think designers are some of the safest jobs in the world, so
Alex Volkov
Alex Volkov 1:20:01
alright folks, comment on this.
1:20:02
we have a few items still to discuss. any comments on, how developers will still be needed in the world, especially if it comes to fronted?
Yam Peleg
Yam Peleg 1:20:11
I completely agree.
1:20:13
I can't even tell you how many agent I, I I fired this week. Like. It's, it's a crazy, crazy amount of agents. the thing that is important to understand is, that these things are a little bit, random from time to time, by the way. They just, this is how the method works. So it's not a question of whether a model can do or like, can not mess up your code or like can not like, I don't know, have a terrible mistake that will throw away the entire computer. It's just a matter of when, because it is a little bit random all the time. So the reason you need all these gates and all these things that apparently slow down, humans, yeah, they absolutely slow down humans, but humans don't mistakenly do Very destructive things to the computer. You, you have the power to delete the entire computer, but you're not gonna do it by mistake on your own because you're human. That models can mistakenly, without you even realizing just delete the entire computer. And like, you can't even blame anyone because like a minute later the context is compacted. The model doesn't even remember what happened and that, that's it. So these things are guarding you from the mistakes that once they happen, they are very destructive.
Ryan Carson
Ryan Carson 1:21:43
Amen.
1:21:43
And then quickly, this is why Document Drift is such a big deal with Code Factory. Like part of the CI check is actually making sure you don't have documentation drift. and, and ya's nailing it because it really quickly can happen.
Alex Volkov
Alex Volkov 1:21:57
Yep.
Nisten Tahiraj
Nisten Tahiraj 1:21:59
You need to finish the job.
1:22:00
that's why you need an actual frontend developer. Like if it's an app, you're gonna hand o over to a customer. Or if it's something that makes money, even though Microsoft and other big labs make the same mistakes now, like you see just random UI stuff breaking everywhere, that's still not an excuse for you. Like you can lose that contract. Like at some point you need to take the thing to completion.
Alex Volkov
Alex Volkov 1:22:24
yeah.
Nisten Tahiraj
Nisten Tahiraj 1:22:25
Yeah, that's the spot for good frontend developers,
Alex Volkov
Alex Volkov 1:22:29
Folks, this has been great discussion.
1:22:31
Many of the listeners of our show definitely are experimenting with, with agentic, building and tools and agentic engineering. it's not Vibe Code anymore when you actually need to put stuff in production. It's not Vibe Code, hardening your app with Code Factory and Gates is not vibe code. Like this is like the step after vibe code. so definitely, definitely happy that we are having this discussion here. the last, few updates that we have to cover is, let me just pull them up. We have Google Lyria, which I wanted to play for you as well. Google, deci DeepMind launch. Lyria three. It's the most advanced AI music generation model they call available in the Gemini app. it has creative controls, 32nd high fidelity tracks. It's very interesting. They call it like the, the highest, the highest quality one, but like the tracks are only 30 seconds. Really wanna like. Listen to a few of them, specifically, they have a prompt guide for it. But let me, let me play a sample of Lyria by sharing the tap. 'cause I, I do wanna, I wanna show you some of the music. Let's say I,
1:23:43
so 30 seconds I don't think is enough, but, you can compose with images. You can upload the image and say, Hey, generate music for this image.
AI
AI 1:23:51
Green Hills Ocean.
Alex Volkov
Alex Volkov 1:24:01
Alright, we gotta say bye to Ryan.
1:24:02
Ryan, thank you so much for joining and, folks definitely should be co factory, from you. We'll see you here next week. Cheers, man. So, the cool thing, the cool thing with, with LY is that they released a prompt, 20%, 12. there is the prompt, guide here that tells you about like, how to prompt this. You can prompted vocals and lyrics you can prompt with, with different, different prompts. that's, obviously very useful, especially 'cause you can, you know, you can send your agents to it and say, Hey, write me some drafts because you don't know music. So Lyria from Google is very, very interesting, that they go into this. Apparently OpenAI has a music model somewhere that they are refusing to release for a while. so, you know, we'll look forward to this. We didn't talk about open source at all. So I really wanna talk to about Qwen Nisten, time for you to, to talk. I, I know you played around with Qwen a little bit. I would love for you to pick up Qwen and just let's chat about this. is this happened this in Monday? I think on Monday, the risk going 3.5.
Nisten Tahiraj
Nisten Tahiraj 1:24:58
Yeah, the benchmarks are showing very good.
1:25:00
The coding is, it's alright. In some of the tests that, that I wrote, I, I think G LM five is, is still a, above, actually in my opinion, GM five is, is better than, than than Gemini. I don't know if qu seems like on par w with today's Gemini, but, again, this is not the big coding model. So this is something for, for everything else. And, I, I'm more excited about like the. The medical part and also, trying to, to train it. So, yeah. it, it's interesting. And this one is not a deeps seek architecture because the other ones, I think both Kimi and GLM mm-hmm both switched to the deeps seek architecture except deeps seek, I don't know what deeps seek is doing.
Alex Volkov
Alex Volkov 1:25:47
yeah.
1:25:47
Deeps seek, we don't know what Dipsy is doing for sure. so Qwen 3.5, almost 500, 400 billion parameters was 17 billion active. So very, like a ton of experts, 512 experts, with 11 LA 11 active all the time. 262 native contacts window, which you can extend to 1 million with yarn, 201 language of support.
Nisten Tahiraj
Nisten Tahiraj 1:26:09
always been fantastic at multilingual.
Alex Volkov
Alex Volkov 1:26:13
Yeah.
Nisten Tahiraj
Nisten Tahiraj 1:26:13
Even on the tiny model.
1:26:14
And this is the other, the other major thing about Qwen is the, the multilingual, performance.
Alex Volkov
Alex Volkov 1:26:20
Yeah.
Nisten Tahiraj
Nisten Tahiraj 1:26:20
Oh, and, actually the dev lead, I think what he
1:26:24
went to school with was not a AI stuff, was actually linguistics, so that, it, it shows in the model.
Alex Volkov
Alex Volkov 1:26:30
It shows, yeah.
Nisten Tahiraj
Nisten Tahiraj 1:26:31
Yeah.
Alex Volkov
Alex Volkov 1:26:32
Wolfram, we have comments about multi.
Woflram Ravenwolf
Woflram Ravenwolf 1:26:35
At least, they cover a lot of languages,
1:26:37
but the quality was never there. And, it was a non-issue. When we had him on the show, we talked about it and he said, it'll be a focus for a future release. So I will definitely test this one to see if now, the multilingual reality, it's not just the coverage, but also the quality that they're hopeful.
Alex Volkov
Alex Volkov 1:26:53
Yep.
1:26:54
so ENT 2.5, we also wanna shout out, the cohere released a small Aya, multilingual model, 3.3 billion parameters, with 70 plus language and support cohere. Speaking of multilingual cohere was always great.
Woflram Ravenwolf
Woflram Ravenwolf 1:27:05
command, R model was one of my main models for Command
1:27:08
R. And command A. but r even more. I used that as my main model for a while because I could run it on my own system. So definitely, even at low quantization. So I'm a big fan of cohere.
Alex Volkov
Alex Volkov 1:27:20
Yep.
1:27:21
I think with this, we covered pretty much everything that we had on. On our docket to cover besides ZRA 380 million parameter, open source BCI tool. listen, let's talk about this for a little bit. 'cause like EEEG electronic, signals from your brain, this, this uses them to reconstruct brain signals from noise data. have you heard about this? What's going on? and will this bring us to the world where, like noninvasive BCI, be able to control computers with my brain?
Nisten Tahiraj
Nisten Tahiraj 1:27:50
Y Yeah.
1:27:51
I, I think this is one of the best step. I haven't tested the model. It, it feels a little bit small, but this is also one of the best efforts right now on this field. And they're released it completely a pasture too. So that means you could probably take one of those like $500 headsets and, and stuff, not non-invasive. You, you just put it, put it on like, it, it looks all right. And, you can actually start to, to train it for, for your own data. So I don't know how good it is or if it's there, or if it's, if it's usable, but, this is the best thing that, that we have right now. I think it will need personalized training for each person, because this is kind of how these, brain to transformer stuff, has worked lately. Yeah. but the. Yeah, I, I like it. Again, I, I haven't looked at the architecture, but for, for people basically just looked at electrical signals from your head, and these headsets are, are pretty cheap now that, that measure it. And up until this point, you know, there were a lot of efforts, but like, they weren't quite really there or like, they sort of worked and they weren't going fully all in with getting people to, to modify them and to, to, to train them and, and to make them better. So I am, I'm actually pretty excited about this. I might end up just buying one of those to see if, if I can make it work and, and train it more.
Alex Volkov
Alex Volkov 1:29:22
Yeah, I think the cool thing about this is like, very small,
1:29:25
380 million parameters, like a very small model that can run like on, on device, likely with those devices. not necessarily in real time, but like, you know, you can imagine in two, three years this runs in real time.
Nisten Tahiraj
Nisten Tahiraj 1:29:36
at that speed it can probably run in real time on a,
Alex Volkov
Alex Volkov 1:29:38
yeah,
Nisten Tahiraj
Nisten Tahiraj 1:29:39
like on, on a gaming.
1:29:40
GPU, not, not even a very expensive one. It, it's, it's tiny and it's actually quite a lot of data for a 0.4 B. so they, they would've needed like a lot of, EEG data. I think this one actually has the potential, or you would just wear the headset around and you just look at the screen and maybe be able to like, move some windows around or just fire up your agents or, or, or close them. You're probably gonna set up to talk to them, but yeah.
Alex Volkov
Alex Volkov 1:30:06
All righty folks,
1:30:06
It was a very exciting week. the highlights would be probably sonnet, 4.6 and the new Gemini 3.1 Pro that we've been able to test on the show. And we're gonna still wait out and see like the, the, the vibe, the vibe feelings from this, from last week. We had Deep Think and I don't know if folks use Deep Think, but it definitely. It was definitely a big, a big thing. everything that happens on the show ends up as a newsletter and a podcast on Thursday at News. Please check out our new website. my agent really worked hard on it. so did I. We worked like really hard. please check out on new website. Please share with your friends. We would love to see, more folks going in there, and discovering and subscribing to the show. the show is ending up as a newsletter and a podcast everywhere where you listen to a podcast. So if you missed any part of the show, look for Thursday, edit news everywhere where you get your podcast. We have a 4.9 rating, everywhere. thank you guys for joining. thank you. Everybody who listens in. I think we're talking in at over 1500 folks, tuning in for the show. Appreciate your time as always with you. LDJ Wolf, Raven Wolf, peg Ryan Carson was here. And your host is Alex Volkov. So, with that, thank you all. we'll see you next week. Bye-bye.