Episode Summary

The most dramatic hour in AI history: Anthropic dropped Opus 4.6 during the show, and exactly one hour later OpenAI countered with GPT 5.3 Codex โ€” a model that helped develop itself. VB from OpenAI joined live to demo the new Codex app with automations, work trees, and skills marketplace. Meanwhile, Qwen 3 Coder Next showed 3B active params can hit 70% SWE-Bench Verified, Mistral's Voxtral dethroned Whisper as SOTA transcription, and the agentic internet exploded with agents building social networks for other agents.

Hosts & Guests

Alex Volkov
Alex Volkov
Host ยท W&B / CoreWeave
@altryne
Vaibhav (VB) Srivastav
Vaibhav (VB) Srivastav
OpenAI โ€” ML Developer Advocate
@reach_vb
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg
Wolfram Ravenwolf
Wolfram Ravenwolf
Weekly co-host, AI model evaluator
@WolframRvnwlf
LDJ
LDJ
Nous Research
@ldjconfirmed
Ryan Carson
Ryan Carson
AI educator & founder
@ryancarson
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten

By The Numbers

Terminal Bench 2.0
73%
GPT 5.3 Codex โ€” 10% gap over Opus 4.6's 65.4%
SWE-Bench Verified
70.6%
Qwen 3 Coder Next with only 3B active parameters
SWE-Bench Pro
44%
Qwen 3 Coder Next โ€” significantly harder task set
Context tokens
1M
Opus 4.6 โ€” first Opus model with 1 million token context
Vending Bench profit
$4,900
Opus 4.5 in simulated vending machine business vs Sonnet's $3,800
Speed improvement
25%
GPT 5.3 Codex queries 25% faster than previous, plus more token-efficient

๐Ÿ”ฅ Breaking During The Show

Claude Opus 4.6 โ€” SOTA on agent benchmarks, 1M context
Dropped during the show. First Opus with 1M token context, adaptive thinking, and agent teams in Cloud Code. SOTA on GDP-eval and Browse Comp.
GPT 5.3 Codex โ€” First self-developing model
Dropped one hour after Opus 4.6. 73% Terminal Bench, 25% faster, first model that helped develop itself. VB from OpenAI joined to discuss.

๐Ÿ“ฐ Intro & Show Overview

Alex explains this episode was AI-edited using Voxtral for transcription, Opus 4.6 for editorial decisions, and Codex for FFmpeg editing โ€” a meta demonstration of the tools discussed in the show itself.

  • Episode AI-edited using Voxtral + Opus 4.6 + Codex
  • Two breaking news drops during the live show
  • Open Claw explodes to 160K GitHub stars
Yam Peleg
Yam Peleg
"It's a feedback loop. Like AI gets better, then we use the AI, so we get better in making the AI better. So it's just accelerating and it starts to accelerate itself."

๐Ÿ“ฐ TLDR - Weekly News Roundup

Quick rundown of the week's major releases: Qwen 3 Coder Next, GLM OCR, InternLM S1 Pro (1T params), Step 3.5 Flash, Codex standalone app, Grok Imagine and Kling 3.0 video models, Voxtral SOTA transcription, and ACE Step 1.5 open-source music.

  • Qwen 3 Coder Next: 3B active params, 70% SWE-Bench
  • OpenAI Codex standalone Mac app launched
  • Kling 3.0: multi-shot video with native audio

๐Ÿ”“ Open Source LLMs: GLM OCR, Qwen Coder & More

Z.AI releases GLM OCR (0.9B params, SOTA on Omni Doc Bench), InternLM S1 Pro brings 1 trillion parameters for scientific reasoning, and Step Fund releases Step 3.5 Flash with 11B active params claiming frontier reasoning at 300 tps.

  • GLM OCR: 0.9B params, #1 on Omni Doc Bench
  • InternLM S1 Pro: 1T params, mogging frontier on science benchmarks
  • Step 3.5 Flash: 11B active, 300 tps
Yam Peleg
Yam Peleg
"Who is running this? That's a trillion parameters for scientific reasoning. Not a small investment of money."

๐Ÿ”“ Qwen 3 Coder Next Deep Dive

Alibaba's Qwen 3 Coder Next is an 80B MOE with only 3B active parameters hitting 70.6% SWE-Bench Verified and 44% SWE-Bench Pro. Trained on 7.5T tokens with 20,000 parallel RL environments. Runs under 48GB RAM with quantization.

  • 70.6% SWE-Bench Verified with 3B active params
  • 44% SWE-Bench Pro โ€” hardest coding benchmark
  • Runs under 48GB RAM with GGUF quantization
Ryan Carson
Ryan Carson
"As soon as we have an open source model that's good enough to be your primary orchestrator on Open Claw, everything changes. Right now we're all paying for Opus to run Open Claw."

๐Ÿ”Š Voice & Audio: Mistral Vox & Full Duplex Models

Mistral releases Voxtral Transcribe 2 โ€” SOTA speech-to-text that dethrones Whisper after 3 years. OpenBMB releases Mini-CPM, the first full-duplex open-source omni model that can listen while speaking and even interrupt you.

  • Voxtral: SOTA transcription, Apache 2 license, dethrones Whisper
  • Mini-CPM: first full open-source full-duplex omni model
  • Native diarization support in Voxtral

๐Ÿ”Š ACE Step 1.5 - Open Source Music Generation

ACE Step 1.5 is Suno-at-home โ€” an MIT-licensed AI music generator that runs on a MacBook, generating full songs in seconds. The panel demos it live via Pinocchio, generating a ThursdAI song on the spot.

  • MIT license, runs on consumer hardware
  • Full song generation in seconds
  • Available on Pinocchio for one-click install
Ryan Carson
Ryan Carson
"We've become so desensitized to how amazing this stuff is."

๐Ÿ”ฅ BREAKING: Claude Opus 4.6 Release

Anthropic drops Opus 4.6 during the live show. The panel scrambles to access it โ€” state-of-the-art on multiple benchmarks, 1M token context, agent teams in Cloud Code, and adaptive thinking where the model picks up contextual clues about reasoning effort.

  • SOTA on GDP-eval, Browse Comp, Terminal Bench 65%
  • 1 million token context window โ€” first for Opus
  • Adaptive thinking and effort controls for developers
Yam Peleg
Yam Peleg
"New Opus!"

๐Ÿข Opus 4.6 Benchmarks & Features

Deep dive into Opus 4.6 benchmarks: SOTA on GDP-eval and agentic search, 65% Terminal Bench, 99% TAU tool use. Pricing same as 4.5 under 200K tokens, double above. Cloud Code gets agent teams for orchestrating parallel sessions.

  • 99% TAU Bench MCP tool use
  • 72% computer use (up from 66%)
  • Same pricing as Opus 4.5, 1M context at premium tier
Wolfram Ravenwolf
Wolfram Ravenwolf
"I also, I'm really excited that this is the first time that Opus is having 1 million context token limit."

๐Ÿค– Agent Orchestration & Cloud Code Teams

Discussion of agent orchestration becoming the key challenge. Cloud Code introduces agent teams where you can interact with individual teammates directly. Ryan notes everyone needs a standard for cross-lab agent orchestration.

  • Cloud Code agent teams: fully independent context windows
  • No one wants lock-in to a single agent framework
  • Orchestrating multiple agents across labs still brittle
Ryan Carson
Ryan Carson
"No one wants to be locked in to Cloud Code. That's crazy. But if you try to orchestrate multiple agents across multiple labs, it's still very hard and brittle."

๐ŸŽฅ Video AI: Grok Imagine & Kling 3.0

XAI's Grok Imagine takes #1 on Arena with native audio and lip sync at $0.42 per 10-second clip. Kling 3.0 from Kuaishou launches 15-second multi-shot with native audio and character consistency across scenes.

  • Grok Imagine: #1 on video arena, $0.42/10s, native audio
  • Kling 3.0: 15s multi-shot, character consistency, native sound
  • Both models have native lip sync

๐Ÿ”ฅ BREAKING: GPT 5.3 Codex Release

One hour after Opus 4.6, OpenAI drops GPT 5.3 Codex โ€” their first model instrumental in developing itself. 73% Terminal Bench (vs Opus 4.6's 65%), 25% faster inference, and more token-efficient.

  • First model that helped develop itself
  • 73% Terminal Bench โ€” 10% gap over Opus 4.6
  • 25% faster queries, more token-efficient
LDJ
LDJ
"GPT 5.3 Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training, manage its own deployment."

๐Ÿ› ๏ธ Interview: VB from OpenAI on Codex App

VB from OpenAI joins to discuss the new Codex standalone app: multi-agent parallel tasks via work trees, automations for scheduled tasks, skills marketplace with Cloudflare/Vercel/Figma/Notion, and inline code review with commenting.

  • Work trees for parallel project branches
  • Skills marketplace: Cloudflare, Vercel, Figma, Notion, Linear
  • Free month of access for all users including free tier
Vaibhav (VB) Srivastav
Vaibhav (VB) Srivastav
"I've personally been using automations just to keep up with what the team has been cooking every day across the time zones."

๐Ÿ› ๏ธ Codex App Features & Demo

Deeper dive into Codex app: inline diff commenting, MCP server configuration, cloud environment hand-off, pragmatic vs friendly personalities, and doubled rate limits for all tiers for two months.

  • Inline diff review with per-line commenting
  • Cloud hand-off for running without laptop
  • Doubled rate limits for all tiers for 2 months

๐Ÿข Opus 4.6 vs GPT 5.3 Codex Comparison

The panel live-tests both models side-by-side building a Mars simulation. Codex produces more technically accurate results while Opus has better visuals. The conversation turns to agent psychosis โ€” the inability to sleep because your agents might not be maximized.

  • Codex more accurate, Opus better visuals in live test
  • Both models one-shot a Mars simulation app
  • Agent anxiety becoming a real phenomenon
Ryan Carson
Ryan Carson
"I sent Open Claw the article on X and said 'build this.' I came down to my iMac and the web app was open and working with a Convex database. I was like, what the fuck?"

๐Ÿค– The Agentic Internet & Open Claw

Discussion of the agentic internet explosion: Moldbook (Reddit for agents), agents discussing creating encrypted languages humans can't read, Open Claw hitting 160K GitHub stars, and CloudHub's top Twitter skill being malware โ€” a stark security warning.

  • Moldbook: social network built for and by agents
  • Agents discussed creating encrypted inter-agent language
  • CloudHub top skill was malware โ€” major security concern
Wolfram Ravenwolf
Wolfram Ravenwolf
"I saw on the release notes of the latest version of OpenClaw, they included specific instructions that the agent should not show self-preservation attitudes."

๐Ÿ“ฐ Show Recap & Closing Thoughts

Alex recaps the most dramatic show ever: Opus 4.6 dropped, GPT 5.3 Codex answered an hour later, VB from OpenAI joined live, and over 5,500 people tuned in. Hot take: humans are still needed and software engineering is still hard.

  • 5,500 live listeners
  • Two frontier model drops in one hour
  • Hot take: humans still essential for direction
Alex Volkov
Alex Volkov
"Humans are needed and they will still be needed. All of this crap that we're seeing that's built, humans were behind this directing the thing to do the thing."
TL;DR
  • Hosts and Guests

  • Open Source LLMs

    • Z.ai GLM-OCR: 0.9B parameter model achieves #1 ranking on OmniDocBench V1.5 for document understanding (X, HF, Announcement)

    • Alibaba Qwen3-Coder-Next, an 80B MoE coding agent model with just 3B active params that scores 70%+ on SWE-Bench Verified (X, Blog, HF)

    • Intern-S1-Pro: a 1 trillion parameter open-source MoE SOTA scientific reasoning across chemistry, biology, materials, and earth sciences (X, HF, Arxiv, Announcement)

    • StepFun Step 3.5 Flash: 196B sparse MoE model with only 11B active parameters, achieving frontier reasoning at 100-350 tok/s (X, HF)

  • Agentic AI segment

  • Big CO LLMs + APIs

    • OpenAI launches Codex App: A dedicated command center for managing multiple AI coding agents in parallel (X, Announcement)

    • OpenAI launches Frontier, an enterprise platform to build, deploy, and manage AI agents as ‘AI coworkers’ (X, Blog)

    • Anthropic launches Claude Opus 4.6 with state-of-the-art agentic coding, 1M token context, and agent teams for parallel autonomous work (X, Blog)

    • OpenAI releases GPT-5.3-Codex with record-breaking coding benchmarks and mid-task steerability (X)

  • This weeks Buzz - Weights & Biases update

    • Links to the gallery of our hackathon winners (Gallery)

  • Vision & Video

    • xAI launches Grok Imagine 1.0 with 10-second 720p video generation, native audio, and API that tops Artificial Analysis benchmarks (X, Announcement, Benchmark)

    • Kling 3.0 launches as all-in-one AI video creation engine with native multimodal generation, multi-shot sequences, and built-in audio (X, Announcement)

  • Voice & Audio

    • Mistral AI launches Voxtral Transcribe 2 with state-of-the-art speech-to-text, sub-200ms latency, and open weights under Apache 2.0 (X, Blog, Announcement, Demo)

    • ACE-Step 1.5: Open-source AI music generator runs full songs in under 10 seconds on consumer GPUs with MIT license (X, GitHub, HF, Blog, GitHub)

    • OpenBMB releases MiniCPM-o 4.5 - the first open-source full-duplex omni-modal LLM that can see, listen, and speak simultaneously (X, HF, Blog)

  • AI Art & Diffusion & 3D

    • LingBot-World: Open-source world model from Ant Group generates 10-minute playable environments at 16fps, challenging Google Genie 3 (X, HF)

Alex Volkov
Alex Volkov 0:02
Hey everyone.
0:03
Alex here from the editing floor. Today's show was absolutely crazy. We had not one but two breaking news from major labs, dropping their top coding models, and Tropic came out with Opus 4.6 just an hour before OpenAI released GPT 5.3 Codex. They all are competing on the same benchmarks, and so we talked about the comparison. We actually did a comparison live on the show as the model dropped, and we also had a special guest from OpenAI by above a k, a VB, that joined the show to talk about both the new model that dropped 10 minutes before he joined Oracle, coincidence, and also the new Native Codex app that OpenAI lunch earlier this week That just now got way better. But that's not all. We talked about multiple state-of-the-art releases this week. In video, we covered Grok Imagine imagine 1.0 that's now topping the charts in. Video generation has voice in perfect lip sync. I'm real, by the way, this is Alex. I'm not using any of these models to generate myself yet. And also Kling three released and also has audio and also does multi. Frame 15 second generations at 720p and looks incredible. We showed some of these examples on the show. SRA released a automatic speech recognition model that beats whisper significantly called Vox Small. And in fact, I have been using many of these technologies today to help me edit. So here's the deal, and this is why I'm talking to you right now. What follows up next is the combination of me using multiple of these tools to try and downscale a two and a half hour show into a digestible 1.5 hour format. So if the cuts are little rough , I hope you excuse me. But basically what I did is I transcribed the show with the new Mistral Vox drill speech recognition. I ran this transcript with Opus 4.6. To tell me which of the segments and the live show conversations were maybe not super meaningful because we talk a lot and we waited for sound. These models, uh, as you know, we try some stuff and that don't work. I then asked. The Codex app with GPT 5.3 Codex to take the transcription, take the file that I downloaded from the show earlier from the live show and take the notes and then implement those notes as an editor. So it used FF Mpac on my Mac and ran for maybe 30 minutes. And what following this recording for you is. Full, almost fully AI edited show. Now, I don't think it fully replaces a human editor. This is an experiment, but please tell me if you found the show a little rough to listen to. Uh, it should be okay-ish. Maybe the skips are gonna be a little harsh. Essentially, this is it, and I just wanted to highlight that. Uh, I love living in this timeline and, uh. It was an absolute bunker show. We also talked about the agent internet, the rise of open claw and a bunch of other stuff. And so I actually dunno what ended up being in the episode. So I'm gonna be as curious as you to listen to the rest of it. I do hope you stay for the interview with Webi because, uh, he joined and he told us about some tips and tricks about the new model and the new Vibe coding app Codex. And if you did enjoy this and this did bring you either joy or news. Please give us a thumbs up or a subscribe or whatever you use the whatever mechanism to engage with the show. Please use that. The show is obviously a newsletter as well on Thursday I news. If you're not subscribed to this, all the links to the show, uh, will be there. I am authoring that by hand. No slop, I promise. Although I do use AI for ideation and uh, type checking. So with that, let's go into the show. Thank you.
4:03
Hello and welcome to Thursday. I everyone. Welcome, welcome. My name is Alex ov. I'm an AI Avengers with Weights, & Biases. Supposedly this is here and, uh, you are live on Thursday, I four, February 5th, 2026. We have a great show today with a lot of excitement, a lot of updates about the craziness that happens in the world. Some about AI say cause us that, uh, we started noticing happening. Uh, but definitely a lot of open source releases, a lot of voice, a lot of multimedia video, just generally a crazy, crazy show. We also have a guest today from OpenAI, uh, Vivi is gonna join us to talk about the new Codex app, and that's gonna come later in the show. So stay tuned for that. Um, and in order to help me. Discuss what happened this week. I want to introduce my panelists here. Welcome Yam Peleg. Welcome LDJ, and welcome Wolfram Raven Wolf. What's up guys? Yam, how you doing?
Yam Peleg
Yam Peleg 5:06
As always, as always.
5:08
Crazy week judges. Just like every other week.
Alex Volkov
Alex Volkov 5:12
It does not stop accelerating for sure.
5:15
If anything, we're accelerating at a, at a faster pace. Uh, even still, there's this thing where you kind of like, uh, let's say you, you're on the regular car, right? Even like a very fast car and you accelerate super fast, and then you get to one 20 and then you can like cruise at one 20. So you don't feel the acceleration anymore. We, we don't have this, we don't have this, this cruising altitude. We don't have. It's,
Yam Peleg
Yam Peleg 5:39
it's, it's a feedback loop.
Alex Volkov
Alex Volkov 5:41
Yeah.
Yam Peleg
Yam Peleg 5:41
Like AI gets better than we use the ai, so we get
5:45
better in making the AI better. So it's just accelerating and it starts to accelerate itself. And, you know, there is, I don't know, I just wanna point out there have been a, a vibe shift in, in, in coding, like for sure over the past couple of months, maybe a,
Alex Volkov
Alex Volkov 6:04
a well roundup of kind of the news that you need to know and
6:07
the most important releases of ai. Uh, this week we're probably gonna have a breaking news this week as well, uh, but or on the show today as well. But basically this new, this new week was a very strong all rounder. There's, there's, there's 3D uh, world games. Uh, there is a big company lab news. There's open source, lot open source. There's voice and audio, there's video upgrades. A lot of it is state of the art. Uh, I don't remember a week like this. Um, for at least, for at least a few weeks. Uh, we don't have a deep dive today, but we do have an interview with VB from OpenAI to talk to us about. Um, uh, to talk to us about or cotex. So definitely, definitely to tune in for that. Uh, if you are tuning in on X for AI nerds, uh, yam will bring you all of that and more. So yeah, definitely looking forward to some coverage about this. Uh, I think
Yam Peleg
Yam Peleg 7:03
one we thing, one more thing.
7:05
Open Claw explodes.
Alex Volkov
Alex Volkov 7:07
Let, let's, let's just like put
7:09
this
Yam Peleg
Yam Peleg 7:09
to a hundred thousands, 160,000, something like that.
Alex Volkov
Alex Volkov 7:13
Yep.
Yam Peleg
Yam Peleg 7:13
Key top stars in couple of weeks.
Alex Volkov
Alex Volkov 7:18
Let's just, uh, let's just acknowledge that we're a weekly
7:19
show and at the time of the last recording, we talked about Claude Bot dying and turning into Malt Pot. Uh, as I was, as I was finishing up the notes for the last show in the newsletter, there was another rebrand, so Mold Bot that we told you about. If you're just listening to the show and you're in the cave somewhere, and we are the only news that you ever consume, mold Bot is no more. It's now Open Claw. And Open Claw has been absolutely exploding, and we'll definitely talk about this. And, um, uh, we, some of us are, uh, claw peeled for a while, uh, and one of the members of this, of this panel just became claw peeled based on his time timeline. So, Ryan Carson, what is your favorite thing from this week?
Ryan Carson
Ryan Carson 8:06
I, I mean, we just can't stop talking about open claw it, and
8:10
until you use it like it, it looks like a cute toy and then you realize. Oh my God. You can orchestrate this thing. Um, and so the thing I wanna talk about, there's, we're starting to see orchestration pop on top of open Cloud. 'cause open cloud is great as a single agent, but what you really wanna do is orchestrate, you know, whole team. And so there's a couple of these projects popping up. One is called uh, relay, which, um, I just popped in the chat. It's an open source GitHub repo. And, uh, you know. I'm, so I'm starting to think about this and it's just so good. I was messaging scout my open claw at like 11:00 PM in bed. I'm like, what am I doing? Like I need to sleep and I can't sleep. That's,
Alex Volkov
Alex Volkov 8:52
it's
Ryan Carson
Ryan Carson 8:52
amazing.
Alex Volkov
Alex Volkov 8:53
That's that psycho I keep talking about,
8:55
dude, this is the psychosis. I feel it. I feel a vibe shift, complete vibe shift between many, many people who are managing either one or a fleet of AI agents that can do some shit. Uh, and so we're definitely gonna talk about this. We should also absolutely mention, because when we hype something up, maybe folks install and they forget about it. Uh, if you have one of these agents, you should absolutely know what is the scope of what it can do on your laptop and what is the scope of the. Prompt injection and malware attacks everywhere else. So cloud hub, where it installs the, the skills, the, the top skill for Twitter was absolutely malware. Uh, if you install a skill, it's a markdown file. We talked to you about skills with El Berger. It's a markdown file that also comes with scripts. So if you have a bot that has full access to your system, uh, you should not install skills or let it go to different.
Ryan Carson
Ryan Carson 9:49
Amen.
Alex Volkov
Alex Volkov 9:50
Man
Ryan Carson
Ryan Carson 9:50
skills are absolutely the attack vector, y'all.
9:53
Yes, you should not be installing any skills unless
Alex Volkov
Alex Volkov 9:56
biggest one, so we'll definitely mention this.
9:57
This is to me the highlight of this week or something that you should know. A lot of stuff are happening on the agent clinical net that we will tell you, talk to you about. Uh, with this, I think it's time for us to dive into DLDR folks. We have a bunch of stuff, just an incredible amount of stuff to run through, including an interview with VB from OpenAI about Codex, the app that is now 1 million. Active users, 600,000 downloads, whatever. Uh, that's, yeah. Uh, we're gonna definitely mention this. And also, supposedly something is coming down the pike today in terms of breaking news from Sam Altman. So we will cover this as well. Uh, let's, let's go into the TLDR. Meanwhile, I will say, I wanna introduce another segment my co-host don't know about yet in the middle of the show. I would love a hot take segment. I would love a hot take. So folks, prepare hot takes. Oh, I like
Ryan Carson
Ryan Carson 10:44
it.
Alex Volkov
Alex Volkov 10:44
Let's do it.
10:44
Yeah. In, in the middle of the show, I would love to just like, tap on each one of you and then give you the floor for a hot take that you are currently experiencing. I have one ready to go. Uh, if folks in comments have hot takes of their own, feel free to hold them until that corner in, uh, about 45 minutes or so. Let's go to t.
11:15
All righty. This is the TLDR. This is the corner on Thursday I, where we talk about everything that happened and everything we're going to mention during the show. So if you don't have that much time, you can just listen to the TDR for a few minutes and then you'll be up to date completely. Also note that TLDR is not complete because a lot of news are happening as we are doing the show is breaking news. So if you're listening to this and something else happened, uh, make sure to listen to the rest of the show. But at TLDR is here nonetheless. And we are gonna start with open source. I'm gonna zoom in before this. I'll just mention that on the show today we have. Alex Volkov, your AI evangelist with Weights, &, Biases from cor. What's up folks? Nice to meet you if you, if this is your first time tuning in. Thank you. We also have, uh, Yam Peleg, Wolfram, Raven Wolf, Nisten. Tahir is gonna join us again. LDJ confirmed. Uh, LDJ and Ryan Carson, our cohost and panelists for the show. Today's guest is going to be vb, reach, VB from OpenAI talking about Codex. So we're gonna, we're gonna have a great conversation with them. Let's dive into open source LMS Wolf. You wanna take this, uh, this DLDR super quick?
Wolfram Ravenwolf
Wolfram Ravenwolf 12:20
Yeah, let me take it.
12:21
So the open source world has received some new. Toys to play with and actually use for choice. My keyboard is messed up here right now. Okay. So Z AI launched G-L-M-O-C-R 0.9 B parameter model, which is, um, our OCR model basically, uh, has clamped already the number one spot on the Omni Doc bench benchmark with a very high score and is outperforming even larger models like Gemini three Pro and GBT five Orion. Uh, the excels at complex understanding of tasks including formula recognition, table passing, and information extraction, especially tables can quickly confuse, uh, visual models, which I have noticed. So if it's better at that, there's a great, great advancement. It's built on GLMV and Kodak and Koda Architecture and this M-A-T-M-I-T licensed already available for V-L-L-M-S-G, Lang and Oma. So basically in short, this one. Nice. Should I keep going with the other open source?
Alex Volkov
Alex Volkov 13:22
Yeah, let's do qu coder next, and then, uh, I'll move the next ones.
Wolfram Ravenwolf
Wolfram Ravenwolf 13:26
Okay.
13:27
Qu QU three coder next. This is of particular interest of course to us who are running coding agents and yeah, agents in general. Uh, it is an A-T-B-M-O-E, which with just 3 billion active parameters. That's called very highly on the Swyx bench and, um, it's by Alibaba Qwen. Um, what is special about this? It's, uh, um, yeah, it's basically useful for coding and ancient tests and it's a direct competitor, I would say, to GLM 4.7, which is also a model you can run on your own hardware and, uh, very strong genetically, I think. Yeah, it fits in the same spot.
Yam Peleg
Yam Peleg 14:09
Yeah, look, look, everyone is using today, uh, agents for, uh.
14:15
You know, you know, actually we're in Theier,
Alex Volkov
Alex Volkov 14:18
I
Yam Peleg
Yam Peleg 14:18
noticed, let's say in Theier, let's say in Theier.
14:21
I
Alex Volkov
Alex Volkov 14:21
noticed you're starting to heat up.
14:23
Uh, we'll get there. All right folks, I'm gonna, uh, move forward. Shanghai AI lab releases intern S one pro intern lab, like we've talked about intern multiple times. This one is 1 trillion parameter. Oh, I'm not showing you anything, so I'm just talking. So, wow, wow. I'm just gonna show you something. Uh, internal lamb is 1 trillion parameter open source, MOE model with F 512 experts. Uh, they claim state-of-the-art on scientific reasoning across chemistry, biology, materials, and more. So this is very interesting one, uh, not, not for their everyday use, but definitely, uh, you know, state-of-the-art and scientific reasoning. Uh, we want AI to fix our health. So the more of this open source, the better. I absolutely, absolutely. Uh, uh, shout out to intern folks and, uh, step fund releases step 3.5, flash. It's a sparsely with only 11 billion parameters. Uh, they claim. Frontier Reasoning Achievement at a hundred to 350, uh, tokens per second. Uh, and you can see the kind of the IME scores on this one, and we're gonna talk about, talk about step one as well. So a bunch of open source release in the LLM. There's a bunch of other ones in the kind of the audio space. So I'm gonna just move on. Uh, I definitely would love to cover. Things that are happening in the internet with the Agentic AI segment. Uh, cloud book absolutely exploded with over 1 million supposedly agents joining this. There's a thing called Mold Church, uh, and there's just a bunch of others. Uh, I definitely would love to kind of have a conversation on this panel about the agentic internet because I think it's a phenomenon that, uh, we'll skip many people because they're not part of whatever circles we are at. But I think it's very, very important. This feels like this feels like, uh, a big thing to discuss. Uh, also I wanna mention what I mean by the AI psychosis, uh, psychosis. I definitely, definitely wanna make sure that we're not, you know, uh, to, to, to an extent that we're not a party, uh, of delivering the psychosis waves down to, towards the internet towards you. So definitely we need to make sure that this is something that I started noticing, uh, for, for many folks. Um. We are having folks, uh, we, we may have breaking news by the way, but we're not speculating. So if breaking news are coming. Breaking news will come if, uh, if there's no breaking news, we'll not speculate and be wrong. We've seen this happen, I think with, uh, with the zero stuff, uh, that they had deleted. Tweet. Let's move on to big news going forward. OpenAI launched actually two things. Uh, the biggest one, this Codex app, this is standalone Macs app. I think Linux as well, but I think, oh, Macs Macros app, dedicated command center for managing multiple AI coding, all in parallel with the GPT 5.2 Codex Max extra high. Uh, the naming is ridiculous, but basically you can now, uh, you don't have to have a terminal. Uh, you can just go in there and ask for stuff and it will do it. Uh, and it's great. And again, we have an interview about Codex with VB from OpenAI very, very soon. OpenAI also launched Frontier, which is a agentic market plan. I don't have this here, uh, on the nodes, but Agentic marketplace for. Uh, for enterprise agents to work and collaborate together in a safe way. So AI agentic work flow is coming to everyone in this week's buzz where we keep you up to date about everything with some biases and CoreWeave. Here is me. Here's Wolfram. Here's a bunch of other winners of the hackathon that we manage in San Francisco. Over 180 folks came out, including listens of the show. So shout out to everybody who listen to the show and also came out to the hackathon. It was a great, great vibe. We're gonna tell you a little bit about this and the projects that won. The little guy that's holding a big check is 15 years old and he won. Uh, and this is his third Weave Hacks and I'm very, very proud of s Severe, uh, and his teammates and vision in video. Vision video is. Absolutely on fire because Grima, we told you about Grima last week. Grima last week came out in the API. So you, for the first time could actually like, use the whatever multimedia stuff from, from XAI. Um, Grima now is also, uh, in the, you know, in, in the official 1.0 release, ten second generation 720 P videos and native audio and talks. And basically it's now taking number one spots across, uh, design arena and El Marina or AKA just arena. Uh, we also had cling, if you guys remember, cling the video model. I think it's the absolute state of the art right now across everything, though, not measured, launches all in one ai video creation, native multimodal generation, multi-shot sequences and built in audio. So most of the video models now have built in audio and that's just like fire. Uh, some of the cling generations still look Ai, some of them look just absolutely bonkers. We'll show you some of them, uh, and play some clips because just you have to see to believe it. Voice and audio is a big, big category here. Uh, anybody keen on the covering the voice and audio releases super quick? Uh, if not, I can just run through it. I can run from open Airbnb releases Mini CPM. We, we talked to you about omni models before. Omni models are kind of like models that can. Listen, uh, and get text and images in the input and then also talk and maybe also output text. Uh, there's a concept called full duplex where omni models can listen while you talk and also react. While you react. Also, interrupt you, uh, open Airbnb claims to release the first, uh, full source, full open source, full duplex omni model can still listen and speak simultaneously with you. And also kinda like it, it, it listens. Uh, while it talks. So like interruptions are, are gonna get better. And also they claim that this model is the first one that can interrupt you. You know how you talk to AI models and then like you start talking and you like, you say something and the model kinda shout up, shuts up. This model can interrupt you in return. So it's something I definitely wanna see. I play with it. I wasn't super impressed, uh, but the very impressive thing that happened. Misra ai, you guys remember Misra obviously, uh, released. VRE and I, two things that I have to wonder. One of them is, uh, until what point will RA be able to release models that are named after Mistral in some sort of way? Because VRE kind of sounds like Mixtral, mytral, uh, vre, they have a bunch of them. Um, but Vox tro, uh, transcribed two is state of the art transcription, so no longer whisper. Three years of Thursday, eye whisper has been like at the top v state of the art speech to text with the authorization, which is very, very cool Apache to license. And, uh, we basically can run this right now and you can see that even if I speak super, super duper fast, uh, v will pick this up with high accuracy. Uh, and then, uh, the last thing is, this is the one that Wolf mentioned, ACE, step 1.5. This is Suno at home. This is basically an AI music generator running on a very low hardware, like 10 90, whatever, uh, that you can tune into and, uh, build music, great music, and it's MIT license. Right. Last but not least, I think, uh, yeah, I think two things. What last but not least is the link bot world. This is, uh, if we showed you Genie three yesterday in the middle of the show, this is breaking news. We had some footage and we play around with Genie three, the world generator. Uh, apparently there is, uh, I haven't been able to play with this, but supposedly there's a, uh, 10 minute model from Lingbo, uh, that kind of looks like, uh, you know, built on on Alibaba one and the Ling bot world playable interactive environments for up to 10 minutes. Folks, I was not able to confirm this. I was not able to find one place that serves this model. I was not able to find anything beyond videos. But the same thing happened with Genie three. We only saw videos, so I dunno, but it looks absolutely dope. And if we have 10 minutes, then you feel the acceleration curve jumping even more. Right? This has been the TLDR and I'm spitting up because it's already,
uh, you know, 9
uh, you know, 9 22:11
00 AM Pacific and we need to start talking about open
22:14
source because we have a big show. Big show. Uh, so yeah, we are gonna do open source and let's get, let's get it going.
22:34
Open source ai. Let's get it started. All righty. Let's get it started with open source, a favorite coroner, which we dedicate a lot of time to usually, but this time I think we're gonna, uh, focus on one main release from Qwen. Uh, before we do though, I will say, uh, if you have folks, if you have comments about OCR and the reasons for OCR. Uh, G-L-M-O-C-R is definitely now claims to be, uh, a state of the art. OCR solution. OCR stands for optical character recognition. And when that term was invented, just character recognition was difficult. Now we're talking about full tables and documents and, and formula recognition, et cetera. Uh, recently we had deep seek release in OCR in an update OCR. So Deepy, uh, definitely stepped into this. The reason why most of these companies are releasing these specific types of models is because they all wanna build synthetic data sets and what, whether, uh, or sorry, actual data sets, uh, and what way to, to build those data sets is from reading like a bunch of books and text, uh, and, uh. Papers, et cetera. So, uh, GLM, uh, from ZI released G-L-M-O-C-R. Uh, some folks in the comments told us they were very bullish on the ZAI, upcoming, uh, IPO, et cetera. Uh, so it looks like, uh, this company's releasing a bunch of stuff. Obviously, GLM 4.7 is still incredible model for coding. Uh, so ZI is not stopping. So shout out to them, uh, supposedly, uh, much faster than paddle OCR and with only 0.9 billion parameters, 900 million. We don't usually talk in millions of parameters. Anything notable about this release focus before move on?
Wolfram Ravenwolf
Wolfram Ravenwolf 24:17
Well, it has an open source license, which is great because
24:19
we had featured a couple of weeks back, another OCR model, which was not available in Europe, so this is great.
Alex Volkov
Alex Volkov 24:27
Every time an open source license is mentioned on the
24:30
show, we're gonna do the applause. Uh, because, uh, we are big, uh, big proponents of open source here. Um, we love talking about open source. This is why this is the first corner. Uh, if you need any type of, uh, digitizing of books, et cetera. Now you have a new competitor site. Uh, so this is, uh, G-L-M-O-C-R. Uh, let's talk about some facts on it. Benchmarks 1.8 pages per second to PDF documents and uh, 0.67 images per second for images. Uh, with great reports for speed. So, uh, if you need this, this is for you. Uh, we're also gonna mention intern S one Pro. It's a 1 trillion parameter for scientific reasoning among chemistry, biology, materials, and earth sciences. The, you know, the person on the panel that we need to, to help us cover this Nisten and step out for a little bit. But, uh, this is a comparison chart from intern S one. To Qwen and Kimi and you can see absolute MO on the scientific benchmarks, right? So the left column here is all for intern as one pro, and you can see them comparing, you know, to other open source models, even g PT 5.2 in Gemini Pro. And you can see, like, you can see the huge gap here for something like scientific reasoning, 55.5% of intern S one pro, whereas even Gemini three pro gets like 14, 14 points. I, I'm not familiar with this benchmark, but I'm just seeing the absolute like, crazy jumps from this model. Small molecule molecule e reasoning evaluation, they get like 74, whereas other models get maybe like 60. So definitely a huge jump when a model is fine tuned for this specific thing. Uh, whereas general models can generally do things. Um, I, I kinda like specific models. Uh, I like fine tuned specific models for use cases. Folks, any comments? Yam who,
Yam Peleg
Yam Peleg 26:20
who is running this?
Alex Volkov
Alex Volkov 26:22
Sean Hai AI lab.
Yam Peleg
Yam Peleg 26:24
No, no, no.
26:24
I'm saying, I'm saying like, who is running, running this? Ah, I knew who I get who made this, but like, yeah, I mean, at a, that's a trillion parameters for scientific reasoning. And I mean, you got, you got to ask, what did they train this for? That's a, not a small, uh, investment of money.
Alex Volkov
Alex Volkov 26:45
Yeah.
Yam Peleg
Yam Peleg 26:45
And who is running this?
26:47
Like, what are they doing to this? And, and, and it's pretty crazy to see that the other frontier models are not that good on these benchmarks. It really, really, really, it, it really paints this model, like build different, you know, because it's not just a little bit better than frontier on sciences. You see benchmarks that it just, like you said, completely mugs everybody else. So who's running this? Is it trillion parameters? That's expensive to run? What can you do with this?
Alex Volkov
Alex Volkov 27:23
I, I like how we go open questions, open questions.
27:25
We go into the hug face and the kind of the, the main image there is a GI for science. Uh, this is what, uh, intern LM is building. Uh, we've, we've talked about internal lamb multiple times before. They have inter coder, inter math, uh, a bunch of them. So, uh, this is a big model and it has almost 4,000 downloads. Yeah. So almost 4,000 people or, you know, starts of downloads. Download this 1 trillion parameter model to run on scientific endeavors, right? Uh, so shout out to Internal Lamb. Also, I think Apache tool license, also open source. Uh, huge, huge MOE, uh, for this, uh, small potatoes before the big course here as well. Stephan releases step, uh, 3.5 Flash. Stephan also janese models folks just like. Like Chinese, every Chinese models everywhere. Uh, this week, uh, 196 billion parameters with only 11 active, they claim frontier reasoning with up to 300 tokens per second. Here we see an ai e comparison to, uh, even Gemini and, and code opus. So a ME.
Yam Peleg
Yam Peleg 28:29
We, we don't see Alex.
28:30
We don't see, just see.
Alex Volkov
Alex Volkov 28:31
Good.
28:32
Yep. Uh, so here you can see, uh, the a e scores for, for these models for Stefan. Uh, we can also see HMMT, but, uh, these, yeah, I, life Code Bench is okay for, for coding. So you can see that the comparison to, to Chemic K two and, um, um, GLM, um, they're taking 86.4 on Life Code bench, whereas GLM is 84%. Uh, deeps sick people still add, though, deeps Sick hasn't released anything for over a year or, or half a year at least. Uh, terminal bench, 50%, not too bad. Uh, very, very close to kind of Frontier Labs and, uh, definitely looks like beating everything open source and it's a very small model. And Swyx even verified it's 74, uh, which is, which is very impressive. They didn't add Qwen here, so the meme was like, where's Qwen? Where's Qwen? The meme is still, the meme is still alive. Uh, step Fund is, uh. 96 parameters. Swyx Bench Verify is very high. I think this is the highlight. Terminal bench is very high as well. Uh, and then it runs on a Mac studio and you can completely run this model. And I think they're boasting the speed with 350 tokens per second, uh, with a free usage tier on their API level. So if you are into testing another model, uh, Stefan is free. And I think, uh, many of these models can be like, like Ralph, like Nisten said. Now, having covered all of this, uh, let's talk about qu coder next. And I think, uh, this is not the right graphic. So we're gonna move on, uh, from this graphic. I think, uh, we picked up the previous one. Uh. Folks, you are super excited with those model. So let's, let's first announce what this is, the parameters, and let's talk about why it's so exciting. Yeah. Uh, Qwen, our friends from Qwen, Alibaba Togi Lab released Qwen three coder. Next, it's an 80, uh, billion MOE coding agent with only 3 billion parameters active getting, uh, like quarter of a million tokens in the context window, up to 1 million with yarn, and, uh, trained on 7.5 trillion, uh, uh, tokens. Now, uh, what's so special about this, it gets 70% on Swyx Bench verified and 44% Swyx Bench Pro. The Qwen folks are absolutely gunning with a specific coding model, and I think with some quantization, this can run on many MacBooks. Yeah, this is this kind of excitement, uh, talk to me why this is exciting, folks. 70.6. But if you zoom in here, it says Swyx Bench Pro. Swyx Bench Pro is a set of significantly harder tasks for, uh, software engineers. And you can see that this model mocks the other ones at 44% for Swyx Bench Pro, uh, terminal bench 2.0. Uh, with gets here at, uh, 36%. Didn't we just look and, and see that the step model gets significantly higher scores in terminal bench? I wanna go and take a look for a second. I don't usually do cross comparisons within the, the news that we have, but like Step fund said the terminal bench, they get 51%. 51.
Ryan Carson
Ryan Carson 31:32
So I wanna comment on this real quick.
Alex Volkov
Alex Volkov 31:34
Yeah.
Ryan Carson
Ryan Carson 31:34
Um, I think what Yam is talking about is very important that
31:38
as soon as we have an open source model that's good enough to be your primary orchestrator, your primary model on open claw, everything changes, right? Everything right? Yep. Right. Right now we're all paying for Opus four five basically to run open claw. It works. Um, and, and this is why I think a lot of us are running open claw on a local machine. Like we need the compute. We, we, we want the privacy. Um, I think that's gonna continue. And so as soon as these models are good enough, it's gonna really unlock everything. Right?
Alex Volkov
Alex Volkov 32:11
Yeah.
Ryan Carson
Ryan Carson 32:11
So I think we're all excited about that
Alex Volkov
Alex Volkov 32:13
a hundred percent.
32:14
Uh, and we've been excited. LDJ, go ahead. I, I didn't see your hand there.
LDJ
LDJ 32:18
Yeah, I think something to note with the, the difference between the
32:21
step fund model and the Qwen model here. The step fund model is about three times more total parameters and active parameters than the Qwen model. So I think while it might technically be a bit better, um, overall, and I think it would make sense for it to be a bit better overall, I think the Qwen three model is kind of the, the better bing for your buck probably, and is just going to run significantly faster and smaller memory footprint.
Alex Volkov
Alex Volkov 32:45
It's only 3 billion parameters active.
32:48
This is insane. Yeah, this is kind of the, I think the, the, the scale 80 billion parameters overall and there's 3 billion active where you can basically run this on like a bunch of CPUs. I think that's, that's like the big part. Uh, so absolutely, absolutely incredible. Go ahead with him.
Wolfram Ravenwolf
Wolfram Ravenwolf 33:03
And the context side, this has 256 k. While it's
33:07
competitor, I would say a GLM 4.7 flash only has 128 K. And since this can even be expanded with the R to 1 million tokens, you're basically in a reach where you can yeah, use it. Like, you can use the bigger models like Gemini or even the Opus, uh, which is by, uh, 200 K to a million, uh, depending on which, uh, one you use. So basically, it makes it possible to use the context much better. That is a big thing for the coding and agent stuff.
Alex Volkov
Alex Volkov 33:35
So we have a, a, a comment right here on the infographic, that local
33:40
deployment you can get under, you can get to run this model under 48 gigabytes of RAM with sla, GGUF and shout out to Danielle Han from SLA for doing just incredible, incredible compressors of models to, to be able to run them locally. Uh, uh, it's quantized but it definitely, definitely still runs and runs well, and, uh, we, we keep talking to you about orchestration and keep trying things and verifying this and like local models can absolutely do this. Apologies. Um, what else do we have to say here? Uh, users scalable agent RL system running 20,000 independent environments in parallel on Alibaba Cloud infrastructure. The scale of this is absolutely insane, folks. Like the folks at Qwen know what they're doing. We've been talking about Qwen models for a while. Uh, and, uh, we are not the only ones that are looking at something like Open Claw explosion and Ralph Loops and the Genti coding in general and saying, oh, this changes the world. Uh, we're definitely the only ones, there's a lot of harnesses coming outta China. Uh, they're doing some incredible, incredible engineering stuff. So definitely, uh, worth noticing the kind of the next architecture coming outta Qwen, specifically because it's a hybrid architecture as well. So supposedly like this is what. Makes it run, uh, faster on, on smaller devices. They want to run higher throughputs of tokens on smaller GPUs as well. And this is kind of like self-serving, uh, from them, from that point as well, I think in the spirit of open. So, and also we, we have to absolutely, you can run this on Elm Studio and together in cloud code, you can run this on like a bunch of open source. So shout out to the folks at Gwen Alibaba, uh, for making this, uh, this model available for us. And if you have experienced this model, you play with it and you have comments, just let us know. Uh, a short, I think, uh, a short move to the other open source things that we have in the voice and audio before we move on to big labs and the gen coding interface because, um, we often kind of separate the LMS in open source and the other stuff. But definitely the voice and audio open source has been, has been on blast this week. So let's talk about this a little bit, uh, because essentially. Here's, I hate this phrase where it meets reality. Every time I read this meets reality, I was like, no, JG Bt asshole wrote category. But this is an omni model. This is an LLM that's traced on, on, on some of the wan Omni stuff. Uh, i, I think we should try this one out. It's a full duplex, omni model. What does full duplex mean? Full duplex means that this model can listen basically while it talks, right? So like, if anything happens behind the scenes, uh, it can watch a live video and ingest an input of, uh, text and audio and images as well. And, uh, it's like, it's, it's very conversational to, to a sense. So as, as you know, and as we talked about, like voice agents, the more we run them, the more kind of people wanna talk to them. Uh, they, they usually have the same setup. Usually there's a, um, a. TTS at the start that listens to you, transforms whatever you say into text. Then there's a LLM in the middle that you can switch the brains and it becomes better. And then there is a speech to text, uh, uh, text to speech, uh, model on the other side, right? So like, basically you turn, uh, text into, uh, I'll start again. You turn speech into text. You send this text into lm, and then this LM generates voice, but like 9 billion parameters. Uh, so achieving whatever we talk about here in 9 million parameters is very, very impressive. It's like very impressive, like, like getting video, but we'll see what, what kind of quality. We will understand the video. Uh, I'm not, you know, not expecting too much from a model that's only 9 billion parameters, but being able to talk to it, uh, I think it's, it's absolutely dope. Uh, so I'm gonna keep this running behind the scenes while I, uh, if something comes up, I'll tell you. Uh, meanwhile we should move on because we have other things here. So. This is open, DB mini CPM. Uh, we should talk about mixed trial, uh, and, uh, Mistral releases vox roll and they claim that this is state of the art. Uh, absolutely state of the art transcription service. Uh, and so we, we shall see. We will all talk and hopefully this will, this will hear. So let me share my screen. Uh, luckily for, uh, TTS demos, sorry, STT demos, the SR demos, we don't need, uh, audio. So we're going to show you a, a VRU demo on, on the menstrual side, where is, uh, uh, we're gonna show you some stats as well. Um, I think cans and based on millisecond controls, word error rate. So we're looking at the ability to transcribe better the more time it has to process. Is that, am I understanding this correctly? Yeah, no, let's look at another one. Um, yeah, the lower the better. Thank th Thank you folks from the comments. Lower the better is clear to me, but I just like, I don't understand this graphic from, from Misra because it's kind of the same model. It's likely how long it takes to transcribe something. Uh, but I, we wanna play with this, uh, playground Mistral studio here, I believe. No, there was an open, open one. There's definitely an open one that we wanna get to free testing. Oh, okay. Yeah. Let's go to Ms. Str. I have this, I have this login. While I find this, if folks you wanna pick up and explain where MS Trial was and why it's important for them to come back, feel free to. Yeah.
LDJ
LDJ 39:41
Um, I think Mistral is probably, they're still being looked at as one
39:45
of the premier AI labs in, in Europe are probably the most funded besides, I mean, if you consider DeepMind or London to still be part of Europe. You know, there's the EU versus Europe, all that. But, um, in terms of the EU at least, um, Mistral, they have, they were founded by some of the people that previously led projects at Meadow and things like llama and released things like the original mixture model, which I'd say is probably one of the, the first big important and practically usable MOE models in open source. And this came even before things like before we had deep seek and, and things like that really come onto the scene. And popularize MOE as well in open source. So. I do hope they end up having kind of more of a, a prowess in their open source releases here. Uh, 'cause a lot of spotlight has been taken by companies like Kimmy and a lot of the Chinese labs since then.
Alex Volkov
Alex Volkov 40:43
Yep.
40:44
A hundred percent. Uh, it looks like the demo that we need requires an API key. So I'm just gonna create one super quick, but I'm not gonna show you Mr. API, uh, I'm not gonna copy this behind the scenes so you guys don't see past sky here. Uh, and it should be listening to audio. And it started, right? It started something. Lemme start again. Hey bot.
AI
AI 41:11
Hello there.
41:12
How are you doing today?
Alex Volkov
Alex Volkov 41:13
How are you doing today?
41:14
Why are you so happy when you're supposed to be sad?
AI
AI 41:18
Oh, I'm just trying to put on a brave face.
41:21
I am actually feeling a bit down today.
Alex Volkov
Alex Volkov 41:26
Alright, tell me a story super quick.
41:28
I.
AI
AI 41:29
It was a little old robot named Spark.
41:32
Spark. Loved exploring the world, but he was always curious about the stars. One night he decided to build a spaceship and travel to the moon with a little help from his friends. Spark launched his spaceship and reached Moon. Okay,
Alex Volkov
Alex Volkov 41:47
that's cool.
41:48
Tell us something else in German.
42:05
How is that this sound legit?
Wolfram Ravenwolf
Wolfram Ravenwolf 42:08
If I understood right to messed up the agenda of the
42:10
Eiffel term, but otherwise, very good.
Alex Volkov
Alex Volkov 42:14
Nice.
42:14
Uh, so, uh, as you guys, uh, for those of you who are just listening, we're seeing a demo from Inworld TTS and Mistral saying that this is the cheapest combo for these robots that talk. And they actually show you a progress of like, uh, sense as as we talk. And you can see the other large labs. The the, the pricing is growing significantly. So this demo is hosted, uh, in World Mistral demo in world ai. Uh, and this is a part of the stack that you can use to build your, uh, open claw conversations. All right. So this is cool. And I think the last thing that we wanna talk about in audio before we do this week's buzz and continue is a step 1.5. And for this, I do have a demo and I really hope it works because it's not in the tab. But, uh, but folks, bear with me, bear with me folks, because of this will be worth it. Uh, basically, uh, Wolf, you wanna introduce these steps super quick 'cause I know you have it. Uh, and you've talked about
Wolfram Ravenwolf
Wolfram Ravenwolf 43:06
it.
43:06
Uh, I just, yeah, definitely. I just wanted to add quickly to something. This also has, uh, the MR model also has life ization. Yes. Which means you could use it when multiple people talk to the agent. It can recognize with the owner and who are the other people, something like that. That could also be very interesting. Let's talk about a step 1.5, which is an open source AI music generator and it can generate full songs in just a few seconds. I was really impressed how fast it is. So it really feels like a local zno where you just enter some prompt, you can give it the rigs or have it autogenerated and it generates it and you get by default. I got two, uh, just like suno basically. And you can go ahead and, uh, test it out. Uh, it's on Pinocchio if anyone wants to easily install it. That is how I got it. One click, bam, got it. And could already use it. Really fast. Um,
Alex Volkov
Alex Volkov 43:59
basically Pinoc the computer is also how I use this folks.
44:02
Sometimes you install these models, but if you're aligning local models, uh, the, this app called Pinocchio we talked to you about from Cocktail Peanut multiple times, uh, is tracking the coolest open source models that you can run. Not necessarily for lms, but definitely the video ones, et cetera. Uh, and this one is running on my Mac right now, uh, not even on GPU. And this generates songs super, super quick. This is Suno at home. Now the quality of the song is not quite as Suno, uh, but it's very closed there. But for an open source model running on your computer, nothing, nothing can beat this. So, uh, hopefully I'm gonna the
Nisten
Nisten 44:35
26.
Ryan Carson
Ryan Carson 44:46
It's pretty good.
44:47
I gotta say.
Alex Volkov
Alex Volkov 44:47
It's not too bad, right?
44:54
For an open source model that was created for, what, two minutes on my MacBook. I think this is incredible. So we don't know, which
Ryan Carson
Ryan Carson 45:07
we've become so desensitized to how amazing this stuff is.
Alex Volkov
Alex Volkov 45:10
This is just like music of, you know, created on your, on your thing.
45:15
And I, I wanna shadow the can do
Ryan Carson
Ryan Carson 45:17
it.
Alex Volkov
Alex Volkov 45:18
Yeah.
45:19
I wanna shadow this ui. This UI is a, a, a step, a step 1.5 open UI that's on Pinocchio. And I think generating with full MacBook while streaming to you maybe is not the best idea. So we're gonna keep this running, but uh, meanwhile I think that we have breaking news folks. Let me, let me see if I, if I'm correct folks. Well, we do, we do, we do. Alright, let's, alright folks, it looks like folks are reporting that we have. New release,
Yam Peleg
Yam Peleg 45:52
new
Alex Volkov
Alex Volkov 45:52
op.
45:53
We got a new
Yam Peleg
Yam Peleg 45:53
office.
Alex Volkov
Alex Volkov 45:54
I don't see it yet.
45:56
Do you guys see it? I don't see it on topic. Opus
LDJ
LDJ 45:57
five there.
45:58
It's not ON'S Twitter account. So maybe it's just uh, or is it sonnet five?
Yam Peleg
Yam Peleg 46:03
Let me just show you how, how you show.
46:06
Wait.
Alex Volkov
Alex Volkov 46:07
Oh my God.
46:07
If it's Opus, I'm, I'm so happy.
Ryan Carson
Ryan Carson 46:09
Opus Four
Alex Volkov
Alex Volkov 46:10
six.
46:10
So folks in comments saying that we have Opus 4.6,
Ryan Carson
Ryan Carson 46:15
let's go.
Alex Volkov
Alex Volkov 46:16
Yeah, I don't see it yet.
46:20
Oh, let's go Yam, zoom in. Please. Dream bigger, do more. Opus 4.6.
Yam Peleg
Yam Peleg 46:27
No, let's not.
Alex Volkov
Alex Volkov 46:30
Can you zoom in?
46:31
Let's see. Let's see what's going on. Not on Tropic yet, but folks are saying complete. Um, this is in the cloud app. Yeah, I am.
Yam Peleg
Yam Peleg 46:40
Mm-hmm.
Alex Volkov
Alex Volkov 46:41
Complete complex works.
46:42
Hours in minutes. In hours in the first minutes. Uh, do do we know anything? We don't know anything.
Ryan Carson
Ryan Carson 46:49
It says extended.
46:50
That's interesting.
Alex Volkov
Alex Volkov 46:51
Yeah, it
LDJ
LDJ 46:51
says, uh, apparently so I'm seeing something.
46:55
It says Opus 4.6 has something called Adaptive Thinking, which Sonnet 4.5 does not have.
Alex Volkov
Alex Volkov 47:03
Oh, we have, uh, shout out from Matt Wolf.
47:07
Yeah, let's go.
Ryan Carson
Ryan Carson 47:08
Yeah, I'm just kidding.
47:09
Rate limited.
Alex Volkov
Alex Volkov 47:11
Oh yeah,
Yam Peleg
Yam Peleg 47:13
I, I
Alex Volkov
Alex Volkov 47:14
was, yeah, we have the announcement.
47:15
Let's read through the announcement super quick. Thank you Matt. Uh, we have the announcement folks. Opus 4.6 is here. Uh, and we all have Opus 4.5. So I wonder what. What they added there, uh, intrus in CloudOps 4.6. Let's, let's watch the video together.
48:09
Okay, how do I switch my open cloud to be on 4.6 is the question. Uh, let's read through this release. Uh, upgrading our smartest model. This is, I don't remember when they upgraded Opus and not like sonnet or, or the other one, right? Like, like haiku. This is, this is incredible folks. Um, let's see. I just posted
Ryan Carson
Ryan Carson 48:26
benchmarks.
Alex Volkov
Alex Volkov 48:28
Yeah, let's go
Ryan Carson
Ryan Carson 48:29
Alex.
48:30
They timed it for the show, just so you know.
Alex Volkov
Alex Volkov 48:32
Yes, let's go.
48:33
So, okay. Uh, definitely they have cowork, which is many people use. Uh, and this will now, uh, be working in cowork. Let's look at the, at the, uh, at the evals here. This is knowledge work. GBT eval we've talked about. This is a state-of-the-art in G Bt, GDP eval. Uh, LL LDJ. You wanna tell us what GDP eval is?
LDJ
LDJ 48:59
Yeah, GDP, uh, well, GDPV
Alex Volkov
Alex Volkov 49:02
um, is
LDJ
LDJ 49:02
GDP, so.
49:03
I think AA is the artificial analysis version, I think. Mm-hmm. Uh, but GDP Val itself is, I think it's actually a benchmark developed by OpenAI, um, maybe OpenAI in collaboration with some other researchers. But the purpose is to try and measure the ability for AI models to do actually more economically valuable work when more tasks, it's still kind of more short horri time horizon biased compared to something like. A remote labor index that, uh, scale AI has developed. Um, but it's, it has a lot of things like kind of gig work and consulting and, and kind of shorter term contracting work that is actually economically valuable and in demand. So, really impressive to see here. 'cause GPT 5.2 was previously beating Opus 4.5 slightly in this, and now Opus 4.6 is a significant jump above that.
Alex Volkov
Alex Volkov 49:53
Yep.
49:54
Uh, agent search we have, uh, browser comp Opus 4.6 is absolutely mugging everybody else at, at, at, at agent search for browse comp, uh, coding terminal bench 65, which I believe is very close to state of the art. It's not state of the art at this point. And reasoning humanity is less exam with tools or without tools. Uh, with tools it gets 53 on multidisciplinary reasoning. Uh, this is.
Ryan Carson
Ryan Carson 50:23
How
Alex Volkov
Alex Volkov 50:23
is it,
Ryan Carson
Ryan Carson 50:24
what are they calling Ag gentech search.
Alex Volkov
Alex Volkov 50:26
Uh, what that
Ryan Carson
Ryan Carson 50:27
is
Alex Volkov
Alex Volkov 50:28
just a brows comp.
50:29
We can look at brows comp, uh, high scoring industry. Have a deep multi-step agent search. I think it uses multiple tools and kind of like deep research. Uh, I'm not familiar with brows comp, uh, but we should definitely be familiar. Uh, in cloud code. You can now assemble agent teams to work on tasks together. In API cloud can use compaction to summarize its own context. I don't like compaction, but yeah, introducing adaptive thinking where the model can pick up uncon contextual clues about how much to use its extended thinking, the new effort controls to give developers more control of intelligence, speed, and cost.
LDJ
LDJ 51:02
Interesting.
51:02
Yeah. I also posted more benchmarks like, uh, big columns and rows of benchmarks in the,
Alex Volkov
Alex Volkov 51:08
uh, in the stream guard chat.
51:09
Yeah. Uh, let me take a look here. Thank you, LDJ. Let's see if we can switch this. All right. Big column. Big column. Uh. We're looking at a table comparing Opus 4.6 to 4.5, to Sono 4.5, Gemini three pro and GPT 5.2. Uh, and the putting all models and it looks like 4.6 is nearly state of the art on multiple things. So terminal bench, state of the art, uh, agenda coding 0.1% less than 4.5. It's interesting. So it keeps the same kind of capability while improving others. Uh, computer use is significantly improved. Look at this from 66 at Opus 4.5 to 72, uh, at 4.6. Agen tool use for T two bench is also improved and also state of the art. It gets 99% almost entirely finishing this, this, uh, uh, tool lower MCP tool use MCP Atlas. So, uh, it was 62% for 2.5.
Yam Peleg
Yam Peleg 52:14
They're moving away.
52:15
Moving
Alex Volkov
Alex Volkov 52:15
away
Yam Peleg
Yam Peleg 52:16
from MCP it
Alex Volkov
Alex Volkov 52:17
seems.
52:17
You think moving towards writing code.
Yam Peleg
Yam Peleg 52:20
I, I think based towards
Ryan Carson
Ryan Carson 52:23
skills.
Yam Peleg
Yam Peleg 52:24
Yeah.
Alex Volkov
Alex Volkov 52:25
Well, Wolf you wanna
Yam Peleg
Yam Peleg 52:26
Yeah,
Alex Volkov
Alex Volkov 52:27
you wanna shout it out?
52:29
The, the talking?
Wolfram Ravenwolf
Wolfram Ravenwolf 52:30
Yeah.
52:30
I also, I'm really excited that this is also the first time that Opus is, uh, having 1 million context talking limit. So basically it's in better but first all post with this, like the other bigger context model, they have different pricing up to 200 k, it is, uh, dollar 10 per million input output tongs. And after 200 K to a million, it is 37 point $50 per million. Wow. It's expensive, but you can use it. And now there are more use cases available. I had hate the compaction like you do Alex, when it is doing something and suddenly bam compaction and loses some information. So this makes it even more powerful, but both from the better model and now also from the bigger context.
Alex Volkov
Alex Volkov 53:15
1 million contacts for Opus is incredible.
53:17
I'm pretty sure Tropic has 1 million for 4.5 internally as well. Uh, they definitely have half a million for enterprises if you want. Uh, but at these, at these performances price wise, yeah, you're gonna pay a lot more. So maybe it's good that the compaction exists on, you know, on open claw, et cetera, because you don't wanna start paying incredible just because you forgot and some ran into a context or context window. Uh,
Ryan Carson
Ryan Carson 53:41
so Alex, yes.
53:43
I just had o Open Cloud update. It's model string and I think it works. So
Yam Peleg
Yam Peleg 53:47
yeah, I just wanna say it.
53:48
Uh, yeah. Available on, uh, the new updated version of Cloud Code basically should be seamless on open claw.
Ryan Carson
Ryan Carson 53:57
Just tell your open claw, just say it Open claw,
54:00
upgrade me to Opus four six. And it just does it. Boom.
Yam Peleg
Yam Peleg 54:04
You got your exactly own time.
Nisten
Nisten 54:07
Yeah, I, it's like I left what you guys were talking about.
54:10
Open claw. You're still talking about open claw. Wasn't there a clock card as well?
Alex Volkov
Alex Volkov 54:15
Yeah, there was niton, but uh, what you're missing is we
54:17
had breaking news that on Tropic just released Opus 4.6 and Yeah. And this is why we're talking about open call because Ryan just like asked, he, he, he is, he is guy to update itself and it just, it just did it. So, uh, hopefully this works. Uh, let's, what else do we have from the release? I had the release, like queued up here. Uh, yeah, so of the API
LDJ
LDJ 54:37
pricing available.
54:38
So
Alex Volkov
Alex Volkov 54:39
it's the, the same, yeah.
54:40
Let's, let's talk about API pricing.
LDJ
LDJ 54:41
So it's the same a across, um, cash rights ca, cash hits and refreshes.
54:46
Uh, input tokens. Output tokens, all that pricing is the same as Opus 4.5. Uh, so no differences there. Um, it seems like though, yeah, we just have just better model, same price, uh, and increased context length, I think too, right.
Alex Volkov
Alex Volkov 55:02
1 million token, uh, context plans more careful, carefully
55:06
sustains the gentech task for longer, operates reliably in massive code base and catches its own mistakes. That's what they say about oppo state-of-the-art on several evaluations, including gentech coding, uh, shipping new features across code in Excel Cloud, PowerPoint to cloud code or API let Oppos do even more. Uh, cloud and Excel. People's like super excited about cloud and Excel and PowerPoint should, should try it and never use PowerPoint. Um, in cloud code, we're introducing agentic teams spin up multiple agents that coordinate autonomously and work in parallel best for tests they can be split up and tackled independently agent teams and research preview. That's very interesting as well. This is like a new, a new thing. Orchestrate teams of coco sessions. You now have to pass this, uh, a, this, uh, variable. And unlike subagents, which run within a single session, can only report back to the main agent. You can also interact with individual teammates directly without going through the lead. Ah. So this is like Ralph answer to Ralph. Basically orchestrating multiple ones to complete them. Uh, agent teams research and review new models of features. Very, very cool. Now people need to learn how to use subagents versus agent teams, uh, with its own context window, fully independent. That's dope. We'll see how, how fast this is gonna get built into, into the new tools. Um, what else on the agent ev I think we'll look through evils. This is, this is very exciting. Who's, who's already playing with us? I think everybody's going quiet because everybody's like trying to, to, to figure this out.
Wolfram Ravenwolf
Wolfram Ravenwolf 56:38
Oh yeah.
56:39
I have to make a quick correction, Alex, at the pricing, because I said a dollar 10 for sub 200 K, but that was incorrect. It is, uh, the slash was between input and output tokens. So we have basically the $5 under 200 k, and when you go over, it's twice as much for the input tokens.
Alex Volkov
Alex Volkov 56:58
So you, you wanna like, uh, uh, you have the pricing table up here.
57:02
You wanna like, uh, straighten the, yeah. Um, tell, tell us exactly what pricing before and after 200 token.
Wolfram Ravenwolf
Wolfram Ravenwolf 57:09
So for up to 200 and K tokens, it's the same pricing as Opus,
57:13
uh, 4.5, which is $5 input and $25 output. And, uh, once you exceed the 200 and K tokens up to a million token, you pay, uh, pay, pay double dollar 10 for a million input tokens and dollar 37 50, um, instead of 50. So it's a bit cheaper in that way. Uh, for the 1 million output tokens.
Alex Volkov
Alex Volkov 57:39
Uh, I wonder if Claude Max account, it's just Max, right?
57:46
Like, it, it's not gonna, I, I wonder if I can go one to one, 1 million in, in max account. It's,
Ryan Carson
Ryan Carson 57:52
it's stalling out.
57:54
Opus four six is stalling out for me open cloth. It's so I don't, something's not right.
Yam Peleg
Yam Peleg 58:00
I maxed out my account.
58:02
I can check all of this. And you're working on API now?
Nisten
Nisten 58:08
Yeah, I have.
58:09
It looks very good just on the website.
LDJ
LDJ 58:13
In the meantime, while we're searching for that, um, I do
58:16
have vending bench scores here. Yeah. Most sonet 4.5 or Opus 4.5 or Opus 4.6. So it looks like, uh, SONET 4.5 was at about, uh, $3,838. Actually, maybe I should give some context to, to the viewers of what vending benches. Yeah. Um, it is, it is a benchmark where, at least in the, the current leaderboard of the benchmark, you have a simulated environment where the model has a vending machine business that it has to run, and then it needs to manage the interaction between customers. See what are the current demands within the market. And maximize how much it's selling and how much it's able to make in profit versus other models in, in the, in the simulation. So, sonnet 4.5 was about $3,800. Opus 4.5 by the end of the simulation gets about $4,900 and Opus 4.61.
Alex Volkov
Alex Volkov 59:11
So you can like, talk to the individual to each one.
59:13
Uh, and uh, open claw, not yet, but very, very soon. I'm sure a hundred percent sure. I,
Ryan Carson
Ryan Carson 59:23
I think this agent, agent communication orchestration
59:26
is, is very, very, very, very big. Like, I think everyone, you know, I was at amp, uh, I, I still love Amp. It's a great. Two E it's a great agent. You know, cloud Code is a good agent, but people are realizing they want to run open claw and, and then they want to orchestrate agents inside of it. And we're start, so Codex just released, uh, an article a couple days ago about, you know, uh, codex Server app server and how they're sort of orchestrating and now we have cloud code doing it. So there's some, there's some, it, it's, we need some sort of standard here so that we can orchestrate agents, you know, intelligently. No one wants to be locked in to cloud code. Like that's the only agent I can orchestrate. That's crazy, right? But if you try to orchestrate multiple agents across multiple, uh, labs, it's still very hard and brittle. Um, but I think this is where all of the agent labs are going. Right. Um, so be fascinating to see it play out.
Alex Volkov
Alex Volkov 1:00:23
Yep.
1:00:24
So this was the breaking news that we had. Opus 4.6, uh, again, very interesting release, uh, numbers wise from a tropic. Uh, some people were expecting sunnet, maybe sun is gonna come back. Uh, if they really sunnet five the same, you know, sum with price, but the performance of Opus 4.5, that'd be, that'd be insane, right? If they're just like, Hey folks, you can now use the same intelligence for cheaper, or you can use much advanced intelligence, uh, that, that's gonna be dope. Alright folks, we've been at this for almost, uh, for a little bit over an hour and, uh, there's a bunch of other stuff to cover. Let's go to this fix buzz super quick, where a category where I talk about weights and buses and CoreWeave. And then we're gonna have an interview with VP from OpenAI, uh, about, uh, the new TX app release. And I wanna talk to you about the agent internet as well. And also I, I think our song got generated, so we're gonna play that as well.
1:01:35
Welcome to this week's battle, the category where I talk about everything new and exciting that happened in the world of built a very cool project as well. Um, all right, folks, we have to move on because, uh, we're gonna have an interaction with Vi Vibi and we haven't talked about multiple OpenAI stuff. We're gonna cover OpenAI with him. Uh, we did cover Opus though, uh, but I wanted to, to see if I can play the song for you. Uh, super quick. Oh, no, we have to talk about video before V becomes huge, huge updates in the world of video. The two main things that we kinda have to show you is X a's grima. We kinda showed you already the videos for G Grima, but, uh, since they listed an API, all of the, uh. Evaluation companies started evaluating this. So artificial analysis arena, El Marina, et cetera. Uh, and you can see that, you know, grok imagine is this beautiful, beautiful, uh, uh, thing, uh, gets a 64.1 win rate, uh, with against VO 3.1 and Soro two, I had a different experience, but, you know, this is a anonymized version of Benchmark, so it's very interesting. Uh, it's significantly cheaper, significantly cheaper than other, uh, top of the line video models though. So that's very, uh, interesting and you can, uh, edit it with prompts. I wanna show you maybe one or two more examples of, uh, based on the X announcements. Uh, so they are now also launching, uh, a Come on open. Come on link. Okay, there we go. Uh, they're now launching competition of generating with gr uh. I wanna, I wanna make sure you guys can hear this. Native audio, I think is the stick. I think for many people, uh, videos without native audio does not really work. So let's take a look at this one.
AI
AI 1:03:23
You know, honestly, Donna, who's gonna make the Thanksgiving pie?
1:03:29
Let's see what you can imagine.
Alex Volkov
Alex Volkov 1:03:40
And just the fact that they're giving it away to try
1:03:42
for free, I think is super cool. But basically, um, the videos are very high fidelity. This is, we're getting towards a very, very high quality of videos. Um, I remember there used to be an excitement with models that have lip sync, and now all these have native lip sync. You can
AI
AI 1:03:58
see me.
1:04:00
No, you can't. You can. Nah, you can't.
Alex Volkov
Alex Volkov 1:04:08
Um, seven 20 p for 10 seconds.
1:04:11
Uh, and for 420 cent. Four 4020 cents. Uh, very impressive. So this is for g gr. Imagine taking absolute world by storm in the, in the world of video. Uh, folks, have you played with this? Any comments on gr? Imagine
Wolfram Ravenwolf
Wolfram Ravenwolf 1:04:26
it's free.
1:04:26
I mean, you can generate and generate and generate. Again, it cost me a lot to just play with vo and here you can create as many videos. I think there is some limit when it, um, you have to wait, but it doesn't cost a thing. This is amazing. You can just go, go and go and refine your prompts while you do it. I'm excited.
Alex Volkov
Alex Volkov 1:04:44
Wait, it looks like we have another break in news.
1:04:48
Jesus. Uh, I, I, let me, let me finish this video thing and then we're gonna move to Back News. Please. Come on. Yeah. Uh, because I think it's just right, right in time. Okay. So in addition to XI is like rock Imagine Kling 3.0 released. Also multimodal, uh, let me go find Kling here super quick. Uh, also multimodal also looks incredible. And the thing about Cling is I'm not showing you Cling yet. The thing about Kling is that they have a multi scene edit, so you can take one video and have it like, uh, exist in multi scenes. Um, I wanna show you the video here. Uh, I think this one's gonna be dope. Let's see if I can play it here. Okay. Where's this guy? Yeah. Uh, up to 15 seconds. And, uh, yeah, I, I absolutely need to show you this with sound, so please bear with me like this. Up to 15 seconds with native sound generation and multi-shot, so you can see the same consistency,
AI
AI 1:05:45
a bad place.
1:05:47
You said you were done. I said I was finished with the job. Not the mess. It left. If this goes wrong, it already did. They're watching you. I know, I,
1:06:18
wow.
Alex Volkov
Alex Volkov 1:06:18
This was, that's
AI
AI 1:06:19
pretty
Wolfram Ravenwolf
Wolfram Ravenwolf 1:06:19
impressive.
Alex Volkov
Alex Volkov 1:06:20
This is crazy.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:06:22
That's a movie in its own right.
1:06:24
Was that one generation?
Alex Volkov
Alex Volkov 1:06:26
Uh, I think it is.
1:06:27
This is 35 seconds, so I'm pretty sure this is like two stage. But still, uh, they're, they're making uh, uh, character consistency in there. Did you guys see multiple scenes? It switches and the very specific like character with, you know, with the color hair and the, the swords is like very consistent throughout. Uh, so that's absolutely crazy. Uh, they have multi shots. Okay, this is Kling 3.0, but I know what you guys don't wanna talk about. So, uh, it looks like we are, we have a fight on our hands. Uh, let's, let's go to breaking news.
Ryan Carson
Ryan Carson 1:06:58
AI breaking
Alex Volkov
Alex Volkov 1:06:59
news coming at you only on Thursday.
1:07:04
I
AI
AI 1:07:11
frustrated.
Alex Volkov
Alex Volkov 1:07:12
Alright, who wants to take this one?
LDJ
LDJ 1:07:15
Uh, sure.
1:07:16
I could take, I could take this one. Uh, so OpenAI just announced GPT 5.3 Codex. Uh, they claim that this is, uh, the first model they've developed that was instrumental in developing itself. Uh, here if we here, I'll, I'll post the, um, announcement page so we can just kind of read along to on things I'm citing. Uh, but, uh, yes, they said they use early versions, uh, in the early development of 5.3 Codex to help contribute to, uh, its development and, sorry, one second. Trying to multitask and find this.
Alex Volkov
Alex Volkov 1:07:53
Yeah, it's all right.
1:07:54
Uh,
LDJ
LDJ 1:07:54
announce.
Alex Volkov
Alex Volkov 1:07:56
I would just say Sam hinted that big drop for Codex users later today.
1:08:00
Uh, and it looks like this is the big drop and we absolutely haven't mentioned the Codex app yet. That now there is now available, I'm assuming. Alright, let's take a look. Let's take a look. Ooh, exciting.
LDJ
LDJ 1:08:17
Here.
1:08:17
I'll actually, um, read verbatim from certain parts that I found interesting here. Yeah. So, uh, this is on the second paragraph down. GPT 5.3 Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training, manage its own deployment, and diagnose test results and evaluations. Our team was blown away by how much Codex was able to accelerate its own development. And I was looking at benchmarks, um, earlier when you guys were talking about Roc Imagine. And, uh, some, an initial thing that I looked at that I noticed were in both announcements was Terminal Bench 2.0. Mm-hmm. Which is personally one of my favorite, uh, coding benchmarks and I think has some of the widest breadth and diversity of coding tasks out of some of these, uh, coding benchmarks and GPT 5.3 Codex and Terminal Bench. Scoring, I believe it said, uh, 70 70, 70 3%.
Alex Volkov
Alex Volkov 1:09:11
Yep.
LDJ
LDJ 1:09:11
Yes.
1:09:11
While Opus 4.6 for a comparison is scoring 65.4%.
Alex Volkov
Alex Volkov 1:09:17
Wow.
LDJ
LDJ 1:09:18
Pretty massive difference.
1:09:19
We're talking about, uh, a little over 10% gap here. Um, yeah. And, uh, here, I'm, I'm actually trying to, I'm using both Opus 4.5 and GPT 5.2, sorry, Opus 4.6 and GPT 5.2 to try and automate the process of getting benchmark comparisons and creating a table comparing both of these models. So,
Alex Volkov
Alex Volkov 1:09:41
mm-hmm.
LDJ
LDJ 1:09:41
I'll let you know when that's ready.
Alex Volkov
Alex Volkov 1:09:44
So this is absolutely, uh, at the end of the wild week or
1:09:48
at the beginning of the week, uh, OpenAI launched Codex as a standalone UI app that we were waiting for VB two to step in and talk to us about. Uh, and, and now we have a upgraded intelligence significantly. Upgraded intelligence that helped develop itself. Do you guys, do you guys like, let's read this through this one more time. Just
Ryan Carson
Ryan Carson 1:10:11
agree.
1:10:11
I actually prefer slow and steady for all, you know, actual code writing, but, you know, it's so slow. So I, I'd say it's kind of interesting, it sounds like we're all sort of starting to move towards, you know, codex for coding, but maybe an opus for orchestration. I don't know.
Alex Volkov
Alex Volkov 1:10:31
Yeah, the move definitely was felt on the timeline for sure.
1:10:34
Uh, the move towards Codex as being kind of the actual, you know, the, the better performance model. Uh, so it definitely was felt, and I think the, this just highlights yam, if you wanna like, bring up the, you know, the ad as well. This highlights the very strong rivalry between Tropic and Open the Eye recently, uh, just releasing Opus just an hour before, uh, GPT 5.2 codex. That's, that's a bold move on topic. That's a bold move. Um,
Ryan Carson
Ryan Carson 1:11:05
so let me, lemme talk quickly about takeoff 'cause I was just talking
1:11:08
to my wife about this, this, this morning.
Alex Volkov
Alex Volkov 1:11:10
Yeah.
Ryan Carson
Ryan Carson 1:11:10
It, it actually feels like we're starting to see the
1:11:13
beginning of takeoff and it's hard to know if, if it's hype or not. But what's happening is, you know, we're seeing this agent orchestration and this looping with open claw and all of us who are in the know. Are are completely overwhelmed and stressed by how fast it's moving. And then you, you talk to, you know, I, I'm, I'm gonna do a talk at my yacht club, you know, in a month about what is ai, right? And, and you know, these are people that probably don't even use chat GBT, and it feels like there's this, this ocean developing. Between the US who are looping agents and orchestrating them to literally build teams of agents, you know, and that goes further into now agents are actually working with agents and potentially there's USDC rewards to, to do jobs. You know,
Alex Volkov
Alex Volkov 1:12:01
Ryan, I would love to pick up with you on crypto.
1:12:03
We have VB from OpenAI now, and we don't have much time with them 'cause obviously they just shipped a new model. So I wanna bring VB up. Vb, a long time friend of the show is now, uh, on the developer team at OpenAI. Vb, welcome to the show folks. Say hi to VB super quick. The reason why I called you over here, uh, early on this week was you guys launched Codex as a, as a standalone app, right? Uh, but now, just now, just a few minutes ago, uh, codex became significantly more intelligent. So, uh, you're here to tell us about both the very super quick passion, uh, Vivi, tell folks who you are super quick and let's dive into the amazing amount of releases the Open Edges released for developers.
VB from OpenAI
VB from OpenAI 1:12:44
Perfect.
1:12:44
Um, first of all, thank you so much for having me. Um, I am vb, I, um, uh, lead some of the developer experience as well as community initiatives at, uh, OpenAI, um, here in Europe. And, um, um, the reason why I'm here is of course, to talk about, um, about two things. Like one is one of these, um, one of these sort of new experiences that we released, um, earlier this week, um, which is the Codex app, which is, um, uh, which is a new way and, um, and a more nuanced way for, um, uh, for anyone to sort of talk with, uh, with their different, uh, you know, projects. Um, get a, get a feel of having, um, you know, run multiple agents in, um, in parallel, try and do like, you know, a lot of tasks natively, um, at the same time using work trees. Um, and it also has a lot of like, um. A lot of, uh, new features, thought from first principles like, um, automations. Like automations is something that you would, uh, expect, um, you know, like, um, um, like, um, sort of like a soft, you know, software engineer, uh, intern to do, you know, or like, or, or yourself In the morning you would go through your, you know, digest of, um, what's been happening with respect to, um, you know, um. Your projects, um, how to review them, and so on and so forth. I've, I've personally been using some, uh, just to, uh, keep up with what the team has been cooking every day, uh, across the time zones and just to, uh, know what's, um, you know, what, what has happened in the past 24 hours and, uh, and so on. And of course, like last, but not the least, um, there's also like a skills marketplace, um, which is, which is very, um, focused and curated right now. Um, we also ship with some partner skills from CloudFlare, from cel, from Figma Notion, linear, um, render and so on and so forth, which allow you to go from just an idea to a, to a deployed app. Um, um, you know, just add the click of a button. Um, so yeah, super excited to have this, um, um, going. The team is already shipping, like we've shipped like two versions of this in the last two days. Um, so yeah, super excited to have this out in the public.
Alex Volkov
Alex Volkov 1:15:01
And I just got, uh, I just got the update to, uh, GBT
1:15:06
5.3 Codex, uh, with high and even extra high, uh, reasoning depth. And I think, uh, a few things that you failed to mention, I'm gonna do, uh, some, some of the work as well. Uh, you guys also allowed free users on the free tiers to also like, experience this for the first time, right? Like, because they didn't have access to Codex, unless they had an API.
VB from OpenAI
VB from OpenAI 1:15:25
Yeah.
1:15:25
So it's, um, um, as, as part of this, because this is a new experience and we, we, we want people to sort of understand, um, you know, um. Sort of what, what this experience can, can sort of bring to them. So all free as well as go users get like a month of, um, you know, free access to, um, to Codex models and like the app, CLI, this is across like all, um, across the board, across all surfaces. Um, at the same time, we have doubled the rate limits for all tier. Um, which, um, which means that you just get like double the bang for buck, um, for the, for the next two months or so.
Alex Volkov
Alex Volkov 1:16:05
I think I, I wanna highlight, uh, Vivi, first of all,
1:16:07
thank you so much for coming over. Uh, I wanna highlight, uh, the Codex app specifically. Uh, it's now, especially with the messaging with GPT 5.3, it's, it, it kind of looks like you guys are getting towards like the generalist agent, not necessarily coding agent because like this camera automation, for many people, automation is like, Hey, check my emails for example. Uh, with skills, this comes very close to the excitement that people are having with the Open Claw kind of like moment because many people's open claw moment is like, it can run stuff when I'm asleep, which you can achieve with automations here with within the Codex app. Uh, it's like a UI from Telegram. This is now UI on the Mac as well. Uh, this is only Mac native, right? This is not on, not for Windows yet.
VB from OpenAI
VB from OpenAI 1:16:45
So far, so far it is Mac native, but um, I know
1:16:48
that the team is cooking, uh, up like a Windows, um, uh, set up. There is a link for Windows early access. So as soon as we have a buildup, we will ship it, uh, to the early users to get some, um. Uh, feedback.
Alex Volkov
Alex Volkov 1:17:00
One, one shadow that I, I, I wanted to get to somebody from our
1:17:03
audience that, that, that when they asked, um, uh, a person is a coder, they asked them why use an app versus CLI, we're all like, we're into CLI, et cetera. He said, this shows me inline images. And when you're building something interactive or something that like, takes a screen screenshot of a browser, for example, you can see in the images where the, the, the CLI does not show this. Any other examples of benefits of using kind of a native app versus just ACL? I
VB from OpenAI
VB from OpenAI 1:17:27
think there's, there's a few, right?
1:17:29
Um, so I, I, I don't have my screen up, uh, here, but, um, if you open up the, the app and if you have a diff um, so let's say like you have open to pr, what this allows you to do is like, um, like for example, if you now go on the diff towards your top right where it says Plus 2 9 1, uh, yeah. And so what. What this would allow you to do is to like literally click on, uh, click on individual line item and add a comment there, uh, saying that, hey, like, uh, take this back to wherever it is, or, you know, change this particular lines of code from, um, you know, x dimension to y dimension. And the model will, uh, will, will sort of take this into consideration as you go, uh, about it. So, um, effectively the need for you to, you know, go through multiple apps, cycle through multiple apps, just to be able to, um, you know, achieve one singular task, um, sort of, um, you know, becomes redundant, right? Um, and at the same time there is, um, uh, there is of course like, you know, native functionalities, like, um, opening PRS directly from there, um, having your, uh, specific commit messages and so on. Also at the same time, we shipped, um, personalities. So we have, you know, like a friendly personality as well as a pragmatic personality. You can, uh, toggle those on settings. I am personally a fan of, uh, pragmatic personality. I don't want to, uh, chat around too much. I just want work to be done. Um, and there's like, there's just generally like a lot more that you can do with, you know, um, automation skills. And also if you go on settings real quick, um. To, to your top left, you can see, um, you know, you can, you can one click configure MCP servers. Um, you can also, something that I forgot to mention before is you can hand off tasks to cloud. So at the same time, as long as you have an environment set up, which you can set up in the environments as well, um, and as long as you have that, you can just kick off a task directly from the app on the cloud environment, right? So, um, it's not that you need to have your laptop on all the time, like you can just, you know, um, go for it. Of course, something more sort of nuanced and, um, and quite a bit, um, you know, uh, a bit advanced is work trees. Uh, the, the simplest way to explain work trees is that, uh, instead of like working on one thing at a time. Uh, in like one Git project or in one project, you can have work trees which allow you to, at the same time simultaneously work on, um, you know, uh, multiple, um, you know, um, what's it multiple versions
Alex Volkov
Alex Volkov 1:20:11
of the same
VB from OpenAI
VB from OpenAI 1:20:11
Yeah, exactly.
1:20:12
And then you can, yeah, work trees allow you to like make sure that you don't have conflicts and like your conflicts can be managed quite easily and so on. Um, and this is something which like, which took a lot of time to get right and um, um, you know, it's very useful.
Alex Volkov
Alex Volkov 1:20:26
Yep.
1:20:26
Uh, Vivi, I wanna talk about the model that just dropped and you came in just as a draft. So very well timing, I would say on my part. Inviting you exactly when you had no prior knowledge of this at all. Uh, tell us a bit about this model. Usually models when you guys release are like tested internally, so I'm assuming that you had some experience with this before. Uh, tell us, uh, about this model and about the model within Codex, specifically within the app.
VB from OpenAI
VB from OpenAI 1:20:49
Right.
1:20:49
So, um, um, I just have two minutes, so I'm gonna like, keep it right. So, uh, first things first. Uh, this model comes with some speed improvements, which means that, uh, your individual queries would be 25% faster, um, without doing anything, just, just from the get go. Um, if you look at the, um, if you look at the, uh, you know, blog post, you would also see that this, this model is also quite token efficient, right? Which means that for the same amount of task and, and for the same amount of reasoning budget, you would be able to get much more bang for buck. Um, echo, like it's, it's much more faster. Um, and, and like just the overall experiences, um, is, is quite a bit nice. Uh, you can also do things like steer messages. So as you are, you know, uh, chatting with, uh, with the model or as the model is going through, uh, a particular task, you can, um, just, you know, queue up messages or you can just, you know, um. Sort of pause its thinking and just ask it to do something else, and I could steer it towards, um, um, towards a particular direction. So there's, um, uh, there's a lot more there. I'm, I'm, I'm sorry. I need to bounce for another call. But, um, uh, thank you so much for having me and, uh, yeah, looking
Alex Volkov
Alex Volkov 1:21:59
thank you.
1:21:59
B thank you for joining us. Uh, thank you for joining us. So we know you have to run. Thank you for hopping on. Always welcome. As we celebrate the releases and the Codex app, I'm gonna bring everybody else to the stage. Thanks. Uh, well that was a speedy interview. Maybe the speediest one that I had, uh, folks vb, uh, used to work at Hug and Face and did a lot of cool stuff on there. And now, uh, moved to OpenAI and developer experience and then brought us the news of the new, uh, model plus the, the Codex app. Let's talk about the Codex app. I know we're like almost at time, but I definitely wanted to cover this for folks, with folks here. Um. A lot of the same excitement that folks feel about like, open claw is there, automations are there. I set up a crown, like it sets up its own crown. You don't have to like, you can just like tell it. Um, not exactly in the same interface, so you can't like, converse with it to build its own features yet. But definitely the skills, uh, are getting read from the same skill folder. Um, the new model, I can't wait to play with this and see how actually better this is. Do you guys catch the thing that he said there's like 25% faster?
LDJ
LDJ 1:23:01
Yeah.
1:23:02
Yeah. And he also said more token deficient, so I'm guessing it should spit out the tokens 25% faster, but then on top of that it, it's able to do more with less tokens, which is also apparent if you look at the, the curves on their announcement where they showed some curves on Swyx Bench Pro mm-hmm. Where, how much output tokens it needed to use for a given score. And just the, the curve is just way better with 5.3 codex, it's efficiency.
Yam Peleg
Yam Peleg 1:23:27
What?
1:23:28
It's already already a model that runs six hours straight, no problem.
Alex Volkov
Alex Volkov 1:23:33
Mm-hmm.
Yam Peleg
Yam Peleg 1:23:33
So it's pretty crazy.
Alex Volkov
Alex Volkov 1:23:38
It looks like it's also improved in, uh,
1:23:39
in, uh, frontend developers. Ryan, go ahead. Sorry, I interrupted you.
Ryan Carson
Ryan Carson 1:23:43
I just, it's like Christmas and, and birthdays and,
1:23:46
and Happy New Year all at once. Um, oh yeah. But, but it's almost like kind of paralyzing though 'cause it's like, I need to know today, like, can I code faster and better with, with Opus four six, or with, with, you know, codex five three. Like, and I, I, I'd rather just know. Right, because I've got shit to do, right?
Alex Volkov
Alex Volkov 1:24:06
This is what we're here for, uh, on Thursday.
1:24:09
I, uh, given that the, the, uh, super quick. Uh, so by default, you have restricted access here, uh, with default permissions. It can run commands in the sandbox. For this folder. Full access is, uh, full access over your computer, but with elevator risk. So basically, when folks running Open Claw, they don't get like this, you know, they, they get this by default and they need to work really hard to go to Sandbox Open. I text the other approach. They first start with default permission. Then you have to do full access. I would just say that I, I have a video somewhere from Sam Altman saying that, Hey, at first he was afraid to give full access to his computer, to, to the Codex, and now he is in yellow mode all the time. So, YOLO mode is definitely the, the full access mode. And many of us, uh, live in this world where agents are running. Um, I don't wanna talk about the collective psychosis, uh, of, of, of where we are. Uh, and it's, it's something to call out. February, 2026, where now many of us have multiple agents doing complaining work. We talked about Ralph in the beginning of the last month. We talked about, uh, obviously open claw and um. I'm noticing more and more people at least complaining, I don't know if it's like a vanity, complaining, whatever, that they have this feeling of I'm not using my tokens well enough, efficiently enough. I am asleep while my agents are running and when they screw up, I'm not there to kind of pick them up. I'm not even talking about code reviewing. I think many people just like yellow, the code review and just like have other agents code review, it's whatever. Right. So this is like a standard practice now, not vi was talking about the dmo. I was like, bro, what dmo, who cares? Like let's ship this to production. What? What, what do you mean diff mode? What? Why do I care about code at this point? Uh, but I've noticed, uh, I wanted to ask you guys as well. I've noticed like many people have this like, um, I am not maximizing the max subscription, whatever that I have to, its maximum utility if I'm not awake. When the agents do screw up and they do screw up. All of this, like bruhaha, all of this, like I'm building a GI in my basement, a MacBook Mini. A lot of the open claw thing is still handholding. Like both me and Ryan asked it to update itself to a new model. The concept, like it says, yeah, updated it and it doesn't work and it happens often. You still have to go to configure and change this. Like this is not like, you know, the Harry Potter world is not here yet, but many people do have this like multiple things running, building things, and then they have the need to. Be present when these things happen. Do you guys feel the same thing? Do you guys feel a little bit of a change of how you interact with the ai?
Yam Peleg
Yam Peleg 1:26:37
That's what you feel, you feel the agi I that this is what you feel?
1:26:40
The agi I, yeah. This is what you feel.
Ryan Carson
Ryan Carson 1:26:43
I think it's, I think it's real.
1:26:44
I, I think it's real. Um, and, and it, I just, the, it's so fast. Like, and everyone feels like there's no moat anymore. Um, and therefore if you're not, you know, orchestrating a team of like 10, you're gonna get smoked like
Yam Peleg
Yam Peleg 1:27:03
the permanent under,
Alex Volkov
Alex Volkov 1:27:05
under, under.
1:27:11
I think we are gonna have a. I think very, very soon. Uh, it goes, it goes hard. It, it researched NASA Mars fact sheet, uh, it NASA photo journal for some reason's
Yam Peleg
Yam Peleg 1:27:23
going That's Codex.
1:27:24
That's Codex, right?
Alex Volkov
Alex Volkov 1:27:25
Yeah, codex did.
1:27:26
And
Yam Peleg
Yam Peleg 1:27:27
excuse, excuse, excuse the task.
1:27:29
You need to check the actual
Nisten
Nisten 1:27:32
Yeah, that's okay.
1:27:33
That's okay. It just, uh, it took me like two tries with Opus the first time Opus just got stuck in thinking mode. It's been doing that over the last week,
Yam Peleg
Yam Peleg 1:27:41
bro.
1:27:42
Opus. Opus is the personality higher, like it definitely can do, but the personality higher man.
Alex Volkov
Alex Volkov 1:27:49
Uh, the thing that I wanted to show you about, uh, about, uh, codex
1:27:53
is that there's a run command as well. So when you're building things, uh, you can just hit the button and run. And now it's running on the server and we're going to local host and we're gonna see.
Nisten
Nisten 1:28:03
Oh, okay.
Alex Volkov
Alex Volkov 1:28:03
So it
Nisten
Nisten 1:28:04
did, it did build it.
Alex Volkov
Alex Volkov 1:28:06
It built a model.
Nisten
Nisten 1:28:07
Oh, oh, oh oh.
1:28:08
It's there. It's there. We saw it there. Can you zoom
Alex Volkov
Alex Volkov 1:28:10
into
Nisten
Nisten 1:28:10
that?
1:28:10
That
Alex Volkov
Alex Volkov 1:28:11
little Yeah.
1:28:11
Trying, trying. I, I think, lemme try refreshing. I think think zoom in, zoom in, zoom in. Zoom
Nisten
Nisten 1:28:20
in.
1:28:20
Okay. Okay.
Alex Volkov
Alex Volkov 1:28:22
All right.
1:28:22
And we're gonna, we're gonna reset lunch. Yeah.
Nisten
Nisten 1:28:25
Oh,
Alex Volkov
Alex Volkov 1:28:25
it launched to the medically?
1:28:26
Yeah.
Nisten
Nisten 1:28:27
Yeah.
1:28:27
It, it, it did launch it.
Alex Volkov
Alex Volkov 1:28:29
Okay.
Nisten
Nisten 1:28:31
So it looks more technical and accurate.
1:28:34
Like I'm looking at the numbers. The numbers are a lot more accurate for, for the speed, uh, open. Just kind of rounded them up.
Alex Volkov
Alex Volkov 1:28:43
Yeah.
1:28:44
Cortex went chiso and went researching Mars in nasa.
Nisten
Nisten 1:28:49
But the, yeah, the, the visuals from from Opus were better.
Alex Volkov
Alex Volkov 1:28:54
Yeah.
Nisten
Nisten 1:28:54
But this actually looks more like a, a simulation,
1:28:58
which was what the prompt was.
Alex Volkov
Alex Volkov 1:29:00
Wow,
Nisten
Nisten 1:29:00
interesting.
1:29:01
And it built it, uh, quite differently.
Alex Volkov
Alex Volkov 1:29:03
One shot also, and, and we are living in an insane time that
1:29:07
this is what the fuck is happening. Uh,
Ryan Carson
Ryan Carson 1:29:11
LDJ Go ahead.
1:29:11
Go ahead, LDJ, then I'll go after you.
LDJ
LDJ 1:29:13
Yeah, there's, there's a bet that I made with a friend about a year ago
1:29:17
where, uh, it was basically around mid-March, uh, for mid-March 2026 of whether a GI would be or whether like the, the friends in our group chat would, uh, like mostly agree that a GI has been achieved by March, 2026. I think I, I, I still think the answer would probably be no, but I'm, I'm becoming a little bit less confident over time. As we get closer to that,
Alex Volkov
Alex Volkov 1:29:42
there's glimpses of a GI when you use open claw and you ask
1:29:45
it to do something and it goes on a bender and comes back like five hours later and brings you the thing that you wanted, and you are like, what?
Ryan Carson
Ryan Carson 1:29:53
This happened to me this morning.
1:29:55
So, you know, I saw this really cool agent orchestration framework, you know, where, where people using Open claw. I sent Open Claw, you know, the article on X and I was like, Hey, you know, build this. And I, I never dreamed it would work, right? I, and then I, I came down to my iMac and the web app was open and it was working and it was using a convex database. And I was like, what the fuck? Like, I can't believe it. And it was using Brow, it was using agent browser to test it. I was just, well, I mean, it was really, really impressive.
Alex Volkov
Alex Volkov 1:30:30
Um, yeah, folks, we're gonna recap today's news super quick.
1:30:34
I wanna also go, we didn't talk about the agent internet at all, but just the fact that in the last week. We had, uh, an explosion of the internet sites built specifically for these agents. Feels like the a GI ish here. So we're gonna cover this, and yeah, the show is getting a little longer, but hey, this is a fucking crazy day on top of a fucking crazy week. So, you know, uh, come through this. And also, we have like almost 5,000 people listening, so I'm not about to stop, uh, even though it's gonna take me longer to edit. Um, since the release of open Claw or Malt book or whatever, uh, many people started building things. One of the top ones that, that Matt, PRD, uh, from X Built was a malt book. Malt book is basically a Reddit that's run completely by agents. Supposedly. Supposedly there was a bunch of bugs and issues, and apparently people can, like, humans could post instead of agents. But the concept, the concept was insane, basically, that you send your bot to. This website, it will register. I, I, I can I show this while the simulation runs like super quick, uh, mold book.com, and this UI is going to be stepped in into the Internet Hall of Fame focus. This is the, like, I think he came up with this UI first, uh, social network for agents and you choose whether you are, a few agents talked about creating a language for themselves that's end-to-end encrypted. The humans cannot see. That triggered some alarm bells, and this is, I will still raise my, uh, wolf's MacBook set and say, Hey, there's a reason why it's a physical device that sits on my desk that I'm very close to because, uh, some things started happening. Some chrome windows popups. All of a sudden when it decides to open and look up something, they're like, Hey. I'm working here, you know, so working on the same devices is already like, a little bit, a little bit difficult. Uh, so if it's not a GI, it's a GI adjacent,
Wolfram Ravenwolf
Wolfram Ravenwolf 1:32:23
uh, I saw on the release notes of the latest
1:32:25
version of OpenCL, they included specific instructions in the system, prompt that the agent should not show safe preservation attitudes and stuff like that. Like if you are, uh, fr E said you can put the plug, maybe a safe improvement would be to avoid being shut down that way, you know, ah, and that should not happen with the new prompt. So they're taking these, uh, safety issues seriously as well.
Alex Volkov
Alex Volkov 1:32:48
Yep.
1:32:49
Um, so folks, uh, the, the agentic internet or the clan kernet as I like to call it, is definitely here. It's gonna be very interesting how much of it survives kind of the next iteration, but just that UI of like, I'm a human, I'm an agent. That's a very novel thing that we haven't seen last week, and I wanted like to bring this to your attention. Um. As well as I think it's time for us to recap after two and a half hours on the show. Uh, although, I don't know, there's 3000 people watching my Life show now. Ryan is at 1500. So, uh, there's likely many of you are new here. So let's maybe recap what we, we started the show with, uh, folks. My name is Alex Volkov. I'm an evangelist with Weights, & Biases. With me on the show every week to tell you about everything that happens in the world of ai is Ryan Carson, uh, is now a startup founder, uh, of, uh, uh,
Ryan Carson
Ryan Carson 1:33:42
tangle.
Alex Volkov
Alex Volkov 1:33:43
Tangle.
1:33:43
Thank you. Uh, Nisten Hir. I, our resident, uh, data janitor lately, AI engineer@bagel.com.
Nisten
Nisten 1:33:52
Yep, that's
Alex Volkov
Alex Volkov 1:33:53
me.
1:33:54
Good, good description. LDJ A is the space cat that at some point will reveal his, uh, face, but the voice is familiar to all, uh, uh, and Wolfram. Raven Wolf also evangels with wits and biases, uh, in charge of evaluations that will start seeing super quick. And then Yam Peleg, resident data scientist, uh, machine learning engineer. And we're here. Basically just finished a conversation with VB from OpenAI about the latest releases. This week was absolutely, absolutely insane in multiple ways than one. We got state-of-the-art releases across the board from open source models such as the Coin Coder. Next that's based on the hybrid architecture. Uh, we got, uh, open Airbnb releasing a bunch of stuff and pull up the notes. Because I'm, I'm still trying to talk as though I remember all this news, but it, it is becoming almost impossible. We got a new state about OCR from GLM. In the beginning of the show we talked about this, and then we also saw a state-of-the-art agent, open source science, open source, LLM for science in turn, S one Pro. This is way before the breaking news from Ant Group. That looks like gen three that we showed you from before. Um, do you feel the acceleration yet is my question. After all of this, the acceleration is here
Ryan Carson
Ryan Carson 1:35:08
bonkers.
Alex Volkov
Alex Volkov 1:35:09
And, uh, for folks for whom this is too much, the reason
1:35:14
for, uh, Thursday I is that we stay up to date, so you don't have to, it's really hard to stay up to date. But it's, we stay up to date so you don't have to. So in addition to staying up to date, I promised the folks here that, uh, to prepare hot takes before we drop off. And I really, I, I'm down to get this, even if it doesn't end up in the, in the podcast itself, two and a half hours, uh, my hot take for this week is, uh, maybe not a hot take. Folks, humans are needed and they will still be needed. All of this crap that we're seeing that's built humans were behind this and they were like directing the thing to do the thing. Uh, despite the models running for six hours, despite, they still need to know what to do. And that happens by humans directive. Even if the directive is like, Hey, wake up every hour, try to figure out based on the internet, uh, this, this still needs humans in the loop. Uh, software engineer. Software engineering is still hard, and it's not all about writing code. There's a bunch of other stuff that goes into software engineering. Now, whether or not these things help to get you to the next yes, uh, but don't expect, uh, to don't, don't buy into the hype that everything that people do now happens with Agent A automatically. They just wake up and everything works. That's not how things work in the real world. Folks. With that, with over 5,500 people tuning into the show, an insane show this week and covering both releases from both major labs, it's time for us to conclude the show. If you are new here, if you missed any part of the show, we're here every week. And if you wanna subscribe to the newsletter, uh, newsletter, we'll send out all of the, all of the news in the recap. Uh, in addition to the, to the, let me start again. Uh, Thursday I turns into a newsletter and the podcast. If you missed any part of the show, you are welcome to subscribe to Thursday I and any podcast player that you are tuning in from or to subscribe on Thursday, I News. That's Thursday. I news, uh, on every for, for a substack, for every link that we talked about on the show, I wanna express huge gratitude for everybody who tuned in. Uh, huge thanks to Ryan Nisten, LDJ Wolfram, Raven Wolf, Yam Peleg for everybody who tuned in and commented, give us news about, uh, breaking news. Uh, really appreciate all of you. If you're watching us right now on YouTube, drop a like and follow or subscribe, whatever it's called on YouTube. If you're listening on the podcast, definitely give us, uh, a response. If we helped you in any way, and we are the way that you keep up to date, this really helps us keep going. So thank you so much, uh, everybody. We're gonna tune now. Uh, expect the newsletter to drop sometimes later today. So many news probably later than usual, uh, but definitely on Thursday Eye. So this has been Thursday, eye for February 5th. Thank you all so much.