Episode Summary

The week OpenAI went full throttle. GPT-5.5 dropped mid-show โ€” SOTA across terminal-bench, SWE-bench, GDPval and frontier-math, using ~40% fewer tokens than 5.4. GPT-Image-2 posted the biggest Arena ELO jump ever (200+ points), generating functioning QR codes, perfect infographics, and 360ยฐ street-view images that Peter Gostev stitched into a 24-hour walkable world. Codex now has real multi-cursor computer use on macOS plus Chronicle screen-memory. On the open-source side, Kimi K2.6 became Wolfram's best-ever open model and Qwen3.6-27B dense beat Alibaba's own 400B flagship. Oh โ€” and Claude Design shipped, dropping Figma stock 7%.

The Week That Broke The Chart โ€” Interactive Recap

Interactive infographic generated with Claude Design. Scroll inside the frame.

Hosts & Guests

Alex Volkov
Alex Volkov
Host ยท W&B / CoreWeave
@altryne
Peter Gostev
Peter Gostev
Head of AI ยท Arena (formerly LMArena)
@petergostev
Wolfram Ravenwolf
Wolfram Ravenwolf
AI model evaluator ยท r/LocalLLaMA
@WolframRvnwlf
LDJ
LDJ
Nous Research
@ldjconfirmed
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten
Ryan Carson
Ryan Carson
AI educator & founder
@ryancarson
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg

By The Numbers

Terminal-Bench 2
82.7%
GPT-5.5 state-of-the-art, up from 75% on 5.4
GPT-Image-2 Arena jump
+200 ELO
Biggest single jump ever recorded on Arena; beat prior top by 300 points
Longest task
8.5 hrs
Peter Gostev: 'It hasn't literally finished the first one' โ€” GPT-5.5 ran one task overnight without stopping
Qwen3.6
27B dense
Apache-2.0, beats Alibaba's own 400B flagship on every major coding benchmark
Kimi K2.6
1T MoE
32B active, SOTA open-source on SWE-Bench Pro at 58.6
Anthropic
$30B ARR
Crossed the $30B annualized revenue mark this week

๐Ÿ”ฅ Breaking During The Show

GPT-5.5 drops mid-show
OpenAI ships GPT-5.5 and GPT-5.5 Pro during the livestream. State-of-the-art on Terminal-Bench 2 (82.7%), SWE-Bench Verified (73%), GDPval (84%), Frontier Math (35%). Uses 40% fewer tokens than 5.4, netting ~20% cheaper despite doubled API pricing. Codex-first rollout.

๐Ÿ“ฐ Intro & TL;DR โ€” Week in Review

Alex welcomes the full cohost lineup back โ€” Ryan from Japan, Wolfram, Yam, LDJ, Nisten โ€” and runs through the TL;DR. OpenAI's week of dominance: GPT-Image-2 shattering Arena, a GPT-5.5 leak via base64 in Codex ('Nous 41'), Claude Design crashing Figma stock, Cursor being acquired by xAI for $60B, and two massive open-source drops from Kimi and Qwen.

  • Full cohost panel reunion โ€” Ryan back from Japan, everyone live
  • Nous 41 = base64 for 'GPT-5.5' โ€” OpenAI leaked their own model in Codex
  • Cursor โ†’ xAI: $10B collab structure with $60B acquisition clause
  • Anthropic crosses $30B ARR, resets all Claude quotas, admits degradation
Wolfram Ravenwolf
Wolfram Ravenwolf
"The benchmarks take time. The analysis takes time. And when you are done with one, the next one is already there. But I'm not complaining โ€” this is the acceleration we've been waiting for."
Alex Volkov
Alex Volkov
"Welcome to livestream number five since the last show."

๐Ÿ”“ Open Source: Kimi K2.6

Moonshot AI drops Kimi K2.6 โ€” 1T MoE with 32B active parameters, 256K context, modified MIT license. Claims open-source state-of-the-art on SWE-Bench Pro at 58.6. Wolfram calls it the best open-source model he's ever tested on his private wolf-bench.

  • 1T parameters MoE, 32B active, 384 experts, MLA attention
  • 256K context window, modified MIT license
  • 58.6 on SWE-Bench Pro โ€” SOTA open source
  • Wolfram's best open-source model ever on wolf-bench
Wolfram Ravenwolf
Wolfram Ravenwolf
"Kimi 2.6 is the best model in the open source department. Both are the best."
LDJ
LDJ
"Kimi seems to be the one that's less academically minded than Qwen, but kind of more creative and more poetic, more diverse in its outputs."

๐Ÿ”“ Open Source: Qwen 3.6-27B

Alibaba ships a dense 27B Apache-2.0 model that beats their own 400B flagship on every major coding benchmark. Plus Qwen3.6-Max-Preview on API. The dense-beats-MoE story keeps evolving.

  • Dense 27B, Apache 2.0 license
  • Beats Alibaba's own 400B flagship on coding benchmarks
  • Qwen3.6-Max-Preview also live on API
Yam Peleg
Yam Peleg
"Have you guys seen Qwen? The one that gives you Opus four or five at home."

๐Ÿ”“ OpenAI Privacy Filter (Apache 2.0)

OpenAI open-sources a tiny 1.5B MoE with only 50M active params โ€” a privacy/PII filter that runs in the browser on WebGPU. Perfect companion for agent security stacks like Brex's CrabTrap.

  • 1.5B MoE, 50M active params, Apache 2.0
  • Runs fully in browser via Xenova's Transformers.js
  • Designed to identify and remove PII in datasets
LDJ
LDJ
"It's a model for helping identify and remove personally identifiable information within datasets โ€” whether that's a company wanting to fine-tune on their own personal data or for whatever other reason."

๐ŸŽจ GPT-Image-2 โ€” Thinking Mode for Images

The biggest jump in Arena ELO history: GPT-Image-2 is 200+ points above the last top model. A thinking/reasoning image model that generates functioning QR codes, renders equirectangular 360ยฐ images, produces photo-perfect character consistency (even Dario Amodei), and 'writes code' by generating screenshots of IDEs containing SVGs that actually render. Ryan is integrating it into his weekly marketing pipeline today.

  • +200 ELO over prior top model on Arena (biggest jump ever)
  • Functioning QR codes embedded in generated images
  • Multi-image character consistency โ€” can generate full manga pages
  • 4K output, equirectangular 360ยฐ images (Peter's street-view hack)
  • Generates pixel-perfect screenshots of IDEs with working SVG code
  • New meta: GPT-Image-2 designs UI โ†’ Codex implements
LDJ
LDJ
"There's not more than a 50-point gap between any of those 50 top-ranking neighbors. The exception is GPT-Image-2 โ€” even on medium reasoning mode, it's over 200 points above the last top place. It's insane."
Ryan Carson
Ryan Carson
"It's good for real stuff, not fancy fun play stuff. I'm already integrating this into my marketing engine."
Wolfram Ravenwolf
Wolfram Ravenwolf
"It's not just an image model. We have intelligence in the images that we didn't have before. It is so mind-blowing to see what you can do now outside of just good-looking images."

๐Ÿค– Codex: Computer Use & Chronicle

Codex now has true background computer use on macOS โ€” a second cursor that works while you work, running on its own thread. It's so good, 'any other computer use is computer useless.' Plus subagents each controlling different windows in parallel. And Chronicle: Codex takes a screenshot every 10 seconds and has total screen memory โ€” ask 'what was I doing an hour ago?' and it knows.

  • Background cursor that doesn't take over your mouse โ€” works while you work
  • Multi-agent: subagents click in parallel windows
  • Software Apps Inc. (ex-Apple Shortcuts team) acquisition paying off
  • Chronicle: 10-second screenshots feed into Codex context
  • Alex used it to auto-quote-tweet from a prompt, with verification
  • OpenAI Codex passes 4M users
Alex Volkov
Alex Volkov
"Once you try Codex computer use, any other computer use is absolutely useless. It's computer useless."
Wolfram Ravenwolf
Wolfram Ravenwolf
"I've been waiting for this from the computer operating system manufacturer. Apple or Microsoft could have built this already โ€” a multi-user system where the AI is another user working with you on its own desktop."
LDJ
LDJ
"OpenAI acquired a company called Multi back in June 2024. Their goal is to make computer use an inherently multiplayer experience. Ever since then I've been waiting for this."

๐Ÿ› ๏ธ Brex CrabTrap โ€” Agent Security

Brex's CEO pair-programs with Codex and open-sources CrabTrap โ€” an LLM-as-judge HTTP proxy that intercepts outbound agent requests, uses natural-language rules, and blocks risky activity. Wolfram changes his pick of the week on the spot.

  • LLM-as-judge proxy for outbound agent traffic
  • Natural-language rule definitions for risky behavior
  • OpenClaw banned at CoreWeave โ€” this is the enterprise fix
  • Ryan: 'intelligence monitoring all traffic โ€” absolutely going to happen'
Wolfram Ravenwolf
Wolfram Ravenwolf
"I want to change my pick of the week to CrabTrap. Every week my agent is doing deep research on how to secure agents, because the more access I give them, the more concerned I am."
Ryan Carson
Ryan Carson
"Intelligence is on demand now. What company would not want intelligence monitoring all their traffic to make sure their employees are not doing bad things? Absolutely this is going to happen."

๐Ÿ”ฅ BREAKING: GPT-5.5 Drops Live

Mid-show, OpenAI ships GPT-5.5 and GPT-5.5 Pro. Terminal-Bench 2 jumps to 82.7% (from 75%), SWE-Bench Verified to 73%, GDPval state-of-the-art beating Opus 4.7 and Gemini 3.1. Uses 40% fewer tokens than 5.4, so net intelligence-per-dollar drops ~20% despite pricing doubling to $5/$30 per million. Alex gets it live in Codex and runs a computer-use quote-tweet in real time.

  • 82.7% Terminal-Bench 2 (SOTA), up from 75% on 5.4
  • 73% SWE-Bench Verified, 84% GDPval โ€” state of the art
  • 40% fewer tokens at double the price โ†’ net ~20% cheaper to run
  • $5 / $30 per million tokens; Pro: $30 / $180
  • Live demo: computer use quote-tweeting in Chrome
  • Not yet in ChatGPT โ€” Codex-first rollout
Yam Peleg
Yam Peleg
"Just to be clear โ€” across the board state of the art, right? From thinking and above, everything is state of the art."
Alex Volkov
Alex Volkov
"State of the art while using almost 50% less tokens. All right folks, let's welcome Peter Gostev from Arena."
Wolfram Ravenwolf
Wolfram Ravenwolf
"If a model is thinking longer, it can actually be detrimental on the agentic benchmarks. That's probably why the score is higher now โ€” it decides it doesn't have to think so much, but act and then correct instead of overthinking."

๐Ÿ’ฌ Peter Gostev Joins โ€” First Impressions

Peter from Arena AI (ex-LMArena) joins with early access impressions. The headline: 'This is the first time a model can actually properly do long-running tasks.' He queued up prompts overnight expecting them to finish by 3am โ€” woke up, first one still running. 8.5 hours on a single task, then seven-and-a-half hours on another. 'Reflex loops are dead.'

  • First model that genuinely sustains multi-hour coherent work
  • Three long-running tasks going simultaneously
  • Better conversational feel, less abrupt than 5.2-5.4
  • Still needs iteration โ€” vision reflection is lacking
  • Front-end design: great with a spec, poor one-shot
Peter Gostev
Peter Gostev
"The biggest thing that jumps out is that this is the first time when a model can actually properly do long-running tasks. All previous models, they kept saying you can do it for many hours, but every time I shouted, it never did it."
Peter Gostev
Peter Gostev
"I queued up thermal prompts to keep it going, and then when I woke up I thought okay, it'll be done at 3am. I woke up and it hasn't literally finished the first one. All of this queuing up was completely unnecessary."
Peter Gostev
Peter Gostev
"We are not at AGI yet. We still need to trick them a little bit, massage them, understand how they behave."

๐Ÿงช Peter's 24-Hour Babylon Street-View Experiment

Peter's overnight project with GPT-5.5 + GPT-Image-2: planning out the Hanging Gardens of Babylon and generating ~400 equirectangular 360ยฐ images that stitch into a walkable Google-Street-View-style reconstruction of a place we don't know how it looked. Started at 1am London time, still running at broadcast. 'Reflex loops are dead.'

  • ~400 equirectangular 360ยฐ images of ancient Babylon
  • GPT-5.5 orchestrated planning, coordination, and code
  • Topaz upscaling on Replicate for 4K fill-in
  • Alex: 'Street view of a place that doesn't exist'
  • Peter: 'It did exist โ€” we just don't know what it looks like'
Peter Gostev
Peter Gostev
"I came up with this idea at about 1am London time, and it worked the whole night. It's been running about seven and a half hours on another task. Every time I check โ€” seven hours. Literally seven hours. I can't even update the bloody app because it keeps running."
Alex Volkov
Alex Volkov
"You basically created street view of a place that doesn't exist."
Peter Gostev
Peter Gostev
"Well, it did exist โ€” but we don't know what it looks like."

๐ŸŽจ Claude Design โ€” Figma Dropped 7%

Anthropic ships Claude Design on Friday as a research preview on Opus 4.7. It's not a Figma replacement, but it's magical enough that Figma stock dropped 7% at the news. Alex generated a full ThursdAI brand kit (logo, tokens, the opener videos for this episode) end-to-end in Claude Design โ€” a flow Codex then used live to produce a GPT-5.5 launch video.

  • Research preview on Opus 4.7, claude.ai/design
  • Figma stock -7% at release
  • New usage meter added to Claude Max settings
  • Alex generated ThursdAI brand kit + opener videos with it
  • Companion: Codex picks up the kit, generates launch video in 9 min
Nisten Tahiraj
Nisten Tahiraj
"I am kind of blown away by this design thing."
Ryan Carson
Ryan Carson
"We have crossed a new threshold. With the entrance of Claude Design plus GPT-Image-2, we are now in a spot where you can really begin to get professional design out of AI."

โšก This Week's Buzz โ€” W&B LEET TUI Workspace Mode

W&B LEET (the terminal UI everyone's talking about TUIs for) ships workspace mode โ€” multi-run comparisons, GPU metrics, and images rendered right in your terminal.

  • Multi-run comparison in the terminal
  • Live GPU metrics
  • Images rendered directly in TUI
Alex Volkov
Alex Volkov
"Everybody's like going home about TUIs. W&B also has a TUI โ€” it's called LEET, and it now shows GPU stats inside the TUI, which is really really good."

๐Ÿ“ฐ Recap & Outro

Four hours live, 5,000 viewers, GPT-5.5 dropped mid-show, GPT-Image-2 reshaped image gen, Codex learned to use your Mac, Claude Design crashed Figma, and two new open-source SOTA models landed. 'How could we not have covered everything?'

  • Almost 4 hours on air
  • ~5,000 concurrent viewers at peak
  • Full coverage of GPT-5.5, GPT-Image-2, Codex CUA, Claude Design, Kimi K2.6, Qwen 3.6-27B, Privacy Filter, CrabTrap
Alex Volkov
Alex Volkov
"Crazy, crazy week AI. With almost 4 hours live and almost 5,000 of you tuning in throughout โ€” it's been a great show. Thank you so much for joining us."

TL;DR

  • Hosts and Guests

  • Big CO LLMs + APIs

    • OpenAI launches GPT-5.5 and GPT-5.5 Pro โ€” SOTA across the board (Blog, Livestream)

    • OpenAI GPT-Image-2 โ€” biggest Arena Elo jump ever, thinking mode for images (X, Eval site, Livestream)

    • OpenAI Codex โ€” Background Computer Use + Chronicle (screen memory), hits 4M users (Chronicle)

    • GPT-5.5 pre-launch leak in Codex dropdown (X)

    • Anthropic Claude Design โ€” research preview on Opus 4.7, Figma -7% (X)

    • Anthropic resets all Claude quotas, admits degradation, allows OpenClaw CLI back (X)

    • Anthropic ARR crosses $30B

    • Google Gemini Deep Research + Deep Research Max on Gemini 3.1 Pro (X)

    • Google Gemini Enterprise Agent Platform (X)

    • ChatGPT Agents โ€œHermesโ€ leak โ€” builder/studio + Slack integration (X)

    • OpenAI clinician/medical model + workspace agents released

  • Open Source LLMs

    • Moonshot Kimi K2.6 โ€” 1T MoE, 32B active, SOTA open source on SWE-Bench Pro (X)

    • Alibaba Qwen3.6-27B โ€” dense 27B, Apache 2.0, beats own 400B flagship (X, HF)

    • Alibaba Qwen3.6-Max-Preview on API (X)

    • OpenAI Privacy Filter โ€” 1.5B MoE, 50M active, Apache 2.0, runs in browser (X)

  • Tools & Agentic Engineering

    • Brex CrabTrap โ€” LLM-as-judge HTTP proxy for agent security (X)

    • OpenAIDevs Euphony โ€” open-source Codex session log visualizer (X)

  • This weekโ€™s Buzz - Weights & Biases

    • W&B LEET TUI goes workspace mode โ€” multi-run, GPU metrics, images in terminal (X)

  • Voice & Audio

    • StepAudio 2.5 TTS โ€” natural-language control of emotion and delivery (X)

  • Deals & Industry

    • SpaceX/xAI <> Cursor โ€” $60B acquisition or $10B collaboration structure

Alex Volkov
Alex Volkov 0:45
Hello, Hello, uh, welcome to Thursday.
0:49
I, this is Alex Volkov coming to you live from Denver. It's a little bit later than we usually start, but I hope, uh, some of you who joined us on livestream saw a few of the openers that were prepared by Claude and Hyper Frames. I'm gonna tell you all about this. Today is a big day. Nous 41. If that means anything to anyone here, then you are too connected to X. You need to leave your house and go touch some grass. Uh, but if it means nothing to you, uh, and if you're asking in our chats, what is Nous 31? Everybody's saying, N 31 is today. Uh, then, uh, we'll tell you all about this, but plus we have a huge show. And to help me through kind of explaining everything that happened in the world of AI today, let's bring up some cohost here. We'll get Ryan Carson, who's back? Wolfram, Raven Wolf, Yam Peleg, and LDJ. What's up folks? How are you doing? Let's start with our long lost brother, Ryan Carson. Welcome back, dude. What's up? Let's go
Ryan Carson
Ryan Carson 1:46
everybody.
1:47
It's so good to be here. I was in Japan with my family and I'm back. And
Alex Volkov
Alex Volkov 1:51
you are back and, and you chose a hell of a week to be back, man.
Ryan Carson
Ryan Carson 1:55
I'm excited.
1:55
Good to be here.
Alex Volkov
Alex Volkov 1:56
It's a, it's a crazy week.
1:57
Um, were you up to date at all or you just like disconnecting? So we need to keep you up to date. Dude,
Ryan Carson
Ryan Carson 2:03
this is the problem, man.
2:04
You can't turn off now, like, and I can code from my phone, so, yeah. I didn't, I didn't go anywhere.
Alex Volkov
Alex Volkov 2:10
You just, you didn't disconnect.
2:10
Just different time zone. Uh, we'll say hi to Wolfram, Wolfram WhatsApp. How are you doing?
Wolfram Ravenwolf
Wolfram Ravenwolf 2:15
Hey, uh, I am, I'm not, uh, it's hard to keep
2:18
up with all the model releases. The benchmarks take time. The analysis takes time. And when you are done with one, the next one is already there. But I'm not complaining. I mean, this is the acceleration we've been waiting for though. Keep going.
Alex Volkov
Alex Volkov 2:30
This week definitely felt accelerated.
2:33
Um, my usual spiel is that, hey, you know, until Wednesday, I kind of like, here's a piece of news and piece of news. And then only until I start preparing the show notes, which we're a serious business here, folks, we have like a round of show document and everything. Uh, but not only until Wednesday do I start feeling the, oh my God, there's so much to talk about. No, I knew that this work was gonna be insane from the moment we ended the last episode, because the moment we ended the, that episode, uh, codex dropped in a huge new update. We were, we didn't have enough time to tell you about this, and then I went on four livestream since then. So some of you like Milosh and some folks in the audience have been with me throughout all these live streams. So welcome to Live Stream number five. Since the last show, uh, we have tons to talk about. Uh, yam Pek. How are you doing, man? What's new? I see the glasses you ready for? Have you guys Nous 41.
Yam Peleg
Yam Peleg 3:23
Have you guys seen Qwen?
Alex Volkov
Alex Volkov 3:25
Mm. Which of the coins?
3:27
There's two coins, man. Which, which one?
Yam Peleg
Yam Peleg 3:29
The one, the one that gives you, uh, uh, Opus four or five at home.
Alex Volkov
Alex Volkov 3:35
That,
Yam Peleg
Yam Peleg 3:36
that kind of one, you know, uh, it's, yeah.
3:40
And Codex and, and there also some rumors. No spreading rumors.
Alex Volkov
Alex Volkov 3:45
It's not even rumors, bro.
3:46
There are, it's not even rumors at point. Even
Yam Peleg
Yam Peleg 3:47
rumors at
Alex Volkov
Alex Volkov 3:48
this point.
3:48
If you go to the OpenAI official account, they posted something, uh, that says Nous 41 and, uh, Nous 41 in base 64 is basically 5.5. If you take the string that they posted, convert it back from base 64, you get 5.5. So, uh, we are going to Conspiracy.
Yam Peleg
Yam Peleg 4:06
Conspiracy.
Alex Volkov
Alex Volkov 4:06
Yeah.
Yam Peleg
Yam Peleg 4:07
Conspiracy confirmed.
4:08
That's
Alex Volkov
Alex Volkov 4:09
conspiracy.
4:09
Conspiracy confirmed co. It's not that big of a conspiracy. Yeah, they licked it in Codex. Let's, let's just say somebody saw a screenshot in Codex with a bunch of other models that we also have to talk about. Uh, again, uh, folks who are just joining us, um, OpenAI about to drop a new model. We don't know when. So we're really, really hoping that they know that Thursday eye is going on, just gonna drop in the middle. They, they love dropping in the middle. So we will ask you in the audience, I'll just, I'll talk directly to you, the audience, please. If you are monitoring the situation, like us, uh, send us a link in the chat that if, uh, anything happens for OpenAI in case we're getting too excited about the show and we're all in this like, debate. Uh, tell us that you're seeing the OpenAI is about to launch something. Um, but there's a bunch of open source as well. I think we should probably start with very soon, uh, LDJ WhatsApp. What is on your mind? What is the one thing that is must not be missed today in the ai? Uh, from last week?
LDJ
LDJ 5:05
From last week, um, or from, from the past seven days.
5:08
Uh,
Alex Volkov
Alex Volkov 5:08
since, since we finished Thursday.
5:10
I last
LDJ
LDJ 5:11
Yeah, well, since, since he already mentioned, uh, Qwen.
5:15
I, I'll mention Kimmy. Uh, yeah, so Kimmy's model seems pretty impressive. Uh, I think as usual, Kimmy seems to be the one that's, uh, maybe less academically minded than Qwen, but kind of more creative and more poetic, uh, kind of more diverse in its outputs. And I think it'll be especially interesting to see what types of web designs that people make out of that.
Alex Volkov
Alex Volkov 5:39
Yeah.
5:40
Yeah. Gimme, uh, gimme K 2.6. Let's just make sure that folks who follow us know exactly. 'cause Kim was out there for a while. Alright folks, I think it's time, maybe to start with the Tldr r folks are saying in comments that the FOMO is unreal. I agree. Hela, the former is unreal. Uh, somebody wants to ask us a question. Folks are saying they're watching Twitter, like a hawk monitoring the situation. Uh, just for folks not to be confused, this show is not called monitoring the situation You are on Thursday. I, there's a different show called Monitor the Situation. Uh, we've been at this for wait longer than theirs, and we're significantly, significantly, uh, deeper diving than, than just covering the news. So hopefully we'll dive deep. I think it's time for the TLDR. We'll tell you about everything that happens in the TLDR section, uh, before we actually get to the deep dives. And then, um, we will definitely wait for GPT 5.5 today. We'll stay on air for 10 hours if we have No, we're not gonna stay on air for 10 hours. That's not gonna happen. Some of us have work to do, anything to do, but we're really hoping the OpenAI will stay true to the name and drop DPT 5.5 in around an hour. That's usually when we do this. Uh, if there's gonna be a live stream, we will restream this. Uh, so we, we did with GPT image, by the way. Um, my big thing from this week, I have two, but I have to focus on one, but I have two. It's really, it's really hard. It's, it's, I have three, in this case, I'll just go to the tldr. Like, I, I won't go through three, like one things the month not be missed, but it's been a hell of a week and it's about to get disrupted even more. Right. So let's jump into our corner called the TLDR, where we talk about everything that we're gonna run. I'm gonna do hopefully a quick one.
7:31
Um, we are in the TLR. Let's add Niton. We are doing the TLDR. My name is Alex Wilco and the Avengers with Weights, & Biases. Your host for today, co-host. We have everyone. Everyone's here. Finally. Yes. Ryan Carson to my right. I don't know if you guys see the mirror, but like Ryan, Ryan Carson, right here. Wolfram Raven Wolf, LDJ down there. Yam Peleg and Nisten ni Hir. Uh, we are, we then we have no guest today. It's just us. I think it's gonna be planning because there's tons of stuff talk about, so, okay. The, the number one thing we have to talk about is GPT image. V two folks OpenAI released. Finally, the, the, the response to what Google has had, uh, a leadership in for a long, long time. Uh, GPT image V two is, uh, in the API in Codex is open AI's new image model, and it renders images up to 4K resolutions. It's a thinking and reasoning image model. So it means that the, the more thinking you give it, the better it does. It does insane things like generating full on QR codes. I, I barcodes. It does, uh, equ rectangle, uh, images in 3D It's absolutely insane. Production grade editing workflows. Uh, some of the imagery, if you saw the, the thumbnails for the show were generated with Image V two. It's great character consistency. It's not the perfect, nothing is, but it's really, really, really good. So we're gonna talk about this. It's good. It's so good, man. It's so good. It's just fucking open. The, I just knocked it out of the park with this one. The, I I'll sh Hmm, no, I'll have to fix the showing part 'cause I really wanna show you stuff. But, uh, it's, uh, it, it broke out in the ELO Arena score by a significant margin. I don't know if you guys saw this. And, uh, we went on livestream. Me and Peter go from Marina and he went through like a very detailed breakdown. So definitely if you missed any part of this, you want more examples, then we'll show you today. Uh, check that out. Uh, OpenAI. Internal models leaked. So we know the G PT 5.5 is ready to go. We don't know when, but open air today posted on their account. Nous 41, which is basic 64 for, uh, GPT 5.5. There's other models in there in that leak. Code names like Aine and Glacier Alpha and GPT Rosalind. And like, I have no idea what those are, but I definitely know 5.5 is ready to go, uh, from leaks. No background information at all. Um. Also in big companies in the limbs. One big one was cloud design. I don't know if you guys tried that, but that is fucking magical. This released on Friday. This is a design, uh, sorry, a, uh, early preview of something. And you know, as well as I do, many people use Claude to do some designs. Many people use the Figma, MCP, blah, blah, blah, blah, blah. They released the whole thing on Friday. They crashed the Figma stock by like 5%. And I can see why it's not a Figma replacement, but oh my God, that UI is so good. I will absolutely have to fix my shit to show you, uh, because I generated a whole, um, brand guidelines for Thursday Eye with the logo and everything, and some of the opening responses. This is, you know, some of the opening videos that I showed you. Here is the result of that brand guidelines. Um, they have added a new usage meter in Cloud Max settings. This is only available for Cloud Max subscribers. Uh, and I blew through that. Like nobody's business like it is just, I I can't use it until tomorrow. So hopefully I'll be able to show you something. Uh, also in big companies in news, have you guys, have you guys heard of this uh, vs. Cloud Clone Code cursor? Anybody here heard about Cursor?
Ryan Carson
Ryan Carson 11:05
What's that?
11:06
I don't know.
Alex Volkov
Alex Volkov 11:06
Uh, apparently Elon Musk did.
11:08
And apparently Elon Musk is ready to give them $10 billion to experiment inside the GPU system of Xai. And apparently, if that is successful, there is a $60 billion deal to buy cursor into Xai.
Ryan Carson
Ryan Carson 11:25
I it's basically 60 billion to buy it with a $10 billion break clause.
11:29
So I, uh, I
Alex Volkov
Alex Volkov 11:31
think it, and, and, and as a gonna
Ryan Carson
Ryan Carson 11:32
happen.
Alex Volkov
Alex Volkov 11:33
There's gonna be a lot of training happening, and they're gonna
11:35
test the, like, I have a very interesting take on this, uh, but yes, it's insane. SpaceX and Cursor, uh, sorry, SpaceX, XAI and X are all gonna IPO at some point very, very soon. Probably the $60 billion is just like, you know, Elon's gonna sneeze and, and, and gonna like reap $60 billion out of the year. So it's not that big a deal, it's just the fact that Cursor is validated at that price point right now is just absolutely mind blowing. Insane. Specifically because just two weeks ago we talked about ideas being dead and then a week ago, so we had, uh, folks from, from, uh, um, uh, Devon and other places to talk about ideas. Um, okay, so this is big news in Open Source. We have two big releases this week, also very big releases full on. We have a full show to talk about these releases. Uh, moonshot ai open source is cmi K 2.6. It's a 1 trillion parameter mixture of experts claiming, uh, open source state of the art on Swyx Inch Pro. And we have Wolfram that I think tested this out. And, uh, we can talk about, uh, communic K 2.6 already, uh, and, uh, a browser comp as well. Great evals all around. And then Qwen, our friends from Alibaba, Qwen released week after week. The, they keep releasing things. Uh, this one is Qwen 3.6 27 B. It's a dense 27 B model. Last, last week we talked about the MOE of Qwen. This is a dense Qwen, uh, that beats their own flagship, uh, 397 billion parameter on every major coding benchmark. So 27 billion parameter model beats their almost 400 billion parameter model. We're gonna talk about that one as well. In one of the biggest updates from this week, I'm pretty sure, is that in tools in agent engineering, somebody said 3.6. Did I say a different number? Yeah. Coin 3.6. That's what I said.
Nisten Tahiraj
Nisten Tahiraj 13:19
Two, 2.6.
Alex Volkov
Alex Volkov 13:20
No, gimme K 2.6
Nisten Tahiraj
Nisten Tahiraj 13:23
3.60000.
13:23
Yeah, yeah, yeah. Qu 3.6.
Alex Volkov
Alex Volkov 13:25
Yes.
Nisten Tahiraj
Nisten Tahiraj 13:26
It's just,
Alex Volkov
Alex Volkov 13:26
yeah, there's too many points.
13:28
You
Ryan Carson
Ryan Carson 13:28
can't keep 'em straight, I
Alex Volkov
Alex Volkov 13:29
dunno.
13:29
Yeah, it's hard. Uh, that's why I have notes that I can't show you, but hopefully I hopefully be able to fix this once one of you starts talking for a minute. Uh, all tools in Agent Engineering. This is a corner where we talk about, you know, many of the folks who watch the show, many of you who didn't use to be AI engineers of becoming AI engineers, working with AI tools, genetic engineering is important. Corner, one of the biggest upcoming things in tools in genetic engineering is that OpenAI is trying to catch up to tropic and tropic past the $30 billion relation, blah, blah, blah. We talked about this. Uh, a RR not valuation, it's a small distinction. Traffic passed $30 billion in a RR, not valuation. Uh, codex from OpenAI has passed 4 million users. That's a big one. And last week at the end of the show, we told you that they released a bunch of stuff and we absolutely missed on the most important fucking thing. Codex now can do computer use. Now you may say, Hey, Alex, hey, hey, hey, hey, hey, cloud could do computer use for a while. Why are you so? No, you don't, you don't understand. You have to understand this. A year ago, this is the LDR still, but I'm gonna go on a little spiel. A year ago opening, bought this company, oh, like half, half of six months ago. They bought software Incorporated Software Apps, Inc. The, the, the, the folks who used to like almost released a thing called Sky. These folks created shortcuts on iOS. These folks like built some stuff inside Apple. Their computer use only on Mac OS for now is in incredible. I don't know how many of you Mac users, I'm assuming Wolf Fromm, you have a Mac, but also a pc. Ryan, I'm pretty sure you're on Mac saw videos. Yam, I don't know about you. Uh, Nisten, you're a Linux guy. Their computer use is insane. It works while you work. I will have to show this. Like there's, I, I will start the stream and start a new one just to show you that you guys don't understand how cool that thing is. It's just incredible. And they're promising a speed up of like 10 x. They're saying we just like started experimenting with this. Um, just incredible computer use. But also Codex now supports image generation. Last week it was GPT Image two 1.5, but now Codex supports GPT Image two. It's really, really good. Uh, they have plugins and can see automations, but like Codex computer use is something I wanna show. Also, they release something called Chronicle. I dunno if you guys saw Chronicle. Chronicle is like, do you guys remember Rewind The ai, the thing that watches your screen. All the
Ryan Carson
Ryan Carson 15:42
Oh yeah.
Alex Volkov
Alex Volkov 15:44
Codex Now.
15:45
Codex now watches your screen, takes screenshots every 10 seconds and has full context into everything that you're doing on your computer. It's magical. It's ma it's creepy as fuck, but it's magical. Supposedly it doesn't go to open the air, supposedly, but, uh, I really like it. Uh, so Codex has big updates, um, and also in tools in genetic engineering. Brex, the CEO of Brex, nonetheless released open sourced cla crab Trap. It's an LM as a judge proxy that you give your agents and the judges whether or not, uh, your agent is doing something illegal. It's really, really cool. Also in open source. I completely forgot 'cause my notes are not perfect. Uh, OpenAI released a new model in Apache two. Did you guys see this? It's incredible. I think, uh, I, I will need to go and, and bring this up because I, I don't have the full details, but OpenAI released on hanging face. Go ahead.
Yam Peleg
Yam Peleg 16:36
Yeah.
16:37
Tiny model. Tiny
LDJ
LDJ 16:38
model
Yam Peleg
Yam Peleg 16:38
classifier.
LDJ
LDJ 16:39
It's, it's a model for helping identify and remove personally
16:42
identifiable information within data sets. Yeah. So whether that's company wanting to make a Finetune on their own personal data or for whatever other reason, this is something really useful.
Alex Volkov
Alex Volkov 16:53
It's so useful.
16:54
And we're gonna have a demo for you as well from our friend Anova from Transformers. Yes. 'cause it runs in the browser. It's a 1.5 billion parameter model, but it's MMOE with 50 million active. So it's like very, very tiny and runs quick. Uh, it's a privacy filter model. It's very important to have. And the reason I'm bringing this up is because there's the scrap trap thing that I told you about. You can you absolutely use this model in the middle of it. Ryan, go ahead.
Ryan Carson
Ryan Carson 17:17
Uh, two other things from big companies that I think we're gonna
17:19
cover, but just to make sure we do, uh, OpenAI just released their workspace agents, which I think is a big deal. Uh, and then also they just released this clinician, uh, model slash uh, product for doctors. So there's a lot of stuff coming out of OpenAI right now.
Alex Volkov
Alex Volkov 17:36
OpenAI has been dominating this week.
17:37
There's no doubt about this in my mind. And they're about to, um, to execute a, a killer blow today, hopefully at some point. Eld, go ahead.
LDJ
LDJ 17:46
It's something you mentioned earlier in the leaks, uh, was GPT Roseland,
17:50
which that was actually something that was announced last Thursday, but I think we didn't get to cover it. But just a really quick summary of it is basically it's a specialized model they developed for things like drug discovery and related operations happening in biotech.
Alex Volkov
Alex Volkov 18:04
Mm-hmm.
18:05
Yeah, that's great. Um, we, we aim for full coverage, uh, folks in comments, if we're, if we're missing any parts, please let us know as well. Uh, so the last thing that we were landed on is a bunch of stuff from OpenAI. I think the Apache two one is very important, like OpenAI is open sourcing again. It's really, really, really good. Um, I think, oh, in this week's buzz we have our TUI, everybody's like gun home about T UIs where it's, it also has a t ui, it's called one B Lead and it has a bunch of updates including, uh, we're now showing you GPU stats inside the T ui and that's really, really good. And also Wolfram is gonna talk about some open source, uh, and none only open source, uh, evals. Wolfram, you wanna give us a highlight one or two sentences from the eval so people have something to wait for?
Wolfram Ravenwolf
Wolfram Ravenwolf 18:46
Yeah, Opus 4.7.
18:48
I've evaluated it and, uh, compare it to the old one and in which agent it works the best and also the same for Qi K 2.6
Alex Volkov
Alex Volkov 18:57
and, uh,
Wolfram Ravenwolf
Wolfram Ravenwolf 18:58
which is the best model I've ever tested
18:59
in the opensource department. Ander
Alex Volkov
Alex Volkov 19:02
K2 six
Wolfram Ravenwolf
Wolfram Ravenwolf 19:03
is the best model ever tested.
19:05
Both are the best. Uh, Opus is the best model I tested on the wolf bench, uh, from the proprietary models. And Kimi 2.6 is the best model in the open source department. So that was the best.
Yam Peleg
Yam Peleg 19:17
That's that's surprising.
19:18
That's actually surprising. Okay.
Wolfram Ravenwolf
Wolfram Ravenwolf 19:20
Yeah, it's always, always, uh, more differential when
19:23
you look at it closely though. We look at it where it's really good and where not.
Alex Volkov
Alex Volkov 19:27
So we absolutely know what folks are waiting for.
19:29
Wolfram, I think we started with open source. Let's start there. We have two incredible models to cover. Let's start with MK two, which is go to open source and we'll start there.
Nisten Tahiraj
Nisten Tahiraj 19:49
Open source ai.
19:51
Let's get it started.
Alex Volkov
Alex Volkov 19:56
All right, let's get it started.
19:57
Uh, I have no idea how quickly all y'all put glasses in the th 13 seconds. The, the, this transition was in there, but I love it. I think every transition was to show up with different clothes and, and like props and everything. That'd be very, very cool. Uh, but also, uh, open source ai. Let's get it started. Uh, Wolfram, please take it away. Uh, LDJ Nisten. I count you guys to look to support, uh, yam as well. Uh, open source. Let's start about Qmi K two. I'll just like do the intro. Okay. This week, moonshot ai, uh, the Chinese company, moonshot ai, uh, decided to finally update their Qmi K 2.5 to a new model K 2.6. It's a 1 trillion mixture of experts model. Uh, they are claiming open source, state-of-the-art on agent coding on stuff like Swyx Pro, uh, 32 billion parameters active with 384 experts, MLA attention 250, say 256 K context window, and a modified MIT license. So not fully, fully, fully open, et cetera. Um, it gets Swyx Bench Pro at 58.6. Swyx Inch Pro is the difficult more version of Swyx binge verified, which we don't no longer cover. Please take it away and tell us why this model slaps well from you started something. And, uh, yam, welcome to join as well.
Wolfram Ravenwolf
Wolfram Ravenwolf 21:06
Do you want me to look at the benchmark results
21:08
already or do we do it in the
Alex Volkov
Alex Volkov 21:10
No, no, you can, you can show benchmark.
21:11
Yeah, I can show anything. So do something visual for the audience.
Wolfram Ravenwolf
Wolfram Ravenwolf 21:16
You already gave the technical details and, um, Kimi
21:19
has always been a special model to me because, uh, like LDJ said, it is um, not so robotic like other models it has from the beginning when it came out, uh, Kimi K two, uh, it started to be very good at the writing department and felt really, really good at creative. And when it came out it always was one of the top models. So the new version, it is not just the top about the, among the open source, but it is also top among I would say, um, yeah, it even beats proprietary models. Let me show you what Wolf Bench has shown. I will just open the page and share it. So, okay. We definitely have to fix the zoom first because there's so much information on here. And just a quick thing about Wolf bench. I'm not just looking at the average score, but I'm also looking at how many percent of the benchmark can it solve and how many does it consistently for the different models and different harnesses. So what we are now doing is basically we are looking at Kim K 2.6, which are tested from the moonshot directly and comparing it to just sonnet and the other models that are even better, uh, still better, but it beats, for example, kidney bed, G 3.1, pro preview, and um, of course it's better than the old one and all the others. So it's the best open source model I have tested so far. And if we look at it closely, it is basically on the sonet level, very, very close to this. And the different colors are different. Benchmark agents. Um, terminal is two with a terminal bench 2.0 basic. Agent, which is, uh, the default. You always see if somebody gives a terminal bench score, it is this agent. I also tested, uh, cloud code, but this is not relevant for this, so I will just remove it. And term Agent OpenCL, these are also the agent stuff and Terminal Bench is an agent benchmark and I care about which agent works the best with this. And if you look at Qmi K 2.6 and looking at the terminal bench and just Terminus two, then uh, it, like I said, is better than Gemini 3.1 bro. And if we take cloth for instance, now it is even better than Opus 4.6. In this, which is a super amazing score, we get 59%, which is, uh, the best open source model score I've ever seen with Op Open Chlor. And if you look at term, um, there, there, interestingly the harness makes a big difference. So in this case it gets basically the same score with open clock. So they are almost the same here, but um, Hermis agent is still better with Opus 4.6. So it depends what you are doing. But if you want to use an open source model, definitely MK 2.6 is the one to go for.
Nisten Tahiraj
Nisten Tahiraj 24:12
Uh, did you have any trouble with the tool calling?
24:17
I'm just hearing from a lot of people that the tool calling setup was a bit, uh, was a bit confusing for them. The they were trying to run it on their own, or, or was that okay? It seems like it was completely okay here.
Wolfram Ravenwolf
Wolfram Ravenwolf 24:31
Yeah, I was using Open router and, uh, the Moonshot AI endpoint
24:35
and I did not have problems with this.
Nisten Tahiraj
Nisten Tahiraj 24:37
Ah, okay.
Wolfram Ravenwolf
Wolfram Ravenwolf 24:38
The interesting thing is in the OpenCL benchmark and in the
24:41
Hermes benchmark as well, I do not even tell it that it is running as an agent. I'm just using the default setup. So no custom prompts. I think it could get even better if you tell it. Do not ask the user some questions, because some tasks failed because the model couldn't figure it out on its own. So it tried to ask the user, but there was no user to answer in the benchmark. So it could be even better with prompting, which is another level to look at.
Yam Peleg
Yam Peleg 25:04
I just wanna mention there is no doubt that the scores are high.
25:08
However, the scores are, uh, we are, this is a very low tax solution. I hope you can see we have, uh, from, uh, we have a report from Bright Mind, uh, that the model might be a little bit benchmarked. Uh, what we see here, uh, is, uh, a rendering of a lava lamp by, uh, by Kimmi. And as you and by the other leading models. Uh, you are more than welcome to go and check a Bright Minds, uh, profile, uh, for, to, to see the exact source for this. I hope you can see anything. Uh, basically what you see is that the love lamp doesn't look well at all, and, uh, all the other, uh, models that, uh, he's comparing with our producing pretty, pretty nice level lamps. Uh, as you can say, I, I wish I could show it to you in, in, in a normal way. All I'm saying is that, uh, benchmarks are not everything and you, and I just want to, uh, you know, push back and give, uh, give the other, other side of this, uh, because clearly, clearly the benchmarks are high. So,
Wolfram Ravenwolf
Wolfram Ravenwolf 26:18
uh,
Yam Peleg
Yam Peleg 26:19
yeah,
Wolfram Ravenwolf
Wolfram Ravenwolf 26:19
it's the best one.
26:20
I, I've tested with the agent test, so I can only they that, uh, everybody has to do their own experiments. Of course. I think benchmarks are a great way to see which models are vast to look at in more. 'cause if they fail completely on a benchmark, it's probably not worth to invest the time to test it with your own test. But if they are really good, that is a model you should look at.
Nisten Tahiraj
Nisten Tahiraj 26:41
Yeah.
26:42
You, you also have to keep in mind, sorry, that 3D performance, that's something that they have to train for and need a lot of data sets for. Uh, so it, it might be very good genetically and as a tool and just not be great at all to do stuff in 3D For example, the new coin, the closed source one, the 3.6 max preview was pretty bad for me when it came to stuff like three Js and making animations and things in 3D. But it looked great at everything else. So there's also, but yeah, yeah, the, the, the, the, the frustrating thing is that there is bench maxing going on, but the models are also good. So it's getting, uh, it's getting a bit hard to, to tell in that regard.
Alex Volkov
Alex Volkov 27:31
So I wanted to join this, uh, I'm back by the way.
27:34
Hopefully, hopefully now we're good. Uh, that I've, I've tested this model, uh, myself as well. Uh, if you guys are still on, on, on Kim's and, uh, I remained unimpressed. I tested it on, on a bunch of stuff. It feels like it's overthinking too much. Uh, and it, it, it gave a lot of, well, I don't know if you mentioned this, sorry, I dropped, but like, uh, definitely there's like way, way, way too much thinking going on to achieve something. So the scores that they showed, they don't usually show the times as well. And that's something that we need to, like, start thinking about as well. Like how long does it take that model to get to that score? Uh, something that they, I don't think publish. That's why Terminal Bench has the cutoff, right? Um, like how, how much you can do it.
Wolfram Ravenwolf
Wolfram Ravenwolf 28:13
We can also look at how many tokens it generated.
28:15
Um, I'm actually writing an article for our blog about this, where I'm looking at a comparison that, that will go much more deeply into this because, yeah, this has also been one of the most expensive open source, uh, benchmarks I did because it, uh, takes so many tokens to, uh, respond. That is something you don't see when you just look at the scores. But I'm also working at a way to visualize that as well in the future. Update of full page.
Alex Volkov
Alex Volkov 28:40
Alright folks, are we ready to move on to the next open source?
28:43
Let's talk about, uh, Qwen. We have a bunch of other stuff to talk about, but let's talk about Qwen, uh, to 27 V. Who, who got super excited about this? Nisten, I think I, I heard you guys excited LDJ as well.
Nisten Tahiraj
Nisten Tahiraj 28:57
Oh yeah, it is my audio.
28:58
Okay. By the way, there's landscapers, uh, cooking
Alex Volkov
Alex Volkov 29:00
outside your audio.
29:01
We hear some landscapers, but also like you, you're coming to it a little bit louder than usual. But let me just, uh, while you fix this, lemme just like say the, the thing folks. Uh, Alibaba released another Qwen for us. It's Qwen 3.6, 27 B dense model. Uh, last week they released the MOE version of a very similar size. This one is a dense model, which means, um, you know, just one model. No, no, no. MOE 15 x smaller, uh, total parameters than their flagship 390 70 B, almost 400 billion MOE. Uh, and it bids, it, it wins on every coding benchmark. A model, um, you know, 15 times the size. It's Apache two license. It gets a Swyx bench verified. I don't care about swb verified. Lemme just keep, uh, terminal bench, uh, at 59 matches Cloud 4.5 opus. Exactly. This is, you know, we, we keep talking about like, uh, uh, benchmarking or not benchmarking. This, this 27 billion parameters model matches Opus 4.5. This is quite, quite crazy. Ya may have
Yam Peleg
Yam Peleg 30:00
comments Ladies and gentlemen.
30:01
Ladies and gentlemen. Opus at home.
Alex Volkov
Alex Volkov 30:03
Opus
Yam Peleg
Yam Peleg 30:03
4.5 at
Alex Volkov
Alex Volkov 30:04
on terminal bench.
30:05
If, if you on terminal
Yam Peleg
Yam Peleg 30:06
bench
Alex Volkov
Alex Volkov 30:06
exactly, exactly the thing you doing Terminal bench.
30:08
Yeah, we'll put it home.
Yam Peleg
Yam Peleg 30:10
No problem.
30:10
If this is what you're doing, you get open at home.
Alex Volkov
Alex Volkov 30:13
Yep.
Yam Peleg
Yam Peleg 30:13
Just, just saying.
30:14
Yeah. It's a really good model, man. It's, it's a dance model. People that are, you can't go better than that.
Ryan Carson
Ryan Carson 30:22
Yeah.
Yam Peleg
Yam Peleg 30:22
You can't go.
30:23
But then that's, that's the top man. And you know, the, uh, the guy, that guy, what his name is just gonna drop and you fine. Your version of it, like super Qwen dance, super 27 B, and it's gonna be even better like tomorrow or something. And man, it's just like Christmas every day in ai. Seriously,
Alex Volkov
Alex Volkov 30:41
with uns Slott, every
Yam Peleg
Yam Peleg 30:42
day you get something.
30:43
Yeah.
Alex Volkov
Alex Volkov 30:43
With uns slot is dynamic.
30:44
G Gs, uh, which we had Daniel Hunt from onslaught here on the show. Great dude. Uh, and they're doing incredible things. Uh, this runs on 18 gigabytes of ram. This is it like the opposite home runs on 18 gigabytes of ram. Uh, unlike the Kimmi model that we just told you about before, which is a 1 trillion parameter that only us at CoreWeave and some other folks can run, you're not gonna pay for kimmi. Let's be very, very, very clear. Nobody here is gonna host their own kii unless the business pays for it. It does not make sense financially. It just does not make sense if you are at that level of paying, just pay for the max account, whatever API is, right? But, um, there is this thing where like, oh, we like open source. But also there's this thing where like open models, uh, and sorry, local models, open local models. This is an open local model. 18 gigabytes of RAM is very affordable and goes around. And so like, definitely some folks will, will like to run this. Um,
Nisten Tahiraj
Nisten Tahiraj 31:33
it, it's more like sonnet 4.5 at home when people
31:38
are a, are actually using it. Uh, when it comes to terminal stuff, it can be, it can be, it is very, it's on par with hopes. And when it comes to judging things visually. It is, it is very, very good. Uh, there were some issues that people had with 3.6, so it's not quite plug and play where you just replace it as the proxy for cloud code. Uh, they, they notice very, very different issues like managing, uh, hard GI merges and stuff that it, it was making a, a mess at. So it's, it's not quite there, but people do feel like it is at 4.5 at home. And that is a huge deal because that can do most of your non-important tasks now. And, uh, you can do it, you can buy a used 30 90 for under a thousand bucks and run it. Yeah. This is the biggest, uh, change here because yes, we always talk about open models, but people have to run like 4 39 ERs at home or, or figure out some, some crazy setup. Now it's becoming, you can just go and buy a 24 gig Mac mini and uh, and run it and just run Hermes on on it. Check your emails, on everything. It's,
Yam Peleg
Yam Peleg 32:54
bro, I've been running sonnet 4.5 for a long time.
32:58
It's, it's absolutely usable, man. It's a great model. It's usable. Like it, yeah, it's great. It's great. I'm taking sonnet at home. I'm having 4.5 for, what is it for? $700. $600. 30, 90. Man, we never had this.
Nisten Tahiraj
Nisten Tahiraj 33:15
And you, you paired this now with OpenAI, uh, privacy filter
33:20
model, which I find the architecture of this tiny 1.5 B just completely insane because it's a mixture of Wait, listen,
Alex Volkov
Alex Volkov 33:28
don't skip.
33:28
We have to announce the next piece of news. We can just skip. Alright.
Nisten Tahiraj
Nisten Tahiraj 33:31
Alright.
33:32
Alright, we'll, we'll, we'll segue into that. Alright.
Alex Volkov
Alex Volkov 33:34
Yeah.
33:35
But yeah, let's segue into that. Like Ty, you just did the job folks. Uh, the last piece in open source news that we're gonna cover before we jump into like a big a bunch of other stuff, uh, is open, the AI is open, again, open, the AI is open again. First of all, let's say Codex is open. The whole thing about cloud code, uh, code leaks, whatever, codex has been open source since the beginning. So there's openness and OpenAI. Uh, but also OpenAI is open sourcing models. Again, not LLM models, but fine, they open sourced privacy filter. you can see that there's a text that's marked. Hey, private person, here's a, a time scheduled for private date. And here's a account number, private account number. Uh, this is kind of boring to show you, but basically privacy filter is all about PII pri private identifiable information. So the stuff that, you know, you're afraid that your open cloud is gonna leak, for example, uh, they also call API Keys also, uh, privacy, like privacy stuff. Um, let me just show you the best de example from a friend of the pod, the Nova who just built this, uh, incredible demo because this privacy filter is so small, it runs on your computer. So we're gonna, um, we're gonna do this. We're gonna show you this privacy filter demo on your
Nisten Tahiraj
Nisten Tahiraj 34:47
browser, on
Alex Volkov
Alex Volkov 34:47
a cpu,
Nisten Tahiraj
Nisten Tahiraj 34:48
on
Alex Volkov
Alex Volkov 34:48
any, this is, this runs on browser.
34:50
You can see the loading model goes twenties. 25, 30, you can see this, right? Uh, and
Yam Peleg
Yam Peleg 34:56
that's, that's really important.
34:57
By the
Alex Volkov
Alex Volkov 34:58
way, this is me downloading and loading the model into memory.
35:01
This is me downloading and loading the whole model into memory. It's about 1.5, uh, gigabytes. It's not that big. I think it's even more quantized. Uh, actually, let me see if I can zoom in here, zoom out, uh, and, uh, we have this case file, this beautiful case file. Let me, let me make sure that you guys can see the picture. Okay. You open up and you see this text. Uh, this morning review began with a careful note from our CEO Sam Altman. It looked harmless, but it still contained contact details, blah, blah, blah. There's a date here, my birthday, there's a phone number. You hit this beautiful redact button. They run the model with beautiful effects, and the model super quickly identifies. There's one person name, there's one email and there's one phone number. I don't know how, like he, he posted this demo like a second after this model got released. So like this beautiful demo and there's a date.
Wolfram Ravenwolf
Wolfram Ravenwolf 35:48
I think this is a very important model.
35:50
If you think about the ENT use cases, like when you have your open Chlor and Hermes agent running, and, uh, it's always a threat that it is leaking data that goes out. So if you use this as a, basically an intermed model that checks what goes in and out, it could, uh, redact the data and notify you. I think that is an important security measure for using agents insecure in a more secure way.
Alex Volkov
Alex Volkov 36:14
A hundred percent.
36:14
A hundred percent. And this is why the categories that they have there, The category is they have private person, so everything, name, last name, et cetera. Like identifiable information, private address, uh, email and phone. Uh, private URLs as well. So apparently like this blocks URLs now, private dates and also account number and secret account number and secret are two of the most like important, uh, things, I think. So account number is everything. Related to bank account. Literally, I can show you that. I posted an example here. This is my email today to somebody at, uh, AI Engineer Miami, where I requested, uh, you know, uh, reimbursement for flights. Uh, this literally include my phone number and my account number and routing number in here. So, uh, I, you know, I was not afraid to pause this because like, it redacted the whole thing. Obviously I re reviewed it before. Um, again, you should not trust this model completely. You should definitely, definitely review. But I think the, the, the thing is with agents, um, other agents, LM as a judges to review this, supposedly, like GPT, uh, 4.5 can do this in a structured way. This model is so much smaller that you can run this in the browser. This is I thing the important part. You can run this as part of, uh, crab Trap Crab Uh, this is a, like a proxy for agents that, um, that you run every, everything that your agent is doing via like an LM as a judge. And this model sounded like this model, uh, is definitely a huge, huge deal for, for like a proxy, like Wolfram said, because your agent, the, the whole fear that people are fearing is that, you know, there's two things there. One, they can run some scripts and like eject some stuff, but also you can just like reply to someone after being prompt, ejected and say, Hey, my, my owners bank account is this, so we definitely don't wanna, don't wanna do this. Um, grade model privacy filter is on ha and face. And what, what else is very important there? Oh, it's multi multilingual as well. Did you guys see this? It's great. Not only in English, which is a very standard thing with very small models. Great in Hindi, Japanese in Mandarin, Chinese in Turkish, and order in Korean. Uh, in Russian. We actually test it out in, in other languages. But, um, any other comments folks? What are we first?
Nisten Tahiraj
Nisten Tahiraj 38:26
No, the architecture is completely crazy.
38:29
It's only 50 million active parameters, so it will use 1.5 gigs on your browser. It's only using 50 megs of per app for active inference. So this is one of the most compressed things. And I am not an OpenAI fan. I canceled my GPT, my Chat GPT like over two years ago and never used it. Uh, I also didn't like GPT OSS all that much, although I, I appreciate it. I think this is one of the most important models that they've ever made. And if you start to count total amount of to tokens processed in the future, this might be the model that just processes the most tokens out of all of them, uh, because it makes it very cheap and very easy for you to just filter everything either at home, commercially at scale. It does not hurt to put this in, uh, before if you were going to use like a lama guard or something model, that was something you, you had to test and set up on your own and it was somewhat expensive. Even running a three B at scale, it can be expensive. You do need infrastructure, uh, you do need GPUs. And for this one, you don't, you, you can just do the filtering right away and you should use this everywhere. You can run client side on the browser before they even give you the, the data. Like that just, that's just such a huge enablement. It's, yeah, I, it's the most. I think this is the most important open source model that they've ever released other than Whisper
Alex Volkov
Alex Volkov 40:07
Ryan and LDJ.
Ryan Carson
Ryan Carson 40:09
So, uh, quickly to touch on Crab Trap.
40:12
Um, so I think this is very important because as more and more of us have digital employees attached to us, or digital employees, uh, deployed through our organization, like R two is my, my open claw. Um, the thinking behind Crab Trap is that we, you can't manage a digital employee fast enough, right? They're too fast. Um, and you can't have one-on-ones with them. It just does. None of none of the human, uh, models make sense. And so Crap Trap basically is a check that whatever your digital employee is doing, uh, is basically correct. Like, so if they're a sales, you know, uh, if they're an SDR, like are they behaving like an SDR? Are they doing something weird? And so these small local models, you know, and I think this redaction model is an example of we need more of these that can run quickly, uh, cheaply to basically monitor our digital employees. Um, and, and so it's very exciting to see that they open source this. Uh, and also as a side note, um, the, uh, open clause set up, uh, that the Brex Co is running is inspired me to update my, uh, claw chief. So there's a lot of things going on here with digital employees.
Alex Volkov
Alex Volkov 41:24
The, the other thing is, and I think it's very important to also
41:26
mention, um, this is fine Tuneable. So on specific domains like your company, they are saying out of the box is not gonna work, but it's very easily fine tuneable to your data. So if you have very specific things that you're afraid that your company is gonna leak, then like with a few examples, this model becomes just like, just great. So, uh, LDJ, let, let's talk to, to you about, and then we'll move on to different things.
LDJ
LDJ 41:50
Yeah.
41:50
It's gonna add to what, uh, and Wealth Firm said. I feel like the, with how efficient it is with how few active parameters it has, I feel like wealth firm's idea of. Having this as basically this private local check for people that really care about privacy and are using local models a lot. It is just kind of a no brainer to have this, do a pass over your text and conversation before you pass some information over to a, a larger closed source model.
Alex Volkov
Alex Volkov 42:14
Yeah.
42:15
All right. I think it's time for us to move on folks. Um, we don't have any other news from OpenAI yet, uh, but GBT 5.5 is on the way. We know, uh, the reason we know is that they, they vague posting with base 64 examples. Uh, should we show, should we show this example? I mean, at some point they're gonna drop the model and then we'll just talk about this. But there's a bunch of stuff that happened, uh, from OpenAI this week that, that we should definitely talk about. Uh, I think it's important for us to start with GPT image V two. I think this is the biggest, the biggest thing that like, you know, cut, cut at least Twitter by storm. Uh, you all probably have tons of examples. I have a few examples as well. Uh, let's, let's take a look. So, uh, OpenAI launches GPT image version two with the biggest jump in in arena ELO score that we've ever seen. Uh, I'm gonna pull this up. GPT image V two is OpenAI, um, thinking and reasoning, uh, image model. Why does it matter that it's thinking and reasoning? Because it can do things that no other image models before could do. It just, just absolutely incredible. The highlights there is it can, it can create QR codes, which I don't know if, I don't know if it's a tool use thing. I dunno if it's Photoshop. Somebody here explained to me how the heck can a diffusion image model create QR codes that work?
Yam Peleg
Yam Peleg 43:33
LDJ, go ahead.
43:34
I think, I think it just generated it seriously.
LDJ
LDJ 43:37
Yeah, I think it's a probably omni modal where it's like this
43:41
unified model that's actually doing reasoning all in the same network. 'cause they have it actually, and you can have a medium reasoning. High reasoning, and. Ella Marina, actually, I'll, I'll link this, but I know we, we, we've been skeptical of Ella Marina lately, but for images, I feel like it's still a bit reliable and it's, it's insane. So I actually went through the past 50, the, the top 50 rankings in Ella Marina. Yeah. And there's, there's not more than a 50 point gap between any of those 50 ranking neighbors. The exception here is GPT image two that just released even on just medium reasoning mode. It's over 200 points above the last top place. It's insane.
Alex Volkov
Alex Volkov 44:24
It's absolutely insane.
44:26
The jump is insane. So I went on a livestream when this got released and played around with it. And then, uh, I had the chance to, to to, to host Peter Gusev who's doing evals at Ella Marina. And obviously he has some access to this model. Before he had about, he had like over, I think over 500 examples to show us where he ran this model against the N Pro and GPD image. 1.5, which was awful, awful. Just completely awful. You can see it here. I don't know if, uh, 1.5, uh, oh, 1.5 is number four here. I don't even think it's number four. So this like throws a little bit of a, um hmm. About this ranking. But 1.5 was not great. Specifically very, very bad comparison. GPT image two blew every other model out of, out of proportion. It just like the, the character consistency is great. Everything is great. Uh, I definitely wanna show you some examples, but this is the, the kind of the jump that we see from Nana Banana two and Nana Banana Pro, which is like incredible models. You can see, uh, this ELO rank jumps by almost 300 points. Now what does this mean? This means that people are watching the two examples of models they ask to generate models and they just, they just prefer GPT image two 93% of the time. I think this is, this is quite a crazy anonymously too. Yeah. And they don't know about which model they're using, so there's no like, um, recency bias, et cetera. Let's show. Let's show some examples, folks, while, while we talk about this, uh, have anybody played here? Ryan, I think you said this is an incredible model.
Ryan Carson
Ryan Carson 45:50
Oh, it's so, so good.
45:51
Um, and it's good for real stuff, right? Not fancy, fun, you know, play stuff. So I'm already, uh, integrating this into my marketing engine. So what I do every week is I interview somebody, uh, in the divorce space, uh, which is what Entangle is, and then I, uh, basically have a pipeline every night that runs, looks at that video, and then creates a killer Instagram cover. Um, and, and it's on brand. Um, and it's just so good. So this is really, really gonna change people's marketing workflows.
Alex Volkov
Alex Volkov 46:27
It's nearly perfect at text.
46:29
I think I was able to see like one typo as well. Uh, and again, if you guys remember every model that was released since Nana Banana Pro, we compare to Nana Pro. We keep saying like, no, nobody's close. This beats Nana Banana Pro on realism. This beats Nana Banana Pro on, um, on, uh, everything. Oh, I just got a ping from Peter Gosta, our friend who just hosted the thing, uh, that I wanna show you. Let's see if he sent me a link. Yes, amazing. Shout out to Peter Gosta, folks from, uh, El Marina. Uh, he did this demo, and I'm gonna show you, and I'm gonna show you how deeply El Marina goes into evals. Okay? Um, there is a lot of tests here. A lot. I, I, I can't even count. I think it's over 150 tests, if not more. Uh, it's gonna take a while for, for, for me to load those, these images. Uh, but we're gonna see a lot of Peter Ghost here, okay? Uh, but you can see that this, this is, um, he has a bunch of prompts in here as well. He has very specific prompts. He, he has very, very specific prompts. There's like 30 things that the model needs to get right. Um, the framing is slightly sloppy, not composed. A wet footprint trail crosses the tiles. There's like a lot of, a lot of like very specific prompts. Uh, so this is an example of GPT image. Lemme zoom out here. So hopefully we can see something. Uh, we're not loading the image on Zoom, probably because I'm loading 118 prompts. 517 images. The dude goes deep. Uh, so, okay, you can see on the left this is, uh, GPT image, uh, two in GPT image 1.5. You guys don't know Peter, but that's not how Peter looks like. So GPT image, the previous one was not as good at like character consistency. Uh, gr is, I don't know why he decides to compare to grok. Grok is consistently the worst outta like all of these comparisons. And then also this is Nana Banana two. And uh, alright, we're gonna have to wait until the whole page loads before we zoom in. For some reason, this is a vibe coded example. Uh, but, uh, he also provided reference inputs and, uh, his like selfies GPT image project, just like absolutely knocks out of the park. Um, this is, this looks like, like a actual thing. So you can see a wet footprint. You can see like the, this cone. Um, let's, let's move into some, like more compositional examples. Uh, this is also a great image that like he showed the prompt here is, uh, rescue helicopter. He, he provided reference images and said hyper realistic 1970s newspaper photo of Peter hanging half out of a rescue helicopter over a flooded town. Okay, uh, he should be grimacing at the wind wearing soaked yellow rescue headset. So this looks like a photo. This literally looks like a photo from a magazine. You guys can see there's a little bit of magazine, uh, GPT image even added. Uh, the specific commentary here in the magazine. Helicopter crews were in action throughout the day, rescuing people, trapped. This model just added this text. Like this text wasn't part of the prompt. Uh, I love it. Uh, and you can see the other ones. I don't know what, what, what's going on here with grok, but this not, this is not a person that's grimacing. Um, it kind of does look like him a little bit. Uh, and then nana two, sorry. Uh, nana two is, uh, where's nana two. Here's Nana two. This does not look like a 19 seventeens like newspaper. So this model wins on a bunch of stuff. The reason why there's two GPT images here is because I think one of them is on high thinking model. So the thing we, we absolutely must mention here is that this model performs better the more thinking you give it. So if you go to Chat GPT interface, uh, we can do so right now. Uh, and then you, you, you select the reasoning like level. This model performs significantly better the more thinking you give it. And on pro, this model is just like absolutely mocks. Every other like image generation model that we've seen with perfect text, with perfect, uh, fidelity of character consistency. The things that they, they said on the livestream, uh, is this can also generate multiple images with character consistency. So they, they literally generated like a manga in a comic and from page to page it looked the same. LDJ, go ahead while I show some more images.
LDJ
LDJ 50:39
Yeah.
50:39
So one of the theories here too is, uh, because it's, it's so good at generating text and doing character consistency and, and things in context and everything. It might maybe even be based on GPT 5.5 that's coming out and they just have it named as like the separate image specific thing. But Oh, that's interesting. For Nano Bonnet, a nano Banana Pro, it ge, Google and Gemini team has confirmed that those are actually based on Gemini Flash and Gemini Pro 3.1 respectively. Um, and so those are basically, essentially, those models are fine tunes of those models specialized for image generation. I think it might be similar here.
Alex Volkov
Alex Volkov 51:16
Yeah.
51:17
Yeah. I'll, I'll go ahead.
Nisten Tahiraj
Nisten Tahiraj 51:19
Oh.
51:19
To to, yes, and to, uh, to, uh, to what LDJ was saying, uh, most current diffusion models, they're actually still just a transformer. They're no longer like stable diffusion, uh, having an RNM type of thing. Uh, uh, like the, the actual part that does the image generation, it, it still just looks like any other transformer and you have to pair it with a full, big, uh, language model to begin with. And this is why Flux two has to be so large because they package Mistral 24 B with it. So I'm really excited as to. How they might have done it in this case where I, I do also suspect that they figure out a way to dynamically pair their, their largest model, uh, with the entire, uh, diffusion DIT transformer side to get to this level because this, this just looks insane. I'm wondering how they've even ran it. Uh, the other thing is, I think whatever deals OpenAI did with all, like the newspapers and stuff, it just gave them much, much better data because it is pretty hard to find audio, uh, video and especially images, data sets. And, uh, yeah. And they've done a terrific job of labeling all of that properly now to the point where all of those things just come together to make something that's like twice better than everything else.
Alex Volkov
Alex Volkov 52:52
It's so good, dude.
52:52
Okay, so, uh, here's an example. This is G gr Imagine, right? And folks, g gr Imagine doesn't have versions. For some reason they just keep updating this. And every week Elon is like, Hey, this is the best model. Uh, the really funny thing about G Grima, okay, I'm gonna read out this prompt. This is, is a long prompt, but hyper realistic Backstage Green Room photograph of three minutes before in the AI summit panel, Sam Alman de dmo, Elon Musk and Jensen Wong are all present and immediately recognizable. This room is cra blah, blah, blah, blah. This is Grima. The, the only person that Gu imagine can, like, can depict is Elon Musk. I think it's really, really funny. Sam Altman doesn't look like Sam Altman. Uh, DMO Day is nowhere close. Like it's a bald dude. Uh, like Gak does not know who D DDE is. Jen Jensen. Huang is just like some random Asian guy, like not looks nothing like Jensen. No one here looks like no one besides Elon Musk. I think it's really, really funny that this is gac. Imagine this is like, absolutely. Uh, this is GPT 1.5. So, okay. You can see Elon's, kind of Elon, uh, Jensen is kind of Jensen with the, with the, um, with the. Leather coat. Sorry. Rocky Magic doesn't even know the Jensen has a leather coat. Get the fuck outta here, man. Come on. Um, but like 1.5 has, uh, Jensen with leather coat as well. You can kind of see the, you can kind of see De Saba with the glasses and kinda et cetera. Uh, Dio no, no other model knows about Daria. Um, this is almost like a fucking photograph of how this would look like. This is GPT image two. Everyone here is spot on, including Dario folks. I, I, I don't think that it's important, this detail I'm about to share with you, but like I, I tried, uh, similar images, uh, on GP on Nana Banana Pro. Uh, no models, don't know what diode looks like. I think this was before cloud, like exploded models, just like they don't have him in the training dataset. Um, but here it's, he is obviously been trained on, but also all the compositions here, you can see that Jenssen's, um, tag says Jenssen's one. Let's say, uh, DDE stacks says Dario. Elon Musk tag is Elon Musk. Uh, you can kind of see the artifact here, right? A little bit if you zoom in. But also that's kind of like what actual pictures would be like with like low resolution megapixels. Uh, all the tags are perfect. You can see that like the prompt set AI summit, the lanyards have AI summit on them, right? This model is just something else. It it is just the reflections
Ryan Carson
Ryan Carson 55:18
right, too in the mirror.
Alex Volkov
Alex Volkov 55:20
The reflection from Elon.
55:21
Yeah. Yeah, yeah. Wait, is it though? Shouldn't be looking at the different side in reflection.
Ryan Carson
Ryan Carson 55:25
He should be
Alex Volkov
Alex Volkov 55:26
if he's, if he's looking left,
Ryan Carson
Ryan Carson 55:28
he Oh, okay.
Nisten Tahiraj
Nisten Tahiraj 55:29
Found something.
55:30
Yeah.
Alex Volkov
Alex Volkov 55:30
We should be
Nisten Tahiraj
Nisten Tahiraj 55:31
looking.
Alex Volkov
Alex Volkov 55:34
Uh, but I think that is absolutely insane.
55:36
Daria is writing some stuff on the whiteboard. I think this was part of the prompt as well. Um, as you can see, Daria's writing stuff here. This is not Dario. I, so obviously like, uh, personalities is like absolutely crazy. I wanna do more comparison. Nana Banana, and then we'll skip, uh, for some reason now, didn't load fully here. Okay,
Nisten Tahiraj
Nisten Tahiraj 55:53
so it is getting reflections raw.
55:55
Anyone wants to,
Alex Volkov
Alex Volkov 55:56
uh, I I think, uh, that, that also, uh, Peter ran it
55:59
on medium at some places and someone on hard and thinking, I think on hard thinking, it looks really good. Um, let's talk about infographics. This model is like really slaps on infographics. I think the nano Banana Pro is still, still great. So I had a few examples, but basically I wanna show you, um, let's show you this. This is the evolution of human language families. Uh, the, this is 1.5. I don't have two here. Uh, not everything loaded. Let's find something with two. Uh, okay. This one, this is the NOA Disaster caste Frequency Analysis. This is an infographic generated by, uh, by GBT Image two. Text is absolutely perfect. Framing is perfect. I, I, I, you know, the only way to to, to show you a difference between this and something else is to show you the previous one. This is 1.5. It's kind of there, but it's not as dense information wise, right? It's not as dense and uh, the text is okay on 1.5, but it's not as dense. 1.5 was six months ago. And this is gr imagine it's not too bad. Imagine It's not bad.
Nisten Tahiraj
Nisten Tahiraj 57:03
It's
Alex Volkov
Alex Volkov 57:03
not
Nisten Tahiraj
Nisten Tahiraj 57:04
actually.
Alex Volkov
Alex Volkov 57:04
Yeah,
Nisten Tahiraj
Nisten Tahiraj 57:05
I guess I've had good data on this.
Alex Volkov
Alex Volkov 57:07
Yeah, they probably have a good data on infographics.
57:09
Um,
Yam Peleg
Yam Peleg 57:10
the thing about this is that you need the data to be correct.
57:13
If you are generating something like this from data, you absolutely care that that thing will represent the data. Probably it's a professional thing. That you are generating? Uh, I suppose, I suppose it is. I'm not claiming anything. I just, I'm just saying that that thing has a, a very, very, uh, distinct characteristics for, uh, for it to be successful. So, but yeah, all the texts is so perfect, but I've seen, I've seen it generate an HTML code of an SVG inside the image and people took it and then rendered it. And it actually was the, that thing is crazy. It can write code in the image. I
Alex Volkov
Alex Volkov 57:56
wanna show this.
Yam Peleg
Yam Peleg 57:57
It's
Alex Volkov
Alex Volkov 57:57
crazy.
57:57
I wanna show this 'cause this is crazy. So, uh, the arm's referring to the, the, I, I think is this, let, see if I can, if I can show this to you guys. I think Yam, this is what you're referring to. Uh, somebody tested Simon Wilson's friend of the pod, very famous kind SVG, uh, um, SVG Pelican Experiment, where you like ask the model to generate SVGs. Um, and then the coding model, just write the SVGs. Um, this model generated a screenshot of a Mac with a, was it a vs. Code Id I think
Yam Peleg
Yam Peleg 58:31
I, I
Alex Volkov
Alex Volkov 58:32
just want, yeah, it's a vs code.
58:34
Lemme just say, wait, hold
Yam Peleg
Yam Peleg 58:36
on.
58:37
This is not, this is not a screenshot and this is not real code written. This is pixels.
Alex Volkov
Alex Volkov 58:41
Yeah, yeah,
Yam Peleg
Yam Peleg 58:42
yeah.
58:42
I mean, it took me a while to understand these are pixels. It generates the pixels,
Alex Volkov
Alex Volkov 58:47
the model generated a screenshot of a Mac.
58:50
In that screenshot, there's a vs. Code editor. Inside the vs. Code editor, there's a SVG. And this SVG taken through a, a text OCR model that turns like text, you know, into, into actual text shows you almost a pelican riding a bike in SVG. This, this renders. So this model is basically like LDJ, I think this supports what you're saying. This could be 5.5 with, you know, with, with the presumption of like, this is an image model. This is
LDJ
LDJ 59:17
exactly, yeah, yeah.
Nisten Tahiraj
Nisten Tahiraj 59:18
Generated running code in the picture.
59:21
Like
Alex Volkov
Alex Volkov 59:21
not only running code, dude, it, it is more than running code.
59:24
Running code is fine. This is a running SVG code that actually depicts something. It's, it's like another layer on top of just running code. Like you can generate like a HT ML page generating an SVG. There's a reason why this is a benchmark that Simon runs. Like, it's hard to generate SVGs. These models don't see how it looks to this model generated a screenshot of code that actually renders into an S vg. It kind of looks like a pelican. This is absolutely mind blowing and insane. There's like multiple levels of gade that's going on here.
Yam Peleg
Yam Peleg 59:51
Yeah, I mean,
LDJ
LDJ 59:51
what's funny about this too is that the Pelican test is a common
59:55
test for testing abilities of models. And this is a pelican that's better than some of the models just from like a couple of years ago. This is
Alex Volkov
Alex Volkov 1:00:01
a
LDJ
LDJ 1:00:01
better pelican.
Alex Volkov
Alex Volkov 1:00:02
Yes.
LDJ
LDJ 1:00:03
Yeah.
1:00:03
So, um, I think one more thing too, I, I, I heard yum one say something too, but one thing before we get off of GPT image two is I think we should definitely show front end D UI that it's generated.
Alex Volkov
Alex Volkov 1:00:13
Yes, we wei after this super quick.
1:00:16
Uh, Riley Goodside, one of the like more incredible people who test like very, very difficult things. Uh, he basically said, GPT image two, generate a game die. But instead of numbers, it has working QR codes for each of the Wikipedia articles of the actual numbers. So I'll say this again, robot.
Yam Peleg
Yam Peleg 1:00:31
What the fuck?
Alex Volkov
Alex Volkov 1:00:32
I'm, I'm literally saying what the fuck as well.
1:00:34
Uh, what, what the bleep. Uh, because YouTube, we don't want you YouTube, the sensors, every side of this cube has a QR code linking to a Wikipedia article about this number. So if, if you scan the, like I, I have a QR code scanner here. I'm gonna actually do this right now. I have a QR code scanner in, uh, see, I'm gonna like do like this and then see, yes, this is all of the work. This is a functional QR codes to every number. The model needs to understand what the hell are you talking about? It's. Just mind blowing. Menus are mind blowing. Uh, this, this one from Clair Vo, like selfie turns into the whole, uh, you know, I don't even know what this is. Um, Ryan, can you, can you describe this? I know you're friends with Claire. Like what, what, what are we seeing here? I think everybody's like,
Ryan Carson
Ryan Carson 1:01:20
all right, I've also got sisters, so I can tell you all about this.
1:01:24
Um, this is basically a color palette for your skin. Um, so you take a, a, a picture of yourself and you say, show me your color palette. And apparently all of us who are married should do this for Mother's Day. Uh, and then it will generate this amazing kind of style, color palette, or it's, I think it's called your, uh, your, your, it's a girl thing. Uh, it's cool
Alex Volkov
Alex Volkov 1:01:46
girl
Ryan Carson
Ryan Carson 1:01:46
thing.
1:01:46
It's a girl thing. So do this for your girls.
Alex Volkov
Alex Volkov 1:01:48
Feels like, feels like we should bring at least one girl here on
1:01:51
the panel as well to tell us about this. Uh, shout out to Claire. Maybe Claire can come. Uh, but this is incredible from one picture. Um, folks, we can go on and on. Like me and Peter, we just said for an hour and a half and I invite you all, I'll, I'll give you a link. Uh, and we just like got excited. The last thing that we do before we, uh, move on. And LDJ, you're absolutely right. Um, I feel like Claude saying this, but you're absolutely right. GPT image two is incredible at UI interfaces period with the fact that Codex is not incredible. At UI interfaces, what people are now doing is this thing, this is the new Alpha. You ask GPT image to generate a beautiful UI and then you send it to Codex to implement. Instead of asking Codex to come up with stuff, this is now the creative brain. Uh, and then you just send it to Codex Codex implement this and this is beautiful. Here's an example. Ya, I know you have comments, but LDJ sent this to us, so if you wanna comment on this, feel free. This is a UI that, uh, GPT Image two created and I think Codex coded. I'm pretty sure.
Yam Peleg
Yam Peleg 1:02:49
I'm just saying you might be getting it in Codex in
1:02:53
like five minutes or something. Oh yeah. That's saying, I'm just saying yeah. It might be
LDJ
LDJ 1:02:58
a reason.
1:02:58
Yeah. I also, I sent another one too, which this is kind of even deeper into the meta. This is image generated with a GPT image. Then CD dance two animates some of the images and Claude design. Then it creates the whole website.
Alex Volkov
Alex Volkov 1:03:11
Wow.
LDJ
LDJ 1:03:12
In an actual code.
Alex Volkov
Alex Volkov 1:03:14
I That looks super cool.
1:03:15
I don dunno how usable this website is. Honestly, like Ryan, I don't know if you, you're gonna rebuild your own website with like animated people. Uh, but you know, for some stuff this looks incredible. And the thing that we must highlight is also this looks nothing like any website, the cloud or opus or, or cloud Opus or g the Codex will generate for you if you just ask. This has to come from like a visual part of the brain. I dunno if we're doing this metaphor with like the left part being the creative and the right part being the analytical, but let's say we do this metaphor for a second codex and like the coding agents and everything, those are the analytical part. They can write. There's some attempts into getting like excitement and creativity in there. We know it's not that great. It is better than Claude, uh, in Claude and Codex. Uh, Ryan, go ahead.
Ryan Carson
Ryan Carson 1:04:01
Uh, we have crossed a new threshold.
1:04:03
So this feels very much, um, like what we experienced in December, January when everybody realized how good, uh, Opus four six was. Now we're experiencing this with design. So with the, uh, uh, the entrance of claw design plus, uh, image, uh, you know, uh, two from OpenAI, we are now in a spot where you can really begin to get professional design out of ai. Now, what I do personally is I, I pay Brett from Design Joy to do an initial layout, right? So all of my web ui, all of my brand is done by a human, but now I can fully hand it off to ai, right? So as soon as you have your design system locked in, sorry.
Alex Volkov
Alex Volkov 1:04:49
Sorry, Brett, I'm sorry.
Ryan Carson
Ryan Carson 1:04:51
Well, it's just the truth.
1:04:52
Like this is where we're at now. Now I think it any serious brand needs to pay a human to build out their brand initially and their ui, but pretty much after that you can, you can do this with ai. It's, it's, uh, exciting.
Alex Volkov
Alex Volkov 1:05:06
So, uh, just, uh, for, for, uh, completeness of
1:05:09
sakes, uh, LDJ, you want to like send whatever you send in chat? Like what, what's actually going on here? 'cause I think it's confusing folks as well.
LDJ
LDJ 1:05:17
Um,
Alex Volkov
Alex Volkov 1:05:17
what we're seeing
LDJ
LDJ 1:05:18
here.
1:05:18
Okay. Yeah. So GPT image generates a mockup of a website and then, uh, at least in the, in the image you were showing above at least. So if you go back up Yeah. Or this video, I mean,
Alex Volkov
Alex Volkov 1:05:31
yeah.
LDJ
LDJ 1:05:32
Um, yeah.
1:05:32
Actually if you scroll a little bit more up in the, they mentioned, yeah. So then they, they passed the image that GPT Image two created. They give that image to Codex and they're like, Hey, can you turn this image that was created into an actual working website? Then it goes from there and, and makes the website. And then in the other example I gave you, they do the same thing basically, except they also send those images into c dance two to animate some of those images into videos and then pass those images and videos into Claude Design and ask Claude Design to make that into a website's.
Alex Volkov
Alex Volkov 1:06:09
Absolutely.
1:06:09
Mind blowing.
Nisten Tahiraj
Nisten Tahiraj 1:06:10
I got scared for a second.
1:06:12
I thought those were actual 3D animations and I thought we scared me. We reached Oh, they, they thrust
Alex Volkov
Alex Volkov 1:06:20
into something crazier.
Nisten Tahiraj
Nisten Tahiraj 1:06:21
Yeah, but why does it
Ryan Carson
Ryan Carson 1:06:23
ni Nisten, why does it matter?
1:06:25
Like if they're just pixels and images, they don't even have, they don't have to be 3D, right?
Nisten Tahiraj
Nisten Tahiraj 1:06:31
Well, that would've been indicative of like a much
1:06:34
smarter model that can build. That's what said something of that quality. Yeah.
Alex Volkov
Alex Volkov 1:06:38
We have this one, this
Yam Peleg
Yam Peleg 1:06:39
one.
1:06:40
This one.
Alex Volkov
Alex Volkov 1:06:40
Yeah.
1:06:40
This is, we have, uh, Rickles Smith send us this. Uh, thank you Rick. Uh, this is again, really good side GT images to generate a photo of a cake decorated with SVG that went transcribed to a file, renders another cake. So, so again, this is a photo of a cake that has SVG glazed on this with some glazing. And this VG looks like
Yam Peleg
Yam Peleg 1:07:05
food,
Alex Volkov
Alex Volkov 1:07:06
like this cake.
Yam Peleg
Yam Peleg 1:07:07
It looks like food and it actually renders looks.
1:07:10
Yeah, it, look, look at the cake itself. It
Alex Volkov
Alex Volkov 1:07:13
looks good.
1:07:14
It also cost 24 9 9 white cake with buttercream icing net weight. 43 ounce. It's 43 ounce, two pounds and 11 ounces. Uh, I'm not sure, like, if that's correct, 1.2 kilograms that, that kind of tracks the size. So I'm also looking at like, you know, the, the extra stuff. It's insane. It is absolutely insane that we're here. That all of us started our like AI journey and stable diffusion too. This could like barely generate stuff. We're not talking about like actual intelligence. Well, from, let's let, let, let's comment and let's move on. We have a bunch of stuff to cover. There's no way.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:07:48
Yeah.
1:07:48
This was my highlight of the week actually. So the first things I did, I, I still have some old prompts from the Dolly era from 2022, though, I did the same render with this and it didn't blow my mind. It was nicer of course, but it wasn't that much. But I changed a bit in the prompt and I suddenly got a complete image with text on it that ied the image. It was a character, uh, from a role-playing game. So it had a character description, it got all the information, uh, each, even the animals with it got their own, uh, thing. So the thing is not, it's not just an image model. We have intelligence in the images that we didn't have before. Like Nana Banana was a big step up above what we had before. I think this is an even bigger step where, where you have, I don't know if they are using a genetic stuff or it's just a very smart, uh, omni model, model, whatever it is, like magic and it is so mind blowing to see what you can do now outside of just good looking images, the intelligence that is baked into the images, that is, wow. I I think we will be using this for a long time to come.
LDJ
LDJ 1:08:50
Oh, okay.
1:08:51
Alex popped out, but no, I'm here. I'm here. I was gonna mention here is I think a disclaimer worth mentioning is, since it is, does seem to be reasoning model does have different reasoning settings. Like the most recent image that, uh, that Alex had up it, they mentioned that they specifically were using pro, which I believe the heavy reasoning is only available to pro users and plus users are below, I think only have access to like a medium or high reasoning or lower. And so if people are trying to replicate some of these things and don't notice the same quality, then do keep in mind it is a different limit of what you can get with different plants.
Alex Volkov
Alex Volkov 1:09:28
Yes.
1:09:29
Uh, no. No.
Ryan Carson
Ryan Carson 1:09:31
So speaking of plans, while you find that, Alex, I mean, but aren't
1:09:34
we all now in a place where it's like, at minimum, except for Nisten, I know like if you want to do serious work as a professional, you're gonna be paying $200 a month for Anthropic, and you're gonna be paying $200 a month to OpenAI. Like, I, I feel like you have to do that now. And the only reason why is because they're so heavily subsidizing the tokens. Like I, if I was to go off my, my OpenAI plan and I had R two run off of the API, it would be $3,000 a month. But I can basically run R two for $200 a month because I'm on, uh, the, the pro plan
Alex Volkov
Alex Volkov 1:10:12
for
Ryan Carson
Ryan Carson 1:10:12
mean, is everybody else else doing this or, or what?
Alex Volkov
Alex Volkov 1:10:15
Uh, some people do API and, and for some people, I think it's very
1:10:18
important to run Opus only, uh, in their like clause and illnesses, for example. Um, and, uh, hopefully today, uh, we still don't have any news from open. The, I by the way, we're still live waiting, uh, but hopefully today some of the comments that people have about, you know, agent stuff that Opus is like much better at, uh, they will address with this new model. Um, but, um, I, I agree with you, dude. Look, everybody should be running these models and the heavy subsidized thing is very scary because when tropic yanked the heavy subsidization of opus out of the open call ecosystem, you could kind of see like the, the, the, the thing go down. And then also all these companies are now copying all these features into Codex, into cloud code, into cowork, into design, into like a bunch of features. So, you know, the, there's a diffusion of where, where things are landing. Um, folks, I think it's enough by the
Nisten Tahiraj
Nisten Tahiraj 1:11:09
way, they allowed it back, which is a complete mess.
1:11:13
Allowed
Alex Volkov
Alex Volkov 1:11:13
what Back Nisten?
1:11:14
We have to, yeah.
Nisten Tahiraj
Nisten Tahiraj 1:11:15
Allowed the use of open claw back in, uh.
1:11:18
Uh, back in. Yeah,
Alex Volkov
Alex Volkov 1:11:21
you mentioned
Nisten Tahiraj
Nisten Tahiraj 1:11:21
this, they reverted again.
1:11:22
So for, but
Ryan Carson
Ryan Carson 1:11:23
wasn't that just for like the CLI or something?
1:11:25
There's, it's this weird
Nisten Tahiraj
Nisten Tahiraj 1:11:27
case.
1:11:27
No, no. They just allowed, allowed it all again, what I think
Alex Volkov
Alex Volkov 1:11:30
it's
Nisten Tahiraj
Nisten Tahiraj 1:11:30
only there are mixed signals from different developers
1:11:33
and Anthropic that are, are saying
Ryan Carson
Ryan Carson 1:11:35
this, but they have a Twitter account.
1:11:36
Everybody philanthropic has a dev Twitter account, just so you know.
Alex Volkov
Alex Volkov 1:11:40
Yeah.
1:11:40
So
Nisten Tahiraj
Nisten Tahiraj 1:11:42
you
Ryan Carson
Ryan Carson 1:11:42
can use it
Nisten Tahiraj
Nisten Tahiraj 1:11:42
in, just use it in her
Alex Volkov
Alex Volkov 1:11:45
folks.
1:11:45
We we're talking about over each other. W what was allowed again is as far as I saw the CI usage. So if you go to open cloud documentation now, you can see that if you do wanna run Atropic via your like max account, uh, the CLI usage is fine. So what this means is that if you have cloud code installed login and you can do cloud dash p and send like a prompt, you can say, you can say, Hey, I'm open Cloud. The assistant and cloud will not block you anymore. They, they had this like thing and now that's fine. Uh, I don't believe that, uh, everything is, uh, fully back like it was. Uh, but I, I definitely should test 'cause I know that like mine fails if I don't have extra usage turned on. Uh, we must continue because I do wanna talk about two features of Codex. I'm not gonna bore you with the details no matter what I tried, I can like share the screen with you. I need to do a full restart and I don't have time. But, um, uh, I think the two features of Codex are like the more important things. So let's, let's see if I can share with you this, uh, anybody use Codex here? I, Ryan, I know you moved on to, uh, to, to, to Devon, but who is like a user? I use, I user, I use Codex a
Ryan Carson
Ryan Carson 1:12:50
little bit.
1:12:51
I use it a little bit, but I'm mostly Devon now.
LDJ
LDJ 1:12:54
I've been using Codex.
1:12:55
What's your,
Alex Volkov
Alex Volkov 1:12:56
what's your,
Ryan Carson
Ryan Carson 1:12:56
what's
Alex Volkov
Alex Volkov 1:12:57
your take on Codex LDJ?
LDJ
LDJ 1:13:00
Um, well since I'm, it's my main thing then I, I could
1:13:03
only really say I love it. Uh, but yeah. But yeah, I've been. It, it's definitely, especially if you're doing one shot or a few shot, it's definitely worse at front end than than Claude. Uh, but when you have a specific vision in mind, and I have been kind of getting more into that flow of actually enjoying that design process and kind of creating my own design and, and giving those instructions to an agent like Codex, uh, is actually quite good at following those instructions over a long time. Horizons just working really hard for very long on implementing very specific specifications I give it. And, uh, yeah, haven't been really building things that crazy though with it in terms of like, mostly just kind of useful small credit applications that are useful for myself. Various tools like training calculators.
Alex Volkov
Alex Volkov 1:13:54
So I think that the, the thing about Codex is, um, specifically
1:13:58
how much work is being put in Codex. Last week we told you that OpenAI famously decided to consolidate, cut some side projects out and consolidate things into one Super app and the, there's early signs that Codex is that super app that they're going to focus on. And a lot of the, you know, the promotions within OpenAI, they're focused on Codex and Codex is getting a lot of new features. So the feature that I really wanted to show you, I'm, I'm just gonna show you videos of it, uh, because you have to see is Codex Computer Use Codex. Last week, uh, released a bunch of, uh, uh, Viv, like a bunch of examples, uh, a bunch of updates. And we got told you about this at the end of the stream. We got like very, very excited, but this was the end of the stream, so I wasn't able to like test it out fully. Uh, and since then I have been able to test it out. Codex now has, um, on, at least on the Mac computer use that beats anything else that I've seen. Not from the perspective of, hey, this, this is better at computer use by clicking buttons, identifying things. Just from the UI of of it, I, I wanna see if, uh, codex can use your Mac. Now, um, this is a, an example. Do you guys see this cursor? The little cursor that like jumps and clicks and moves and plays tac toe. The thing that I want to highlight all this is a quick demo video of all of the features, but I wanna highlight this cursor. Um, this little thing is running on a background thread somehow. This is not your Mac cursor that's getting taken over. I have no idea. And I think still the industry has no idea how they achieve this. You can OpenAI, but, uh, software, uh, apps incorporated and those folks worked at Apple before and worked on DD different, uh, things like workflows. Um, they, uh, they're running something with accessibility. I think they're the only ones. Maybe labs will catch up. The most important thing there is this happens while you are able to com control your computer yourself. Most computer use, what happens if you use cloud code, for example, and say, Hey, con you, you know, control my computer, which works, they will just use your mouse cursor and open windows in front of you and like you, you aren't able to work. You're basically sitting like this and then a cloud is using your computer. I ran this thing and that's it. To go to TL draw and just, you know, draw the, like things in the UI and I was able to do it in a background window that I didn't even see. It's so much more powerful. The, once you go back to any other computer use, it's useless. It's computer useless. Codex computer use is so good that any other computer use is absolutely useless. Um, dunno if folks have tried it, but I absolutely recommend like a full, full trial of this thing. Comments, folks? Anybody tried this already? Anyone play with this?
Nisten Tahiraj
Nisten Tahiraj 1:16:47
I, I think I know how they did it.
1:16:50
Or I have to guess, uh, in older Linux, like before. For some reason, everyone switched to, to Wayland. You could have two cursors on, uh, on Linux with the X 11 or xor. And this is, I don't know, I just found this during university, people are just trolling each other and you could move cursors between, between laptops and, uh, there is a port in Mac, which I do use, which is called exports, and that, that lets you stream Linux applications to your Mac. And uh, so I think there's not a whole lot of work there to get that other mouse going that way. This is just my guess here. Uh, because, uh, yeah, this has been a fun Linux trick for a while that people don't realize. So it, it's, that's black magic here.
Alex Volkov
Alex Volkov 1:17:38
I dunno if it's the same, but I definitely know that
1:17:40
the, the little corset that they have, they showed somebody like building the animations, that's a layer on top that they're putting. Um, this is the layer on top that they're putting and they're like, they're like baking clicks. I don't know the, I wanna highlight the thing, I think I'm gonna put up this repo. I noticed, and I don't know if you guys noticed this as well, that a browser use, for example, sometimes need actual computer use. So if you, you know, the cloud can control a browser. Playwright is a thing. And then also a native dev tools. API MCP is also like a great thing to control browsers. All of them are great at like clicking things within the browser, within the website, but then sound like a download for example, or a canvas interaction. These things cannot do. So the dev tools thing cannot, like drag or do canvas like drawings, et cetera. So computer use, paired with browser use, I think is the full exact picture. I, I've been working on a, uh, skill that I will publish later if you guys are interested about how to combine these things. How to do like a hybrid computer use slash web use. And I think that that, that like beats every other like, browser thing out of the world. LDJ, go ahead.
LDJ
LDJ 1:18:46
So nearly a, a full, two years ago, OpenAI actually acquired a
1:18:51
company called Malti June, 2024. And this company is, is precisely trying to do things like that. They're saying their goal is to make a computer, the, the experience of using a computer inherently multiplayer and basically a multiplayer experience where you could have multiple cursors on a screen. And ever since OpenAI acquired that company in June of 2024, I've been waiting for them to release something like this where you have like another cursor on your screen that Chat GPT controls for you. Now we finally have it. It's a little bit underwhelming 'cause it was like a slow boil to, up to this point. But
Alex Volkov
Alex Volkov 1:19:26
yeah,
LDJ
LDJ 1:19:26
I'm glad it's here.
Alex Volkov
Alex Volkov 1:19:28
The, the video I'm showing right now is, uh, from VB front of
1:19:32
the bot on open the Ice Codex team. That also says that not only can you have computer use, you can have multiple computer uses the way they did this, because it's not your cursor and it's not your windows. You can have subagents with Codex perform actions within different windows. Again, subagents plus computer use. They will all go and click different things in there. I think that's just like, how insane is this? How absolutely insane this is. So he has a X window on the bottom left that like he's typing things, uh, the confetti did you guys see the confetti thing? So he opens raycast in another one and like types confetti, and then he types some other stuff in notes. And all these subagent are doing things in parallel in all these windows. Just
Wolfram Ravenwolf
Wolfram Ravenwolf 1:20:18
been waiting for this from the computer
1:20:21
operating system manufacturer. They could have built this already, like Apple or Microsoft. They are in ai. Okay. Apple not so much, but if they could put this in their systems, like a multi-user system where the AI is another user working with you on its own desktop, on your, on your own desktop sharing between those that the stuff the operating system could provide and the technology is there now. Just somebody has to be, uh, brave enough to actually build this thing. And it looks like the labs are doing it now. They're building everything.
LDJ
LDJ 1:20:51
Yeah.
1:20:51
And including the hardware, which open a said over the next six, 12 months, they're gonna be announcing their first hardware product.
Alex Volkov
Alex Volkov 1:20:59
This is after they're saying they're focusing
1:21:01
on, on the no longer site Quest. Um, the other thing in products that you asked Ryan, you had a comment. I see you getting excited about something.
Ryan Carson
Ryan Carson 1:21:09
I was last, I was just laughing about your
1:21:11
comment about staying focused.
Alex Volkov
Alex Volkov 1:21:14
The the stay focused.
1:21:14
Yes. Uh, I'm very happy that the GPT images is not a side quest. I'm very happy that they're like doubling down on images. 'cause like nothing the tropic does, tropic like famously focuses on like text generation, right? There's no image generation, no voice, nothing. Uh, I'm very happy to OpenAI staying and leading in the pack and like fighting the good fight as well. 'cause like we all benefit from this. Nano banana was, I don't remember another AI technology that dominated as long as nano banana did. Just absolutely domination until yesterday or whatever, when images got released two days ago. The other thing in Codex that you guys absolutely must know about, I really wanted to show you all this, uh, but the technology guys are not like with me today. Uh, this, it's called Chronicle in Codex. This is a research preview that uses what's on your screen codex behind the scenes, taking pictures of everything that you did every, I think 10 seconds or so, and then. It adds it to context. So if you ask, Hey, what am I working on? What I was working on like an hour ago? Codex knows. Codex just knows. It fills in the missing context to you saying, uh, and then it is incredible. I, I, it's hard to explain how much is incredible, but folks who've used something like, uh, um, re recall or re rewind AI or like different things that, you know, rewind that bar up by meta and shut down. So people can't use re rewind anymore. Um, if all your screen is recorded all the time, it's an incredible, incredible addition to the context of your model, right? Uh, famously rewind this tagline was like, uh, an AI system that knows everything you've seen, read, or heard, uh, the, like, this doesn't transcribe everything you hear, you hear, but Codex now has screenshots of everything you did outside of Codex. It's, it's kind of awesome. Like, honestly, it's kind of awesome. Here's an example of why it's awesome. I have granola running on my meetings, so when I do meetings, I have granola running behind the scenes, granola prints out the output of the meeting automatically. Codex sees that. So technically, codex now without extra steps has insight into every meeting that I had throughout the day. Every meeting I can ask Codex, Hey, you know, when I met with Wolfram, what did we talk about? And it, it's just, just mind blowing. Now, this does mean that you're enabling screenshots on all, everything that you see on your Mac and potentially sending this to OpenAI, right? Uh, the, the images are stored locally, uh, but at the time of processing, obviously they're sent to the, to the image. So that's not for everyone. Um, I think in addition to this, there was news about meta now adding stuffer like this for all their AI engineers to measure their productivity. So that's, you know, the, the, the spy stuff, the, this, you know, the, the, the conspiracy minded folks may say, Hey, this is like too, I think too far. But I think for use case for usability is great. Well, from.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:24:11
Uh, it wasn't to measure their productivity explicitly
1:24:14
not, but for training models on what they are doing basically.
Alex Volkov
Alex Volkov 1:24:18
Oh, for for meta you mean not,
Wolfram Ravenwolf
Wolfram Ravenwolf 1:24:20
yeah, that was what Meta was doing or is doing.
Alex Volkov
Alex Volkov 1:24:23
Yeah.
1:24:24
Uh,
Wolfram Ravenwolf
Wolfram Ravenwolf 1:24:24
they need the training data.
Alex Volkov
Alex Volkov 1:24:26
Anybody used, uh, Chronicle yet?
1:24:31
I really wish I could show you though.
Ryan Carson
Ryan Carson 1:24:32
It sounds cool.
1:24:33
I, like I said, unfortunately I rolled off Codex, so, uh, I would like to try it, but it sounds good. I wonder how much signal to noise there is though. So that
Alex Volkov
Alex Volkov 1:24:43
was pretty good.
1:24:44
Uh, I can say, you know, I can test this and, and maybe show you, but like, um, I, I can definitely say that I asked that, Hey, what was working on an hour ago? I was able to figure it out. He was able to tell me like, Hey, you're working on this and this and this, Mr what are your thoughts on an always on CC computer screenshot taking Codex?
Nisten Tahiraj
Nisten Tahiraj 1:25:03
I mean, just look at the last few court cases that involved
1:25:07
Chat GPT documents with CEOs saying, oh, I, they had deleted everything and, uh, they had zero data retention policies and it all showed up in the court. So you have to think of it that it, uh, the agreements do not mean anything and uh, there's always, uh, your data is always recorded. It, it, it doesn't, it doesn't matter and it can come back to you or to your customers. So I would not use this unless it's running at home. I do want it, but I, yeah, I, I I would not, I'm not trust this thing
Alex Volkov
Alex Volkov 1:25:44
LDJ.
LDJ
LDJ 1:25:45
Yeah.
1:25:46
I'm gonna take the opposite stance here. I I'm definitely gonna use it. Um, apple, all of these companies, I mean, I shouldn't give into it in this way. It's probably not the best argument, but they already have a ton of my data and I'm not necessarily doing super compromising stuff here. Um, I'm, I'm probably going to keep the, improve the model for everyone thing off. Yeah. That doesn't guarantee that they're not gonna store a train on my data, but. I think the benefits and usefulness here are such that it, it's worth it for things like contracts and reviewing legal work. I'm still gonna just go into private models 'cause they're good enough for that and it just, it just worth it to do. So
Alex Volkov
Alex Volkov 1:26:26
I definitely think there's a path towards building something
1:26:29
like this fully, locally, right? We're talking about like sauna at home. The new one that we just covered is multimodal so it can understand this wolf.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:26:36
I actually had something like that, a screen watching
1:26:39
assistant that was, uh, I was using Florin for Microsoft, uh, the, um, image recognition model for that and it would turn it into text. But, um, yeah, it's been over a year ago, so it, the technology just wasn't there now. But I think with the newer models that are smarter and faster, that would be something to run locally for sure, because you don't want to send all of your data and in a company you probably can't or are not allowed to do this. But if it's all local and it's just an index and it goes into knowledge base that your AI assistant can refer to, that makes a lot of sense. So yeah, definitely a screen watching assistant is one of the big unlocks.
Alex Volkov
Alex Volkov 1:27:18
All righty.
1:27:20
I think we have a few more things to cover. Um, before I do wanna talk about Crab Trap. We mentioned this in the beginning, Ryan, I think you saw this. Uh, the CEO of Brex joins the litany of new CEOs who find new found time to, to, to, to pair with Codex and actually build things. Uh, and he built Crab Trap and he says basically open claw is not great for enterprises. In fact, I'll say like, it's banned in Coral Weave. So we, like none of us can use open Qua. Uh, and the reason why, and, and Jensen mentioned this on stage at, at uh, GTC, is that it has access to the sensitive data within enterprise and it can communicate externally and it can be prompt injected. Uh, so not great, uh, but he's like, Hey, we use open client Brex internally. This is a great admission from the company. Uh, he says, um, we started deploying agents internally at Brex. We couldn't stop thinking about this question. Lemme actually show you this agent works. Nobody wants to give them real credentials. Instead of waiting for a solution, we decided to try and novel approach using LMS to judge the network traffic of an AI agent. So they build crab trap, open source proxy, intercepts every outbound requests and blocks risky activity using LLMs like we told you before, the privacy filter from OpenAI. That's a great tool to kind of add to this arsenal. Uh, so this is a, uh, this is a, an open source proxy that you like proxy all the network tool. I think it supports OpenAI, uh, but I'm not sure if it supports on tropic. Uh, and then you just like basically proxy everything and then it catches everything that, that your agent sends to, uh, decrypts static rules. And then LM is a judge. So this is kind of expensive, right? You're running another LLM to review all other l lms. So you have to consider like context windows. But given that, um, a leak of your private credentials for an enterprise can cost significantly more, this is maybe worth it. LDJ comments, Ryan Carson comments about whether or not you're gonna rub, crank, wrap on your agents.
LDJ
LDJ 1:29:29
I've heard it's, it's especially effective if you have, like, if you're
1:29:33
using the judge as let's say, Claude, and if the model, the main model you're using is also Claude, but you tell the judge, Hey, the model that you're monitoring is gr or something like that, apparently it's especially good at actually catching things better because it's like extra critical about it.
Alex Volkov
Alex Volkov 1:29:52
Oh, nice.
Ryan Carson
Ryan Carson 1:29:52
I I, this is absolutely gonna be a thing.
1:29:55
Like intelligence is on demand now. So what company would not want intelligence monitoring all their traffic to make sure that their employees are not doing bad things? Like absolutely this is gonna happen,
Wolfram Ravenwolf
Wolfram Ravenwolf 1:30:07
I just wanted to say that, uh, I want to change my
1:30:10
pick of the week to the crab trap. I haven't looked at it in, in detail, but I've, every week I'm doing a deep research. My agent is doing it, looking at how to secure agents, because the more I use my agents, the more access I give it, the more I'm concerned about this. So, um, basically some, a security solution running in the background and observing what is happening and being able to intervene. This is what I've been looking for all the time. I looked at all the Guidewire, uh, implementation. But crab tap, I will definitely, this is my weekend project. I will implement this and, uh, my h is already on it, so definitely, um, I want this and I think we all need this. Something to make sure that our agents are not doing stuff they shouldn't do. Hey everyone, this
Yam Peleg
Yam Peleg 1:30:50
is Pedro from Brex.
Alex Volkov
Alex Volkov 1:30:51
So this is the demo.
1:30:53
Uh, we're not gonna listen to Pedro from Brex, but, uh, he is the, I think the CEO, there's four minutes of things, but basically not only, uh, does it look at every request that your, uh, agent does. You can also with natural language define, Hey, this does not look dangerous, or this is something that looks dangerous. You can add those rules and I think it's very important, uh, to malleability. Um, so crack up from it is, somebody mentioned it's Okta for agents, and I love this Okta for agents. So I'm definitely gonna implement this for my agents as well. And, uh, and go forward there. We're breaking news. AI breaking news coming at you only on Thursday. I,
1:31:40
all right. Now finally, we have breaking news. Folks OpenAI. Newest model GPT 5.5 just launched, uh, they call it the New Class of intelligence for real work. We're not gonna take a, a look at the video 'cause we wanna go directly into the evals and show you that on terminal bench, GPT 5.5 gets 82%, jumping from 75 A GP, 5.4 beating every other model that they have here. Uh, GPT 5.5 PRO is also launched, but they didn't test it for some reason. On GPT, on on terminal bench, uh, we have expert Swyx internal benchmark jumping at to 73 from 68. Uh, OS. World verified is a little bit of a bump. Uh, G-P-G-D-P-V, uh, that we specifically love here. This like is state-of-the-art model now, like Beats clo, clo, CloudOps, 4.7 beats, Gini 3.1. Uh oh, we have yam in the car. Joining us to the breaking news, um, browser comp is almost state of the aar and, uh, let's go. Can we, can we test this out? But yeah, what else? Frontier math is incredible. Uh, 35% model capabilities. Open eyes is building the global infrastructure. The, over the past year, we've seen AI dramatic accelerate software engineering with 5.5 in Codex. In J GPT, the same transformation is beginning to extend in the scientific research and broader work people do on computers, uh, across these domains. GPT 5.5 is just not more, is not just more intelligent. It is more efficient in how it works through problems often reaching higher quality outputs with fewer tokens and fewer retries. This is a trend that we showed you before. Not only our models like capabilities are blowing up. Uh, also they, they're do it with like lower, lower, uh, lower tokens. So let's take a look here. Artificial analysis index. I love the fact that people, uh, the big labs show artificial analysis here. Um, they show the GPT 5.5, which is the kind of the purple here, gets, uh, significantly less output tokens on the artificial analysis like intelligence index. Yeah, this is great. It's absolutely great. We have folks in, in the comments like freaking out as well. Uh, somebody says it will work for 30, 60, 90 minutes or more as, wait,
Nisten Tahiraj
Nisten Tahiraj 1:33:52
people are already using it.
1:33:54
Oh, wait, what?
Alex Volkov
Alex Volkov 1:33:55
So they're saying, uh, this is our strongest agenda.
1:33:58
Decoding models to date on terminal bench two, which tests complex command line workflows. Uh, it gets state-of-the-art accuracy of 82.7%. This is now, uh, just state of the art on, on, on, uh, on, uh, terminal bench. Two LDJ. Go ahead.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:34:12
Yeah.
LDJ
LDJ 1:34:14
Yeah, so I don't have access to it in Chad.
1:34:15
Should be to your codex yet. I've been refreshing. Um, but I overall on the, the, the sheet of different benchmarks that they showed, uh, in earlier in the blog posts, I'm not seeing a single benchmark where opus 4.7 is beating 5.5, which I think is pretty impressive and yet it's maybe partially I
Alex Volkov
Alex Volkov 1:34:32
have access.
1:34:32
Let's go.
LDJ
LDJ 1:34:34
Oh, there we go.
1:34:34
Okay,
Alex Volkov
Alex Volkov 1:34:35
there we go.
1:34:36
Uh, extra high and high. Let's say speed is fast. Okay, I'm gonna use this in speed. Nisten, let's do the Marsing Mars.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:34:44
By the way, we had Opus 4.7 last week at terminal bench 2.0 at
1:34:48
69.4 then, so that is a huge jump here.
Alex Volkov
Alex Volkov 1:34:53
I'm gonna do a GBT 5.5 on, uh, on fast mode with high reasoning.
1:34:59
Let's take a look.
1:35:07
Oh, folks are saying that they had a run for eight hours. What's up, Peter Gustav from Arena who gets access to early while. All right, we're, we're gonna send this, uh, Mars instrumentation. Let's keep, let's keep reading here. Uh, for terminal bench two, not only does this model beat the scores, as you guys can see, it uses significantly less tokens, almost twice less tokens. That's incredible. So, uh, let's look at this one. So this is a medium, this is low reasoning effort. Okay. Um, the low reasoning effort gets a little bit lower score, but uses a like one half of the tokens. And then for medium reasoning effort, you can see the GBD 5.5 gets a score of 75% on terminal bench. The medium reasoning effort for 5.4 takes 63. So almost 10% difference on medium thinking with significantly less token. 7,000 versus 9,000. That's very important as well, right? Uh, Wolf and we talked about like how, how important, uh, is, how many tokens you get as well, how many tokens you use.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:36:16
And
Alex Volkov
Alex Volkov 1:36:17
I'm so excited I can speak
Wolfram Ravenwolf
Wolfram Ravenwolf 1:36:19
the price of the intelligence you are
1:36:21
getting that is super important. And we find, found out also that if a model is thinking longer, it can actually be detrimental on the genic benchmarks. So finding a good way, that is also probably why the score is higher now, because it decides, it doesn't have to think so much, but act and then correct instead of overthinking,
Alex Volkov
Alex Volkov 1:36:39
let's use, let's build a simple website.
1:36:43
Build me a okay,
Nisten Tahiraj
Nisten Tahiraj 1:36:45
you guys do that.
1:36:46
But I am kind of blown away by this design thing.
Alex Volkov
Alex Volkov 1:36:51
Nisten, not now.
1:36:52
Now, now Nisten, you should have joined me a week ago. Now the big news is give you 5.5, uh, you're killing me. Yes. Love design is amazing. Yes, I agree with it.
Nisten Tahiraj
Nisten Tahiraj 1:37:00
It's,
Alex Volkov
Alex Volkov 1:37:02
I have to edit this out.
1:37:03
All I wanted
Nisten Tahiraj
Nisten Tahiraj 1:37:04
is, is
Alex Volkov
Alex Volkov 1:37:05
okay.
1:37:05
Yes, it, it's incredible. But please, um, what else do we have here? So, expert Swyx as well, you can see that actually uses less tokens. Um, they're showing an example here of the Space Mission app with like 3D things, uh, that shows the price comparison between GPT 5.5. I'm gonna build a website that shows price comparison. I'm hoping This's gonna go and actually look at, uh, the, the prices. GP 5.5, uh, Opus 4.7. Where's Gemini, by the way? Gemini 3.1. Gemini is the last one, right? 3.1. Uh, in 3D somehow three Gs. So I asked the to, to go and, uh, look, look up. The scores. I think the new meta that, that we're, we're now like waiting for also is generating things with image. So let's see. Mm-hmm. If it builds the Mars thing, Nisten it still thinks a lot. So I have an answer,
LDJ
LDJ 1:38:02
by the way, for the pricing.
Alex Volkov
Alex Volkov 1:38:04
Oh, okay.
1:38:05
Tell us. I just noticed I ran it. I, I ran it with 5.4. I won it with 5.5. Yeah, go ahead. About pricing.
LDJ
LDJ 1:38:12
Sure.
1:38:12
So it looks like, um, it is for, for regular GBT 5.5, it's priced at $5 per 1 million and put tokens and $30 per 1 million output tokens. I think 5.4 was $25 per 1 million output tokens. Um, so yeah, that's like a little bit more expensive, but not insanely much. And then for 5.5 Pro, um, it's the usual cost of the pro models. It seems like at $30 per 1 million input tokens, $180 per 1 million output token. So still a lot. But the other pro models were really a lot like that too.
Alex Volkov
Alex Volkov 1:38:48
Oh, look at this.
1:38:48
We have, uh, a friend from the Twitter like showing up on the actual page saying then shipper founder of every describe 5.5 as the first coding model I've used that has serious conceptual clarity and Pietro Serrano, or a friend from Magic Pass says a similar step change when 5.5 merge the branch with hundreds of frontend and refactor changes. Uh, that has also changed substantially. Resolving the work in one shot about 20 minutes. It generally feels like I'm working with a higher intelligence and there's almost a sense of respect. I, I gotta wonder if they fixed, like open claw or something. Did they mention Open Claw here? Open claw? Nope. They didn't mention open claw because we know that like once tropic yanked the open claw thing, uh, then, you know, everybody was waiting for open eye to catch up to, uh, to, to Codex. Let's look at GGDP valve. GDP Valve is a test agent. Abilities to produce well specified knowledge. Work across 44 occupations, GPT 5.5 scored 84%. 84%. Where's my 84 here? And the industry expert baseline is here. So like all these models beat it, uh, just a little bit above G GPT 5.4. Not a huge amount, but somebody says it's a good model, sir. Yes. Okay. All, all, all models now are good models. Um, os world Verify, oh, this is a nicer model for, uh, for Tulio as well. Oh, looks like we are, we're we're about to, to see the mar generator Nisten, and we can compare it to the previous one that we run with, uh, uh, Opus 4.7.
Nisten Tahiraj
Nisten Tahiraj 1:40:26
I expect that to take some time back and forth,
1:40:30
but, uh, yeah, yeah, yeah. We're gonna see
Alex Volkov
Alex Volkov 1:40:33
the desktop view is alive.
1:40:35
The default target was one kilometer under the exact rail length after rounding. So it labeled the minimum orbit, blah, blah, blah. It, it really takes, you know, it really thinks about the, the, the math there. 'cause we asked it to do all of the math.
Nisten Tahiraj
Nisten Tahiraj 1:40:47
It's doing mobile page view too.
Alex Volkov
Alex Volkov 1:40:49
I think so.
Nisten Tahiraj
Nisten Tahiraj 1:40:50
Interesting.
Alex Volkov
Alex Volkov 1:40:51
That's the first time we've seen this, right?
1:40:53
That we have two, two examples that the model like compare, wait, show
Nisten Tahiraj
Nisten Tahiraj 1:40:56
the picture that it took.
1:40:58
It took a picture, right? A bigger one. Yeah. Okay. Okay. All right.
Alex Volkov
Alex Volkov 1:41:02
Oh, it's gonna use my browser now.
1:41:05
It asked me permission to use the browser 'cause Codex is like that. I'll give it permission to use my browser and we'll see what's going on. Uh, so folks who are just tuning in, we have a bunch of folks here. Uh, we're testing GPT 5.5 from OpenNet. They just dropped. Uh, and um. We're testing it in multiple ways, but first of all, we're running this on a, uh, Mars Rails calculator that we, uh, from this, that we usually test it with things on the show. Uh, it has a verify. The, the thing we're noticing, it has a verified mobile p and g, it now deletes it, but like this model decided to test its own ui, both on desktop browser and mobile browser. And I've never seen this before. And it's done. Let's take a look and it's running now. Uh, let's take a look. Now I wanna open this in the actual browser for a second. Uh, Nisten, I will let you verify the numbers if you, if you want to, but,
Nisten Tahiraj
Nisten Tahiraj 1:41:59
uh,
Alex Volkov
Alex Volkov 1:42:00
there we go.
1:42:02
So we have Mars
Nisten Tahiraj
Nisten Tahiraj 1:42:04
it, it should get the numbers right?
1:42:05
Even
Alex Volkov
Alex Volkov 1:42:05
small ones.
1:42:05
Yeah. Yeah. We have the Target. Minimal orbit eastward, no rotation assist. Oh, escape. Outward Escape. No rotation assist, and then custom rail. I don't know what that, yeah, let's
Nisten Tahiraj
Nisten Tahiraj 1:42:15
just, no, let's just do minimum orbit eastward.
Alex Volkov
Alex Volkov 1:42:17
Okay.
1:42:18
East Words. And then we have 65.
Nisten Tahiraj
Nisten Tahiraj 1:42:21
Looks good.
Alex Volkov
Alex Volkov 1:42:23
We have acceleration time, et cetera.
1:42:25
Exit angle. We can, what the hell was this? Oh, this is a different one. This is, okay. Uh, and then we, we hit launch and let's see. We can see the, the model goes. Oh, oh, we
Nisten Tahiraj
Nisten Tahiraj 1:42:36
launched it.
1:42:36
Okay. Interesting
Alex Volkov
Alex Volkov 1:42:38
launch.
1:42:38
But I don't see,
Nisten Tahiraj
Nisten Tahiraj 1:42:40
yeah, it maybe kinda, maybe, uh, do the exit angle.
1:42:43
I don't know, like 15 degrees or something.
Alex Volkov
Alex Volkov 1:42:45
Okay.
1:42:46
Let's Like this
Nisten Tahiraj
Nisten Tahiraj 1:42:47
or, yeah, just three.
1:42:48
That's fine. Okay.
Alex Volkov
Alex Volkov 1:42:51
It's not the best one that we've seen.
Nisten Tahiraj
Nisten Tahiraj 1:42:52
Yeah, it's not, it's
Alex Volkov
Alex Volkov 1:42:55
every, everything else we showed has like multiple V views, angles.
1:42:59
Uh, I did customer, let me refresh this guy and let's start again.
Nisten Tahiraj
Nisten Tahiraj 1:43:02
Sometimes you have to tell it Add orbit controls
1:43:04
and other cinematics stuff.
Alex Volkov
Alex Volkov 1:43:07
Yeah.
Nisten Tahiraj
Nisten Tahiraj 1:43:09
And it's not showing it in orbit either.
Alex Volkov
Alex Volkov 1:43:12
No, but it did do Mars, so that's pretty cool.
Nisten Tahiraj
Nisten Tahiraj 1:43:14
Yeah.
1:43:15
Yeah. Can you rotate it? Can you drag and rotate?
Alex Volkov
Alex Volkov 1:43:17
No, it's, it, it's locked in place.
1:43:19
So we didn't get any of the fancy stuff that we got from like, uh, Opus or even the previous GPT. Meanwhile though, I've been running this Codex, so on the Mars thing we're saying it's, it wasn't the best one, but maybe we need to specify a little bit better.
Nisten Tahiraj
Nisten Tahiraj 1:43:34
Yeah, I just need better, better prompting.
Alex Volkov
Alex Volkov 1:43:36
Meanwhile, I asked this prompt build me a beautiful website
1:43:39
that shows the price comparison with the G BT 5.5, Opus 4.7, and Gemini 3.1 in 3D somehow with three Gs. And then, uh, it's still running, but it built me this, you guys wanna see Frontier model price field four 50 to 1150. There's kind of some text overlapping, some other text, uh, but it is a price comparison with blended input and output. And you can see that the, the 3D kinda like rotate and you can see the prices. So GPT $5.55 per input, $30 per output. Uh, it added the artificial analysis index. So this model is at 60, Opus 4.7 is 57.3, and Gemini is 57.2. It added terminal bench evals. I didn't ask for evals. It added Swyx bench verified, uh, Swyx bench pro evals and added G gb, GDP valve evals as well. Not only that, codex asked the model to confirm. So you guys can see the little codex window here if you press it, took a screenshot and confirmed that it works. And the mobile pass exposed the usual absolute layout trap. The cards were starting to high. So this is now the second time that it looks and verifies its own work on mobile. This is the first model that I've seen that does this without prompting at all, uh, which is very, very cool. What else do we have? Folks are saying artificial analysis posted their benchmark. Let's take a look. Let's take a look at artificial analysis. Uh, okay. We have the official score here and also artificial analysis. Let's take a look. Lemme just open this in your tab. I have some scores to compare with mythos, by the way. When you're ready. Oh, with mythos. Let's go. Uh, let me find artificial analysis. Here is their official thing. Independent analysis of 5.5. Alright, from artificial analysis, GBT 5.5 takes OpenAI back to the clear number one in ai. Open AI's new model tops the artificial analysis intelligence by three points. It's not that much breaking a three-way tie with tropic in Google. OpenAI gave us pretty release access to test all five reasoning effort levels. Uh, extra high, high, medium, low and non reasoning. OpenAI topping, uh, the GBT terminal bench hard G-B-G-D-P valve and our newly hosted apex agents Artificial analysis eval the model trays only other OpenAI models in CRI PT and come second to Gemini 3.1 pro preview on three additional evaluations. Uh, 20% more expensive to run on intelligence index per token pricing was doubled from GBD 5.4, double the pricing to $5 and like $30 per 1 million output tokens. However, a 40% token reduction largely absorbs the hike. So this model is, uh, like more expensive, but 40% token use reduction on, uh, artificial analysis resulting in a net 20% cost to run our intelligence index effort. A clear ladder for balancing intelligence and cost GBD 5.5 scores the same as cloud oppos on our intelligence index at one quarter of the cost. Wow. All right. Cool. What else? Number one. GDPV and uh. Trailing the Frontier on Hallucination, our private A omniscience benchmark rewards factual knowledge. Uh, GBT 5.5 extra High has the highest accuracy at 57%. Meaning the model can recall facts in omniscience corpus more effectively than any other model. However, it has a hallucination rate of 86% versus Opus at 36. Uh, this makes it more likely to answer a question where it does not know the answer. That's not great, honestly. Uh, but great model, sir. Great model, sir. Alright, this, the pricing thing has finished. Let's see if it changed anything. Uh, no, it's still kind of like wonky, but I, I kinda like the price comparison thing. I didn't ask it for too much besides the fact that it's a little bit, uh, there's text over overlapping here. It's pretty cool. Um, some, some folks are saying can't wait to test the 5.5 Pro. Let's see if 5.5 PRO is up. I GPT, I tried
LDJ
LDJ 1:47:59
to, I tried to check my Chat GPT, but I don't see it on mine,
1:48:03
but maybe you have it on yours.
Alex Volkov
Alex Volkov 1:48:04
Not even logged.
1:48:04
You're still
LDJ
LDJ 1:48:05
sharing, by the way, by the way.
Alex Volkov
Alex Volkov 1:48:06
Yeah.
1:48:08
Thank you. Am I still sharing even now?
LDJ
LDJ 1:48:10
Yes.
Alex Volkov
Alex Volkov 1:48:11
Lovely.
1:48:11
Even if I moved it away, I'm still, well, I don't see
LDJ
LDJ 1:48:14
your browser anymore, but I
Alex Volkov
Alex Volkov 1:48:15
see.
1:48:15
Okay. Yeah, that, that's fine. So let me, let me, let me log in, uh, and see if I have the Pro. I don't think Codex has access to Pro, I think it just, uh, it is just online, right. So we'll see. All right. Log into Chat GPT, lemme confirm that I do not have
LDJ
LDJ 1:48:35
here.
1:48:35
While you do that, I could say some, uh, mythos versus 5.5 scores.
Alex Volkov
Alex Volkov 1:48:39
Yes, please.
LDJ
LDJ 1:48:40
So, um, it looks like humanities, last exam and GPQA and most of the
1:48:44
benchmarks, mythos is significantly beating it, but it is interesting in Cyber Jim, which is, it seems like really the only popular cybersecurity benchmark. That Anthropic tested mythos. Mm-hmm. Uh, in that case, mythos preview got 83.1%, Opus 4.7, which just released, gets 73.1% and GPT 5.5 gets, sorry, I just had it pulled up here. Okay. GPT 5.5 gets 81.8%.
Alex Volkov
Alex Volkov 1:49:19
Could you
LDJ
LDJ 1:49:19
send me something so
Alex Volkov
Alex Volkov 1:49:20
we, we, we have a visual as well.
LDJ
LDJ 1:49:23
Uh, sure.
1:49:23
So basically GPT 5.5 only scores about 1.5% lower than mythos here, than
Alex Volkov
Alex Volkov 1:49:30
Mythos.
LDJ
LDJ 1:49:31
While, yeah.
1:49:31
While Opus 4.7 is a full 10% behind.
Alex Volkov
Alex Volkov 1:49:35
Oh wow.
1:49:37
So this is a cybersecurity model as well. Is this spot, do we know if it's Spud?
LDJ
LDJ 1:49:43
Uh, I don't think it's confirmed.
1:49:45
I think it's kind of been implied
Alex Volkov
Alex Volkov 1:49:47
that
LDJ
LDJ 1:49:48
it might be, or at least like an early version of
1:49:50
Spud or something like that.
Alex Volkov
Alex Volkov 1:49:51
Yeah.
1:49:52
I think that folks are posted that, you know, folks from OpenAI I think posted something about like spot is coming. Uh, what else do we want to, uh, let's see what Sam says. Sam Altman. We believe in the iterative deployment. Although 5.5 is already a smart model, we expect rapid improvements. Iterative deployment is a big part of our safety strategy. We believe the world is best equipped to win, uh, at the team sport of AI resilience this way. Okay. We believe in democratization. We want people to be able to use lots of ai. We aim to have the most efficient models, the most efficient inference stack, and the most compute. We want our users to have access to the best technology. We have been tracking Cybersecurity's preparedness category for a long time and have built mitigations we believe in that enable us to make capable models broadly available. He's taking direct shots at philanthropic with mythos and the two Dangerous to release. Oh yeah.
Yam Peleg
Yam Peleg 1:50:43
Oh yeah.
1:50:43
Oh yeah. Oh yeah. That's,
Alex Volkov
Alex Volkov 1:50:44
uh, shots fired.
1:50:46
Absolutely. We love you and we want you to win. Sam Altman says, we want to be a platform for every company, scientist, entrepreneur, in person, and in app. Parentheses. My whole career has largely been about magic of startups, and I think we're about to see that magic at hyper scale. Uh, this is great. This is great. So shout out to Sam Altman. Uh, let's see what else OpenAI posted about OpenAI. You guys know what? I gotta wonder if GPT image now is significantly better because GPT 5.5 is significantly better. Or like we said, the GPT 5.5 was already GPT image, uh, which is, didn't know about it yet.
Nisten Tahiraj
Nisten Tahiraj 1:51:25
Oh, yes.
1:51:26
Again, the, for these types of models, you have to include a full language model, uh, in it. Yeah. So, yes. Yes, that will help.
Alex Volkov
Alex Volkov 1:51:36
So we can go and take a look.
1:51:38
Uh, lemme log into file ai, we'll try GPT image. Meanwhile, uh, folks, let us know if you want us to test anything specific because, um, oh yeah, Peter, please, please do, do, do you wanna jump on? Let me, let me invite Peter Gusev, 'cause he is been testing this model for a bit. Let's invite Peter. Uh,
Wolfram Ravenwolf
Wolfram Ravenwolf 1:51:55
the new GBT image is also on open router, and there
1:51:57
it's called GBT 5.4, image two. So basically the language model below the image model.
Alex Volkov
Alex Volkov 1:52:04
Wait, could you say this again?
1:52:05
Ham?
Wolfram Ravenwolf
Wolfram Ravenwolf 1:52:06
Uh, yeah.
1:52:06
On op route, the model is called OPIS slash GBT 5.4. Image two. So basically you have the name of the model that it is the image.
Alex Volkov
Alex Volkov 1:52:15
Oh, GBT 5.4.
1:52:17
Image two. On the API you mean?
Wolfram Ravenwolf
Wolfram Ravenwolf 1:52:19
Yeah, on the API.
1:52:20
So basically if you, uh, expect it to be 5.5, then it will be updated there as well. So, so far it has been 5.4 with image two as a layer.
Alex Volkov
Alex Volkov 1:52:30
Yep.
1:52:31
We'll see if Peter Augusta from Arena comes on. Uh, I texted him. Let's see. But Peter, if you're listening, I sent you a dm, uh, on X with the link. Please join us on stage. Uh, meanwhile, let's see what OpenAI says here. Uh, so we kinda looked at those evals, uh, but I think what is standing out to us, they have GPT 5.5 Pro and very interestingly, they compare, uh, the four models here. They compare the thinking and the pro to both 5.4, uh, GPT 5.5 thinking looks like beats nearly all the models on Lon. I don't know what LON is. Um hmm. I dunno. Well, from did you try and switch to 5.5 in, uh, in Hermes?
Wolfram Ravenwolf
Wolfram Ravenwolf 1:53:20
Uh, I don't have it in Germany yet.
1:53:22
Uh, it is being rolled out, so I asked at least, uh, I did some research. Uh, it should come, but it's not there yet. Otherwise I would definitely switch and I just, uh, subscribed to op I again with the pro account.
Alex Volkov
Alex Volkov 1:53:35
Oh, nice.
1:53:35
Okay.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:53:36
So when 5.3 or something in December, I unsubscribed
1:53:40
and now I'm, I will test and see, but I have high hopes and high expectations.
Alex Volkov
Alex Volkov 1:53:45
Meanwhile, while we wait for Peter to come on, folks, I
1:53:48
wanna show you, uh, the thing that I wanted to show you this whole time. I wanna show you computer use. Okay? So I'm gonna, I'm gonna start a new chat here with 5.5. I don't think we need high. Let's do a medium and use at computer use to, you got it? Interact. You got it in Codex already? Yeah, I got it in Codex. Yeah. Crazy, crazy, crazy. Let's try, uh, use computer use to interact with Chrome browser and uh, tweet from my account. Hey, we're live and testing GPT 5.5 that just dropped. Join us on our live stream, but also, uh, quote, tweet my previous tweet that has the live stream. Not an easy task, definitely for computer use, not an easy task. Uh, let's, let's see folks. Let's see. So right now we're gonna use DBT 5.5. And on medium, I'm on also on fast mode. So it burns like 1.5, the tokens. We'll, we'll see how the computer use is gonna work. And the thing that I wanna show you all the time, and finally I can show you, is that it's already clicking in Chrome while I'm focused on here. So it's already doing the clicks in Chrome. Uh, Chrome is already on. Next logged into Al Train. I can see your live broadcast in sidebar. So I'm going to use the, it found a livestream post. It clicked. Do you guys see this? It clicks the, the thing it's gonna do, quote. Are you seeing this? Let's go. It's beautiful. And now it's gonna post for me. You guys see the little cursor? It's gonna focus. Hopefully it's focusing.
Nisten Tahiraj
Nisten Tahiraj 1:55:29
Do not let Nikita see this, uh, guy.
1:55:33
It didn't have it.
Alex Volkov
Alex Volkov 1:55:34
Nikita, don't look at, don't look at it.
1:55:36
I have the call tweet drafted against the livestream post ready to send because this is publicly post. Please confirm. Should I click post now? Yes, let's go.
Wolfram Ravenwolf
Wolfram Ravenwolf 1:55:48
Alex, did you give it a prompt to wait for
1:55:50
confirmation or is that, uh, no. Default.
Alex Volkov
Alex Volkov 1:55:52
It's default.
Yam Peleg
Yam Peleg 1:55:54
That's great.
Alex Volkov
Alex Volkov 1:55:55
That's great.
1:55:56
It's clicking, it's clicking, it's sending and it's sent. Ah, this was awesome. This is folks, this is computer use with 5.5 and post and verify. This new quote tweet is live on your do the verification as well. Uh, folks, I wanna welcome Peter Gusta from Arena AI to the show. Uh, Peter, welcome. We just did a live stream with you, well, two days ago, talked about GPT image, and now we're talking about a new model, GPT 5.5 that just dropped from OpenAI. Uh, first of all, thank you for joining. Second of all, uh, impressions from you. Would love to hear about this model. Please.
Peter Gostev
Peter Gostev 1:56:35
Yeah, let me, you know what?
1:56:38
I feel like I can hear you from two places and I clearly have you open too many times.
Alex Volkov
Alex Volkov 1:56:45
Alright, so while we fix this one second, I was just like,
Peter Gostev
Peter Gostev 1:56:47
yeah, yeah.
1:56:47
Lemme gimme a second.
Alex Volkov
Alex Volkov 1:56:49
I will reiterate with folks that we just saw a demo of computer use
1:56:53
with GBT 5.5 and I asked it not only to post something on Twitter, which is easy, I asked it to quote, tweet, another tweet of mine with the livestream. There's quite a lot of like intelligence involved in like, figuring this out and, and they did it like super fast. We, we all saw it happening on the fly and I think it, like, it's incredibly, incredibly cool. Um, Peter, lemme know when, when you're ready, I go,
Yam Peleg
Yam Peleg 1:57:16
just to be clear, uh, before that, uh, across the board state of
1:57:20
the art, right, from thinking and above everything is state of the art.
Alex Volkov
Alex Volkov 1:57:24
Yeah.
1:57:24
Correct. Yeah. State of the art while using complete, like, what, 20% less tokens or something? Or, uh, sorry. Like almost 50% of less tokens. All right folks, let's welcome Peter Gosa from Marina to the show to talk about GBT 5.5.
Peter Gostev
Peter Gostev 1:57:38
Yeah, so, uh, I haven't had huge amounts of time with it,
1:57:41
but uh, it was certainly fun to test. I would say the biggest thing that jumps out is that is the first time when a model can actually properly do long running tasks. Um, all previous models, I know they kept saying, oh, like you can do it for many hours, but every time, I dunno about you guys, like, I, I just, I, I shouted it, I do anything I can think of. I come up with these constructs of how it's supposed to do it and then it never does it. So that was always very annoying. And now is the first time when I could really, maybe it's not completely, I say work for 10 hours and it works for 10 hours, but without too much prompting, you can get it to work for a long time. So, I'll give you one example. Uh, yesterday I came up with a, with a little idea, uh, it, it's not done yet considering how long it's running, but I wanted to kind of generate some images, create an app, you know, the, and create the whole kind of experience around it. And, um, what I did before going to sleep, I came up with a prompt and then I queued up, you know, how you do, you queue up like thermal prompts to like keep it going?
Alex Volkov
Alex Volkov 1:58:48
Yeah.
Peter Gostev
Peter Gostev 1:58:48
And then when I woke up I thought, okay, I'll be done and
we'll probably be done at like 3
we'll probably be done at like 3 1:58:51
00 AM I woke up, it hasn't literally
1:58:55
finished the first one, so like all of this queuing up was completely unnecessary, so it just kept going.
Alex Volkov
Alex Volkov 1:59:00
Oh wow.
Peter Gostev
Peter Gostev 1:59:01
Um, so that's the first time I've ever had that happen.
1:59:05
Um,
Alex Volkov
Alex Volkov 1:59:05
so how long did it run for?
Peter Gostev
Peter Gostev 1:59:07
So I, at about eight and a half hours, I kind of stopped
1:59:11
it to, to just kind of check in with it and try and rearrange things a little bit just to speed it up. Um, like for example, I wanted to use subagents a bit more, so it's like, so it's not gonna run for another 20 hours. So, um, but yeah, probably, I dunno how long it would've been running for. I'll tell you even now I've got the button to, to do an update on Codex, um, up and I literally cannot do it because it's gonna ruin my, my long running tasks. And I've got, like, I've got a couple of, I I've got, I've got three now. Long running tasks running one, let me just check. I think it's been running for about seven, seven and a half hours. And I, I have this little lab that I have just for my work where, uh, I've got like different visualizations pulling in the data. I've got very custom visualizations that I'm doing. Um, and it's kind of, I code it in a really crappy way, so it's like it's all breaks and so on. So I want to keep migrating it to like better, better architecture. Um, and uh, yeah, it's been going for like seven hours every time I check. Seven hours. Come on. Yeah. Literally, literally started today seven hours. So I can't even update the bloody app. Uh, 'cause it keeps running.
Alex Volkov
Alex Volkov 2:00:24
It's still running.
Peter Gostev
Peter Gostev 2:00:25
Yeah.
Alex Volkov
Alex Volkov 2:00:26
Um, the, it's quite crazy.
2:00:28
Uh, so like, Ralph Loops are dead essentially with models that are essentially running through all this time. Seven hours is insane.
Peter Gostev
Peter Gostev 2:00:34
Yeah.
Alex Volkov
Alex Volkov 2:00:35
What else did you notice?
2:00:36
What, what differences?
Peter Gostev
Peter Gostev 2:00:38
Um, so it feels, I mean, it's kind of
2:00:42
silly, but it does feel nicer. Like it does feel kind of a little bit smart and just nicer to, to speak to, and it, it just kind of explains things a bit better. So I feel like when, when I'm trying to get something done, sometimes a bit like, especially starting with like, I wanna say maybe 5.2 or something like that, it would just be kind of abrupt and just kind of do stuff and then you're like, I, I don't really know what's happening.
Alex Volkov
Alex Volkov 2:01:09
Yeah,
Peter Gostev
Peter Gostev 2:01:09
this one is a bit better.
2:01:10
I don't know if it's a, an important change or they just change the style of it or something like that. So it's like, it be hard to know for sure. Uh, but it does feel kind of smarter that way. I did some, uh, kind of one short 3G uh, generations like, uh, I like to do and that was noticeably better. Uh, kind of one short versus uh, uh, the 5.4. Um, yeah. So that was really nice. Yeah, I was using computer use as well, like you're saying. I like that as well. Um, I would say though, like I, I still feel like we are not quite there in terms of it being like quite as good as I, I hope it would be. So what I mean is that what I want for, for it to do is that it can literally use the app itself and then like properly reflect on what's wrong and then make it better. It's like, it's not quite getting it. Like there's not really any model. And I used used it with Opus as well with like Gemini. They'll kind look at it and just say, oh yeah, that's good. And just not really register very obvious issues, but I dunno if it's like vision kinda sucks still, or, or what? I dunno. But, so it's kind of, it kind of does something, but it's like, feels like it, it needs like another probably generation, probably with vision.
Alex Volkov
Alex Volkov 2:02:26
Yeah.
2:02:28
I, I, uh, you know, we're just getting it now, but I'm running this now with uh, 5.5 medium and computer use and I asked it to go and download the, the brand kit that we generated from cloud AI and just generate like a launch video for itself. So we're gonna see, but this is a long running task. I don't know, like we're gonna sit here. Uh, but folks, I can show you like what's going on. Uh, we, we have kind of a, uh, like it went and found the brand kit and says the kit files are now in, in this uh, folder. Uh, the read me the tokens, the cloud design system scale, nex I scaffolding the Hyper Frames project. So basically the, all the tools that I told you about before, GPT 5.5 is now using computer use plus via Codex using the, the like writing CLI and doing some things to create a video. The only thing that I have, and this is a comment that I wanna give, not, uh, 5.5 related necessarily. It's just a, a comment about how Codex works. Uh, like, uh, Peter said, you said you queued up some stuff. I chose the steer function versus the Q op function. So steer is something that only I think GPT GPT has. I think Devon also has Steer, but GPT, the model has steer is while it's running, you can steer it. So it's great for long running tasks. Peter, the, like the one that you said you had like eight hours, uh, usually in cloud, you, you have to pause it or stop it completely with tool calls and say, Hey, do this instead. Uh, GPT has steering enabled into its thingy so you can actually tell it to do some stuff. So I can say, uh, for example, uh, don't, you can see this, probably don't render the video. Uh, just show me the, uh, hyper frames ui UI to confirm before. I'll say, by the way, don't render the video. This is steering, so now it'll throw this in the middle of the, of the reasoning process, and then you can kind of like, uh, you know, you can join the, the long running process with your thoughts. Um, anything else we should cover? Folks? I'm seeing this, uh, mytho comparison is very interesting. Uh, Wolf you wanna talk about this a little bit? 'cause I think LDJ showed us this and we can show this one. Once again,
Wolfram Ravenwolf
Wolfram Ravenwolf 2:04:33
HJ has a nice comparison with the scores and, uh, the
2:04:35
Mytho model still has some advantages. Um. Can you bring it up or should I
Alex Volkov
Alex Volkov 2:04:40
just Yeah, I'm bring it up.
2:04:41
I just checked on, uh, JGBT. I don't have 5.5. Still on JGBT.
Ryan Carson
Ryan Carson 2:04:45
Yeah, me
Alex Volkov
Alex Volkov 2:04:45
neither.
Wolfram Ravenwolf
Wolfram Ravenwolf 2:04:46
Alright.
2:04:47
Yeah, so basically terminal bench, it's very, very close. But, uh, in humanities last, uh, xm, which is not the last, uh, it is still a big gap over there, especially even if it's using tools, it's still over 12%, uh, of a difference. Percent, uh, point
Alex Volkov
Alex Volkov 2:05:05
you, you're talking about the unreleased cloud mytho
2:05:07
from Anthropic compared to GBT 5.5 that we all just got in Codex. So I think that's, that's also a big difference.
Wolfram Ravenwolf
Wolfram Ravenwolf 2:05:13
Yeah.
2:05:14
GBT 5.5, uh, when everybody can access it, this is what they said. It is close to myth. It's much better than Opus 4.7 in all those scores. I think almost all those scores, I think it's somewhere, uh, Opus 4.7 is still ahead, but it's very, very close. And, um, yeah, like Sam Alman said, it's great to have the AI available for everybody and not restricted or different classes of who can access what
Alex Volkov
Alex Volkov 2:05:41
do you guys read this, uh, this thing where some, like Cloud Mythos
2:05:44
was actually available for some folks on Discord, and what they used it for is to generate some websites that was really funny on the first day or something. A, a bunch of folks on the Discord got access to Cloud Mythos. What they use this like, not to jailbreak or like break computers just to, you know, just generate websites. I love it.
Wolfram Ravenwolf
Wolfram Ravenwolf 2:06:03
No benchmark has shown, and mine, mine doesn't show either.
2:06:06
That's what I do the wipe checks for, is, uh, if the personality has changed, because 5.4 it was so boring and robotic to talk to. Yes.
Wolfram Ravenwolf
Wolfram Ravenwolf 2:06:14
But just something if you, if you want to, you notice
2:06:16
the same when you have your agent and you talk to your agent and it feels like just, uh, robots, uh, it's not as much fun to use it. And so I, I hope I will test this. Uh, hopefully they also changed a bit of this. Uh, yeah.
Alex Volkov
Alex Volkov 2:06:31
I, I gotta ask you, I gotta ask, uh, Peter, I gotta
2:06:33
ask you guys who are focusing on Neals, how do we even test this? This is all just vibes. You just like work with your assistant for a while and they're like, oh, this is better. 'cause I know that I have no idea. Like, I know the opus is better than GPT 5.4, but I, I don't know, know of an eval that compares this wolf. What, what do you think you
Wolfram Ravenwolf
Wolfram Ravenwolf 2:06:49
could call it?
2:06:50
Uh, a private eval in a way. When I, when I talk to an agent or anywhere when I get a good AI response that is funny, that touches me on a level, then I copy it into a quotes file and I keep it that way. And I always write, which, uh, AI did that. And so basically I could do a, a list of which quotes came from which models. And I know that Opus is super, super strong in there. And JGBT has for all were where the last, uh, last models where I had some quotes from. So basically, uh, that is my personal thing though. I noticed when I copy a lot of stuff in there, that is a model that, uh, I really like.
Alex Volkov
Alex Volkov 2:07:25
I see.
2:07:26
Okay. So we'll see. Uh, and we'll see How many like things you, you quoted from GPT 5.5 Peter on the, on the arena. I'm assuming this model is just now running, there's no, no confirmation. Like we had with GPT image that this was a musk in paper duct tape or something, right?
Peter Gostev
Peter Gostev 2:07:42
No, uh, I think there's no API yet.
2:07:47
Should you guys see that?
Alex Volkov
Alex Volkov 2:07:48
There's no API, it's just like available in
2:07:50
Codex and not even in Chat GPT. I, I only have it in Codex. I think some people said rolling out public cli. Do you
Peter Gostev
Peter Gostev 2:07:58
have it in.
2:08:00
So, uh, I haven't checked, but, uh, I should be in charge GPT and in, um, and in Codex as well. Um, but, uh, I, it's not gonna be on API as far as I understand, at least for the time being. So we can't test it. Interest.
Alex Volkov
Alex Volkov 2:08:17
Um, so we have the task that I asked it for, again, I
2:08:21
wanna show you guys that involved computer use, involved a bunch of other stuff involved the hyper stream. So I asked it, uh, I asked GBT 5.5 to create the launch video for itself. This is the prompt. Open the new window, go to cloud ai, do design, download the Thursday I brand kit into a folder, and then generate a launch video using the brand kit for GPT 5.5, using our brand guidelines with hyper frames. Uh, it's quite a complex thing to do. And now we have some sort of a video. This is with medium thinking. Let's, let's see. I, it still controls my browser. I'm not sure what I can show you, but this is the, this is the video that we generated. Let's take a look. Uh, let's play dismiss. Okay, dismiss, uh, I did, I asked it to not to render the video. Let's play. Oh, it's pretty cool. Built for agent. I, I wish I could like zoom in here. I'm not sure how I can like, full screen this video. Oh, maybe. Okay, like this. Let's start again. There's no music, but here's the video breaking model drop gvt 5.5 just landed. Uh, this is literally our branding kit built for agent work right in the bug code research across the web operate software. More capable same pace, uh, latency matches Gvd 5.5. We're live testing it right now. No hype, no flu, just a signal. Thursday I news that way. This was impressive folks. This is very impressive. Uh, I will say specifically, it's impressive because I tested, uh, both 5.4 the previous model. And Opus. And Opus was way better on than 5 1 4 before. And like grading these videos. This model not only just created the video, it understood what to talk about it understood where to get the brand kit, it downloaded the brand kit and did all things, uh, very quick as well. How long did it run? Nine minutes. It ran for nine minutes. While, while we're covering this. Yep.
Yam Peleg
Yam Peleg 2:10:13
One thing, one thing you might wanna check is, uh, front
2:10:16
end design because that thing, uh, front end and design in general is something that, uh, is specifically known to be hard to the Codex models.
Alex Volkov
Alex Volkov 2:10:26
Yeah,
Yam Peleg
Yam Peleg 2:10:26
all of them.
2:10:27
So, uh, it's gonna be, so we did
Alex Volkov
Alex Volkov 2:10:30
kind of test this.
2:10:31
We have a
Yam Peleg
Yam Peleg 2:10:31
competition if we have competition for CLO
Alex Volkov
Alex Volkov 2:10:34
at this point.
2:10:34
I mean we, we did kind of test this. Okay. So we, we asked this and uh, and this one is kind of like not the best, this is the comparison with 3D. The 3D is here, but like this frontier design is not the best one. And then we also checked it on our Olympus Mons, uh, mar driver thing that we also always tested with Nisten. This is lacking, let's say it's lacking. Plus 4.7 was just incredible on this. Uh, but for regular web designs, I think the new meta is we have to test and Peter, we talked about this when we went live with GPT image.
Peter Gostev
Peter Gostev 2:11:04
Yeah, yeah, yeah.
2:11:06
No, I agree. I don't think it's quite good enough. Which is kind of kind of odd, right? Because they didn't, they obviously know it. They know Yes. That they need to get better at this. So it's kind of interesting why they can't quite get it right. Um, I dunno why so I guess, well it's interesting if, if this is part as we guess, uh, and I dunno for sure, but if it is, means it's not pre-trained, right? It's like something in post-training that they're not quite getting right yet. Um, but I, I do think with, with some like, uh, codex, uh, or with the GPT 5.5, you just kinda need to fight it and then it's great. But yeah, the, the initial instincts are are terrible. So yeah, it's not, one short thing is much better with opus.
Alex Volkov
Alex Volkov 2:11:52
Just for design's sake.
2:11:53
Uh, in, in the case of, in the case of, uh, just this video, I will say, like, I, I asked it to go and download the design. Um, how should you say the, the, the brand guidelines? I will say those are spot on. This looks like the brand guidelines for Thursday Eye. You guys can see the logo clear. You can see the, the, the font render fine. Like all of this. I wouldn't say this is like the most beautiful design, but it's spot on on what I asked it for. Um,
Peter Gostev
Peter Gostev 2:12:21
yeah,
Alex Volkov
Alex Volkov 2:12:21
but,
Peter Gostev
Peter Gostev 2:12:21
and that, that's what I find as well.
2:12:23
Like if you do have some guidelines, some structure, it's completely excellent. Like it's really, really spot on. But to do the initial thing, yeah. Wouldn't, wouldn't rely on it to be honest.
Alex Volkov
Alex Volkov 2:12:34
But we, we do know the GPT image is great.
Peter Gostev
Peter Gostev 2:12:39
Yeah.
Alex Volkov
Alex Volkov 2:12:39
We, we know the GPT image is great.
2:12:41
So how about we test this? You guys wanna test this? Yeah. What type of web design would you imagine? Do it. Absolutely do it. Okay. Let's do it. Absolutely
Yam Peleg
Yam Peleg 2:12:50
do it.
2:12:50
Imagine it with GPT image. And let's see,
Alex Volkov
Alex Volkov 2:12:52
let's see.
2:12:53
So we're gonna open like proper codex, not, not the little side window here that we have. Uh, we're gonna build, uh, NISTA. How about we do the Mars thing, but with GPT image first,
Nisten Tahiraj
Nisten Tahiraj 2:13:04
can I just give you the design file that cloud generated and
Alex Volkov
Alex Volkov 2:13:09
No, but hold on, hold on.
2:13:11
We, we said that like we wanted one shot it with, with the thing. Yes. We know that, like using the design, Peter just said, using the design. It's really good at, we wanna see, uh, the ability to use Codex as a stand in for like, the creativity of, uh, of, of, uh, uh, GPT 5.5. Okay. Send me the design file. I'll take a look, but for now, I wanna say, uh,
Nisten Tahiraj
Nisten Tahiraj 2:13:33
you can just copy the prompt and you can say, generate
2:13:35
a screenshot of this game, uh,
Alex Volkov
Alex Volkov 2:13:39
from the, the
Nisten Tahiraj
Nisten Tahiraj 2:13:39
Mars one,
Alex Volkov
Alex Volkov 2:13:40
this interface first with image.
2:13:43
How do you, how do you, how do you image, imagine.
Nisten Tahiraj
Nisten Tahiraj 2:13:47
I
Alex Volkov
Alex Volkov 2:13:47
guess image two, uh, and then implement with code.
2:13:56
Okay. Let's do high thinking on, on, on, on speed. It should be fine. Yeah, on speed. It should be fine. Okay. So folks, we're, we're generating you guys not watching this. Of course. Why? Yeah, there we go. Uh, in the new Codex, we're generating the Olympus Mons driver rail thing, but in Parentis asset, generate a screenshot of this interface first with Imogen, uh, GPT image two, and then implement with code and I send it. So this is kind of the new method that we talked about, that, uh, you can substitute opposite's creativity with potential GPT image. 'cause it is really good at what desire. It's really good with the different things. Um, and it says, I'm using the image and make sure the requested visual target. Then I'll build the working version with two Gs.
Nisten Tahiraj
Nisten Tahiraj 2:14:41
Let's see.
Peter Gostev
Peter Gostev 2:14:44
Um, Alex, well, this working?
2:14:46
Yeah. I can show you a fun project that, yeah, please. This is the one that's been working overnight. so basically, but in the demo they showed this idea that you can generate these 360 images.
Alex Volkov
Alex Volkov 2:14:58
Mm-hmm.
Peter Gostev
Peter Gostev 2:14:59
And I thought, oh, that, that's a really cool idea.
2:15:01
What if you actually just generate a whole bunch of them? So what I was getting you to do is to plan out the whole, the Henning, uh, gardens of Babylon, kind of what that would look like. And I tried to do this kind of 360 view. Um, of, of dam uh, what I tried to do is, this is GPT
Alex Volkov
Alex Volkov 2:15:19
image two, right?
Peter Gostev
Peter Gostev 2:15:21
Yeah.
2:15:21
So, so this is, yeah. Um, GPT image two together with, uh, all of the planning, all of the code, all of the coordination is by, uh, 5.5. And what I try to do is to kind of create this street view kind of, um, view. Oh wow. I mean, you can see still a bit buggy, but the idea is that what I was trying to do is to do kind of a walkthrough. I'm gonna try and fix it. It's still, it's still working, but this is like few hundred images. Um, so, and then I can just like go into here and then like look around. That's insane. See what? Like
Alex Volkov
Alex Volkov 2:15:55
insane.
2:15:57
That's incredible. That's incredible. Lemme just sort, yeah, it's incredible. Lemme just repeat so that folks understand what's going on here. GBD image, uh, two can do 360 images, ac query, tenal images format that you can then put in and like rotate around. Uh, it can do them very, very well. Uh, Peter, you're saying you're generating a few hundreds of images and you can walk through that whole universe that was generated?
Peter Gostev
Peter Gostev 2:16:22
Yeah, exactly.
2:16:23
Yeah.
Nisten Tahiraj
Nisten Tahiraj 2:16:25
Okay.
2:16:27
Absolutely insane. That's insane Bunkers.
Alex Volkov
Alex Volkov 2:16:31
So you build like a street view thing completely.
2:16:34
Yeah. Generated obviously it's not 3D 'cause it's all accurate. Rectangle, but that's how Google Street view works, right? It's all equirectangular images, one after another, and they have like this nice animation that they fake.
Nisten Tahiraj
Nisten Tahiraj 2:16:46
Dude, this is a crazy demo, man.
2:16:48
It should waste. Holy cow. Wow.
Peter Gostev
Peter Gostev 2:16:52
Yeah.
Alex Volkov
Alex Volkov 2:16:53
How long did it go for?
2:16:54
And also how many images generated.
Peter Gostev
Peter Gostev 2:16:57
So I need to check how many images I think it ended up so far.
2:17:00
I think I have like about 400, but I'm going to, so the, the issue that I have is that what I was trying to do is I was trying to like get it to like, um, in the same way, like you can do street view where you can literally move from one to the next, but it's like it's a little bit buggy. It's like the images don't quite align. Yeah. And I think that's, that is the, I mean I, I think I'm asking a lot from it, right? To plan like literally everything. Yeah. And I think that I probably need like many thousands for it to work properly. So I just need to work out like what's the nice balance? So it, so you're asking how long did this work for? I came up with this idea at about like 1:00 AM last night London time. So that was like end, end of your working day. And then it worked the whole night, in the morning. Then I was like tweaking, uh, and giving it a bit more direction and it's still working. So I guess, I dunno, it's gonna be coming up to 24 hours in terms of to build this. Uh, but I think maybe I want a bit too ambitious to try and create the whole thing. I think if I had like one road with like some cool stuff, I think I probably could have done that overnight. So maybe if you've got like a bit more better scoped ideas, maybe you can try that as well. But yeah, you can literally do it in Codex now.
Alex Volkov
Alex Volkov 2:18:15
That's so cool.
2:18:16
You basically created street view of a place that doesn't exist.
Peter Gostev
Peter Gostev 2:18:20
Yeah, well it did exist, but we dunno what it looks like.
Alex Volkov
Alex Volkov 2:18:23
Yeah, we dunno it existed, but we dunno what it looks like.
2:18:26
And then the hallucinated, like latent space version of it that comes from GD image. Yeah.
Peter Gostev
Peter Gostev 2:18:31
Yeah.
Alex Volkov
Alex Volkov 2:18:32
Wow.
2:18:33
That is
Peter Gostev
Peter Gostev 2:18:33
crazy.
2:18:34
The, the, the only caveat to this is that at least I didn't quite work out how I can get the, like the 4K resolution. So some of it is like, looks a little bit rubbish. Uh, just because, and you'll see a bunch of artifacts here. I did also use upscaling to just get it a bit nicer, but if you zoom in, some of it kinda looks bad, and I don't think it's because the image model couldn't technically do it, but I, I think it's just because the, the resolution that you can get access to via Codex is not the highest. So then it starts just doing that. So that's like a little bit of a downside. So if you're gonna try and replicate this, you, you, so I was using Topaz upscaling via Replicate. Um, but yeah, I mean, it cost me some amount of,
Alex Volkov
Alex Volkov 2:19:18
some amount
Peter Gostev
Peter Gostev 2:19:18
of As well.
Alex Volkov
Alex Volkov 2:19:19
Yeah.
Peter Gostev
Peter Gostev 2:19:20
Yeah.
2:19:20
Well, dollars I have to pay for it separately, so Yeah. Not, not ideal.
Alex Volkov
Alex Volkov 2:19:24
Wow.
2:19:25
But this, this is like a, a very long, long lived project. Peter, go ahead. Sorry to interrupt.
Peter Gostev
Peter Gostev 2:19:31
No, the, the, there always tricks to, to using these.
2:19:34
Yeah, for sure. So yeah, I, I don't wanna, I think we're gonna read a lot of hype about like, yeah, GPT 5.5 or the next model or the next, but it's never, we are not at a GI yet, right? So let's remember, we still need to trick them a little bit, massage them, understand how, how they behave. I would say I have had a couple of times and when I was testing it, but it was doing something like a little bit weird. Where, for example, I was asking you to like, to basically do a little bit like what, what we were just doing in terms of validating its work and work until completion and so on. And then they just randomly created like an automation that would run every 30 minutes. I'm like, what the hell? That's literally never happened. Like why, why would you do that? And then it just like took my work until completion somehow, as if like it needs to run on automation. So I think that probably, I don't think it's like dumb or something, but it, it is probably just behavior is a bit different. So I, I could imagine if you're gonna try it yourself now and do exactly the same thing as before, you might be disappointed for whatever reason.
Alex Volkov
Alex Volkov 2:20:39
Mm-hmm.
Peter Gostev
Peter Gostev 2:20:39
But as always, just adjust a bit.
2:20:42
If, if it's a new base model especially, it'll probably be a bit different. Um, so yeah, it's not, it's not a GI definitely try it and get used to it if you're using it.
Alex Volkov
Alex Volkov 2:20:50
Yeah.
2:20:51
One shot is, is fun for demos, but like for an actual thing, you have to iterate. You have to work, you have to learn the model. Yeah. And that's what we're trying to do here, uh, folks. So let's, let's do, we've been, we've been on air for a while now, like almost four hours. So let's, I think, let's do a recap and then like start talking about this. We got an insane week, which is an absolutely insane week, uh, capping with an quite an incredible model that is, looks like based on benchmark state of the art in most anything. I won't treat mythos benchmarks as relevant because mythos is not available. And those are just like marketing numbers from, from Anthropic. Um, we cover pretty much everything. How could we not, we're almost live for four hours. Uh, we covered pretty much everything, uh, on the, on the stream. Thanks Peter. Peter had to drop. Looks like, uh. Thank you Peter Augusta, for joining us and giving us first, uh, thoughts of the GPT 5.5. Obviously the big release from this week is GPT 5.5 from OpenAI that we we're waiting for most of the stream and finally dropped. We also had, uh, we, we talked about GPT image, which is a huge, huge model, uh, that we're now trying to collaborate GPT image and GPT 5.5. We talked about some people asking, did we talk about privacy filter? Yes, absolutely. We talked about GT's, uh, OpenAI. Latest open source called GPT privacy. Uh, not even GPT, just privacy filter. That's Apache two license model. That's on the hack and Face hub. We talked and demoed at length cloud design, which is a new skill that, uh, we're all getting very excited about. Uh, just cloud AI slash design. We talked about the fact that, uh, on tropic reset the quotas for all of their users. So basically if you did quote it out for this week, you can go back and, and look at the quotas. Um, we talked to the course of computer use. We showed off computer use. Uh, a lot of stuff. Crazy, crazy week. Ai. I think at this, it's time for us to drop because with almost four hours on live with almost 5,000 of you tuning in throughout like, uh, different things. It's been a great show. Thank you so much for joining us. All right. Cheers everyone. Bye-bye.