Episode Summary

This episode captures the moment AI browsers stop feeling theoretical and start looking like a real product category. DeepSeek's OCR trick, Atlas browser, Browserbase authentication, and Kwindla's real-time voice/video stack all point to the same thing: interfaces are getting more agentic and more multimodal at the same time.

Hosts & Guests

Alex Volkov
Alex Volkov
Host ยท W&B / CoreWeave
@altryne
Paul Klein
Paul Klein
Founder & CEO ยท Browserbase
@pk_iv
Kwindla Hultman Kramer
Kwindla Hultman Kramer
Co-Founder & CEO ยท Daily.co
@kwindla
Nisten Tahiraj
Nisten Tahiraj
AI operator & builder
@nisten
Yam Peleg
Yam Peleg
AI builder & founder
@Yampeleg
LDJ
LDJ
Nous Research
@ldjconfirmed

๐Ÿ”“ DeepSeek OCR and the Open-Model Angle

DeepSeek kicks off the episode because it feels like another reminder that product breakthroughs can come from unexpected corners. The panel is less interested in a single demo than in what the OCR shift suggests about capability jumps and interface design.

  • DeepSeek changes the tone of the opening segment
  • OCR is discussed as a workflow unlock, not just a benchmark win

๐Ÿ› ๏ธ Atlas and the Start of the Browser Wars

ChatGPT Atlas pushes the discussion into product territory very quickly. Alex and the co-hosts treat AI browsers as the next UX battleground because they bundle search, memory, automation, and interaction into the same surface.

  • Atlas is framed as a category-defining product move
  • The browser becomes the new place where agents meet users

๐Ÿค– Paul Klein on Browserbase and Authentication

Paul Klein joins to talk about the hardest part of browser agents: making them work in real environments with real credentials, approvals, and constraints. The segment stays concrete about tradeoffs, which makes it one of the most useful builder conversations in the episode.

  • Authentication and approvals are treated as core product challenges
  • Browserbase is positioned as infrastructure for trustworthy browser agents

๐ŸŽฅ Kwindla on Real-Time Voice and Lip Sync

Kwindla Hultman Kramer helps bridge browser agents to the multimodal future. The conversation moves through native voice, low-latency interaction, and real-time lip sync, giving the episode a second major thread around what live AI interfaces will feel like.

  • Voice and video are discussed as live system problems, not static generation tasks
  • The segment feels like a preview of next-generation multimodal products

โšก Video Releases and the Week's Buzz

The closing stretch sweeps through the rest of the release board without losing the episode's central theme. Video tooling, weekly buzz items, and conference chatter all reinforce the sense that product surfaces are evolving just as fast as the models underneath them.

  • The finale stays release-dense without losing coherence
  • The running theme is interface change, not just raw model progress

Hey everyone, Alex here!

Welcome... to the browser war II - the AI edition! This week we chatted in depth about ChatGPTโ€™s new Atlas agentic browser, and the additional agentic powers Microsoft added to Edge with Copilot Mode (tho it didnโ€™t work for me)

Also this week was a kind of crazy OCR week, with more than 4 OCR models releasing, and the crown one is DeepSeek OCR, that turned the whole industry on itโ€™s head (more later)

Quite a few video updates as well, with real time lipsync from Decart, and a new update from LTX with 4k native video generation, itโ€™s been a busy AI week for sure!

Additionally, Iโ€™ve had the pleasure to talk about AI Browsing agents with Paul from BrowserBase and real time video with Kwindla Kramer from Pipecat/Daily, so make sure to tune in for those interviews, buckle up, letโ€™s dive in!

Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.

Share

Open Source: OCR is Not What You Think It Is (X, HF, Paper)

The most important and frankly mind-bending release this week came from DeepSeek. They dropped DeepSeek-OCR, and let me tell you, this is NOT just another OCR model. The cohost were buzzing about this, and once I dug in, I understood why. This isnโ€™t just about reading text from an image; itโ€™s a revolutionary approach to context compression.

We think that DeepSeek needed this as an internal tool, so weโ€™re really grateful to them for open sourcing this, as they did something crazy here. They are essentially turning text into a visual representation, compressing it, and then using a tiny vision decoder to read it back with incredible accuracy. Weโ€™re talking about a compression ratio of up to 10x with 97% decoding accuracy. Even at 20x compression they are achieving 60% decoding accuracy! My head exploded live on the show when I read that. This is like the middle-out compression algorithm joke from Silicon Valley, but itโ€™s real. As Yam pointed out, this suggests our current methods of text tokenization are far from optimal.

With only 3B and ~570M active parameters, they are taking a direct stab at long context inefficiency, imagine taking 1M tokens, encoding them into 100K visual tokens, and then feeding those into a model. Since the model is tiny, itโ€™s very cheap to run, for example, alphaXiv claimed they have OCRdโ€™ all of the papers on ArXiv with this model for $1000, a task that would have cost $7500 using MistalOCR - as per their paper, with DeepSeek OCR, on a single H100 GPU, its possible to scan up to 200K pages! ๐Ÿคฏ Really innovative stuff!

OCR and VLM models had quite a week, with multiple models besides DeepSeek OCR releasing, models like Liquids LFM2-VL-3B (X, HF), and the newly updated 2B and 32B of Qwen3-VL (X, Hugging Face), and AI2โ€™s olmo-ocr 2-7B (X, HF).

The Qwen models are particularly interesting, as the 2B model is a generic VLM (can also do OCR) and is close to previous weeks 4B and 8B brothers, and the newly updated 32B model outperforms GPT-5 mini and Claud 4 sonnet even!

The Browser Wars are BACK: OpenAI & Microsoft Go Agentic

Look, I may be aging myself here, but I remember, as a young frontend dev, having to install 5 browers at once to test them out, Chrome, Internet Explorer, Firefox, Opera etcโ€™. That was then, and now, I have Dia, Comet, and the newly released Atlas, and, yeah, today I even installed Microsoft Edge to test their AI features! It seems like the AI boom brought with it a newly possible reason for folks to try and take a bite out of Chrome (whoโ€™s agentic features are long rumored with project mariner but are nowhere to be found/shipped yet)

OpenAIโ€™s ChatGPT Atlas: The Browser Reimagined (X, Download)

OpenAI is proving that besides just models, they are a product powerhouse, stepping into categories like Shopping (with a shopify integration), app stores (with ChatGPT apps), social (with Sora2) and now... browsers! This week, they have launched their tightly integrated into ChatGPT browser called Atlas, and itโ€™s a big release!

Iโ€™ll split my review here to 2 parts, the browser features part and the agentic part.

New fresh take on a chromium based browser

The tight integration into ChatGPT is everywhere in this browser, from the new tab that looks like the basic ChatGPT interaface, one line of text, to the sidebar on the left that... is the ChatGPT web sidebar with all your chats, projects, custom GPTs etc.

The integration doesnโ€™t stop there, as you have to sign in to your ChatGPT account to even use this browser (available only to MacOS users, and Pro, Plus and Nano tiers). The browser has a few neat tricks, like a special tool that allows you to search your browsing history with natural language, a-la โ€œwhat were those shoes I was looking at a few days agoโ€ will find your the tabs you browsed for shoes.

Email draft window titled โ€˜Team Meeting Follow-Up.โ€™ The body text thanks the team for attending and reminds them to update a shared document. A ChatGPT cursor suggestion appears above highlighted text, showing a prompt that reads โ€˜Make this sound more professional,โ€™ with the ChatGPT logo and an upward arrow icon beside it.

A special and cool feature is called, confusingly โ€œCursorโ€, wherein you can select a text, and then click the little OpenAI logo that pops up, allowing you to ask ChatGPT for changes to that selected text (like fix typos, spruce up your writing etc). Itโ€™s surprisingly convenient to rewrite tweets or for any type of document editing.

ChatGPT Atlas also stores memories about your browsing patterns, which will be additional to the ChatGPT memories it stores about you from chats, helping even more by knowing your browsing patterns, which software you prefer to use, which websites you prefer to order food from etc. This IMO is one of the hugest unlocks for folks inside the ChatGPT ecosystem, as much of a stanard persons peferences can be gleaned from their browser usage and patterns.

Atlas Browser window displaying a Wall Street Journal article about 2026 federal tax bracket changes. The article includes a table of inflation-adjusted income tax rates for single and married filers. On the right, ChatGPTโ€™s sidebar shows a conversation where the user asks, โ€˜Can you explain this in simple terms?โ€™ ChatGPT responds with a plain-language summary outlining whatโ€™s changing, how much rates are shifting, and standard deduction updates.

Lastly, the โ€œAsk ChatGPTโ€ sidepane on the right (which can be opened with cmd+.) is really great for chatting with a webpage, or going down search rabbit holes. It receives the context of the webpage youโ€™re looking at by default (only 1 page so far, competitors allow you to add additional tabs with @, (which is supposedly coming to ChatGPT soon) and ask... ChatGPT anything about this.

Agentic SOTA? not so fast

The most important โ€œchangeโ€ to how browsers work in Atlas imo is the agentic mode. This isnโ€™t new, we remember when ChatGPT launched thier Operator Agent back in January of this year (our coverage) and then renamed it Agent Mode and integrated into ChatGPT itself back in July.

So, web browsing agents are not entirely new, whatโ€™s novel here though, is the integration into your browser, and the ability for the Atlas browser to use your logged in sessions and cookies, to pretend to be you! This... can be quite scary for some, as prompt injection attacks are getting more popular (where-in malicious assholes add hidden instructions to their website that will get the agent to do something you donโ€™t like) but itโ€™s also very exciting, as the agent can do much much more, without getting blocked by providers who could previously just block Agent Mode as it ran on OpenAI servers!

Until today, there were 2 main Agentic browsers in the mix, Perplexityโ€™s Comet (where you can choose which model runs the agent) and Atlas. Comet seems to be doing a little bit better on some stuff on my tests, but not by much. I have the same agentic task (go to X.com, find my bookmarks, open all links, summarize per my specific format) that Iโ€™ve been running for a while now, and Comet outdid Atlas this week on that task.

Who needs agentic browsing?

For some reason, most of the demos for agentic browsing are showing the same, boring-ish examples. Book some flights, collect a grocery shopping cart. Iโ€™ve tried new and different things this week, for example, letting Atlas choose and order food for me (as ChatGPT knows my pescatarian preferences, itโ€™s better than Comet for personal stuff), and one of the longest task Iโ€™ve had an agent do yet, I asked it to complete a Compliance training I had to take at work!

Mind you, this is a very complex task, even for regular people, as these compliance websites are built to not be messed with. They have video players that stop if you switch focus to some other tab, they have interactive quizes and games, drag and drop interfaces, audio buttons, to make sure you really are taking the test. I can happily report, that after 5 hours, and a few stops along the way (where I had to convince the agent to keep going), it completed this very hard task! (and now I have to take this course myself again to actualy be compliant ๐Ÿ˜… it will probably take me 2 hours to do manually)

This experiment made me think, who needs the agentic browsing features and for what? Well, for tasks that require a lot of manual steps to do the same thing over and over again, agentic browser is going to make a lot of peoples browsing a lot easier. Things like kids schedules reviewing in multiple websites, collecitng data and formatting it differently etc.

Scary security implications

Atlas could only finish my compliance task while being logged in as me, and ChatGPT Atlas gives a all or nothing control. You can run your agent with full access to your logged in websites (think Gmail etc) or you can essentially give it an incognito mode.

This, again, due to the risk of promp injections in malicious websites being more and more prevalent. In a rare post detailing how they are thinking about this, OpenAI Chief Information Security officer offered a deep dive into their attempts to mitigate this issue (Simon Willison had a great breakdown of that information here) but thatโ€™s likely not enough, so definitely be aware when youโ€™re running agent mode (which needs to be explicitly turned on right now by selecting Agent)


This Weeks Buzz - Weights & Biases // Coreweave

Weights & Biases (now proudly part of CoreWeave) had some exciting updates. Our Fully Connected conference series is hitting Tokyo on October 30-31 and London on November 4-5โ€”perfect for ML practitioners and AI engineers. If youโ€™re in the area, join us for talks, networking, and deep dives into the latest. Register at Fullyconnected.comโ€”DM me if you need a hook-up!

We also collaborated with Meta and Stanford on Torch Forge, a new PyTorch-native library for scalable RL post-training and agent development. Itโ€™s built for massive GPU runs (we provided 520 H100s!), competing with Ray via tools like Monarch scheduler. If youโ€™re training on clusters, check the blog โ€”itโ€™s a big deal for efficient multi-GPU workflows.


Microsoft goes after OpenAI with Edge copilot mode (X)

Image

In a pretty surprising move, Microsoft announced today their take on the agentic browser war, with a bunch of enhancements to Copilot (their overall word for their AI assistance across Microsoft 360, Browser, Bing search etc), Think.. clippy, for the AI age (they even brought clippy back as an easter egg)

The short version is, Edge is getting more powerful with custom agentic features (which I enabled and couldnโ€™t get to work no matter how much I tried, so I canโ€™t tell you how they compare to Atlas/Comet), and they have a voice mode that allows you to talk to your browser, with Edge having a sense of whatโ€™s on the actual page! Of course, this being Microsoft, marketing aside and features aside, when I asked Copilot if it has access to other tabs (like the marketing video claims) it said it doesnโ€™t have access, agentic mode didnโ€™t work, and Iโ€™m very unlikely to be testing it further! But hey, if you use Copilot app on your mobile phone, and click the new Mico avatar like 25 times it will turn into Clippy, so.. yay?

Claude Code on the Web, Claude on Desktop upgraded (X, Anthropic)

Anthropic also made waves by bringing Claude Code to the web. Now you can delegate software tasks to Claude through a web interface with GitHub integration. Nisten was particularly excited about being able to manage his coding projects from his phone. It runs tasks in a secure sandbox, can handle multiple repos, and automatically create pull requests. Itโ€™s another powerful coding agent becoming more accessible to developers everywhere.

They have also made changes to the desktop Claude app, allowing it to see the context of your screen with screenshots, and file sharing, and even a new voice mode that allows you to talk to Claude (which is unfortunately mapped to the tab button, without the ability to remap)

Browser Automation and Delegated Authentication with Browserbase (X, Director.ai, Stagehand)

While OpenAI and Microsoft are building chat into the browser, what about bringing the browser into our chat-based agents? We had Paul Klein, the founder of Browserbase, join us to talk about this exact topic. His company is tackling one of the biggest hurdles for AI agents: authentication.

Paul and his team launched Director 2.0, a platform that lets you build web automation with natural language prompts. But the real innovation here is their integration with 1Password. Instead of giving an agent the โ€œmaster keysโ€ to all your logged-in sessions like Atlas does, Browserbase allows for delegated, per-site authentication. When an agent running in the cloud needs to log into a site on your behalf, you get a prompt on your local machine to approve it. This is a much safer, more granular way to give agents the access they need. As Paul said, you shouldnโ€™t let an AI the master keys into your house; you should give it permission to enter one room at a time. Itโ€™s a brilliant paradigm for secure agentic workflows and I really like this approach of a piece-meal authentication for browser agents. I wish Atlas has something like this for the incognito mode!

Director 2.0 itself is like V0 for web automationโ€”you give it a prompt, it performs the task, and then it gives you a repeatable script you can deploy. Itโ€™s a way to create robust automations without needing to be a developer, and itโ€™s already being used to automate thousands of hours of manual work.

Video & Audio: The Race to Real-Time

The world of generative media is moving at lightning speed, with a clear trajectory towards real-time, interactive experiences.

Decartโ€™s Real-Time Lip Sync API (X)

We had Kwindla Kramer, one of the worlds leading experts in real-time audio, join us to break down a phenomenal release from Decart AI: a real-time lip-sync API. This isnโ€™t the pre-rendered, slightly-off lip-sync weโ€™re used to. This is a pipeline of models working together to generate perfectly synchronized lip movements for an avatar in real-time.

Kwindla explained the tech stack: it captures your audio via WebRTC, sends it to Whisper for transcription, gets a response from an LLM like Grok, generates a voice with ElevenLabs, and then Decartโ€™s model modifies the avatarโ€™s video frames to match the new audio, all with a sub-two-second latency. This is how we get to truly interactive, believable AI characters. Kwindla even built a quick demo, though it didnโ€™t seem to work the in the morning, probably GPU issues, so we just played the demo videos.

LTX-2 and Soraโ€™s Pet Cameos

The trend towards high-fidelity, real-time generation continued with a breaking news release from Lightricks: LTX-2. This is an open-source (weights coming this fall!) engine that can generate native 4K video with synchronized audio. Itโ€™s fast, efficient, and is set to be a powerful open alternative to closed models like Sora. And itโ€™s a native 4K, no upscaling!

Speaking of Sora, they announced that character cameos are getting an upgrade. Soon, youโ€™ll be able to turn anythingโ€”your pet, a coffee cup, or even a sunny-side-up eggโ€”into an animated, talking character. Iโ€™m really looking forward for this new Sora update and will let you know my impressions when it drops (soon, according to Bill from OpenAI)


What a week folks! What A WEEK! ๐Ÿ˜… My head is still spinning!

From browsers that can do our work for us to OCR that redefines context, weโ€™re seeing foundational shifts across the board. The tools are getting more powerful, more accessible, and more integrated into our daily workflows. The future is being built right now, and we get to watch it happen week by week.

Thank you for being a ThursdAI subscriber. As always, here are the show notes with all the links and details from this weekโ€™s whirlwind of AI news.

  • Hosts and Guests

  • Open Source LLMs

    • DeepSeek-OCR: Efficient Vision-Text Compression for Massive Contexts (X, HF, Paper)

    • Liquid AI LFM2-VL-3B: Tiny Multilingual Vision-Language Model (X, HF)

    • PokeeResearch-7B: Open-source SOTA Deep Research Agent (X, HF, Web, ArXiv, GitHub)

    • Qwen3-VL 2B & 32B: compact STEM-tuned multimodal powerhouses (X, Hugging Face)

  • Big CO LLMs + APIs

  • This weeks Buzz

  • Vision & Video

    • Sora is about to get pet cameos

    • Krea openโ€‘sources a 14โ€‘billionโ€‘parameter realโ€‘time video model (X, HF)

    • Reveโ€™s unannounced video mode!? 1080p + sound

    • LTX-2: open-source 4K audio+video generation engine from Lightricks (X, Website, GitHub)

  • Voice & Audio

    • Decart Lip Sync API: Real-Time Avatar Lip Movement (X)

  • Tools

Alex Volkov
Alex Volkov 0:56
Welcome everyone to ThursdAI for October 23rd.
1:00
My name is Alex Volkov. I'm an AI evangelist with Weights, & Biases, and host of ThursdAI and we have a very exciting week, very eye opening week, in terms of A lot of OCR getting released. Deep Sec. Came back with a new one. So we're gonna chat about all of this. With that, I will also say hello to Nisten and hello to Yam. Welcome friends. Welcome, welcome. Happy Thursday. Happy Thursday. Nisten, you're looking fresh. we already noticed this on the previous live stream.
Nisten Tahiraj
Nisten Tahiraj 1:32
I've literally just being outside coding in the balcony
1:35
because it was really warm in Canada this week, and it's, uh, winter is coming.
Alex Volkov
Alex Volkov 1:41
Winter is coming.
1:43
let's do some benter for the, the favorite AI release of this week. we should do this before the TLDR starts and to get like folks a chance to get in into the show, say hi to us in comments. while we share ours, folks who are listening, feel free to share yours as well. somebody already shared the cloud code web is their favorite Nisten. What is your favorite AI release this week?
Nisten Tahiraj
Nisten Tahiraj 2:04
I was going back and forth between what I liked,
2:07
but then when I actually tried the deep seek OCR one, it not only does optical character recognition. It rebuilds it as HDML, so it looks at a chart and then it rebuilds the chart back up. So that's why this one became my favorite. I've also been trying the Claude, web, one, I have some opinions on that. It's so, so, but I think it's gonna end up being a little bit better. I've been using it quite a bit too.
Alex Volkov
Alex Volkov 2:40
Nice.
2:41
Obviously folks love your opinions and also we do, we try to give you listeners multiple opinions and not just like hype a release. Nisten is very good at that. so I really appreciate the outlook there. yam, what is your must have AI release for this week
Yam Peleg
Yam Peleg 2:57
both, cloud Code Web and Deep Sea OCR are sick.
3:02
Seriously. deep Sea ocr, you all get the, the idea of it, like you can convert loads of PDF only from the web into markdown. That's a lot of tokens. That's, that unlocks a lot of tokens.
Alex Volkov
Alex Volkov 3:19
Yep.
3:20
I'll share mine, then we'll jump into TLDR because a lot, topics this week to talk about, obviously, browser from Chat GPT and potentially other browsers for me, I realized why, we were on a live stream when Chat GPT Atlas, the browser that OpenAI launched this week that Chrome based released, I then realized that I used to work on a browser. I spent two years, working on a chromium based browser back, I don't know, a decade ago, over a decade ago, maybe 15 years ago. and I have experience with browsers. as a fronted developer, I participated in the first browser where I followed every release for HTM oh five css, three, edge Internet Explorer, et cetera. So this hit home very hard for me, and this reminded me of the browser wars as well. it's really funny actually, that my manager from back this time is now watching with us. shout out to Ariel. back when we were building browsers and I posted about this Building a browser is not easy. There's so many decisions. this was definitely my favorite one. I also ran the. The JGB Atlas agent for four hours trying to complete a compliance training with the company one of those compliance training you have to, and it did after five hours and I was very, very impressed. Also annoying. So, I definitely have a lot to say about this. We'll add LDJ. Super quick. LDJ, welcome. And we're gonna go and talk about, the TL DR
4:46
This is it. This is the TLDR, the section on Thursday. I will read, just tell you about all of the news together in a very short period without going into detail so that you'll know. And then, you'll know which, Generally the outline of our show, and you also know which segments to stick around for. so for today, your host is Alex Volkov , AI Evangelist, with some biases. We have Yam Peleg Nisten here, and LDJ here. And we're gonna have a few other, guests and friends. We're gonna have Paul Klein from, browser Base. And we're gonna have Qwen Kramer, our friend, step in very, very soon to talk to us about some realtime stuff. As everybody here already told you, once we dive into open source, LLMs, this is a few of the releases this week. we had a bunch of vision model releases in the tiny, tiny space. Yeah. Including one from yesterday from Qwen. But yeah, go ahead. I, I
LDJ
LDJ 5:32
guess I could, I'll briefly at least summarize DeepSeek OCR.
5:35
So just give us
Alex Volkov
Alex Volkov 5:36
like a one, two sentence.
LDJ
LDJ 5:38
Yeah.
5:39
So I guess the really short simplified version would be by looking at text more so as an image and compressing it as an image. You end up being able to more efficiently have the model use context and basically have visual tokens used that is able to more efficiently represent text and then have models be more efficiently able to run and as if they were running on less context. Yep. so liquid ai, LFM two, VL three B. This is a vision language model. Then we have pokey research seven B, open source soda, deep research agent, and then Qwen three, VL two B 32 B. is that two B active parameters 32 B total, or is that two different models?
Nisten Tahiraj
Nisten Tahiraj 6:21
No, it's two different models.
6:23
And yeah, folks are saying that the two B
Alex Volkov
Alex Volkov 6:25
these are
LDJ
LDJ 6:25
both vents, right?
6:27
Yeah. Yeah, I remember seeing it now. compacts de tuned. Multimodal powerhouses outperforming GPT five mini and Claude force on it. That's big. If true, if when people actually vibe test it, we'll see.
Alex Volkov
Alex Volkov 6:38
Yeah, maybe we'll wipe test it on the show.
6:40
So I had this like some screenshot as well from, somebody Harvin Singh. he said a bunch of OCR model released in the past few weeks. Deeps Synq, OCR, which we're gonna talk about released but like the folks from, Allen Institute released, their Omo, OCR second version. there's one called Chandra, OCR, also new, around 8 billion parameters. And then also dots, OCR, like Nisten, you just mentioned this one. Obviously the vl are General V lms, but also do OCR Paddle released the one. And then there's just a bunch of like new releases in the space and looks like Deep Sea OCIs, like for the win. So this is the one we're gonna focus on, but like, just so folks know, we're not gonna cover all these, but also like a bunch of these are recent releases as well. We're gonna talk about the, you know, the movers and shakers of the AI world, the people who spend billions of dollars on infrastructure and, invent and reinvent the world every week. We'll obviously have a dedicated discussion for the newly announced chromium based Chat GPT Atlas browser. it's an agent I browser with Deep Chat GPT integration. We're gonna cover some of the feedback on the, on the vibes. And, I also used it as my primary one, although right now I'm recording from, from Arc. Cannot, cannot let Arc go no matter what, even though that one is a default for mine. we're gonna cover some of the shortcomings, some of the stuff that it doesn't do well. Some comparisons to the Comet browser, which, supposedly after our livestream, this studio that I got ya, I'm excited about. So ya maybe already used come, from, from perplexity and other AI browsers and also Microsoft is now having an event right now for their co-pilot sessions, whatever. And then they're gonna bring actionable, actionable browser into Edge as well. This probably few in a few minutes. So we'll, we'll take a look at that. and then also like bring you the news that whether or not we got two agentic browsers this week and Edge is huge. Edge may not take as much user, thing as, as Chrome, but Edge is like one of the bigger browser, install bases out there. So if they're adding Angen system, like last week we talked about the whole of Windows 11 is adding Agen system. I think it's big. Cloud code on the web folks cloud code, the, the, the, you know, if any, we can say cloud code in the cloud. So, you know, every time we transcribe cloud code, it says cloud code, but we can say cloud, cloud code. this is what happened this week. you can now chat with this agent not only via your terminal, it can do background stuff for you. You can be on the toilet, which is, a highlight of many people coding on the toilet. they have secure sandboxing in there as well. That's very exciting for many, many folks. super quick meta bans. One 800 GPT on WhatsApp. meta decided that on the WhatsApp platform, the only AI that's gonna be there is META'S ai. So they are kicking GGBT out. if you use using that move to the app. again, like I said, Microsoft is about to announce something generic. agent and a friend from Google AI Studio launched the new vibe code. Did you guys see this login launch? The new vibe coding platform, and it's pretty, pretty good. actually liked that we can live, code something live. it's pretty dope. And this week's buzz, we have, I think we have two updates for you. One of them is, fully connected. The conference that we have is coming to Tokyo on October 30th and London on November four and fifth. If you are a listener from Europe or from Japan, or Southeast Asia, those two conferences are definitely for you. We participated in the release together with the PyTorch Library and Stanford called Torch Forge. And we'll briefly cover that as well in a few sentences. basically a new, supervised fine tuning library from the folks at PyTorch, which is supposedly a big deal. Vision and video was also hard this week, including a breaking news release from today. just an update. SOA is about to get PET cameos, so expect SOA to blow up. the folks bill Peoples from soa, came out with the news and said, we're proving soa. for now, you are only able to add cameras of yourself from very soon you're gonna be able to take your pet and create a cameo for it and take a coffee cup, create a cameo for it. So you'll be able to add character consistency not only to your face. I think it's a big deal. we'll see how viral that becomes, because people will do all kinds of characters. the Halloween, skeleton that I did last year for example, could be a talking skeleton cameo that everybody could use. So we'll see a new social dynamic there. our friends from Korea released open source, 14 billion parameter realtime video model. fine tuned on top of WAN to, 14 b, distilled from WAN. That's pretty cool. Rev, you guys remember Rev, the image editing, UI plus model from, Christian Control and some other folks, rev announced, sorry, rev didn't announce, but they have a video model in there and I looked and nobody of the people including their main account didn't talk about this. And also, I haven't seen anybody post about this, but yeah, they have a video model in there. I, it looks pretty good. It even has sound. And speaking of sound, it looks like most of the video models for the past. I dunno. Three months are adding sound as well. And we have a breaking news today from LTX, the Israel Company Light Tricks. their model is called LTX. they just open sourced a 4K audio and video generation engine. I believe it's open sourced, this called Ltx two and they're claiming it's live. you know, also, also real time, nearly real time. So that's crazy. 4K, nearly real time. That's crazy with synchronized audio and video and lip sync, hybrid diffusion performance stack, So the incredible, I think I have some friends in the LTX world. ya, maybe you know, some folks as well. We should absolutely bring them because they've been killing this. another company that's been killing this, also Israeli company is, Decar ai. they released a new lip sync API, but it's real time. So we've seen other companies like Sync and others just basically take a video that, or a character and then, you provide audio and then they kind of like modulate the lips to look like the person is talking according to the audio. LTX release this in real time, API. so that is very impressive that it's real time. it works for characters that you wanna talk to. If Wolf Firm were here, he would mention Amy and how, you know, this, this would work with Amy, for example. similar to, how Annie works in Grok. So we're definitely gonna cover this a little bit. This is the TLDR for this week. I think there's plenty for us here to talk about folks. It's time for our favorite corner. Let's go to open source ai,
13:05
Let's get it started. Let's get it started. And I will just say I just saw Mustaf Sunman from MAI, the rebranded inflection AI team within Microsoft that Mustaf runs. It's called Microsoft AI rebranded, MAI just say T minus 10 minutes for the Microsoft copilot. we may just tune in, folks, what do you think? Like why not? if it's as big two browsers live streams to two agent browsers the this week. so we may just tune in and it's gonna happen in a few minutes. but meanwhile we're gonna start with open source. And I think, let's dive right in. the most important and interesting release this week from open source is Deeps Seeq, OCR Deeps Seek the Chinese whale, the folks that don't hype anything, that just release great releases, one after another. The folks that gave us GRPO together with R one, that turned out that RR one itself was less consequential than the GRPO that everybody now uses, pretty much single handedly re re returned the, the RL world back, or, you know, like deeps sick. We talked about deeps sick. Like every week they released something. this week they released a tiny O-C-R-M-O-E-A tiny O-C-R-M-O-E. This is like a very, very tiny model. if I'm not mistaken, it's like 500 million parameters, active only. it's a, a. Folks, you all look to this. You all mentioned this. what, what's special there? Yam? Let's start with you LDJ, in following up with you. What is special about this like tiny model and why Deeps seek innovates continuously.
Yam Peleg
Yam Peleg 14:35
It's insanely powerful and it's small and fast.
14:38
It is, that's the deal. LDR it's really, really good and it's really, really fast and small. Therefore you can run it at scale on whatever you want to generate. Tokens, it's OCR. You can just feed, I don't know, screenshots of stuff, like screenshots of code, screenshots of stuff, written newspapers, book PDFs, whatever you want. it is, the most powerful for its size that we ever seen. people have been doing wild stuff with it already, even though it was released a couple of days ago. What do you guys think?
Alex Volkov
Alex Volkov 15:17
Yeah, I want to hear about this like compression thing
15:19
that they have going on there.
LDJ
LDJ 15:21
Yeah.
15:21
So, I didn't get to fully, read the paper, but from what I understand, it's you're essentially taking text tokens, representing it as images and compressing that in the way that you would usually create visual tokens out of images and a lot of, image processing models. And then doing that ends up being able to actually compress it in a way that's more accurate and basically better compression ratios than what you'd normally get with just tokenization methods of text in a raw way.
Alex Volkov
Alex Volkov 15:53
I have some notes on this where, where I read for some
15:56
folks the compression thing is insane. Yeah. The compression thing is insane, like turning, text that they see into vision coders, but then it be able, be able to like decompress it back with like high accuracy. but like even at like 10, less than 10 x compression, OCR decoding accuracy is 97%. What, bro, this is like Silicon Valley middle out algorithm.
Yam Peleg
Yam Peleg 16:20
it's wild because if you just, it's not intuitive, but all of it, it's
16:24
even possible to do something like this 10
Alex Volkov
Alex Volkov 16:27
It's
Yam Peleg
Yam Peleg 16:28
compression.
16:28
The, the compression is like, why would it, taking a string shot of code is better than, than reading the code. Yeah. Like, like, I mean, makes no sense. That's, one of the things that just got people so surprised by this model. You
Alex Volkov
Alex Volkov 16:42
know, on Thursday, I prepare the notes and oftentimes
16:45
there's other releases that dive deep. We only have so much time to dive deep into stuff. So I go deep into et cetera. And then we come here and we talk about this, and then my head explodes live on the thing. this makes no sense as I read this, like this, this is one of those WTF made
Yam Peleg
Yam Peleg 16:58
look, I don't know, I don't know if it makes no sense.
17:00
It basically just points to probably the way we, tokenize text is not optimal. Yeah, that's probably what it just points to that there is a better way to represent text. okay. Look, everyone is suspecting that the way we tokenize text is kind of backwards today. there must be a better way, but, it is what it is. it is the way it works. We just, do it because it works best. but I don't know. That's, that's wild.
Nisten Tahiraj
Nisten Tahiraj 17:25
Oh, no, no, no.
17:26
I was trying to run it. That's why I was ah, see, see, see.
Alex Volkov
Alex Volkov 17:29
I think there's a few hug and space, things that already
17:31
run this so we can take a look. here's the note that I have, that I, I took lifted of someone. Obviously, you could visually encode 1 million tokens into a hundred K vision token sequence. So that like for long context models, This is a text compression mechanism. The, the, the, that supposedly shoves 10 x more context into the same context. Wind. What the fuck? LDJ, go ahead.
LDJ
LDJ 17:57
think, so some, some limitations, I think are, that are worth
18:01
noting is so when they say 97%, precision or accuracy here mm-hmm. based off what I was able to see in the paper, they don't show traditional language model benchmarks, but they show compression just in terms of basically how accurate is it in actually being able to, decode or actually truly see the original text and that what they say as 97% precision here in the compression itself. When it comes to benchmarks that might result in more than a 3% drop. That might result in more like. Only 90% or 80% accuracy in benchmarks, but we don't really know. So we, we will have to see that. And also the model that they do it in here is quite small. It's like only roughly around one B size, 1 billion parameters. And so we'll also have to see how the scales up. But I do think it's really interesting so far, and they say 97%, compression accuracy with, 10 x compression. And I think, like, conservatively speaking, like hopefully they can get at least two or three x compression with basically 99 or 99.5% accuracy or something. And in that case, like it would be really good. And because of the quadratic scaling of flops, with context, it would lead to more than three x savings of flops. So you might get five or 10 x savings and flops or more when you're doing really long context inference.
Alex Volkov
Alex Volkov 19:27
So I just wanna make sure that we're like
19:30
highlighting the right thing here. This is not just a model to take text, and do optical character recognition. This is way more than this. Like they're preparing for, and specifically highlighting the amount of, training data. This can generate synthetically from PDFs, et cetera.
Yam Peleg
Yam Peleg 19:47
They want all the tokens.
Alex Volkov
Alex Volkov 19:49
All the Totally.
Yam Peleg
Yam Peleg 19:49
just performance wise?
19:50
I don't know if, we said it. you can process, like 200,000 pages a day on a single GPU with these things. I mean, these are in the numbers, like single GPU 200,000 pages. That's really crazy. another thing that's worth, mentioning, which, had been coming to us, through the comments, that's absolutely true. what they do, I mean basically because they feed, images into the model, you have a part of the model which is not causal, which is not predict the next token, but is fully connected attention, which is something that we haven't, been kind of moved away from at this moment. there are, people suspecting that this might be, the boost in performance, quote unquote but the unexplained thing that, pushes this model beyond what we all anticipated. Anyway. It's extremely interesting and pretty, I don't know. I'm grateful that, we even got this release. I'm not sure that everybody else would release something like this because of what you can do with this. everybody online is going to just run it and scan now for good reason.
Alex Volkov
Alex Volkov 20:59
I saw librarians get excited.
21:01
I saw somebody scanning sheets of Mixtral fascia. Do you guys know what the Mixtral Mixtral fascia is? That library stores actual screenshots of, like papers or whatever, in like very, very, very tiny compressed microfilm type things. so they scan all of this, with precision, with this model. Shout out to Deeps, for open sourcing this. at some point Deeps Seek is gonna release something that's again, gonna change the world. And these small releases on the way there from the, 3.1 terminus, and some other things. This is like leaders there. So we're waiting for deep seek as well. But shout out for this, release. I think it's Apache two as well. Tiny model. You can run on anything. incredible. LDJ, go ahead. You're still muted.
Nisten Tahiraj
Nisten Tahiraj 21:45
Yeah, go ahead, Mr. there's still some issues with
21:48
the inference engines and, a lot of the spaces were not working. I don't know if Lama CDP supports it properly, so that's why I've kind of been waiting, because some worked and some didn't quite work correctly, but, this probably was like an internal tool or something where they do take in all the training data, that they made because, in that sense, if you have to scan trillions, it makes a lot of sense to make a tiny model because then it might take you 10 years if you just run a whole, cluster if you did large, right? So you, you do want to get the fastest, most accurate, smallest, possible thing. but yeah, it will be very interesting once they apply this, to the bigger models. it is pretty exciting for me because now if you want to do computer use, before you had to parse it or you had to dump the document object model, off the site. But now it's a lot faster to just take a picture, which is unexpected. It's also 10 times more compressed. So all that input data. Which you have to do before making a decision on an action on any agent thing that you build now can just be an image that's pretty nice that just became 10 times faster. Well, if you count the small tiny size, given what it can do, it's almost like we got a few hundred x improvement in speed, in capability because before you had to run full GPUs and it'd be a lot slower too. Yep.
Alex Volkov
Alex Volkov 23:19
Open source, is, only one of the corners here,
23:21
and we have to talk about others. Go ahead.
LDJ
LDJ 23:23
Yeah.
23:24
I think, it is worth giving an honorable mention to, ZFU ai, the creators of GLM since literally the day before the Deep C OCR paper released. They released a very similar paper basically doing the same thing, and it's achieving also pretty good compression ratios. and that paper is, called glyph, if anybody's curious glyph scaling context windows via visual text compression.
Alex Volkov
Alex Volkov 23:46
Yep.
23:47
Alright, thank you LDJ. Thank you folks. in the other open source releases this week We have a liquid, foundation model vl, 3 billion parameters. We talked about liquid AI multiple times here, even had some folks from them on, to talk about different releases and what the difference is. and liquid releases a VL version, that supposedly beats other small VL versions. But this is only true to like Tuesday because even to, like yesterday when released, a version, of their V model that's probably beating this. But, they have average score here for multiple, multimodal benchmarks and star blink MA bench, cr bench, pope and real world ca. And they claim SOTA this week with their LFM 3 billion parameters, ultra compact vision language model. Unlike the OCR one, this is the generative, agent vision model. 11 language, multimodal, multilingual performance, and, strong performance on multimodal tasks. LFM two delivers 51% on mm, if eval, and 71% of real world qa, which is both multimodal, benchmarks, Can we say 3 billion parameters, smaller than most competitive v lms? this is only true because this happened this week, but we can skip forward to the Qwen stuff because I think the Qwen stuff is kind of in this category, is also more interesting. They compare themselves to the previous Qwen, vl, Qwen 2.5 vl. we now have a tiny Qwen, three V. then we have poke research releasing a state-of-the-art well compared to other open source, deep research agent. it's fairly impressive the HLE results that they have here for, like around 15% or like, it's really hard to, to to count here. Like I think around 15 to 17% on humanities. Last exam from poky research is a 7 billion parameter open source state of without deep research, on par with or ahead of closed models the same size. shout out to PKI ai. I don't believe we covered any models from PKI AI so far. so shout out to them for this release. they have a multi third of Reasoning scaffold and, the combined RL af, reinforcement learning with AI feedback, which is a technique we discussed on the show previously. Let's see what else we can talk to you about this. Open soda, robust reasoning, scaffold recovery verification. automatically reruns and corrects failed tool calls and perform self verification loops, reducing error cascade escapes and boosting final and reliability. So this is like a model that also does a harness type modelly stuff. Interesting. And now let's get to the breaking news from yesterday. Our friends from Alibaba Qwen Togi Lab, released two new models. Nisten would love to hear from you about this because like this is the small one is crazy. Absolutely. Qwen three, vl 2 billion parameter and Qwen three vl 32 billion parameter folks. I was confused this week because I looked at this news and I didn't look at the release sizes and I was like, didn't we already talk about Qwen VL last week? Is this really new? Sometimes what happens is we cover things so early that people only get to the things we cover after we cover. So I almost skipped this release entirely. what's new? these are new sizes. These are, these are new sizes for these models Nisten, you wanna cover some of the releases here, I saw like multiple folks react to the 2 billion parameter. One that it's kind of like crazy.
Nisten Tahiraj
Nisten Tahiraj 26:59
Yeah.
27:00
why people are excited about this is that Qwen usually, it's pretty accurate with their benchmarks and they put a lot of benchmarks. some of them do seem optimistic, but usually they are, pretty right on. the reason that people got excited is because they have a track record of putting correct benchmarks when it compares this well with the SONET four, which, can have some issues with OCR, and we're not compar SONET 4.5, but that's fine. When you start to get something that's very, very good, or comparable to Sonet four and that you can run on your phone and it's performing that well, that's pretty exciting. the coin models are really easy to work with. I mean, most tutorials and stuff that you're gonna see in, Google collab are with coin models. Part of the reason why people are excited about this is because they can modify them to do what you want. you can train these models pretty easily too. but they also have crazy performance to boot. the two B model, which, you should be able to run on less than two gigs of ram. this is now something you can run even on a four or 5-year-old iPhone, and you are getting a performance comparable to GBT five mini. So it's like putting the image on GBT five. and, it will recognize pretty quickly, and I tried, I gave it some hard tasks like the terminal and, and things, and, the two B was still hallucinating.
Alex Volkov
Alex Volkov 28:32
yeah,
Nisten Tahiraj
Nisten Tahiraj 28:32
because it would try to interpret stuff as to what it
28:34
meant, but it did almost get all of it. when I asked it to repeat verbatim Tim, what it said it did actually, do that. so that was a good experience for me. Now, the more exciting thing, which nobody has really tried and we're probably gonna post about this week, is that these can do video and, video's very expensive to process. So this is why you need as small as possible. As high quality of a model to be able to actually input video properly or for it to just like look at you throughout the day and kind of figure out what's going on. Before you couldn't really do this unless you just like bought the API from Google. Now you can't, and this is something that we haven't tried and we don't know how well it works, but it'll probably be good.
Alex Volkov
Alex Volkov 29:24
yeah, they have a bunch of video benchmarks like ENV bench, and it
29:28
looks like even the four B model is like comparable to some of the other ones. but absolutely if you look at the chart, they're showing in red the state of the art on this benchmark. And Qwen VL 32 B instruct the 32 billion parameter one that you can also run. most M four max or the M three max can run the 32 B one. for sure. they have most of the benchmarks beating cloud sonnet for N GPT five mini, and not to mention the previous 72 billion parameter Qwen 2.5 vl, right? So we're, we're always seeing these step function jumps with Qwen, where the previous bigger model is outperformed with a newer model while also getting half of the parameters in place. we're getting a significant jump in most benches besides maybe OCR bench, which surprisingly the Qwen three reveal 8 billion parameters is a little better. Everything else is absolutely, crushing, including the video stuff. one comment about video. Super cool. Before we move on, do you guys know the Chat GPT can watch video? Do you guys know this? No, I posted about this
Yam Peleg
Yam Peleg 30:31
negatively, natively, bro.
30:33
Since went, since went since we,
Alex Volkov
Alex Volkov 30:34
I found somebody that, that saw that, there's
30:37
only one way to get it done. You cannot upload the videos. MP four is locked. folks, we're a little bit out of the open source now. We're moving towards some exciting stuff. basically there is a trick that I'll tell you about this trick later, of how to get native Chat GPT, understanding of video. it's, it's a trick that they don't want you to know, but we know. Are you
Yam Peleg
Yam Peleg 30:57
sure it's hundred
Alex Volkov
Alex Volkov 30:59
percent sure.
31:01
That's native. A hundred percent. And I know how, I'll tell you how, because I use the same, the same video on Google Gemini and I used it on gr and grok is supposedly multimodal native, but it's not really just like watches frames. I, I wanna highlight why, it's multimodal. there's a video of my son learning phonics with an iOS app. And the way he learns, like he watches an iPad video, like explains to him how to read and he spells it out. Chat GPT knew he didn't spell it correctly, like Chat GPT listened to the audio and understood he's not spelling this correctly. You cannot do this with frames. MRTA we just talked about. Qwen vl, for example, the video processing. I don't think that they have a multimodel video processing, for example. I like, I don't think that co VL listens to the audio in the, in the video. Like I'm pretty sure it's all frames. Correct me please if I'm wrong, but like, I don't, I don't know, like, they have the omni model that's like Omni, but like the VL model doesn't, doesn't process like, like a sound into it. Did you guys remember we talked about multimodal things? I was like, wait a second. there's only one way to do this as far as I know, you can share into the Chat GPT app, via the iOS share feature. but you cannot select the video from the dropdown, 'cause they restricted to images. But apparently if you send the video to yourself on, WhatsApp and then share this video from outside of WhatsApp Chat GPT shows up at the shareable location apps.
Yam Peleg
Yam Peleg 32:26
That's new.
32:27
It wasn't there before for sure. So now that's really new. You can send
Alex Volkov
Alex Volkov 32:30
Yes, you can send the video from WhatsApp into Chat GPT.
32:33
Chat GPT will understand the video fully natively. I really needed this a couple of times and I didn't like know how, I sent the same video to Gemini, which we know Gemini is multimodal and like, Gemini did not do as good a job as, as this. So I'm very, very excited about this. LDJ, go ahead.
LDJ
LDJ 32:50
Yeah.
32:51
Quick clarification on Qwen. Qwen three Omni 30 BA three B. Yeah, it can do
Alex Volkov
Alex Volkov 32:57
Omni does, Yeah.
32:58
Omni does audio input. And also I believe video input, but I'm not sure that the video input includes audio, if that, if you understand what I'm saying. Oh, I see what you mean. I'm not sure either. Do you wanna record the video? You wanna record the video of us and share it to Chat GPT? Yam. What are you showing us? That it works? Yam confirmation, everything.
Yam Peleg
Yam Peleg 33:14
You are right.
33:15
You're right. But, again, you can't just share. From the iPhone. You have to, exactly like you said, you need to send the video yourself. Yes. Then go click on it and select in the WhatsApp selection and then share. And only there you can see Chat GPT. let's see if I have it on the web. If I actually have a video player.
Alex Volkov
Alex Volkov 33:33
probably on the web.
33:34
Yeah. So native.
Yam Peleg
Yam Peleg 33:35
we know that they absolutely have a native, video input to GPT because,
33:38
advanced voice mode work natively.
Alex Volkov
Alex Volkov 33:41
Yes.
Yam Peleg
Yam Peleg 33:41
with video.
33:42
yeah. you heard it first. I heard it first time. Here. Go try It actually works.
Alex Volkov
Alex Volkov 33:47
And that's pretty cool.
33:48
it really is like actual multimodal understanding of video. all right, so Going back to the browser wars, et cetera, like something very, very exciting to me.
Nisten Tahiraj
Nisten Tahiraj 33:57
well, I can also confirm.
33:59
Quin three VL does not do audio in only the model one. I just tested that.
Alex Volkov
Alex Volkov 34:03
Yeah.
Nisten Tahiraj
Nisten Tahiraj 34:04
Doesn't do audio.
Alex Volkov
Alex Volkov 34:06
All right, folks, we have to move on to other news.
34:09
we got some news from, Microsoft ai, and their releases. I didn't see anything like super, super exciting there. not that I expected, but yeah. Let's take a look. most money posted. where's, where is this? Blah blah. Okay. Microsoft Edge. So, before we get to the actual Atlas point of the story, Microsoft announced copilot mode for the edge that turns edge into ai. Agent browser Edge can now navigate pages for you with actions, managed tabs, and access browser history. this is not new for us, but it's very interesting to see this coming to one of the bigger browser in the, in the browser wars. Copilot, I think is still powered by GPT five, I believe. I don't use copilot, like, to be very honest with you. Like I try it when new things come out, but like I, you know, I stick to Chat GPT. and we have a very flashy video that's not super interesting, but basically they have a copilot, like browse and click things, which is now the state of the art in browsers. And the only one that doesn't do this is Chrome. so shout out to copilot. They also have, right now, should we try it? Can we try it? Oh, okay. Microsoft, windows Edge and Mac folks, we're gonna try this live on real time. Do you guys have, we're gonna try to download Edge. We will download Edge. we're gonna download one browser, agent browser from another agent browser. so this copilot mode in edge, once we get there, we will try it, but for now, I think the more bigger release, I think it's bigger. JGBT Atlas release. We already had a, a two hour stream this week. So for those of you who are coming back, this may be some something repetitive, but we also, we've used this browser so far as well, this week, open the, I stepped into the browser war after two weeks ago, stepping into This is the copilot fall release before again, ti Atlas, copilot mode Edge. They have a new clippy thing called Micko. this is the new assistant. they have memory built in there, which is pretty cool. And proactive actions. Ooh, this is, I'm interesting about what is proactive actions, and then some other stuff and connectors. maybe we should cover this like fully at some point. I don't know, like who uses Edge? maybe we can use Edge now that we downloaded and try it.
Nisten Tahiraj
Nisten Tahiraj 36:20
Remember two years ago when a lot of us were using Edge
36:25
and I kept posting stuff, how, I would open three different side panels on it and I would have, I forgot what model I would have on one side panel GPT
Alex Volkov
Alex Volkov 36:35
This is the model you would have.
36:36
Like at some point, edge was the only place to get GBT four before it was released. GP four with Vision for example. Yeah.
Yam Peleg
Yam Peleg 36:42
was a good thing.
36:43
Okay. it was a good thing being a lot. It's funny to think about it today, but I used Bing a lot. That was the only one with internet access. Absolutely. Yeah.
Nisten Tahiraj
Nisten Tahiraj 36:51
That was before they hired must and then nobody used it anymore.
36:56
it's still there, but, yeah, it's kind of interesting because we all thought they were going to dominate at that point. It was the only thing that was actually aware of what set you were on too.
Alex Volkov
Alex Volkov 37:06
I'm installing Bing right now, and it asks me for my permission,
Yam Peleg
Yam Peleg 37:10
I'm not, so many different things are called copilot at the moment.
37:14
Yes. So I'm not entirely sure what exactly. Like you're saying, it added to copilot and then it added to edge and native on windows. And you have the IDE plugin called copilot, and the agent called copilot UB copilot. And you have the, chat completion in the, code called copilot. So many things are just called copilot and getting pretty good. Oh, there is a windows native, chat client called Co-pilot. And I don't know, it kind of feels like anything that is branded co-pilot is like trying to strap AI onto a different thing.
Alex Volkov
Alex Volkov 38:26
But let's talk about Atlas Chat.
38:28
GPT released a new browser this week also. this is the first step of Chat GPT into the browser wars. This is the new tab and you can ask it to do some stuff. And, there's a, draft Thursday News Recap feature that I already used. there's a bunch of other things. It's Chat GPT in the browser. it's not just Chromium Fork, which Chat GPT slept on. It's really is deep native integration with your profile on the top right, which is your, charge G profile. One thing I've noticed that we didn't notice during the livestream folks, there's this like accent color thing. and now this is the, the, the accent color of my Chat GPT. So if I do this like purple and I go to Chat GPT, charge G is gonna be purple. The iOS app is gonna be purple. Everything's gonna be purple, and you can change it back to like yellow, whatever. so now this is like the accent color for, for all of your Chat GPT another way to customize, not that it matters, also, folks who are complaining online that this is just a chromium fork. No, this is a completely different menu than even Edge. if you guys see this with the Zoom and whatever, this is the standard chromium menu, new tab, new window, blah, blah, blah. Favor. It's like the dropdowns in more settings. This is the standard menu. So if you see like a blank, like chromium fork, you'll see the same menu everywhere. New tab, new private tab, then zoom, like yeah, I remember the browser we built had exactly the same menu, JGBT bu the menu with, with SWIFT I native and the settings panel is completely, completely new. this thing in the settings panel, we talked about this folks, they added a virality feature within JGBT Atlas that shows when you joined, a OpenAI as an account, not J GPT specifically, even though it says J GPT. 'cause most people joined OpenAI via J GPT. I joined 18, 124 days ago, 1,824 days ago. 1,824 days in years is 4.99 years, almost five years ago. So I joined in June, 2020. if any of our listeners joined before me, definitely post your score as well. Atlas is, I think the only way right now to know when you actually joined, OpenAI. what does Atlas do? Besides having Chat GPT fully integrated into Atlas, which is, you get, the whole Chat, GPT sidebar in here on every place. So the new chat search, your library of images, custom, projects, your chats, everything is here. besides this, you have, context for every new tab with the new site. The S GPT feature. One thing that I didn't know during the livestream is that, command dot opens up the site pane so you can shortcut it, command dot, and it's really, really useful. And there's no way in any of the browser, any of the settings that I looked to actually know this. the only way I knew this is a thread from one of the folks on JGBT Atlas that said, Hey, we need to build up a page before now, like you know, this opens the site pane. But again, you have this Ask j GPT mode on the side, and it has context access to the current tab. Only one tab, by the way. Again, comparing it to, for example, comet or, even DIA from the browser company. J GPT for now only has access to one tab and you cannot like, give it context for other tabs. It's annoying as heck and it's really like messing with my ability to like do cross multi-page research, which I do often for you guys, here on Thursday. what else do we have in Chat? GPT, Atlas. That's super, super cool. We have summarized feature, and yes, we have the agent mode. I think the agent mode is like the number one important thing besides the feature. There's a bunch of features, it has this cursory thing where you type something in any type of typeable thing. Once you select it, you have this little che BT thing. A GBT bubble pops up and you can do charge GBT. Well, action on the selected text, which is. I haven't seen this in any other browser. It's novel. I've seen this Chrome extensions. and here's an example I typed in like a print bullshit. S-D-F-S-D-F-S-D, blah blah. And I told it, Hey, clean this up. Che, GBT made the poem out of all of this bullshit. It's really funny. So, and then you can use this to replace the text, for example. it's really funny because Google Docs also has a feature to do this natively. So you can have AI on top of ai, but this, I think they call this cursor, bubble whatever, if you guys remember exactly what it's called. this is another feature of Chat GPT that I haven't seen anywhere else. However, I think the most important feature of Chat GPT, while we add Quila here, our friend Quila. What's up Qwen? Uh, we're gonna get to talk about some stuff very soon. Uh, the other exciting feature is, uh, the agent mode. Agent mode is the agent ability of Chat GPT to go and do stuff on your behalf. Uh, and as far as I, if you guys remember operator in Chat GPT native interface that became Cha GPT agent, this is now built into the browser with the main difference that agent has access to the webpages that you have access to and blocking it is now impossible. So, for example, if, uh, J GPT. Agent would use a, you know, you know, a secure environment, and then you log into that secure environment. Uh, Chrome can block access, uh, to, to those actions. Now, J GBT is running within your browser. This is your browser. You're logged into stuff that you're logged in. Uh, it has a security feature that shows, uh, that you can run this agent with, uh, log in sessions or without. So like, basically give it like any computer mode. I had this yesterday, I had this mode, uh, and I like almost live streamed it, uh, do a compliance training for me. Uh, and for the five hours that I tried it completed, the compliance training. The compliance training is hard. Uh, specifically that, uh, platform called Easy Lama. It has not only like, Hey, watch a video. Here's a few questions. It has questions in different formats. It has quizzes, it has multiple visual things that you need to complete. It has drag and drop interfaces where like, yeah, drag this plaid scenario to the thing. It has like a bunch of stuff like this and charge BT after five hours, five or something after three stops that I had to tell it to continue. And it said I am continuing. And I told it, no, you're not, because I can see that you're not doing this. Uh, after a few gaslighting moments, it, it did complete. I had to restart the chat for, for once. Uh, but I was very, very impressed because, um. This is stumbling through this pressing buttons, not listening to the audio, because a lot of it is watching a video and listening to the audio that they're saying. So Che g bt agent cannot do this yet. Um, it still was able to complete a compliance training, uh, for me, which I need to now take again because obviously I'm not compliant if I use an agent to finish this for me. Uh, but I was very, very impressed. So the agent mode is something, uh, it needs supervision still. It is dangerous because of injection problems, but, uh, it, it is, is absolutely something. Folks, have you tried the Atlas browser? Will you be trying the Atlas browser? What are your thoughts? What are your, um, um, you know, general reaction to the world of, uh,
Yam Peleg
Yam Peleg 45:09
people on Windows and Linux would like to have a chance as well.
45:14
I'm just saying that if anybody listens and can influence this, there are many, uh, many of your users are eagerly waiting to, to try. Well, yes,
Alex Volkov
Alex Volkov 45:24
Atlas was released only to Mac Os.
45:26
it's a research preview and they're working on other platforms. There's no iOS equivalent, no Android, definitely no Windows and no Linux. who here did try it? Probably not. Nisten instant.
Nisten Tahiraj
Nisten Tahiraj 45:40
I tried to try it.
Alex Volkov
Alex Volkov 45:41
tell us security is your concern.
Nisten Tahiraj
Nisten Tahiraj 45:44
Yeah.
45:45
Also, I don't really need it because any other. Thing I do with Clotter, GLM, it just fires up chromium, headless, it, grabs it down, it takes pictures, has full control of Lin already. So I don't, there's nothing I can do that I can't already do. But, OpenAI does make a nicer interface. I mean, I've tried to build a browser before too, and the Chrome build takes like 90 gigs of space on your thing to do like a modern chromium compile, and that's just hard drive space. And you can take like two hours even on an M two Ultra. so to make a good browser, it's actually pretty hard. And, I do think OpenAI have done a pretty good job rebuilding chromium in that sense. So I did want to test all of that.
Yam Peleg
Yam Peleg 46:38
I, I just wanna say on this topic, anyone listening, the, I tried,
46:42
this week, the Chrome, web dev MCP and it surprisingly, is able to control whatever you want in the bright in the browser. I thought it's just gonna be, a bunch of tools to, monitor websites or to check for, I don't know, test, test some performance. But no, it can absolutely control the entire browser. it's really powerful. You just. Install it in Cloud Code Codex, whatever kind that you run, and that's it. It can control Google Chrome on the same computer. You see the browser in front of your eyes doing whatever, clicking stuff and doing stuff. That's a good alternative. That's what I'm using at the moment. would love to try the other options.
Nisten Tahiraj
Nisten Tahiraj 47:21
Wait, wait.
47:22
I didn't know this was out. Oh, yeah, this was out a month ago.
Yam Peleg
Yam Peleg 47:26
yeah.
Nisten Tahiraj
Nisten Tahiraj 47:26
20th.
Yam Peleg
Yam Peleg 47:26
look, look, look.
47:27
It's out a month ago. And I thought, just like you, okay. It's a nice tool for benchmarking websites and so on. It's not gonna control the entire browser, but is this what they're using? I used it to read archive papers. Seriously, I just went to chat with cloud code and told it the code search for the paper, and it went, and I saw the browser open and it went to find the paper I'm talking about. And it was really, really cool. It's really cool. It's one command just to install like MCP a. Google, Chrome Web, web dev, whatever the MCP is named. And that's it.
Alex Volkov
Alex Volkov 47:58
So I'll try to bring back the discussion into Atlas.
48:01
I will just say, though, we should talk about the alternative ways of controlling the internet. We're gonna chat with Paul Klein, in a moment to talk about browser base and the director and some other things that they released. MCP is another way to install, more. All of these have risks, different levels of risks. One of the things that we know about the agentic browsers is that, if browsers have malicious instructions in them and the model is not sufficiently trained to detect malicious instructions, I think brave also a browser, not agentic, but also AI browser released a security research into multiple browsers, just failing, context injection attacks. Simon Wilson has an outline three years ago, content injection as a thing for iGen browsing features. and we had, this week, the ciso, the Chief Security Officer from J GPT, information Security Officer Dane, He talked at length about their new security measures in the browser. This is like the most. Transparent, the company has been about mitigating this type of attack. Very, very cool. I definitely recommend folks add this to show notes. here's some of the stuff that they are doing for this prevention. They're saying we prioritize rapid response systems to help you quickly identify block attacks campaigns as we become aware of them. This is very, very important because cht PT is also a. Ecosystem. And now if they detect multiple attacks in the browser, they can block it for every other user versus, you know, chromium for example, can't. they're also, they invest in security teams research to improve robustness of the models. design Atlas to give you controls to help protect yourself. This is one of the main features that I want to make you guys aware in Atlas as well. Atlas has the agent mode. Once you run agent mode, it gives you this option. let me zoom in here. It leads you the option to stay logged in. JGBT works alongside you using your logged in account. So if you logged into X or Gmail or whatever, it will have access to the same cookies or there will basically give your incognito browser experience to the agent logged out. So like, if you wanna do some research but you don't want it, to be there, it will browse with you locked out. So that's one of the features that they have. they said when agent is operating in sensitive sites, we have also implemented a quote watch mode that alerts you to the sensitive nature of the site and requires you to have the tab active to watch the agent do its work. The cool thing about agents is that you can run something in a background tab and it will do things for you and you don't have to watch it so this is what I did. now with all this said, and hopefully this will come live, the browser security implications are very interesting and as Yam said, there's a few ways to control browser, chet's controlling the browser on my system. but there's also other agents and other things maybe you want cloud to use. It's os world score of like 67, that sonnet was released with. And, many of them are using tools and one of those tools is browser base. And we have the browser base guy, Paul with us. What's up Paul Line? How you doing? What's up
Paul Klein
Paul Klein 50:45
Hey, I'm on my way to the browser right now.
50:49
super exciting week in the world of browser. happy to chat about it.
Alex Volkov
Alex Volkov 50:52
Yes.
50:53
you guys launched something also today, and the thing that I wanted to call you for is because I think that you guys innovated something that I, I was very, very excited about the integration with one Password and authentication on my behalf as a user control thing. I have never seen this before. I'm a user of one password for the last, I think, maybe 15 years, and I use it to log in as me. And I was always saying that, hey, if I have an agent, it should either have access to the tools that I'm logged in as, or it should ask me access to my password in secure way without like me giving away the password. So I saw you guys implement this. would love for you to talk about what you guys launched this week and this integration both and how you guys see authentication for agents in the browser world.
Paul Klein
Paul Klein 51:30
Yeah, exactly.
51:31
With we think about browser automation in two ways. There's a browser in chat or chat in the browser, right? And I think with Atlas and Comm and Dia, you're seeing a lot more of the chat in the browser where someone's giving you a browser and you're chatting with ai. Now we kind of tend to see this more with traditional AI agents where you have a browser in a chat where you're having your AI agent chatting with it within a website or some SaaS, and you're now it's deploying or talking to a browser, running in the background in the cloud or something like that. And regardless of both of these things, you need to let your AI authenticate as you, when you're offering someone a local browser experience, much easier, right? Because it's running on your computer, it's next to you and you're able to kind of use the same IP address as the person, and keep things a little bit more safer there. But what happens there in the kinda browser and chat experiences is that there's no background browsing. As soon as you close the laptop, all of a sudden your workflow stop. And we've seen more of these kind of AI browser companies start to offer some ideas of background browsing, but then you run into what you need, which is the, delegated authentication. And at browser base, we think a lot about cloud browsing, background browsing, and really background browsing powers. A lot of this browser in chat type of experience that I talked about earlier and what our partnership with one password was, was, Hey, you have your one password set up, which is a vault running on your local machine, and your browser locally can talk to it, right? You never want your passwords to be stored in the cloud, but how do you enable a cloud browser to access your password securely? And with one passer, what we did was we found a way that we could have a vault running in the cloud environment, a vault running locally and using the extension. The extension establishes connection from the cloud browser to your local vault, and can request access to it. So when you use a cloud browser with one survey, you able to say your agent, Hey, give me access to this credential. Pop on your screen, on your local machine saying your browser wants access to this credential. It's almost like the cloud browser is reaching into your local device, asking for permission to the password, using it in the cloud browser, and the proceeding forward. This is a cool paradigm 'cause it means that if you have access to one password on your phone or on your laptop or your desktop wherever your AI agent is, it will be able to request access to your resources. And it's a paradigm that's a lot more flexible. If your AI agent only has access to one machine, it's very limited to that machine. But if your AI agent can access the services and tools you have on all of your machines, it might get a lot more done.
Alex Volkov
Alex Volkov 53:55
Yeah.
53:56
So I have to say that as a long user, longtime user, or one password, I was delighted. See this, it requires the night to install of the one password. Currently, it's not baked into the main release yet. The fact of the matter is the agent that I run on my behalf somewhere in the background, has access to the exact same password that I needed to get access to, and I'm in control of what I give it or don't give it to when something runs on Atlas, for example. The only control I have is have access to all of my login sessions or none. On this way, I can just log in on a specific thing. It's not perfect yet. Paul, I gave you some feedback as well. I wasn't successful to log into X, but I was successful to log into substack, for example, and doing browser automations with my cookies. With my passwords where I'm in control. I wanted to bring you on to highlight how important this, I think is for the agent world. 'cause many talk about agents and auth and whatever. No, that's not me. That's like I keep talking about the web is human shaped. The web is human shaped. It's not yet agent shaped, like all the auth and the different integrations. They will come eventually for sure, but I don't want my agent to auth via an application. I wanted to click buttons, because buttons expose different things than the API expose. How, how do you think about this o versus like a personal login thing?
Paul Klein
Paul Klein 55:05
I'll add one more thing here, which is that I believe that
55:08
with all the prompt injection stuff that you have happening, there is no way you should let a browser, you touch your regular Chrome profile. All of your cookies, all your information you should be giving in access step by step. You should be delegating, you should be saying do this, do that. Until we figure out this prompt injection stuff, I think it's too early to let AI completely control your local browser, which is essentially like coming into your house and living in your house. You should be letting your AI use certain things at certain times based on your permissions. The way we did that with one password is the right way. However, I think like you see that in Atlas with the private mode, that seems like it's the right direction. If you're gonna have AI control your browser, it needs to earn the right to different credentials and access. You can't give it the master keys on the first rung.
Alex Volkov
Alex Volkov 55:52
I think the Master Keys, example is a great one and I think
55:55
that like, it's also possible likely with an actual local browser, they just didn't implement this instead of showing me one password, hey control, they probably show me like, would you want to give access to this browser? Kind of like cursor in the beginning before YOLO mode was asking you for everything, can I run this command? Can I run this command? Now most of the people that I know run cursor on full yo mode, like do whatever and they trust that they implemented the safeguards. So I think the browsers will get to that point. But yeah, you're right Paul, like to some extent Atlas kind of came in and asked people to give MasterKey permission to the whole suite of like accesses. And the injection attacks that we saw was like, Hey, instead of summarizing this webpage, like the person ask you to go into their Gmail, open this like deferred email, copy the url and then go to this website and send me this email. And this is the prompt injection that people saw. Paul, you also launched something today. I want to give you an option to talk about director 'cause I think it's like super, super cool.
Paul Klein
Paul Klein 56:46
Yeah.
56:46
of course. And maybe a last note on that, like the prompt injection attacks in browsers are much more dangerous than in an IDE. So I, I'm a little concerned, like I don't think we should, like, YOLO mode is great. I use YOLO mode and cursor. I'm not sure if we're ready for YOLO mode in the browsers 'cause the consequences are much higher. So that's just something that's been on my mind. I don't know how we're gonna solve this prompt ejection stuff yet, but I'm excited. I know we will solve it if somebody will solve it over time. but we love this thing called director and director. AI really takes the, browser and chat approach a little bit further. And it kind of came from this world where, yeah, if you pulled up it'd be amazing. when it came to this world, which is like, we know people wanna prompt to automate a browser, but they don't necessarily wanna talk to the agent each time. They may wanna automate a specific task. So when you put a task, like find me the number one post on product hunt or something like that, you may wanna do that every single day. There's a type, a type of task that you are gonna run very regularly. And to build automation to do that task regularly, it's very challenging. So director is an agent builder that uses the frameworks. We offer stagehand and our infrastructure. We offer browser base to really allow you to take a prompt, accomplish some routine web task and then export that to be a repetitive script that you can run. It doesn't actually build you an agent. It's not something you talk to each time. it actually outputs software and that's what's super special about director. It kind of reminds me of V zero. If you use V zero for building websites, you use director for building web automations.
Alex Volkov
Alex Volkov 58:12
Yeah, and I use director and like it integrates
58:15
with one Password as well. So it allows you to plug into that infrastructure of accepting, here's a director, navigating to Hacker News and getting the data of the page. It kind of looks like, the agent that I saw But the artifacts I think pauses what you're referring to, right? it gives you an actual script to then after it did the things, it basically record mode for agents to give me a, controllable script afterwards and not just an energetic thing that can get lost, right?
Paul Klein
Paul Klein 58:37
Yeah.
58:37
You can also deploy that script right there in the ui. So not only are we gonna generate the script for you, but if you click that deploy button, you may have to sign in. if you click that deploy button it will actually go ahead and run that on a schedule for you or let you call that with an API eventually. So we feel like this will be a one-stop shop for people who are we built this product 'cause I kept getting, people kept hitting me up, like dentists were calling me saying, Hey, how do I use browser based to automate something? So we built director for these semi-technical people who are AI curious and wanna automate and really wanna try and wanna make it a one stop shop for everything they're doing.
Alex Volkov
Alex Volkov 59:10
So, a question super quick before I let you go.
59:12
I know you're on your way as well. if I use the one browser, one password integration in my script, every time it will hit like log in as Hacker News as me. I'll get it prompted saying, Hey, your agent is trying to log in as you, would you approve. This happens every time.
Paul Klein
Paul Klein 59:25
For right now, you're approving it every time, but
59:27
we're also gonna offer an option. we're working with the one password team where you could say like, it's kinda like, the cookie banner, like, or like, log in, do I keep me signed in, type of thing.
Alex Volkov
Alex Volkov 59:36
Yeah.
Paul Klein
Paul Klein 59:36
So it's like, yes, accept it or like, yes, always accept it.
59:39
But right now we wanna, like, once again, we try and bias as an infrastructure provider, we wanna be as secure as possible in the beginning, give you a kind of teaser of what's happening. And then over time as we work with the community, see what's best, see what people like, move in the direction that people feel comfortable.
Alex Volkov
Alex Volkov 59:53
I think this is ahead of a lot of question before we let to go, Paul.
Nisten Tahiraj
Nisten Tahiraj 59:55
Yeah.
59:55
As someone that likes vibe coding on their phone. Paul you mentioned, so there's a cloud browser and there's a local browser. Is there a way, like, for you right now, if you have an agent running right now in the background, could you approve it with one password from your phone? Is that functionality, it's a, there with browser, browser base or it's coming or.
Paul Klein
Paul Klein 1:00:13
So if you're asking about the one password integration we have
1:00:15
with director, no, we don't have mobile support for one password yet. We have to do a big mobile app update with them. it's coming soon. if you're asking, can I use my phone to control a cloud browser, the answer is yes. director.ai works on mobile two, and it's pretty fun. Like you can actually dispatch a browser from your phone to go do some task. And I think that like the, the interfaces we use to interact with AI are changing. And I always tell people, you know, browser based, for now, you know, we're, we're gonna be in good shape until Neuralink comes out and we're just talking to each other's brains. We might not even need software anymore. We might not be browsers, but until that happens, I think the company's gonna be, in a pretty good spot.
Alex Volkov
Alex Volkov 1:00:51
All righty.
1:00:52
Paul, thank you so much for joining us.
Nisten Tahiraj
Nisten Tahiraj 1:00:53
I just tried it in the mobile view.
1:00:55
I asked it to look up a whole bunch of used stuff on, Kaji and Craigslist. It's actually doing a pretty good job. it knows roughly where I am. It found the prices. It's thinking through it. This looks good.
Kwindla Kramer
Kwindla Kramer 1:01:06
Paul, we gotta build the native voice We, we
1:01:08
got one more step to take before.
Alex Volkov
Alex Volkov 1:01:10
I wanna talk to my browser.
1:01:11
I wanna have a do stuff. folks definitely give, I wanna take to my browser. Yeah. I'll see you guys. Thank you so much. All right. definitely check out BB browser base. one of my favorite browser integrations, for sure. We guys also collaborated on a bunch of hackathons. You guys always sponsoring and doing incredible things and the people who are building agents are finding the browser base, tools are very important for them. To give agen capabilities to the like, go and search something is not enough with like X APIs. Do you want to maybe download something and maybe parse to like browser base is great for that. director is in very easy way to get started with browser base, end stage hunt 'cause it records whatever you wanna do. So users, director is very, very good. And congrats on launch. Paul, thank you so much for joining. And I think for me, the most exciting thing is the new paradigm of, pair website authentication for my vault. I don't think that people should store passwords in the browser. I think those two things should be separated for this exact purpose. If you store your browser passwords in Chrome, like many people do, because this is a default option. Some like this will be impossible for you until you get out and sort this. I personally vouch for one password and have been for the past 15 years, they've never been hacked. no password leaks from, they're very, very well integrated into the security ecosystem. And so seeing them innovate together with you guys on this newgen paradigm of authentication, I'm very, very excited about that. I was very happy to bring you on. To chat about this, and as Paul said, don't give your master keys to your whole house, when you can decide room by room whether any AI can get in there or not. That's a great metaphor. Thank you, Paul. folks, we need to move on. Qwen, we brought you in to talk about some super exciting stuff, but there's one thing in the big company's API that we have to cover before we chat about some stuff, but obviously you are part of the panelists here. You feel free to chime in. cloud code is now on the web. you guys were way more excited about this than I am because I, you know, I use Codex and I use some other stuff. And Christian Windsurf. let's talk about cloud code on the web, Nisten and Yam. What is the most exciting thing there for you? Why is it exciting? What's, what's going on? Nisten, you go first.
Nisten Tahiraj
Nisten Tahiraj 1:03:00
I could show it, but we don't have that much time I found it.
1:03:03
I could use it from the phone. I was hoping it was a lot better, but it was still good enough that, on the chat side, I have the CloudFlare MCP connected. So I can see when deployments failed or, I can check the databases, how they're doing or when new users came, but I never had a good way to, send it stuff unless I used my own tools. the fact that this is integrated only with GitHub, not with GitLab, is that, It is just there, like it is just on the website. I don't have to worry about like, which machine am I SSHing to or is my server up and stuff. it doesn't do everything you can on a local environment, but it does do almost like 90% of it.
Alex Volkov
Alex Volkov 1:03:48
So they have like a secure sandbox thing where like
1:03:50
it supposedly runs your stuff, but like it's still very complex, right? Like running your stuff is complex. so how does it actually run your things? I think people will
Nisten Tahiraj
Nisten Tahiraj 1:03:58
like it.
1:03:58
I'll just show it really quickly. So, you have your, you gotta turn it on, if
Alex Volkov
Alex Volkov 1:04:03
you can zoom in,
Nisten Tahiraj
Nisten Tahiraj 1:04:04
Yeah.
1:04:04
Yeah. And they, let me just zoom in a bit so you can make the personas now. So this is the stuff that I liked. sure use memory and then you can do a use CloudFlare MCP to check pages. However, what you couldn't do was you couldn't really look at the code or do anything with the code. But what you can do now is. They have this one on the side, and here you can see, oh, I had, I was fixing my actual code, my authentication middleware. And like sometimes it will merge the pr, sometimes it won't, like often. Okay, so this one, it, it decided, it, it already merged, but, review the code again. update to dos with the 20 points. So it is actually just the command line, cloud code on the web. So it does keep track of all of the to dos. you have some limitations. Oh, and the other strange thing is that you can now teleport your session, to, to the cli the actual clock code. Yeah. On a whole different machine. So I could probably, show that, but like, it is just, it's a UI for, for the terminal. What's happening here. It's also exactly what's happening in the terminal. You can teleport this and go back to the terminal. Now what I really liked was that, there was a mobile interface, so it does work. On the app as well. So now you can actually use, cloud code on the go and it keeps the session running. So you don't have to worry about, like, did I close my laptop?
Alex Volkov
Alex Volkov 1:05:37
Did
Nisten Tahiraj
Nisten Tahiraj 1:05:37
You don't have to worry that much about that stuff
1:05:39
because it'll just keep going. if you have your agenda stuff set up well enough, it will go on for like a good 20 minutes, or so. and then you can, you can check up, you can go on the site, you can check up again. I feel bad for all the startups that we're doing this. it's still missing quite a lot of stuff, like, permissions that you sent. it can't really do screenshots. you can't really fire up automated actions, which you could do with the chromium. it doesn't have access to the stuff that the chat has, so it's still not really integrated. There's no voice input but there is voice input in the chat. So it is a little bit weird, but I think a lot of people will use this that would not normally use cloud code.
Kwindla Kramer
Kwindla Kramer 1:06:21
know we've gotta take it step by step, but for me,
1:06:23
what I want to see is a new UI here. But what I can see coming is, okay, you've got everything in this sandbox. There's a new UI that you could build from scratch for this kind of loop with an agent, co-pilot, collaborative coding thing. That's what I want to see.
Nisten Tahiraj
Nisten Tahiraj 1:06:41
It's more of a just be tell it a thing.
1:06:43
You wait 15 minutes or you just like check up on it.
Alex Volkov
Alex Volkov 1:06:46
Alright folks, we have, a bunch of other yam you
1:06:48
haven't, this mostly talked about. Tell, tell me about cloud code on web.
Yam Peleg
Yam Peleg 1:06:51
look, cloud code, the thing about it is that, it's a tool
1:06:54
that can control a computer, whatever computer, and that might happen to have your code hosted on this computer, therefore it can edit your code. but first and foremost, it's a tool that controls the computer through a terminal. Okay, now we're coming full circle because people, realize how useful it is. there started to be ports of the same tool to within your, code editor and so on. But now people wanted to, for example, just use it on a mobile, by the way, have it on a mobile. the thing that, people are mentioning here about, sandbox is okay, when it's running on a mobile, which exec computer does it control? So you have a sandbox environment on tropic servers and basically, in order to edit your code, you can plug it into guitar. That's the way it works. And you can just run it in a sandbox and. Use it to merge PRS later on. but again, it's becoming a little bit convoluted at this point. what exactly is cloud code? What exactly is cloud code for mobile? I fully agree that there might be a need for a new, user experience here. Codex is going the same way as well. There is Codex, CLI and Codex web app, and they're kind of different. if you need a cloud code and you like it, you can now run it. once you can get the toilet right here, it's very easy. Just integrate it in into GitHub. It takes five minutes and you can send it to do, whatever you want and it'll add the code. the cool thing I, the cool thing that I like about Cloud code is because it can control the computer. when I'm running it locally, I can ask it to run the code to test it out, not just to write the code, not to, not just to edit, because you can run the code and run tests as well.
Alex Volkov
Alex Volkov 1:08:39
right?
1:08:39
If you have GPU cooler stuff, it's not gonna run them.
Yam Peleg
Yam Peleg 1:08:42
Yeah.
1:08:43
definitely a good, a good thing. I mean, definitely. a good need to have this app. But, you know, maybe I'm not entirely sure where we're going. Everyone is launching everything and just moving their stuff from here to this platform. I'll just shout out that
Alex Volkov
Alex Volkov 1:08:57
Swix was here last week, and Ai e code, the AI engineer code where all
1:09:02
these people will come and discuss about the advancements is coming up in November. So sign up for that. the speakers was just released. I would love to see you there. Cloud Code was released earlier this year. Cloud Code is not even a year old. This whole paradigm is not even a year old and it changed completely how much money tropic is making within last than a year, and now they're releasing the web interface and TX is also web interface. Alright, folks, stay with us. we haven't talked about the video stuff yet. We haven't talked about the audience stuff yet. We have the world's top expert in audio, real time stuff, Qwen, Quin, LA Halman Kramer here from Pipe, when we're gonna get to that conversation very, very soon. but before this, this show was sponsored by Wits and Biases and, this is this week's buzz.
1:09:56
I really have to update this transition thing to say Weights, &, Biases from CoreWeave, which is now the new thing because we joined CoreWeave, you know, six months ago it said blah, blah, blah. So this actual whole segment is about this week's buzz from Weights, & Biases, from CoreWeave. all right, so here's what I have for you. Not gonna take too much time. we have two you guys know, fully connected, which is an ML term that ya mentioned previously in the chat as well. Fully connected is the name of our conference as well for machine learning practitioners, AI engineers, et cetera. you guys are welcome. If you are in Europe, London is coming. You are more than, more than invited to come to London, November four and five to meet a bunch of friends. if you are in Asia, Tokyo, I wish I'd be there, dude, if it, if this wasn't on Halloween, if this wasn't on October 31st, I would've been in Tokyo. MCP, like MCing this conference. I'm seeing this conference. I'm very excited about that because like, it's super cool and there's a bunch of folks in Asia that are like leaning strongly into ai. you will know this if you go to the soa, Japanese feed, you'll see some incredible stuff as well. So fully connected. if you need some tickets, hit me up on dms. I'll definitely get you in there. if you're in Europe, or in Tokyo, fully connected.com for that. And the other thing is our, host company, CoreWeave is one of the top, if I'm not mistaken, the only platinum. Neo scaler Cloud, GPU provider on the, our semi analysis, the scale, whatever they had. we have partnered with Meta and with Stanford and we cooperated on this new release of a new library called Torch Forge. Torch Forge is a new RL library from PyTorch, a Python native library, PyTorch native library for scalable RL post training and agent development. if you read the blog, you'll see that, this was, developed together with, Stanford. I'm not gonna go into this whole thing. Maybe I'll bring Aaron who's Ai.engineer on this from our team. but basically for scheduling multiple. Hundreds of GPUs runs. There's Ray, and now they release something called Monarch. And Monarch is from the team at Meta. this competes with Ray and the scheduling part, et cetera. and now PyTorch natively integrates with all of this, to give you, the ability to scale a l and post-training energy development across multiple tons of GPUs. And CoreWeave was a proud provider of those GPUs to send virtual research this research. So if you are interested in this, if this is for you, if you are training on multiple GPUs, definitely read the blog post, scaling Reinforcement Learning with to Forge on Corporate Cloud. And obviously those runs are now supported on Sunk, which is our way to provide the this thing. so we had the partnership with them. we provided 520 VH one hundreds, on our platform for this research. so look at Torch Forge if you are using the Core V infrastructure. sunk is, is, I think it's slum on Kubernetes, which is what, what they call sunk, is what we, what we give, folks. And, you are able to now use this new Torch Force Library. so shout out to the team that did this and shout out to corwith, for helping supporting this important research because we need alternatives because AEL is here. And also AI engineer code. Not this week's buzz, but we're sponsoring, we're a proud gold sponsor of that conference. please join us in New York. ThursdAIs also gonna be there covering this as we did for every AI engineer since it's fucking existence. I'm very proud of this. Alright folks, this is, this week's buzz. Let's talk about audio.
1:13:25
there's a reason I played this twice because earlier this week LDJ DMed me on X and said, you know what, Alex, you asshole. I found myself in the middle of the day just like singing to myself in my head in this wing. so yeah, this, this works. LDJ. Hopefully you'll also be dream about this. yeah. Marketing works. Yeah. Alright, folks, we are moving on. We have, we have like 15 minutes. We have some time to talk about some video releases, including, very important ones. But before this, it's kind of his video. Qwen, I ask you to come here though. You're not the purveyor of this new technology, but they definitely use podcast for some of this. we saw a company called De Cart AI that we talked about before. They did, realtime mine ization of video streams. They did a realtime diffusion model. they're doing like a bunch of interesting stuff. they have released something very cool, which is real time lip sync, augmentation. And, that works very well for. This. If we add AI in here, there's two ways to add this. One of them is gonna be voice. We'll only hear this. but the other one is to put a character. If we put a character when it says something, it needs to look like it's saying something that's like basically lip sync. it's possible in the world of offline. So like a bunch of models now, pretty much SOA and VO and like all these, they nailed this completely. So the, the, the lip sync is perfectly sync to what the character is saying, but it wasn't as easy before. we chat about like a bunch of open source stuff that do this. and now they added this in real time. have you seen this release? What's your participation level like? do you have a demo? Tell me everything.
Kwindla Kramer
Kwindla Kramer 1:14:54
build a demo last night in a Waymo while I was driving here.
1:14:57
No way. But I don't think we should know the demo live. 'cause this morning it looks to me like they're maybe having some GPU issues. We need to hook 'em up with CoreWeave. That's a good problem to have after a launch. there's a nice clip though I sent you if you want to show it Sure. That they posted on Twitter. That gives people a sense of like what this feels like and then we can talk about the tech. 'cause the tech is really fun.
Alex Volkov
Alex Volkov 1:15:16
The tech is really fun.
1:15:17
Let me actually show this in a way where we can, we can hear, because I think this is very important. so basically this is an example of their live stream. You can go to their
Kwindla Kramer
Kwindla Kramer 1:15:28
playground, you can do exactly this thing in their
1:15:30
playground or through the API.
Alex Volkov
Alex Volkov 1:15:32
here, let's go.
1:15:33
we're gonna do a tab and we're gonna do this. Okay. Hopefully you guys can see, yes. And then let's play this. Hey, can you hear me? All right. Cool. So yeah,
Decart
Decart 1:15:43
this is real time lip movement.
1:15:45
No pre-render, no editing. My lips move perfectly in sync with the audio.
Alex Volkov
Alex Volkov 1:15:50
Hey, can you, how Qwen explain this to me.
1:15:52
Like how is this possible? I don't get it.
Kwindla Kramer
Kwindla Kramer 1:15:55
architecture diagram?
1:15:56
I sent you a, like a little architecture diagram too. I mean, so the, the interesting thing here to me about the tech is that everything we're now building in like real time conversational AI is multiple models. And so you're, you're trying to put together multiple models in like a processing graph or a processing pipeline. So what they released yesterday at Decart is this lip sync model that can operate in real time on a pretty wide variety of, of video inputs. So you got the demo that you just showed, Alex, you can make your own video and you can stick it into this pipeline. They have an element in the pipeline that pulls the frames out of the video, runs a model inference on each of those frames and adjust the lip position outputs new pixels. We're seeing this pattern even in just voice stuff. So we want to be really fast in a conversational voice AI app, but you're doing things like tool calling. You're starting to build multiple specialized models to do different pieces of even a voice interface. And as soon as you go to image input, image output, video input, video output, robotic sensing, you're definitely in the world of multiple models. So my touchpoint here is that we work with sort of everybody who does this kind of stuff because pipe Kat's a open source framework. You can plug into it. Descartes did a really nice, cookbook. You can clone it from GitHub. it's got very clean pipe code. You can modify any of those elements in the pipeline. You can write your own elements or you can just take it as is and make a YAML file. And from that YAML file, you get a new character you can talk to on a video call or integrate into any kind of real time app. there we are at the point with real time video that we were like 18 months ago with voice, where if you know exactly what you're doing, you can put all the pieces together and you can build something really compelling. But it's super fragile. Mm-hmm. And it takes kind of expert knowledge and just the right models. Like I spent a bunch of time trying to. Make a perfect video character that the lip sync would be as good as their demo characters. I didn't quite even get there, right. Because there's like some specialized knowledge about how that model works. so we're like almost there. I really think over the next 18 months we're gonna have a explosion of real time video because when you do get it right, it's so compelling. And especially for social applications, games, NVIDIA's doing a bunch of really cool stuff with on device rendering that is like this pipeline, in fact they're using Pipe Kat for a fair amount of that kind of, you know, on the bleeding edge Invidia character rendering stuff. I think we're just gonna see so much stuff that's hard to imagine coming together over the next 18 months.
Alex Volkov
Alex Volkov 1:18:26
I absolutely agree and I think when I saw that this is like
1:18:29
real time lip sync where I know that like you previously have to run, I don't know, five, five B or even like 14 B models locally to, to pre-process to give you an output and that's not real time and this is real time. I was like very excited to see this. L did you have any comments on this? I saw you go off mute.
LDJ
LDJ 1:18:49
No, it just accidentally clicked it.
Alex Volkov
Alex Volkov 1:18:50
Ah, cool.
1:18:51
so we just, the diagram that you have for us Qwen, heres the diagram. basically the request goes through, web RTC component transfer, and then, they run it to whisper, speech to text, and then they run it to GR lamp to understand what, what you asked it.
Kwindla Kramer
Kwindla Kramer 1:19:06
So uses the pipe cat small web RTC transport, which
1:19:09
is a zero dependency peer-to-peer web RTC transport that just does a direct connection to a server. In this case, when you're running this, cookbook, you're probably just running the server locally. And then you're running a web client locally and you're just making a UDP network connection between them. Then the server component strings together the models and the rest of the processing graph. so they run, you're running Whisper locally the way they did the cookbook. Then you're calling out to Grok for the LLM to generate text
Alex Volkov
Alex Volkov 1:19:36
a response to what the user asked.
Kwindla Kramer
Kwindla Kramer 1:19:38
exactly.
1:19:38
So text in from the user grok text out for what the avatar's gonna say. Then they run it through, 11 labs for the voices. So you're getting a voice generated by 11 labs. you're putting the video frames together with that voice in their lip sync processor, and then you're sending that. time synced audio and video back to the user through the web RTC transport.
Alex Volkov
Alex Volkov 1:20:00
How fast does this actually work?
1:20:02
Like you've tested it out. We only saw videos prerecorded, which is no different from a prerecorded lip sync. How fast, like what, what is the round trip around all of this for the user?
Kwindla Kramer
Kwindla Kramer 1:20:11
I think what I was seeing in the Waymo on a cellular data connection
1:20:14
was sub sub two second, which is good. once you get down to about one second, it feels completely instantaneous, but sub two second is pretty impressive.
Alex Volkov
Alex Volkov 1:20:23
Wow, that's very impressive.
1:20:24
I think, one thing that I wanted to also ask you about, we've seen an animated AI character in grog. There's Annie, there's Valentino, that's not the same thing, right? Like they are animating the lips with a 3D software that was like, you know, receives the audio and then understands the phone names or whatever, and then modulates the news based on that. But that's not a diffusion model changing the lips. That's just like a 3D model that's baked in. This is completely different. Yes.
Kwindla Kramer
Kwindla Kramer 1:20:50
Like the, the, the Elon GR stuff is like a puppet
1:20:55
and you are driving that puppet with digital com control points. Yeah. 3D mesh. It, it's great. This is basically video to video. So if I give you a reference video, you can turn that reference video into. Character.
Alex Volkov
Alex Volkov 1:21:09
And, in the next segment we're gonna watch a few video
1:21:11
models that are now doing perfect lip sync, including the one from LTX, including the one from Reve. and you know, I use the same image of me sitting in here, to judge how closely they match the person. those models all do lip sync almost perfectly now. It's quite incredible how far we moved from a model to generate something and it doesn't match the voice to perfect lip sync.
Kwindla Kramer
Kwindla Kramer 1:21:31
The funny thing for me is we're still, I don't
1:21:34
think for photorealistic lip sync, any of the models are quite on the other side of the uncanny valley. They're good and you can certainly use them for use cases where, you know, people are not super concerned about that. But what I would like to see, even before the photorealistic, you know, video folks get all the way there is. Our brains process like cartoon images and non-photo real realistic images completely differently. And you can get away with a lot with lip sync in a non photorealistic image. And it's still really, really compelling. I'd love to see people working on these models really push in the non photorealistic direction and get that all the way nailed with like, really, really low latency, really reliable models for any kind of cartoonish input. I actually think that would unlock a bunch of super interesting stuff. But it's a little like, you know, the big labs train their LLMs on certain things that they care about the video folks really want to get to perfect photorealism. And I'm like, I get it. And I'm excited about that too, but like, spend a little time on the non photorealistic. 'cause we could do a lot with it.
Alex Volkov
Alex Volkov 1:22:36
I think that this is a perfect segue to our next segment,
1:22:39
which is gonna be about video. and definitely let's keep chatting about this as we show this, the next, piece of our, Thursday segment is video obviously, as you guys understood. And the bill I think is From soa, Soa, I think I'll find it like his, he posted about the next SOA roadmap update, and you'll see something exactly like this. So, SOA obviously has cameos. cameos in SOA are only people. You need to scan your face. You need to read out the three numbers. You need to say your voice in order for it to create a cameo. You cannot upload images to SOA or videos. You can upload images, but they can't have human faces in it. So basically they took a lot of steps. So that SOA feed is only SOA generated a hundred percent. and then the cameras are only your cameo. you can only use other cameos if people open them up. the next update to SOA though is gonna have something coin that you mentioned, very, very distinctly. and Sora is perfect lip sync, I believe like the uncanny valley on the other side of this. If I show somebody a video of Sora, they're not sure if it's reality, or not. If I show somebody reality, they're not sure if it's Sora or not. Like I had both of these I talked about on the podcast with Sam Mann. the next iteration of soa though, they will introduce, cameos that I can upload of, not of people, but of characters. And this is kind of how it looks and looks super cool. I'm just gonna show you guys, hopefully you'll see in here.
sora
sora 1:23:49
Hi there.
1:23:50
I'm a character cameo, Anything can be a character
Alex Volkov
Alex Volkov 1:23:53
cameo.
1:23:53
So for folks who are listening right now we're watching a sunny side up egg in the frying pan, sizzling with the eyes as the, the yolks of the eggs. And this face has lip sync that says, and I'm a cameo too.
sora
sora 1:24:06
The O2, anything can be a character.
1:24:08
Camille can even upload your pets as cameos and put them into any scene you want. Character cameos coming soon to soa.
Alex Volkov
Alex Volkov 1:24:18
I think that this is, hopefully you guys
1:24:21
saw this and heard this. I think that this is absolutely the next step in, solar generation. 'cause people will just create incredible ones and share them and then people will reuse them and we'll see the rise of the cameo world, We'll see the social dynamics of that. generally this is also a way to get to the place where it's like animated stuff. That looks very good. The egg on the frying panel was insane.
Kwindla Kramer
Kwindla Kramer 1:24:41
I feel like it's not a big leap for what the models can
1:24:44
do today for those to be real time. And when I was talking about not quite over the uncanny valley, I was mostly thinking about the realtime stuff,
Alex Volkov
Alex Volkov 1:24:50
So you can like, have them on with us as co-hosts for ThursdAI and
1:24:53
chat with them like we're doing right now.
Kwindla Kramer
Kwindla Kramer 1:24:55
you have the Sunnyside up egg be a co-host?
1:24:57
A hundred percent.
Alex Volkov
Alex Volkov 1:24:57
First of all, I would animate LGAs, cat Avatar
1:25:00
that's been sitting with us. Nobody here knows what LD looks like. We would just like animate this. I would be able to give LDJ the anonymity he wants to without compromising, while also seeing some animated.
LDJ
LDJ 1:25:10
You've seen me.
Alex Volkov
Alex Volkov 1:25:11
Just me.
1:25:12
Yes. But we have, yeah, we have, 7,000 views for today's stream. And none of them, none of them know what the co-host looks like, which is fine. Maybe for the three year anniversary for Thursday. all right, so this, this was from Sora. we definitely want to get to the point where it's real time Qwen. it's gonna be there. Like, it's clear to all of us that the trajectory is clear. we're moving towards real time ability to chat with those characters, with those cameos, with those AI generations. and it's like, it's real time. speaking of real time, we should move on to the next thing that we have, which is breaking news, from, today with LTX, We talked about electrics, the company, a couple times ago. LTX has a studio of video production, that you can plug in any other video model into called the LTX Studio. they also released their model. So we talked about the realtime model and they released ltx two. I really wanna show you l TX two because I think it's like super, super cool, they said we're more than a studio. We also have, the model LTX is a major breakthrough. It's fast generation, native 4K lip sync and high frame rate, up to 10 seconds of video. let's take a look at some examples.
1:26:22
I just hope this. So here, here's a few of the highlights here. Native 4K. Most of the video models that we currently know about, they are upscaling to even 10 80 PI think VO is not even 10 80 p native or it only starts to be, this is native 4K, which is very, very impressive. but they also mentioning that they have three versions of this fast pro and ultra that's coming soon, supposedly, fast to move quickly, ultra whenever pixel counts. So you can iterate fast on an idea, on prompts, and kind of the same prompt will get you the same results. In the Ultra model, they claim to get the fastest generations on the market, the speed remove blocker. So definitely one thing about Sora and Sora Pro was released what, two weeks ago. waiting for a solar generation to see whether or not your prompt hit exactly like you wanted it, is excruciatingly painful as an iterative process for iteration, you need some speedy stuff. VO three is good at this. V has two models and one of them is faster. You can like really iterate with vo you can try a bunch of stuff. Oh no, I'm actually gonna change the, the prompt, et cetera. really hard to iterate with SOA because it's so slow, especially with SOA Pro. and LTX is claiming that they have near realtime, speed. One of the faster generations on the market is take a look at that video so quick, like seven seconds. They call it faster than real time, which is an interesting choice. the previous version of LTX, the one that was released in, open source, I think the, do you guys remember off the top? I'm trying to remember. There was a 5 billion version of in a 12 to 14 range billion version of LTX. the 5 billion was generating videos under five seconds. So they would generate five second videos under five seconds, which was, faster in real time. And what else From LTX, native 4K and 1440 p and 10 80 p. it's very, very cool. And the last thing is what I told you about guys. Lip sync is coming everywhere. Sound is coming everywhere. This model supports sound and lip sync. Let's take a look.
1:28:25
Bring it on. You said milk made from oats. So native lip sync as well with high frame rate. Is it open source though? I don't believe it is. it's only in their LTX studio.
Kwindla Kramer
Kwindla Kramer 1:28:38
This waits are coming later this fall.
1:28:40
Yeah, so the training
Alex Volkov
Alex Volkov 1:28:41
code's coming, which is awesome.
1:28:43
Oh, for custom things, like, kind of like when we talk about cameral, like folks should understand, like, is this possible open source models via lus? Like we've trained custom video things via lus. If you go to civic ai, for example, there's custom video for v from, a bunch of video models. the open source community will get to that level. It's not gonna be as smooth as cameos in the beginning, but we'll get to the point where like we can, train this coffee cup with the smiley on his face to also talk.
Kwindla Kramer
Kwindla Kramer 1:29:08
I think it's really important 'cause you can't get
1:29:10
all the way there with prompting with image and video models. Like prompting is a skill and really important. But if you want a particular style, a particular, look, a particular way of camera movement, at least as far as I am able to do it, you can't get all the way there with prompting. You need to do some fine tuning some Laura stuff. Something.
Alex Volkov
Alex Volkov 1:29:29
Yeah.
1:29:30
Laura is definitely for character consistency is one of the biggest like hurdles for character consistency. alright, I wanna comments on this folks super quick before I move on. Comments on LTX?
Yam Peleg
Yam Peleg 1:29:41
Yeah, very impressive.
1:29:42
I wanted to share screen, just the, from their website, the officers as strategic advantage. confirm that weights and training code releasing later this fall. Absolutely.
Alex Volkov
Alex Volkov 1:29:54
Yeah.
Yam Peleg
Yam Peleg 1:29:54
Oh, I, seriously, amazing.
Alex Volkov
Alex Volkov 1:29:56
I think I will have to bring them on.
1:29:59
I think I know some of the folks involved in LTX, and I think I'll have to bring them on, for this because real time video model of this quality is definitely very, very impressive. So shout out to them. folks, we're almost at the end. we have, we talked about soa, we talked about thing. I wanna show you another unannounced release like I told you about how to get Chat GPT to talk, to understand videos. And yamas now gonna use this everywhere. there's also an analysis release that it's there, so I don't feel bad. I don't have like power, knowledge, et cetera. We, I, I talked to you about ve I was also able to show you some examples of ve, but I really wanna like show, an example of ve the interface, I use Rev to regenerate thumbnails for the show. So this week's a thumbnail was created for the show, and you can see kind of the progression. I uploaded an image of myself here, and then an image here, and then an image here, and I asked it, and now it can like, use my face. This is not Laura, this is not my face trained on it. This is just the image getting context from multiple images as input and understanding is. but I learned this week that Rev also has video. When you go in there, there's this button create video and, let's see if this is the video one. Yeah, this is how it looks. So you can see the apple in my image that I recorded before. And, I believe that this is one of my best lightning moments, in my studio. it's no longer as high quality lightning. I need to get it there. and basically I created this video. It took a while. It's way longer than, VO three by example, but I think you can hear this also and look at the lipstick as well. Welcome everyone to thrust DAI. October doesn't, 23rd doesn't, it can say Thursday. I, which is by the way, so is great at this. The lip sync is really, really good for a model that like, you know, the, the, this ncees the name of ThursdAI, for Revit. And, yeah, this is, it take a while. It looks like their own model. It doesn't seem like they're using anything, below and coming out of the gate with a model that also does videos. Very impressive. Shout it to Revit. I think I showed you this. Guys, I'll show you this again. This was my, template for this week's, thumbnail. And then the cool thing about Rev is when the edit mode, it understands the things in the picture, and you can just talk to it like with nano banana. But the cool thing is you can add objects with the thing and just say, I want a picture of a cat on a yellow. Background in a circle, you can just add things. And the editing interface will allow you to just like move this where you want to add this. And you can say, cut in circle inside, inside this monitor. So like, let's say we'll have a legit, so we will understand where you pointing and understand what you want to put there and the apply edits. I think that this way of editing, with, nano doesn't exist and the this way of editing with, C Dream, four also doesn't exist. this is like Photoshop for AI editing and this is why I keep using Rev. And you see the new model, like very quickly just added another like, cat here. It did remove the previous monitor, but it added another cat. Exactly why I wanted it. Exactly with the yellow background, this is a, shadow to LDJ a different cat though. We covered pretty much everything on the show. There was no breaking news. It looks like from the comments, folks are, yeah. Jose said, did you see the Genie three is released? Genie three is not released, but there was a, there was like, they're preparing to release this, I think. so Genie three is the generative world model from Google. And, this, it apparently based on testing catalog, they started to work on, on, on working, on releasing this. So, let me show you how this looks. Here. testing catalog finds all kinds of things, that people before they like, ready to experiment, but it looks like a genie experiment is, is coming and you'd be able to create your world and walk in this world with Genie three something. I'm, I'm so looking forward to testing it out and telling you guys when it comes, it looks like it's, it's about to happen. It's coming soon. And also, other three, coded, 3.0 coded like we keep waiting for Gemini three as well. keep waiting for that. That's gonna come soon. But with that, I think folks we're over our time a little bit and, wanted to say huge thanks to the folks who came and made the show. Paul Klein was here talking about browser base. Nisten had to drop LDJ and Yum and everybody from the almost 7,000 folks that tuned in, to the livestream, which is incredible. thank you so much for joining Thursday from week to week. we have the distinct pleasure of having to follow up with AI so that we keep you up, up to date with AI as well. So this is like a, a benefit to us as well. if you are new to any part of the show, the show is turning into a newsletter and a podcast immediately on the same Thursday. So if you are new and you wanna not miss any show anymore or any part of Thursday, please subscribe to Thursday. I news. to get it, if you missed an episode or missed any part of the show. And with that, I will say goodbye. Thank you so much folks, and we'll see you here next week. Cheers. Bye bye.