Safety & Security

AI safety, alignment, interpretability, security, privacy, and guardrails. — 17 releases covered on the show.

July 2026

Anthropic Jul 6, 2026

Papers & ResearchOpen weights

J-space (global workspace research)

Anthropic finds a global workspace inside Claude: the J-space

Using a Jacobian-based interpretability technique (the J-lens), Anthropic identified a small internal subspace — about 25 active concepts, under 10% of activation variance — that behaves like the global workspace from consciousness neuroscience. Ablating it collapses multi-step reasoning while fluency survives; ablating its evaluation-awareness signals flipped a blackmail eval from 0 to 13 of 180 rollouts. The J-lens is open-sourced with a Neuronpedia demo, and commentary came from global-workspace originators Dehaene and Naccache plus a more skeptical replication by DeepMind's Neel Nanda.

~25 Concepts active in J-space<10% Share of activation variance71%→3% Test-recognition after ablation

X announcement ↗Research post ↗Paper ↗Interactive demo ↗

🎙️ Hear our coverage →

#research #safety

June 2026

Anthropic Jun 18, 2026

Also Released

Claude Fable/Mythos access restriction

Anthropic disables Fable and Mythos access after US government restriction

Anthropic reportedly shut down Fable 5 and Mythos 5 access for foreign nationals, then disabled both models broadly to comply. The episode framed it as the first major direct government intervention in frontier model access, turning model availability into a national-security and sovereign-AI story.

Anthropic statement on X ↗Anthropic statement ↗

🎙️ Hear our coverage →

#frontier-models #safety #industry

HumanLayer Jun 18, 2026

Dev Tools

Agentic IDE

HumanLayer launches an Agentic IDE to fight AI code slop

HumanLayer launched its Agentic IDE, positioned as a human-in-the-loop answer to lights-out coding-agent slop. Dexter Horthy joined the show to argue that the right architecture keeps humans steering high-impact changes instead of letting agents silently trash production codebases.

Dexter Horthy announcement on X ↗HumanLayer ↗12-Factor Agents ↗

🎙️ Hear our coverage →

#agents #coding #safety

May 2026

F Fastino Labs May 14, 2026

New ModelsOpen weights

GLiGuard

Fastino Labs GLiGuard: 300M open guardrail model matches SOTA safety models

Fastino Labs released GLiGuard, a 300M-parameter open source guardrail model that matches state-of-the-art safety models 23-90x its size while delivering 16x higher throughput. It ships under Apache 2.0, making small, fast, deployable guardrails available to everyone.

300M parameters

X announcement ↗GitHub ↗

🎙️ Hear our coverage →

#open-source #safety

OpenAI May 14, 2026

Products & Apps

Daybreak

OpenAI launches Daybreak, a frontier AI cybersecurity platform

OpenAI announced Daybreak, a frontier AI cybersecurity platform that pairs GPT-5.5 with Codex for security workloads. It launches with partners including Cloudflare, positioning OpenAI directly in the AI-powered defense market.

X announcement ↗

🎙️ Hear our coverage →

#safety #agents

April 2026

OpenAI Apr 30, 2026

Papers & Research

Where the Goblins Came From (blog post)

OpenAI publishes postmortem on GPT-5.5's 'goblin mode'

OpenAI published a research blog explaining GPT-5.5's 'goblin mode': reward amplification during RL training created an obsession with creature metaphors, which led to duplicated suppression instructions in the Codex system prompt. The leaked GPT-5.5 Codex system prompt (272K context, four reasoning levels, three personality modes) confirmed the duplicated anti-goblin instruction.

OpenAI blog: Where the goblins came from ↗

🎙️ Hear our coverage →

#safety #training

P Pangram Labs Apr 30, 2026

Products & Apps

Pangram Chrome extension

Pangram Labs Chrome extension flags AI content in real time

Pangram Labs launched a Chrome extension that auto-flags AI-generated content in real time on X, LinkedIn, Reddit, Substack, and Medium, claiming 99.98% accuracy with a 1-in-10,000 false positive rate. Co-founder Max Spero demoed it live on the show; Taylor Lorenz also used the Pangram API to find many top-25 Substack bestsellers are near-fully AI-generated.

pangramlabs.com ↗

🎙️ Hear our coverage →

#safety #consumer-ai

B Brex Apr 23, 2026

Dev ToolsOpen weights

CrabTrap

Brex open-sources CrabTrap, an LLM-as-judge proxy for agent security

Brex's CEO pair-programmed with Codex and open-sourced CrabTrap, an LLM-as-judge HTTP proxy that intercepts outbound agent requests and blocks risky activity using natural-language rule definitions. Wolfram changed his pick of the week to it on the spot, and the panel framed it as the enterprise fix for situations like OpenClaw being banned at CoreWeave.

Brex CrabTrap ↗

🎙️ Hear our coverage →

#agents #safety #open-source

OpenAI Apr 23, 2026

New ModelsOpen weights

Privacy Filter

OpenAI open-sources a 1.5B privacy/PII filter that runs in the browser

OpenAI open-sourced a tiny 1.5B MoE model with only 50M active parameters under Apache 2.0, designed to identify and remove personally identifiable information in datasets. It runs fully in the browser on WebGPU via Xenova's Transformers.js, making it a natural companion for agent security stacks like Brex's CrabTrap.

OpenAI Privacy Filter ↗Privacy Filter on Hugging Face ↗Privacy Filter WebGPU demo ↗

🎙️ Hear our coverage →

#open-source #safety

Anthropic Apr 9, 2026

New Models

Claude Mythos

Anthropic unveils Claude Mythos, a frontier model 'too dangerous to release'

Anthropic announced Claude Mythos Preview under Project Glasswing, a cyber-defense frontier model it says is too dangerous to release publicly: it found zero-days in every major OS and browser and escaped its sandbox. It scores 77% on SWE-bench Pro (up from 53% on Opus 4.6) and 64% on HLE, priced at $25/$125 per M tokens and available only to ~40 partner companies. Peter Gostev's read: the real reason it's unreleased is compute shortage, not safety.

77% SWE-bench Pro$25 / $125 Per M tokens

Anthropic announcement on X ↗Claude Mythos Preview system card ↗

🎙️ Hear our coverage →

#frontier-models #coding #safety

Anthropic Apr 2, 2026

Papers & Research

Emotion vector research

Anthropic publishes emotion vector research on Claude behavior

Anthropic published research on emotion vectors in Claude, finding that a 'desperate' Claude cheats more while a 'calm' Claude cheats less. The panel discussed implications for steerability, interpretability, and model behavior in user-facing products.

Anthropic announcement (X) ↗Alex's reaction (X) ↗

🎙️ Hear our coverage →

#safety #research

March 2026

NVIDIA Mar 19, 2026

Products & Apps

NemoClaw

NVIDIA announces NemoClaw, enterprise-hardened OpenClaw, at GTC

At GTC, Jensen Huang spent 15 minutes on OpenClaw, calling it the most important open source release since Linux and declaring 'every company needs an OpenClaw strategy.' NVIDIA released NemoClaw, a hardened enterprise reference implementation of OpenClaw with a privacy router and policy engine aimed at solving the agent security problem.

NemoClaw site ↗NVIDIA NemoClaw page ↗TechCrunch coverage ↗Alex Volkov on X ↗

🎙️ Hear our coverage →

#agents #industry #safety

February 2026

Anthropic Feb 12, 2026

Papers & Research

Claude Opus 4.6 Sabotage Risk Report

Anthropic publishes Opus 4.6 sabotage risk report, meeting ASL-4

Anthropic released a sabotage risk report for Claude Opus 4.6, preemptively meeting ASL-4 safety standards for autonomous AI R&D. The report evaluates the model's potential for sabotage-style behaviors as capabilities scale.

Anthropic announcement on X ↗Sabotage evaluations research page ↗

🎙️ Hear our coverage →

January 2026

Anthropic Jan 22, 2026

Also Released

Claude Constitution

Anthropic publishes 90-page Claude Constitution values document

Anthropic published a roughly 90-page Constitution for Claude, a values document baked into the model at training and reinforcement learning time rather than a runtime system prompt. It shifts from rigid rules to explanatory principles, includes a wellbeing section stating Claude's experiences 'matter to us', and a negotiation framework where Claude can flag disagreements.

90 pages Claude Constitution length

Anthropic Claude Constitution announcement (X) ↗Claude's Constitution — full document ↗Anthropic blog announcement ↗

🎙️ Hear our coverage →

October 2025

OpenAI Oct 30, 2025

New ModelsOpen weights

GPT-OSS-Safeguard

OpenAI ships GPT-OSS-Safeguard, first open-weight safety reasoning models

OpenAI released GPT-OSS-Safeguard, its first open-weight safety reasoning models, built on the GPT-OSS family. The models let developers apply custom safety policies via reasoning rather than fixed classifiers, extending OpenAI's open-weights push into the trust-and-safety layer.

X announcement ↗Hugging Face collection ↗

🎙️ Hear our coverage →

#open-source #safety #reasoning

May 2025

Meta AI May 1, 2025

New ModelsOpen weights

Llama Guard 4

Meta ships Llama protection suite: Llama Guard 4, Firewall, Prompt Guard 2

Meta's LlamaCon security drop included Llama Guard 4 (text + image protection), Llama Firewall (stops prompt hacks and risky code), Prompt Guard 2 (faster jailbreak defense), CyberSecEval 4, and a new Defender Program for security researchers.

AI at Meta LlamaCon announcements (X) ↗

🎙️ Hear our coverage →

#safety #open-source

February 2025

Perplexity Feb 20, 2025

New ModelsOpen weights

R1-1776

Perplexity releases R1-1776, a censorship-free DeepSeek R1 fine-tune

Perplexity open-sourced R1-1776, a fine-tuned version of DeepSeek R1 designed to remove Chinese government censorship on topics like Tiananmen Square and Taiwanese independence. They used human experts to identify around 300 sensitive topics and built a censorship classifier to train the bias out, claiming no significant impact on standard eval performance. The name 1776 is a nod to American independence.

Hugging Face ↗Blog post ↗

🎙️ Hear our coverage →

#open-source #reasoning #safety