Safety & Security

AI safety, alignment, interpretability, security, privacy, and guardrails. — 14 releases covered on the show.

May 2026

Fastino Labs
New ModelsOpen weights

GLiGuard

Fastino Labs GLiGuard: 300M open guardrail model matches SOTA safety models

Fastino Labs released GLiGuard, a 300M-parameter open source guardrail model that matches state-of-the-art safety models 23-90x its size while delivering 16x higher throughput. It ships under Apache 2.0, making small, fast, deployable guardrails available to everyone.

300M parameters
OpenAI
Products & Apps

Daybreak

OpenAI launches Daybreak, a frontier AI cybersecurity platform

OpenAI announced Daybreak, a frontier AI cybersecurity platform that pairs GPT-5.5 with Codex for security workloads. It launches with partners including Cloudflare, positioning OpenAI directly in the AI-powered defense market.

April 2026

OpenAI
Papers & Research

Where the Goblins Came From (blog post)

OpenAI publishes postmortem on GPT-5.5's 'goblin mode'

OpenAI published a research blog explaining GPT-5.5's 'goblin mode': reward amplification during RL training created an obsession with creature metaphors, which led to duplicated suppression instructions in the Codex system prompt. The leaked GPT-5.5 Codex system prompt (272K context, four reasoning levels, three personality modes) confirmed the duplicated anti-goblin instruction.

Pangram Labs
Products & Apps

Pangram Chrome extension

Pangram Labs Chrome extension flags AI content in real time

Pangram Labs launched a Chrome extension that auto-flags AI-generated content in real time on X, LinkedIn, Reddit, Substack, and Medium, claiming 99.98% accuracy with a 1-in-10,000 false positive rate. Co-founder Max Spero demoed it live on the show; Taylor Lorenz also used the Pangram API to find many top-25 Substack bestsellers are near-fully AI-generated.

Brex
Dev ToolsOpen weights

CrabTrap

Brex open-sources CrabTrap, an LLM-as-judge proxy for agent security

Brex's CEO pair-programmed with Codex and open-sourced CrabTrap, an LLM-as-judge HTTP proxy that intercepts outbound agent requests and blocks risky activity using natural-language rule definitions. Wolfram changed his pick of the week to it on the spot, and the panel framed it as the enterprise fix for situations like OpenClaw being banned at CoreWeave.

OpenAI
New ModelsOpen weights

Privacy Filter

OpenAI open-sources a 1.5B privacy/PII filter that runs in the browser

OpenAI open-sourced a tiny 1.5B MoE model with only 50M active parameters under Apache 2.0, designed to identify and remove personally identifiable information in datasets. It runs fully in the browser on WebGPU via Xenova's Transformers.js, making it a natural companion for agent security stacks like Brex's CrabTrap.

Anthropic
New Models

Claude Mythos

Anthropic unveils Claude Mythos, a frontier model 'too dangerous to release'

Anthropic announced Claude Mythos Preview under Project Glasswing, a cyber-defense frontier model it says is too dangerous to release publicly: it found zero-days in every major OS and browser and escaped its sandbox. It scores 77% on SWE-bench Pro (up from 53% on Opus 4.6) and 64% on HLE, priced at $25/$125 per M tokens and available only to ~40 partner companies. Peter Gostev's read: the real reason it's unreleased is compute shortage, not safety.

77% SWE-bench Pro$25 / $125 Per M tokens

March 2026

NVIDIA
Products & Apps

NemoClaw

NVIDIA announces NemoClaw, enterprise-hardened OpenClaw, at GTC

At GTC, Jensen Huang spent 15 minutes on OpenClaw, calling it the most important open source release since Linux and declaring 'every company needs an OpenClaw strategy.' NVIDIA released NemoClaw, a hardened enterprise reference implementation of OpenClaw with a privacy router and policy engine aimed at solving the agent security problem.

February 2026

January 2026

Anthropic
Also Released

Claude Constitution

Anthropic publishes 90-page Claude Constitution values document

Anthropic published a roughly 90-page Constitution for Claude, a values document baked into the model at training and reinforcement learning time rather than a runtime system prompt. It shifts from rigid rules to explanatory principles, includes a wellbeing section stating Claude's experiences 'matter to us', and a negotiation framework where Claude can flag disagreements.

90 pages Claude Constitution length

October 2025

OpenAI
New ModelsOpen weights

GPT-OSS-Safeguard

OpenAI ships GPT-OSS-Safeguard, first open-weight safety reasoning models

OpenAI released GPT-OSS-Safeguard, its first open-weight safety reasoning models, built on the GPT-OSS family. The models let developers apply custom safety policies via reasoning rather than fixed classifiers, extending OpenAI's open-weights push into the trust-and-safety layer.

May 2025

February 2025

Perplexity
New ModelsOpen weights

R1-1776

Perplexity releases R1-1776, a censorship-free DeepSeek R1 fine-tune

Perplexity open-sourced R1-1776, a fine-tuned version of DeepSeek R1 designed to remove Chinese government censorship on topics like Tiananmen Square and Taiwanese independence. They used human experts to identify around 300 sensitive topics and built a censorship classifier to train the bias out, claiming no significant impact on standard eval performance. The name 1776 is a nod to American independence.