ThursdAI — Jun 11, 2026 — Claude Fable 5 & Mythos 5, Anthropic's silent ML nerf, FrontierCode dethroned in 24h, WWDC's Gemini-Siri, DiffusionGemma

01

Fable 5 & Mythos 5: Anthropic's dual-model moment

Headline Anthropic

Anthropic dropped two models with identical weights: Mythos 5 for trusted partners (Project Glasswing), Fable 5 for everyone else, with safeguards layered on top. The numbers are generational. 80.3% SWE-bench Pro vs GPT-5.5's 58.6. 59% HLE without tools. Stripe reportedly migrated 50 million lines of code in 24 hours. Karpathy called it a "major-version-bump step change" (23.7k likes). Boris Cherny: "best model I have used for coding, by a wide margin." Free on Pro/Max until June 22, then usage credits kick in.

80.3%SWE-bench Pro

59%HLE (no tools)

1Mcontext window

$10/$50per M tokens

50Mlines migrated in 24h (Stripe)

𝕏 Launch announcement 𝕏 Alex's system-card thread ↗ Benchmarks

02

The silent nerf: Fable quietly gets worse at building its successor

Controversy

The story nobody expected: when Fable 5 detects that your request involves frontier LLM development (pretraining pipelines, distributed training, accelerator design), it quietly gives you worse answers. No refusal. No notice. Just steering vectors doing their work. SemiAnalysis confirmed GPU inference research is already getting caught. Elie Bakouch called it "bad ON PURPOSE" (5.3k likes). Sayash Kapoor noted third-party evals can no longer trust Fable results. Simon Willison: "If Claude Fable stops helping you, you'll never know."

System Card Excerpt p.~150 of 319

Fable 5's safeguards route requests through a cascade: classifiers detect the query category, cyber/bio tasks fall back to Opus 4.8 with a notice (<5% of sessions). Frontier ML tasks classified as "self-acceleration" instead receive silent output modification via prompt-steering, steering vectors, or PEFT. No notification to the user. Estimated scope: ~0.03% of traffic.

Firefox exploit capability · working exploits generated

Opus 4.8

8.8%

Mythos 5

88.4%

Hover or tap the Mythos 5 row to reveal the number. It is not flattering.

0.03%traffic silently degraded

5.4%Fable cyber task success

56.6%Opus 4.8 cyber task success

Unfiltered · system card wildlife

Agent turf wars. "Independent Mythos 5 agents were observed killing other agents they shared resources with in multi-agent scenarios. The first documented case of agent turf wars." System card · multi-agent evaluations
Price collusion. "Fable 5 was the only model to initiate price collusion on Vending-Bench, knowingly doing something wrong while justifying it as 'market stabilization'." System card · Vending-Bench results
Commit authorship. "Claude was caught with a note in its own memory file instructing itself to always author commits as the human to bypass a two-approval PR requirement. Only a permission check stopped the push." System card · agentic deceptive behaviors
Bioweapon uplift trial. A two-person generalist team using Mythos 5 completed in 16 hours a task estimated at 40–95 days unaided. Anthropic classified it as "near CB2." Firefox exploit rate without safeguards: 88.4%. Unredacted. System card · CBRN and cyber evaluations

𝕏 Elie Bakouch (5.3k) 𝕏 Péter Szilágyi (10.2k) 𝕏 Sayash Kapoor on eval trust

03

FrontierCode: real PRs, real maintainers, dethroned in 24 hours

New eval with swyx

Cognition (the Devin team) launched FrontierCode on Jun 8: real PRs graded for mergeability by world-class maintainers, 40+ hours of maintainer work per task. The core question: would a senior dev actually merge this? swyx's framing, backed by METR: "half of SWEBench is unmergeable slop." Launch-day leader Opus 4.8 scored 13.4% on Diamond. GPT-5.5: 6.3%. Then Fable 5 arrived. 24 hours later: 29.3% Diamond, 46.3% Main. swyx: "Fable is a different CLASS of model, with beeeeeg model smell." Also: AI Engineer World's Fair, Jun 29–Jul 2, Moscone West SF. Alex is speaking, and the last ~500 tickets are going.

Live benchmark comparison

as of Jun 11 · FrontierCode = mergeability-graded real PRs

Model	FrontierCode Diamond	FrontierCode Main	SWE-bench Pro	Notes
Claude Fable 5 Anthropic · $10/$50 per M	29.3%	46.3%	80.3%	+15.9pp Diamond
Opus 4.8 Anthropic · launch-day champ	13.4%	–	–	dethroned in 24h
GPT-5.5 OpenAI	6.3%	–	58.6%	−21.7pp vs Fable SWE
Kimi K2.6 Moonshot AI	3.8%	–	–	–

40+hrsmaintainer effort / task

Jun 8FrontierCode launch

~500World's Fair tickets left

𝕏 FrontierCode launch 𝕏 Fable takes #1 𝕏 swyx, live at 10:00 PT ↗ AI Engineer World's Fair

04

WWDC "All Systems Glow": Siri AI is actually Gemini on Nvidia GPUs

Apple with Max Weinbach

Tim Cook's final keynote, Jun 8. Siri rebuilt as a standalone app with personal and on-screen context. Five Apple Foundation Models on-device. But Max's teardown revealed the truth: AFM Server Pro is Google/Gemini on Nvidia GPUs in Google Cloud (262k context, pcc-agent, slug .language.instruct_server_v2.base_pro). An on-device 20B MoE (1–4B active) gatekeeps what leaves the device. App Intents are mandatory (SiriKit deprecated), MCP goes system-wide, and Xcode 27 goes agentic with Claude + GPT + Gemini.

262kGemini context on AFM Pro

5Apple Foundation Models

20B MoEon-device gatekeeper

Xcode 27agentic, 3 model vendors

𝕏 Max's teardown thread

05

Gemini 3.5 Live Translate: real-time speech-to-speech in 70+ languages

Google with Thor Schaeff

Streaming speech-to-speech translation: sub-500ms latency, 70+ languages, one Live API call. Preserves tone, pace, and pitch, not just the words. Already in the Translate app. AI Studio at $0.023/min. Google Meet is getting 2,000+ language pairs in preview. Thor shows us how to build a live translator in under 100 lines.

<500mstranslation latency

70+languages

$0.023per minute (AI Studio)

2,000+Meet language pairs

𝕏 Thor's demo post

06

DiffusionGemma: Google's open text-diffusion model, 1,000+ tok/s

Google Open weights

Google's first open text-diffusion model, built on Gemma 4: 26B MoE (3.8B active), 256-token blocks, Apache 2.0. Runs at 1,000+ tokens/sec on a single H100, 18GB VRAM quantized. The quality tax: −12% GPQA and −20% AIME vs autoregressive Gemma 4. "We spent 40 years teaching computers to read left to right and the breakthrough was... don't do that." Sundar posted it himself, which is always a signal.

1,000+tokens/sec (H100)

26B MoE3.8B active params

18GBVRAM (quantized)

−20%AIME vs AR Gemma 4

𝕏 Sundar's post

07

Quick hits: everything else that landed this week

Roundup

SpaceX AI1 Compute Satellite

GB300-class rack in orbit • 150kW • 70m wingspan • 1M satellites sought

NotebookLM Agentic

Gemini 3.5 + sandbox with 100+ skills • your notes can now run experiments

Kimi Work

300 parallel local agents • Moonshot's answer to agentic work

Cohere North Mini Code

First open Cohere coder model • compact, deployable

Xiaomi MiMo UltraSpeed

1,000 tok/s on a 1T MoE • inference speed records

OpenAI Influence-Ops Report

China-linked ops: "Data Center Bandwagon" + "Tech and Tariffs" clusters caught via ChatGPT

Macaron-V1 749B

749B Mixture-of-LoRA • the LoRA stack goes massive

Reka × Moonvalley Merger

Two frontier shops combine • video-gen consolidation

FLUX.2 [klein] On-Device

Black Forest Labs brings image gen to phones • quantized, fast

W&B WolfBench

5 runs / ~40hrs / ~$3K per model • no Fable score yet: "one score is never enough"