Major Features & Updates
Claude Opus 4.6 (1M context)
Anthropic makes Opus 4.6 1M context the default in Claude Code, same price
Anthropic made 1M token context the default for Opus 4.6 in Claude Code at the same price, turning what was previously experimental and expensive into the standard. MRCR benchmark performance holds at 93% at 256K and 76% at 1M. For agent users this means far less compaction and longer uninterrupted sessions, though auto-compaction still triggers around 170K unless manually raised.
1M Opus 4.6 context default
New Models
MiniMax M2.7
MiniMax M2.7: first self-evolving model hits 56% on SWE-Bench Pro
MiniMax dropped M2.7, billed as the first self-evolving model: it ran 100+ autonomous RL optimization loops and wrote its own agent scaffolding, built by one engineer over four days with zero lines of human code. It scores 56.22% on SWE-Bench Pro, within one point of Opus 4.6's 57.3%, and WolfBench shows it roughly matching Sonnet 4.6 on OpenClaw agent tasks. Not yet open weights, though rumors suggest a release is coming.
56% MiniMax 2.7 SWE-bench Pro
New ModelsOpen weights
Mistral Small 4
Mistral Small 4: 119B MoE with 6B active unifies vision, coding, reasoning
Mistral returned to open source with Small 4, a 119B-parameter MoE with 128 experts and only 6B active per token, released under Apache 2.0. It unifies the previous Pixtral (vision), Devstral (coding), and Magistral (reasoning) lines into one model and can fit on a single H100 when compressed. Early WolfBench results are sobering at ~17% on OpenClaw agent tasks, roughly on par with similarly sized Nemotron.
119B Mistral Small 4 total params
Papers & ResearchOpen weights
Mamba-3
Mamba-3 lands with three SSM innovations for inference-first linear models
Mamba-3 dropped with three SSM-centric innovations: trapezoidal discretization, complex-valued states, and a MIMO formulation aimed at inference-first linear models. It extends the state-space model line that underpins the growing wave of hybrid SSM architectures for long-context and agentic workloads.
New ModelsOpen weights
Nemotron 3 Super 120B
NVIDIA releases Nemotron 3 Super 120B with $26B open-source bet
NVIDIA launched Nemotron 3 Super, a 120B Hybrid Mamba-Transformer MoE model with 12B active parameters, a 1M-token context window, and 450 tok/s throughput. It shipped with BF16/FP8/NVFP4 weights, a base checkpoint, SFT and pre-training data, and the full training recipe, alongside a $26B 5-year open-source commitment. It is available on W&B Inference at $0.20/M input and $0.80/M output.
120B Nemotron 3 Super total parameters12B Nemotron 3 Super active parameters (MoE)1M Nemotron 3 Super context window (tokens)
Papers & Research
Self-Flow
Black Forest Labs introduces Self-Flow
Black Forest Labs published Self-Flow, new research from the FLUX makers in the AI art and diffusion space. It was included in the week's AI Art & Diffusion roundup.
New Models
Gemini 3.1 Flash-Lite
Google launches Gemini 3.1 Flash-Lite with 1M context at 360 tok/s
Google launched Gemini 3.1 Flash-Lite, a fast and cheap model with 1M token context aimed at the instant/fast tier, running around 360 tokens per second. The panel flagged a material pricing jump versus the prior Flash-Lite generation but saw it as well suited for judge, guardrail, and orchestration workloads in agent systems.
360 tokens/sec Gemini 3.1 Flash-Lite speed
New Models
GPT-5.4
OpenAI drops GPT-5.4 Thinking and GPT-5.4 Pro live during the show
OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro mid-show, a frontier general model that folds Codex-level coding into a unified reasoning model. It ships with a 1M token context window, a /fast mode, and mid-reasoning steering, posting 83.3% on ARC-AGI 2 (Pro) and roughly 75% on OS World computer use. The panel tested it live in Codex and called it a major general-model jump, while noting input pricing rose about 50% versus 5.2.
83.3% ARC-AGI 2 (GPT-5.4 Pro)75% OS World / computer-use score1M Context window