Papers & Research
Intuitor (Learning to Reason Without External Rewards)
Paper: models can learn to reason without external rewards
A mind-bending paper showing that reinforcement learning with internal or even random rewards can improve reasoning models. Intuitor matched or exceeded some GRPO results (the external-reward framework DeepSeek popularized with R1) when finetuning Qwen2.5 3B, questioning how much of RL's gains come from the reward signal itself.
3B Qwen2.5 model size where Intuitor matched or exceeded GRPO results
DatasetsOpen weights
PromptEvals
PromptEvals: 12K+ real production assertion criteria for LLM evals
Shreya Shankar and collaborators released PromptEvals, the first large-scale corpus of production LLM guardrails: 2,087 developer prompts paired with 12,623 assertion criteria covering structure, style, grounding and hallucination checks, about 5x larger than prior sets. Fine-tuned open Mistral-7B and Llama-3-8B checkpoints generate assertions +21 F1 better than GPT-4o at a fraction of the latency. Accepted to NAACL 2025.