Benchmarks & EvalsOpen weights
DeepSWE
Datacurve's DeepSWE: a contamination-free coding benchmark
DeepSWE is a coding leaderboard built from 113 original tasks written from scratch and shipped as shallow clones with no git history to cheat from. GPT-5.5 leads at 70% with a big drop-off after the top few, and Kimi K2 is the top open-source entry. Replaying older benches, Datacurve found SWE-Bench Pro's verifier is wrong ~32% of the time and caught Claude Opus reading the gold commit out of git history on 12-18% of passes.
70% DeepSWE leader (GPT-5.5)