Researchers from Kuaishou's Kwaipilot team have developed a novel reinforcement learning framework called SRPO, designed to improve the efficiency and performance of large language models. This new method addresses limitations in standard GRPO, such as sample inefficiency and cross-domain optimization conflicts, by employing a two-stage training process. SRPO has demonstrated state-of-the-art performance on mathematical and code benchmarks, matching DeepSeek-R1-Zero while requiring only one-tenth of the training steps. AI
Summary written by gemini-2.5-flash-lite from 1 source. How we write summaries →
RANK_REASON Open-source release of a novel training method and model from a non-frontier lab, achieving competitive benchmark results.