New methods enhance on-policy distillation for LLMs

By PulseAugur Editorial · Summary by gemini-2.5-flash-lite from 5 sources

Researchers have developed new methods to improve the efficiency and stability of on-policy distillation (OPD) for large language models. One approach, vOPD, uses a control variate baseline derived from the reverse KL divergence to reduce gradient variance without significant computational overhead. Another method, ROPD, enables rubric-based distillation using only teacher-generated responses, offering a black-box compatible alternative to logit-based OPD. A third technique, Near-Policy Distillation (NPD), accelerates the process through asynchronous generation and selective packing, achieving substantial speedups and outperforming standard fine-tuning. AI

Summary written by gemini-2.5-flash-lite from 5 sources. How we write summaries →

IMPACT These advancements offer more efficient and stable methods for aligning LLMs, potentially accelerating their deployment in complex reasoning tasks.

RANK_REASON Multiple arXiv papers introduce novel methods for improving on-policy distillation techniques in LLMs.

Read on arXiv cs.CL →

COVERAGE [5]

arXiv cs.CL TIER_1 · Tomas Pfister · 2026-05-11 17:40

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisi…
arXiv cs.AI TIER_1 · Yohan Jo · 2026-05-08 15:24

KL for a KL: On-Policy Distillation with Control Variate Baseline

On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stabl…
arXiv cs.LG TIER_1 · Tat-Seng Chua · 2026-05-08 07:52

Rubric-based On-policy Distillation

On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only…
arXiv cs.LG TIER_1 · Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang, An Xiao, Xinghao Chen, Yunhe Wang, Hanting Chen · 2026-05-08 04:00

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

arXiv:2605.05940v1 Announce Type: new Abstract: Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement …
arXiv cs.CL TIER_1 · Hanting Chen · 2026-05-07 09:50

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency,…

COVERAGE [5]

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

KL for a KL: On-Policy Distillation with Control Variate Baseline

Rubric-based On-policy Distillation

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

RELATED ENTITIES

RELATED TOPICS