PulseAugur
LIVE 08:35:39
research · [5 sources] ·
0
research

New methods enhance on-policy distillation for LLMs

Researchers have developed new methods to improve the efficiency and stability of on-policy distillation (OPD) for large language models. One approach, vOPD, uses a control variate baseline derived from the reverse KL divergence to reduce gradient variance without significant computational overhead. Another method, ROPD, enables rubric-based distillation using only teacher-generated responses, offering a black-box compatible alternative to logit-based OPD. A third technique, Near-Policy Distillation (NPD), accelerates the process through asynchronous generation and selective packing, achieving substantial speedups and outperforming standard fine-tuning. AI

Summary written by gemini-2.5-flash-lite from 5 sources. How we write summaries →

IMPACT These advancements offer more efficient and stable methods for aligning LLMs, potentially accelerating their deployment in complex reasoning tasks.

RANK_REASON Multiple arXiv papers introduce novel methods for improving on-policy distillation techniques in LLMs.

Read on arXiv cs.CL →

COVERAGE [5]

  1. arXiv cs.CL TIER_1 · Tomas Pfister ·

    RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

    Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisi…

  2. arXiv cs.AI TIER_1 · Yohan Jo ·

    KL for a KL: On-Policy Distillation with Control Variate Baseline

    On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stabl…

  3. arXiv cs.LG TIER_1 · Tat-Seng Chua ·

    Rubric-based On-policy Distillation

    On-policy distillation (OPD) is a powerful paradigm for model alignment, yet its reliance on teacher logits restricts its application to white-box scenarios. We contend that structured semantic rubrics can serve as a scalable alternative to teacher logits, enabling OPD using only…

  4. arXiv cs.LG TIER_1 · Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang, An Xiao, Xinghao Chen, Yunhe Wang, Hanting Chen ·

    Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

    arXiv:2605.05940v1 Announce Type: new Abstract: Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement …

  5. arXiv cs.CL TIER_1 · Hanting Chen ·

    Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

    Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency,…