OpenAI advances reinforcement learning with new benchmarks and methods

OpenAI News TIER_1 · 2019-12-13 08:00

Dota 2 with large scale deep reinforcement learning

OpenAI News TIER_1 · 2019-11-21 08:00

Benchmarking safe exploration in deep reinforcement learning

OpenAI News TIER_1 · 2018-12-06 08:00

Quantifying generalization in reinforcement learning

We’re releasing CoinRun, a training environment which provides a metric for an agent’s ability to transfer its experience to novel situations and has already helped clarify a longstanding puzzle in reinforcement learning. CoinRun strikes a desirable balance in complexity: the env…

OpenAI News TIER_1 · 2018-10-31 07:00

Reinforcement learning with prediction-based rewards

We’ve developed Random Network Distillation (RND), a prediction-based method for encouraging reinforcement learning agents to explore their environments through curiosity, which for the first time exceeds average human performance on Montezuma’s Revenge.

OpenAI News TIER_1 · 2018-06-17 07:00

Learning policy representations in multiagent systems

OpenAI News TIER_1 · 2018-04-18 07:00

Evolved Policy Gradients

We’re releasing an experimental metalearning approach called Evolved Policy Gradients, a method that evolves the loss function of learning agents, which can enable fast training on novel tasks. Agents trained with EPG can succeed at basic tasks at test time that were outside thei…

OpenAI News TIER_1 · 2018-03-20 07:00

Variance reduction for policy gradient with action-dependent factorized baselines

OpenAI News TIER_1 · 2018-03-03 08:00

Some considerations on learning to explore via meta-reinforcement learning

OpenAI News TIER_1 · 2018-02-26 08:00

Multi-Goal Reinforcement Learning: Challenging robotics environments and request for research

OpenAI News TIER_1 · 2017-04-21 07:00

Equivalence between policy gradients and soft Q-learning

OpenAI News TIER_1 · 2017-04-10 07:00

Stochastic Neural Networks for hierarchical reinforcement learning

OpenAI News TIER_1 · 2016-11-15 08:00

#Exploration: A study of count-based exploration for deep reinforcement learning

OpenAI News TIER_1 · 2016-11-09 08:00

RL²: Fast reinforcement learning via slow reinforcement learning

Apple Machine Learning Research TIER_1 · 2026-05-04 00:00

PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents using outcome-only rewards suffers from credit-assignment ambiguity, obscuring which…

Lil'Log (Lilian Weng) TIER_1 · 2024-11-28 00:00

Reward Hacking in Reinforcement Learning

Reward hacking occurs when a <a href="(https://lilianweng.github.io/posts/2018-02-19-rl-overview/)">reinforcement learning (RL)</a> agent <a href="https://lilianweng.github.io/posts/2018-01-23-multi-armed-bandit/#exploitation-vs-exploration">exploits</a> flaws or ambiguities i…

Hugging Face Blog TIER_1 · 2023-02-07 00:00

Introducing ⚔️ AI vs. AI ⚔️ a deep reinforcement learning multi-agents competition system

Hugging Face Blog TIER_1 · 2022-12-09 00:00

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Hugging Face Blog TIER_1 · 2022-05-04 00:00

An Introduction to Deep Reinforcement Learning

Lil'Log (Lilian Weng) TIER_1 · 2020-06-07 00:00

Exploration Strategies in Deep Reinforcement Learning

[Updated on 2020-06-17: Add <a href="#exploration-via-disagreement">“exploration…

Lil'Log (Lilian Weng) TIER_1 · 2020-01-29 00:00

Curriculum for Reinforcement Learning

<!-- A curriculum is an efficient tool for humans to progressively learn from simple concepts to hard problems. It breaks down complex knowledge by providing a sequence of learning steps of increasing difficulty. In this post, we will examine how the idea of curriculum can help r…

Lil'Log (Lilian Weng) TIER_1 · 2019-06-23 00:00

Meta Reinforcement Learning

<!-- Meta-RL is meta-learning on reinforcement learning tasks. After trained over a distribution of tasks, the agent is able to solve a new task by developing a new RL algorithm with its internal activity dynamics. This post starts with the origin of meta-RL and then dives into t…

Lil'Log (Lilian Weng) TIER_1 · 2018-05-05 00:00

Implementing Deep Reinforcement Learning Models with Tensorflow + OpenAI Gym

The full implementation is available in <a href="https://github.com/lilianweng/deep-reinforcement-learning-gym">lilianweng/deep-reinforcement-learning-gym</a> In the prev…

Lil'Log (Lilian Weng) TIER_1 · 2018-04-08 00:00

Policy Gradient Algorithms

<!-- Abstract: In this post, we are going to look deep into policy gradient, why it works, and many new policy gradient algorithms proposed in recent years: vanilla policy gradient, actor-critic, off-policy actor-critic, A3C, A2C, DPG, DDPG, D4PG, MADDPG, TRPO, PPO, ACER, ACTKR, …

Lil'Log (Lilian Weng) TIER_1 · 2018-02-19 00:00

A (Long) Peek into Reinforcement Learning

<!-- In this post, we are gonna briefly go over the field of Reinforcement Learning (RL), from fundamental concepts to classic algorithms. Hopefully, this review is helpful enough so that newbies would not get lost in specialized terms and jargons while starting. [WARNING] This i…

Andrej Karpathy TIER_1 · Andrej Karpathy · 2016-05-29 04:31

Pong AI with Policy Gradients

Trained for ~8000 episodes, each episode = ~30 games. Updates were done in batches of 10 episodes, so ~800 updates total. Policy network is a 2-layer neural net connected to raw pixels, with 200 hidden units. Trained with RMSProp and learning rate 1e-4. The final agent does not b…

arXiv cs.AI TIER_1 · Yunzhong He · 2026-05-12 17:54

Reward Hacking in Rubric-Based Reinforcement Learning

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verif…

arXiv cs.LG TIER_1 · Amanda Prorok · 2026-05-12 16:51

Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

Effective multi-agent cooperation requires agents to adopt diverse behaviors as task conditions evolve-and to do so at the right moment. Yet, current Multi-Agent Reinforcement Learning (MARL) frameworks that facilitate this diversity are still limited by the fact that they bind f…

arXiv cs.AI TIER_1 · Alexander J. Smola · 2026-05-12 16:44

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

Reinforcement learning is structurally harder than supervised learning because the policy changes the data distribution it learns from. The resulting fragility is especially visible in large-model training, where the training and rollout systems differ in numerical precision, sam…

arXiv cs.AI TIER_1 · Peizhong Ju · 2026-05-12 16:44

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

Many reinforcement learning (RL) tasks have discrete action spaces, but most generative policy methods based on diffusion and flow matching are designed for continuous control. Meanwhile, generative policies usually rely heavily on offline datasets and offline-to-online RL is its…

arXiv cs.AI TIER_1 · Shaowu Yang · 2026-05-12 15:58

Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

Random delays weaken the temporal correspondence between actions and subsequent state feedback, making it difficult for agents to identify the true propagation process of action effects. In cross-task scenarios, changes in task objectives and reward formulations further reduce th…

arXiv cs.LG TIER_1 · Shaowu Yang · 2026-05-12 15:28

Delay-Empowered Causal Hierarchical Reinforcement Learning

Many real-world tasks involve delayed effects, where the outcomes of actions emerge after varying time lags. Existing delay-aware reinforcement learning methods often rely on state augmentation, prior knowledge of delay distributions, or access to non-delayed data, limiting their…

arXiv cs.AI TIER_1 · Abhishek Gupta · 2026-05-12 15:07

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a uni…

arXiv cs.LG TIER_1 · Jamison Heard · 2026-05-12 15:01

Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

Advancements in reinforcement learning have produced a variety of complex and useful intrinsic driving forces; crucially, these drivers operate under a direct conditioning paradigm. This form of conditioning limits our agents' capacity by restricting how they learn from the envir…

arXiv cs.LG TIER_1 · Guillaume Drion · 2026-05-12 14:45

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

In reinforcement learning (RL), agents acting in partially observable Markov decision processes (POMDPs) must rely on memory, typically encoded in a recurrent neural network (RNN), to integrate information from past observations. Long-horizon POMDPs, in which the relevant observa…

arXiv cs.CL TIER_1 · Fuli Feng · 2026-05-12 12:21

SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

Skill libraries enable large language model agents to reuse experience from past interactions, but most existing libraries store skills as isolated entries and retrieve them only by semantic similarity. This leads to two key challenges for compositional tasks. Firstly, an agent m…

arXiv cs.CL TIER_1 · Xiangxiang Chu · 2026-05-12 11:54

Learning Agentic Policy from Action Guidance

Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional traini…

arXiv cs.CL TIER_1 · Xuanjing Huang · 2026-05-12 08:47

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

Policy entropy has emerged as a fundamental measure for understanding and controlling exploration in reinforcement learning with verifiable rewards (RLVR) for LLMs. However, existing entropy-aware methods mainly regulate entropy through global objectives, while the token-level me…

arXiv cs.CL TIER_1 · Hong Cheng · 2026-05-11 17:55

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

Large language model agents increasingly rely on external skills to solve complex tasks, where skills act as modular units that extend their capabilities beyond what parametric memory alone supports. Existing methods assume external skills either accumulate as persistent guidance…

Hugging Face Daily Papers TIER_1 · 2026-05-11 16:34

Policy Gradient Methods for Non-Markovian Reinforcement Learning

We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provid…

arXiv cs.AI TIER_1 · Nicholas Bambos · 2026-05-11 16:34

Policy Gradient Methods for Non-Markovian Reinforcement Learning

We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provid…

arXiv cs.LG TIER_1 · Jan Peters · 2026-05-11 15:38

XQCfD: Accelerating Fast Actor-Critic Algorithms with Prior Data and Prior Policies

For reinforcement learning in the real world online exploration is expensive A common practice in robotic reinforcement learning is to incorporate additional data to improve sample efficiency Expert demonstration data is often crucial for solving hard exploration tasks with spars…

arXiv cs.AI TIER_1 · Nils Jansen · 2026-05-11 09:54

Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a…

Hugging Face Daily Papers TIER_1 · 2026-05-11 09:54

Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a…

arXiv cs.AI TIER_1 · Michal Nauman · 2026-05-11 09:11

When Does Non-Uniform Replay Matter in Reinforcement Learning?

Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed…

量子位 (QbitAI) TIER_1 中文(ZH) · 闻乐 · 2026-05-09 08:07

Reinforcement Learning Without Parameter Updates! OpenAI's Jia-Yi Ong Proposes a New Paradigm: Decision-Making Only Requires an AI-Handcrafted .py File

实现过程开源可复现

arXiv cs.LG TIER_1 · Sanjay Bhat · 2026-05-08 17:41

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied i…

arXiv cs.LG TIER_1 · Daniel Murfet · 2026-05-08 16:59

Interpreting Reinforcement Learning Agents with Susceptibilities

Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate …

arXiv cs.AI TIER_1 Deutsch(DE) · Minhyuk Sung · 2026-05-08 13:34

Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow

We propose Drifting Field Policy (DFP), a non-ODE one-step generative policy built on the drifting model paradigm. We frame the policy update as a reverse-KL Wasserstein-2 gradient flow toward a soft target policy, so that each DFP update corresponds to a gradient step in probabi…

arXiv cs.CL TIER_1 · Yohan Jo · 2026-05-08 10:49

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its em…

arXiv cs.LG TIER_1 · Hao Chen · 2026-05-08 09:38

LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

Developing lightweight, on-device vision-language GUI agents is essential for efficient cross-platform automated interaction. However, current on-device agents are constrained by limited model capacity, and further performance improvements remain urgently needed. Traditional Supe…

arXiv cs.CL TIER_1 · Miaohui Wang · 2026-05-08 09:37

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

Large reasoning models (LRMs) achieve strong performance via extended chain-of-thought (CoT) reasoning, yet suffer from excessive token consumption and high inference latency. Existing reinforcement learning (RL) approaches for CoT compression rely on uniform, static length penal…

arXiv cs.CL TIER_1 · Yanghua Xiao · 2026-05-08 09:13

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

Instruction following is a fundamental capability of large language models (LLMs), yet continuously improving this capability remains challenging. Existing methods typically rely either on costly external supervision from humans or strong teacher models, or on self-play training …

arXiv cs.LG TIER_1 · Shangtong Zhang · 2026-05-08 06:37

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

In-context reinforcement learning (ICRL) studies agents that, after pretraining, adapt to new tasks by conditioning on additional context without parameter updates. Existing theoretical analyses of ICRL largely rely on linear attention, which replaces the softmax function in the …

arXiv cs.CL TIER_1 · Stefano Soatto · 2026-05-08 05:01

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

We introduce Mutual Reinforcement Learning, a framework for concurrent RL post-training in which heterogeneous LLM policies exchange typed experience while keeping separate parameters, objectives, and tokenizers. The framework combines a Shared Experience Exchange (SEE), Multi-Wo…

arXiv cs.LG TIER_1 · Tim Walter, Hannah Markgraf, Jonathan K\"ulz, Matthias Althoff · 2026-05-08 04:00

Leveraging Analytic Gradients in Provably Safe Reinforcement Learning

arXiv:2506.01665v4 Announce Type: replace Abstract: The deployment of autonomous robots in safety-critical applications requires safety guarantees. Provably safe reinforcement learning is an active field of research that aims to provide such guarantees using safeguards. These saf…

arXiv cs.LG TIER_1 · David Leeftink, Max Hinne, Marcel van Gerven · 2026-05-08 04:00

Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

arXiv:2605.05373v1 Announce Type: new Abstract: A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement le…

arXiv cs.LG TIER_1 · Dillon Sandhu, Ronald Parr · 2026-05-08 04:00

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

arXiv:2605.05481v1 Announce Type: new Abstract: We revisit a classic "chicken-and-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state-visitation distribution of the updated policy. That distribution over states is u…

arXiv cs.LG TIER_1 · Nandiraju Gireesh, Yuanliang Ju, He Wang · 2026-05-08 04:00

Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

arXiv:2605.05544v1 Announce Type: new Abstract: Offline-to-online reinforcement learning with action chunking eliminates multi-step off-policy bias and enables temporally coherent exploration, but all existing methods use a fixed chunk size across every state. This is suboptimal:…

arXiv cs.LG TIER_1 · Cristiano da Costa Cunha, Ajmal Mian, Tim French, Wei Liu · 2026-05-08 04:00

Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark

arXiv:2605.06066v1 Announce Type: new Abstract: Causal reinforcement learning (RL) lacks benchmarks for complex systems that combine sequential decision making, hidden information, large masked action spaces, and explicit causal structure. We introduce MTG-Causal-RL, a Gymnasium …

arXiv cs.LG TIER_1 · Alireza Modirshanechi, Benjamin Eysenbach, Peter Dayan, Eric Schulz · 2026-05-08 04:00

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

arXiv:2605.06145v1 Announce Type: new Abstract: Unsupervised pretraining has driven empirical advances in goal-conditioned reinforcement learning (GCRL), but its theoretical foundations remain poorly understood. In particular, an influential class of methods, mutual information s…

arXiv cs.LG TIER_1 · Yaomin Wang, Jianting Pan, Ran Tian, Xiaoyang Li, Yu Zhang, Hengle Qin, Tianshu YU · 2026-05-08 04:00

AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

arXiv:2605.06149v1 Announce Type: new Abstract: The discount factor in reinforcement learning controls both the effective planning horizon and the strength of bootstrapping, yet most deep RL methods use a single fixed value across all states. While state-dependent discounting is …

arXiv cs.LG TIER_1 · Hyunjun Na, Donghwan Lee · 2026-05-08 04:00

Soft Deterministic Policy Gradient with Gaussian Smoothing

arXiv:2605.06228v1 Announce Type: new Abstract: Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in pra…

arXiv cs.LG TIER_1 · Zuyuan Zhang, Fei Xu Yu, Tian Lan · 2026-05-08 04:00

Operator-Guided Invariance Learning for Continuous Reinforcement Learning

arXiv:2605.06500v1 Announce Type: new Abstract: Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve …

arXiv cs.LG TIER_1 · Hao Ye, Jisheng Dang, Junfeng Fang, Bimei Wang, Yizhou Zhang, Ning Lv, Wencan Zhang, Hong Peng, Bin Hu, Tat-Seng Chua · 2026-05-08 04:00

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

arXiv:2605.06523v1 Announce Type: new Abstract: Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated…

arXiv cs.LG TIER_1 · Dmitri Goloubentsev, Natalija Karpichina · 2026-05-08 04:00

SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation

arXiv:2605.06570v1 Announce Type: new Abstract: Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor …

arXiv cs.LG TIER_1 · Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song · 2026-05-08 04:00

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

arXiv:2605.05262v1 Announce Type: cross Abstract: We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnos…

arXiv cs.LG TIER_1 · Haodong Liang, Lifeng Lai · 2026-05-08 04:00

Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

arXiv:2605.05755v1 Announce Type: cross Abstract: We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-at…

arXiv cs.LG TIER_1 · Maria Ana Cardei, Matthew Landers, Afsaneh Doryab · 2026-05-08 04:00

Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

arXiv:2605.06557v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, pa…

arXiv cs.LG TIER_1 · David M\"uller, Agon Serifi, Sammy Christen, Ruben Grandia, Espen Knoop, Moritz B\"acher · 2026-05-08 04:00

ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting

arXiv:2605.06593v1 Announce Type: cross Abstract: Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motio…

arXiv cs.LG TIER_1 · Shuo Liu, Xinzichen Li, Christopher Amato · 2026-05-08 04:00

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

arXiv:2605.06595v1 Announce Type: cross Abstract: Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal…

arXiv cs.LG TIER_1 · Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Yanfeng Wang, Siheng Chen · 2026-05-08 04:00

AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

arXiv:2602.07906v5 Announce Type: replace Abstract: Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons. While recent LLM-based agents show promise, current prompt-based agents for MLE suffer from behaviora…

arXiv cs.LG TIER_1 · Guangchen Lan, Lian Xiong, Xin Zhou, Hejie Cui, Yuwei Zhang, Mao Li, Zhenyu Shi, Besnik Fetahu, Lihong Li, Xian Li · 2026-05-08 04:00

Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

arXiv:2603.15646v2 Announce Type: replace Abstract: Reinforcement Learning with Rubric Rewards (RLRR) is a framework that extends conventional reinforcement learning from human feedback (RLHF) and verifiable rewards (RLVR) by replacing scalar preference signals with structured, m…

arXiv cs.LG TIER_1 · Jiaxin Liu, Anzhe Cheng, Paul Bogdan · 2026-05-08 04:00

Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

arXiv:2603.18257v2 Announce Type: replace Abstract: When an RL agent's observations contain distractors driven by the same confounders as its true state, observational data alone cannot identify which dimensions the agent controls. In our benchmarks, even state-conditioned observ…

arXiv cs.LG TIER_1 · Naveen Mysore · 2026-05-08 04:00

Prediction-Based Markov Violation Scores for Detecting Non-Markovian Observations in Reinforcement Learning

arXiv:2603.27389v2 Announce Type: replace Abstract: Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance …

arXiv cs.LG TIER_1 · Yuan Zhuang, Yuexin Bian, Sihong He, Jie Feng, Qing Su, Songyang Han, Jonathan Petit, Shihao Ji, Yuanyuan Shi, Fei Miao · 2026-05-08 04:00

Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

arXiv:2604.18978v2 Announce Type: replace Abstract: Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training…

arXiv cs.CL TIER_1 · Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li, Ruiqing Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen · 2026-05-08 04:00

Milestone-Guided Policy Learning for Long-Horizon Language Agents

arXiv:2605.06078v1 Announce Type: new Abstract: While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where corr…

arXiv cs.CL TIER_1 · Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li, Chengming Li, Peng Chen, Jie Jiang · 2026-05-08 04:00

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

arXiv:2605.06200v1 Announce Type: new Abstract: Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions.…

arXiv cs.CL TIER_1 · Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin · 2026-05-08 04:00

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

arXiv:2605.06642v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and…

arXiv cs.CL TIER_1 · Mingwei Xu, Hao Fang · 2026-05-08 04:00

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

arXiv:2605.06650v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change …

arXiv cs.AI TIER_1 · Yinbo Yu, Xueyu Yin, Jiadai Wang, Chunwei Tian, Sai Xu, Qi Zhu, Daoqiang Zhang · 2026-05-08 04:00

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

arXiv:2605.05977v1 Announce Type: new Abstract: Backdoor attacks pose a serious threat to deep reinforcement learning (DRL). Current defenses typically rely on reward anomalies to reverse-engineer triggers and model finetuning to remove backdoors. However, complex trigger pattern…

arXiv cs.AI TIER_1 · Yaorui Shi, Yuxin Chen, Zhengxi Lu, Yuchun Miao, Shugui Liu, Qi GU, Xunliang Cai, Xiang Wang, An Zhang · 2026-05-08 04:00

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

arXiv:2605.06130v1 Announce Type: new Abstract: A persistent skill library allows language model agents to reuse successful strategies across tasks. Maintaining such a library requires three coupled capabilities. The agent selects a relevant skill, utilizes it during execution, a…

arXiv cs.AI TIER_1 · Haochen Cai, Xian Yu · 2026-05-08 04:00

Learning to Cut: Reinforcement Learning for Benders Decomposition

arXiv:2605.06516v1 Announce Type: cross Abstract: Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem…

arXiv cs.AI TIER_1 · Claudio Fanconi, Nicol\'as Astorga, Mihaela van der Schaar · 2026-05-08 04:00

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

arXiv:2510.01857v4 Announce Type: replace Abstract: Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or definin…

arXiv cs.CL TIER_1 · Hao Fang · 2026-05-07 17:55

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to G…

arXiv cs.AI TIER_1 · Zhenfei Yin · 2026-05-07 17:51

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. I…

Hugging Face Daily Papers TIER_1 · 2026-05-07 17:20

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substan…

arXiv cs.AI TIER_1 · Christopher Amato · 2026-05-07 17:20

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substan…

arXiv cs.LG TIER_1 · Moritz Bächer · 2026-05-07 17:20

ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting

Retargeting human kinematic reference motion onto a robot's morphology remains a formidable challenge. Existing methods often produce physical inconsistencies, such as foot sliding, self-collisions, or dynamically infeasible motions, which hinder downstream imitation learning. We…

arXiv cs.LG TIER_1 · Natalija Karpichina · 2026-05-07 17:01

SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation

Many real-world problems require sequential decisions under uncertainty: when to inject or withdraw gas from storage, how to rebalance a pension portfolio each month, what temperature profile to run through a pharmaceutical reactor chain. Dynamic programming solves small instance…

arXiv cs.AI TIER_1 · Afsaneh Doryab · 2026-05-07 16:50

Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

Cooperative multi-agent reinforcement learning (MARL) benchmarks commonly emphasize aggregate outcomes such as return, success rate, or completion time. While essential, these metrics often fail to reveal how agents coordinate, particularly in settings where agents, tasks, and jo…

arXiv cs.AI TIER_1 · Tat-Seng Chua · 2026-05-07 16:30

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

Recent extensive research has demonstrated that the enhanced reasoning capabilities acquired by models through Reinforcement Learning with Verifiable Rewards (RLVR) are primarily concentrated within the rank-1 components. Predicated on this observation, we employed Periodic Rank-…

arXiv cs.AI TIER_1 · Xian Yu · 2026-05-07 16:26

Learning to Cut: Reinforcement Learning for Benders Decomposition

Benders decomposition (BD) is a widely used solution approach for solving two-stage stochastic programs arising in real-world decision-making under uncertainty. However, it often suffers from slow convergence as the master problem grows with an increasing number of cuts. In this …

arXiv cs.AI TIER_1 · Tian Lan · 2026-05-07 16:18

Operator-Guided Invariance Learning for Continuous Reinforcement Learning

Reinforcement learning (RL) with continuous time and state/action spaces is often data-intensive and brittle under nuisance variability and shift, motivating methods that exploit value-preserving structures to stabilize and improve learning. Most existing approaches focus on spec…

arXiv cs.CL TIER_1 · Jie Jiang · 2026-05-07 13:09

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assi…

arXiv cs.CL TIER_1 · Yongliang Shen · 2026-05-07 12:00

Milestone-Guided Policy Learning for Long-Horizon Language Agents

While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal …

arXiv cs.LG TIER_1 · Xiaoyuan Cheng, Wenxuan Yuan, Boyang Li, Yuanchao Xu, Yiming Yang, Hao Liang, Bei Peng, Robert Loftin, Zhuo Sun, Yukun Hu · 2026-05-07 04:00

How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?

arXiv:2602.02924v2 Announce Type: replace Abstract: Diffusion policy sampling enables reinforcement learning (RL) to represent multimodal action distributions beyond suboptimal unimodal Gaussian policies. However, existing diffusion-based RL methods primarily focus on offline set…

arXiv cs.CL TIER_1 · Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang · 2026-05-07 04:00

Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning

arXiv:2512.15146v4 Announce Type: replace Abstract: Test-time reinforcement learning mitigates the reliance on annotated data by using majority voting results as pseudo-labels, emerging as a complementary direction to reinforcement learning with verifiable rewards (RLVR) for impr…

arXiv cs.LG TIER_1 · Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka · 2026-05-07 04:00

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

arXiv:2605.04880v1 Announce Type: new Abstract: Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular i…

arXiv cs.LG TIER_1 · Lirui Luo, Guoxi Zhang, Hongming Xu, Cong Fang, Qing Li · 2026-05-07 04:00

SPHERE: Mitigating the Loss of Spectral Plasticity in Mixture-of-Experts for Deep Reinforcement Learning

arXiv:2605.04712v1 Announce Type: new Abstract: In deep reinforcement learning (DRL), an agent is trained from a stream of experience. In a continual learning setting, such agents can suffer from plasticity loss: their ability to learn new skills from new experiences diminishes o…

arXiv cs.LG TIER_1 · Zhen-Yu Zhang, Yuting Tang, Jiandong Zhang, Lanjihong Ma, Masashi Sugiyama · 2026-05-07 04:00

Data-dependent Exploration for Online Reinforcement Learning from Human Feedback

arXiv:2605.04477v1 Announce Type: new Abstract: Online reinforcement learning from human feedback (RLHF) has emerged as a promising paradigm for aligning large language models (LLMs) by continuously collecting new preference feedback during training. A foundational challenge in t…

arXiv cs.LG TIER_1 · Keyu Chen, Nanfei Ye, Yida Wang, Wenchao Sun, Danqi Zhao, Hao Cheng, Sifa Zheng · 2026-05-07 04:00

CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

arXiv:2605.04470v1 Announce Type: new Abstract: Open-loop imitation learning has advanced modern autonomous driving policy architectures, but closed-loop deployment remains vulnerable to policy-induced distribution shift. Existing post-training paradigms exhibit fundamental trade…

arXiv cs.LG TIER_1 · Senne Deproost, Mehrdad Asadi, Ann Now\'e · 2026-05-07 04:00

Hierarchical Support Vector State Partitioning for Distilling Black Box Reinforcement Learning Policies

arXiv:2605.04254v1 Announce Type: new Abstract: We introduce State Vector Space Partitioning (SVSP), a novel method to mimic a black box reinforcement learning policy using a set of human-interpretable subpolicies. By partitioning a distillation dataset of state action pairs with…

arXiv cs.LG TIER_1 · Qijun Liao, Zhaoxin Yu, Jue Yang · 2026-05-07 04:00

Constraint-Enhanced Reinforcement Learning Based on Dynamic Decoupled Spherical Radial Squashing

arXiv:2605.04185v1 Announce Type: new Abstract: When deploying reinforcement learning policies to physical robots, actuator rate constraints -- hard limits on how fast each joint can move per control step -- are unavoidable. These limits vary substantially across joints due to di…

arXiv cs.LG TIER_1 · Bilel Abderrahmane Benziane, Benoit Lardeux, Ayoub Mcharek, Maher Jridi · 2026-05-07 04:00

Designing a double deep reinforcement learning selection tool for resilient demand prediction

arXiv:2605.04068v1 Announce Type: new Abstract: The use of artificial intelligence in supply chain forecasting has attracted many scientific studies for several decades. However, the process of selecting an appropriate forecasting solution becomes a daunting task. This complexity…

arXiv cs.LG TIER_1 · Anvay Shah, Ramsundar Anandanarayanan, Sharayu Moharir, Shivaram Kalyanakrishnan · 2026-05-07 04:00

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

arXiv:2605.04979v1 Announce Type: cross Abstract: A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of de…

arXiv cs.LG TIER_1 · Bj\"orn Hoppmann, Christoph Scholz · 2026-05-07 04:00

Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent

arXiv:2602.19837v3 Announce Type: replace-cross Abstract: Humans are highly effective at utilizing prior knowledge to adapt to novel tasks, a capability that standard machine learning models struggle to replicate due to their reliance on task-specific training. Meta-learning over…

arXiv cs.LG TIER_1 · Xueyan Niu, Bo Bai, Wei Han, Weixi Zhang · 2026-05-07 04:00

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

arXiv:2601.07389v2 Announce Type: replace Abstract: Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs …

arXiv cs.LG TIER_1 · Peter N. Loxley · 2026-05-07 04:00

Optimal Control with Natural Images: Efficient Reinforcement Learning using Overcomplete Sparse Codes

arXiv:2412.08893v3 Announce Type: replace Abstract: Optimal control and sequential decision making are widely used in many complex tasks. Optimal control over a sequence of natural images is a first step towards understanding the role of vision in control. Here, we formalize this…

arXiv cs.AI TIER_1 · Karthik Soma, Yann Bouteiller, Heiko Hamann, Giovanni Beltrame · 2026-05-07 04:00

The Hive Mind is a Single Reinforcement Learning Agent

arXiv:2410.17517v5 Announce Type: replace-cross Abstract: Decision-making is an essential attribute of any intelligent agent or group. Natural systems are known to converge to effective strategies through at least two distinct mechanisms: collective decision-making via imitation …

arXiv cs.LG TIER_1 · Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang, Miroslav Pajic, Yuichi Motai · 2026-05-07 04:00

Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

arXiv:2605.05123v1 Announce Type: new Abstract: In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline,…

arXiv cs.LG TIER_1 · Shawn Ray · 2026-05-07 04:00

Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning

arXiv:2605.05020v1 Announce Type: new Abstract: System Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all $\binom{n}{2}$ agent pairs, making each call quadratic in team size. We introduce Graph-S…

arXiv cs.LG TIER_1 · Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang · 2026-05-07 04:00

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

arXiv:2605.04960v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity …

arXiv cs.LG TIER_1 · Xiyan Fu, Wei Liu · 2026-05-07 04:00

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

arXiv:2605.04920v1 Announce Type: new Abstract: Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target …

arXiv cs.AI TIER_1 · Thomas Weng · 2026-05-06 17:40

When Life Gives You BC, Make Q-functions: Extracting Q-values from Behavior Cloning for On-Robot Reinforcement Learning

Behavior Cloning (BC) has emerged as a highly effective paradigm for robot learning. However, BC lacks a self-guided mechanism for online improvement after demonstrations have been collected. Existing offline-to-online learning methods often cause policies to replace previously l…

arXiv cs.AI TIER_1 · Yuichi Motai · 2026-05-06 16:51

Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline, candidate policies trained with offline RL are …

arXiv cs.AI TIER_1 · Gabriel Nelson · 2026-05-06 16:43

LineRides: Line-Guided Reinforcement Learning for Bicycle Robot Stunts

Designing reward functions for agile robotic maneuvers in reinforcement learning remains difficult, and demonstration-based approaches often require reference motions that are unavailable for novel platforms or extreme stunts. We present LineRides, a line-guided learning framewor…

arXiv cs.LG TIER_1 · Shawn Ray · 2026-05-06 15:18

Graph-SND: Sparse Aggregation for Behavioral Diversity in Multi-Agent Reinforcement Learning

System Neural Diversity (SND) measures behavioral heterogeneity in multi-agent reinforcement learning by averaging pairwise distances over all $\binom{n}{2}$ agent pairs, making each call quadratic in team size. We introduce Graph-SND, which replaces this complete-graph average w…

arXiv cs.AI TIER_1 · Shivaram Kalyanakrishnan · 2026-05-06 14:32

On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect rec…

arXiv cs.AI TIER_1 · Zhisheng Yang · 2026-05-06 14:21

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, …

arXiv cs.AI TIER_1 · Gal A. Kaminka · 2026-05-06 14:05

Modular Reinforcement Learning For Cooperative Swarms

A cooperative robot swarm is a collective of computationally-limited robots that share a common goal. Each robot can only interact with a small subset of its peers, without knowing how this affects the collective utility. Recent advances in distributed multi-agent reinforcement l…

arXiv cs.CL TIER_1 · Wei Liu · 2026-05-06 13:47

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

Compositional generalization refers to correctly interpret novel combinations of known primitives, which remains a major challenge. Existing approaches often rely on supervised fine-tuning, which encourages models to imitate target outputs. This token-level training paradigm fail…

arXiv cs.AI TIER_1 · Gal A. Kaminka · 2026-05-06 13:16

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastical…

arXiv cs.LG TIER_1 · Shan Yang, Yang Liu · 2026-05-06 04:00

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

arXiv:2602.20078v3 Announce Type: replace-cross Abstract: Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, each agent's learning signal is computed from a shared return that depends on …

arXiv cs.AI TIER_1 · Haixin Wang, Hejie Cui, Chenwei Zhang, Xin Liu, Shuowei Jin, Shijie Geng, Xinyang Zhang, Nasser Zalmout, Zhenyu Shi, Yizhou Sun · 2026-05-06 04:00

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

arXiv:2605.02178v1 Announce Type: new Abstract: Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and …

arXiv cs.AI TIER_1 · Dahyun Oh, Minhyuk Yoon, H. Jin Kim · 2026-05-06 04:00

Quality-Aware Exploration Budget Allocation for Cooperative Multi-Agent Reinforcement Learning

arXiv:2605.01865v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) requires agents to discover joint strategies in a combinatorially large state-action space, yet effective coordination configurations are exceedingly rare. Intrinsic motivation…

arXiv cs.LG TIER_1 · Rohan Surana, Gagan Mundada, Xunyi Jiang, Chuhan Wang, Zhenwei Tang, Difan Jiao, Zihan Huang, Yuxin Xiong, Junda Wu, Sheldon Yu, Xintong Li, Raghav Jain, Nikki Kuang, Sizhe Zhou, Bowen Jin, Zhendong Chu, Tong Yu, Ryan Rossi, Kuan-Hao Huang, Jingbo Shang, · 2026-05-06 04:00

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

arXiv:2605.02913v1 Announce Type: new Abstract: Reinforcement learning (RL) has become a central post-training tool for improving the reasoning abilities of large language models (LLMs). In these systems, the rollout, the trajectory sampled from a prompt to termination, including…

arXiv cs.LG TIER_1 · Jingchu Gai, Laixi Shi · 2026-05-06 04:00

Taming the Curses of Multiagency in Robust Markov Games with Large State Space through Linear Function Approximation

arXiv:2605.03125v1 Announce Type: new Abstract: Multi-agent reinforcement learning (MARL) holds great potential but faces robustness challenges due to environmental uncertainty. To address this, distributionally robust Markov games (RMGs) optimize worst-case performance when the …

arXiv cs.LG TIER_1 · Cyrille Kone, Kevin Jamieson · 2026-05-06 04:00

Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

arXiv:2605.03921v1 Announce Type: new Abstract: We study the $(\varepsilon, \delta)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer fr…

arXiv cs.LG TIER_1 · Prakhar Gupta, Vaibhav Gupta · 2026-05-06 04:00

Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order

arXiv:2512.04277v3 Announce Type: replace Abstract: Post-training with reinforcement learning (RL) typically optimizes a single scalar objective and ignores structure in how solutions are produced. We ask whether a scalar hint toward a canonical solver ordering, used only during …

arXiv cs.LG TIER_1 · Yuxin Bai, Aranyak Acharyya, Ashwin De Silva, Zeyu Shen, James Hassett, Joshua T. Vogelstein · 2026-05-06 04:00

Optimal control of the future via prospective learning with control

arXiv:2511.08717v4 Announce Type: replace-cross Abstract: Optimal control of the future is the next frontier for AI. Current approaches to this problem are typically rooted in reinforcement learning (RL). RL is mathematically distinct from supervised learning, which has been the …

arXiv cs.LG TIER_1 · Kevin Jamieson · 2026-05-05 16:16

Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

We study the $(\varepsilon, δ)$-PAC policy identification problem in finite-horizon episodic Markov Decision Processes. Existing approaches provide finite-time guarantees for approximate settings ($\varepsilon>0$) but suffer from high computational cost, rendering them hard to im…

arXiv cs.LG TIER_1 · Rudray Dave, Vedang Dubey, Smit Deoghare, Sudhakar Mishra · 2026-05-05 04:00

Selector-Guided Autonomous Curriculum for One-Shot Reinforcement Learning from Verifiable Rewards

arXiv:2605.01823v1 Announce Type: new Abstract: Recently, Reinforcement Learning from Verifiable Rewards (RLVR) has been established as a highly effective technique for augmenting the math reasoning skills of Large Language Models (LLMs) based on a single instance. Current state-…

arXiv cs.LG TIER_1 · Ujjwal Patil, Javad Ghofrani · 2026-05-05 04:00

Combining Trained Models in Reinforcement Learning

arXiv:2605.02159v1 Announce Type: new Abstract: Deep reinforcement learning (DRL) has delivered strong results in domains such as Atari and Go, but it still suffers from high sample cost and weak transfer beyond the training setting. A common response is to reuse information from…

arXiv cs.LG TIER_1 · Sanjiv R. Das, Harshad Khadilkar, Sukrit Mittal, Daniel Ostrov, Deep Srivastav, Hungjen Wang · 2026-05-05 04:00

A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

arXiv:2605.02300v1 Announce Type: new Abstract: Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) …

arXiv cs.LG TIER_1 · Marc Dymetman · 2026-05-05 04:00

Binary Rewards and Reinforcement Learning: Fundamental Challenges

arXiv:2605.02375v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improve…

arXiv cs.LG TIER_1 · Lei Gao, Zhuoming Li, Mengxi Jia, Jiakang Yuan, Hongbo Sun, Hao Sun, Xuelong Li · 2026-05-05 04:00

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

arXiv:2605.01327v1 Announce Type: cross Abstract: Existing reinforcement learning approaches for Large Language Models typically perform policy optimization at the granularity of individual tokens or entire response sequences. However, such formulations often misalign with the na…

arXiv cs.LG TIER_1 · Haohan Yu, Jinmiao Cong, Shengzhi Wang, Lu Wang, Chanjuan Liu · 2026-05-05 04:00

MAGIC: Multi-Step Advantage-Gated Causal Influence for Multi-agent Reinforcement Learning

arXiv:2605.01805v1 Announce Type: cross Abstract: A key challenge in multi-agent reinforcement learning (MARL) lies in designing learning signals that effectively promote coordination among agents. Designing such signals necessitates the ability to quantify the true, long-term ca…

arXiv cs.LG TIER_1 · Yiheng Zhang, Yiming Wang, Kaiyan Zhao, Zhenglin Wan, Jiayu Chen, Leong Hou U · 2026-05-05 04:00

ANO: A Principled Approach to Robust Policy Optimization

arXiv:2605.02320v1 Announce Type: cross Abstract: Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clippin…

arXiv cs.LG TIER_1 · Christian Jestel, Nicolas Bach, Marvin Wiedemann, Jan Finke, Peter Detzner · 2026-05-05 04:00

Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

arXiv:2605.02528v1 Announce Type: cross Abstract: Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. Wh…

arXiv cs.LG TIER_1 · Juan Sebastian Rojas, Chi-Guhn Lee · 2026-05-05 04:00

Ergodic Risk Measures: Towards a Risk-Aware Foundation for Continual Reinforcement Learning

arXiv:2510.02945v3 Announce Type: replace Abstract: Continual reinforcement learning (continual RL) seeks to formalize the notions of lifelong learning and endless adaptation in RL. In particular, the aim of continual RL is to develop RL agents that can maintain a careful balance…

arXiv cs.LG TIER_1 · Lipeng Zu, Yu Qian, Shayok Chakraborty, Xiaonan Zhang · 2026-05-05 04:00

From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Release for Offline-to-Online Reinforcement Learning

arXiv:2511.03828v2 Announce Type: replace Abstract: Offline-to-online reinforcement learning (O2O RL) faces a central challenge between retaining offline conservatism and adapting to online feedback under distribution shift. This challenge arises because data behavior evolves dur…

arXiv cs.AI TIER_1 · Haotian Zhao, Yuxin Zhang, Songlin Zhou, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu · 2026-05-05 04:00

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

arXiv:2605.00425v1 Announce Type: new Abstract: Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only re…

arXiv cs.LG TIER_1 · Jongsoo Lee, Jangwon Kim, Soohee Han · 2026-05-05 04:00

Delayed homomorphic reinforcement learning for environments with delayed feedback

arXiv:2604.03641v2 Announce Type: replace Abstract: Reinforcement learning in real-world systems often involves delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical augmentation-based approaches cause state-space explosion, which i…

arXiv cs.LG TIER_1 · Ruoning Zhang, Siying Wang, Wenyu Chen, Yang Zhou, Zhitong Zhao, Zixuan Zhang, Ruijie Zhang, Stefano V. Albrecht · 2026-05-05 04:00

Optimistic {\epsilon}-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning

arXiv:2502.03506v2 Announce Type: replace-cross Abstract: The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, conventional methods based on CTDE can suffer from value underestimation and …

arXiv cs.LG TIER_1 · Kejiang Qian, Amos Storkey, Fengxiang He · 2026-05-05 04:00

Rationality Measurement and Theory for Reinforcement Learning Agents

arXiv:2602.04737v2 Announce Type: replace Abstract: This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it …

arXiv cs.CL TIER_1 · Seonglae Cho, Zekun Wu, Adriano Koshiyama · 2026-05-05 04:00

Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

arXiv:2602.10437v3 Announce Type: replace-cross Abstract: Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We introduce Control Rei…

arXiv cs.CL TIER_1 · Yifan Zhang, Lanser Contributors · 2026-05-05 04:00

Reinforcement Learning from Compiler and Language Server Feedback

arXiv:2510.22907v2 Announce Type: replace Abstract: Coding agents fail when text-level guesses outrun program facts: they hallucinate APIs, drift to the wrong symbol, and apply edits without evidence that the workspace remains valid. Compilers, type checkers, and language servers…

arXiv cs.CL TIER_1 · Mehmet Iscan · 2026-05-05 04:00

Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

arXiv:2605.01567v1 Announce Type: cross Abstract: Large language model (LLM) coding agents increasingly operate over repositories, terminals, tests, and execution traces across long software-engineering episodes. Persistent memory is useful, but static vector stores or generic re…

arXiv cs.LG TIER_1 · Peter Detzner · 2026-05-04 12:28

Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable dive…

Hugging Face Daily Papers TIER_1 · 2026-05-04 12:28

Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators

Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable dive…

Hugging Face Daily Papers TIER_1 · 2026-05-04 11:04

Middle-mile logistics through the lens of goal-conditioned reinforcement learning

Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with model-free RL, extracting small feature graphs f…

arXiv cs.LG TIER_1 · Marc Dymetman · 2026-05-04 09:17

Binary Rewards and Reinforcement Learning: Fundamental Challenges

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes fal…

Hugging Face Daily Papers TIER_1 · 2026-05-04 09:17

Binary Rewards and Reinforcement Learning: Fundamental Challenges

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes fal…

arXiv cs.LG TIER_1 · Leong Hou U · 2026-05-04 08:15

ANO: A Principled Approach to Robust Policy Optimization

Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clipping (as in SPO) exposes optimization to unbounded gr…

arXiv cs.LG TIER_1 · Hungjen Wang · 2026-05-04 07:48

A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) problems. Each GBWM problem involves a multiple …

arXiv cs.LG TIER_1 · Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, Chi Jin · 2026-05-04 04:00

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

arXiv:2605.00347v1 Announce Type: new Abstract: Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-…

arXiv cs.LG TIER_1 · Yikai Wang, Shang Liu, Jose Blanchet · 2026-05-04 04:00

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

arXiv:2605.00155v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations researc…

arXiv cs.LG TIER_1 · Jiaming Zhang, Yujie Yang, Yao Lyu, Shengbo Eben Li, Liping Zhang · 2026-05-04 04:00

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

arXiv:2605.00667v1 Announce Type: new Abstract: Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires …

arXiv cs.LG TIER_1 · Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rong Luo, Jing Gao · 2026-05-04 04:00

PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

arXiv:2510.26020v2 Announce Type: replace-cross Abstract: Multi-tool-integrated reasoning enables LLM-empowered tool-use agents to solve complex tasks by interleaving natural-language reasoning with calls to external tools. However, training such agents from outcome-only rewards …

arXiv cs.LG TIER_1 · Guangyu Zhao, Kewei Lian, Haoxuan Ru, Borong Zhang, Haowei Lin, Zhancun Mu, Haobo Fu, Qiang Fu, Shaofei Cai, Zihao Wang, Yitao Liang · 2026-05-04 04:00

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

arXiv:2412.02125v2 Announce Type: replace-cross Abstract: Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals, yet their downstream performance is often highly sensitive to the choice of instructions or prompts. To bypass …

arXiv cs.LG TIER_1 · Haichen Hu, Jian Qian, David Simchi-Levi · 2026-05-04 04:00

Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation

arXiv:2605.00393v1 Announce Type: new Abstract: Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. Whi…

arXiv cs.LG TIER_1 · Preston Rozwood, Edward Mehrez, Ludger Paehler, Wen Sun, Steven L. Brunton · 2026-05-04 04:00

Koopman-Assisted Reinforcement Learning

arXiv:2403.02290v2 Announce Type: replace-cross Abstract: The Bellman equation and its continuous form, the Hamilton-Jacobi-Bellman equation, are ubiquitous in reinforcement learning and control theory. However, these equations become intractable for high-dimensional or nonlinear…

arXiv cs.LG TIER_1 · Tao Li, Kaiyuan Hou, Tuan Vinh, Monika Raj, Zhichun Guo, Carl Yang · 2026-05-04 04:00

Reinforcement Learning with LLM-Guided Action Spaces for Synthesizable Lead Optimization

arXiv:2604.07669v2 Announce Type: replace Abstract: Lead optimization in drug discovery requires improving therapeutic properties while ensuring that molecular modifications correspond to feasible synthetic routes. Existing approaches either prioritize property scores without enf…

arXiv cs.LG TIER_1 · Tianwei Ni, Esther Derman, Vineet Jain, Vincent Taboga, Siamak Ravanbakhsh, Pierre-Luc Bacon · 2026-05-04 04:00

Long-Horizon Model-Based Offline Reinforcement Learning Without Explicit Conservatism

arXiv:2512.04341v3 Announce Type: replace Abstract: Popular offline reinforcement learning (RL) methods rely on explicit conservatism, penalizing out-of-dataset actions or restricting rollout horizons. We question the universality of this principle and revisit a complementary Bay…

arXiv cs.LG TIER_1 · Washim Uddin Mondal, Vaneet Aggarwal · 2026-05-04 04:00

Last-Iterate Convergence of General Parameterized Policies in Constrained MDPs

arXiv:2408.11513v2 Announce Type: replace Abstract: This paper focuses on learning a Constrained Markov Decision Process (CMDP) via general parameterized policies. We propose a Primal-Dual based Regularized Accelerated Natural Policy Gradient (PDR-ANPG) algorithm that uses entrop…

arXiv cs.LG TIER_1 · Anamika Lochab, Bolian Li, Ruqi Zhang · 2026-05-04 04:00

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

arXiv:2605.00365v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collaps…

arXiv cs.LG TIER_1 · Andrzej Ruszczynski, Tiangang Zhang · 2026-05-04 04:00

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

arXiv:2605.00654v1 Announce Type: new Abstract: For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the …

arXiv cs.CL TIER_1 · Zhichao Wang (James), Kiran Ramnath (James), Bin Bi (James), Shiva Kumar Pentyala (James), Sougata Chaudhuri (James), Shubham Mehrotra (James), Zixu (James), Zhu (Claire), Xiang-Bo Mao (Claire), Sitaram Asur (Claire), Na (Claire), Cheng · 2026-05-04 04:00

Reinforcement Learning for LLM Post-Training: A Survey

arXiv:2407.16216v3 Announce Type: replace Abstract: Large language models (LLMs) trained via pretraining and supervised fine-tuning (SFT) can still produce harmful and misaligned outputs, or struggle in domains like math and coding. Reinforcement learning (RL)-based post-training…

Hugging Face Daily Papers TIER_1 · 2026-05-04 03:15

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervas…

arXiv cs.AI TIER_1 · Liping Zhang · 2026-05-01 13:46

Augmented Lagrangian Multiplier Network for State-wise Safety in Reinforcement Learning

Safety is a primary challenge in real-world reinforcement learning (RL). Formulating safety requirements as state-wise constraints has become a prominent paradigm. Handling state-wise constraints with the Lagrangian method requires a distinct multiplier for every state, necessita…

arXiv cs.AI TIER_1 · Jianmin Wu · 2026-05-01 05:54

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to indi…

arXiv cs.LG TIER_1 · David Simchi-Levi · 2026-05-01 04:32

Model-Based Reinforcement Learning with Double Oracle Efficiency in Policy Optimization and Offline Estimation

Reinforcement learning (RL) in large environments often suffers from severe computational bottlenecks, as conventional regret minimization algorithms require repeated, costly calls to planning and statistical estimation oracles. While recent advances have explored offline oracle-…

arXiv cs.LG TIER_1 · Eason Yu, Tzu Hao Liu, Cl\'ement L. Canonne, Yunke Wang, Chang Xu, Nguyen H. Tran, Stefano V. Albrecht · 2026-05-01 04:00

NashPG: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

arXiv:2510.18183v2 Announce Type: replace Abstract: Finding Nash equilibria in two-player zero-sum imperfect-information games remains a central challenge in multi-agent reinforcement learning. Recent multi-round regularization methods offer a promising direction, yet existing ap…

arXiv cs.AI TIER_1 · Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin · 2026-05-01 04:00

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

arXiv:2604.28123v1 Announce Type: cross Abstract: The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distrib…

arXiv cs.AI TIER_1 · Alexandros Evangelidis, Gricel V\'azquez, Simos Gerasimou · 2026-05-01 04:00

Accelerating Policy Synthesis in Large-Scale MDPs via Hierarchical Adaptive Refinement

arXiv:2506.17792v2 Announce Type: replace Abstract: Software-intensive systems, such as software product lines and robotics, utilise Markov decision processes (MDPs) to capture uncertainty and analyse sequential decision-making problems. Despite the usefulness of conventional pol…

arXiv cs.AI TIER_1 · Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati · 2026-05-01 04:00

FP-IRL: Fokker--Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision Processes

arXiv:2306.10407v3 Announce Type: replace-cross Abstract: Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision pro…

arXiv cs.AI TIER_1 · Perry Dong, Qiyang Li, Dorsa Sadigh, Chelsea Finn · 2026-05-01 04:00

EXPO: Stable Reinforcement Learning with Expressive Policies

arXiv:2507.07986v3 Announce Type: replace-cross Abstract: We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable …

arXiv cs.AI TIER_1 · Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun · 2026-05-01 04:00

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

arXiv:2603.09117v2 Announce Type: replace-cross Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in inco…

arXiv cs.AI TIER_1 · Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai, Philipp F\"urnstahl, Bernhard Sch\"olkopf, Andreas Krause · 2026-05-01 04:00

Bounded Ratio Reinforcement Learning

arXiv:2604.18578v3 Announce Type: replace-cross Abstract: Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect betwee…

arXiv cs.LG TIER_1 · Naibin Gu, Chenxu Yang, Qingyi Si, Chuanyu Qin, Dingyu Yao, Peng Fu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang · 2026-05-01 04:00

Co-Evolving Policy Distillation

arXiv:2604.27083v1 Announce Type: new Abstract: RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mi…

arXiv cs.LG TIER_1 · Haiyang Zhao · 2026-05-01 04:00

Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

arXiv:2604.27411v1 Announce Type: new Abstract: Visual model-based reinforcement learning (MBRL) agents can perform well on the training distribution, but often break down once the test environment shifts. In visual MBRL, recognizing that a shift has occurred is often the easier …

arXiv cs.LG TIER_1 · Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko · 2026-05-01 04:00

Bayesian policy gradient and actor-critic algorithms

arXiv:2604.27563v1 Announce Type: new Abstract: Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate the gradient, …

arXiv cs.LG TIER_1 · Buqing Ou, Frederike D\"umbgen · 2026-05-01 04:00

Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?

arXiv:2604.27667v1 Announce Type: cross Abstract: Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good perfo…

arXiv cs.CL TIER_1 · Chi Jin · 2026-05-01 02:05

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human traj…

Hugging Face Daily Papers TIER_1 · 2026-04-30 10:04

Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?

Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good performance, whereas more global and less initializatio…

arXiv cs.LG TIER_1 · Frederike Dümbgen · 2026-04-30 10:04

Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?

Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen initial guesses for good performance, whereas more global and less initializatio…

arXiv cs.LG TIER_1 · Michal Valko · 2026-04-30 08:14

Bayesian policy gradient and actor-critic algorithms

Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo techniques to estimate the gradient, which tend to have high variance, requiring many…

arXiv cs.LG TIER_1 · Ankita Kushwaha, Kiran Ravish, Preeti Lamba, Pawan Kumar · 2026-04-30 04:00

A Survey of Safe Reinforcement Learning and Constrained MDPs: A Technical Survey on Single-Agent and Multi-Agent Safety

arXiv:2505.17342v2 Announce Type: replace Abstract: Safe Reinforcement Learning (SafeRL) is the subfield of reinforcement learning that explicitly deals with safety constraints during the learning and deployment of agents. This survey provides a mathematically rigorous overview o…

arXiv cs.LG TIER_1 · Tan Jing, Xiaorui Li, Chao Yao, Xiaojuan Ban, Yuetong Fang, Renjing Xu, Zhaolin Yuan · 2026-04-30 04:00

Adaptive Scaling of Policy Constraints for Offline Reinforcement Learning

arXiv:2508.19900v2 Announce Type: replace Abstract: Offline reinforcement learning (RL) enables learning effective policies from fixed datasets without any environment interaction. Existing methods typically employ policy constraints to mitigate the distribution shift encountered…

arXiv cs.CL TIER_1 · Xia Zeng, Yihan Chen, Luhui Liu, Chao Luo, Ye Chen, Zhuoran Zhuang · 2026-04-30 04:00

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment from Heterogeneous Rewards

arXiv:2510.04214v3 Announce Type: replace Abstract: We deploy large language models (LLMs) as business development (BD) agents for persuasive price negotiation in online travel agencies (OTAs). The agent must follow a multi-stage Standard Operating Procedure (SOP) and strict guar…

arXiv cs.CL TIER_1 · Zheng Liu, Mengjie Liu, Siwei Wen, Mengzhang Cai, Bin Cui, Conghui He, Wentao Zhang · 2026-04-30 04:00

Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

arXiv:2509.16591v2 Announce Type: replace Abstract: Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. However, existing methods typically treat it as a discrete filter or post-hoc regu…

arXiv cs.CL TIER_1 · Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson, Kenny Smith · 2026-04-30 04:00

Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning

arXiv:2602.21720v2 Announce Type: replace Abstract: Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular. Following prior work that relates cross-linguistic tendencies to biases in learn…

arXiv cs.AI TIER_1 · Seungyub Han, Hyungjin Kim, Jungwoo Lee · 2026-04-30 04:00

Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning

arXiv:2604.26516v1 Announce Type: cross Abstract: Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-b…

arXiv cs.AI TIER_1 · Jungwoo Lee · 2026-04-29 10:32

Lyapunov-Guided Self-Alignment: Test-Time Adaptation for Offline Safe Reinforcement Learning

Offline reinforcement learning (RL) agents often fail when deployed, as the gap between training datasets and real environments leads to unsafe behavior. To address this, we present SAS (Self-Alignment for Safety), a transformer-based framework that enables test-time adaptation i…

arXiv cs.LG TIER_1 · Huaiyang Wang, Xiaojie Li, Deqing Wang, Haoyi Zhou, Zixuan Huang, Yaodong Yang, Jianxin Li, Yikun Ban · 2026-04-29 04:00

Policy Improvement Reinforcement Learning

arXiv:2604.00860v2 Announce Type: replace Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become a central post-training paradigm for improving the reasoning capabilities of large language models. Yet existing methods share a common blind spot: they optimize p…

arXiv cs.LG TIER_1 · Ali Al Housseini, Cristina Rottondi, Omran Ayoub · 2026-04-29 04:00

Hierarchical Reinforcement Learning for the Dynamic VNE with Alternatives Problem

arXiv:2512.05207v2 Announce Type: replace-cross Abstract: Virtual Network Embedding (VNE) is a key enabler of network slicing, yet most formulations assume that each Virtual Network Request (VNR) has a fixed topology. Recently, VNE with Alternative topologies (VNEAP) was introduc…

arXiv cs.LG TIER_1 · Ihor Vitenko, Noha Ibrahim, Sihem Amer-Yahia · 2026-04-29 04:00

Lever: Inference-Time Policy Reuse under Support Constraints

arXiv:2604.20174v2 Announce Type: replace Abstract: Reinforcement learning (RL) policies are typically trained for fixed objectives, making reuse difficult when task requirements change. We study inference-time policy reuse: given a library of pre-trained policies and a new compo…

arXiv cs.LG TIER_1 · Alexandru Cioba, Aya Kayal, Laura Toni, Sattar Vakili, Alberto Bernacchia · 2026-04-29 04:00

Reinforcement Learning Using known Invariances

arXiv:2511.03473v2 Announce Type: replace Abstract: In many real-world reinforcement learning (RL) problems, the environment exhibits inherent symmetries that can be exploited to improve learning efficiency. This paper develops a theoretical and algorithmic framework for incorpor…

arXiv cs.LG TIER_1 · Dominik \.Zurek, Kamil Faber, Marcin Pietron, Pawe{\l} Gajewski, Roberto Corizzo · 2026-04-29 04:00

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

arXiv:2604.25898v1 Announce Type: new Abstract: Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise …

arXiv cs.LG TIER_1 · Artur Eisele, Bernd Frauenknecht, Friedrich Solowjow, Sebastian Trimpe · 2026-04-29 04:00

Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

arXiv:2604.25508v1 Announce Type: new Abstract: Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dy…

arXiv cs.AI TIER_1 · Roberto Corizzo · 2026-04-28 17:41

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live enviro…

Hugging Face Daily Papers TIER_1 · 2026-04-28 17:41

TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

Continual offline reinforcement learning (CORL) aims to learn a sequence of tasks from datasets collected over time while preserving performance on previously learned tasks. This setting corresponds to domains where new tasks arise over time, but adapting the model in live enviro…

arXiv cs.AI TIER_1 · Daniele Meli · 2026-04-28 12:02

Sample-efficient Neuro-symbolic Proximal Policy Optimization

Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a neuro-symbolic extension of Proximal Policy Optimization (PPO) that transfers pa…

arXiv cs.LG TIER_1 · Sebastian Trimpe · 2026-04-28 11:14

Dyna-Style Safety Augmented Reinforcement Learning: Staying Safe in the Face of Uncertainty

Safety remains an open problem in reinforcement learning (RL), especially during training. While safety filters are promising to address safe exploration, they are generally poorly suited for high-dimensional systems with unknown dynamics. We propose Dyna-style Safety Augmented R…

arXiv cs.AI TIER_1 · Karol Desnos · 2026-04-28 08:34

Multi-action Tangled Program Graphs for Multi-task Reinforcement Learning with Continuous Control

Over the past few decades, machine learning has been widely used to learn complex tasks. Reinforcement Learning (RL), inspired by human behavior, is a great example, as it involves developing specific behaviours for specific tasks. To further challenge algorithms, Multi-Task RL (…

arXiv cs.LG TIER_1 · Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh · 2026-04-28 04:00

Polychromic Objectives for Reinforcement Learning

arXiv:2509.25424v5 Announce Type: replace Abstract: Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising b…

arXiv cs.CL TIER_1 · Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao · 2026-04-28 04:00

DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

arXiv:2604.24320v1 Announce Type: new Abstract: Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental und…

arXiv cs.LG TIER_1 · Zixuan Xia, Quanxi Li · 2026-04-28 04:00

K-Score: Kalman Filter as a Principled Alternative to Reward Normalization in Reinforcement Learning

arXiv:2604.23056v1 Announce Type: new Abstract: We propose a simple yet effective alternative to reward normalization in policy gradient reinforcement learning by integrating a 1D Kalman filter for online reward estimation. Instead of relying on fixed heuristics, our method recur…

arXiv cs.LG TIER_1 · Rahul Narava, Siddharth Verma, Ojas Jain, Shashi Shekhar Jha, Mayank Shekhar Jha · 2026-04-28 04:00

CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

arXiv:2604.23576v1 Announce Type: new Abstract: Ensuring safe exploration in high-dimensional systems with unknown dynamics remains a significant challenge. Existing safe reinforcement learning methods often provide safety guarantees only in expectation, which can still lead to s…

arXiv cs.LG TIER_1 · Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng · 2026-04-28 04:00

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

arXiv:2604.24005v1 Announce Type: new Abstract: On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent se…

arXiv cs.LG TIER_1 · Atahan Cilan, Mahir Demir, \"Ozg\"un Can Y\"ur\"utken, Seyyid Osman Sevgili, \"Umit Can Bekar · 2026-04-28 04:00

Perfecting Aircraft Maneuvers with Reinforcement Learning

arXiv:2604.24338v1 Announce Type: new Abstract: This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A m…

arXiv cs.LG TIER_1 · Ying-Tu Chen, Wei Hung, Bing-Shu Wu, Zhang-Wei Hong, Ping-Chun Hsieh · 2026-04-28 04:00

A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

arXiv:2604.24532v1 Announce Type: new Abstract: Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} ad…

arXiv cs.LG TIER_1 · Zijian Guo, \.Ilker I\c{s}{\i}k, H. M. Sabbir Ahmad, Wenchao Li · 2026-04-28 04:00

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

arXiv:2604.24729v1 Announce Type: new Abstract: Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promis…

arXiv cs.LG TIER_1 · Shipeng Li, Zhiqin Yang, Shikun Li, Xiaobo Xia, Hengyu Liu, Xinghua Zhang, Gaode Chen, Dong Fang, Ying Tai, Zhe Peng · 2026-04-28 04:00

LearnAlign: Data Selection for LLM Reinforcement Learning with Improved Gradient Alignment

arXiv:2506.11480v4 Announce Type: replace Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing LLMs' reasoning abilities, yet its data inefficiency remains a major bottleneck. To address this critical yet challenging issue, we p…

arXiv cs.AI TIER_1 · Donghwan Lee · 2026-04-28 04:00

Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

arXiv:2604.17457v3 Announce Type: replace-cross Abstract: Q-value iteration (Q-VI) is usually analyzed through the $\gamma$-contraction of the Bellman operator. This argument proves convergence to $Q^*$, but it gives only a coarse account of when the induced greedy policy bec…

arXiv cs.LG TIER_1 · Elias Hossain, Mohammad Jahid Ibna Basher, Ivan Garibay, Ozlem Garibay, Niloofar Yousefi · 2026-04-28 04:00

When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

arXiv:2604.22873v1 Announce Type: new Abstract: Offline reinforcement learning (RL) can learn effective policies from fixed datasets, but deployment objectives may change after training, and in many applications the trained actor cannot be retrained because of data, cost, or gove…

arXiv cs.LG TIER_1 · Stela Tong, Elai Ben-Gal · 2026-04-28 04:00

CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs

arXiv:2604.22785v1 Announce Type: new Abstract: Large language model (LLM) deployments increasingly rely on multi-agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal…

arXiv cs.CL TIER_1 · Bilgehan Sel, Vaishakh Keshava, Phillip Wallis, Lukas Rutishauser, Ming Jin, Dingcheng Li · 2026-04-28 04:00

Reinforcement Learning with Backtracking Feedback

arXiv:2602.08377v2 Announce Type: replace-cross Abstract: Addressing the critical need for robust safety in Large Language Models (LLMs), particularly against adversarial attacks and in-distribution errors, we introduce Reinforcement Learning with Backtracking Feedback (RLBF). Th…

arXiv cs.LG TIER_1 · Wenchao Li · 2026-04-27 17:40

SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning

Specification-guided reinforcement learning (RL) provides a principled framework for encoding complex, temporally extended tasks using formal specifications such as linear temporal logic (LTL). While recent methods have shown promising results, their ability to generalize across …

arXiv cs.LG TIER_1 · Ping-Chun Hsieh · 2026-04-27 14:32

A Reward-Free Viewpoint on Multi-Objective Reinforcement Learning

Many sequential decision-making tasks involve optimizing multiple conflicting objectives, requiring policies that adapt to different user preferences. In multi-objective reinforcement learning (MORL), one widely studied approach} addresses this by training a single policy network…

Hugging Face Daily Papers TIER_1 · 2026-04-27 11:29

Perfecting Aircraft Maneuvers with Reinforcement Learning

This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulat…

arXiv cs.LG TIER_1 · Ümit Can Bekar · 2026-04-27 11:29

Perfecting Aircraft Maneuvers with Reinforcement Learning

This paper evaluates an advanced jet trainer's utilization of artificial intelligence (AI)-based aircraft aerobatic maneuvers with the intention of developing an AI-assisted pilot training module for specific aircraft maneuvers. A multitude of aircraft maneuvers have been simulat…

arXiv cs.CL TIER_1 · Shen Gao · 2026-04-27 11:09

DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single …

arXiv cs.LG TIER_1 · Jichao Wang, Liuyang Bian, Yufeng Zhou, Han Xiao, Yue Pan, Guozhi Wang, Hao Wang, Zhaoxiong Wang, Yafei Wen, Xiaoxin Chen, Shuai Ren, Lingfang Zeng · 2026-04-27 04:00

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

arXiv:2604.22558v1 Announce Type: new Abstract: As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GU…

arXiv cs.LG TIER_1 · Anne E. Staples · 2026-04-27 04:00

Insect-inspired modular architectures as inductive biases for reinforcement learning

arXiv:2604.22081v1 Announce Type: new Abstract: Most reinforcement-learning (RL) controllers used in continuous control are architecturally centralized: observations are compressed into a single latent state from which both value estimates and actions are produced. Biological con…

arXiv cs.LG TIER_1 · Zhancun Mu, Guangyu Zhao, Yiwu Zhong, Chi Zhang · 2026-04-27 04:00

Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

arXiv:2604.22229v1 Announce Type: new Abstract: One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset…

arXiv cs.LG TIER_1 · Rashmeet Kaur Nayyar, Naman Shah, Siddharth Srivastava · 2026-04-27 04:00

Context-Sensitive Abstractions for Reinforcement Learning with Parameterized Actions

arXiv:2512.20831v2 Announce Type: replace-cross Abstract: Real-world sequential decision-making often involves parameterized action spaces that require both, decisions regarding discrete actions and decisions about continuous action parameters governing how an action is executed.…

arXiv cs.LG TIER_1 · Promise Ekpo, Saesha Agarwal, Felix Grimm, Lekan Molu, Angelique Taylor · 2026-04-27 04:00

AdaFair-MARL: Enforcing Adaptive Fairness Constraints in Multi-Agent Reinforcement Learning

arXiv:2511.14135v2 Announce Type: replace Abstract: Fair workload enforcement in heterogeneous multi-agent systems that pursue shared objectives remains challenging. Fixed fairness penalties often introduce inefficiencies, training instability, and conflicting agent incentives. R…

arXiv cs.CL TIER_1 · Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma, Yang Liu · 2026-04-27 04:00

UR$^2$: Unify RAG and Reasoning through Reinforcement Learning

arXiv:2508.06165v4 Announce Type: replace Abstract: Large Language Models (LLMs) have shown strong capabilities through two complementary paradigms: Retrieval-Augmented Generation (RAG) for knowledge grounding and Reinforcement Learning from Verifiable Rewards (RLVR) for complex …

arXiv cs.LG TIER_1 · Peiyan Zhang, Hanmo Liu, Chengxuan Tong, Yuxia Wu, Wei Guo, Yong Liu · 2026-04-27 04:00

ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

arXiv:2604.22169v1 Announce Type: new Abstract: Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at al…

arXiv cs.AI TIER_1 · Lingfang Zeng · 2026-04-24 13:53

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilem…

arXiv cs.AI TIER_1 · Chi Zhang · 2026-04-24 05:07

Preserve Support, Not Correspondence: Dynamic Routing for Offline Reinforcement Learning

One-step offline RL actors are attractive because they avoid backpropagating through long iterative samplers and keep inference cheap, but they still have to improve under a critic without drifting away from actions that the dataset can support. In recent one-step extraction pipe…

arXiv cs.AI TIER_1 · Yong Liu · 2026-04-24 02:44

ReCast: Recasting Learning Signals for Reinforcement Learning in Generative Recommendation

Generic group-based RL assumes that sampled rollout groups are already usable learning signals. We show that this assumption breaks down in sparse-hit generative recommendation, where many sampled groups never become learnable at all. We propose ReCast, a repair-then-contrast lea…

arXiv cs.LG TIER_1 · Sukesh Subaharan · 2026-04-23 09:18

Dynamical Priors as a Training Objective in Reinforcement Learning

Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or …

Hugging Face Daily Papers TIER_1 · 2026-04-20 16:43

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in grou…

X — Mira Murati TIER_1 · Mira Murati · 2025-10-27 17:06

Combining the benefits of RL and SFT with on-policy distillation, a promising approach for training small models for domain performance and continual ...

Combining the benefits of RL and SFT with on-policy distillation, a promising approach for training small models for domain performance and continual learning.<div class="rsshub-quote"> Thinking Machines: Our latest post explores on-policy distillation, a training appr…

arXiv stat.ML TIER_1 · Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli · 2026-05-13 04:00

DARLING: Detection Augmented Reinforcement Learning with Non-Stationary Guarantees

arXiv:2604.16684v2 Announce Type: replace-cross Abstract: We study model-free reinforcement learning (RL) in non-stationary finite-horizon episodic Markov decision processes (MDPs) without prior knowledge of the non-stationarity. We focus on the piecewise stationary (PS) setting,…

arXiv stat.ML TIER_1 · Aidan Gleich, Eric Laber, Alexander Volfovsky · 2026-05-13 04:00

Adaptive Policy Learning Under Unknown Network Interference

arXiv:2605.11191v1 Announce Type: new Abstract: Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in orde…

arXiv stat.ML TIER_1 · Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao · 2026-05-13 04:00

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

arXiv:2605.11473v1 Announce Type: cross Abstract: Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diag…

arXiv stat.ML TIER_1 · Seokmin Ko, Ambuj Tewari, Kihyuk Hong · 2026-05-13 04:00

Offline Constrained Reinforcement Learning under Partial Data Coverage

arXiv:2505.17506v2 Announce Type: replace Abstract: We study offline constrained reinforcement learning with general function approximation in discounted constrained Markov decision processes. Prior methods either require full data coverage for evaluating intermediate policies, l…

arXiv stat.ML TIER_1 · Maxime Haddouche, Otmane Sakhi · 2026-05-13 04:00

Sequential Off-Policy Learning with Logarithmic Smoothing

arXiv:2506.10664v2 Announce Type: replace Abstract: Off-policy learning enables training policies from logged interaction data. Most prior work considers the batch setting, where a policy is learned from data generated by a single behavior policy. In real systems, however, polici…

arXiv stat.ML TIER_1 · Nam Phuong Tran, Andi Nika, Goran Radanovic, Long Tran-Thanh, Debmalya Mandal · 2026-05-13 04:00

Sparse Offline Reinforcement Learning with Corruption Robustness

arXiv:2512.24768v3 Announce Type: replace Abstract: We investigate robustness to strong data corruption in offline sparse reinforcement learning (RL). In our setting, an adversary may arbitrarily perturb a fraction of the collected trajectories from a high-dimensional but sparse …

arXiv stat.ML TIER_1 · Rui Miao · 2026-05-12 03:40

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously ov…

arXiv stat.ML TIER_1 · Alexander Volfovsky · 2026-05-11 19:57

Adaptive Policy Learning Under Unknown Network Interference

Adaptive experimentation under unknown network interference requires solving two coupled problems: (i) learning the underlying dynamics of interference among units and (ii) using these dynamics to inform treatment allocation in order to maximize a cumulative outcome of interest (…

arXiv stat.ML TIER_1 · Guannan Qu · 2026-05-11 17:49

Revisiting Policy Gradients for Restricted Policy Classes: Escaping Myopic Local Optima with $k$-step Policy Gradients

This work revisits standard policy gradient methods used on restricted policy classes, which are known to get stuck in suboptimal critical points. We identify an important cause for this phenomenon to be that the policy gradient is itself fundamentally myopic, i.e. it only improv…

arXiv stat.ML TIER_1 · Zaiwei Chen · 2026-05-11 14:53

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in whi…

arXiv stat.ML TIER_1 · Lars van der Laan, Nathan Kallus · 2026-05-11 04:00

Bellman Calibration for $V$-Learning in Offline Reinforcement Learning

arXiv:2512.23694v2 Announce Type: replace Abstract: Reliable long-horizon value prediction is difficult in offline reinforcement learning because fitted value methods combine bootstrapping, function approximation, and distribution shift, while standard guarantees often require Be…

arXiv stat.ML TIER_1 · Lars van der Laan, Nathan Kallus, Aurelien Bibaut · 2026-05-11 04:00

Inverse Reinforcement Learning with Just Classification and a Few Regressions

arXiv:2509.21172v2 Announce Type: replace-cross Abstract: Inverse reinforcement learning (IRL) aims to infer rewards from observed behavior, but rewards are not identified from the policy alone: many reward--value pairs can rationalize the same actions. Meaningful reward recovery…

arXiv stat.ML TIER_1 · Yuyang Zhang, Haldun Balim, Na Li · 2026-05-11 04:00

Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

arXiv:2605.07101v1 Announce Type: cross Abstract: Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), address…

arXiv stat.ML TIER_1 · Xinyu Liu, Zixuan Xie, Shangtong Zhang · 2026-05-11 04:00

Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

arXiv:2605.07104v1 Announce Type: cross Abstract: Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic app…

arXiv stat.ML TIER_1 · Kun Long, Yuqiang Li, Xianyi Wu · 2026-05-11 04:00

Improved Model-based Reinforcement Learning with Smooth Kernels

arXiv:2605.07218v1 Announce Type: cross Abstract: For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive …

LessWrong (AI tag) TIER_1 · Oliver Sourbut · 2026-05-10 15:30

Reinforcement learning scaling might incentivise hidden reasoning architectures for AI

In short: the transformer architecture brought massive scale to AI, and also provided partial guarantees of ‘reasoning out loud’, an unprecedentedly interpretable situation for AI. Reinforcement learning (…

arXiv stat.ML TIER_1 · Feng Ji · 2026-05-10 04:02

Reinforcement Learning Measurement Model

Interactive assessments generate sequential process data that are not well handled by conventional item response models. Existing MDP-based measurement approaches, such as the Markov decision process measurement model (MDP-MM, LaMar, 2018), link action choices to state-action val…

arXiv stat.ML TIER_1 · Xianyi Wu · 2026-05-08 04:10

Improved Model-based Reinforcement Learning with Smooth Kernels

For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-bas…

arXiv stat.ML TIER_1 · Shangtong Zhang · 2026-05-08 01:32

Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

Establishing almost sure convergence rates for stochastic approximation and reinforcement learning under Markovian noise is a fundamental theoretical challenge. We make progress towards this challenge for a class of stochastic approximation algorithms whose expected updates are c…

arXiv stat.ML TIER_1 · Na Li · 2026-05-08 01:29

Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

Cooperative multi-agent reinforcement learning (MARL) involves complex agent interactions and requires effective exploration strategies. A prominent class of MARL algorithms, decentralized softmax policy gradient (DecSPG), addresses this through energy-based policy updates. In pr…

arXiv stat.ML TIER_1 · Lifeng Lai · 2026-05-07 06:50

Transformers Provably Implement In-Context Reinforcement Learning with Policy Improvement

We investigate the ability of transformers to perform in-context reinforcement learning (ICRL), where a model must infer and execute learning algorithms from trajectory data without parameter updates. We show that a linear self-attention transformer block can provably implement p…

arXiv stat.ML TIER_1 · Li Song · 2026-05-06 06:17

Maximizing Rollout Informativeness under a Fixed Budget: A Submodular View of Tree Search for Tool-Use Agentic Reinforcement Learning

We formalize Rollout Informativeness under a Fixed Budget (RIFB) as the expected non-vanishing policy-gradient mass that a tool-use rollout set injects into Group Relative Policy Optimization (GRPO). We prove that any budget-agnostic independent sampler suffers a collapse rate bo…

arXiv stat.ML TIER_1 · Onno Eberhard, Thibaut Cuvelier, Michal Valko, Bruno De Backer · 2026-05-05 04:00

Middle-mile logistics through the lens of goal-conditioned reinforcement learning

arXiv:2605.02461v1 Announce Type: new Abstract: Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with …

arXiv stat.ML TIER_1 · Bruno De Backer · 2026-05-04 11:04

Middle-mile logistics through the lens of goal-conditioned reinforcement learning

Middle-mile logistics describes the problem of routing parcels through a network of hubs linked by trucks with finite capacity. We rephrase this as a multi-object goal-conditioned MDP. Our method combines graph neural networks with model-free RL, extracting small feature graphs f…

arXiv stat.ML TIER_1 · Tiangang Zhang · 2026-05-01 13:36

Reinforcement Learning with Markov Risk Measures and Multipattern Risk Approximation

For a risk-averse finite-horizon Markov Decision Problem, we introduce a special class of Markov coherent risk measures, called mini-batch measures. We also define the class of multipattern risk-averse problems that generalizes the class of linear systems. We use both concepts in…

arXiv stat.ML TIER_1 · Tiantian Zhang, Jierui Zuo, Michael Chen, Wenping Wang · 2026-05-01 04:00

DDO-RM: Distribution-Level Policy Improvement after Reward Learning

arXiv:2604.11119v2 Announce Type: replace Abstract: Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We propose DDO-RM, a finite-candidate deci…

arXiv stat.ML TIER_1 · Rohan Tangri, Jan-Peter Calliess · 2026-05-01 04:00

Constrained Policy Optimization with Cantelli-Bounded Value-at-Risk

arXiv:2601.22993v3 Announce Type: replace-cross Abstract: We introduce the Value-at-Risk Constrained Policy Optimization algorithm (VaR-CPO), a sample efficient and conservative method designed to optimize Value-at-Risk (VaR) constrained reinforcement learning (RL) problems. Empi…

arXiv stat.ML TIER_1 · Ruqi Zhang · 2026-05-01 03:02

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degra…

arXiv stat.ML TIER_1 · Jose Blanchet · 2026-04-30 19:22

Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem u…

arXiv cs.CV TIER_1 · Chengwei Qin · 2026-04-30 17:12

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). However, SFT introduces distributional drift that neither preserves the model's o…

arXiv stat.ML TIER_1 · Zhenghao Li, Shengbo Wang, Nian Si · 2026-04-29 04:00

Near-Optimal Sample Complexities of Divergence-based S-rectangular Distributionally Robust Reinforcement Learning

arXiv:2505.12202v3 Announce Type: replace-cross Abstract: Distributionally robust reinforcement learning (DR-RL) has recently gained significant attention as a principled approach that addresses discrepancies between training and testing environments. To balance robustness, conse…

arXiv stat.ML TIER_1 · Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin · 2026-04-29 04:00

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

arXiv:2604.25872v1 Announce Type: cross Abstract: Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality o…

arXiv stat.ML TIER_1 · Noam Razin · 2026-04-28 17:10

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metrics for assessing the quality of proxy rewards, such as ranking accuracy, treat i…

arXiv stat.ML TIER_1 · Marcel Hedman, Kale-ab Abebe Tessera, Juan Claude Formanek, Anya Sims, Riccardo Zamboni, Trevor McInroe, John Torr, Elliot Fosong · 2026-04-28 04:00

CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

arXiv:2604.23308v1 Announce Type: cross Abstract: Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they ca…

arXiv stat.ML TIER_1 · Elliot Fosong · 2026-04-25 13:49

CODA: Coordination via On-Policy Diffusion for Multi-Agent Offline Reinforcement Learning

Offline multi-agent reinforcement learning (MARL) enables policy learning from fixed datasets, but is prone to coordination failure: agents trained on static, off-policy data converge to suboptimal joint behaviours because they cannot co-adapt as their policies change. We introdu…

Smol AINews TIER_1 · 2025-05-12 05:44

Prime Intellect's INTELLECT-2 and PRIME-RL advance distributed reinforcement learning

**Prime Intellect** released **INTELLECT-2**, a decentralized GPU training and RL framework with a vision for distributed AI training overcoming colocation limits. **ByteDance** launched **DreamO**, a unified image customization model on Hugging Face. **Qwen** released models opt…

Smol AINews TIER_1 · 2025-01-07 02:33

PRIME: Process Reinforcement through Implicit Rewards

**Implicit Process Reward Models (PRIME)** have been highlighted as a significant advancement in online reinforcement learning, trained on a **7B model** with impressive results compared to **gpt-4o**. The approach builds on the importance of process reward models established by …

Eugene Yan TIER_1 · 2021-09-05 00:00

Reinforcement Learning for Recommendations and Search

Focusing on long-term rewards, exploration, and frequently updated item.

AWS Machine Learning Blog TIER_1 · Surya Kari · 2026-05-07 15:53

Overcoming reward signal challenges: Verifiable rewards-based reinforcement learning with GRPO on SageMaker AI

In this post, you will learn how to implement reinforcement learning with verifiable rewards (RLVR) to introduce verification and transparency into reward signals to improve training performance. This approach works best when outputs can be objectively verified for correctness, s…

Practical AI TIER_1 · Practical AI LLC · 2022-02-01 20:00

Exploring deep reinforcement learning

In addition to being a Developer Advocate at Hugging Face, Thomas Simonini is building next-gen AI in games that can talk and have smart interactions with the player using Deep Reinforcement Learning (DRL) and Natural Language Processing (NLP). He also created a Deep Reinforce…

Practical AI TIER_1 · Practical AI LLC · 2020-10-26 21:15

Reinforcement Learning for search

Hamish from Sajari blows our mind with a great discussion about AI in search. In particular, he talks about Sajari’s quest for performant AI implementations and extensive use of Reinforcement Learning (RL). We’ve been wanting to make this one happen for a while, and it was wel…

Practical AI TIER_1 · Practical AI LLC · 2020-04-27 19:30

Reinforcement learning for chip design

Daniel and Chris have a fascinating discussion with Anna Goldie and Azalia Mirhoseini from Google Brain about the use of reinforcement learning for chip floor planning - or placement - in which many new designs are generated, and then evaluated, to find an optimal component la…

Practical AI TIER_1 · Practical AI LLC · 2019-04-23 11:00

Deep Reinforcement Learning

While attending the NVIDIA GPU Technology Conference in Silicon Valley, Chris met up with Adam Stooke, a speaker and PhD student at UC Berkeley who is doing groundbreaking work in large-scale deep reinforcement learning and robotics. Adam took Chris on a tour of deep reinforce…

Lex Fridman Podcast TIER_1 · Lex Fridman · 2019-03-12 16:06

Leslie Kaelbling: Reinforcement Learning, Planning, and Robotics

Leslie Kaelbling is a roboticist and professor at MIT. She is recognized for her work in reinforcement learning, planning, robot navigation, and several other topics in AI. She won the IJCAI Computers and Thought Award and was the editor-in-chief of the prestigious Journal of …

Lex Fridman Podcast TIER_1 Nederlands(NL) · Lex Fridman · 2018-12-16 19:48

Pieter Abbeel: Deep Reinforcement Learning

Pieter Abbeel is a professor at UC Berkeley, director of the Berkeley Robot Learning Lab, and is one of the top researchers in the world working on how to make robots understand and interact with the world around them, especially through imitation and deep reinforcement learni…

Mastodon — mastodon.social TIER_1 · aihaberleri · 2026-05-09 10:33

📰 2026 Breakthrough: OpenAI Eliminates Parameter Updates in Reinforcement Learning with Python Scripts A groundbreaking reinforcement learning paradigm develope

📰 2026 Breakthrough: OpenAI Eliminates Parameter Updates in Reinforcement Learning with Python Scripts A groundbreaking reinforcement learning paradigm developed by OpenAI researcher Jia-Yi Weng eliminates the need for parameter updates, enabling AI agents to make decisions by ge…

LINKS aihaberleri.org/…/2026-breakthrough-opena…

Mastodon — mastodon.social TIER_1 Türkçe(TR) · aihaberleri · 2026-05-09 10:33

📰 New Learning Method: Parameter-Free Reinforcement Learning OpenAI researchers, AI making decisions on its own without updating parameters

📰 Yeni Öğrenme Yöntemi: Parametre Güncellemesiz Reinforcement Learning OpenAI araştırmacıları, parametreleri güncellemeden yapay zekanın kendi kendine karar vermesini sağlayan yeni bir reinforcement learning范式 sundu. Bu yöntem, AI'nin bir .py dosyası yazarak öğrenmesini sağlıyor.…

LINKS aihaberleri.org/…/yeni-ogrenme-yontemi-pa…

COVERAGE [283]