改变锚定的正规自然政策梯度，用于多目标增强学习

论文标题

改变锚定的正规自然政策梯度，用于多目标增强学习

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning

论文作者

Zhou, Ruida, Liu, Tao, Kalathil, Dileep, Kumar, P. R., Tian, Chao

论文摘要

我们研究具有多个奖励价值函数的马尔可夫决策过程（MDP）的政策优化，这些功能应根据给定的标准共同优化，例如比例公平（平稳的凹面标量化），硬约束（约束MDP）和Max-Min权衡。我们提出了一个改变锚定的正规自然政策梯度（ARNPG）框架，该框架可以系统地将良好表现的一阶方法中的想法纳入多目标MDP问题的策略优化算法的设计中。从理论上讲，基于ARNPG框架的设计算法实现了$ \ tilde {o}（1/t）$全局收敛，并具有精确的梯度。从经验上讲，与某些现有的基于策略梯度的方法相比，ARNPG引导的算法在精确梯度和基于样本的场景中也表现出卓越的性能。

We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems. Theoretically, the designed algorithms based on the ARNPG framework achieve $\tilde{O}(1/T)$ global convergence with exact gradients. Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题