策略优化的参数类似梯度更新类别

论文标题

策略优化的参数类似梯度更新类别

A Parametric Class of Approximate Gradient Updates for Policy Optimization

论文作者

Gummadi, Ramki, Kumar, Saurabh, Wen, Junfeng, Schuurmans, Dale

论文摘要

基于如何解释参数模型（例如价值与策略表示）或如何制定学习目标，但它们具有最大化预期回报的共同目标，这是从各种原则中源于各种原则的动机。为了更好地捕获共同点并确定策略优化方法之间的关键差异，我们开发了一个统一的观点，该视角以有限的梯度形式和缩放功能的选择来重新表达基础更新。特别是，我们确定了高度结构化的策略优化的近似梯度更新的参数化空间，但涵盖了包括PPO在内的经典和最近的示例。结果，我们获得了新颖而又积极进取的更新，这些更新以一种可以在收敛速度和最终结果质量方面带来好处的方式概括了现有算法。一项实验研究表明，可以利用参数化更新家族中提供的其他自由度，以在合成域和流行的深层RL基准中获得非平凡的改进。

Approaches to policy optimization have been motivated from diverse principles, based on how the parametric model is interpreted (e.g. value versus policy representation) or how the learning objective is formulated, yet they share a common goal of maximizing expected return. To better capture the commonalities and identify key differences between policy optimization methods, we develop a unified perspective that re-expresses the underlying updates in terms of a limited choice of gradient form and scaling function. In particular, we identify a parameterized space of approximate gradient updates for policy optimization that is highly structured, yet covers both classical and recent examples, including PPO. As a result, we obtain novel yet well motivated updates that generalize existing algorithms in a way that can deliver benefits both in terms of convergence speed and final result quality. An experimental investigation demonstrates that the additional degrees of freedom provided in the parameterized family of updates can be leveraged to obtain non-trivial improvements both in synthetic domains and on popular deep RL benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题