正规化基于模型的策略固定分配以稳定离线强化学习

论文标题

正规化基于模型的策略固定分配以稳定离线强化学习

Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning

论文作者

Yang, Shentao, Feng, Yihao, Zhang, Shujian, Zhou, Mingyuan

论文摘要

离线增强学习（RL）将经典RL算法的范式扩展到纯粹从静态数据集中学习，而无需在学习过程中与基础环境进行交互。离线RL的一个关键挑战是政策培训的不稳定，这是由于离线数据的分布与学习政策的未结束的固定状态分配之间的不匹配引起的。为了避免分配不匹配的有害影响，我们将当前策略的未静置固定分配正规化在政策优化过程中的离线数据。此外，我们训练一个动态模型既实施此正规化，又可以更好地估计当前策略的固定分布，从而减少了分布不匹配引起的错误。在各种连续控制的离线RL数据集中，我们的方法表示竞争性能，从而验证了我们的算法。该代码公开可用。

Offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets, without interacting with the underlying environment during the learning process. A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy. To avoid the detrimental impact of distribution mismatch, we regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process. Further, we train a dynamics model to both implement this regularization and better estimate the stationary distribution of the current policy, reducing the error induced by distribution mismatch. On a wide range of continuous-control offline RL datasets, our method indicates competitive performance, which validates our algorithm. The code is publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题