使用离线数据强大的加强学习

论文标题

使用离线数据强大的加强学习

Robust Reinforcement Learning using Offline Data

论文作者

Panaganti, Kishan, Xu, Zaiyan, Kalathil, Dileep, Ghavamzadeh, Mohammad

论文摘要

强大的增强学习（RL）的目的是学习一项与模型参数不确定性的强大策略。由于模拟器建模误差，随着时间的推移，现实世界系统动力学的变化以及对抗性干扰，参数不确定性通常发生在许多现实世界中的RL应用中。强大的RL通常被认为是最大问题问题，其目的是学习最大化价值与不确定性集中最坏可能的模型的策略。在这项工作中，我们提出了一种称为鲁棒拟合Q-材料（RFQI）的强大RL算法，该算法仅使用离线数据集来学习最佳稳健策略。使用离线数据的强大RL比其非持续性对应物更具挑战性，因为在强大的Bellman运营商中存在的所有模型中的最小化。这在离线数据收集，对模型的优化以及公正的估计中构成了挑战。在这项工作中，我们提出了一种系统的方法来克服这些挑战，从而导致了RFQI算法。我们证明，RFQI在标准假设下学习了一项近乎最佳的强大政策，并证明了其在标准基准问题上的出色表现。

The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real-world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题