弥合离线和在线加强学习评估方法之间的差距

论文标题

弥合离线和在线加强学习评估方法之间的差距

Bridging the Gap Between Offline and Online Reinforcement Learning Evaluation Methodologies

论文作者

Sujit, Shivakanth, Braga, Pedro H. M., Bornschein, Jorg, Kahou, Samira Ebrahimi

论文摘要

强化学习（RL）在具有较大状态和动作空间的环境中纯粹来自标量奖励信号的环境中的算法学习表现出了巨大的希望。对于当前的深度RL算法而言，至关重要的挑战是，它们需要大量的环境相互作用才能学习。在这种互动很昂贵的情况下，这可能是不可行的。例如在机器人技术中。离线RL算法尝试通过从现有记录的数据中引导学习过程来解决此问题，而无需从一开始就与环境进行交互。虽然通常对在线RL算法作为环境相互作用数量进行评估，但没有单个已建立的协议来评估离线RL方法。在本文中，我们提出了一种顺序评估离线RL算法作为训练集大小的函数以及其数据效率的方法。顺序评估为学习过程的数据效率以及算法对数据集的分布变化的鲁棒性提供了宝贵的见解，同时还协调了离线和在线学习阶段的可视化。我们的方法通常适用且易于实施。我们使用这种方法比较了几种现有的离线RL算法，并提供了各种任务和离线数据集的见解。

Reinforcement learning (RL) has shown great promise with algorithms learning in environments with large state and action spaces purely from scalar reward signals. A crucial challenge for current deep RL algorithms is that they require a tremendous amount of environment interactions for learning. This can be infeasible in situations where such interactions are expensive; such as in robotics. Offline RL algorithms try to address this issue by bootstrapping the learning process from existing logged data without needing to interact with the environment from the very beginning. While online RL algorithms are typically evaluated as a function of the number of environment interactions, there exists no single established protocol for evaluating offline RL methods.In this paper, we propose a sequential approach to evaluate offline RL algorithms as a function of the training set size and thus by their data efficiency. Sequential evaluation provides valuable insights into the data efficiency of the learning process and the robustness of algorithms to distribution changes in the dataset while also harmonizing the visualization of the offline and online learning phases. Our approach is generally applicable and easy to implement. We compare several existing offline RL algorithms using this approach and present insights from a variety of tasks and offline datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题