脱机RL的潜在可变量优势加权政策优化

论文标题

脱机RL的潜在可变量优势加权政策优化

Latent-Variable Advantage-Weighted Policy Optimization for Offline RL

论文作者

Chen, Xi, Ghadirzadeh, Ali, Yu, Tianhe, Gao, Yuan, Wang, Jianhao, Li, Wenzhe, Liang, Bin, Finn, Chelsea, Zhang, Chongjie

论文摘要

离线增强学习方法具有从预收取的数据集中学习政策的希望，而无需查询环境是否是否有新的过渡。此设置特别适合连续控制机器人应用程序，基于试用和错误的在线数据收集昂贵且可能不安全。实际上，离线数据集通常是异质的，即在各种情况下收集的，例如来自几个人类示威者的数据或以不同目的起作用的政策。不幸的是，此类数据集会加剧数据基础行为策略与要学习的最佳策略之间的分布变化，从而导致性能差。为了应对这一挑战，我们建议利用可以代表更广泛的策略分布的潜在可变性政策，从而更好地遵守培训数据分配，同时通过与潜在变量相比，通过政策最大化奖励。正如我们在一系列模拟的移动，导航和操纵任务上所表明的那样，我们的方法称为潜在优势 - 智能政策优化（LAPO），将下一个最佳表现的离线加固方法的平均性能提高了49％的下一个最佳表现的离线加固方法，将49％的方法在多种含量的数据集中，以及与范围内的分配范围分配8％，并提高了8％的分配和偏见。

Offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions. This setting is particularly well-suited for continuous control robotic applications for which online data collection based on trial-and-error is costly and potentially unsafe. In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios, such as data from several human demonstrators or from policies that act with different purposes. Unfortunately, such datasets can exacerbate the distribution shift between the behavior policy underlying the data and the optimal policy to be learned, leading to poor performance. To address this challenge, we propose to leverage latent-variable policies that can represent a broader class of policy distributions, leading to better adherence to the training data distribution while maximizing reward via a policy over the latent variable. As we empirically show on a range of simulated locomotion, navigation, and manipulation tasks, our method referred to as latent-variable advantage-weighted policy optimization (LAPO), improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets, and by 8% on datasets with narrow and biased distributions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题