在紧凑的潜在动作空间中有效计划

论文标题

在紧凑的潜在动作空间中有效计划

Efficient Planning in a Compact Latent Action Space

论文作者

Jiang, Zhengyao, Zhang, Tianjun, Janner, Michael, Li, Yueying, Rocktäschel, Tim, Grefenstette, Edward, Tian, Yuandong

论文摘要

基于计划的强化学习在离散和低维连续动作空间的任务中表现出强劲的表现。但是，规划通常为决策带来大量的计算开销，将这些方法扩展到高维操作空间仍然具有挑战性。为了提高高维连续控制的有效计划，我们提出了轨迹自动编码计划器（TAP），该规划师（TAP）学习了使用国家条件的VQ-VAE学习低维潜在动作代码。因此，VQ-VAE的解码器是一种新型动力学模型，该模型采用潜在作用和当前状态作为输入和重建长途径轨迹。在推理期间，鉴于起始状态，TAP搜索对离散的潜在作用，以查找在训练分布下具有很高概率和高预测累积奖励的轨迹。离线RL设置中的经验评估表明，决策潜伏期较低，这与日益增长的原始动作维度无关。对于具有高维度连续动作空间的Adroit机器人手动操纵任务，TAP通过较大的边距超过了现有的基于模型的方法，并且还击败了强大的无模型参与者 - 批判性基线。

Planning-based reinforcement learning has shown strong performance in tasks in discrete and low-dimensional continuous action spaces. However, planning usually brings significant computational overhead for decision-making, and scaling such methods to high-dimensional action spaces remains challenging. To advance efficient planning for high-dimensional continuous control, we propose Trajectory Autoencoding Planner (TAP), which learns low-dimensional latent action codes with a state-conditional VQ-VAE. The decoder of the VQ-VAE thus serves as a novel dynamics model that takes latent actions and current state as input and reconstructs long-horizon trajectories. During inference time, given a starting state, TAP searches over discrete latent actions to find trajectories that have both high probability under the training distribution and high predicted cumulative reward. Empirical evaluation in the offline RL setting demonstrates low decision latency which is indifferent to the growing raw action dimensionality. For Adroit robotic hand manipulation tasks with high-dimensional continuous action space, TAP surpasses existing model-based methods by a large margin and also beats strong model-free actor-critic baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题