论文标题
台阶返回的混合在自举DQN中
Mixture of Step Returns in Bootstrapped DQN
论文作者
论文摘要
多年来,深入强化学习(DRL)采用了利用多步回报来更新价值功能的概念。具有不同备份长度的更新值函数在不同方面提供了优势,包括价值估计的偏差和差异,收敛速度和代理的探索行为。传统方法(例如TD-LAMBDA)通过使用等同于不同步骤回报的指数平均值的目标值来利用这些优势。然而,集成步骤返回单个目标牺牲了不同步骤返回目标提供的优势的多样性。为了解决此问题,我们建议在自举DQN顶部建立的混合物自举DQN(MB-DQN),并使用不同的备份长度用于不同的引导头。 MB-DQN实现了仅依靠单个目标值的方法中无法使用的目标值的异质性。结果,它能够维持不同备份长度提供的优势。在本文中,我们首先通过简单的迷宫环境讨论动机见解。为了验证MB-DQN的有效性,我们在Atari 2600基准环境上执行实验,并证明MB-DQN在多种基线方法上的性能提高。我们进一步提供一组消融研究,以检查MB-DQN不同设计配置的影响。
The concept of utilizing multi-step returns for updating value functions has been adopted in deep reinforcement learning (DRL) for a number of years. Updating value functions with different backup lengths provides advantages in different aspects, including bias and variance of value estimates, convergence speed, and exploration behavior of the agent. Conventional methods such as TD-lambda leverage these advantages by using a target value equivalent to an exponential average of different step returns. Nevertheless, integrating step returns into a single target sacrifices the diversity of the advantages offered by different step return targets. To address this issue, we propose Mixture Bootstrapped DQN (MB-DQN) built on top of bootstrapped DQN, and uses different backup lengths for different bootstrapped heads. MB-DQN enables heterogeneity of the target values that is unavailable in approaches relying only on a single target value. As a result, it is able to maintain the advantages offered by different backup lengths. In this paper, we first discuss the motivational insights through a simple maze environment. In order to validate the effectiveness of MB-DQN, we perform experiments on the Atari 2600 benchmark environments, and demonstrate the performance improvement of MB-DQN over a number of baseline methods. We further provide a set of ablation studies to examine the impacts of different design configurations of MB-DQN.