增值路径：迈向更好的增强学习的代表

论文标题

增值路径：迈向更好的增强学习的代表

The Value-Improvement Path: Towards Better Representations for Reinforcement Learning

论文作者

Dabney, Will, Barreto, André, Rowland, Mark, Dadashi, Robert, Quan, John, Bellemare, Marc G., Silver, David

论文摘要

在基于价值的增强学习（RL）中，与监督学习不同，代理不是一个单一的，固定的，近似问题，而是一系列价值预测问题。每当政策改善时，问题的性质都会改变，同时改变了状态及其价值观。在本文中，我们采用了一种新颖的观点，认为RL代理所面临的价值预测问题不应孤立地解决，而应作为一个单一的，整体的预测问题。 RL算法生成了一系列策略，至少在最佳策略方面有所改善。我们明确表征了相关的价值函数序列，并将其称为值改善路径。我们的主要思想是从整体上近似于价值改善路径，而不是仅跟踪当前策略的价值函数。具体而言，我们讨论了RL对代表学习的整体观点的影响。我们证明，跨越过去的增值路径的表示形式还将为未来的政策改进提供准确的价值近似值。我们使用这种见解来更好地了解辅助任务的现有方法并提出新的方法。为了从经验上检验我们的假设，我们通过学习增值路径的辅助任务增强了标准的深度RL药物。在一项对Atari 2600场比赛的研究中，增强特工的平均表现和基线代理的平均表现大约是两倍。

In value-based reinforcement learning (RL), unlike in supervised learning, the agent faces not a single, stationary, approximation problem, but a sequence of value prediction problems. Each time the policy improves, the nature of the problem changes, shifting both the distribution of states and their values. In this paper we take a novel perspective, arguing that the value prediction problems faced by an RL agent should not be addressed in isolation, but rather as a single, holistic, prediction problem. An RL algorithm generates a sequence of policies that, at least approximately, improve towards the optimal policy. We explicitly characterize the associated sequence of value functions and call it the value-improvement path. Our main idea is to approximate the value-improvement path holistically, rather than to solely track the value function of the current policy. Specifically, we discuss the impact that this holistic view of RL has on representation learning. We demonstrate that a representation that spans the past value-improvement path will also provide an accurate value approximation for future policy improvements. We use this insight to better understand existing approaches to auxiliary tasks and to propose new ones. To test our hypothesis empirically, we augmented a standard deep RL agent with an auxiliary task of learning the value-improvement path. In a study of Atari 2600 games, the augmented agent achieved approximately double the mean and median performance of the baseline agent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题