论文标题
在连续控制任务中,本地持续的探索稀疏的奖励
Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards
论文作者
论文摘要
强化学习的一个主要挑战是探索策略的设计,尤其是对于奖励结构稀疏以及连续状态和行动空间的环境。直观地,如果加固信号非常稀缺,则代理应依靠某种形式的短期记忆来有效地覆盖其环境。我们提出了一种基于两个直觉的新探索方法:(1)下一个探索性作用的选择不仅应取决于环境的(马尔可夫)状态,而且还应取决于迄今为止代理的轨迹,并且(2)代理人应利用在状态空间中的传播度量,以避免被卡在一个小区域中。我们的方法利用统计物理中经常使用的概念为简化(聚合物)链的行为提供解释,以便在状态空间中产生持久(本地自我避免)轨迹。我们讨论了当地自我避开步道的理论特性及其通过轨迹内的衰减时间相关性提供短期记忆的能力。我们在模拟的2D导航任务中提供了对方法的经验评估,以及具有稀疏奖励的较高维度连续控制运动任务。
A major challenge in reinforcement learning is the design of exploration strategies, especially for environments with sparse reward structures and continuous state and action spaces. Intuitively, if the reinforcement signal is very scarce, the agent should rely on some form of short-term memory in order to cover its environment efficiently. We propose a new exploration method, based on two intuitions: (1) the choice of the next exploratory action should depend not only on the (Markovian) state of the environment, but also on the agent's trajectory so far, and (2) the agent should utilize a measure of spread in the state space to avoid getting stuck in a small region. Our method leverages concepts often used in statistical physics to provide explanations for the behavior of simplified (polymer) chains in order to generate persistent (locally self-avoiding) trajectories in state space. We discuss the theoretical properties of locally self-avoiding walks and their ability to provide a kind of short-term memory through a decaying temporal correlation within the trajectory. We provide empirical evaluations of our approach in a simulated 2D navigation task, as well as higher-dimensional MuJoCo continuous control locomotion tasks with sparse rewards.