一项政策就足够了：与单一政策的平行探索对于无奖励的强化学习几乎是最佳的

论文标题

一项政策就足够了：与单一政策的平行探索对于无奖励的强化学习几乎是最佳的

One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning

论文作者

Cisneros-Velarde, Pedro, Lyu, Boxiang, Koyejo, Sanmi, Kolar, Mladen

论文摘要

尽管平行性已被广泛用于增强学习（RL），但理论上并未对并行探索的定量效应很好地理解。我们研究了线性马尔可夫决策过程（MDP）和两人零和零马尔可夫游戏（MGS）中简单并行探索的好处。与现有文献相比，该文献侧重于鼓励代理商探索各种政策的方法，我们表明，与完全顺序的对应物相比，在所有情况下，使用单个政策指导所有代理商的探索都足以获得几乎是线性的加速。此外，我们证明了这个简单的过程在线性MDP的无奖励设置中是接近最佳的。从实际的角度来看，我们的论文表明，单个政策足以且在探索阶段融合并行性。

Although parallelism has been extensively used in reinforcement learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature, which focuses on approaches that encourage agents to explore a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Furthermore, we demonstrate that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.

下载PDF全文

下载文献需遵守相关版权规定

论文标题