论文标题
不变政策优化:在强化学习方面进行更强的概括
Invariant Policy Optimization: Towards Stronger Generalization in Reinforcement Learning
论文作者
论文摘要
强化学习的一个基本挑战是学习超出培训过程中经历的操作领域的政策。在本文中,我们通过以下不变性原则应对这一挑战:代理必须找到一种表示形式,以便在此代表之上构建的行动提示者存在,在所有培训领域都同时最佳。直观地,由此产生的不变政策通过寻找成功行动的原因来增强概括。我们提出了一种新颖的学习算法,不变策略优化(IPO),该算法实现了这一原则,并在培训期间学习了不变政策。我们将我们的方法与标准策略梯度方法进行了比较,并在线性二次调节器和网格世界问题上表现出对看不见的域的概括性能的显着改善,并且在一个示例中,机器人必须学会以不同的物理特性打开门。
A fundamental challenge in reinforcement learning is to learn policies that generalize beyond the operating domains experienced during training. In this paper, we approach this challenge through the following invariance principle: an agent must find a representation such that there exists an action-predictor built on top of this representation that is simultaneously optimal across all training domains. Intuitively, the resulting invariant policy enhances generalization by finding causes of successful actions. We propose a novel learning algorithm, Invariant Policy Optimization (IPO), that implements this principle and learns an invariant policy during training. We compare our approach with standard policy gradient methods and demonstrate significant improvements in generalization performance on unseen domains for linear quadratic regulator and grid-world problems, and an example where a robot must learn to open doors with varying physical properties.