论文标题
通过在线贝叶斯推理的终身增量增强学习
Lifelong Incremental Reinforcement Learning with Online Bayesian Inference
论文作者
论文摘要
长期增强学习(RL)代理的核心能力是随着环境的变化而逐步适应其行为,并逐步建立以前的经验,以促进现实世界中未来的学习。在本文中,我们提出了终身增量增强学习(LLIRL),这是一种新的增量算法,可有效地适应动态环境。我们开发并维护一个包含参数化环境模型的无限混合物的库,这等同于在潜在空间中群集环境参数。混合物上的先前分布是作为中国餐厅工艺(CRP)配制的,该过程逐步实例化了新的环境模型,而无需任何外部信息以提前发出环境变化。在终生学习中,我们通过在线贝叶斯推断采用了期望最大化(EM)算法,以完全增量的方式更新混合物。在EM中,E-Step涉及估计环境至集群分配的后验期望,而M-Step则更新环境参数以供将来学习。此方法允许根据需要对所有环境模型进行调整,并在再次遇到以前看到的环境时将新模型实例化以用于环境变化和旧模型。实验表明,LLIRL的表现优于相关的现有方法,并可以有效地适应各种动态环境的终身学习。
A central capability of a long-lived reinforcement learning (RL) agent is to incrementally adapt its behavior as its environment changes, and to incrementally build upon previous experiences to facilitate future learning in real-world scenarios. In this paper, we propose LifeLong Incremental Reinforcement Learning (LLIRL), a new incremental algorithm for efficient lifelong adaptation to dynamic environments. We develop and maintain a library that contains an infinite mixture of parameterized environment models, which is equivalent to clustering environment parameters in a latent space. The prior distribution over the mixture is formulated as a Chinese restaurant process (CRP), which incrementally instantiates new environment models without any external information to signal environmental changes in advance. During lifelong learning, we employ the expectation maximization (EM) algorithm with online Bayesian inference to update the mixture in a fully incremental manner. In EM, the E-step involves estimating the posterior expectation of environment-to-cluster assignments, while the M-step updates the environment parameters for future learning. This method allows for all environment models to be adapted as necessary, with new models instantiated for environmental changes and old models retrieved when previously seen environments are encountered again. Experiments demonstrate that LLIRL outperforms relevant existing methods, and enables effective incremental adaptation to various dynamic environments for lifelong learning.