论文标题
基于模型的加强学习,用于连续控制的后验采样
Model-based Reinforcement Learning for Continuous Control with Posterior Sampling
论文作者
论文摘要
平衡探索和剥削对于加强学习(RL)至关重要。在本文中,我们研究了基于模型的后验抽样,用于在理论和经验上进行连续的状态行动空间中的加固学习(PSRL)。首先,我们在连续的空间中显示了PSRL的第一个遗憾,这是我们所知的多项式。假设可以通过贝叶斯线性回归对奖励和过渡功能进行建模,我们将遗憾的是$ \ tilde {o}(h^{3/2} d \ sqrt {t})$,其中$ h $是情节长度,$ d $是国家行动空间的尺寸,$ t $表示总时间步骤。该结果与线性MDP中的非PSRL方法的最著名遗憾相匹配。我们的界限也可以扩展到非线性盒,并通过功能嵌入:使用功能表示$ ϕ $上的线性内核,遗憾的绑定变为$ \ tilde {o}(h^{3/2} d_D_D_D_D_D_D_D_T \ sqrt {t})$,其中$ d_D_D_D_ϕ $是代表空间的中度。此外,我们提出了MPC-PSRL,这是一种基于模型的后验算法,具有模型预测控制的动作选择。为了捕获模型中的不确定性,我们在神经网络的倒数第二层(特征表示层$ ϕ $)上使用贝叶斯线性回归。经验结果表明,与先前的基于模型的算法相比,我们的算法在基准连续控制任务中达到了最先进的样本效率,并且与无模型算法的渐近性能匹配。
Balancing exploration and exploitation is crucial in reinforcement learning (RL). In this paper, we study model-based posterior sampling for reinforcement learning (PSRL) in continuous state-action spaces theoretically and empirically. First, we show the first regret bound of PSRL in continuous spaces which is polynomial in the episode length to the best of our knowledge. With the assumption that reward and transition functions can be modeled by Bayesian linear regression, we develop a regret bound of $\tilde{O}(H^{3/2}d\sqrt{T})$, where $H$ is the episode length, $d$ is the dimension of the state-action space, and $T$ indicates the total time steps. This result matches the best-known regret bound of non-PSRL methods in linear MDPs. Our bound can be extended to nonlinear cases as well with feature embedding: using linear kernels on the feature representation $ϕ$, the regret bound becomes $\tilde{O}(H^{3/2}d_ϕ\sqrt{T})$, where $d_ϕ$ is the dimension of the representation space. Moreover, we present MPC-PSRL, a model-based posterior sampling algorithm with model predictive control for action selection. To capture the uncertainty in models, we use Bayesian linear regression on the penultimate layer (the feature representation layer $ϕ$) of neural networks. Empirical results show that our algorithm achieves the state-of-the-art sample efficiency in benchmark continuous control tasks compared to prior model-based algorithms, and matches the asymptotic performance of model-free algorithms.