天然参与者批评的有限时间分析POMDPS

论文标题

天然参与者批评的有限时间分析POMDPS

Finite-Time Analysis of Natural Actor-Critic for POMDPs

论文作者

Cayci, Semih, He, Niao, Srikant, R.

论文摘要

我们考虑了具有较大甚至什至无限状态空间的部分观察到的马尔可夫决策过程（POMDP）的加强学习问题，其中控制器只能访问对基础受控的马尔可夫链的嘈杂观察。我们考虑一种自然的参与者批评方法，该方法采用有限的内部记忆进行策略参数化，以及用于策略评估的多步骤差学习算法。据我们所知，我们确定了第一个在功能近似下部分观察到的系统的参与者批评方法的第一个非反应全局融合。特别是，除了在MDP中也出现的函数近似和统计误差之外，我们还明确表征了由于使用有限状态控制器而导致的误差。通过使用有限状态控制器时，pomdps中传统信念状态与隐藏状态的后验分布之间的总变化距离表示，这一额外错误是指的。此外，我们证明，在滑动块控制器的情况下，可以使用较大的块大小，可以将此错误缩小。

We consider the reinforcement learning problem for partially observed Markov decision processes (POMDPs) with large or even countably infinite state spaces, where the controller has access to only noisy observations of the underlying controlled Markov chain. We consider a natural actor-critic method that employs a finite internal memory for policy parameterization, and a multi-step temporal difference learning algorithm for policy evaluation. We establish, to the best of our knowledge, the first non-asymptotic global convergence of actor-critic methods for partially observed systems under function approximation. In particular, in addition to the function approximation and statistical errors that also arise in MDPs, we explicitly characterize the error due to the use of finite-state controllers. This additional error is stated in terms of the total variation distance between the traditional belief state in POMDPs and the posterior distribution of the hidden state when using a finite-state controller. Further, we show that this error can be made small in the case of sliding-block controllers by using larger block sizes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题