论文标题
强大的马尔可夫决策过程的一阶政策优化
First-order Policy Optimization for Robust Markov Decision Process
论文作者
论文摘要
我们考虑解决强大的马尔可夫决策过程(MDP)的问题,该过程涉及一组折扣,有限的,有限的动作空间MDP,具有不确定的过渡核。计划的目的是找到一项强大的政策,以优化针对过渡不确定性的最坏情况值,从而将标准的MDP计划作为特殊情况。对于$(\ Mathbf {s},\ Mathbf {a})$ - 矩形不确定性集,我们对可靠目标建立了几个结构性观察,这有助于开发基于策略的一阶方法,即强有力的策略镜镜(RPMD)。 $ \ MATHCAL {O}(\ log(1/ε))$迭代复杂度用于查找$ε$ - 最佳策略,并以线性增加的步骤确定。我们进一步开发了一个名为SRPMD的稳健策略镜下降方法的随机变体,仅当一阶信息仅通过与名义环境的在线互动获得。我们表明,最佳差距线性收敛到噪声水平,因此建立了$ \ tilde {\ Mathcal {o}}}(1/ε^2)$样本复杂性通过开发用于策略评估的时间差学习方法。还讨论了RPMD的迭代和样本复杂性,并具有恒定的步骤。据我们所知,上述所有结果似乎是应用于强大的MDP问题的基于策略的一阶方法的新事物。
We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/ε))$ iteration complexity for finding an $ε$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.