强大的马尔可夫决策过程的一阶政策优化

论文标题

强大的马尔可夫决策过程的一阶政策优化

First-order Policy Optimization for Robust Markov Decision Process

论文作者

Li, Yan, Lan, Guanghui, Zhao, Tuo

论文摘要

我们考虑解决强大的马尔可夫决策过程（MDP）的问题，该过程涉及一组折扣，有限的，有限的动作空间MDP，具有不确定的过渡核。计划的目的是找到一项强大的政策，以优化针对过渡不确定性的最坏情况值，从而将标准的MDP计划作为特殊情况。对于$（\ Mathbf {s}，\ Mathbf {a}）$ - 矩形不确定性集，我们对可靠目标建立了几个结构性观察，这有助于开发基于策略的一阶方法，即强有力的策略镜镜（RPMD）。 $ \ MATHCAL {O}（\ log（1/ε））$迭代复杂度用于查找$ε$ - 最佳策略，并以线性增加的步骤确定。我们进一步开发了一个名为SRPMD的稳健策略镜下降方法的随机变体，仅当一阶信息仅通过与名义环境的在线互动获得。我们表明，最佳差距线性收敛到噪声水平，因此建立了$ \ tilde {\ Mathcal {o}}}（1/ε^2）$样本复杂性通过开发用于策略评估的时间差学习方法。还讨论了RPMD的迭代和样本复杂性，并具有恒定的步骤。据我们所知，上述所有结果似乎是应用于强大的MDP问题的基于策略的一阶方法的新事物。

We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/ε))$ iteration complexity for finding an $ε$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/ε^2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题