马尔可夫决策过程框架的关系分析

论文标题

马尔可夫决策过程框架的关系分析

A Relation Analysis of Markov Decision Process Frameworks

论文作者

Mai, Tien, Jaillet, Patrick

论文摘要

我们研究了机器学习和计量经济学文献中不同的马尔可夫决策过程（MDP）框架之间的关系，包括标准MDP，熵和常规MDP以及随机MDP，后者基于奖励函数是随机的，并遵循给定的分布。我们表明，熵调查的MDP等效于随机MDP模型，并严格由一般的正规化MDP包含。此外，我们通过假设奖励函数的分布模棱两可，提出了一个分布随机MDP框架。我们进一步表明，分布随机MDP等同于正则MDP，因为它们始终产生相同的最佳策略。我们还提供了随机/正规化的MDP和受约束的MDP之间的联系。我们的工作对几个重要的MDP框架提供了统一的观点，这将带来新的方法，通过随机奖励的镜头，反过来解释（熵/一般）正规化的MDP框架，反之亦然。鉴于（深度）增强学习中正规化MDP的最新流行，我们的工作对这种算法方案如何工作并提出了开发新的想法的新理解。

We study the relation between different Markov Decision Process (MDP) frameworks in the machine learning and econometrics literatures, including the standard MDP, the entropy and general regularized MDP, and stochastic MDP, where the latter is based on the assumption that the reward function is stochastic and follows a given distribution. We show that the entropy-regularized MDP is equivalent to a stochastic MDP model, and is strictly subsumed by the general regularized MDP. Moreover, we propose a distributional stochastic MDP framework by assuming that the distribution of the reward function is ambiguous. We further show that the distributional stochastic MDP is equivalent to the regularized MDP, in the sense that they always yield the same optimal policies. We also provide a connection between stochastic/regularized MDP and constrained MDP. Our work gives a unified view on several important MDP frameworks, which would lead new ways to interpret the (entropy/general) regularized MDP frameworks through the lens of stochastic rewards and vice-versa. Given the recent popularity of regularized MDP in (deep) reinforcement learning, our work brings new understandings of how such algorithmic schemes work and suggest ideas to develop new ones.

下载PDF全文

下载文献需遵守相关版权规定

论文标题