递归增强学习

论文标题

递归增强学习

Recursive Reinforcement Learning

论文作者

Hahn, Ernst Moritz, Perez, Mateo, Schewe, Sven, Somenzi, Fabio, Trivedi, Ashutosh, Wojtczak, Dominik

论文摘要

递归是有限地描述潜在无限物体的基本范例。由于最先进的加固学习（RL）算法无法直接推理递归，因此他们必须依靠从业者的创造力来设计适当的“平坦”环境代表。由此产生的手动特征结构和近似值繁琐且容易出错。他们缺乏透明度会阻碍可伸缩性。为了克服这些挑战，我们开发了能够在被描述为马尔可夫决策过程集合（MDP）的环境中计算最佳策略的RL算法，这些算法可以递归地互相调用。每个成分MDP的特征是几个进入点和出口点，与这些调用的输入和输出值相对应。这些递归的MDP（或RMDPS）与概率的下降系统（呼叫堆栈扮演了Purpdown堆栈的角色），并且可以用递归程序性调用对概率程序进行建模。我们介绍了递归Q学习 - RMDPS的无模型RL算法 - 并证明它在轻度假设下会收敛于有限的，单位外观和确定性的多EXIT RMDP。

Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As state-of-the-art reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and error-prone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with call-stack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Q-learning -- a model-free RL algorithm for RMDPs -- and prove that it converges for finite, single-exit and deterministic multi-exit RMDPs under mild assumptions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题