价值迭代是光学组成

论文标题

价值迭代是光学组成

Value Iteration is Optic Composition

论文作者

Hedges, Jules, Sakamoto, Riu Rodríguez

论文摘要

动态编程是一类用于计算马尔可夫决策过程的最佳控制策略的算法。动态编程在控制理论中无处不在，也是强化学习的基础。在本文中，我们表明价值改善（动态编程的主要步骤之一）自然可以被视为一类光学的组成，并且直观地，最佳值函数是光学组成链的极限。我们用三个经典示例来说明这一点：网格世界，倒置的摆和储蓄问题。这是朝着参数化的光学方面的完整说明进行增强学习的第一步。

Dynamic programming is a class of algorithms used to compute optimal control policies for Markov decision processes. Dynamic programming is ubiquitous in control theory, and is also the foundation of reinforcement learning. In this paper, we show that value improvement, one of the main steps of dynamic programming, can be naturally seen as composition in a category of optics, and intuitively, the optimal value function is the limit of a chain of optic compositions. We illustrate this with three classic examples: the gridworld, the inverted pendulum and the savings problem. This is a first step towards a complete account of reinforcement learning in terms of parametrised optics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题