论文标题
价值迭代是光学组成
Value Iteration is Optic Composition
论文作者
论文摘要
动态编程是一类用于计算马尔可夫决策过程的最佳控制策略的算法。动态编程在控制理论中无处不在,也是强化学习的基础。在本文中,我们表明价值改善(动态编程的主要步骤之一)自然可以被视为一类光学的组成,并且直观地,最佳值函数是光学组成链的极限。我们用三个经典示例来说明这一点:网格世界,倒置的摆和储蓄问题。这是朝着参数化的光学方面的完整说明进行增强学习的第一步。
Dynamic programming is a class of algorithms used to compute optimal control policies for Markov decision processes. Dynamic programming is ubiquitous in control theory, and is also the foundation of reinforcement learning. In this paper, we show that value improvement, one of the main steps of dynamic programming, can be naturally seen as composition in a category of optics, and intuitively, the optimal value function is the limit of a chain of optic compositions. We illustrate this with three classic examples: the gridworld, the inverted pendulum and the savings problem. This is a first step towards a complete account of reinforcement learning in terms of parametrised optics.