论文标题
最大熵最佳控制
Hamilton-Jacobi-Bellman Equations for Maximum Entropy Optimal Control
论文作者
论文摘要
最大的熵增强学习(RL)方法已成功应用于一系列具有挑战性的顺序决策和控制任务。但是,大多数现有技术都是为离散时间系统设计的。作为向连续时间系统扩展的第一步,本文考虑了连续的确定性确定性最佳控制问题以及熵正则化。应用动态编程原理,我们得出了一类新型的汉密尔顿 - 雅各比 - 贝尔曼(HJB)方程,并证明了最大熵控制问题的最佳值函数对应于HJB方程的唯一粘度解决方案。我们的最大熵公式显示出可以增强粘度溶液的规律性,并随着熵正则化的影响渐近一致。 HJB方程的显着特征是计算障碍。广义HOPF-LAX公式可用于以无障碍的方式求解HJB方程,而无需数字优化哈密顿量。我们进一步表明,在控制仿射系统的情况下,最佳控制是唯一的特征为高斯,并且对于线性 - 二次问题,HJB方程将降低为riccati方程,可用于获得最佳控制的明确表达。最后,我们讨论如何通过采用自适应动态编程方法将结果扩展到连续的无时间模型RL。据我们所知,所得算法是第一个数据驱动的控制方法,该方法在连续时间使用信息理论探索机制。
Maximum entropy reinforcement learning (RL) methods have been successfully applied to a range of challenging sequential decision-making and control tasks. However, most of existing techniques are designed for discrete-time systems. As a first step toward their extension to continuous-time systems, this paper considers continuous-time deterministic optimal control problems with entropy regularization. Applying the dynamic programming principle, we derive a novel class of Hamilton-Jacobi-Bellman (HJB) equations and prove that the optimal value function of the maximum entropy control problem corresponds to the unique viscosity solution of the HJB equation. Our maximum entropy formulation is shown to enhance the regularity of the viscosity solution and to be asymptotically consistent as the effect of entropy regularization diminishes. A salient feature of the HJB equations is computational tractability. Generalized Hopf-Lax formulas can be used to solve the HJB equations in a tractable grid-free manner without the need for numerically optimizing the Hamiltonian. We further show that the optimal control is uniquely characterized as Gaussian in the case of control affine systems and that, for linear-quadratic problems, the HJB equation is reduced to a Riccati equation, which can be used to obtain an explicit expression of the optimal control. Lastly, we discuss how to extend our results to continuous-time model-free RL by taking an adaptive dynamic programming approach. To our knowledge, the resulting algorithms are the first data-driven control methods that use an information theoretic exploration mechanism in continuous time.