在马尔可夫噪声下对具有线性函数近似的贪婪GQ的有限样本分析

论文标题

在马尔可夫噪声下对具有线性函数近似的贪婪GQ的有限样本分析

Finite-sample Analysis of Greedy-GQ with Linear Function Approximation under Markovian Noise

论文作者

Wang, Yue, Zou, Shaofeng

论文摘要

Greedy-GQ是一种非政策的两个时尺度算法，用于增强学习中的最佳控制。本文在马尔可夫噪声下开发了具有线性函数近似值的贪婪GQ算法的第一个有限样本分析。我们的有限样本分析提供了理论上的理由，可以为这两个时间尺度算法选择实践中更快收敛的步骤，并建议在收敛速度和获得的政策质量之间进行权衡。我们的论文将两个时间尺度增强学习算法的有限样本分析从政策评估到最佳控制，这是更实际的兴趣。具体而言，与对两种时间尺度方法的现有有限样本分析相比，例如GTD，GTD2和TDC，其目标函数是凸的，贪婪GQ算法的目标函数是非convex。此外，贪婪的GQ算法也不是线性的两次尺度随机近似算法。本文我们的技术为非convex值基于基于值的强化学习算法的有限样本分析提供了一个通用框架，以实现最佳控制。

Greedy-GQ is an off-policy two timescale algorithm for optimal control in reinforcement learning. This paper develops the first finite-sample analysis for the Greedy-GQ algorithm with linear function approximation under Markovian noise. Our finite-sample analysis provides theoretical justification for choosing stepsizes for this two timescale algorithm for faster convergence in practice, and suggests a trade-off between the convergence rate and the quality of the obtained policy. Our paper extends the finite-sample analyses of two timescale reinforcement learning algorithms from policy evaluation to optimal control, which is of more practical interest. Specifically, in contrast to existing finite-sample analyses for two timescale methods, e.g., GTD, GTD2 and TDC, where their objective functions are convex, the objective function of the Greedy-GQ algorithm is non-convex. Moreover, the Greedy-GQ algorithm is also not a linear two-timescale stochastic approximation algorithm. Our techniques in this paper provide a general framework for finite-sample analysis of non-convex value-based reinforcement learning algorithms for optimal control.

下载PDF全文

下载文献需遵守相关版权规定

论文标题