悲观的Q学习用于离线增强学习：迈向最佳样本复杂性

论文标题

悲观的Q学习用于离线增强学习：迈向最佳样本复杂性

Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity

论文作者

Shi, Laixi, Li, Gen, Wei, Yuting, Chen, Yuxin, Chi, Yuejie

论文摘要

离线或批处理增强学习试图使用历史数据来学习近乎最佳的政策，而无需积极探索环境。为了应对许多离线数据集的覆盖范围和样本稀缺性，最近引入了悲观的原则，以减轻估计值的高偏差。在理论上已经研究了基于模型的算法的悲观变体（例如，具有较低置信度范围的价值迭代），但在不需要明确的模型估计的无模型对应物中尚未进行充分研究，尤其是在样本效率方面进行了充分研究。为了解决这一不足，我们研究了有限的马尔可夫决策过程中Q学习的悲观变体，并在单政策浓缩性假设下表征其样品复杂性，该假设不需要完全覆盖州行动空间。此外，提出了降低方差的悲观Q学习算法以达到近乎最佳的样本复杂性。总之，这项工作突出了与悲观和降低差异一起使用时，在离线RL中无模型算法的效率。

Offline or batch reinforcement learning seeks to learn a near-optimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of model-based algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their model-free counterparts -- which do not require explicit model estimation -- have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes, and characterize its sample complexity under the single-policy concentrability assumption which does not require the full coverage of the state-action space. In addition, a variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity. Altogether, this work highlights the efficiency of model-free algorithms in offline RL when used in conjunction with pessimism and variance reduction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题