具有近乎最佳样品复杂性的基于分布强大的基于模型的离线增强学习

论文标题

具有近乎最佳样品复杂性的基于分布强大的基于模型的离线增强学习

Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity

论文作者

Shi, Laixi, Chi, Yuejie

论文摘要

本文涉及模型鲁棒性和离线增强学习（RL）样本效率的中心问题，该问题旨在学习从历史数据中进行决策而无需主动探索。由于环境的不确定性和变化，至关重要的是，学习一项稳健的策略（尽可能少的样本），即使部署的环境偏离用于收集历史记录数据集的名义环境时，该策略也能很好地执行。我们考虑了离线RL的分布稳健公式，重点是鲁棒马尔可夫决策过程，其不确定性设置为有限 - 马和Infinite-Horizon设置中的Kullback-Leibler Divergence所指定的不确定性。为了与样本稀缺性作斗争，提出了一种基于模型的算法，该算法将分布强劲的价值迭代与面对不确定性时的悲观原理结合在一起，通过对核心设计的数据驱动的惩罚项来惩罚鲁棒值估计。在对历史数据集的轻度和量身定制的假设下，该数据集测量分布变化而不需要完全覆盖州行动空间，我们建立了所提出算法的有限样本复杂性。我们进一步开发了一个信息理论下限，这表明学习RMDP至少在不确定性水平足够小时与标准MDP一样困难，并证实了我们上部的紧密度，达到（有效）地平线长度的多项式因素，以达到不确定性水平的范围。据我们所知，这提供了第一个在模型不确定性和部分覆盖范围内学习的近乎最佳的稳健离线RL算法。

This paper concerns the central issues of model robustness and sample efficiency in offline reinforcement learning (RL), which aims to learn to perform decision making from history data without active exploration. Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy -- with as few samples as possible -- that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset. We consider a distributionally robust formulation of offline RL, focusing on tabular robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings. To combat with sample scarcity, a model-based algorithm that combines distributionally robust value iteration with the principle of pessimism in the face of uncertainty is proposed, by penalizing the robust value estimates with a carefully designed data-driven penalty term. Under a mild and tailored assumption of the history dataset that measures distribution shift without requiring full coverage of the state-action space, we establish the finite-sample complexity of the proposed algorithms. We further develop an information-theoretic lower bound, which suggests that learning RMDPs is at least as hard as the standard MDPs when the uncertainty level is sufficient small, and corroborates the tightness of our upper bound up to polynomial factors of the (effective) horizon length for a range of uncertainty levels. To the best our knowledge, this provides the first provably near-optimal robust offline RL algorithm that learns under model uncertainty and partial coverage.

下载PDF全文

下载文献需遵守相关版权规定

论文标题