论文标题
离线增强性学习具有可靠性和单政策的浓缩性
Offline Reinforcement Learning with Realizability and Single-policy Concentrability
论文作者
论文摘要
离线增强学习(RL)的样本效率保证通常依赖于对功能类别(例如Bellman-Completeness)和数据覆盖范围(例如,全政策浓缩性)的强有力的假设。尽管最近在放松这些假设方面做出了努力,但现有作品只能放松两个因素之一,从而使另一个因素的强烈假设完好无损。作为一个重要的开放问题,我们可以实现对这两个因素的假设较弱的样本效率离线RL吗? 在本文中,我们以积极的态度回答了这个问题。我们基于MDP的原始偶对偶进行分析了一种简单的算法,其中双重变量(打折占用)是使用密度比函数对离线数据进行建模的。通过适当的正则化,我们表明该算法仅在仅在可靠性和单极浓缩性下具有多项式样本的复杂性。我们还基于不同的假设提供了替代分析,以阐明离线RL原始二算法的性质。
Sample-efficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellman-completeness) and the data coverage (e.g., all-policy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sample-efficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primal-dual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a density-ratio function against offline data. With proper regularization, we show that the algorithm enjoys polynomial sample complexity, under only realizability and single-policy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primal-dual algorithms for offline RL.