通过增强拉格朗日的最佳保守离线RL具有一般功能近似

论文标题

通过增强拉格朗日的最佳保守离线RL具有一般功能近似

Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian

论文作者

Rashidinejad, Paria, Zhu, Hanlin, Yang, Kunhe, Russell, Stuart, Jiao, Jiantao

论文摘要

离线增强学习（RL）是指从前几年中获得的互动数据集中的决策，在过去的几年中受到了极大的关注。通过各种形式的保守政策学习解决了部分数据覆盖的普遍问题，努力致力于改善离线RL实用性。尽管大多数算法没有有限样本的保证，但在处理部分覆盖范围的单极浓缩框架内设计和分析了几种可证明的保守的离线RL算法。然而，在难以获得置信区间的非线性函数近似设置中，现有的可证明的算法遭受了计算棘手的性能，强大的假设和次优统计率。在本文中，我们利用了RL的边缘化重要性采样（MIS）配方，并介绍了第一组离线RL算法，这些算法在统计学上是最佳且在一般函数近似和单极浓缩性下是最佳且实用的，从而绕开了对不确定量化的需求。我们确定成功解决MIS问题的基于样本的近似值的关键是确保几乎满足某些占用有效性限制。我们通过增强Lagrangian方法的新颖应用来强制执行这些约束，并证明以下结果：随着误导，增强的Lagrangian足以使其足以实现统计上最佳的离线RL。与先前通过行为正规化等方法诱导更多保守主义的算法形成鲜明对比的是，我们的方法可以消除这一需求，并将正则化剂重新诠释为“占用有效性的执行者”，而不是“保守主义的启动者”。

Offline reinforcement learning (RL), which refers to decision-making from a previously-collected dataset of interactions, has received significant attention over the past years. Much effort has focused on improving offline RL practicality by addressing the prevalent issue of partial data coverage through various forms of conservative policy learning. While the majority of algorithms do not have finite-sample guarantees, several provable conservative offline RL algorithms are designed and analyzed within the single-policy concentrability framework that handles partial coverage. Yet, in the nonlinear function approximation setting where confidence intervals are difficult to obtain, existing provable algorithms suffer from computational intractability, prohibitively strong assumptions, and suboptimal statistical rates. In this paper, we leverage the marginalized importance sampling (MIS) formulation of RL and present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification. We identify that the key to successfully solving the sample-based approximation of the MIS problem is ensuring that certain occupancy validity constraints are nearly satisfied. We enforce these constraints by a novel application of the augmented Lagrangian method and prove the following result: with the MIS formulation, augmented Lagrangian is enough for statistically optimal offline RL. In stark contrast to prior algorithms that induce additional conservatism through methods such as behavior regularization, our approach provably eliminates this need and reinterprets regularizers as "enforcers of occupancy validity" than "promoters of conservatism."

下载PDF全文

下载文献需遵守相关版权规定

论文标题