具有线性函数近似的可证明有效的无模型约束RL

论文标题

具有线性函数近似的可证明有效的无模型约束RL

Provably Efficient Model-Free Constrained RL with Linear Function Approximation

论文作者

Ghosh, Arnob, Zhou, Xingyu, Shroff, Ness

论文摘要

我们研究了受约束的增强学习问题，其中代理的目的是最大程度地提高预期累积奖励，但要受到实用程序功能的预期总价值的约束。与现有的基于模型的方法或与“模拟器”伴随的无模型方法相反，我们旨在开发第一个无模型的，无模拟器的无模拟算法，即使在大规模系统中，也可以实现sublinear遗憾和透明性约束违规。为此，我们考虑具有线性函数近似的情节约束决策过程，其中过渡动力学和奖励函数可以表示为一些已知特征映射的线性函数。 We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the情节和$ t $是步骤的总数。我们的界限是在没有明确估计未知过渡模型或需要模拟器的情况下达到的，并且仅通过特征映射的维度取决于状态空间。因此，即使国家的数量进入无限，我们的界限也存在。我们的主要结果是通过标准LSVI-UCB算法的新型适应来实现的。特别是，我们首先将原始二次优化引入LSVI-UCB算法中，以在遗憾和违反约束之间取得平衡。更重要的是，我们使用软马克斯政策取代了LSVI-UCB中的状态行动功能的标准贪婪选择。事实证明，这是通过其近似平滑度的权衡来确定受约束案例的统一浓度的关键。我们还表明，一个人可以达到均匀的约束违规行为，同时仍然保持相同的订单相对于$ t $。

We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题