达到模仿学习的基本限制

论文标题

达到模仿学习的基本限制

Toward the Fundamental Limits of Imitation Learning

论文作者

Rajaraman, Nived, Yang, Lin F., Jiao, Jiantao, Ramachandran, Kannan

论文摘要

模仿学习（IL）旨在模仿仅给出示威的顺序决策问题中专家政策的行为。在本文中，我们专注于理解情节马尔可夫决策过程中IL的最小统计限制（MDPS）。我们首先考虑为学习者提供$ N $专家轨迹的数据集，并且无法与MDP进行交互。在这里，我们表明，尽可能地模仿专家的策略期望$ \ lyssim \ frac {| \ Mathcal {s} | H^2 \ log（n）} {n} $ subiptimal与专家的价值相比，即使专家遵循任意随机策略。这里$ \ Mathcal {s} $是状态空间，$ h $是情节的长度。此外，我们建立了$ \ gtrsim | \ mathcal {s} | H^2 / n $即使专家被限制为确定性，或者允许学习者在与MDP互动以$ n $ eviepodes互动时积极查询专家。据我们所知，这是第一种没有其他假设的次级临时性的算法。然后，我们提出了一种基于最小距离函数的新算法，在给出过渡模型的设置中，专家是确定性的。该算法由$ \ sillsim \ min \ {h \ sqrt {| \ Mathcal {s} | / n}，\ | \ Mathcal {S} | h^{3/2} / n \} $，表明过渡的知识将最小速率提高至少一个$ \ sqrt {h} $ factor。

Imitation learning (IL) aims to mimic the behavior of an expert policy in a sequential decision-making problem given only demonstrations. In this paper, we focus on understanding the minimax statistical limits of IL in episodic Markov Decision Processes (MDPs). We first consider the setting where the learner is provided a dataset of $N$ expert trajectories ahead of time, and cannot interact with the MDP. Here, we show that the policy which mimics the expert whenever possible is in expectation $\lesssim \frac{|\mathcal{S}| H^2 \log (N)}{N}$ suboptimal compared to the value of the expert, even when the expert follows an arbitrary stochastic policy. Here $\mathcal{S}$ is the state space, and $H$ is the length of the episode. Furthermore, we establish a suboptimality lower bound of $\gtrsim |\mathcal{S}| H^2 / N$ which applies even if the expert is constrained to be deterministic, or if the learner is allowed to actively query the expert at visited states while interacting with the MDP for $N$ episodes. To our knowledge, this is the first algorithm with suboptimality having no dependence on the number of actions, under no additional assumptions. We then propose a novel algorithm based on minimum-distance functionals in the setting where the transition model is given and the expert is deterministic. The algorithm is suboptimal by $\lesssim \min \{ H \sqrt{|\mathcal{S}| / N} ,\ |\mathcal{S}| H^{3/2} / N \}$, showing that knowledge of transition improves the minimax rate by at least a $\sqrt{H}$ factor.

下载PDF全文

下载文献需遵守相关版权规定

论文标题