通过连续行动的模仿政策的紧张性能保证

论文标题

通过连续行动的模仿政策的紧张性能保证

Tight Performance Guarantees of Imitator Policies with Continuous Actions

论文作者

Maran, Davide, Metelli, Alberto Maria, Restelli, Marcello

论文摘要

行为克隆（BC）旨在学习模仿专家所展示的行为的政策。当前对BC的理论理解仅限于有限行动的情况。在本文中，我们研究BC的目的是在不断行动的情况下为模仿政策的绩效提供理论保证。首先，我们根据瓦斯施泰因距离得出了针对性能差距的新颖束缚，该距离适用于连续行动专家，并在假设值函数是Lipschitz连续的假设下。由于后一种条件在实践中很耐心，即使对于Lipschitz Markov的决策过程和政策，我们提出了一个轻松的环境，证明价值功能始终是持有人的连续性。该结果具有独立的兴趣，并允许在卑诗省获得模仿策略的性能的一般界限。最后，我们分析了噪声注入，这是一种常见的做法，在应用噪声内核后，在环境中执行专家行动。我们表明，由于增加噪声，这种做法允许以偏见的价格获得更强的性能保证。

Behavioral Cloning (BC) aims at learning a policy that mimics the behavior demonstrated by an expert. The current theoretical understanding of BC is limited to the case of finite actions. In this paper, we study BC with the goal of providing theoretical guarantees on the performance of the imitator policy in the case of continuous actions. We start by deriving a novel bound on the performance gap based on Wasserstein distance, applicable for continuous-action experts, holding under the assumption that the value function is Lipschitz continuous. Since this latter condition is hardy fulfilled in practice, even for Lipschitz Markov Decision Processes and policies, we propose a relaxed setting, proving that value function is always Holder continuous. This result is of independent interest and allows obtaining in BC a general bound for the performance of the imitator policy. Finally, we analyze noise injection, a common practice in which the expert action is executed in the environment after the application of a noise kernel. We show that this practice allows deriving stronger performance guarantees, at the price of a bias due to the noise addition.

下载PDF全文

下载文献需遵守相关版权规定

论文标题