在最大状态熵探索中，非马克维亚性的重要性

论文标题

在最大状态熵探索中，非马克维亚性的重要性

The Importance of Non-Markovianity in Maximum State Entropy Exploration

论文作者

Mutti, Mirco, De Santi, Riccardo, Restelli, Marcello

论文摘要

在最大的状态熵探索框架中，代理商与无奖励环境进行互动，以学习最大化其正在引起的预期国有访问的熵的政策。 Hazan等。（2019年）指出，马尔可夫随机策略类别足以满足最大状态熵目标，在这种情况下，利用非马克维亚性通常被认为是毫无意义的。在本文中，我们认为在有限样本制度中最大程度地探索了非马克维亚性是至关重要的。尤其是，我们重新阐述了目标在一次试验中针对诱发的国有访问的预期熵的目标。然后，我们表明，非马克维亚确定性政策的类别足以满足引入的目标，而马尔可夫政策总体上遭受了非零的遗憾。但是，我们证明找到最佳的非马克维亚政策的问题是NP-HARD。尽管取得了负面的结果，但我们讨论了以易于处理的方式解决该问题的途径，以及非马克维亚探索如何使未来工作中在线增强学习的样本效率受益。

In the maximum state entropy exploration framework, an agent interacts with a reward-free environment to learn a policy that maximizes the entropy of the expected state visitations it is inducing. Hazan et al. (2019) noted that the class of Markovian stochastic policies is sufficient for the maximum state entropy objective, and exploiting non-Markovianity is generally considered pointless in this setting. In this paper, we argue that non-Markovianity is instead paramount for maximum state entropy exploration in a finite-sample regime. Especially, we recast the objective to target the expected entropy of the induced state visitations in a single trial. Then, we show that the class of non-Markovian deterministic policies is sufficient for the introduced objective, while Markovian policies suffer non-zero regret in general. However, we prove that the problem of finding an optimal non-Markovian policy is NP-hard. Despite this negative result, we discuss avenues to address the problem in a tractable way and how non-Markovian exploration could benefit the sample efficiency of online reinforcement learning in future works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题