论文标题
政策优化作为在线学习和调解员反馈
Policy Optimization as Online Learning with Mediator Feedback
论文作者
论文摘要
策略优化(PO)是一种解决连续控制任务的广泛使用方法。在本文中,我们介绍了调解人反馈的概念,该概念将PO视为政策空间上的在线学习问题。与标准的匪徒反馈相比,其他可用信息允许重复使用一项策略生成的样本来估计其他策略的绩效。基于此观察,我们提出了一种算法,随机探索策略通过与截断(随机主义者)的多重重要性采样(随机主义者)进行优化,以使PO中的遗憾最小化,该采用随机探索策略与现有乐观的方法不同。当政策空间是有限的时,我们表明在某些情况下,可能会持续遗憾,同时始终享受对数遗憾。我们还得出了与问题有关的遗憾下限。然后,我们将随机主义者扩展到紧凑的策略空间。最后,与PO和Bandit基线相比,我们提供了有关有限和紧凑型策略空间的数值模拟。
Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over the policy space. The additional available information, compared to the standard bandit feedback, allows reusing samples generated by one policy to estimate the performance of other policies. Based on this observation, we propose an algorithm, RANDomized-exploration policy Optimization via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, that employs a randomized exploration strategy, differently from the existing optimistic approaches. When the policy space is finite, we show that under certain circumstances, it is possible to achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent regret lower bounds. Then, we extend RANDOMIST to compact policy spaces. Finally, we provide numerical simulations on finite and compact policy spaces, in comparison with PO and bandit baselines.