真正的确定性政策优化

论文标题

真正的确定性政策优化

Truly Deterministic Policy Optimization

论文作者

Saleh, Ehsan, Ghaffari, Saba, Bretl, Timothy, West, Matthew

论文摘要

在本文中，我们提出了一种政策梯度方法，该方法避免了探索性噪声注入，并对确定性格局进行政策搜索。通过避免噪声注入，可以在确定性动态（直至初始状态分布）的系统中消除所有估计方差的来源。由于使用传统的非金属措施（例如KL Divergence）是不可能的确定性政策正则化，因此我们出于我们的目的得出了基于Wasserstein的二次模型。我们陈述了系统模型的条件，在该系统模型下，可以建立单调策略改进保证，为策略梯度估计提出替代功能，并表明是否有可能计算确切的优势估计，如果国家过渡模型和策略都是确定性的。最后，我们描述了两个新型的机器人控制环境 - 一个在频域中具有非本地奖励，另一个具有较长的地平线（8000个时步） - 我们的策略梯度方法（TDPO）显着优于现有方法（PPO，TRPO，DDPG和TD3）。我们使用所有实验设置的实现，请访问https://github.com/ehsansaleh/code_tdpo

In this paper, we present a policy gradient method that avoids exploratory noise injection and performs policy search over the deterministic landscape. By avoiding noise injection all sources of estimation variance can be eliminated in systems with deterministic dynamics (up to the initial state distribution). Since deterministic policy regularization is impossible using traditional non-metric measures such as the KL divergence, we derive a Wasserstein-based quadratic model for our purposes. We state conditions on the system model under which it is possible to establish a monotonic policy improvement guarantee, propose a surrogate function for policy gradient estimation, and show that it is possible to compute exact advantage estimates if both the state transition model and the policy are deterministic. Finally, we describe two novel robotic control environments -- one with non-local rewards in the frequency domain and the other with a long horizon (8000 time-steps) -- for which our policy gradient method (TDPO) significantly outperforms existing methods (PPO, TRPO, DDPG, and TD3). Our implementation with all the experimental settings is available at https://github.com/ehsansaleh/code_tdpo

下载PDF全文

下载文献需遵守相关版权规定

论文标题