非政策最大熵钢筋学习：具有优势加权混合物政策（SAC-AWMP）的软演员评论（SAC-AWMP）

论文标题

非政策最大熵钢筋学习：具有优势加权混合物政策（SAC-AWMP）的软演员评论（SAC-AWMP）

Off-policy Maximum Entropy Reinforcement Learning : Soft Actor-Critic with Advantage Weighted Mixture Policy(SAC-AWMP)

论文作者

Hou, Zhimin, Zhang, Kuangen, Wan, Yi, Li, Dongyu, Fu, Chenglong, Yu, Haoyong

论文摘要

加强学习问题的最佳政策通常是不连续的，也不平滑。即，对于具有相似表示的两个州，其最佳策略可能会显着不同。在这种情况下，用函数近似器（FA）代表所有状态的共享参数可能并不理想，因为参数共享的概括能力使代表不连续的，非平滑策略的概括难以实现。解决此问题的一种常见方法，称为专家的混合物，是将策略表示为多个组件的加权总和，其中不同的组件在状态空间的不同部分上表现良好。遵循这个想法，并受到最新作品的启发，称为“优势加权信息最大化”，我们建议学习这些组件的每个状态权重，以便它们需要国家本身的信息，以及迄今为止为国家所学的首选行动。动作偏好是通过优势函数来表征的。在这种情况下，对于某些表征相似且首选的行动表示形式的某些状态组的权重也只有很大。因此，每个组件易于表示。我们将其称为以这种方式参数化的策略为优势加权混合策略（AWMP），并将此想法应用于改善竞争性连续控制算法之一的软性批评（SAC）之一。实验结果表明，具有AWMP的SAC在四个常用的连续控制任务中显然优于SAC，并在不同的随机种子上实现稳定的性能。

The optimal policy of a reinforcement learning problem is often discontinuous and non-smooth. I.e., for two states with similar representations, their optimal policies can be significantly different. In this case, representing the entire policy with a function approximator (FA) with shared parameters for all states maybe not desirable, as the generalization ability of parameters sharing makes representing discontinuous, non-smooth policies difficult. A common way to solve this problem, known as Mixture-of-Experts, is to represent the policy as the weighted sum of multiple components, where different components perform well on different parts of the state space. Following this idea and inspired by a recent work called advantage-weighted information maximization, we propose to learn for each state weights of these components, so that they entail the information of the state itself and also the preferred action learned so far for the state. The action preference is characterized via the advantage function. In this case, the weight of each component would only be large for certain groups of states whose representations are similar and preferred action representations are also similar. Therefore each component is easy to be represented. We call a policy parameterized in this way an Advantage Weighted Mixture Policy (AWMP) and apply this idea to improve soft-actor-critic (SAC), one of the most competitive continuous control algorithm. Experimental results demonstrate that SAC with AWMP clearly outperforms SAC in four commonly used continuous control tasks and achieve stable performance across different random seeds.

下载PDF全文

下载文献需遵守相关版权规定

论文标题