ERL-RE $^2 $：有效的进化强化学习，具有共同的状态表示和个人策略代表

论文标题

ERL-RE $^2 $：有效的进化强化学习，具有共同的状态表示和个人策略代表

ERL-Re$^2$: Efficient Evolutionary Reinforcement Learning with Shared State Representation and Individual Policy Representation

论文作者

Hao, Jianye, Li, Pengyi, Tang, Hongyao, Zheng, Yan, Fu, Xian, Meng, Zhaopeng

论文摘要

深度强化学习（深度RL）和进化算法（EA）是政策优化的两个主要范式，具有不同的学习原理，即基于梯度的V.S.无梯度。一个吸引人的研究方向是将深入的RL和EA整合到通过融合其互补优势来设计新方法。但是，现有的有关深入RL和EA的作品有两个共同的缺点：1）RL代理商和EA代理人分别学习其政策，从而有效地分享有用的常识； 2）参数级别的策略优化可确保EA端的语义演变的语义水平。在本文中，我们提出了进化强化学习，并具有两尺度的状态表示和政策表示（ERL-RE $^2 $），这是对上述两个缺点的新颖解决方案。 ERL-RE $^2 $的关键思想是两尺度表示：所有EA和RL策略在维护个人}线性策略表示同时共享相同的非线性状态表示。国家的代表传达了所有代理人所学到的环境的表达共同特征。线性策略表示为有效的政策优化提供了一个有利的空间，可以在其中执行新颖的行为级交叉和突变操作。此外，线性策略表示允许借助政策扩展的值函数近似器（PEVFA）方便地概括，从而进一步提高了健身估计的样本效率。对一系列连续控制任务的实验表明，ERL-RE $^2 $始终优于先进的基线，并实现了最新技术（SOTA）。我们的代码可在https://github.com/yeshenpy/erl-re2上找到。

Deep Reinforcement Learning (Deep RL) and Evolutionary Algorithms (EA) are two major paradigms of policy optimization with distinct learning principles, i.e., gradient-based v.s. gradient-free. An appealing research direction is integrating Deep RL and EA to devise new methods by fusing their complementary advantages. However, existing works on combining Deep RL and EA have two common drawbacks: 1) the RL agent and EA agents learn their policies individually, neglecting efficient sharing of useful common knowledge; 2) parameter-level policy optimization guarantees no semantic level of behavior evolution for the EA side. In this paper, we propose Evolutionary Reinforcement Learning with Two-scale State Representation and Policy Representation (ERL-Re$^2$), a novel solution to the aforementioned two drawbacks. The key idea of ERL-Re$^2$ is two-scale representation: all EA and RL policies share the same nonlinear state representation while maintaining individual} linear policy representations. The state representation conveys expressive common features of the environment learned by all the agents collectively; the linear policy representation provides a favorable space for efficient policy optimization, where novel behavior-level crossover and mutation operations can be performed. Moreover, the linear policy representation allows convenient generalization of policy fitness with the help of the Policy-extended Value Function Approximator (PeVFA), further improving the sample efficiency of fitness estimation. The experiments on a range of continuous control tasks show that ERL-Re$^2$ consistently outperforms advanced baselines and achieves the State Of The Art (SOTA). Our code is available on https://github.com/yeshenpy/ERL-Re2.

下载PDF全文

下载文献需遵守相关版权规定

论文标题