学习合作多代理强化学习的隐性信用分配

论文标题

学习合作多代理强化学习的隐性信用分配

Learning Implicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning

论文作者

Zhou, Meng, Liu, Ziyu, Sui, Pengwei, Li, Yixuan, Chung, Yuk Ying

论文摘要

我们提出了一种多代理参与者 - 批评方法，旨在在完全合作的环境下隐式解决信用分配问题。我们的主要动机是，只要（1）从集中式评论家提供的政策梯度为分散的代理提供足够的信息，即通过最佳合作提供足够的信息，并且（2）在整个培训过程中执行持续的探索水平。在分散执行（CTDE）范式的集中式培训下，我们通过将集中批评家作为超网络来实现前者，从而通过与随机政策的多重关联将潜在国家代表纳入政策梯度；为了实现后者，我们得出了一种称为自适应熵正则化的简单技术，其中熵梯度的大小根据当前的策略随机性动态重新缩放，以鼓励一致的探索水平。我们的算法（称为LICA）在包括多代理粒子环境和一组具有挑战性的Starcraft II微观管理任务的几个基准上进行了评估，我们表明LICA明显胜过以前的方法。

We present a multi-agent actor-critic method that aims to implicitly address the credit assignment problem under fully cooperative settings. Our key motivation is that credit assignment among agents may not require an explicit formulation as long as (1) the policy gradients derived from a centralized critic carry sufficient information for the decentralized agents to maximize their joint action value through optimal cooperation and (2) a sustained level of exploration is enforced throughout training. Under the centralized training with decentralized execution (CTDE) paradigm, we achieve the former by formulating the centralized critic as a hypernetwork such that a latent state representation is integrated into the policy gradients through its multiplicative association with the stochastic policies; to achieve the latter, we derive a simple technique called adaptive entropy regularization where magnitudes of the entropy gradients are dynamically rescaled based on the current policy stochasticity to encourage consistent levels of exploration. Our algorithm, referred to as LICA, is evaluated on several benchmarks including the multi-agent particle environments and a set of challenging StarCraft II micromanagement tasks, and we show that LICA significantly outperforms previous methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题