论文标题
Q值路径的分解,用于深层增强学习
Q-value Path Decomposition for Deep Multiagent Reinforcement Learning
论文作者
论文摘要
最近,深度多基础增强学习(MARL)已成为一个高度活跃的研究领域,因为许多现实世界中的问题都可以固有地看作是多基因系统。一个特别有趣且普遍适用的问题是部分可观察到的合作多种设置,其中一组代理商学会了协调其行为,以私人观察和通常共享全球奖励信号来协调其行为。一种自然的解决方案是诉诸集中式培训和分散的执行范式。在集中培训期间,一个主要挑战是多重信贷分配:如何为各个代理政策分配全球奖励,以更好地协调系统,以最大程度地提高系统级别的利益。在本文中,我们提出了一种称为Q值路径分解(QPD)的新方法,将系统的全局Q值分解为单个代理的Q值。与以前的工作限制了单个Q值的表示关系和全局的作品不同,我们利用综合梯度归因技术为深MAL,将沿轨迹路径沿轨迹路径直接分解为代理的信用。我们评估了QPD对具有挑战性的Starcraft II微管理任务,并表明QPD与现有的合作MARL算法相比,QPD在均质和异质的多种方案中都达到了最先进的性能。
Recently, deep multiagent reinforcement learning (MARL) has become a highly active research area as many real-world problems can be inherently viewed as multiagent systems. A particularly interesting and widely applicable class of problems is the partially observable cooperative multiagent setting, in which a team of agents learns to coordinate their behaviors conditioning on their private observations and commonly shared global reward signals. One natural solution is to resort to the centralized training and decentralized execution paradigm. During centralized training, one key challenge is the multiagent credit assignment: how to allocate the global rewards for individual agent policies for better coordination towards maximizing system-level's benefits. In this paper, we propose a new method called Q-value Path Decomposition (QPD) to decompose the system's global Q-values into individual agents' Q-values. Unlike previous works which restrict the representation relation of the individual Q-values and the global one, we leverage the integrated gradient attribution technique into deep MARL to directly decompose global Q-values along trajectory paths to assign credits for agents. We evaluate QPD on the challenging StarCraft II micromanagement tasks and show that QPD achieves the state-of-the-art performance in both homogeneous and heterogeneous multiagent scenarios compared with existing cooperative MARL algorithms.