论文标题
与转型和蒸馏框架合作的全球最优性
Towards Global Optimality in Cooperative MARL with the Transformation And Distillation Framework
论文作者
论文摘要
分散执行是合作多代理增强学习(MARL)的核心需求。最近,大多数流行的MARL算法采用了分散的政策来实现分散执行并将梯度下降作为优化器。但是,几乎没有对这些算法进行优化方法的理论分析,我们发现,当选择梯度下降作为其优化方法时,具有分散策略的各种流行的MARL算法在玩具任务中是次优的。在本文中,我们从理论上分析了通过分散策略的两种常见算法类别 - 多代理策略梯度方法和价值分解方法,以证明其在使用梯度下降时的次优。 In addition, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived ``single-agent" MDP. This approach uses a two-stage learning paradigm to address the optimization problem in cooperative MARL, maintaining its performance guarantee. Empirically, we implement TAD-PPO基于PPO,理论上可以在有限的多代理MDP中执行最佳策略学习,并在大量合作的多代理任务上显示出明显的超越表现。
Decentralized execution is one core demand in cooperative multi-agent reinforcement learning (MARL). Recently, most popular MARL algorithms have adopted decentralized policies to enable decentralized execution and use gradient descent as their optimizer. However, there is hardly any theoretical analysis of these algorithms taking the optimization method into consideration, and we find that various popular MARL algorithms with decentralized policies are suboptimal in toy tasks when gradient descent is chosen as their optimization method. In this paper, we theoretically analyze two common classes of algorithms with decentralized policies -- multi-agent policy gradient methods and value-decomposition methods to prove their suboptimality when gradient descent is used. In addition, we propose the Transformation And Distillation (TAD) framework, which reformulates a multi-agent MDP as a special single-agent MDP with a sequential structure and enables decentralized execution by distilling the learned policy on the derived ``single-agent" MDP. This approach uses a two-stage learning paradigm to address the optimization problem in cooperative MARL, maintaining its performance guarantee. Empirically, we implement TAD-PPO based on PPO, which can theoretically perform optimal policy learning in the finite multi-agent MDPs and shows significant outperformance on a large set of cooperative multi-agent tasks.