CURO：相对过度过度化的课程学习

论文标题

CURO：相对过度过度化的课程学习

CURO: Curriculum Learning for Relative Overgeneralization

论文作者

Shi, Lin, Liu, Qiyuan, Peng, Bei

论文摘要

当最佳联合作用的效用降至最佳的联合作用的效用时，相对过度概括（RO）是一种合作多代理任务中可能出现的病理。 RO可能会导致代理商陷入本地Optima或无法解决合作任务，要求给定时间段内代理之间进行大量协调。在这项工作中，我们从经验上发现，在多代理增强学习（MARL）中，基于价值的和政策梯度的MARL算法可能会遭受RO的损失，并且无法学习有效的协调政策。为了更好地克服RO，我们提出了一种新的方法，称为“相对范式过度化”（CURO）的课程学习。为了解决表现出强大RO的目标任务，在Curo中，我们首先微调目标任务的奖励功能，以生成源任务以训练代理商。然后，为了有效地将一个任务中获得的知识转移到下一个任务中，我们使用一种将价值函数传递与缓冲区传输相结合的传输学习方法，这可以在目标任务中更有效地探索。 Curo是一般的，可以应用于基于价值和策略梯度MARL方法。我们证明，当应用于Qmix，Happo和Hatrpo时，Curo可以成功克服严重的RO，提高性能，并在各种具有挑战性的合作多代理任务中胜过基线方法。

Relative overgeneralization (RO) is a pathology that can arise in cooperative multi-agent tasks when the optimal joint action's utility falls below that of a sub-optimal joint action. RO can cause the agents to get stuck into local optima or fail to solve cooperative tasks requiring significant coordination between agents within a given timestep. In this work, we empirically find that, in multi-agent reinforcement learning (MARL), both value-based and policy gradient MARL algorithms can suffer from RO and fail to learn effective coordination policies. To better overcome RO, we propose a novel approach called curriculum learning for relative overgeneralization (CURO). To solve a target task that exhibits strong RO, in CURO, we first fine-tune the reward function of the target task to generate source tasks to train the agent. Then, to effectively transfer the knowledge acquired in one task to the next, we use a transfer learning method that combines value function transfer with buffer transfer, which enables more efficient exploration in the target task. CURO is general and can be applied to both value-based and policy gradient MARL methods. We demonstrate that, when applied to QMIX, HAPPO, and HATRPO, CURO can successfully overcome severe RO, achieve improved performance, and outperform baseline methods in a variety of challenging cooperative multi-agent tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题