通过混合执行的集中培训多代理增强学习

论文标题

通过混合执行的集中培训多代理增强学习

Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning

论文作者

Santos, Pedro P., Carvalho, Diogo S., Vasco, Miguel, Sardinha, Alberto, Santos, Pedro A., Paiva, Ana, Melo, Francisco S.

论文摘要

我们介绍了多代理增强学习（MARL）中的混合执行，这是一种新的范式，在该范式中，代理商旨在通过利用代理商之间的信息共享在执行时间成功完成任意沟通水平的合作任务。在混合执行下，通信级别的范围可以从代理之间不允许通信（完全分散的）到具有完整通信（完全集中式）的设置，但是代理人不知道他们会在执行时遇到哪个通信级别。为了使我们的设置形式化，我们定义了一类新的多代理的Markov决策过程（POMDP），我们将其命名为Hybrid-Pomdps，该过程明确对代理之间的通信过程进行了明确建模。我们为Maro做出了贡献，Maro是一种使用自动回归预测模型的方法，该模型以集中的方式训练，以估算执行时间缺失的代理商的观察结果。我们评估了Maro在标准方案和以前的基准的扩展方面量身定制的，以强调MARL中部分可观察性的负面影响。实验结果表明，我们的方法始终胜过相关的基线，使代理可以通过错误的通信行动，同时成功利用共享信息。

We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully complete cooperative tasks with arbitrary communication levels at execution time by taking advantage of information-sharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized), but the agents do not know beforehand which communication level they will encounter at execution time. To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly model a communication process between the agents. We contribute MARO, an approach that makes use of an auto-regressive predictive model, trained in a centralized manner, to estimate missing agents' observations at execution time. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms relevant baselines, allowing agents to act with faulty communication while successfully exploiting shared information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题