通过最佳响应多样性来培养队友，以培训强大的临时团队合作社

论文标题

通过最佳响应多样性来培养队友，以培训强大的临时团队合作社

Generating Teammates for Training Robust Ad Hoc Teamwork Agents via Best-Response Diversity

论文作者

Rahman, Arrasy, Fosong, Elliot, Carlucho, Ignacio, Albrecht, Stefano V.

论文摘要

临时团队合作（AHT）是设计一个强大的学习者代理人的挑战，该专员可以与未知的队友有效合作而没有事先协调机制。早期的方法通过培训学习者的各种手工队友政策来应对AHT挑战，该政策通常基于专家的领域知识，以了解学习者可能遇到的政策。但是，实施基于领域知识培训的队友政策并不总是可行的。在这种情况下，最近的方法试图通过通过优化信息理论多样性指标而产生的队友政策来提高学习者的鲁棒性。为队友政策生成优化现有信息理论多样性指标的问题在于表面上不同的队友的出现。当用于AHT训练时，在与未知队友的合作期间，表面上不同的队友行为可能不会提高学习者的鲁棒性。在本文中，我们提出了一种自动化的队友策略生成方法，以优化最佳响应多样性（BRDIV）指标，该方法根据队友政策在收益方面衡量了多样性。我们在具有多种有效协调策略的环境中评估了我们的方法，并与优化信息理论多样性指标的方法进行比较，而没有优化任何多样性指标的消融。我们的实验表明，优化BRDIV会产生各种各样的培训队友政策，这些政策相对于以前的队友生成方法提高了学习者的绩效，而在与以前未见过的队友政策合作时。

Ad hoc teamwork (AHT) is the challenge of designing a robust learner agent that effectively collaborates with unknown teammates without prior coordination mechanisms. Early approaches address the AHT challenge by training the learner with a diverse set of handcrafted teammate policies, usually designed based on an expert's domain knowledge about the policies the learner may encounter. However, implementing teammate policies for training based on domain knowledge is not always feasible. In such cases, recent approaches attempted to improve the robustness of the learner by training it with teammate policies generated by optimising information-theoretic diversity metrics. The problem with optimising existing information-theoretic diversity metrics for teammate policy generation is the emergence of superficially different teammates. When used for AHT training, superficially different teammate behaviours may not improve a learner's robustness during collaboration with unknown teammates. In this paper, we present an automated teammate policy generation method optimising the Best-Response Diversity (BRDiv) metric, which measures diversity based on the compatibility of teammate policies in terms of returns. We evaluate our approach in environments with multiple valid coordination strategies, comparing against methods optimising information-theoretic diversity metrics and an ablation not optimising any diversity metric. Our experiments indicate that optimising BRDiv yields a diverse set of training teammate policies that improve the learner's performance relative to previous teammate generation approaches when collaborating with near-optimal previously unseen teammate policies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题