论文标题
超越奖励:离线多基因行为分析的层次结构观点
Beyond Rewards: a Hierarchical Perspective on Offline Multiagent Behavioral Analysis
论文作者
论文摘要
每年,专家级的性能都在越来越复杂的多种域中达到,其中值得注意的例子包括GO,扑克和Starcraft II。这种快速的进步伴随着相应的需求,以更好地了解这种代理如何达到这种绩效,以实现其安全的部署,确定局限性并揭示其改善它们的潜在手段。在本文中,我们从以性能为中心的多种学习中退后一步,而是将注意力转向代理行为分析。我们介绍了一种模型 - 反应方法,用于在多型域中发现行为簇,并使用变分推断来学习关节和本地代理水平的行为层次结构。我们的框架没有对代理的基本学习算法的假设,不需要访问其潜在状态或政策,也不需要使用离线观察数据进行培训。 We illustrate the effectiveness of our method for enabling the coupled understanding of behaviors at the joint and local agent level, detection of behavior changepoints throughout training, discovery of core behavioral concepts, demonstrate the approach's scalability to a high-dimensional multiagent MuJoCo control domain, and also illustrate that the approach can disentangle previously-trained policies in OpenAI's hide-and-seek domain.
Each year, expert-level performance is attained in increasingly-complex multiagent domains, where notable examples include Go, Poker, and StarCraft II. This rapid progression is accompanied by a commensurate need to better understand how such agents attain this performance, to enable their safe deployment, identify limitations, and reveal potential means of improving them. In this paper we take a step back from performance-focused multiagent learning, and instead turn our attention towards agent behavior analysis. We introduce a model-agnostic method for discovery of behavior clusters in multiagent domains, using variational inference to learn a hierarchy of behaviors at the joint and local agent levels. Our framework makes no assumption about agents' underlying learning algorithms, does not require access to their latent states or policies, and is trained using only offline observational data. We illustrate the effectiveness of our method for enabling the coupled understanding of behaviors at the joint and local agent level, detection of behavior changepoints throughout training, discovery of core behavioral concepts, demonstrate the approach's scalability to a high-dimensional multiagent MuJoCo control domain, and also illustrate that the approach can disentangle previously-trained policies in OpenAI's hide-and-seek domain.