论文标题
学习可推广风险敏感的政策,以在分散的多代理通用游戏中协调
Learning Generalizable Risk-Sensitive Policies to Coordinate in Decentralized Multi-Agent General-Sum Games
论文作者
论文摘要
尽管已经在合作环境中提出了各种多方面的强化学习方法,但很少有著作研究自我利益的学习代理如何在分散的通用音乐中实现相互协调,并将预先培训的政策推广到执行过程中的非同伴对手。在本文中,我们提出了可推广的风险敏感政策(GRSP)。 GRSP了解了有关代理商回报的分布,并估算了一个动态的寻求风险奖金,以发现风险的协调策略。此外,为避免过度适应训练对手,GRSP学习了辅助对手建模任务,以推断对手的类型,并在执行过程中动态改变相应的策略。从经验上讲,通过GRSP训练的代理可以在训练期间稳定地实现相互协调,并避免在执行过程中被非合作对手剥削。据我们所知,这是在迭代囚犯的困境(IPD)和迭代的雄鹿狩猎(ISH)中学习协调策略的第一种方法,而无需塑造对手或奖励,并首先考虑执行过程中的概括。此外,我们表明GRSP可以缩放到高维设置。
While various multi-agent reinforcement learning methods have been proposed in cooperative settings, few works investigate how self-interested learning agents achieve mutual coordination in decentralized general-sum games and generalize pre-trained policies to non-cooperative opponents during execution. In this paper, we present Generalizable Risk-Sensitive Policy (GRSP). GRSP learns the distributions over agent's return and estimate a dynamic risk-seeking bonus to discover risky coordination strategies. Furthermore, to avoid overfitting to training opponents, GRSP learns an auxiliary opponent modeling task to infer opponents' types and dynamically alter corresponding strategies during execution. Empirically, agents trained via GRSP can achieve mutual coordination during training stably and avoid being exploited by non-cooperative opponents during execution. To the best of our knowledge, it is the first method to learn coordination strategies between agents both in iterated prisoner's dilemma (IPD) and iterated stag hunt (ISH) without shaping opponents or rewards, and firstly consider generalization during execution. Furthermore, we show that GRSP can be scaled to high-dimensional settings.