统一分配和巡逻的强化学习在有不确定性的信号游戏中

论文标题

统一分配和巡逻的强化学习在有不确定性的信号游戏中

Reinforcement Learning for Unified Allocation and Patrolling in Signaling Games with Uncertainty

论文作者

Venugopal, Aravind, Bondi, Elizabeth, Kamarthi, Harshavardhan, Dholakia, Keval, Ravindran, Balaraman, Tambe, Milind

论文摘要

绿色安全游戏（GSG）已成功用于保护宝贵资源，例如渔业，森林和野生动植物。尽管现实世界的部署涉及资源分配和随后通过通信和实时，不确定信息进行协调的巡逻，但以前的游戏模型并未同时完全解决这两个阶段。此外，很难采用现有的解决方案策略，因为它们对于游戏模型的较大，更复杂的变体的扩展不佳。因此，我们首先提出了一种新颖的GSG模型，该模型结合了辩护人分配，巡逻，实时无人机通知给人类巡逻者，以及向攻击者发出警告信号的无人机。该模型进一步纳入了无人机和人类巡逻队团队中实时决策的不确定性。其次，我们提出CombsGPO，这是一种基于强化学习的新颖且可扩展的算法，以计算此游戏模型的防御者策略。 COMBSGPO对多维，离散的动作空间进行了政策搜索，以计算一种分配策略，该策略最适合辩护人的最佳反应巡逻策略，该策略是通过培训多代理的深Q网络来学习的。我们通过实验表明，CombsGPO会收敛到更好的策略，并且比可比方法更可扩展。第三，我们对CombsGPO学到的协调和信号行为进行了详细的分析，显示了Defender Resources和基于资源之间信号和通知的巡逻编队之间的群体形成。重要的是，我们发现战略信号在最终学习的策略中出现。最后，我们执行实验以在不同级别的不确定性下评估这些策略。

Green Security Games (GSGs) have been successfully used in the protection of valuable resources such as fisheries, forests and wildlife. While real-world deployment involves both resource allocation and subsequent coordinated patrolling with communication and real-time, uncertain information, previous game models do not fully address both of these stages simultaneously. Furthermore, adopting existing solution strategies is difficult since they do not scale well for larger, more complex variants of the game models. We therefore first propose a novel GSG model that combines defender allocation, patrolling, real-time drone notification to human patrollers, and drones sending warning signals to attackers. The model further incorporates uncertainty for real-time decision-making within a team of drones and human patrollers. Second, we present CombSGPO, a novel and scalable algorithm based on reinforcement learning, to compute a defender strategy for this game model. CombSGPO performs policy search over a multi-dimensional, discrete action space to compute an allocation strategy that is best suited to a best-response patrolling strategy for the defender, learnt by training a multi-agent Deep Q-Network. We show via experiments that CombSGPO converges to better strategies and is more scalable than comparable approaches. Third, we provide a detailed analysis of the coordination and signaling behavior learnt by CombSGPO, showing group formation between defender resources and patrolling formations based on signaling and notifications between resources. Importantly, we find that strategic signaling emerges in the final learnt strategy. Finally, we perform experiments to evaluate these strategies under different levels of uncertainty.

下载PDF全文

下载文献需遵守相关版权规定

论文标题