在多机构多军土匪中分布式合作决策

论文标题

在多机构多军土匪中分布式合作决策

Distributed Cooperative Decision Making in Multi-agent Multi-armed Bandits

论文作者

Landgren, Peter, Srivastava, Vaibhav, Leonard, Naomi Ehrich

论文摘要

我们研究了一个分布式决策问题，其中多个代理面对相同的多臂强盗（MAB），每个代理在武器之间做出顺序选择，以最大程度地提高自己的个人奖励。代理商通过在固定通信图上共享估计来合作。我们考虑了一个无约束的奖励模型，其中两个或多个代理可以选择同一手臂并收集独立的奖励。我们考虑了一个有限的奖励模型，在该模型中，同时选择同一臂的代理商没有获得奖励。我们设计了一种基于动态的，基于共识的分布式估计算法，用于对每个ARM的平均奖励进行合作估计。我们利用该算法的估计值开发了两种分布式算法：COP-UCB2和COP-ucb2选择性学习，分别为无约束和约束的奖励模型。我们表明，这两种算法都达到了群体绩效，接近集中式融合中心的性能。此外，我们研究了通信图结构对性能的影响。我们提出了一个新的图形探索探索索引指数，该指数可以通过通信图来预测组的相对性能，并且我们提出了一个新型的节点探索探索探索探索中心性指数，该指数可预测通信图中代理位置的代理商在代理位置的相对性能。

We study a distributed decision-making problem in which multiple agents face the same multi-armed bandit (MAB), and each agent makes sequential choices among arms to maximize its own individual reward. The agents cooperate by sharing their estimates over a fixed communication graph. We consider an unconstrained reward model in which two or more agents can choose the same arm and collect independent rewards. And we consider a constrained reward model in which agents that choose the same arm at the same time receive no reward. We design a dynamic, consensus-based, distributed estimation algorithm for cooperative estimation of mean rewards at each arm. We leverage the estimates from this algorithm to develop two distributed algorithms: coop-UCB2 and coop-UCB2-selective-learning, for the unconstrained and constrained reward models, respectively. We show that both algorithms achieve group performance close to the performance of a centralized fusion center. Further, we investigate the influence of the communication graph structure on performance. We propose a novel graph explore-exploit index that predicts the relative performance of groups in terms of the communication graph, and we propose a novel nodal explore-exploit centrality index that predicts the relative performance of agents in terms of the agent locations in the communication graph.

下载PDF全文

下载文献需遵守相关版权规定

论文标题