论文标题
实例依赖的帕累托前沿保证了多武器的强盗,没有通信
The Pareto Frontier of Instance-Dependent Guarantees in Multi-Player Multi-Armed Bandits with no Communication
论文作者
论文摘要
我们研究了随机的多人多军匪徒问题。在这个问题中,$ m $ $ players合作,从$ k> m $ harm中最大化他们的总奖励。但是,如果玩家同时拉着同一支手臂,则无法交流和受到惩罚(例如,没有获得奖励)。我们询问是否有可能获得最佳实例的遗憾$ \ tilde {o}(1/δ)$,其中$δ$是$ m $ th和$ m $ th和$ m+1 $ 1 $ -st最佳武器之间的差距。最近在一个模型中实现了这种保证,允许玩家通过故意碰撞隐式交流。 令人惊讶的是,我们表明,没有任何沟通,这种保证是无法实现的。实际上,对于某些$δ$的某些值,获得最佳$ \ tilde {o}(1/δ)$遗憾必然意味着在其他制度中严格的次优遗憾。我们的主要结果是对帕累托最佳实例依赖性权衡的完整表征,这些权衡是在没有通信的情况下进行的。我们的算法概括了Bubeck,Budzinski和第二作者的算法。在那儿,即使有强大的无碰撞属性,自适应对手可能会破坏对碰撞的反馈,即使对碰撞的反馈也会成功。我们的下限基于多个尺度的拓扑障碍,并且是全新的。
We study the stochastic multi-player multi-armed bandit problem. In this problem, $m$ players cooperate to maximize their total reward from $K > m$ arms. However the players cannot communicate and are penalized (e.g. receive no reward) if they pull the same arm at the same time. We ask whether it is possible to obtain optimal instance-dependent regret $\tilde{O}(1/Δ)$ where $Δ$ is the gap between the $m$-th and $m+1$-st best arms. Such guarantees were recently achieved in a model allowing the players to implicitly communicate through intentional collisions. Surprisingly, we show that with no communication at all, such guarantees are not achievable. In fact, obtaining the optimal $\tilde{O}(1/Δ)$ regret for some values of $Δ$ necessarily implies strictly sub-optimal regret in other regimes. Our main result is a complete characterization of the Pareto optimal instance-dependent trade-offs that are possible with no communication. Our algorithm generalizes that of Bubeck, Budzinski, and the second author. As there, our algorithm succeeds even when feedback upon collision can be corrupted by an adaptive adversary, thanks to a strong no-collision property. Our lower bound is based on topological obstructions at multiple scales and is completely new.