先知不平等和潘多拉盒子的强盗算法

论文标题

先知不平等和潘多拉盒子的强盗算法

Bandit Algorithms for Prophet Inequality and Pandora's Box

论文作者

Gatmiry, Khashayar, Kesselheim, Thomas, Singla, Sahil, Wang, Yifan

论文摘要

先知不平等和潘多拉的盒子问题是机制设计，在线算法，随机优化，最佳停止和操作研究中应用的基本随机问题。这些作品中通常的假设是，将$ n $的随机变量的概率分布作为算法输入。由于在实践中需要学习这些分布，因此我们启动了多军匪徒模型中此类随机问题的研究。在多武器匪徒模型中，我们与$ n $未知分布在$ t $ rounds上进行交互：在$ t $中，我们播放polition $ x^{（t）} $，并收到有关$ x^{（t）} $的性能的部分（强盗）反馈。目的是最大程度地减少遗憾，这是知道分布与我们算法的总价值的最佳算法总值超过$ t $ rounds的差异，该算法的总价值从部分反馈中学习了分布。我们的主要结果给出了近乎最佳的$ \ tilde {o}（\ Mathsf {poly}（n）\ sqrt {t}）$对于先知不平等和潘多拉的盒子的总遗憾算法。我们的证明是通过维持最佳政策未知指数的置信区间来进行的。探索 - 探索折衷的折衷使我们无法直接完善这些置信区间，因此主要技术是设计一种遗憾的上限，在播放低纤维的强盗政策时是可以学习的。

The Prophet Inequality and Pandora's Box problems are fundamental stochastic problem with applications in Mechanism Design, Online Algorithms, Stochastic Optimization, Optimal Stopping, and Operations Research. A usual assumption in these works is that the probability distributions of the $n$ underlying random variables are given as input to the algorithm. Since in practice these distributions need to be learned, we initiate the study of such stochastic problems in the Multi-Armed Bandits model. In the Multi-Armed Bandits model we interact with $n$ unknown distributions over $T$ rounds: in round $t$ we play a policy $x^{(t)}$ and receive a partial (bandit) feedback on the performance of $x^{(t)}$. The goal is to minimize the regret, which is the difference over $T$ rounds in the total value of the optimal algorithm that knows the distributions vs. the total value of our algorithm that learns the distributions from the partial feedback. Our main results give near-optimal $\tilde{O}(\mathsf{poly}(n)\sqrt{T})$ total regret algorithms for both Prophet Inequality and Pandora's Box. Our proofs proceed by maintaining confidence intervals on the unknown indices of the optimal policy. The exploration-exploitation tradeoff prevents us from directly refining these confidence intervals, so the main technique is to design a regret upper bound that is learnable while playing low-regret Bandit policies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题