批处理的决斗土匪

论文标题

批处理的决斗土匪

Batched Dueling Bandits

论文作者

Agarwal, Arpit, Ghuge, Rohan, Nagarajan, Viswanath

论文摘要

$ k $武装的决斗匪徒问题是反馈以嘈杂的成对比较的形式进行了广泛研究。以前的工作仅集中在每次比较后策略适应的顺序设置上。但是，在许多应用程序（例如搜索排名和推荐系统）中，最好在有限数量的并行批处理中进行比较。我们在两个标准设置下研究了批次的$ k $武装的决斗匪徒问题：（i）condorcet赢家的存在，以及（ii）强烈的随机传递性和随机三角形不平等。对于这两种设置，我们都会获得批次数量和遗憾之间平稳折衷的算法。我们的遗憾界限仅使用对数数量的批次数量，符合最著名的顺序遗憾界限（达到多种含量因素）。我们通过几乎匹配的下限对遗憾分析进行补充。最后，我们还通过实验合成和真实数据来验证我们的理论结果。

The $K$-armed dueling bandit problem, where the feedback is in the form of noisy pairwise comparisons, has been widely studied. Previous works have only focused on the sequential setting where the policy adapts after every comparison. However, in many applications such as search ranking and recommendation systems, it is preferable to perform comparisons in a limited number of parallel batches. We study the batched $K$-armed dueling bandit problem under two standard settings: (i) existence of a Condorcet winner, and (ii) strong stochastic transitivity and stochastic triangle inequality. For both settings, we obtain algorithms with a smooth trade-off between the number of batches and regret. Our regret bounds match the best known sequential regret bounds (up to poly-logarithmic factors), using only a logarithmic number of batches. We complement our regret analysis with a nearly-matching lower bound. Finally, we also validate our theoretical results via experiments on synthetic and real data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题