论文标题
预测匪徒
Predictive Bandits
论文作者
论文摘要
我们介绍并研究了一类新的随机匪徒问题,称为预测性匪徒。在每回合中,决策者首先决定是否收集有关特定武器的回报的信息(以便可以预测他们的回报)。这些测量值是昂贵的,可能会因噪音而破坏。然后,决策者选择一轮在回合中实际弹奏的手臂。预测匪徒在许多领域找到应用;例如它们可以应用于无线电通信系统中的频道选择问题。在本文中,我们提供了有关预测匪徒的第一个理论结果,并专注于允许决策者每轮最多一臂测量的场景。我们为这些问题提供了渐近实例特定的遗憾下限,并开发了与这些基本限制相匹配的算法。我们通过数值实验说明了算法的性能。特别是,我们强调了通过使用奖励预测可以实现的收益,并研究噪声在相应测量中的影响。
We introduce and study a new class of stochastic bandit problems, referred to as predictive bandits. In each round, the decision maker first decides whether to gather information about the rewards of particular arms (so that their rewards in this round can be predicted). These measurements are costly, and may be corrupted by noise. The decision maker then selects an arm to be actually played in the round. Predictive bandits find applications in many areas; e.g. they can be applied to channel selection problems in radio communication systems. In this paper, we provide the first theoretical results about predictive bandits, and focus on scenarios where the decision maker is allowed to measure at most one arm per round. We derive asymptotic instance-specific regret lower bounds for these problems, and develop algorithms whose regret match these fundamental limits. We illustrate the performance of our algorithms through numerical experiments. In particular, we highlight the gains that can be achieved by using reward predictions, and investigate the impact of the noise in the corresponding measurements.