多军匪徒的宽松遗憾

论文标题

多军匪徒的宽松遗憾

Lenient Regret for Multi-Armed Bandits

论文作者

Merlis, Nadav, Mannor, Shie

论文摘要

我们考虑了多臂强盗（MAB）问题，其中代理会顺序选择动作并观察其采取的动作的奖励。尽管大多数算法试图最大程度地减少遗憾，即最佳行动的奖励与代理商的行动之间的累积差异，但该标准可能会导致不良结果。例如，在很大的问题中，或者与环境的互动短暂时，找到最佳的臂是不可行的，并且后悔最小化的算法往往会过度开发。为了克服这个问题，此类设置的算法应该集中于发挥最佳的武器。为此，我们建议一个新的，更宽松，遗憾的标准，它忽略了比一些$ε$要小的次优差距。然后，我们提出了汤普森采样（TS）算法的一种变体，称为$ε$ -TS，并在宽大的遗憾方面证明了其渐近最优性。重要的是，我们表明，当最佳臂的平均值足够高时，$ε$ -TS的宽大遗憾就会由常数界定。最后，我们证明$ε$ -TS可以应用于当代理知道次级次数差距的下限时，可以提高性能。

We consider the Multi-Armed Bandit (MAB) problem, where an agent sequentially chooses actions and observes rewards for the actions it took. While the majority of algorithms try to minimize the regret, i.e., the cumulative difference between the reward of the best action and the agent's action, this criterion might lead to undesirable results. For example, in large problems, or when the interaction with the environment is brief, finding an optimal arm is infeasible, and regret-minimizing algorithms tend to over-explore. To overcome this issue, algorithms for such settings should instead focus on playing near-optimal arms. To this end, we suggest a new, more lenient, regret criterion that ignores suboptimality gaps smaller than some $ε$. We then present a variant of the Thompson Sampling (TS) algorithm, called $ε$-TS, and prove its asymptotic optimality in terms of the lenient regret. Importantly, we show that when the mean of the optimal arm is high enough, the lenient regret of $ε$-TS is bounded by a constant. Finally, we show that $ε$-TS can be applied to improve the performance when the agent knows a lower bound of the suboptimality gaps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题