汤普森与虚拟帮助代理商进行抽样

论文标题

汤普森与虚拟帮助代理商进行抽样

Thompson Sampling with Virtual Helping Agents

论文作者

Pant, Kartik Anand, Hegde, Amod, Srinivas, K. V.

论文摘要

我们解决了在线顺序决策制定的问题，即平衡利用当前知识以最大程度地提高直接性能和探索新信息以使用多武器的强盗框架获得长期利益之间的权衡。汤普森采样是选择解决这一探索探索困境的行动的启发式方法之一。我们首先提出了一个通用框架，该框架可以帮助启发式探索与汤普森采样中的探索与剥削权衡，并使用后部分布中的多个样本进行了采样。利用此框架，我们提出了两种算法，以解决多臂匪徒问题，并为累积的遗憾提供理论界限。接下来，我们证明了拟议算法对汤普森采样的累积遗憾表现的经验改善。我们还显示了所提出的算法在现实世界数据集上的有效性。与现有方法相反，我们的框架提供了一种基于手头任务的探索/开发量的机制。为此，我们扩展了两个其他问题的框架，即，在土匪中最佳的ARM识别和对时间敏感的学习，并将我们的算法与现有方法进行比较。

We address the problem of online sequential decision making, i.e., balancing the trade-off between exploiting the current knowledge to maximize immediate performance and exploring the new information to gain long-term benefits using the multi-armed bandit framework. Thompson sampling is one of the heuristics for choosing actions that address this exploration-exploitation dilemma. We first propose a general framework that helps heuristically tune the exploration versus exploitation trade-off in Thompson sampling using multiple samples from the posterior distribution. Utilizing this framework, we propose two algorithms for the multi-armed bandit problem and provide theoretical bounds on the cumulative regret. Next, we demonstrate the empirical improvement in the cumulative regret performance of the proposed algorithm over Thompson Sampling. We also show the effectiveness of the proposed algorithm on real-world datasets. Contrary to the existing methods, our framework provides a mechanism to vary the amount of exploration/ exploitation based on the task at hand. Towards this end, we extend our framework for two additional problems, i.e., best arm identification and time-sensitive learning in bandits and compare our algorithm with existing methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题