上下文随机匪徒问题中的模型选择

论文标题

上下文随机匪徒问题中的模型选择

Model Selection in Contextual Stochastic Bandit Problems

论文作者

Pacchiano, Aldo, Phan, My, Abbasi-Yadkori, Yasin, Rao, Anup, Zimmert, Julian, Lattimore, Tor, Szepesvari, Csaba

论文摘要

我们研究随机环境中的匪徒模型选择。我们的方法依赖于在候选基础算法之间选择的元算象。我们开发了一种元词素基础算法抽象，该算法可以与一般类别的基础算法和不同类型的对抗元算法一起使用。我们的方法依赖于强匪算法的一种新颖而通用的平滑转换，该算法使我们能够获得最佳的$ o（\ sqrt {t}）$模型选择保证，只要最佳基础算法能够满足很高的可能性后悔的保证，就可以为随机上下文的强盗问题提供保证。我们通过一个下限表明，即使一种基本算法具有$ O（\ log t）$遗憾，总的来说，在模型选择中，即使是渐近，也不可能获得比$ω（\ sqrt {t}）$遗憾的更好。使用我们的技术，我们解决了各种问题中的模型选择，例如拼写的线性上下文匪徒，带有未知维度的线性匪徒和未知特征图的增强学习。我们的算法需要了解最佳基础后悔的知识，以调整元叠加学习率。我们表明，如果没有先验知识，任何元算法都会遭受比最佳基础遗憾更大的遗憾。

We study bandit model selection in stochastic environments. Our approach relies on a meta-algorithm that selects between candidate base algorithms. We develop a meta-algorithm-base algorithm abstraction that can work with general classes of base algorithms and different type of adversarial meta-algorithms. Our methods rely on a novel and generic smoothing transformation for bandit algorithms that permits us to obtain optimal $O(\sqrt{T})$ model selection guarantees for stochastic contextual bandit problems as long as the optimal base algorithm satisfies a high probability regret guarantee. We show through a lower bound that even when one of the base algorithms has $O(\log T)$ regret, in general it is impossible to get better than $Ω(\sqrt{T})$ regret in model selection, even asymptotically. Using our techniques, we address model selection in a variety of problems such as misspecified linear contextual bandits, linear bandit with unknown dimension and reinforcement learning with unknown feature maps. Our algorithm requires the knowledge of the optimal base regret to adjust the meta-algorithm learning rate. We show that without such prior knowledge any meta-algorithm can suffer a regret larger than the optimal base regret.

下载PDF全文

下载文献需遵守相关版权规定

论文标题