论文标题

上下文可能很便宜:用线性匪徒求解随机上下文匪徒

Contexts can be Cheap: Solving Stochastic Contextual Bandits with Linear Bandit Algorithms

论文作者

Hanna, Osama A., Yang, Lin F., Fragouli, Christina

论文摘要

在本文中,我们解决了随机上下文线性匪徒问题,其中为决策者提供了一个上下文(从分布中绘制的一组动作)。每个动作的预期奖励由动作的内部产物和未知参数指定。目的是设计一种算法,该算法学会在许多动作发挥后与未知最佳策略保持近距离发挥作用。这个问题被认为比线性匪徒问题更具挑战性,这可以将其视为\ emph {固定}上下文的上下文匪徒问题。令人惊讶的是,在本文中,我们表明可以解决随机上下文问题,就好像这是一个线性匪徒问题一样。特别是,当已知上下文分布时,我们建立了一个新颖的还原框架,该框架将每个随机上下文线性匪徒实例转换为线性匪徒实例。当上下文分布未知时,我们建立了一种算法,将随机上下文实例降低到具有较小的错误指定的线性匪徒实例,并获得与解决误解的线性bastit实例的算法几乎相同的最坏情况遗憾。 结果,我们的结果暗示了$ o(d \ sqrt {t \ log t})$高概率的遗憾,依靠上下文线性匪徒,在解决(Li等,2019)中解决一个开放问题方面取得了进展,(Li等人,2021年)。 我们的还原框架为解决随机上下文线性匪徒问题开辟了一种新方法,并在许多实例中,包括批处理设置,具有错误特异性的上下文匪徒,具有稀疏未知参数的上下文匪徒以及具有对抗性腐败的上下文匪徒。

In this paper, we address the stochastic contextual linear bandit problem, where a decision maker is provided a context (a random set of actions drawn from a distribution). The expected reward of each action is specified by the inner product of the action and an unknown parameter. The goal is to design an algorithm that learns to play as close as possible to the unknown optimal policy after a number of action plays. This problem is considered more challenging than the linear bandit problem, which can be viewed as a contextual bandit problem with a \emph{fixed} context. Surprisingly, in this paper, we show that the stochastic contextual problem can be solved as if it is a linear bandit problem. In particular, we establish a novel reduction framework that converts every stochastic contextual linear bandit instance to a linear bandit instance, when the context distribution is known. When the context distribution is unknown, we establish an algorithm that reduces the stochastic contextual instance to a sequence of linear bandit instances with small misspecifications and achieves nearly the same worst-case regret bound as the algorithm that solves the misspecified linear bandit instances. As a consequence, our results imply a $O(d\sqrt{T\log T})$ high-probability regret bound for contextual linear bandits, making progress in resolving an open problem in (Li et al., 2019), (Li et al., 2021). Our reduction framework opens up a new way to approach stochastic contextual linear bandit problems, and enables improved regret bounds in a number of instances including the batch setting, contextual bandits with misspecifications, contextual bandits with sparse unknown parameters, and contextual bandits with adversarial corruption.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源