论文标题
在有限效率线性上下文匪徒中的顺序批量学习
Sequential Batch Learning in Finite-Action Linear Contextual Bandits
论文作者
论文摘要
我们研究了具有有限动作集的线性上下文匪徒中的顺序批处理学习问题,在这种情况下,决策者被限制为将传入的个体分为固定的批次,只能观察到批次末端的批处理中的个人的结果。与标准的在线上下文匪徒学习或在持续强盗中学习的离线策略学习相比,这个顺序学习问题提供了许多个性化的顺序决策问题,包括在临床试验中的医疗治疗中,对电子商务和众包中适应性实验的产品建议。 我们研究了问题的两个设置:一个是任意生成上下文的地方,另一个是从某些分布中绘制的上下文\ textit {iid}的另一个设置。在每种情况下,我们建立了一个遗憾的下限并提供了一种算法,其遗憾的上限几乎与下限匹配。作为一个重要的见解,在此处,在前一种环境中,我们表明,实现完全在线性能所需的批次数量在时间范围内是多项式的,而对于后一种设置,即使批次在零零时,具有明智的批次分区方案的纯爆炸算法也可以达到完全在线性能。总之,当存在批处理约束时,我们的结果对线性上下文匪徒中的顺序决策进行了几乎完整的表征。
We study the sequential batch learning problem in linear contextual bandits with finite action sets, where the decision maker is constrained to split incoming individuals into (at most) a fixed number of batches and can only observe outcomes for the individuals within a batch at the batch's end. Compared to both standard online contextual bandits learning or offline policy learning in contexutal bandits, this sequential batch learning problem provides a finer-grained formulation of many personalized sequential decision making problems in practical applications, including medical treatment in clinical trials, product recommendation in e-commerce and adaptive experiment design in crowdsourcing. We study two settings of the problem: one where the contexts are arbitrarily generated and the other where the contexts are \textit{iid} drawn from some distribution. In each setting, we establish a regret lower bound and provide an algorithm, whose regret upper bound nearly matches the lower bound. As an important insight revealed therefrom, in the former setting, we show that the number of batches required to achieve the fully online performance is polynomial in the time horizon, while for the latter setting, a pure-exploitation algorithm with a judicious batch partition scheme achieves the fully online performance even when the number of batches is less than logarithmic in the time horizon. Together, our results provide a near-complete characterization of sequential decision making in linear contextual bandits when batch constraints are present.