论文标题
重新审视的广义线性土匪的延迟反馈
Delayed Feedback in Generalised Linear Bandits Revisited
论文作者
论文摘要
随机通用的线性匪徒是一个很好地理解的,用于顺序决策问题的模型,许多算法在立即反馈下实现了近乎最佳的遗憾。但是,在奖励几乎总是延迟的许多实际应用中,对即时奖励的严格要求是未得到满足的。我们以理论方式研究了普遍的线性斑块中延迟奖励的现象。我们表明,对延迟反馈的乐观算法的自然适应使人感到遗憾的是,延误的惩罚与地平线无关。在现有工作中,这一结果大大改善了,其中最知名的遗憾界限会随着视野的延迟惩罚而增加。我们通过对模拟数据进行实验来验证我们的理论结果。
The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, the stringent requirement for immediate rewards is unmet in many real-world applications where the reward is almost always delayed. We study the phenomenon of delayed rewards in generalised linear bandits in a theoretical manner. We show that a natural adaptation of an optimistic algorithm to the delayed feedback achieves a regret bound where the penalty for the delays is independent of the horizon. This result significantly improves upon existing work, where the best known regret bound has the delay penalty increasing with the horizon. We verify our theoretical results through experiments on simulated data.