论文标题
不再偏见:对敌方匪徒和MDP的高概率数据依赖于数据的后悔界限
Bias no more: high-probability data-dependent regret bounds for adversarial bandits and MDPs
论文作者
论文摘要
我们开发了一种新的方法,可以通过针对自适应对手的强盗反馈来获得在线学习的高概率后悔界限。尽管现有方法都需要仔细构建乐观和偏见的损失估计器,但我们的方法使用标准的无偏估计器,并依赖于简单的学习率计划,并在对数均匀的自我符合障碍的帮助下以及增强的Freedman的不平等现象。 除了简单之外,我们的方法还具有多种优势。首先,获得的高概率遗憾界限是数据依赖性的,并且可能比最差的范围小得多,后者解决了NEU(2015)提出的空旷问题。其次,解决了Bartlett等人的另一个开放问题。 (2008年)以及Abernethy和Rakhlin(2009),我们的方法导致了第一个通用和有效的算法,具有高度概率的遗憾,而对对抗性线性匪徒则构成了高效,而以前的方法效率低下或仅适用于特定的动作集。最后,我们的方法还可以应用于学习对抗性马尔可夫决策过程,并为第一个算法提供了针对此问题的高概率小概率的算法。
We develop a new approach to obtaining high probability regret bounds for online learning with bandit feedback against an adaptive adversary. While existing approaches all require carefully constructing optimistic and biased loss estimators, our approach uses standard unbiased estimators and relies on a simple increasing learning rate schedule, together with the help of logarithmically homogeneous self-concordant barriers and a strengthened Freedman's inequality. Besides its simplicity, our approach enjoys several advantages. First, the obtained high-probability regret bounds are data-dependent and could be much smaller than the worst-case bounds, which resolves an open problem asked by Neu (2015). Second, resolving another open problem of Bartlett et al. (2008) and Abernethy and Rakhlin (2009), our approach leads to the first general and efficient algorithm with a high-probability regret bound for adversarial linear bandits, while previous methods are either inefficient or only applicable to specific action sets. Finally, our approach can also be applied to learning adversarial Markov Decision Processes and provides the first algorithm with a high-probability small-loss bound for this problem.