论文标题
索引性不足以容纳惠特:改进,近乎最佳的算法
Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits
论文作者
论文摘要
我们研究了采用多个动作计划不安定的多臂土匪(RMAB)的问题。这是用于多机构系统的流行模型,其应用程序包括多渠道通信,监视和机器维护任务以及医疗保健。基于拉格朗日放松的质指数策略,由于其简单性和在某些条件下的近乎罕见性,因此在这些环境中广泛使用。在这项工作中,我们首先表明Whittle Index策略即使RMAB是可索引的,也可以在简单且实际上相关的RMAB设置中失败。我们讨论为什么最佳保证失败的原因以及为什么渐近最优性可能无法很好地转化为实际相关的计划范围。 然后,我们根据平均场方法提出了一种替代计划算法,该算法可以证明,该算法可以证明,该算法可以通过大量武器获得大量武器的近乎最佳策略,而没有惠特尔指数策略要求的严格结构假设。这使现有研究的想法得到了一些改进:我们的方法是无参数的,我们提供了改进的非症状分析,该分析具有:(a)不需要外源性超参数和对已知问题参数的多项式依赖性; (b)高概率范围,表明政策的奖励是可靠的; (c)相对于武器数量,该算法的次数下限匹配,从而证明了我们边界的紧密度。我们广泛的实验分析表明,平均场方法与其他基准相匹配或优于其他基线。
We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, even when the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.