论文标题
基于动量的加速优化算法的收敛速率和噪声放大之间的权衡
Tradeoffs between convergence rate and noise amplification for momentum-based accelerated optimization algorithms
论文作者
论文摘要
我们研究基于动量的一阶优化算法,其中迭代利用了前两个步骤中的信息,并受到添加剂白噪声的影响。该设置使用噪声来说明梯度评估或迭代更新中的不确定性,其中包括Polyak的重球和Nesterov的加速方法作为特殊情况。对于强烈凸出二次问题,我们使用优化变量中误差的稳态差异来量化噪声扩增并确定基本的随机性能权衡。我们的方法利用陪审团稳定性标准提供了线性收敛条件的新几何表征,并且揭示了噪声扩增和收敛速率之间的关系以及它们对条件数量和恒定算法参数的依赖。这种几何见解会导致标准收敛结果的简单替代证明,并使我们能够建立强烈凸出优化的``不确定性原理'':对于具有线性收敛速率的两步动量方法,在平稳时间和噪声扩增量表之间的较低范围与条件编号相关。我们的分析还确定了梯度和迭代噪声模型之间的关键差异:虽然可以通过充分减速算法将梯度噪声的扩增任意缩小,但对于迭代噪声模型来说,最佳可实现的差异是在降低降低措施中随着设置时间的线性增加。最后,我们介绍了两个参数化的算法家族,它们在噪声放大和沉降时间之间取得了平衡,同时为两个噪声模型保留订单的帕累托最佳性。
We study momentum-based first-order optimization algorithms in which the iterations utilize information from the two previous steps and are subject to an additive white noise. This setup uses noise to account for uncertainty in either gradient evaluation or iteration updates, and it includes Polyak's heavy-ball and Nesterov's accelerated methods as special cases. For strongly convex quadratic problems, we use the steady-state variance of the error in the optimization variable to quantify noise amplification and identify fundamental stochastic performance tradeoffs. Our approach utilizes the Jury stability criterion to provide a novel geometric characterization of conditions for linear convergence, and it reveals the relation between the noise amplification and convergence rate as well as their dependence on the condition number and the constant algorithmic parameters. This geometric insight leads to simple alternative proofs of standard convergence results and allows us to establish ``uncertainty principle'' of strongly convex optimization: for the two-step momentum method with linear convergence rate, the lower bound on the product between the settling time and noise amplification scales quadratically with the condition number. Our analysis also identifies a key difference between the gradient and iterate noise models: while the amplification of gradient noise can be made arbitrarily small by sufficiently decelerating the algorithm, the best achievable variance for the iterate noise model increases linearly with the settling time in the decelerating regime. Finally, we introduce two parameterized families of algorithms that strike a balance between noise amplification and settling time while preserving order-wise Pareto optimality for both noise models.