多通SGD的风险范围对于插值制度中的最小二乘

论文标题

多通SGD的风险范围对于插值制度中的最小二乘

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

论文作者

Zou, Difan, Wu, Jingfeng, Braverman, Vladimir, Gu, Quanquan, Kakade, Sham M.

论文摘要

随机梯度下降（SGD）由于其在优化和泛化方面的出色表现，取得了巨大的成功。对单频道SGD进行了大多数现有的概括分析，与常用的多通SGD相比，这是一个不太实用的变体。此外，对多通SGD的理论分析通常涉及一类问题中最坏的实例，这可能是悲观的，可以解释某些特定问题实例的卓越概括能力。本文的目的是通过为实例依赖于实例的多余风险来限制在插值制度中的最小二乘，这是根据迭代编号，步骤尺寸和数据协调的函数来表达的。我们表明，SGD的多余风险可以完全分解为GD的多余风险和正波动误差，这表明SGD在概括中总是比GD更糟。另一方面，我们表明，尽管SGD比GD需要更多的迭代才能达到相同的多余风险，但它可以节省随机梯度评估的数量，因此在计算时间方面是可取的。

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant compared to the commonly-used multi-pass SGD. Besides, theoretical analyses for multi-pass SGD often concern a worst-case instance in a class of problems, which may be pessimistic to explain the superior generalization ability for some particular problem instance. The goal of this paper is to sharply characterize the generalization of multi-pass SGD, by developing an instance-dependent excess risk bound for least squares in the interpolation regime, which is expressed as a function of the iteration number, stepsize, and data covariance. We show that the excess risk of SGD can be exactly decomposed into the excess risk of GD and a positive fluctuation error, suggesting that SGD always performs worse, instance-wisely, than GD, in generalization. On the other hand, we show that although SGD needs more iterations than GD to achieve the same level of excess risk, it saves the number of stochastic gradient evaluations, and therefore is preferable in terms of computational time.

下载PDF全文

下载文献需遵守相关版权规定

论文标题