从人口损失的梯度流到随机梯度下降的学习

论文标题

从人口损失的梯度流到随机梯度下降的学习

From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent

论文作者

Kale, Satyen, Lee, Jason D., De Sa, Chris, Sekhari, Ayush, Sridharan, Karthik

论文摘要

随机梯度下降（SGD）已成为学习大规模非凸模型的首选方法。尽管对SGD工作何时难以捉摸的一般分析，但在理解梯度流（GF）对人口损失的收敛性方面取得了很大进展，部分原因是连续时间分析可以购买我们的简单性。我们论文的总体主题是提供一般条件，假设GF对人口损失的汇聚，则SGD会融合。我们建立这种联系的主要工具是像定理这样的一般匡威Lyapunov，这意味着在温和的GF收敛速率的假设下存在Lyapunov的潜力。实际上，使用这些电势，我们显示了GF收敛速率与基础目标的几何特性之间的一对一对应关系。当这些潜力进一步满足某些自我限制的特性时，我们表明它们可以用于为梯度下降（GD）和SGD提供收敛保证（即使GF和GD/GD/SGD的路径相距很远）。事实证明，这些自我限制的假设在某种意义上也是GD/SGD工作的必要条件。使用我们的框架，我们为GD / SGD提供了统一的分析，不仅针对经典设置（例如凸损失）或满足PL / KL属性的目标，还为更复杂的问题（包括相位检索和矩阵SQ-ROOT）提供了统一的分析，并在Chatterjee 2022的最新工作中扩展了结果。

Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity that a continuous-time analysis buys us. An overarching theme of our paper is providing general conditions under which SGD converges, assuming that GF on the population loss converges. Our main tool to establish this connection is a general converse Lyapunov like theorem, which implies the existence of a Lyapunov potential under mild assumptions on the rates of convergence of GF. In fact, using these potentials, we show a one-to-one correspondence between rates of convergence of GF and geometrical properties of the underlying objective. When these potentials further satisfy certain self-bounding properties, we show that they can be used to provide a convergence guarantee for Gradient Descent (GD) and SGD (even when the paths of GF and GD/SGD are quite far apart). It turns out that these self-bounding assumptions are in a sense also necessary for GD/SGD to work. Using our framework, we provide a unified analysis for GD/SGD not only for classical settings like convex losses, or objectives that satisfy PL / KL properties, but also for more complex problems including Phase Retrieval and Matrix sq-root, and extending the results in the recent work of Chatterjee 2022.

下载PDF全文

下载文献需遵守相关版权规定

论文标题