随机梯度下降的效率排序

论文标题

随机梯度下降的效率排序

Efficiency Ordering of Stochastic Gradient Descent

论文作者

Hu, Jie, Doshi, Vishwaraj, Eun, Do Young

论文摘要

我们考虑由一般随机序列驱动的随机梯度下降（SGD）算法，包括I.I.D噪声和随机行走，在任意图上等等；并以渐近意义进行分析。具体而言，我们采用了“效率排序”的概念，这是一种分析良好的工具，用于比较马尔可夫链蒙特卡洛（MCMC）采样器的性能，以长期与缩放级数相关的协调矩阵的形式进行SGD算法。使用此顺序，我们表明对MCMC采样更有效的输入序列也导致限制中SGD算法的误差的较小协方差。这也表明，当由更有效的链驱动时，任意加权的SGD迭代率会变小。我们的发现在分散的优化和群学习等应用中特别感兴趣，在这些应用程序中，在基础通信图上以随机步行方式实施SGD，以解决成本问题和/或数据隐私。我们演示了某些非马克维亚过程如何在基于典型的混合时间的非质子界面上是棘手的，可以在SGD的效率订购方面胜过其马尔可夫的同行。我们通过将其应用于梯度下降，并以改组和迷你批次梯度下降将其应用于梯度下降，从而证明了我们的实用性，从而在统一框架下重申了现有文献的关键结果。从经验上讲，我们还观察到SGD的变体（例如加速SGD和Adam）的效率排序，开辟了将我们的效率订购概念扩展到更广泛的随机优化算法的可能性。

We consider the stochastic gradient descent (SGD) algorithm driven by a general stochastic sequence, including i.i.d noise and random walk on an arbitrary graph, among others; and analyze it in the asymptotic sense. Specifically, we employ the notion of `efficiency ordering', a well-analyzed tool for comparing the performance of Markov Chain Monte Carlo (MCMC) samplers, for SGD algorithms in the form of Loewner ordering of covariance matrices associated with the scaled iterate errors in the long term. Using this ordering, we show that input sequences that are more efficient for MCMC sampling also lead to smaller covariance of the errors for SGD algorithms in the limit. This also suggests that an arbitrarily weighted MSE of SGD iterates in the limit becomes smaller when driven by more efficient chains. Our finding is of particular interest in applications such as decentralized optimization and swarm learning, where SGD is implemented in a random walk fashion on the underlying communication graph for cost issues and/or data privacy. We demonstrate how certain non-Markovian processes, for which typical mixing-time based non-asymptotic bounds are intractable, can outperform their Markovian counterparts in the sense of efficiency ordering for SGD. We show the utility of our method by applying it to gradient descent with shuffling and mini-batch gradient descent, reaffirming key results from existing literature under a unified framework. Empirically, we also observe efficiency ordering for variants of SGD such as accelerated SGD and Adam, open up the possibility of extending our notion of efficiency ordering to a broader family of stochastic optimization algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题