论文标题
ADA:随机梯度的自适应调度
AdaS: Adaptive Scheduling of Stochastic Gradients
论文作者
论文摘要
在大多数训练程序中,选择在随机梯度下降(SGD)优化中使用的步进大小。此外,使用预定的学习技术,例如踩踏,周期性学习和热身来调整阶梯尺寸,需要广泛的实践经验 - 提供有限的了解参数如何更新的见解,并且在应用程序之间并不一致。这项工作试图回答研究人员和从业者的兴趣问题,即\ textit {“在深层神经网络的迭代培训中获得了多少知识?”}回答这个问题引入了两个有用的指标,这些指标源自深神经网络中低级分解层的奇异值。我们介绍了\ textit {“知识增益”}和\ textit {“映射条件”}的概念,并提出了一种称为自适应调度(ADA)的新算法,该算法利用这些派生的指标来适应SGD学习速率,以适应与连续迭代相比知识收益的变化率。实验表明,使用派生的指标,ADA展示了:(a)比现有的自适应学习方法更快地收敛和卓越的概括; (b)缺乏对验证集以确定何时停止培训的依赖。代码可在\ url {https://github.com/mahdihosseini/adas}中获得。
The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to tune the step-size requires extensive practical experience--offering limited insight into how the parameters update--and is not consistent across applications. This work attempts to answer a question of interest to both researchers and practitioners, namely \textit{"how much knowledge is gained in iterative training of deep neural networks?"} Answering this question introduces two useful metrics derived from the singular values of the low-rank factorization of convolution layers in deep neural networks. We introduce the notions of \textit{"knowledge gain"} and \textit{"mapping condition"} and propose a new algorithm called Adaptive Scheduling (AdaS) that utilizes these derived metrics to adapt the SGD learning rate proportionally to the rate of change in knowledge gain over successive iterations. Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training. Code is available at \url{https://github.com/mahdihosseini/AdaS}.