Pathprox：一种用于重量衰减正规化深神经网络的近端梯度算法

论文标题

Pathprox：一种用于重量衰减正规化深神经网络的近端梯度算法

PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized Deep Neural Networks

论文作者

Yang, Liu, Zhang, Jifan, Shenouda, Joseph, Papailiopoulos, Dimitris, Lee, Kangwook, Nowak, Robert D.

论文摘要

体重衰减是深度学习中使用最广泛的正则化形式之一，已被证明可以改善概括和稳健性。优化目标驾驶重量衰减是损失的总和，以及与平方重量总和成正比的术语。本文认为，随机梯度下降（SGD）可能是该目标的效率低下算法。对于具有RELU激活的神经网络，重量衰减目标的解决方案等同于具有不同目标的那些正则化项相反，即正则化项是$ \ ell_2 $（非平方）的乘积和与每个Relu神经元相关的输入权重的标准的总和。这种替代方案（有效地等效）正则化提出了一种用于网络训练的新型近端梯度算法。理论和实验支持新的训练方法，表明它可以更快地收敛到其与标准重量衰减训练所共享的稀疏解决方案。

Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional to the sum of squared weights. This paper argues that stochastic gradient descent (SGD) may be an inefficient algorithm for this objective. For neural networks with ReLU activations, solutions to the weight decay objective are equivalent to those of a different objective in which the regularization term is instead a sum of products of $\ell_2$ (not squared) norms of the input and output weights associated with each ReLU neuron. This alternative (and effectively equivalent) regularization suggests a novel proximal gradient algorithm for network training. Theory and experiments support the new training approach, showing that it can converge much faster to the sparse solutions it shares with standard weight decay training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题