论文标题
使用规模不变架构对神经网络的强大培训
Robust Training of Neural Networks Using Scale Invariant Architectures
论文作者
论文摘要
与SGD相比,Adam等自适应梯度方法允许对现代深层网络(尤其是大型语言模型)进行强有力的培训。但是,适应性的使用不仅是为了额外的记忆,而且还提出了一个基本问题:SGD等非自适应方法可以享受类似的好处吗?在本文中,我们通过以下一般配方提议通过以下一般配方进行稳健和记忆效率的培训来为这个问题提供肯定的答案:(1)修改体系结构并使IT规模不变,即参数的规模不会影响网络的输出,(2)使用SGD和权重衰减,以及选择的全球范围的型号型号的型号,(3) $ \ sqrt {\ tfrac {2λ}η} $,其中$η$是学习率,$λ$是权重衰减。我们表明,这种一般方法可以通过证明其收敛仅在对数取决于初始化和损失的规模上来重新恢复参数和损失,而标准SGD甚至可能不会收敛许多初始化。遵循食谱后,我们设计了一个名为Sibert的Bert的比例不变版本,该版本仅由Vanilla SGD进行训练时,可以实现与Bert在下游任务中受过自适应方法训练的BERT相当的性能。
In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2λ}η}$, where $η$ is learning rate and $λ$ is weight decay. We show that this general approach is robust to rescaling of parameter and loss by proving that its convergence only depends logarithmically on the scale of initialization and loss, whereas the standard SGD might not even converge for many initializations. Following our recipe, we design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam on downstream tasks.