通过层批处理 - 循环正则化提高深神经网络的训练性

论文标题

通过层批处理 - 循环正则化提高深神经网络的训练性

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

论文作者

Peer, David, Keulen, Bart, Stabinger, Sebastian, Piater, Justus, Rodríguez-Sánchez, Antonio

论文摘要

培训深层神经网络是一项非常艰巨的任务，尤其是具有挑战性的是如何适应体系结构以提高训练有素的模型的性能。我们可以发现，有时，浅网络比深网概括得更好，并且增加更多层会导致更高的培训和测试错误。深层残留学习框架通过将跳过连接添加到几个神经网络层来解决此退化问题。最初，需要这种跳过连接才能成功地训练深层网络，因为网络的表达性会随着深度而成倍增长，因此需要这种跳过连接。在本文中，我们首先通过神经网络分析信息流。我们介绍并评估批处理循环，该批量通过神经网络的每一层量化信息流。从经验和理论上，我们证明了基于梯度下降的训练方法需要阳性批处理融合，以成功地优化给定的损失函数。基于这些见解，我们引入了批处理融合正则化，以使基于梯度下降的训练算法能够单独通过每个隐藏层来优化信息流。借助批处理正则化，梯度下降优化器可以将无法捕捉的网络转换为可训练的网络。我们从经验上表明，因此我们可以训练“香草”完全连接的网络和卷积神经网络 - 没有跳过连接，批处理归一化，辍学或任何其他建筑调整 - 只需将批处理 - 连续副本正则术语添加到损失函数中即可使用500层。批处理 - 注入正则化的效果不仅在香草神经网络上评估，还评估了剩余网络，自动编码器以及在广泛的计算机视觉以及自然语言处理任务的剩余网络，自动编码器以及变压器模型上。

Training deep neural networks is a very demanding task, especially challenging is how to adapt architectures to improve the performance of trained models. We can find that sometimes, shallow networks generalize better than deep networks, and the addition of more layers results in higher training and test errors. The deep residual learning framework addresses this degradation problem by adding skip connections to several neural network layers. It would at first seem counter-intuitive that such skip connections are needed to train deep networks successfully as the expressivity of a network would grow exponentially with depth. In this paper, we first analyze the flow of information through neural networks. We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network. We prove empirically and theoretically that a positive batch-entropy is required for gradient descent-based training approaches to optimize a given loss function successfully. Based on those insights, we introduce batch-entropy regularization to enable gradient descent-based training algorithms to optimize the flow of information through each hidden layer individually. With batch-entropy regularization, gradient descent optimizers can transform untrainable networks into trainable networks. We show empirically that we can therefore train a "vanilla" fully connected network and convolutional neural network -- no skip connections, batch normalization, dropout, or any other architectural tweak -- with 500 layers by simply adding the batch-entropy regularization term to the loss function. The effect of batch-entropy regularization is not only evaluated on vanilla neural networks, but also on residual networks, autoencoders, and also transformer models over a wide range of computer vision as well as natural language processing tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题