了解NLP中变压器的批准归一化失败

论文标题

了解NLP中变压器的批准归一化失败

Understanding the Failure of Batch Normalization for Transformers in NLP

论文作者

Wang, Jiaxi, Wu, Ji, Huang, Lei

论文摘要

分批归一化（BN）是加速训练深神经网络并改善计算机视觉（CV）任务的概括的核心和普遍技术。但是，它无法捍卫其在自然语言处理（NLP）中的地位，该地位由层归一化（LN）主导。在本文中，我们正在尝试回答为什么BN在具有变压器模型的NLP任务中通常比LN差。我们发现，BN的训练和推理之间的不一致是导致BN失败NLP的主要原因。我们定义训练推理差异（TID），以定量衡量这种不一致，并揭示TID可以表明BN的性能，并得到了广泛的实验，包括图像分类，神经机器翻译，语言建模，序列标记和文本分类任务。我们发现，当TID通过训练保持小时，BN可以获得比LN更好的测试性能。为了抑制TID的爆炸，我们提出了正规化的BN（RBN），该元素增加了一个简单的正规化项，以缩小批处理统计数据和BN人口统计之间的差距。 RBN在20个设置中的17个设置中始终如一地提高BN的性能，或与LN相当，涉及十个数据集和两个常见变压器的变体我们的代码可从https://github.com/wjxts/regularizedbn获得。

Batch Normalization (BN) is a core and prevalent technique in accelerating the training of deep neural networks and improving the generalization on Computer Vision (CV) tasks. However, it fails to defend its position in Natural Language Processing (NLP), which is dominated by Layer Normalization (LN). In this paper, we are trying to answer why BN usually performs worse than LN in NLP tasks with Transformer models. We find that the inconsistency between training and inference of BN is the leading cause that results in the failure of BN in NLP. We define Training Inference Discrepancy (TID) to quantitatively measure this inconsistency and reveal that TID can indicate BN's performance, supported by extensive experiments, including image classification, neural machine translation, language modeling, sequence labeling, and text classification tasks. We find that BN can obtain much better test performance than LN when TID keeps small through training. To suppress the explosion of TID, we propose Regularized BN (RBN) that adds a simple regularization term to narrow the gap between batch statistics and population statistics of BN. RBN improves the performance of BN consistently and outperforms or is on par with LN on 17 out of 20 settings, involving ten datasets and two common variants of Transformer Our code is available at https://github.com/wjxts/RegularizedBN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题