加速和稳定变压器的统一归一化

论文标题

加速和稳定变压器的统一归一化

Unified Normalization for Accelerating and Stabilizing Transformers

论文作者

Yang, Qiming, Zhang, Kai, Lan, Chaoxiang, Yang, Zhi, Li, Zheyang, Tan, Wenming, Xiao, Jun, Pu, Shiliang

论文摘要

变压器的扎实结果使它们在各种自然语言和视觉任务中占上风。作为变压器中的默认组件，层归一化（LN）将每个令牌内的激活归一化，以提高稳健性。但是，LN需要在推理以及除法和平方根操作中进行直立的统计计算，从而导致硬件效率低下。更重要的是，用其他硬件有效的标准化方案（例如，批发归一化）代替LN会导致性能较低，甚至在训练中崩溃。我们发现，这种困境是由激活统计的异常行为引起的，包括对迭代的大波动和各个层的极端异常值的波动。为了解决这些问题，我们提出了统一的归一化（UN），可以通过与其他线性操作融合并在LN上实现可比的性能来加快推断。联合国通过使用量身定制的波动平滑策略来校准激活和梯度统计来促进性能。同时，采用自适应离群过滤策略来避免在本文中在理论上证明并在实验上验证的训练中崩溃。我们证明，通过对语言和视觉任务进行广泛的实验，联合国可以成为LN的有效液位替代方案。此外，我们评估了我们方法在GPU上的效率。配备了联合国的变压器享受约31％的推理加速度和近18％的记忆力减少。代码将在https://github.com/hikvision-research/unified-normalization上发布。

Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable performance on par with LN. UN strives to boost performance by calibrating the activation and gradient statistics with a tailored fluctuation smoothing strategy. Meanwhile, an adaptive outlier filtration strategy is applied to avoid collapse in training whose effectiveness is theoretically proved and experimentally verified in this paper. We demonstrate that UN can be an efficient drop-in alternative to LN by conducting extensive experiments on language and vision tasks. Besides, we evaluate the efficiency of our method on GPU. Transformers equipped with UN enjoy about 31% inference speedup and nearly 18% memory reduction. Code will be released at https://github.com/hikvision-research/Unified-Normalization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题