论文标题
探索预训练语言模型的极端参数压缩
Exploring Extreme Parameter Compression for Pre-trained Language Models
论文作者
论文摘要
最近的工作探讨了大规模变压器的预训练模型的潜力,尤其是自然语言处理中的预训练的语言模型(PLM)。从各个角度来看,这引起了许多担忧,例如财务成本和碳排放。压缩PLM,例如BERT,具有可忽略不计的性能损失,以更快的推理和便宜的部署引起了很多关注。在这项工作中,我们旨在探索PLM的较大压缩比,其中张量分解是一种潜在的,但对张量不足。进一步提出了两个分解和重建方案,以提高压缩过程中的有效性和效率。我们使用$ {1}/{7} $参数的压缩BERT在Transform-layers中的参数表现出色,有时比胶水基准中的原始BERT稍好。一个微小的版本可实现$ 96.7 \%$ $ {1}/{48} $ encoder参数的性能(即,不包括嵌入式层的少于200万参数)和$ 2.7 \ $ 2.7 \ tims $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ \ times $ $在推理上更快。为了证明所提出的方法与知识蒸馏等现有压缩方法是正交的,我们还探索了蒸馏式BERT所提出的方法的好处。
Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT with ${1}/{7}$ parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves $96.7\%$ performance of BERT-base with $ {1}/{48} $ encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and $2.7 \times$ faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.