Edge AI系统的自动有效的BERT修剪

论文标题

Edge AI系统的自动有效的BERT修剪

An Automatic and Efficient BERT Pruning for Edge AI Systems

论文作者

Huang, Shaoyi, Liu, Ning, Liang, Yueying, Peng, Hongwu, Li, Hongjia, Xu, Dongkuan, Xie, Mimi, Ding, Caiwen

论文摘要

随着对深度学习民主化的向往，在资源受限的设备上实施基于变形金刚的自然语言处理（NLP）模型的需求越来越大，以实施低延迟和高准确性。现有的BERT修剪方法要求域专家启发手工制作超参数，以在模型大小，延迟和准确性之间取得平衡。在这项工作中，我们提出了Ae-Bert，这是一个具有有效评估的自动有效的BERT修剪框架，以选择“良好”子网络候选者（高准确性），鉴于整体修剪比率约束。我们提出的方法不需要人类专家的经验，并且可以在许多NLP任务上取得更好的准确性能。我们关于一般语言理解评估（胶水）基准的实验结果表明，Ae-Bert在Bert $ _ {\ Mathrm {base}} $上胜过最先进的（SOTA）手工修剪方法。在QNLI和RTE上，我们获得75 \％和42.8％的总体修剪比，同时获得更高的精度。在MRPC上，我们的得分比SOTA高4.6，在相同的整体修剪比为0.5。在STS-B上，与SOTA手工制作的修剪方法相比，我们可以达到40 \％的修剪比，而Spearman相关性的损失非常小。实验结果还表明，在模型压缩之后，单个bert $ _ {\ mathrm {base}} $编码器在xilinx alveo u200 fpga板上的编码器具有1.83 $ \ times $速度BERT $ _ {\ MATHRM {base}} $模型在计算限制设备上。

With the yearning for deep learning democratization, there are increasing demands to implement Transformer-based natural language processing (NLP) models on resource-constrained devices for low-latency and high accuracy. Existing BERT pruning methods require domain experts to heuristically handcraft hyperparameters to strike a balance among model size, latency, and accuracy. In this work, we propose AE-BERT, an automatic and efficient BERT pruning framework with efficient evaluation to select a "good" sub-network candidate (with high accuracy) given the overall pruning ratio constraints. Our proposed method requires no human experts experience and achieves a better accuracy performance on many NLP tasks. Our experimental results on General Language Understanding Evaluation (GLUE) benchmark show that AE-BERT outperforms the state-of-the-art (SOTA) hand-crafted pruning methods on BERT$_{\mathrm{BASE}}$. On QNLI and RTE, we obtain 75\% and 42.8\% more overall pruning ratio while achieving higher accuracy. On MRPC, we obtain a 4.6 higher score than the SOTA at the same overall pruning ratio of 0.5. On STS-B, we can achieve a 40\% higher pruning ratio with a very small loss in Spearman correlation compared to SOTA hand-crafted pruning methods. Experimental results also show that after model compression, the inference time of a single BERT$_{\mathrm{BASE}}$ encoder on Xilinx Alveo U200 FPGA board has a 1.83$\times$ speedup compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU, which shows the reasonableness of deploying the proposed method generated subnets of BERT$_{\mathrm{BASE}}$ model on computation restricted devices.

下载PDF全文

下载文献需遵守相关版权规定

论文标题