论文标题
SDBERT:Sparsedistilbert,更快,更小的Bert模型
SDBERT: SparseDistilBERT, a faster and smaller BERT model
论文作者
论文摘要
在这项工作中,我们介绍了一种名为Sparsedistilbert(SDBERT)的新变压器架构,该结构是稀疏注意力和知识蒸馏器(KD)的结合。我们实施了稀疏的注意机制,以减少对线性输入长度的二次依赖性。除了降低模型的计算复杂性外,我们还使用了知识蒸馏(KD)。我们能够将BERT模型的规模减少60%,同时保持97%的性能,而训练只需花费40%。
In this work we introduce a new transformer architecture called SparseDistilBERT (SDBERT), which is a combination of sparse attention and knowledge distillantion (KD). We implemented sparse attention mechanism to reduce quadratic dependency on input length to linear. In addition to reducing computational complexity of the model, we used knowledge distillation (KD). We were able to reduce the size of BERT model by 60% while retaining 97% performance and it only took 40% of time to train.