论文标题
TransKimmer:变形金刚学会了从图层略读
Transkimmer: Transformer Learns to Layer-wise Skim
论文作者
论文摘要
变形金刚架构已成为从自然语言处理和计算机视觉的许多机器学习任务的事实模型。因此,提高其计算效率变得至关重要。基于变压器的模型的主要计算效率低下之一是,它们在所有层中花费了相同数量的计算。先前的工作已经提出,凭借脱脂令牌提高其计算效率的能力来增强变压器模型。但是,他们没有对离散撇脱预测变量的有效和端到端优化。为了解决上述限制,我们提出了TransKimmer体系结构,该体系结构学会识别每一层不需要的隐藏状态令牌。然后将脱脂令牌直接转发到最终输出,从而减少了连续层的计算。 TransKimmer中的关键思想是在每一层之前都添加一个参数化的预测变量,该预测因子学会做出脱脂决策。我们还建议采用重新聚集技巧,并为TransKimmer的端到端培训增加略大的损失。与香草bert基线基线相比,TransKimmer在胶水基准测试基准的平均速度为10.97倍,精度降解率少于1%。
Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation.