论文标题
注意机制的快速蒙特卡洛近似
Fast Monte-Carlo Approximation of the Attention Mechanism
论文作者
论文摘要
我们引入了蒙特卡罗注意(MCA),这是一种随机近似方法,用于降低变压器体系结构中自我注意的机制的计算成本。 MCA利用了这样一个事实,即每个令牌在输入序列中的重要性因注意力评分而异。因此,当注意力低下时,可以忍受某种程度的误差。使用近似矩阵乘法,MCA将不同的误差界限应用于编码输入令牌,以便以较低的精度计算出较低的注意力,而显着元素的误差则最小化。 MCA可以与其他注意优化方案并行运行,并且不需要模型修改。我们研究了理论误差界限,并证明MCA可将各种变压器模型的注意力复杂性(在拖放中)降低到胶水基准中的11美元$ \ times $,而不会损害模型的准确性。
We introduce Monte-Carlo Attention (MCA), a randomized approximation method for reducing the computational cost of self-attention mechanisms in Transformer architectures. MCA exploits the fact that the importance of each token in an input sequence varies with respect to their attention scores; thus, some degree of error can be tolerable when encoding tokens with low attention. Using approximate matrix multiplication, MCA applies different error bounds to encode input tokens such that those with low attention scores are computed with relaxed precision, whereas errors of salient elements are minimized. MCA can operate in parallel with other attention optimization schemes and does not require model modification. We study the theoretical error bounds and demonstrate that MCA reduces attention complexity (in FLOPS) for various Transformer models by up to 11$\times$ in GLUE benchmarks without compromising model accuracy.