注意使零近似错误

论文标题

注意使零近似错误

Attention Enables Zero Approximation Error

论文作者

Fang, Zhiying, Ouyang, Yidong, Zhou, Ding-Xuan, Cheng, Guang

论文摘要

深度学习模型已被广泛应用于日常生活的各个方面。许多基于深度学习结构的变体模型都取得了更好的性能。基于注意力的体系结构在深度学习结构中几乎无处不在。特别是，变压器模型现在已经击败了图像分类任务中的卷积神经网络，成为最广泛使用的工具。但是，很少考虑基于注意力的模型的理论特性。在这项工作中，我们表明，有了合适的改编，具有固定数量的变压器编码器块的单头自发动变压器和自由参数能够生成输入的任何所需多项式，而无需错误。变压器编码器块的数量与目标多项式的程度相同。更令人兴奋的是，我们发现该模型中的这些变压器编码器块无需培训。作为一个直接的结果，我们表明，具有越来越多的自由参数的单头自发动变压器是普遍的。这些令人惊讶的理论结果清楚地解释了变压器模型的出色表现，并可能阐明了实际应用中未来的修改。我们还提供了一些实验来验证我们的理论结果。

Deep learning models have been widely applied in various aspects of daily life. Many variant models based on deep learning structures have achieved even better performances. Attention-based architectures have become almost ubiquitous in deep learning structures. Especially, the transformer model has now defeated the convolutional neural network in image classification tasks to become the most widely used tool. However, the theoretical properties of attention-based models are seldom considered. In this work, we show that with suitable adaptations, the single-head self-attention transformer with a fixed number of transformer encoder blocks and free parameters is able to generate any desired polynomial of the input with no error. The number of transformer encoder blocks is the same as the degree of the target polynomial. Even more exciting, we find that these transformer encoder blocks in this model do not need to be trained. As a direct consequence, we show that the single-head self-attention transformer with increasing numbers of free parameters is universal. These surprising theoretical results clearly explain the outstanding performances of the transformer model and may shed light on future modifications in real applications. We also provide some experiments to verify our theoretical result.

下载PDF全文

下载文献需遵守相关版权规定

论文标题