论文标题
并非所有参数都是平等的:注意主要是您需要的
Not all parameters are born equal: Attention is mostly what you need
论文作者
论文摘要
变压器被广泛用于最新的机器翻译中,但其成功的关键仍然未知。为了深入了解这一点,我们考虑了三组参数:嵌入,注意力和前进神经网络(FFN)层。我们通过进行消融研究来检查每个人的相对重要性,在该研究中我们将它们随机初始化并冻结,以使它们的权重在训练过程中不会改变。通过此,我们表明注意力和FFN同样重要,并且在模型中实现了相同的功能。我们表明,关于组件是否被冻结还是被允许训练的决定至少对于最终模型性能与其参数数量一样重要。同时,仅参数的数量并不表示组件的重要性。最后,虽然嵌入层对于机器翻译任务是最不可能的,但它是语言建模任务的最重要组成部分。
Transformers are widely used in state-of-the-art machine translation, but the key to their success is still unknown. To gain insight into this, we consider three groups of parameters: embeddings, attention, and feed forward neural network (FFN) layers. We examine the relative importance of each by performing an ablation study where we initialise them at random and freeze them, so that their weights do not change over the course of the training. Through this, we show that the attention and FFN are equally important and fulfil the same functionality in a model. We show that the decision about whether a component is frozen or allowed to train is at least as important for the final model performance as its number of parameters. At the same time, the number of parameters alone is not indicative of a component's importance. Finally, while the embedding layer is the least essential for machine translation tasks, it is the most important component for language modelling tasks.