硬件加速器，可在变压器中进行多头注意力和位置前进

论文标题

硬件加速器，可在变压器中进行多头注意力和位置前进

Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer

论文作者

Lu, Siyuan, Wang, Meiqi, Liang, Shuang, Lin, Jun, Wang, Zhongfeng

论文摘要

为深神经网络（DNN）设计硬件加速器已经非常需要。尽管如此，这些现有的加速器中的大多数都是为卷积神经网络（CNN）或经常性神经网络（RNN）构建的。最近，变压器模型正在替换自然语言处理（NLP）区域中的RNN。但是，由于涉及密集的矩阵计算和复杂的数据流，因此从未报道过变压器模型的硬件设计。在本文中，我们提出了第一个针对两个关键组件的硬件加速器，即多头注意（MHA）重新块和位置优先馈电网络（FFN）Resblock，这是变压器中两个最复杂的层。首先，引入了一种有效的方法来分区变压器中的巨大矩阵，从而使两个重新建筑可以共享大多数硬件资源。其次，计算流的设计良好，以确保收缩阵列的高硬件利用率，这是我们设计中最大的模块。第三，复杂的非线性功能得到了高度优化，以进一步降低硬件复杂性以及整个系统的延迟。我们的设计使用硬件说明语言（HDL）编码，并在Xilinx FPGA上进行评估。与GPU上具有相同设置的实现相比，提出的设计在MHA Resblock中的加速度分别为14.6倍，而FFN Resblock的实现分别显示了3.4倍。因此，这项工作为为多个变压器网络构建有效的硬件加速器奠定了良好的基础。

Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data flow being involved, the hardware design for the Transformer model has never been reported. In this paper, we propose the first hardware accelerator for two key components, i.e., the multi-head attention (MHA) ResBlock and the position-wise feed-forward network (FFN) ResBlock, which are the two most complex layers in the Transformer. Firstly, an efficient method is introduced to partition the huge matrices in the Transformer, allowing the two ResBlocks to share most of the hardware resources. Secondly, the computation flow is well designed to ensure the high hardware utilization of the systolic array, which is the biggest module in our design. Thirdly, complicated nonlinear functions are highly optimized to further reduce the hardware complexity and also the latency of the entire system. Our design is coded using hardware description language (HDL) and evaluated on a Xilinx FPGA. Compared with the implementation on GPU with the same setting, the proposed design demonstrates a speed-up of 14.6x in the MHA ResBlock, and 3.4x in the FFN ResBlock, respectively. Therefore, this work lays a good foundation for building efficient hardware accelerators for multiple Transformer networks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题