论文标题

快速答案:在张量流处理器上加速BERT

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

论文作者

Ahmed, Ibrahim, Parmar, Sahil, Boyd, Matthew, Beidler, Michael, Kang, Kris, Liu, Bill, Roach, Kyle, Kim, John, Abts, Dennis

论文摘要

变形金刚已成为主要的机器学习工作量,它们不仅是自然语言处理任务的事实上的标准,而且还将部署在其他领域,例如视觉和语音识别。许多基于变压器的应用程序都是实时系统,例如机器翻译和Web搜索。这些实时系统通常具有严格的端到端推理潜伏期要求。不幸的是,尽管大多数变压器计算来自矩阵乘法,但变压器还包括几种在推理过程中倾向于成为瓶颈的非线性组件。在这项工作中,我们加速了张量流处理器上BERT模型的推断。通过仔细将所有非线性组件与矩阵乘法组件融合在一起,我们能够有效利用芯片矩阵乘法单元,从而通过BERT-BASE进行确定性的尾巴潜伏期为130 $ $ $ s的确定性尾部潜伏期为130 $ $ $,这比当前的状态比当前的状态更快。

Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many of the transformer-based applications are real-time systems such as machine translation and web search. These real time systems often come with strict end-to-end inference latency requirements. Unfortunately, while the majority of the transformer computation comes from matrix multiplications, transformers also include several non-linear components that tend to become the bottleneck during an inference. In this work, we accelerate the inference of BERT models on the tensor streaming processor. By carefully fusing all the nonlinear components with the matrix multiplication components, we are able to efficiently utilize the on-chip matrix multiplication units resulting in a deterministic tail latency of 130 $μ$s for a batch-1 inference through BERT-base, which is 6X faster than the current state-of-the-art.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源