投机解码：利用投机执行来加速SEQ2SEQ生成

论文标题

投机解码：利用投机执行来加速SEQ2SEQ生成

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

论文作者

Xia, Heming, Ge, Tao, Wang, Peiyi, Chen, Si-Qing, Wei, Furu, Sui, Zhifang

论文摘要

我们首次提出投机解码（SPECDEC），正式研究利用投机执行的想法以加速自回归（AR）解码。投机解码有两种创新：Spec-Drafter - 一个独立的模型，专门优化，以高效，准确的制图以及Spec-Verification-一种可靠的方法，用于在解码范式中有效地验证起草的代币。各种SEQ2SEQ任务的实验结果，包括机器翻译和抽象性摘要表明，我们的方法可以实现$ 5 \ times $ $加速，对于流行的变压器体系结构，具有可比的生成质量与光束搜索解码的速度可相当，这使您的印象是，当时的验证范式仅引入$ 1.4 \ times $ 1.4 \ times $ \ sim $ $ $ $ $ $ 2 \ sim $ 2 \ speedup $ speedup。除了显着的加速外，我们还展示了SPECDEC的其他3个优势，从而揭示了其在实际应用中加速生成模型的实用价值。我们的模型和代码可在https://github.com/hemingkx/specdec上找到。

We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an independent model specially optimized for efficient and accurate drafting -- and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around $5\times$ speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only $1.4\times$$\sim$$2\times$ speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.

下载PDF全文

下载文献需遵守相关版权规定

论文标题