论文标题
投机解码:利用投机执行来加速SEQ2SEQ生成
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation
论文作者
论文摘要
我们首次提出投机解码(SPECDEC),正式研究利用投机执行的想法以加速自回归(AR)解码。投机解码有两种创新:Spec-Drafter - 一个独立的模型,专门优化,以高效,准确的制图以及Spec-Verification-一种可靠的方法,用于在解码范式中有效地验证起草的代币。各种SEQ2SEQ任务的实验结果,包括机器翻译和抽象性摘要表明,我们的方法可以实现$ 5 \ times $ $加速,对于流行的变压器体系结构,具有可比的生成质量与光束搜索解码的速度可相当,这使您的印象是,当时的验证范式仅引入$ 1.4 \ times $ 1.4 \ times $ \ sim $ $ $ $ $ $ 2 \ sim $ 2 \ speedup $ speedup。除了显着的加速外,我们还展示了SPECDEC的其他3个优势,从而揭示了其在实际应用中加速生成模型的实用价值。我们的模型和代码可在https://github.com/hemingkx/specdec上找到。
We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an independent model specially optimized for efficient and accurate drafting -- and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around $5\times$ speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only $1.4\times$$\sim$$2\times$ speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.