自适应的自适应语言建模

论文标题

自适应的自适应语言建模

Confident Adaptive Language Modeling

论文作者

Schuster, Tal, Fisch, Adam, Gupta, Jai, Dehghani, Mostafa, Bahri, Dara, Tran, Vinh Q., Tay, Yi, Metzler, Donald

论文摘要

基于变压器的大语言模型（LLM）的最新进展已导致许多任务的绩效改进。这些收益随着模型的大小急剧增加而导致推理时间缓慢且昂贵的使用缓慢。但是，在实践中，LLMS制造的一代一代由不同的难度组成。尽管某些预测确实从模型的全部容量中受益，但其他连续性更为微不足道，可以通过减少的计算来解决。在这项工作中，我们介绍了自信的自适应语言建模（平静），这是一个动态分配每个输入和发电时间段的不同计算的框架。早期退出解码涉及我们在这里解决的几个挑战，例如：（1）使用哪种信心措施；（2）将序列级别的约束连接到本地的人均退出决策；（3）由于以前的令牌中的早期退出而返回缺失的隐藏表示形式。通过对三种不同文本生成任务的理论分析和经验实验，我们证明了框架在减少计算的功效 - 潜在的速度最高为$ \ times 3 $ - 同时可维持高性能。

Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, potentially leading to slow and costly use at inference time. In practice, however, the series of generations made by LLMs is composed of varying levels of difficulty. While certain predictions truly benefit from the models' full capacity, other continuations are more trivial and can be solved with reduced compute. In this work, we introduce Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep. Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens. Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $\times 3$ -- while provably maintaining high performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题