超越缩放定律具有0.1％的额外计算

论文标题

超越缩放定律具有0.1％的额外计算

Transcending Scaling Laws with 0.1% Extra Compute

论文作者

Tay, Yi, Wei, Jason, Chung, Hyung Won, Tran, Vinh Q., So, David R., Shakeri, Siamak, Garcia, Xavier, Zheng, Huaixiu Steven, Rao, Jinfeng, Chowdhery, Aakanksha, Zhou, Denny, Metzler, Donald, Petrov, Slav, Houlsby, Neil, Le, Quoc V., Dehghani, Mostafa

论文摘要

缩放语言模型可提高性能，但具有巨大的计算成本。本文提出了UL2R，该方法可以通过相对少量的额外计算来实质上改善现有语言模型及其缩放曲线。关键的想法是继续使用UL2的混合物目标来培训最先进的大语言模型（例如棕榈）。我们表明，由于几乎可以忽略不计的额外计算成本，并且没有新的数据来源，我们能够实质上改善大语模型在下游指标上的缩放属性。在本文中，我们继续使用UL2R训练手掌，引入了一组新型的8B，62B和540B量表，我们称为U-PALM。令人印象深刻的是，我们以540B的比例显示了大约2倍的计算储蓄率，其中u-palm在其计算预算的一半左右（即节省了$ \ sim $ \ sim $ 440亿美元TPUV4小时），与最终的Palm 540B模型相同的性能。我们进一步表明，这种改进的缩放曲线会导致对挑战大型基础任务的“紧急能力” - 例如，U-PALM在某些任务上的表现要比Palm好得多，或者在较小的尺度上表现出更好的质量（62B（62B），而不是540B）。总体而言，我们表明，U-Palm在许多少数几个设置中的表现都优于Palms Palm，即英语NLP任务（例如，常识性推理，问答），具有经过思考的推理任务（例如GSM8K）（例如GSM8K），多语言任务（MGSM，Tydiiqa），MMLU和MMLU和挑剔的大型任务。最后，我们提供了定性示例，显示了u-palm的新功能，用于单跨度和多跨度填充。

Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.

下载PDF全文

下载文献需遵守相关版权规定

论文标题