测量和缩小语言模型中的组成差距

论文标题

测量和缩小语言模型中的组成差距

Measuring and Narrowing the Compositionality Gap in Language Models

论文作者

Press, Ofir, Zhang, Muru, Min, Sewon, Schmidt, Ludwig, Smith, Noah A., Lewis, Mike

论文摘要

我们研究了语言模型执行组成推理任务的能力，其中总体解决方案取决于正确构成子问题的答案。我们测量模型可以正确回答所有子问题的频率，但不能生成整体解决方案，我们称之为组成性差距。我们通过提出多跳问题的答案来评估该比率，这些问题需要在训练过程中观察到多个事实。在GPT-3模型家族中，随着模型大小的增加，我们表明，单跳的问题回答性能的提高速度比多跳跃性能的速度快，因此组合差距不会降低。这个令人惊讶的结果表明，尽管更强大的模型记住并回忆了更多的事实知识，但它们在执行这种构图推理的能力方面却没有相应的改善。然后，我们证明了通过明确推理的构图差距如何缩小构图差距。我们提出了一种新方法，即进一步改善思想链。在我们的方法中，该模型在回答初始问题之前明确询问（和答案）后续问题。我们最终表明，自助式的结构化提示使我们可以轻松地插入搜索引擎以回答后续问题，从而提高了准确性。

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题