从自我采样的正确和部分校正的解决方案中学习数学推理

论文标题

从自我采样的正确和部分校正的解决方案中学习数学推理

Learning Math Reasoning from Self-Sampled Correct and Partially-Correct Solutions

论文作者

Ni, Ansong, Inala, Jeevana Priya, Wang, Chenglong, Polozov, Oleksandr, Meek, Christopher, Radev, Dragomir, Gao, Jianfeng

论文摘要

预审前的语言模型在许多自然语言处理任务上表现出卓越的表现，但他们仍然在多步骤的正式推理任务（例如小学数学问题）方面挣扎。填补他们解决此类数学推理问题的一个关键挑战是，尽管通常有其他类似于最终答案的不同推理路径的替代解决方案，但许多现有数据集仅包含一个针对每个问题的参考解决方案。这样，固定模型偏向有限的参考解决方案，这将其概括限制为看不见的例子。为了减轻此问题，我们建议让模型在训练过程中执行抽样，并从两种自采样的完全校正的解决方案中学习，该解决方案在执行时产生正确的答案，并部分校正解决方案，其中间状态与已知正确解决方案的中间状态相匹配。我们表明，我们对自我采样的正确和部分校正的解决方案的使用可以使学习受益并帮助指导采样过程，从而更有效地探索解决方案空间。此外，我们还探索各种培训目标，以支持多个解决方案学习，并发现它们极大地影响了性能。与MLE的单个参考解决方案学习相比，两个数学推理数据集的实验显示了我们方法的有效性，GSM8K的单个参考解决方案从35.5％提高了100 pass@100，而MathQA的27.6％至36.2％通过@80。这种改进在不同的模型大小之间也是一致的。我们的代码可在https://github.com/microsoft/tracecodegen上找到。

Pretrained language models have shown superior performance on many natural language processing tasks, yet they still struggle at multi-step formal reasoning tasks like grade school math problems. One key challenge of finetuning them to solve such math reasoning problems is that many existing datasets only contain one reference solution for each problem, despite the fact that there are often alternative solutions resembling different reasoning paths to the final answer. This way, the finetuned models are biased towards the limited reference solutions, which limits their generalization to unseen examples. To mitigate this issue, we propose to let the model perform sampling during training and learn from both self-sampled fully-correct solutions, which yield the correct answer upon execution, and partially-correct solutions, whose intermediate state matches an intermediate state of a known correct solution. We show that our use of self-sampled correct and partially-correct solutions can benefit learning and help guide the sampling process, leading to more efficient exploration of the solution space. Additionally, we explore various training objectives to support learning from multiple solutions per example and find they greatly affect the performance. Experiments on two math reasoning datasets show the effectiveness of our method compared to learning from a single reference solution with MLE, where we improve PASS@100 from 35.5% to 44.5% for GSM8K, and 27.6% to 36.2% PASS@80 for MathQA. Such improvements are also consistent across different model sizes. Our code is available at https://github.com/microsoft/TraceCodegen.

下载PDF全文

下载文献需遵守相关版权规定

论文标题