论文标题
通过半结构化数学推理的策略梯度进行动态及时学习
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
论文作者
论文摘要
数学推理是人类智力的核心能力,在抽象思维和逻辑推理中对机器提出了独特的挑战。最近的大型预训练的语言模型(例如GPT-3)在以文本形式(例如数学单词问题(MWP))编写的数学推理任务上取得了显着的进步。但是,未知模型是否可以处理更复杂的问题,这些问题涉及数学推理,例如表格数据。为了填补空白,我们提出了表格数学单词问题(TABMWP),这是一个包含38,431个开放域级等级问题的新数据集,这些问题都需要在文本和表格数据上进行数学推理。 TABMWP中的每个问题都与表格上下文对齐,该上下文作为图像,半结构化文本和结构化表。有两种类型的问题:自由文本和多选项,每个问题都用金解决方案注释以揭示多步推理过程。我们在TABMWP上评估了不同的预训练模型,包括在几次设置中的GPT-3模型。正如早期研究所表明的那样,由于很少有GPT-3依赖于封闭式示例的选择,因此其性能是不稳定的,并且可能会降低到几乎机会。处理TABMWP等复杂问题时,不稳定的问题更为严重。为了减轻这种情况,我们进一步提出了一种新颖的方法,即PresspG,该方法利用策略梯度学会从少量培训数据中选择秘密示例,然后为测试示例构造相应的提示。实验结果表明,与随机选择相比,我们的方法在准确性度量上优于最佳基线,并大大降低了预测差异,这可以验证其在选择中文本示例中的有效性。
Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in selecting in-context examples.