论文标题
Lila:数学推理的统一基准
Lila: A Unified Benchmark for Mathematical Reasoning
论文作者
论文摘要
数学推理技能对于通用智能系统至关重要,可以执行从杂货购物到气候建模的任务。为了评估和改进该领域的AI系统,我们提出了LILA,这是一个统一的数学推理基准,由沿着四个方面的23个不同任务组成:(i)数学能力:(i)数学能力,例如算术,算术,计算,语言格式(ii)语言格式常识,物理。我们通过以Python程序的形式收集任务说明和解决方案来扩展20个数据集基准,从而构建基准测试,从而除了正确的答案外还获得了可解释的解决方案。我们还介绍了两个评估数据集,以测量对语言扰动的分布表现和鲁棒性。最后,我们介绍了Bhaskara,这是一种在Lila训练的通用数学推理模型。重要的是,我们发现多任务导致显着改善(平均相对改善为21.83%的F1分数与单任务模型),而最佳性能模型仅获得60.40%,这表明在一般数学推理和理解中有改善的空间。
Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding.