Planbench：用于评估大型语言模型的规划和推理的可扩展基准测试

论文标题

Planbench：用于评估大型语言模型的规划和推理的可扩展基准测试

PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change

论文作者

Valmeekam, Karthik, Marquez, Matthew, Olmo, Alberto, Sreedharan, Sarath, Kambhampati, Subbarao

论文摘要

生成行动计划和关于变革的推理长期以来一直被认为是智能代理的核心竞争力。因此，评估大语言模型（LLMS）的计划和推理能力（LLMS）已成为研究的热门话题也就不足为奇了。但是，关于LLM计划功能的大多数主张都是基于常识任务 - 很难判断LLM是在计划还是仅仅从其广阔的世界知识中取回。非常需要具有足够多样性的系统性和可扩展的计划基准，以评估LLM是否具有先天的计划能力。在此激励的情况下，我们提出了PlanBench，这是一款基于自动化计划社区中使用的域类型，尤其是在国际计划竞赛中，以测试LLMS在规划或推理行动和变革推理方面的能力。 PlanBench在任务域和特定的计划功能中都提供了足够的多样性。我们的研究还表明，在许多关键功能上，即使使用SOTA模型，包括计划生成-LLM的性能也很短。因此，PlanBench可以作为LLM在计划和推理中的进度的有用标记。

Generating plans of action, and reasoning about change have long been considered a core competence of intelligent agents. It is thus no surprise that evaluating the planning and reasoning capabilities of large language models (LLMs) has become a hot topic of research. Most claims about LLM planning capabilities are however based on common sense tasks-where it becomes hard to tell whether LLMs are planning or merely retrieving from their vast world knowledge. There is a strong need for systematic and extensible planning benchmarks with sufficient diversity to evaluate whether LLMs have innate planning capabilities. Motivated by this, we propose PlanBench, an extensible benchmark suite based on the kinds of domains used in the automated planning community, especially in the International Planning Competition, to test the capabilities of LLMs in planning or reasoning about actions and change. PlanBench provides sufficient diversity in both the task domains and the specific planning capabilities. Our studies also show that on many critical capabilities-including plan generation-LLM performance falls quite short, even with the SOTA models. PlanBench can thus function as a useful marker of progress of LLMs in planning and reasoning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题