Docut5：Seq2Seq SQL生成表文档

论文标题

Docut5：Seq2Seq SQL生成表文档

DocuT5: Seq2seq SQL Generation with Table Documentation

论文作者

Soare, Elena, Mackie, Iain, Dalton, Jeffrey

论文摘要

基于预训练的语言模型的当前SQL发电机难以回答需要领域上下文或了解细粒度的表结构的复杂问题。人类将通过对表的文档进行推理来处理这些未知数。基于此假设，我们提出了Docut5，该docut5使用现成的语言模型体系结构并从外部“文档”注入知识来改善域的概括。我们对包含跨域和多桌子的复杂问题的蜘蛛家族进行实验。具体来说，我们开发了新的文本到SQL故障分类法，发现19.6％的错误是由于外国关键错误造成的，而49.2％的错误是由于缺乏领域知识所致。我们提出了docut5，一种方法，从（1）外键的表结构上下文以及（2）通过上下文化表和列来捕获知识。两种类型的知识都对蜘蛛的最新解码有所改善，而域知识对蜘蛛dk和蜘蛛-Syn数据集产生了最新的可比性。

Current SQL generators based on pre-trained language models struggle to answer complex questions requiring domain context or understanding fine-grained table structure. Humans would deal with these unknowns by reasoning over the documentation of the tables. Based on this hypothesis, we propose DocuT5, which uses off-the-shelf language model architecture and injects knowledge from external `documentation' to improve domain generalization. We perform experiments on the Spider family of datasets that contain complex questions that are cross-domain and multi-table. Specifically, we develop a new text-to-SQL failure taxonomy and find that 19.6% of errors are due to foreign key mistakes, and 49.2% are due to a lack of domain knowledge. We proposed DocuT5, a method that captures knowledge from (1) table structure context of foreign keys and (2) domain knowledge through contextualizing tables and columns. Both types of knowledge improve over state-of-the-art T5 with constrained decoding on Spider, and domain knowledge produces state-of-the-art comparable effectiveness on Spider-DK and Spider-SYN datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题