论文标题
具有蒸馏测试套件的文本到SQL的语义评估
Semantic Evaluation for Text-to-SQL with Distilled Test Suites
论文作者
论文摘要
我们提出了测试套件的精度,以近似文本到SQL模型的语义精度。我们的方法提炼了一个小型的数据库测试套件,该数据库从大量随机生成的数据库中获得了高码覆盖率。在评估时,它计算蒸馏测试套件上预测的查询的表示准确性,因此可以有效地计算出紧密的上限,以有效地进行语义精度。我们使用建议的方法评估提交给蜘蛛排行榜委员会的21个模型,并手动验证我们的方法在100个示例中始终是正确的。相比之下,当前的蜘蛛指标平均导致2.5%的假阴性率为2.5%,而最坏情况下则为8.1%,这表明需要测试套件的精度。我们的实施以及针对11个文本到SQL数据集的蒸馏测试套件公开可用。
We propose test suite accuracy to approximate semantic accuracy for Text-to-SQL models. Our method distills a small test suite of databases that achieves high code coverage for the gold query from a large number of randomly generated databases. At evaluation time, it computes the denotation accuracy of the predicted queries on the distilled test suite, hence calculating a tight upper-bound for semantic accuracy efficiently. We use our proposed method to evaluate 21 models submitted to the Spider leader board and manually verify that our method is always correct on 100 examples. In contrast, the current Spider metric leads to a 2.5% false negative rate on average and 8.1% in the worst case, indicating that test suite accuracy is needed. Our implementation, along with distilled test suites for eleven Text-to-SQL datasets, is publicly available.