论文标题
通过自我训练框架进行无监督的复杂表推理的优化技术
Optimization Techniques for Unsupervised Complex Table Reasoning via Self-Training Framework
论文作者
论文摘要
结构化表格数据是许多字段中的基本数据类型,表面上的推理能力对于回答问题和验证假设至关重要。但是,为复杂的推理任务构建标记的数据是劳动密集型的,并且注释数据的数量仍然不足以支持现实世界应用程序的复杂需求。为了解决不足的注释挑战,我们通过生成具有复杂逻辑的多样化的合成数据,为无监督的复杂表格推理(UCTR-ST)提供了自我训练框架。具体而言,UCTR-ST结合了几种基本技术:我们汇总了各种程序,并根据“程序管理”组件在桌子上执行它们,并使用功能强大的“程序变形”模块弥合程序和文本之间的差距,该模块生成具有复杂逻辑的自然语言句子。此外,我们使用“表文本操纵器”来优化该过程,以处理关节表文本推理方案。整个框架都利用自训练技术来利用未标记的培训数据,这在对现实世界数据进行测试时会大大改善。实验结果表明,UCTRST在不同的任务和域上达到了90%的监督模型性能,从而降低了对手动注释的依赖。此外,我们的方法可以用作数据增强技术,从而大大提高低资源域中监督模型的性能。
Structured tabular data is a fundamental data type in numerous fields, and the capacity to reason over tables is crucial for answering questions and validating hypotheses. However, constructing labeled data for complex reasoning tasks is labor intensive, and the quantity of annotated data remains insufficient to support the intricate demands of real-world applications. To address the insufficient annotation challenge, we present a self-training framework for unsupervised complex tabular reasoning (UCTR-ST) by generating diverse synthetic data with complex logic. Specifically, UCTR-ST incorporates several essential techniques: we aggregate diverse programs and execute them on tables based on a "Program-Management" component, and we bridge the gap between programs and text with a powerful "Program-Transformation" module that generates natural language sentences with complex logic. Furthermore, we optimize the procedure using a "Table-Text Manipulator" to handle joint table-text reasoning scenarios. The entire framework utilizes self-training techniques to leverage the unlabeled training data, which results in significant performance improvements when tested on real-world data. Experimental results demonstrate that UCTRST achieves above 90% of the supervised model performance on different tasks and domains, reducing the dependence on manual annotation. Additionally, our approach can serve as a data augmentation technique, significantly boosting the performance of supervised models in low-resourced domains.