论文标题
检测,检索,理解:零照片文档级问题的灵活框架回答
Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering
论文作者
论文摘要
研究人员生产了数千个包含有价值的技术知识的学术文件。社区面临着阅读这些文档以识别,提取和综合信息的费力任务。为了使信息收集自动化,文档级的问题答案(QA)提供了一个灵活的框架,可以在其中调整人类提出的问题以提取多样化的知识。 Finetuning QA系统需要访问标记的数据(上下文,问题和答案的元素)。但是,文档质量检查的数据策展具有唯一的挑战,因为需要从潜在的长期,不良的文档中检索上下文(即回答证据段落)。现有的QA数据集避开了这个挑战,通过提供简短的,定义明确的上下文,这些上下文在现实世界中的应用程序中是不现实的。我们提出了三阶段文档质量检查方法:(1)从PDF提取文本; (2)从提取的文本中检索以形成良好的上下文的证据; (3)质量保证从上下文中提取知识以返回高质量的答案 - 提取,抽象或布尔值。使用QASPER进行评估,我们的检测到撤离(DRC)系统可实现+7.19 Answer-f1的改进,而在提供了卓越的上下文选择的同时,对现有基线的提高了。我们的结果表明,DRC作为实用科学文档质量检查的灵活框架具有巨大的希望。
Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.