从类似形式的文档中提取数据有效的信息

论文标题

从类似形式的文档中提取数据有效的信息

Data-Efficient Information Extraction from Form-Like Documents

论文作者

Gunel, Beliz, Potti, Navneet, Tata, Sandeep, Wendt, James B., Najork, Marc, Xie, Jing

论文摘要

自动化信息从大规模的形式文档中提取，这是一个紧迫的需求，因为它可能会影响金融服务，保险和医疗保健等许多行业的自动化业务工作流程。关键的挑战是，这些业务工作流程中的类似形式的文档几乎可以无限多种方式进行。因此，解决此问题的一个很好的解决方案应该概括为具有看不见的布局和语言的文档。解决此问题的解决方案需要对文档中的文本段和视觉提示有整体理解，这是非平凡的。尽管自然语言处理和计算机视觉社区开始解决此问题，但并没有太多关注（1）数据效率，以及（2）能够跨越不同文档类型和语言的能力。在本文中，我们表明，当我们只有少数标记的文档进行培训（〜50）时，从相当大的结构差异的较大标记的语料库中，一种直接的转移学习方法可产生高达27 f1点的提高，而不是简单地对目标域中的小型语料库进行训练。我们通过一种简单的多域转移学习方法（目前正在生产使用中）对此进行了改进，并表明这可以进一步提高8 F1点。我们认为数据效率对于使信息提取系统扩展以处理数百种不同的文档类型至关重要，而学习良好表示对于实现这一目标至关重要。

Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this.

下载PDF全文

下载文献需遵守相关版权规定

论文标题