论文标题

TableParser:自动桌子解析,弱监管来自电子表格的监督

TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets

论文作者

Rao, Susie Xi, Rausch, Johannes, Egger, Peter, Zhang, Ce

论文摘要

表一直是存储数据的不断存在的结构。现在存在不同的方法来物理存储表格数据。 PDF,图像,电子表格和CSV是主要示例。在许多应用中,能够解析表结构并提取受这些结构界定的内容至关重要。在本文中,我们设计了TableParser,这是一个能够在本机PDF和扫描图像中以高精度解析表的系统。我们进行了广泛的实验,以显示域适应性在开发这种工具中的功效。此外,我们创建了TableAnnotator和Excelnotator,构成了基于电子表格的弱监督机制和启用表解析的管道。我们与研究界共享这些资源,以促进这一有趣的方向的进一步研究。

Tables have been an ever-existing structure to store data. There exist now different approaches to store tabular data physically. PDFs, images, spreadsheets, and CSVs are leading examples. Being able to parse table structures and extract content bounded by these structures is of high importance in many applications. In this paper, we devise TableParser, a system capable of parsing tables in both native PDFs and scanned images with high precision. We have conducted extensive experiments to show the efficacy of domain adaptation in developing such a tool. Moreover, we create TableAnnotator and ExcelAnnotator, which constitute a spreadsheet-based weak supervision mechanism and a pipeline to enable table parsing. We share these resources with the research community to facilitate further research in this interesting direction.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源