历史文档图像数据集的调查

论文标题

历史文档图像数据集的调查

A Survey of Historical Document Image Datasets

论文作者

Nikolaidou, Konstantina, Seuret, Mathias, Mokayed, Hamam, Liwicki, Marcus

论文摘要

本文介绍了图像数据集的系统文献综述，用于文档图像分析，重点关注历史文档，例如手写手稿和早期印刷品。寻找适当的数据集进行历史文档分析是使用不同的机器学习算法进行研究的关键先决条件。但是，由于各种各样的实际数据（例如，脚本，任务，日期，支持系统以及劣化量），数据和标签表示的不同格式，以及不同的评估过程和基准，因此找到适当的数据集是一项艰巨的任务。这项工作填补了这一空白，并在现有数据集中介绍了元研究。经过系统的选择过程（根据PRISMA指南），我们选择了基于不同因素的65项研究，例如发表年份，文章中实施的方法数量，所选算法的可靠性，数据集大小和期刊出口。我们通过将其分配给三个预定义的任务之一来概括每个研究：文档分类，布局结构或内容分析。我们为每个数据集提供统计，文档类型，语言，任务，输入视觉方面和地面真相信息。此外，我们还提供了这些论文或最近竞争的基准任务和结果。我们进一步讨论了该领域的差距和挑战。我们倡导将转换工具提供到通用格式（例如，用于计算机视觉任务的可可格式），并始终提供一组评估指标，而不仅仅是一种评估指标，以使整个研究的结果可比。

This paper presents a systematic literature review of image datasets for document image analysis, focusing on historical documents, such as handwritten manuscripts and early prints. Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms. However, because of the very large variety of the actual data (e.g., scripts, tasks, dates, support systems, and amount of deterioration), the different formats for data and label representation, and the different evaluation processes and benchmarks, finding appropriate datasets is a difficult task. This work fills this gap, presenting a meta-study on existing datasets. After a systematic selection process (according to PRISMA guidelines), we select 65 studies that are chosen based on different factors, such as the year of publication, number of methods implemented in the article, reliability of the chosen algorithms, dataset size, and journal outlet. We summarize each study by assigning it to one of three pre-defined tasks: document classification, layout structure, or content analysis. We present the statistics, document type, language, tasks, input visual aspects, and ground truth information for every dataset. In addition, we provide the benchmark tasks and results from these papers or recent competitions. We further discuss gaps and challenges in this domain. We advocate for providing conversion tools to common formats (e.g., COCO format for computer vision tasks) and always providing a set of evaluation metrics, instead of just one, to make results comparable across studies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题