论文标题
自动页细分无需解压缩运行的压缩文本文档
Automatic Page Segmentation Without Decompressing the Run-Length Compressed Text Documents
论文作者
论文摘要
页面细分被认为是自动分析具有复杂布局的文档的关键阶段。传统上,这是在未压缩的文件中进行的,尽管现实生活中的大多数文件都以压缩形式存在,该形式必须由使存储和转移有效的要求。但是,直接在压缩文档中进行页面细分而不经历减压阶段是一个具有挑战性的目标。本研究论文提出了直接在CCITT组3组压缩文本文本的运行长度数据中进行页面细分操作的可能性,该数据可能是单个或多色的,甚至可能在反向文本颜色模式下具有某些文本区域。因此,在将文本文档分割成列中之前,每列将每列分成段落,每个段落成文本行,每行分为单词,最后,每个单词都需要对文本文档进行预处理。预处理阶段标识了正常的文本区域和倒文本区域,并且倒文的文本区域将切换到正常模式。在启动色谱柱分离的续集中,提出了一种新的空间同化的新策略,沿垂直方向运行,并提出了某些相关参数的自动估计。已经设计了一种实现使用这些提取参数的列进行分割的程序。随后,首先是一个两级水平行分离过程,该过程将每一列分为段落,然后将其分为段落。然后,有一个两级垂直列的分离过程,将分隔完成为单词和字符。
Page segmentation is considered to be the crucial stage for the automatic analysis of documents with complex layouts. This has traditionally been carried out in uncompressed documents, although most of the documents in real life exist in a compressed form warranted by the requirement to make storage and transfer efficient. However, carrying out page segmentation directly in compressed documents without going through the stage of decompression is a challenging goal. This research paper proposes demonstrating the possibility of carrying out a page segmentation operation directly in the run-length data of the CCITT Group-3 compressed text document, which could be single- or multi-columned and might even have some text regions in the inverted text color mode. Therefore, before carrying out the segmentation of the text document into columns, each column into paragraphs, each paragraph into text lines, each line into words, and, finally, each word into characters, a pre-processing of the text document needs to be carried out. The pre-processing stage identifies the normal text regions and inverted text regions, and the inverted text regions are toggled to the normal mode. In the sequel to initiate column separation, a new strategy of incremental assimilation of white space runs in the vertical direction and the auto-estimation of certain related parameters is proposed. A procedure to realize column-segmentation employing these extracted parameters has been devised. Subsequently, what follows first is a two-level horizontal row separation process, which segments every column into paragraphs, and in turn, into text-lines. Then, there is a two-level vertical column separation process, which completes the separation into words and characters.