减少一组正则表达式并分析域特异性统计报告的差异

论文标题

减少一组正则表达式并分析域特异性统计报告的差异

Reducing a Set of Regular Expressions and Analyzing Differences of Domain-specific Statistic Reporting

论文作者

Kalmbach, Tobias, Hoffmann, Marcel, Lell, Nicolas, Scherp, Ansgar

论文摘要

由于每日科学出版物大量，因此无法手动审查每一个。因此，需要自动提取关键信息。在本文中，我们检查了立体声，这是一种使用正则表达式从科学论文中提取统计数据的工具。通过为我们的用例调整现有的正则表达式包含算法，我们将立体声中使用的正则表达式数量减少约33.8美元\％$。我们揭示了可用于创建新规则的凝结规则集中的常见模式。我们还将以前在生命界和医疗领域进行过培训的立体声应用于新的科学领域，即人类计算机交流（HCI），并重新评估它。根据我们的研究，HCI域中的统计数据与医学领域中的统计数据相似，尽管在HCI域中发现了APA符合形式的统计数据的百分比较高。此外，我们比较了PDF和乳胶源文件上的提取，发现乳胶更可靠地提取。

Due to the large amount of daily scientific publications, it is impossible to manually review each one. Therefore, an automatic extraction of key information is desirable. In this paper, we examine STEREO, a tool for extracting statistics from scientific papers using regular expressions. By adapting an existing regular expression inclusion algorithm for our use case, we decrease the number of regular expressions used in STEREO by about $33.8\%$. We reveal common patterns from the condensed rule set that can be used for the creation of new rules. We also apply STEREO, which was previously trained in the life-sciences and medical domain, to a new scientific domain, namely Human-Computer-Interaction (HCI), and re-evaluate it. According to our research, statistics in the HCI domain are similar to those in the medical domain, although a higher percentage of APA-conform statistics were found in the HCI domain. Additionally, we compare extraction on PDF and LaTeX source files, finding LaTeX to be more reliable for extraction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题