论文标题

基于图的语义提取文本分析

Graph-based Semantical Extractive Text Analysis

论文作者

Samizadeh, Mina

论文摘要

在过去的几十年中,从具有不同主题的各种来源产生的可用数据量爆炸。这些巨大数据的可用性使我们必须采用有效的计算工具来探索数据。这导致人们对研究界的强烈兴趣开发着专注于处理此文本数据的计算方法。一项研究的重点是凝结文本,以便我们能够在较短的时间内获得更高的理解水平。这样做的两个重要任务是关键字提取和文本摘要。在关键字提取中,我们有兴趣从文本中找到关键重要词。这使我们熟悉文本的一般主题。在文本摘要中,我们有兴趣制作一个短长的文本,其中包括有关文档的重要信息。 Textrank算法是一种无监督的学习方法,它是Pagerank的扩展(算法是Google搜索引擎的基本算法,用于搜索页面并进行排名)显示了其在大规模文本挖掘中的功效,尤其是用于文本摘要和键盘提取。该算法可以自动提取文本的重要部分(关键字或句子),并将其声明为结果。但是,该算法忽略了不同部分之间的语义相似性。在这项工作中,我们通过结合文本部分之间的语义相似性来改善Textrank算法的结果。除了关键字提取和文本摘要外,我们还基于我们的框架开发了一种主题聚类算法,该算法可以单独使用,也可以作为生成摘要来克服覆盖范围问题的一部分。

In the past few decades, there has been an explosion in the amount of available data produced from various sources with different topics. The availability of this enormous data necessitates us to adopt effective computational tools to explore the data. This leads to an intense growing interest in the research community to develop computational methods focused on processing this text data. A line of study focused on condensing the text so that we are able to get a higher level of understanding in a shorter time. The two important tasks to do this are keyword extraction and text summarization. In keyword extraction, we are interested in finding the key important words from a text. This makes us familiar with the general topic of a text. In text summarization, we are interested in producing a short-length text which includes important information about the document. The TextRank algorithm, an unsupervised learning method that is an extension of the PageRank (algorithm which is the base algorithm of Google search engine for searching pages and ranking them) has shown its efficacy in large-scale text mining, especially for text summarization and keyword extraction. this algorithm can automatically extract the important parts of a text (keywords or sentences) and declare them as the result. However, this algorithm neglects the semantic similarity between the different parts. In this work, we improved the results of the TextRank algorithm by incorporating the semantic similarity between parts of the text. Aside from keyword extraction and text summarization, we develop a topic clustering algorithm based on our framework which can be used individually or as a part of generating the summary to overcome coverage problems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源