使用虚拟边缘从建模为复杂网络的文本中提取关键字

论文标题

使用虚拟边缘从建模为复杂网络的文本中提取关键字

Using virtual edges to extract keywords from texts modeled as complex networks

论文作者

Tohalino, Jorge A. V., Silva, Thiago C., Amancio, Diego R.

论文摘要

检测文本中的关键字对于许多文本挖掘应用程序很重要。基于图的方法通常被用于自动找到文本中的关键概念，但是，嵌入式提供的相关信息尚未被广泛用于丰富图形结构。在这里，我们对文本进行了建模，其中节点是单词，边缘是通过上下文或语义相似性建立的。我们比较了两种嵌入方法-Word2Vec和Bert-检查通过单词嵌入创建的边缘是否可以提高关键字提取方法的质量。我们发现，实际上，虚拟边缘的使用可以改善共发生网络的可区分性。当我们考虑添加虚拟（嵌入）边缘的低百分比时，获得了最佳性能。对结构和动态网络指标的比较分析表明，Pagerank和可访问性的度量是在富含虚拟边缘的模型中表现出最佳性能的指标。

Detecting keywords in texts is important for many text mining applications. Graph-based methods have been commonly used to automatically find the key concepts in texts, however, relevant information provided by embeddings has not been widely used to enrich the graph structure. Here we modeled texts co-occurrence networks, where nodes are words and edges are established either by contextual or semantical similarity. We compared two embedding approaches -- Word2vec and BERT -- to check whether edges created via word embeddings can improve the quality of the keyword extraction method. We found that, in fact, the use of virtual edges can improve the discriminability of co-occurrence networks. The best performance was obtained when we considered low percentages of addition of virtual (embedding) edges. A comparative analysis of structural and dynamical network metrics revealed the degree, PageRank, and accessibility are the metrics displaying the best performance in the model enriched with virtual edges.

下载PDF全文

下载文献需遵守相关版权规定

论文标题