使用LDA主题建模的数据驱动的潜在潜在语义分析，用于自动文本摘要

论文标题

使用LDA主题建模的数据驱动的潜在潜在语义分析，用于自动文本摘要

A Data-driven Latent Semantic Analysis for Automatic Text Summarization using LDA Topic Modelling

论文作者

Onah, Daniel F. O., Pang, Elaine L. L., El-Haj, Mahmoud

论文摘要

随着大数据挖掘和现代大量文本分析的出现和普及，自动化文本摘要在从文档中提取和检索重要信息而变得突出。这项研究从单个和多个文档的角度研究了自动文本摘要的各个方面。摘要是将庞大的文本文章凝结成简短的摘要版本的任务。为了摘要目的，该文本的大小减小，但保留了关键的重要信息并保留原始文档的含义。这项研究介绍了用于从具有与基因和疾病有关的主题的摘要的医学学期刊文章中进行主题建模的潜在差异分配（LDA）方法。在这项研究中，基于Pyldavis Web的交互式可视化工具用于可视化所选主题。可视化提供了主要主题的总体视图，同时允许并将深度含义归因于流行率个体主题。这项研究提出了一种新颖的方法来汇总单个文档和多个文档。结果表明，使用提取性摘要技术在处理后的文档中考虑其主题患病率的概率，这纯粹是对术语进行排名的。 Pyldavis可视化描述了探索主题与拟合LDA模型的术语的灵活性。主题建模结果显示了主题1和2中的流行率。该关联表明，本研究中主题1和2中的术语之间存在相似性。使用潜在语义分析（LSA）（LSA）和面向召回的研究测量LDA的功效和提取性摘要方法，用于评估模型的可靠性和有效性。

With the advent and popularity of big data mining and huge text analysis in modern times, automated text summarization became prominent for extracting and retrieving important information from documents. This research investigates aspects of automatic text summarization from the perspectives of single and multiple documents. Summarization is a task of condensing huge text articles into short, summarized versions. The text is reduced in size for summarization purpose but preserving key vital information and retaining the meaning of the original document. This study presents the Latent Dirichlet Allocation (LDA) approach used to perform topic modelling from summarised medical science journal articles with topics related to genes and diseases. In this study, PyLDAvis web-based interactive visualization tool was used to visualise the selected topics. The visualisation provides an overarching view of the main topics while allowing and attributing deep meaning to the prevalence individual topic. This study presents a novel approach to summarization of single and multiple documents. The results suggest the terms ranked purely by considering their probability of the topic prevalence within the processed document using extractive summarization technique. PyLDAvis visualization describes the flexibility of exploring the terms of the topics' association to the fitted LDA model. The topic modelling result shows prevalence within topics 1 and 2. This association reveals that there is similarity between the terms in topic 1 and 2 in this study. The efficacy of the LDA and the extractive summarization methods were measured using Latent Semantic Analysis (LSA) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) metrics to evaluate the reliability and validity of the model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题