组合无监督文档摘要的单词嵌入和n-grams

论文标题

组合无监督文档摘要的单词嵌入和n-grams

Combining Word Embeddings and N-grams for Unsupervised Document Summarization

论文作者

Jiang, Zhuolin, Srivastava, Manaj, Krishna, Sanjay, Akodes, David, Schwartz, Richard

论文摘要

基于图的提取文档摘要依赖于句子相似性图的质量。词袋或基于TF-IDF的句子相似性使用确切的单词匹配，但无法衡量单个单词之间的语义相似性或考虑句子的语义结构。为了提高句子之间的相似性度量，我们采用了现成的深层嵌入功能和TF-IDF功能，并引入了新的文本相似性度量。改进的句子相似性图构建和使用，用于提取性摘要，该摘要由加权覆盖项和多样性项组成。为句子压缩开发了基于变压器的压缩模型，以帮助文档摘要。我们的摘要方法是挖掘和无监督的。实验表明，我们的方法可以胜过基于TF-IDF的方法，并在DUC04数据集上实现最先进的性能，以及与CNN/DM和NYT数据集中完全监督的学习方法相当的性能。

Graph-based extractive document summarization relies on the quality of the sentence similarity graph. Bag-of-words or tf-idf based sentence similarity uses exact word matching, but fails to measure the semantic similarity between individual words or to consider the semantic structure of sentences. In order to improve the similarity measure between sentences, we employ off-the-shelf deep embedding features and tf-idf features, and introduce a new text similarity metric. An improved sentence similarity graph is built and used in a submodular objective function for extractive summarization, which consists of a weighted coverage term and a diversity term. A Transformer based compression model is developed for sentence compression to aid in document summarization. Our summarization approach is extractive and unsupervised. Experiments demonstrate that our approach can outperform the tf-idf based approach and achieve state-of-the-art performance on the DUC04 dataset, and comparable performance to the fully supervised learning methods on the CNN/DM and NYT datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题