主题建模使用上下文化的单词表示集群

论文标题

主题建模使用上下文化的单词表示集群

Topic Modeling with Contextualized Word Representation Clusters

论文作者

Thompson, Laure, Mimno, David

论文摘要

集群令牌级别的上下文化单词表示形式产生的输出与英语文本集合的主题模型具有许多相似之处。与词汇级单词嵌入的聚类不同，所得模型更自然地捕获多义，可以用作组织文档的一种方式。我们评估了从多个流行上下文化语言模型的几个不同输出层训练的令牌群集。我们发现BERT和GPT-2产生了高质量的聚类，但罗伯塔没有。这些群集模型简单，可靠，并且可以执行，即使不是LDA主题模型，即使主题的数量相对于本地收藏的大小，也可以保持高主题质量。

Clustering token-level contextualized word representations produces output that shares many similarities with topic models for English text collections. Unlike clusterings of vocabulary-level word embeddings, the resulting models more naturally capture polysemy and can be used as a way of organizing documents. We evaluate token clusterings trained from several different output layers of popular contextualized language models. We find that BERT and GPT-2 produce high quality clusterings, but RoBERTa does not. These cluster models are simple, reliable, and can perform as well as, if not better than, LDA topic models, maintaining high topic quality even when the number of topics is large relative to the size of the local collection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题