论文标题

分层主题通过联合球形树和文本嵌入方式采矿

Hierarchical Topic Mining via Joint Spherical Tree and Text Embedding

论文作者

Meng, Yu, Zhang, Yunyi, Huang, Jiaxin, Zhang, Yu, Zhang, Chao, Han, Jiawei

论文摘要

将一组有意义的主题挖掘为层次结构具有直觉上的吸引力,因为主题相关性在大规模文本语料库中无处不在。为了说明潜在的层次主题结构,分层主题模型通过将潜在主题层次结合到其生成建模过程中来概括平面主题模型。但是,由于其纯粹无监督的性质,学到的主题层次结构通常会偏离用户的特定需求或兴趣。为了通过最小的用户监督指导层次的主题发现过程,我们提出了一个新任务,分层主题挖掘,该挖掘仅使用类别名称描述的类别树,并旨在从文本语料库中挖掘每个类别的一组代表性术语,以帮助用户理解他/她的感兴趣的主题。我们开发了一种新颖的关节树和文本嵌入方法,以及一个原则的优化过程,该过程允许在球形空间中同时建模树结构和copus生成过程,以实现有效类别代表性的术语发现。我们的全面实验表明,我们的模型名为Josh,挖掘出具有高效率和福利弱监督的分层文本分类任务的高质量等级主题。

Mining a set of meaningful topics organized into a hierarchy is intuitively appealing since topic correlations are ubiquitous in massive text corpora. To account for potential hierarchical topic structures, hierarchical topic models generalize flat topic models by incorporating latent topic hierarchies into their generative modeling process. However, due to their purely unsupervised nature, the learned topic hierarchy often deviates from users' particular needs or interests. To guide the hierarchical topic discovery process with minimal user supervision, we propose a new task, Hierarchical Topic Mining, which takes a category tree described by category names only, and aims to mine a set of representative terms for each category from a text corpus to help a user comprehend his/her interested topics. We develop a novel joint tree and text embedding method along with a principled optimization procedure that allows simultaneous modeling of the category tree structure and the corpus generative process in the spherical space for effective category-representative term discovery. Our comprehensive experiments show that our model, named JoSH, mines a high-quality set of hierarchical topics with high efficiency and benefits weakly-supervised hierarchical text classification tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源