论文标题

带有协变量的新的LDA配方

A new LDA formulation with covariates

论文作者

Shimizu, Gilson, Izbicki, Rafael, Valle, Denis

论文摘要

潜在的DIRICHLET分配(LDA)模型是一种创建混合会员群集的流行方法。尽管最初是为文本分析开发的,但LDA已用于广泛的其他应用。我们为结合协变量的LDA模型提出了一种新的公式。在此模型中,负二项式回归嵌入了LDA中,从而可以直接解释回归系数,并分析每个采样单元中群集特异性元件的数量(而不是分析集中于在结构主题模型中对每个群集的比例进行建模)。我们在Gibbs采样算法中使用切片采样来估计模型参数。我们依靠模拟来展示我们的算法如何成功检索真实的参数值以及使用协变量提供的信息对丰度矩阵进行预测的能力。使用来自三个不同领域的实际数据集说明了该模型:冠状病毒文章的文本挖掘,杂货购物篮的分析以及在巴罗科罗拉多岛(Panama)上的树种生态学。该模型允许在离散数据中识别混合成员簇,并推断协变量与这些簇的丰度之间的关系。

The Latent Dirichlet Allocation (LDA) model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values and the ability to make predictions for the abundance matrix using the information given by the covariates. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源