论文标题

混合分类和有序数据的概率丢失价值归合

Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data

论文作者

Zhao, Yuxuan, Townsend, Alex, Udell, Madeleine

论文摘要

许多实际数据集都包含缺失的条目和混合数据类型,包括分类和有序(例如连续和序数)变量。由于许多数据分析管道需要完整的数据,因此需要归纳缺失条目,但这对于混合数据尤其具有挑战性。本文提出了一种使用扩展的高斯copula模型的概率插补方法,该模型支持单个插补和多个插补。该方法模型使用潜在的高斯分布混合了分类和有序数据。分类变量的无序特征是使用Argmax运算符明确建模的。该方法对数据边缘没有任何假设,也不需要调整任何超参数。关于合成和实际数据集的实验结果表明,与混合数据中分类变量和有序变量的当前最新图像相关的插补。

Many real-world datasets contain missing entries and mixed data types including categorical and ordered (e.g. continuous and ordinal) variables. Imputing the missing entries is necessary, since many data analysis pipelines require complete data, but this is challenging especially for mixed data. This paper proposes a probabilistic imputation method using an extended Gaussian copula model that supports both single and multiple imputation. The method models mixed categorical and ordered data using a latent Gaussian distribution. The unordered characteristics of categorical variables is explicitly modeled using the argmax operator. The method makes no assumptions on the data marginals nor does it require tuning any hyperparameters. Experimental results on synthetic and real datasets show that imputation with the extended Gaussian copula outperforms the current state-of-the-art for both categorical and ordered variables in mixed data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源