AugmentedPCA：一个监督和对抗线性因子模型的Python包装

论文标题

AugmentedPCA：一个监督和对抗线性因子模型的Python包装

AugmentedPCA: A Python Package of Supervised and Adversarial Linear Factor Models

论文作者

Carson IV, William E., Talbot, Austin, Carlson, David

论文摘要

深度自动编码器通常会以受到监督或对抗性损失的方式扩展，以学习具有理想特性的潜在表示，例如标签和结果的更大预测性或公平性与敏感变量有关。尽管有监督和对抗性深层因素模型无处不在，但这些方法应证明比在实践中首选的更简单的线性方法的改进。这需要一个可复制的线性类似物，该类似物仍然遵守增强的监督或对抗性目标。我们通过提出方法来解决这种方法论差距，以通过有监督或对抗性目标来扩大主要成分分析（PCA）目标并提供分析和可再现的解决方案。我们在开源Python软件包AugmentedPCA中实现了这些方法，该软件包可以产生出色的现实基线。我们在开源的RNA-Seq癌基因表达数据集上证明了这些因子模型的实用性，表明，通过有监督的客观结果增强，从而改善了下游分类的性能，可产生更大的类别的主要成分，具有更大的阶级忠诚度，并促进与与特定类型的特定类型的主要轴相一致的基因鉴定的基因。

Deep autoencoders are often extended with a supervised or adversarial loss to learn latent representations with desirable properties, such as greater predictivity of labels and outcomes or fairness with respects to a sensitive variable. Despite the ubiquity of supervised and adversarial deep latent factor models, these methods should demonstrate improvement over simpler linear approaches to be preferred in practice. This necessitates a reproducible linear analog that still adheres to an augmenting supervised or adversarial objective. We address this methodological gap by presenting methods that augment the principal component analysis (PCA) objective with either a supervised or an adversarial objective and provide analytic and reproducible solutions. We implement these methods in an open-source Python package, AugmentedPCA, that can produce excellent real-world baselines. We demonstrate the utility of these factor models on an open-source, RNA-seq cancer gene expression dataset, showing that augmenting with a supervised objective results in improved downstream classification performance, produces principal components with greater class fidelity, and facilitates identification of genes aligned with the principal axes of data variance with implications to development of specific types of cancer.

下载PDF全文

下载文献需遵守相关版权规定

论文标题