具有深层生成模型的表格数据的少数族裔类过采样

论文标题

具有深层生成模型的表格数据的少数族裔类过采样

Minority Class Oversampling for Tabular Data with Deep Generative Models

论文作者

Camino, Ramiro, Hammerschmidt, Christian, State, Radu

论文摘要

实际上，机器学习专家通常面临不平衡的数据。在不考虑不平衡的情况下，普通分类器的性能较差，标准评估指标误导了从业者对模型的性能。治疗不平衡数据集的一种常见方法是不足和过采样的。在此过程中，将样品从多数类中删除，或者将合成样本添加到少数类别中。在本文中，我们跟进深度学习的最新发展。我们采用了包括我们自己的深层生成模型的建议，并研究了这些方法提供现实样本的能力，从而通过过度采样来改善分类任务不平衡的性能。在160k+实验中，我们表明所有新方法的性能都比SMOTE等简单的基线方法更好，但需要不同的未透露率和过度采样率来做到这一点。我们的实验表明，采样方法不会影响质量，但运行时的方式差异很大。我们还观察到，在对方法进行排名时，虽然表现很重要，但在绝对方面通常是很小的，尤其是与所需的努力相比。此外，我们注意到改进的很大一部分是由于采样而不是过采样。我们使我们的代码和测试框架可用。

In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A common method to treat imbalanced datasets is under- and oversampling. In this process, samples are either removed from the majority class or synthetic samples are added to the minority class. In this paper, we follow up on recent developments in deep learning. We take proposals of deep generative models, including our own, and study the ability of these approaches to provide realistic samples that improve performance on imbalanced classification tasks via oversampling. Across 160K+ experiments, we show that all of the new methods tend to perform better than simple baseline methods such as SMOTE, but require different under- and oversampling ratios to do so. Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely. We also observe that the improvements in terms of performance metric, while shown to be significant when ranking the methods, often are minor in absolute terms, especially compared to the required effort. Furthermore, we notice that a large part of the improvement is due to undersampling, not oversampling. We make our code and testing framework available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题