论文标题

基于不平衡数据的最近的邻居分类:一种统计方法

Nearest Neighbor Classification based on Imbalanced Data: A Statistical Approach

论文作者

Garg, Anvit, Ghosh, Anil K., Sarkar, Soham

论文摘要

当分类问题中的竞争类别的尺寸不相同时,许多受欢迎的分类器对较大的班级表现出偏见,而最近的邻居分类器也不例外。为了解决这个问题,我们基于此类不平衡数据集开发了一种用于最近邻居分类的统计方法。首先,我们为二进制分类问题构建一个分类器,然后将其扩展为涉及两个以上类别的分类问题。与现有的过采样或不足采样方法不同,我们提出的分类器无需生成任何伪观测或删除任何现有观察结果,因此结果是完全可重现的。我们在适当的规律性条件下建立了这些分类器的贝叶斯风险一致性。通过分析几个基准数据集,可以充分证明它们优于现有方法。

When the competing classes in a classification problem are not of comparable size, many popular classifiers exhibit a bias towards larger classes, and the nearest neighbor classifier is no exception. To take care of this problem, we develop a statistical method for nearest neighbor classification based on such imbalanced data sets. First, we construct a classifier for the binary classification problem and then extend it for classification problems involving more than two classes. Unlike the existing oversampling or undersampling methods, our proposed classifiers do not need to generate any pseudo observations or remove any existing observations, hence the results are exactly reproducible. We establish the Bayes risk consistency of these classifiers under appropriate regularity conditions. Their superior performance over the existing methods is amply demonstrated by analyzing several benchmark data sets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源