论文标题

当嘈杂的标签遇到长尾部困境时:一种表示校准方法

When Noisy Labels Meet Long Tail Dilemmas: A Representation Calibration Method

论文作者

Zhang, Manyi, Zhao, Xuyang, Yao, Jun, Yuan, Chun, Huang, Weiran

论文摘要

现实世界中的大规模数据集都标有嘈杂的标签和平衡。这些问题严重损害了训练有素的模型的概括。因此,解决同时的不正确标签和班级不平衡,即用嘈杂的长尾数据上的嘈杂标签学习问题是很重要的。以前的工作为问题开发了几种方法。但是,他们始终依靠强有力的假设,这些假设在实践中很难检查。在本文中,为了解决问题并解决先前工作的局限性,我们提出了一种表示校准方法RCAL。具体而言,RCAL与无监督的对比学习提取的表示形式合作。我们假设,如果没有错误的标签和类不平衡,每个类中实例的表示形式符合多元高斯分布,这很温和,更易于检查。基于该假设,我们从污染和平衡的数据造成的污染分布中恢复了基本表示分布。然后,从恢复的分布中对其他数据点进行采样,以帮助概括。此外,在分类器培训期间,表示学习利用了对比度学习带来的表示鲁棒性,从而进一步改善了分类器的性能。我们得出理论上的结果来讨论我们表示校准的有效性。在多个基准上进行的实验证明了我们的主张合理,并确认了所提出的方法的优越性。

Real-world large-scale datasets are both noisily labeled and class-imbalanced. The issues seriously hurt the generalization of trained models. It is hence significant to address the simultaneous incorrect labeling and class-imbalance, i.e., the problem of learning with noisy labels on long-tailed data. Previous works develop several methods for the problem. However, they always rely on strong assumptions that are invalid or hard to be checked in practice. In this paper, to handle the problem and address the limitations of prior works, we propose a representation calibration method RCAL. Specifically, RCAL works with the representations extracted by unsupervised contrastive learning. We assume that without incorrect labeling and class imbalance, the representations of instances in each class conform to a multivariate Gaussian distribution, which is much milder and easier to be checked. Based on the assumption, we recover underlying representation distributions from polluted ones resulting from mislabeled and class-imbalanced data. Additional data points are then sampled from the recovered distributions to help generalization. Moreover, during classifier training, representation learning takes advantage of representation robustness brought by contrastive learning, which further improves the classifier performance. We derive theoretical results to discuss the effectiveness of our representation calibration. Experiments on multiple benchmarks justify our claims and confirm the superiority of the proposed method.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源