计算机视觉数据集问题的数据集细化调查

论文标题

计算机视觉数据集问题的数据集细化调查

A Survey of Dataset Refinement for Problems in Computer Vision Datasets

论文作者

Wan, Zhijing, Wang, Zhixiang, Chung, CheukTing, Wang, Zheng

论文摘要

大规模数据集在计算机视觉的发展中发挥了至关重要的作用。但是，他们通常会遇到类别不平衡，嘈杂的标签，数据集偏见或高资源成本等问题，这些问题可以抑制模型性能并降低可信赖性。随着以数据为中心的研究的倡导，已经提出了各种以数据为中心的解决方案来解决上面提到的数据集问题。它们通过重新组织数据集来提高数据集的质量，我们称其为数据集细化。在这项调查中，我们为有问题的计算机视觉数据集提供了全面且结构化的概述。首先，我们总结并分析了大型计算机视觉数据集中遇到的各种问题。然后，我们根据改进过程将数据集细化算法分为三类：数据采样，数据子集选择和主动学习。此外，我们根据解决的数据问题组织了这些数据集细化方法，并提供了系统的比较描述。我们指出，这三种类型的数据集精炼在数据集问题上具有明显的优势和缺点，这为以数据为中心的方法选择了适合特定研究目标的数据。最后，我们总结了当前的文献，并提出了潜在的未来研究主题。

Large-scale datasets have played a crucial role in the advancement of computer vision. However, they often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs, which can inhibit model performance and reduce trustworthiness. With the advocacy of data-centric research, various data-centric solutions have been proposed to solve the dataset problems mentioned above. They improve the quality of datasets by re-organizing them, which we call dataset refinement. In this survey, we provide a comprehensive and structured overview of recent advances in dataset refinement for problematic computer vision datasets. Firstly, we summarize and analyze the various problems encountered in large-scale computer vision datasets. Then, we classify the dataset refinement algorithms into three categories based on the refinement process: data sampling, data subset selection, and active learning. In addition, we organize these dataset refinement methods according to the addressed data problems and provide a systematic comparative description. We point out that these three types of dataset refinement have distinct advantages and disadvantages for dataset problems, which informs the choice of the data-centric method appropriate to a particular research objective. Finally, we summarize the current literature and propose potential future research topics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题