论文标题

通过数据质量分析的自动可行性研究ML:标签噪声的案例研究

Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise

论文作者

Renggli, Cedric, Rimanic, Luka, Kolar, Luka, Wu, Wentao, Zhang, Ce

论文摘要

在我们与使用当今汽车系统的域专家合作的经验中,我们遇到的一个常见问题是我们所说的“不现实的期望” - 当用户通过嘈杂的数据获取过程面临非常具有挑战性的任务时,同时期待通过机器学习(ML)实现令人震惊的高精度。其中许多是从一开始就被预定为失败的。在传统软件工程中,通过可行性研究解决了此问题,这是开发任何软件系统之前必不可少的一步。在本文中,我们介绍了Snoopy,目的是支持数据科学家和机器学习工程师在构建ML应用之前进行系统和理论上建立的可行性研究。我们通过估计基本任务的不可还原错误(也称为贝叶斯错误率(BER))来解决此问题,这源于用于训练或评估ML模型工件的数据集中的数据质量问题。我们设计了一个实用的贝叶斯误差估计器,该估计值与计算机视觉和自然语言处理中的6个数据集(具有不同级别的其他真实和合成噪声)上的基线可行性研究候选者进行了比较。此外,通过将我们的系统可行性研究和其他信号包括在迭代标签清洁过程中,我们在端到端实验中证明了用户如何能够节省大量标签的时间和货币努力。

In our experience of working with domain experts who are using today's AutoML systems, a common problem we encountered is what we call "unrealistic expectations" -- when users are facing a very challenging task with a noisy data acquisition process, while being expected to achieve startlingly high accuracy with machine learning (ML). Many of these are predestined to fail from the beginning. In traditional software engineering, this problem is addressed via a feasibility study, an indispensable step before developing any software system. In this paper, we present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study before building ML applications. We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER), which stems from data quality issues in datasets used to train or evaluate ML model artifacts. We design a practical Bayes error estimator that is compared against baseline feasibility study candidates on 6 datasets (with additional real and synthetic noise of different levels) in computer vision and natural language processing. Furthermore, by including our systematic feasibility study with additional signals into the iterative label cleaning process, we demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源