论文标题
通过提高决策树的培训动力来提高数据质量
Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees
论文作者
论文摘要
现实世界的数据集包含错误标记的实例,这些实例妨碍了模型的性能,尤其是概括分布式的能力。而且,每个示例可能对学习有不同的贡献。这激发了研究,以更好地理解数据实例在模型中在良好指标中的贡献方面的作用。在本文中,我们提出了一种基于从梯度增强决策树(GBDT)的训练动力学计算的指标,以评估每个培训示例的行为。我们专注于主要包含表格或结构化数据的数据集,在这种数据方面,决策树集合的使用仍然是最新的性能。与自信学习,直接启发式方法和强大的增强算法相比,我们的方法总体上取得了最佳的结果。我们显示了检测到嘈杂标签的结果,以确保清洁数据集,改善模型在合成和真实的公共数据集中的指标,以及在我们根据建议的解决方案部署模型的行业情况。
Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. Our methods achieved the best results overall when compared with confident learning, direct heuristics and a robust boosting algorithm. We show results on detecting noisy labels in order clean datasets, improving models' metrics in synthetic and real public datasets, as well as on a industry case in which we deployed a model based on the proposed solution.