论文标题

机器学习的数据预算

Data Budgeting for Machine Learning

论文作者

Zhao, Xinyi, Liang, Weixin, Zou, James

论文摘要

数据是为AI供电的燃料,并为许多域创造了巨大的价值。但是,为AI收集数据集是一项耗时,昂贵且复杂的努力。对于从业者来说,数据投资仍然是实践中信仰的飞跃。在这项工作中,我们研究数据预算问题并将其作为两个子问题提出:预测(1)如果给出足够的数据,则饱和性能是什么,以及(2)在饱和性能附近需要多少数据点。与传统的与数据集无关的方法(如PowerLaw)不同,我们提出了一种解决数据预算问题的学习方法。为了支持和系统地评估基于学习的数据预算方法,我们策划了大量383个表格ML数据集的集合,以及它们的数据与性能曲线。我们的经验评估表明,在一个小型试点研究数据集的情况下,可以执行数据预算,该数据集少于$ 50 $。

Data is the fuel powering AI and creates tremendous value for many domains. However, collecting datasets for AI is a time-consuming, expensive, and complicated endeavor. For practitioners, data investment remains to be a leap of faith in practice. In this work, we study the data budgeting problem and formulate it as two sub-problems: predicting (1) what is the saturating performance if given enough data, and (2) how many data points are needed to reach near the saturating performance. Different from traditional dataset-independent methods like PowerLaw, we proposed a learning method to solve data budgeting problems. To support and systematically evaluate the learning-based method for data budgeting, we curate a large collection of 383 tabular ML datasets, along with their data vs performance curves. Our empirical evaluation shows that it is possible to perform data budgeting given a small pilot study dataset with as few as $50$ data points.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源