努力提高数据模型效率：识别有关小组绩效的数据外部性

论文标题

努力提高数据模型效率：识别有关小组绩效的数据外部性

Striving for data-model efficiency: Identifying data externalities on group performance

论文作者

Rolf, Esther, Packer, Ben, Beutel, Alex, Diaz, Fernando

论文摘要

构建值得信赖，有效和负责任的机器学习系统取决于了解培训数据和建模决策的差异如何相互作用以影响预测性能。在这项工作中，我们试图更好地了解我们如何表征，检测和设计数据模型协同作用。我们专注于一种特定类型的数据模型效率低下，其中添加来自某些来源的培训数据实际上可以降低对人群的关键子组进行评估的绩效，这是一种现象，我们称为群体绩效的负面数据外部性。这种外部性可能在标准学习设置中出现，并且根据训练设定大小和模型大小之间的条件可能会出现不同的表现。数据外部性直接暗示了可行模型改进的下限，但是改进模型需要有效地了解潜在的数据模型张力。从更广泛的角度来看，我们的结果表明数据效率是准确和值得信赖的机器学习的关键组成部分。

Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. In this work, we seek to better understand how we might characterize, detect, and design for data-model synergies. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population, a phenomenon we refer to as negative data externalities on group performance. Such externalities can arise in standard learning settings and can manifest differently depending on conditions between training set size and model size. Data externalities directly imply a lower bound on feasible model improvements, yet improving models efficiently requires understanding the underlying data-model tensions. From a broader perspective, our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题