数据提出性值得一千个样本：从分析增强样品矩中进行精确量化

论文标题

数据提出性值得一千个样本：从分析增强样品矩中进行精确量化

A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments

论文作者

Balestriero, Randall, Misra, Ishan, LeCun, Yann

论文摘要

已知数据启动（DA）可以改善任务和数据集的性能。我们提出了一种理论上分析DA的效果的方法，并研究了以下问题：需要多少个增强样本来正确估计该DA编码的信息？增强政策如何影响模型的最终参数？我们以近距离的形式得出了数量，例如在给定的DA分布下图像，损失和模型输出的期望和方差。这些派生开放了新的途径，以量化DA的收益和局限性。例如，我们表明，通用DAS需要数万个样本才能正确估算手头的损失，并使模型训练进行收敛。我们表明，要使训练损失在DA采样下保持稳定，模型的显着性图（相对于模型输入的损失的梯度）必须与所考虑的DA增强的样品方差的最小特征向量保持一致，这暗示了为什么模型倾向于将其焦点转移到纹理上的可能解释。

Data-Augmentation (DA) is known to improve performance across tasks and datasets. We propose a method to theoretically analyze the effect of DA and study questions such as: how many augmented samples are needed to correctly estimate the information encoded by that DA? How does the augmentation policy impact the final parameters of a model? We derive several quantities in close-form, such as the expectation and variance of an image, loss, and model's output under a given DA distribution. Those derivations open new avenues to quantify the benefits and limitations of DA. For example, we show that common DAs require tens of thousands of samples for the loss at hand to be correctly estimated and for the model training to converge. We show that for a training loss to be stable under DA sampling, the model's saliency map (gradient of the loss with respect to the model's input) must align with the smallest eigenvector of the sample variance under the considered DA augmentation, hinting at a possible explanation on why models tend to shift their focus from edges to textures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题