最小化累积轨迹误差以改善数据集蒸馏

论文标题

最小化累积轨迹误差以改善数据集蒸馏

Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation

论文作者

Du, Jiawei, Jiang, Yidi, Tan, Vincent Y. F., Zhou, Joey Tianyi, Li, Haizhou

论文摘要

基于模型的深度学习取得了惊人的成功，部分原因是大型现实世界数据的可用性。但是，在计算，存储，培训和寻找良好的神经体系结构方面，处理如此大量的数据以相当大的代价产生了巨大的代价。因此，数据集蒸馏最近出现了。该范式涉及将大型现实世界数据集的信息提炼成小巧而紧凑的合成数据集，以便处理后者的理想性能与前者产生相似的性能。最先进的方法主要依赖于学习合成数据集，这是通过与实际数据和合成数据之间训练期间获得的梯度相匹配的。但是，这些梯度匹配方法遭受所谓的累积轨迹误差，这是由于蒸馏和随后的评估之间的差异引起的。为了减轻这种累积的轨迹误差的不利影响，我们提出了一种新的方法，该方法鼓励优化算法寻求平坦的轨迹。我们表明，在合成数据上训练的权重与正规化朝向平面轨迹的累积错误扰动具有鲁棒性。我们的方法称为FLAT轨迹蒸馏（FTD），显示出在具有较高分辨率图像的Imagenet数据集的一部分中，将梯度匹配方法的性能提高了4.7％。我们还通过不同分辨率的数据集验证方法的有效性和概括性，并证明了其对神经体系结构搜索的适用性。代码可在https://github.com/angusdujw/ftd-distillation上找到。

Model-based deep learning has achieved astounding successes due in part to the availability of large-scale real-world data. However, processing such massive amounts of data comes at a considerable cost in terms of computations, storage, training and the search for good neural architectures. Dataset distillation has thus recently come to the fore. This paradigm involves distilling information from large real-world datasets into tiny and compact synthetic datasets such that processing the latter ideally yields similar performances as the former. State-of-the-art methods primarily rely on learning the synthetic dataset by matching the gradients obtained during training between the real and synthetic data. However, these gradient-matching methods suffer from the so-called accumulated trajectory error caused by the discrepancy between the distillation and subsequent evaluation. To mitigate the adverse impact of this accumulated trajectory error, we propose a novel approach that encourages the optimization algorithm to seek a flat trajectory. We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory. Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7% on a subset of images of the ImageNet dataset with higher resolution images. We also validate the effectiveness and generalizability of our method with datasets of different resolutions and demonstrate its applicability to neural architecture search. Code is available at https://github.com/AngusDujw/FTD-distillation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题