论文标题
数据扩大数据在数据流分类中的积极学习
Data augmentation on-the-fly and active learning in data stream classification
论文作者
论文摘要
新兴的需要对预测模型进行直接培训,因为在许多机器学习应用程序中,数据以在线方式到达。遇到的一个关键挑战是,由于在线观察到新数据时,地面真相信息的可用性有限(例如,分类任务中的标签),而另一个重大的挑战是阶级失衡。这项工作介绍了新颖的增强队列方法,该方法通过以协同的方式组合在线积极学习,数据增强和多音量记忆来解决双重问题,以维持每个班级的单独和平衡的队列。我们使用图像和时间序列增加进行了广泛的实验研究,其中我们检查了主动学习预算,记忆大小,不平衡水平和神经网络类型的作用。我们展示了增强队列的两个主要优势。首先,由于合成数据的产生仅在培训时间发生,因此它不能保留额外的内存空间。其次,学习模型可以访问更标记的数据,而无需增加主动的学习预算和 /或原始内存大小。在直接的学习中,提出了主要的挑战,通常会阻碍学习模型的部署。增强队列可显着提高学习质量和速度方面的性能。我们的代码可公开可用。
There is an emerging need for predictive models to be trained on-the-fly, since in numerous machine learning applications data are arriving in an online fashion. A critical challenge encountered is that of limited availability of ground truth information (e.g., labels in classification tasks) as new data are observed one-by-one online, while another significant challenge is that of class imbalance. This work introduces the novel Augmented Queues method, which addresses the dual-problem by combining in a synergistic manner online active learning, data augmentation, and a multi-queue memory to maintain separate and balanced queues for each class. We perform an extensive experimental study using image and time-series augmentations, in which we examine the roles of the active learning budget, memory size, imbalance level, and neural network type. We demonstrate two major advantages of Augmented Queues. First, it does not reserve additional memory space as the generation of synthetic data occurs only at training times. Second, learning models have access to more labelled data without the need to increase the active learning budget and / or the original memory size. Learning on-the-fly poses major challenges which, typically, hinder the deployment of learning models. Augmented Queues significantly improves the performance in terms of learning quality and speed. Our code is made publicly available.