论文标题

从数据流学习:概述和更新

Learning from Data Streams: An Overview and Update

论文作者

Read, Jesse, Žliobaitė, Indrė

论文摘要

在数据流的背景下,有关机器学习的文献庞大而又不断增长。但是,许多有关数据流学习任务的定义假设太强大,无法在实践中坚持,甚至是矛盾的,以至于在监督学习的背景下无法实现它们。根据标准选择和设计算法,这些标准通常没有明确说明,因为问题设置未明确定义,在不切实际的环境中进行测试和/或与更广泛文献的相关方法隔离。这使人们质疑许多在这种情况下构想的许多方法对现实世界影响的潜力,并冒着传播错误的研究重点的风险。我们建议通过对当代对概念漂移和时间依赖性的考虑来重新调整监督数据流学习的基本定义和设置来解决这些问题;我们仔细研究了什么构成了监督的数据流学习任务,以及可以应用于解决此类任务的算法的重新考虑。通过反思和概述,我们提供了对现实世界数据流的工业参与者的非正式调查的帮助,我们提供了建议。我们的主要重点是,从数据流中学习并不强加单一或在线学习方法或任何特定的学习制度;并且对内存和时间的任何约束都不是特定于流的。同时,在文献的其他领域中,存在建立的技术来处理时间依赖和概念漂移。因此,对于数据流社区,我们鼓励研究重点的转变,从处理经常人工的限制和学习模式的假设,再到与学术和工业环境中数据流中的学习越来越多的鲁棒性,隐私性和可解释性的问题。

The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源