罕见但受欢迎的：网络异常检测的标记数据集的证据和含义

论文标题

罕见但受欢迎的：网络异常检测的标记数据集的证据和含义

Rare Yet Popular: Evidence and Implications from Labeled Datasets for Network Anomaly Detection

论文作者

Navarro, Jose Manuel, Huet, Alexis, Rossi, Dario

论文摘要

异常检测研究工作通常提出算法或端到端系统，旨在自动发现数据集或流中的离群值。尽管有关算法或指标的定义的文献充斥着更好的评估，但很少质疑对其进行评估的地面真理质量。在本文中，我们对在网络环境中的可用公共（以及我们的私人）地面真相进行了系统分析，在网络环境的背景下，数据是本质上是时间的，多变量的，尤其是在我们所知，我们是第一个探索我们的空间属性。我们的分析表明，尽管从定义上讲，异常在时间上是罕见的事件，但它们的空间表征清楚地表明，某些类型的异常比其他异常更为流行。我们发现，简单的聚类可以将人类标记的需求减少2x-10x，这是我们首先在野外进行定量分析。

Anomaly detection research works generally propose algorithms or end-to-end systems that are designed to automatically discover outliers in a dataset or a stream. While literature abounds concerning algorithms or the definition of metrics for better evaluation, the quality of the ground truth against which they are evaluated is seldom questioned. In this paper, we present a systematic analysis of available public (and additionally our private) ground truth for anomaly detection in the context of network environments, where data is intrinsically temporal, multivariate and, in particular, exhibits spatial properties, which, to the best of our knowledge, we are the first to explore. Our analysis reveals that, while anomalies are, by definition, temporally rare events, their spatial characterization clearly shows some type of anomalies are significantly more popular than others. We find that simple clustering can reduce the need for human labeling by a factor of 2x-10x, that we are first to quantitatively analyze in the wild.

下载PDF全文

下载文献需遵守相关版权规定

论文标题