论文标题
估计无监督异常检测中污染因子的分布
Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection
论文作者
论文摘要
异常检测方法通过根据各种启发式方法为示例分配实现的异常得分,从而确定通常以无监督方式遵循预期行为的示例。这些分数需要通过阈值转换为实际预测,以使标记为异常的示例的比例等于所谓的污染因子的预期比例。不幸的是,没有估计污染因子本身的好方法。我们从贝叶斯的角度解决了这一需求,引入了一种估计给定未标记数据集的污染因子的后验分布的方法。我们利用了几个异常检测器的输出作为已经捕获异常的基本概念的表示,并使用特定的混合配方估算了污染。从经验上讲,在22个数据集上,我们表明估计的分布已妥善化合,并且使用后均值设置阈值可以改善多种替代方法的异常检测器的性能。所有代码均可公开可重现。
Anomaly detection methods identify examples that do not follow the expected behaviour, typically in an unsupervised fashion, by assigning real-valued anomaly scores to the examples based on various heuristics. These scores need to be transformed into actual predictions by thresholding, so that the proportion of examples marked as anomalies equals the expected proportion of anomalies, called contamination factor. Unfortunately, there are no good methods for estimating the contamination factor itself. We address this need from a Bayesian perspective, introducing a method for estimating the posterior distribution of the contamination factor of a given unlabeled dataset. We leverage on outputs of several anomaly detectors as a representation that already captures the basic notion of anomalousness and estimate the contamination using a specific mixture formulation. Empirically on 22 datasets, we show that the estimated distribution is well-calibrated and that setting the threshold using the posterior mean improves the anomaly detectors' performance over several alternative methods. All code is publicly available for full reproducibility.