论文标题

LOSDD:丢弃支持向量数据描述异常值检测

LOSDD: Leave-Out Support Vector Data Description for Outlier Detection

论文作者

Boiar, Daniel, Liebig, Thomas, Schubert, Erich

论文摘要

在接受干净数据培训时,支持向量机已成功地用于单级分类(OCSVM,SVDD),但是它们在肮脏的数据上的工作差得多:培训数据中存在的异常值往往会成为支持向量,因此被认为是“正常”。在本文中,我们提高了通过出售策略在肮脏训练数据中检测异常值的有效性:通过一次暂时省略一个候选人,只能使用其余数据来判断这一点。我们表明,这比使用现有基于SVM的方法的松弛术语更有效地得分分数。然后可以从数据中删除已确定的离群值,以便可以识别出其他异常值隐藏的离群值,以减少掩盖问题。天真地,这种方法将需要培训单个SVM(以及一次迭代删除最坏的异常值时,培训$ O(n^2)$ svms),这非常昂贵。我们将讨论每个步骤中只需要考虑支持向量,并且通过重复使用SVM参数和权重,可以实质上加速此增量再培训。通过批量删除候选人,我们可以进一步改善处理时间,尽管显然比培训单个SVM的成本更高。

Support Vector Machines have been successfully used for one-class classification (OCSVM, SVDD) when trained on clean data, but they work much worse on dirty data: outliers present in the training data tend to become support vectors, and are hence considered "normal". In this article, we improve the effectiveness to detect outliers in dirty training data with a leave-out strategy: by temporarily omitting one candidate at a time, this point can be judged using the remaining data only. We show that this is more effective at scoring the outlierness of points than using the slack term of existing SVM-based approaches. Identified outliers can then be removed from the data, such that outliers hidden by other outliers can be identified, to reduce the problem of masking. Naively, this approach would require training N individual SVMs (and training $O(N^2)$ SVMs when iteratively removing the worst outliers one at a time), which is prohibitively expensive. We will discuss that only support vectors need to be considered in each step and that by reusing SVM parameters and weights, this incremental retraining can be accelerated substantially. By removing candidates in batches, we can further improve the processing time, although it obviously remains more costly than training a single SVM.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源