适应$ k $ -MEANS算法的异常值

论文标题

适应$ k $ -MEANS算法的异常值

Adapting $k$-means algorithms for outliers

论文作者

Grunau, Christoph, Rozhoň, Václav

论文摘要

本文展示了如何使用$ k $ - ameans问题的几种简单且经典的算法与异常值进行设置。最近，Bhaskara等人。（Neurips 2019）展示了如何将古典$ K $ -MEANS ++算法适应与异常值的设置。但是，他们的算法需要输出$ o（\ log（k）\ cdot z）$ outiers，其中$ z $是true Outliers的数量，以匹配$ o（\ log k）$ - $ k $ -means ++的近似保证。 In this paper, we build on their ideas and show how to adapt several sequential and distributed $k$-means algorithms to the setting with outliers, but with substantially stronger theoretical guarantees: our algorithms output $(1+\varepsilon)z$ outliers while achieving an $O(1 / \varepsilon)$-approximation to the objective function.在顺序世界中，我们通过改编Lattanzi和Sohler的最新算法来实现这一目标（ICML 2019）。在分布式设置中，我们适应了Guha等人的简单算法。（IEEE Trans。知道和数据工程2003）以及Bahmani等人的流行$ K $ -MEANS $ \ | $。（PVLDB 2012）。我们的技术的理论应用是一种具有运行时间$ \ tilde {o}（nk^2/z）$的算法，它在输出$ o（z）$ outliers时，可以实现目标函数的$ O（1）$ - 近似值，假设$ o（z）$ outliers，则假设$ k \ ll z \ ll z \ ll z \ ll n $。这与Oracle模型中此问题的$ω（NK^2/z）$的匹配下限相互补。

This paper shows how to adapt several simple and classical sampling-based algorithms for the $k$-means problem to the setting with outliers. Recently, Bhaskara et al. (NeurIPS 2019) showed how to adapt the classical $k$-means++ algorithm to the setting with outliers. However, their algorithm needs to output $O(\log (k) \cdot z)$ outliers, where $z$ is the number of true outliers, to match the $O(\log k)$-approximation guarantee of $k$-means++. In this paper, we build on their ideas and show how to adapt several sequential and distributed $k$-means algorithms to the setting with outliers, but with substantially stronger theoretical guarantees: our algorithms output $(1+\varepsilon)z$ outliers while achieving an $O(1 / \varepsilon)$-approximation to the objective function. In the sequential world, we achieve this by adapting a recent algorithm of Lattanzi and Sohler (ICML 2019). In the distributed setting, we adapt a simple algorithm of Guha et al. (IEEE Trans. Know. and Data Engineering 2003) and the popular $k$-means$\|$ of Bahmani et al. (PVLDB 2012). A theoretical application of our techniques is an algorithm with running time $\tilde{O}(nk^2/z)$ that achieves an $O(1)$-approximation to the objective function while outputting $O(z)$ outliers, assuming $k \ll z \ll n$. This is complemented with a matching lower bound of $Ω(nk^2/z)$ for this problem in the oracle model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题