论文标题
查询有效的相关聚类
Query-Efficient Correlation Clustering
论文作者
论文摘要
相关聚类可以说是聚类最自然的表述。给定n个对象和成对的相似性度量,目标是将对象聚类,以便在最佳范围内将相似的对象放在相同的群集中,而不同的对象则将不同的对象放在不同的群集中。 相关聚类的主要缺点是它需要输入$θ(n^2)$成对相似性。这通常是不可行的,甚至只是存储。在本文中,我们研究\ emph {Query效率}算法用于相关聚类。具体来说,我们设计了一种相关聚类算法,鉴于$ q $ Queries的预算,它达到了一种解决方案,该解决方案的预期分歧数量最多为$ 3 \ cdot opt + o(\ frac {n^3} {q} {q})$,在$ opt $的情况下,$ opt $是该实例的最佳成本。它的运行时间为$ O(Q)$,可以轻松地使非自适应(意味着它可以在一开始指定其所有查询,并并行地指定),并具有相同的保证。最终,我们的算法在查询$ Q $的数量与最严重的案例错误之间,即使对于自适应算法也可以实现最佳的权衡。 最后,我们对合成和真实数据的提议方法进行了一项实验研究,显示了算法的可扩展性和准确性。
Correlation clustering is arguably the most natural formulation of clustering. Given n objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. A main drawback of correlation clustering is that it requires as input the $Θ(n^2)$ pairwise similarities. This is often infeasible to compute or even just to store. In this paper we study \emph{query-efficient} algorithms for correlation clustering. Specifically, we devise a correlation clustering algorithm that, given a budget of $Q$ queries, attains a solution whose expected number of disagreements is at most $3\cdot OPT + O(\frac{n^3}{Q})$, where $OPT$ is the optimal cost for the instance. Its running time is $O(Q)$, and can be easily made non-adaptive (meaning it can specify all its queries at the outset and make them in parallel) with the same guarantees. Up to constant factors, our algorithm yields a provably optimal trade-off between the number of queries $Q$ and the worst-case error attained, even for adaptive algorithms. Finally, we perform an experimental study of our proposed method on both synthetic and real data, showing the scalability and the accuracy of our algorithm.