用于分层群集的司级算法

论文标题

用于分层群集的司级算法

Sublinear Algorithms for Hierarchical Clustering

论文作者

Agarwal, Arpit, Khanna, Sanjeev, Li, Huan, Patil, Prathamesh

论文摘要

图形上的分层聚类是数据挖掘和机器学习中的一项基本任务，并在系统发育学，社交网络分析和信息检索等领域中进行了应用。具体而言，我们考虑了由于Dasgupta引起的层次聚类的最近普及的目标函数。以前（大约）最小化此目标函数的算法需要线性时间/空间复杂性。在许多应用程序中，底层图的大小可能是巨大的，即使使用线性时间/空间算法，也可以在计算上具有挑战性。结果，人们对设计只能使用sublinear资源执行全局计算的算法有浓厚的兴趣。这项工作的重点是在三个精心绘制的肌图计算模型下研究层次聚类，分别集中于空间，时间和沟通，作为要优化的主要资源：（1）（动态）流式模型，其中将边缘作为流作为流的图表，（2）使用邻居和图形的图形，该图形是Query and neighbor的，该图形是QUERED的，该图形是Queries at ew sypriping and Queries（3）模型（3）模型（3）。通过通信渠道。我们在上面的所有三个模型中设计用于层次聚类的sublinear算法。算法结果的核心是图表中的剪切方面的视图，这使我们能够使用宽松的剪刀示意图进行分层聚类，同时仅引入目标函数中的较小失真。然后，我们的主要算法贡献是如何在查询模型和MPC模型中有效地构建所需形式的剪切稀疏器。我们通过建立几乎匹配的下限来补充我们的算法结果，这些下限排除了在每个模型中设计更好的算法的可能性。

Hierarchical clustering over graphs is a fundamental task in data mining and machine learning with applications in domains such as phylogenetics, social network analysis, and information retrieval. Specifically, we consider the recently popularized objective function for hierarchical clustering due to Dasgupta. Previous algorithms for (approximately) minimizing this objective function require linear time/space complexity. In many applications the underlying graph can be massive in size making it computationally challenging to process the graph even using a linear time/space algorithm. As a result, there is a strong interest in designing algorithms that can perform global computation using only sublinear resources. The focus of this work is to study hierarchical clustering for massive graphs under three well-studied models of sublinear computation which focus on space, time, and communication, respectively, as the primary resources to optimize: (1) (dynamic) streaming model where edges are presented as a stream, (2) query model where the graph is queried using neighbor and degree queries, (3) MPC model where the graph edges are partitioned over several machines connected via a communication channel. We design sublinear algorithms for hierarchical clustering in all three models above. At the heart of our algorithmic results is a view of the objective in terms of cuts in the graph, which allows us to use a relaxed notion of cut sparsifiers to do hierarchical clustering while introducing only a small distortion in the objective function. Our main algorithmic contributions are then to show how cut sparsifiers of the desired form can be efficiently constructed in the query model and the MPC model. We complement our algorithmic results by establishing nearly matching lower bounds that rule out the possibility of designing better algorithms in each of these models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题