论文标题

强烈的局部P-norm切割算法,用于半监督学习和本地图聚类

Strongly local p-norm-cut algorithms for semi-supervised learning and local graph clustering

论文作者

Liu, Meng, Gleich, David F.

论文摘要

基于图形的半监督学习是在给定一些示例节点(通常称为种子)的图形节点学习标签函数的问题,通常是在图表的边缘表示标记的相似性的假设下。这与当地的图形聚类或社区检测问题密切相关,即在给定种子周围找到群集或节点社区。对于这个问题,我们提出了文献中随机行走,扩散或平滑函数方法的新颖概括,以凸出p-norm剪切功能。对我们的p-norm方法的需求是,在我们对现有方法的研究中,我们发现基于特征向量,光谱,随机步行或线性系统的原则方法通常难以捕获目标标签或目标群集的正确边界。相比之下,基于1-norm或MaxFlow-Mincut方法捕获边界,但不能从小种子集中生长。使用两者都难以设置参数的混合过程。在本文中,我们提出了涉及p-norms的这些方法背后的目标函数的概括。为了解决P-norm剪切问题,我们给出了一种强烈的本地算法,该算法的运行时取决于输出的大小,而不是图形的大小。我们的方法可以被认为是对安德森 - 钟朗推动程序的非线性概括,可以有效地近似个性化的pagerank vector。我们的程序是一般的,可以解决其他类型的非线性目标函数,例如Huber损失的P-Norm变体。我们通过我们的方法提供了对发现种植的目标簇的理论分析,并表明p-norm削减功能改善了随机步行和光谱方法的标准cheeger不平等现象。最后,我们证明了合成和现实世界数据集中新方法的速度和准确性。我们的代码可在http://github.com/mengliupurdue/slq上找到。

Graph based semi-supervised learning is the problem of learning a labeling function for the graph nodes given a few example nodes, often called seeds, usually under the assumption that the graph's edges indicate similarity of labels. This is closely related to the local graph clustering or community detection problem of finding a cluster or community of nodes around a given seed. For this problem, we propose a novel generalization of random walk, diffusion, or smooth function methods in the literature to a convex p-norm cut function. The need for our p-norm methods is that, in our study of existing methods, we find those principled methods based on eigenvector, spectral, random walk, or linear system often have difficulty capturing the correct boundary of a target label or target cluster. In contrast, 1-norm or maxflow-mincut based methods capture the boundary, but cannot grow from small seed set; hybrid procedures that use both have many hard to set parameters. In this paper, we propose a generalization of the objective function behind these methods involving p-norms. To solve the p-norm cut problem we give a strongly local algorithm -- one whose runtime depends on the size of the output rather than the size of the graph. Our method can be thought as a nonlinear generalization of the Anderson-Chung-Lang push procedure to approximate a personalized PageRank vector efficiently. Our procedure is general and can solve other types of nonlinear objective functions, such as p-norm variants of Huber losses. We provide a theoretical analysis of finding planted target clusters with our method and show that the p-norm cut functions improve on the standard Cheeger inequalities for random walk and spectral methods. Finally, we demonstrate the speed and accuracy of our new method in synthetic and real world datasets. Our code is available at http://github.com/MengLiuPurdue/SLQ.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源