论文标题
优化高性能马尔可夫聚类用于前爆炸架构
Optimizing High Performance Markov Clustering for Pre-Exascale Architectures
论文作者
论文摘要
HIPMCL是流行的Markov群集算法(MCL)的高性能分布式内存实现,可以使用数千个配备CPU的节点在数小时内集群大规模网络。它依赖于稀疏的矩阵计算,并在很大程度上使用了稀疏的矩阵矩阵乘数内核(SPGEMM)。 HIPMCL中现有的平行算法无法扩展到Exascale体系结构,这既是由于它们的沟通成本在很大的一致性上占主导地位,也是由于它们无法利用越来越受欢迎的加速器。 在这项工作中,我们系统地删除了HIPMCL的可扩展性和性能瓶颈。我们通过在GPU上执行MCL算法的昂贵扩展阶段来启用GPU。我们提出了一种称为管道稀疏summa的CPU-GPU分布式SPGEMM算法,并整合了快速准确的概率内存需求估计器。我们开发了一种新的合并算法,用于GPU产生的部分结果的增量处理,从而提高了重叠效率和峰值存储器的使用情况。我们还集成了最近,更快的算法,用于在CPU上执行SPGEMM。我们通过广泛的评估来验证新算法和优化。随着GPU的启用和新算法的集成,HIPMCL的速度最高为12.4倍,能够使用ORNL的Summit SuperCtuper的1024个节点将具有7000万个蛋白质和680亿个连接的网络聚集到不到15分钟的不到15分钟的网络。
HipMCL is a high-performance distributed memory implementation of the popular Markov Cluster Algorithm (MCL) and can cluster large-scale networks within hours using a few thousand CPU-equipped nodes. It relies on sparse matrix computations and heavily makes use of the sparse matrix-sparse matrix multiplication kernel (SpGEMM). The existing parallel algorithms in HipMCL are not scalable to Exascale architectures, both due to their communication costs dominating the runtime at large concurrencies and also due to their inability to take advantage of accelerators that are increasingly popular. In this work, we systematically remove scalability and performance bottlenecks of HipMCL. We enable GPUs by performing the expensive expansion phase of the MCL algorithm on GPU. We propose a CPU-GPU joint distributed SpGEMM algorithm called pipelined Sparse SUMMA and integrate a probabilistic memory requirement estimator that is fast and accurate. We develop a new merging algorithm for the incremental processing of partial results produced by the GPUs, which improves the overlap efficiency and the peak memory usage. We also integrate a recent and faster algorithm for performing SpGEMM on CPUs. We validate our new algorithms and optimizations with extensive evaluations. With the enabling of the GPUs and integration of new algorithms, HipMCL is up to 12.4x faster, being able to cluster a network with 70 million proteins and 68 billion connections just under 15 minutes using 1024 nodes of ORNL's Summit supercomputer.