开发用于多相粒子中的基于方程的并行化方法

论文标题

开发用于多相粒子中的基于方程的并行化方法

Development of an Equation-based Parallelization Method for Multiphase Particle-in-Cell Simulations

论文作者

Woo, Mino, Jordan, Terry, Nandi, Tarak, Dietiker, Jean Francois, Guenther, Christopher, Van Essendelft, Dirk

论文摘要

制造商一直在开发具有较大容量，高带宽内存和非常高的带宽节点互连的新图形处理单元（GPU）节点。这使得以低成本在同一节点上在GPU之间移动大量数据。但是，小包装包带宽和潜伏期尚未减少，这使全球点产品变得昂贵。这些特征有利于一种新型的问题分解，称为“方程分解”，而不是传统的域分解。在这种方法中，为每个GPU分配了一个方程组，以并行求解，以便消除传统分布式线性求解器中频繁且昂贵的点产品同步点。作为交换，该方法涉及状态变量在高带宽（节点内互连）上的频繁运动。为了测试这一理论，我们的旗舰代码多相流具有相间交换（MFIX）的张量。该新产品被称为Mfix-ai，可以与原始版本的MFIX产生几乎相同的结果，并在多相粒子中（MP-PIC）模拟中具有显着的加速度。在NVLink 2.0上连接的4个NVIDIA A100的单个节点的性能显示出对Joule 2.0 SuperComputer上1000 CPU核心（25个节点）的竞争力，从而导致节省的能源节省高达90％。对于中小型问题，这是一个很大的性能优势。随着GPU节点变得越来越强大，这种好处将增长。此外，MFIX-AI有望接受本地人工智能/机器学习模型，以进一步加速和开发。

Manufacturers have been developing new graphics processing unit (GPU) nodes with large capacity, high bandwidth memory and very high bandwidth intra-node interconnects. This enables moving large amounts of data between GPUs on the same node at low cost. However, small packet bandwidths and latencies have not decreased which makes global dot products expensive. These characteristics favor a new kind of problem decomposition called "equation decomposition" rather than traditional domain decomposition. In this approach, each GPU is assigned one equation set to solve in parallel so that the frequent and expensive dot product synchronization points in traditional distributed linear solvers are eliminated. In exchange, the method involves infrequent movement of state variables over the high bandwidth, intra-node interconnects. To test this theory, our flagship code Multiphase Flow with Interphase eXchanges (MFiX) was ported to TensorFlow. This new product is known as MFiX-AI and can produce near identical results to the original version of MFiX with significant acceleration in multiphase particle-in-cell (MP-PIC) simulations. The performance of a single node with 4 NVIDIA A100s connected over NVLINK 2.0 was shown to be competitive to 1000 CPU cores (25 nodes) on the JOULE 2.0 supercomputer, leading to an energy savings of up to 90%. This is a substantial performance benefit for small- to intermediate-sized problems. This benefit is expected to grow as GPU nodes become more powerful. Further, MFiX-AI is poised to accept native artificial intelligence/machine learning models for further acceleration and development.

下载PDF全文

下载文献需遵守相关版权规定

论文标题