论文标题

连续卷积的平行算法

Parallel Algorithms for Successive Convolution

论文作者

Christlieb, Andrew J., Guthrey, Pierson T., Sands, William A., Thavappiragasm, Mathialakan

论文摘要

在这项工作中,我们考虑了PDE的替代离散化,这些PDE使用涉及积分运算符的扩展来近似空间衍生物。这些构造在积分术语中使用明确的信息,但隐含地对待边界数据,这有助于该方法的整体速度。对于线性问题,这种方法是无条件稳定的,并且已经在非线性问题上证明了稳定性。此外,从某种意义上说,它不必反转线性系统,并且非线性术语不需要迭代。此外,该方案采用了快速的求和算法,该算法得出了一种具有$ \ MATHCAL {o}(n)$的计算复杂性的方法,其中$ n $是沿方向沿方向的网格点的数量。尽管已经完成了许多工作来探索这些方法背后的理论,但它们在大规模计算环境中的实用性是一个未开发的主题。在这项工作中,我们通过开发适用于分布式内存系统以及共享内存算法的域分解算法来探讨这些方法的性能。作为第一次通行证,我们得出了人工CFL条件,该条件强制执行最近的邻居通信模式,并简要讨论可能的概括。我们还通过优化主要的环结构并最大程度地利用数据来分析几种实现并行算法的方法。使用用于分别用于算法的分布式和共享内存组件的MPI和Kokkos的混合设计,我们表明我们的方法是有效的,并且可以维持更新率$> 1 \ times10^8 $ dof/node/s。我们提供的结果可以使用几个不同的PDE测试问题(包括非线性示例)来证明算法的可伸缩性和多功能性,该示例采用了适应性的时间步长规则。

In this work, we consider alternative discretizations for PDEs which use expansions involving integral operators to approximate spatial derivatives. These constructions use explicit information within the integral terms, but treat boundary data implicitly, which contributes to the overall speed of the method. This approach is provably unconditionally stable for linear problems and stability has been demonstrated experimentally for nonlinear problems. Additionally, it is matrix-free in the sense that it is not necessary to invert linear systems and iteration is not required for nonlinear terms. Moreover, the scheme employs a fast summation algorithm that yields a method with a computational complexity of $\mathcal{O}(N)$, where $N$ is the number of mesh points along a direction. While much work has been done to explore the theory behind these methods, their practicality in large scale computing environments is a largely unexplored topic. In this work, we explore the performance of these methods by developing a domain decomposition algorithm suitable for distributed memory systems along with shared memory algorithms. As a first pass, we derive an artificial CFL condition that enforces a nearest-neighbor communication pattern and briefly discuss possible generalizations. We also analyze several approaches for implementing the parallel algorithms by optimizing predominant loop structures and maximizing data reuse. Using a hybrid design that employs MPI and Kokkos for the distributed and shared memory components of the algorithms, respectively, we show that our methods are efficient and can sustain an update rate $> 1\times10^8$ DOF/node/s. We provide results that demonstrate the scalability and versatility of our algorithms using several different PDE test problems, including a nonlinear example, which employs an adaptive time-stepping rule.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源