论文标题
通过微晶测量和指导级分析来揭开NVIDIA安培体系结构
Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis
论文作者
论文摘要
现在,图形处理单元(GPU)被认为是加速通用工作负载(例如AI,数据分析和HPC)的领先硬件。在过去的十年中,研究人员致力于揭开和评估供应商所揭示的各种GPU架构的微观结构特征。这项工作对于更好地了解硬件并构建更有效的工作负载和应用程序是必要的。许多作品研究了最近的Nvidia架构,例如Volta和Turing,将它们与其继任者Ampere进行了比较。但是,某些微体系结构功能,例如不同说明的时钟周期,尚未对安培体系结构进行广泛研究。在本文中,我们通过使用NVIDIA GPU的指令集架构(ISA)中的各种数据类型研究了时钟周期。我们使用微型计算标准测量了PTX ISA指令及其SASS ISA指令的时钟周期。我们进一步计算访问每个内存单元所需的时钟周期。我们还通过使用WMMA API并为不同的数据类型和输入形状测量其时钟周期,并通过使用WMMA API并测量其时钟周期和吞吐量,从而揭示了在安培体系结构中发现的新版本的张量核心单元。这项工作中发现的结果应指导软件开发人员和硬件架构师。此外,每个说明的时钟循环被性能建模和工具广泛使用,以建模和预测硬件的性能。
Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instructions with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. Using microbenchmarks, we measure the clock cycles for PTX ISA instructions and their SASS ISA instructions counterpart. we further calculate the clock cycle needed to access each memory unit. We also demystify the new version of the tensor core unit found in the Ampere architecture by using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. The results found in this work should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.