高通量GPU实施量梯后数字签名

论文标题

高通量GPU实施量梯后数字签名

High-Throughput GPU Implementation of Dilithium Post-Quantum Digital Signature

论文作者

Shen, Shiyu, Yang, Hao, Dai, Wangchen, Zhang, Hong, Liu, Zhe, Zhao, Yunlei

论文摘要

数字签名是各种协议中的基本构建块，以提供完整性和真实性。量子计算的开发引起了人们对经典签名计划提供的安全保证的担忧。 Crystals-Dilithium是一种基于晶格密码学的有效后数字签名方案，已被美国国家标准与技术研究所（National Standitdes of National Standard of Technology）选为标准化的主要算法。在这项工作中，我们提出了高通量GPU实施。对于个人操作，我们采用一系列计算和内存优化来克服顺序约束，减少内存使用情况和IO延迟，解决库冲突以及减轻管道摊位。这会为每个操作提供高和平衡的计算吞吐量和内存吞吐量。在并发的任务处理方面，我们利用任务级别的批处理充分利用并行性，并实现存储池机制以快速内存访问。考虑到不同重复数量在二硫属中对整体执行时间和硬件利用的影响，我们提出了一种动态的任务调度机制，以改善多处理器的占用率并大大减少执行时间。此外，我们应用异步计算并启动多个流以隐藏数据传输潜伏期并最大化CPU和GPU的计算功能。在所有三个安全级别中，我们的GPU实现可以同时计算签名不到32 milisecond的一千个任务，并在商业和服务器级GPU上进行验证15 miliseconds。这为每个任务实现了微秒级别的摊销执行时间，提供了适用于真实系统中各种应用程序的高通量和Quantum耐药解决方案。

Digital signatures are fundamental building blocks in various protocols to provide integrity and authenticity. The development of the quantum computing has raised concerns about the security guarantees afforded by classical signature schemes. CRYSTALS-Dilithium is an efficient post-quantum digital signature scheme based on lattice cryptography and has been selected as the primary algorithm for standardization by the National Institute of Standards and Technology. In this work, we present a high-throughput GPU implementation of Dilithium. For individual operations, we employ a range of computational and memory optimizations to overcome sequential constraints, reduce memory usage and IO latency, address bank conflicts, and mitigate pipeline stalls. This results in high and balanced compute throughput and memory throughput for each operation. In terms of concurrent task processing, we leverage task-level batching to fully utilize parallelism and implement a memory pool mechanism for rapid memory access. Considering the impact of varying repetition numbers in Dilithium on the overall execution time and hardware utilization, we propose a dynamic task scheduling mechanism to improve multiprocessor occupancy and significantly reduce execution time. Furthermore, we apply asynchronous computing and launch multiple streams to hide data transfer latencies and maximize the computing capabilities of both CPU and GPU. Across all three security levels, our GPU implementation can concurrently compute ten thousand tasks in less than 32 miliseconds for signing and 15 miliseconds for verification on both commercial and server-grade GPUs. This achieves microsecond-level amortized execution times for each task, offering a high-throughput and quantum-resistant solution suitable for a wide array of applications in real systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题