使用AVX512指令集和VPClMulQDQ指令，超过$ \ mathbb {f} _2 [x] $更快的乘法

论文标题

使用AVX512指令集和VPClMulQDQ指令，超过$ \ mathbb {f} _2 [x] $更快的乘法

Faster multiplication over $\mathbb{F}_2[X]$ using AVX512 instruction set and VPCLMULQDQ instruction

论文作者

Robert, Jean-Marc, Véron, Pascal

论文摘要

基于代码的密码学是量词后加密环境的主要命题之一，并且已经在NIST平台上提交了这种协议。其中，自行车和HQC是在KEM类别中NIST标准化过程的第三轮中选择的五个替代候选者的一部分。这两个方案利用了大型多项式在二元环上的乘法，并且由于多项式大小（从10,000到60,000位），此操作是密钥生成，封装或指定机制中最昂贵的操作之一。在这项工作中，我们重新访问了任意多项式乘法的不同现有恒定时间算法。我们探索不同的Karatsuba和Toom-Cook构造，以确定每个多项式度范围的最佳组合，在AVX2和AVX512指令集的背景下。在每种情况下，这会导致不同的内核和构造。特别是在AVX512的背景下，我们使用VPCLMULQDQ指令，该指令是矢量化的二进制多项式乘法指令。该指令最多处理四个多项式（最多63个）乘法，四个结果存储在一个512位单词中。与AVX2指令集实现相比，这允许大约将操作的退休指令编号除以3，而在处理器时钟周期方面，加速度最高为39％。这些结果不同于Drucker中估计的结果（二进制多项式的快速乘法，即即将到来的VPCLMULQDQ指令，2018年）。为了说明新的VPCLMULQDQ指令的好处，我们使用了HQC代码来评估我们的方法。当在HQC协议中实施时，对于安全级别128、192和256，我们的方法可为关键对生成提供多达12％的速度。

Code-based cryptography is one of the main propositions for the post-quantum cryptographic context, and several protocols of this kind have been submitted on the NIST platform. Among them, BIKE and HQC are part of the five alternate candidates selected in the third round of the NIST standardization process in the KEM category. These two schemes make use of multiplication of large polynomials over binary rings, and due to the polynomial size (from 10,000 to 60,000 bits), this operation is one of the costliest during key generation, encapsulation, or decapsulation mechanisms. In this work, we revisit the different existing constant-time algorithms for arbitrary polynomial multiplication. We explore the different Karatsuba and Toom-Cook constructions in order to determine the best combinations for each polynomial degree range, in the context of AVX2 and AVX512 instruction sets. This leads to different kernels and constructions in each case. In particular, in the context of AVX512, we use the VPCLMULQDQ instruction, which is a vectorized binary polynomial multiplication instruction. This instruction deals with up to four polynomial (of degree up to 63) multiplications, the four results being stored in one single 512-bit word. This allows to divide by roughly 3 the retired instruction number of the operation in comparison with the AVX2 instruction set implementations, while the speedup is up to 39% in terms of processor clock cycles. These results are different than the ones estimated in Drucker (Fast multiplication of binary polynomials with the forthcoming vectorized vpclmulqdq instruction, 2018). To illustrate the benefit of the new VPCLMULQDQ instruction, we used the HQC code to evaluate our approaches. When implemented in the HQC protocol, for the security levels 128, 192, and 256, our approaches provide up to 12% speedup, for key pair generation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题