CN118214541B

CN118214541B - SM3 parallel data encryption method based on ARM platform

Info

Publication number: CN118214541B
Application number: CN202410622101.6A
Authority: CN
Inventors: 董振江; 王星宇; 董建阔; 陈滏媛; 吴港庆
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2024-05-20
Filing date: 2024-05-20
Publication date: 2024-09-10
Anticipated expiration: 2044-05-20
Also published as: CN118214541A

Abstract

The invention belongs to the technical field of security password application, and discloses an SM3 parallel data encryption method based on an ARM platform. Firstly, multi-thread technology is used to fully utilize the characteristics of a multi-core CPU, and multiple threads simultaneously process multiple groups of plaintexts, thereby realizing parallelization at the software level. A message extension part is implemented in parallel using a NEON instruction set, and four adjacent data are loaded into a register at one time, and are simultaneously calculated using a parallel instruction set. An Ultra Round concept is proposed for a round function in a compression function, and the round function in the compression function that originally requires eight assignment operations at a time is reduced to only four assignment operations at a time by switching the input position of a word in each round, thereby realizing the improvement of the compression function performance.

Description

A SM3 parallel data encryption method based on ARM platform

技术领域Technical Field

本发明属于安全密码应用技术领域，具体是涉及一种基于ARM平台的SM3并行数据加密方法。The invention belongs to the technical field of security password application, and in particular relates to an SM3 parallel data encryption method based on an ARM platform.

背景技术Background Art

SM3是中国国家密码管理局制定的密码学哈希算法标准，属于对称密码体制，广泛用于数字签名、数据完整性验证、身份认证等安全领域。SM3算法在各个领域都发挥着关键作用，为信息安全提供了可靠的基础技术支持，其安全性及效率等效于SHA-256。然而由于国产CPU多为ARM架构，国外CPU多为x86架构，且目前越来越多的移动设备、嵌入式系统以及服务器都采用ARM处理器，此外，ARM处理器以其低功耗而闻名，特别是在移动设备领域，故将SM3算法优化为在ARM架构上高效运行，可以使其在移动设备等功耗敏感的场景下更具竞争力，因此针对ARM架构设计高效的SM3加密方法具有重要的意义。SM3 is a cryptographic hash algorithm standard developed by the China National Cryptography Administration. It belongs to the symmetric cryptographic system and is widely used in security fields such as digital signatures, data integrity verification, and identity authentication. The SM3 algorithm plays a key role in various fields and provides reliable basic technical support for information security. Its security and efficiency are equivalent to SHA-256. However, since most domestic CPUs are ARM architectures and most foreign CPUs are x86 architectures, and more and more mobile devices, embedded systems, and servers are using ARM processors, in addition, ARM processors are well-known for their low power consumption, especially in the field of mobile devices. Therefore, optimizing the SM3 algorithm to run efficiently on the ARM architecture can make it more competitive in power-sensitive scenarios such as mobile devices. Therefore, it is of great significance to design an efficient SM3 encryption method for the ARM architecture.

SM3原始消息扩展部分主要包括以下步骤：The SM3 original message extension part mainly includes the following steps:

①将消息分组划分为 16 个字； ① Group messages Divided into 16 words ;

②根据这16个字扩展生成这些字，即； ②Generate expansion based on these 16 characters These words, namely ;

③利用上述得到的继续生成剩下的64个字，其中。 ③ Using the above obtained Continue to generate the remaining 64 characters ,in .

SM3原始压缩函数部分主要包括以下步骤：The SM3 original compression function mainly includes the following steps:

①令A,B,C,D,E,F,G,H为字寄存器,SS1,SS2,TT1,TT2为中间变量；①Let A, B, C, D, E, F, G, H be word registers, and SS1, SS2, TT1, TT2 be intermediate variables;

②压缩函数，，其中V⁽ⁱ⁾初始为A,B,C,D,E,F,G,H寄存器中的值。②Compression function , , where V ⁽ⁱ⁾ is initially the value in registers A, B, C, D, E, F, G, and H.

CF函数具体流程如图4所示。The specific process of the CF function is shown in Figure 4.

从上述可知，SM3算法存在内存占用高，代码复杂度大的问题，对此，研究人员进行了相关研究，如专利申请CN113794552A公开了一种基于SIMD的SM3并行数据加密运算方法及系统，其在保持数据具有哈希依赖的情况下实现了多路并行运算；但定义了大量整型数组和临时变量，重复定义会导致内存占用累积，对数据长度进行判断和条件的分支还会引用额外内存开销，不适合在内存成本较高的ARM架构上使用；且该方案的额外判断技术需要将数据分为等长与不等长来进行相应的处理，增加了额外的判断成本，并需要引入条件分支和额外的计算，影响性能的同时存在安全隐患。From the above, it can be seen that the SM3 algorithm has the problems of high memory usage and large code complexity. In this regard, researchers have conducted relevant research. For example, patent application CN113794552A discloses a SIMD-based SM3 parallel data encryption operation method and system, which realizes multi-way parallel operation while maintaining the hash dependency of data; but a large number of integer arrays and temporary variables are defined, and repeated definitions will lead to accumulated memory usage. The judgment of data length and conditional branches will also refer to additional memory overhead, which is not suitable for use on the ARM architecture with higher memory costs; and the additional judgment technology of this scheme needs to divide the data into equal length and unequal length for corresponding processing, which increases the additional judgment cost, and requires the introduction of conditional branches and additional calculations, which affects performance and poses security risks.

专利申请CN114422110A公开了一种面向长指令字的SM3哈希函数消息处理的快速实现方法，通过将SM3密码杂凑算法中的循环运算展开后进行计算，而SM3算法中循环高达 68轮，每轮中都需要进行多个操作，大大增加代码的复杂度，降低了代码的可读性和可维护性，同时增加代码的大小；同时其仍然需要对的公式进行17轮的迭代，迭代次数仍然较多。该技术只是将要加密的数据直接输入给处理器处理，虽利用了NEON的并行指令，实现了一部分硬件层面的并行，但并没有充分发挥多核CPU的潜在性能，无法实现加密性能提升达到最大化。 Patent application CN114422110A discloses a fast implementation method for SM3 hash function message processing for long instruction words, which is performed by expanding the loop operation in the SM3 cryptographic hash algorithm. The SM3 algorithm has up to 68 loops, and multiple operations are required in each loop, which greatly increases the complexity of the code, reduces the readability and maintainability of the code, and increases the size of the code. At the same time, it still needs to be The formula is iterated for 17 rounds, and the number of iterations is still relatively large. This technology simply inputs the data to be encrypted directly into the processor for processing. Although it uses the parallel instructions of NEON and realizes some parallelism at the hardware level, it does not give full play to the potential performance of multi-core CPUs and cannot maximize the encryption performance improvement.

专利申请CN117938401A公开了一种基于并行SM3算法的数据加密方法，基于AVX2指令集实现8组消息并行完成SM3算法的快速实现，通过一次性获取8组或16组输入，利用AVX2指令集中的指令，完成SM3算法的加密；但该技术限制了每次输入的明文数量，灵活性较差，且最多一次性输入16组；此外，该技术同样只是将多组明文利用SIMD指令进行硬件层面的并行处理，并没有考虑软件层面的优化，不能充分利用多核CPU的潜在性能，无法最大程度对加密性能进行提升。Patent application CN117938401A discloses a data encryption method based on the parallel SM3 algorithm. It implements the rapid implementation of the SM3 algorithm by parallelizing 8 groups of messages based on the AVX2 instruction set. By obtaining 8 or 16 groups of inputs at one time and using the instructions in the AVX2 instruction set, the encryption of the SM3 algorithm is completed. However, this technology limits the number of plaintexts input each time, has poor flexibility, and can only input 16 groups at most at one time. In addition, this technology also only uses SIMD instructions to perform parallel processing of multiple groups of plaintexts at the hardware level, and does not consider optimization at the software level. It cannot fully utilize the potential performance of multi-core CPUs and cannot maximize the improvement of encryption performance.

发明内容Summary of the invention

为解决上述技术问题，本发明提供了一种基于ARM平台的SM3并行数据加密方法，可以根据CPU核心数进行合理配置，在CPU核心数较多的机器上可以支持同时处理更多组的明文，实现SM3加密算法的整体性能提升。To solve the above technical problems, the present invention provides an SM3 parallel data encryption method based on the ARM platform, which can be reasonably configured according to the number of CPU cores. On a machine with a large number of CPU cores, it can support the simultaneous processing of more groups of plaintext, thereby achieving overall performance improvement of the SM3 encryption algorithm.

本发明所述的一种基于ARM平台的SM3并行数据加密方法，包括以下步骤：The present invention provides an SM3 parallel data encryption method based on an ARM platform, comprising the following steps:

步骤1、根据CPU核心数获取与其相同数量的待加密的明文；Step 1: Obtain the same number of plaintexts to be encrypted as the number of CPU cores;

步骤2、创建与CPU核心数相同的线程数目，使每个线程加密一组明文；Step 2: Create the same number of threads as the number of CPU cores, and make each thread encrypt a set of plaintext;

步骤3、针对每个线程处理的每组明文，进行消息填充；Step 3: For each group of plaintext processed by each thread, message padding is performed;

步骤4、将完成消息填充后的明文输入并行消息扩展模块，对每组数据使用NEON指令集并行处理；Step 4: Input the plain text after message filling into the parallel message expansion module, and use the NEON instruction set to process each group of data in parallel;

步骤5、将经过并行消息扩展模块后的明文输入优化后的CF压缩函数模块，通过不断改变字的输入位置减少赋值操作，同时降低循环次数，最终得到加密后的杂凑值。Step 5: Input the plaintext after the parallel message expansion module into the optimized CF compression function module, reduce the assignment operation by continuously changing the input position of the word, and reduce the number of loops, and finally obtain the encrypted hash value.

进一步的，步骤3中，对明文进行消息填充，具体为：Furthermore, in step 3, the plaintext is filled with messages, specifically:

将明文划分为16个字，用这16个字生成后续的116个字，即； Divide the plaintext into 16 words, and use these 16 words to generate the subsequent 116 words, that is, ;

将公式, 16≤ j≤67中的替换为uint32x4类型的tmp1；其中，tmp1=； The formula , 16≤ j≤67 Replace with tmp1 of type uint32x4; where tmp1= ;

将替换为uint32x4类型的tmp2，其中tmp2=； Will Replaced with tmp2 of type uint32x4, where tmp2= ;

将替换为uint32x4类型的tmp3，其中tmp3=； Will Replace with tmp3 of uint32x4 type, where tmp3= ;

将替换为uint32x4类型的tmp4，其中tmp4=； Will Replace with tmp4 of type uint32x4, where tmp4= ;

将替换为uint32x4类型的tmp5，其中tmp5=。 Will Replaced with tmp5 of type uint32x4, where tmp5= .

进一步的，步骤4中，每组数据使用NEON指令集并行处理，具体为：Furthermore, in step 4, each set of data is processed in parallel using the NEON instruction set, specifically:

设j=4，进行13轮迭代；Let j = 4 and perform 13 rounds of iterations;

每轮迭代过程中使用NEON指令集中的vld1q_u32指令一次性从W[j*4-16]处加载4个字到tmp1中；In each iteration, the vld1q_u32 instruction in the NEON instruction set is used to load 4 words from W[j*4-16] into tmp1 at one time;

从W[j*4-13]处加载4个字到tmp2中；Load 4 words from W[j*4-13] into tmp2;

从W[j*4-9]处加载4个字到tmp3中；Load 4 words from W[j*4-9] into tmp3;

从W[j*4-6]处加载4个字到tmp4中；Load 4 words from W[j*4-6] into tmp4;

然后利用NEON的veorq_u32指令对tmp1和tmp3进行并行异或计算，同时利用循环左移指令使tmp4左移15位，最后使用vst1q_u32指令将处理好的数据存储到W[j*4]数组中。Then use NEON's veorq_u32 instruction to perform parallel XOR calculations on tmp1 and tmp3, and use the circular left shift instruction to shift tmp4 left by 15 bits. Finally, use the vst1q_u32 instruction to store the processed data into the W[j*4] array.

进一步的，CF压缩函数模块包括存储初始值的寄存器A、B、C、D、E、F、G、H，以及存储计算过程中的中间变量的寄存器SS1、SS2、TT1、TT2，通过Ultra Round函数循环执行，在 Ultra Round中实现具体压缩函数部分，同时对布尔函数进行拆分，前4轮循环使用布尔函数1，后15轮循环使用布尔函数2，减少分支判断。其中布尔函数1包括FF₁(X,Y, Z)=XY Z，GG₁(X,Y, Z)=XYZ，布尔函数2包括FF₂(X,Y, Z)=(X∧Y)∨(X∧Z)∨(Y∧Z)，GG₂(X,Y, Z)=(X∧Y)∨(¬X∧Z)，其中X、Y、Z均为寄存器中需要计算的值，表示异或运算，∧表示与运算，∨表示或运算，¬表示非运算。 Furthermore, the CF compression function module includes registers A, B, C, D, E, F, G, and H for storing initial values, and registers SS1, SS2, TT1, and TT2 for storing intermediate variables in the calculation process. The Ultra Round function is executed in a loop, and the specific compression function part is implemented in the Ultra Round. At the same time, the Boolean function is split. The first 4 rounds of loops use Boolean function 1, and the last 15 rounds of loops use Boolean function 2 to reduce branch judgment. Among them, Boolean function 1 includes FF ₁ (X, Y, Z) = X Y Z，GG ₁ (X,Y, Z)=X Y Z, Boolean function 2 includes FF ₂ (X,Y, Z)=(X∧Y)∨(X∧Z)∨(Y∧Z), GG ₂ (X,Y, Z)=(X∧Y)∨(¬X∧Z), where X, Y, and Z are all values in the register that need to be calculated. represents the exclusive-OR operation, ∧ represents the AND operation, ∨ represents the OR operation, and ¬ represents the NOT operation.

进一步的，Ultra Round的具体实现步骤为：Furthermore, the specific implementation steps of Ultra Round are:

首先对采用原始压缩函数的流程计算出寄存器SS1，SS2，TT1，TT2的值，然后再将TT1的值赋值给寄存器D，TT2的值经过P_0函数运算后赋值给寄存器H，B的值左移9位后再赋值为寄存器B，F的值左移19位后再赋值为寄存器F，最终得到加密后的杂凑值。First, the values of registers SS1, SS2, TT1, and TT2 are calculated for the process using the original compression function, and then the value of TT1 is assigned to register D. The value of TT2 is assigned to register H after being operated by the P_0 function. The value of B is shifted left by 9 bits and then assigned to register B. The value of F is shifted left by 19 bits and then assigned to register F, and finally the encrypted hash value is obtained.

本发明所述的有益效果为：本发明所述方法使用了多线程技术，充分利用了多核CPU的特性，多个线程同时处理多组明文，实现了软件层面的并行化，避免了单线程程序在执行过程中CPU可能会有空闲的问题；本发明对消息扩展部分采用NEON指令级并行实现，一次性加载相邻的4个数据到寄存器中，利用并行指令集同时计算，原理简洁易懂且代码运行高效，不存在循环展开等冗余操作，从而极大降低了消息扩展部分计算的复杂度，实现硬件层面的并行化，同时本发明每次加密只需要5个寄存器，大大节约了内存资源，更适合在内存资源紧缺的ARM架构上使用；针对压缩函数中的轮函数提出Ultra Round的概念，将压缩函数中原本一次需要8个赋值操作的轮函数通过在每轮切换字的输入位置，减少到一次只需要4个赋值操作，同时降低了循环的次数，实现压缩函数性能的提升。The beneficial effects described in the present invention are as follows: the method described in the present invention uses multi-threading technology, fully utilizes the characteristics of a multi-core CPU, and multiple threads simultaneously process multiple groups of plaintexts, thereby realizing parallelization at the software level and avoiding the problem that the CPU may be idle during the execution of a single-threaded program; the present invention adopts NEON instruction-level parallel implementation for the message extension part, loads four adjacent data into the register at one time, and uses a parallel instruction set to calculate simultaneously, the principle is simple and easy to understand, and the code runs efficiently, and there is no redundant operations such as loop expansion, thereby greatly reducing the complexity of the calculation of the message extension part and realizing parallelization at the hardware level, and at the same time, the present invention only needs five registers for each encryption, greatly saving memory resources, and is more suitable for use on an ARM architecture with scarce memory resources; the concept of Ultra Round is proposed for the round function in the compression function, and the round function in the compression function that originally requires eight assignment operations at a time is reduced to only four assignment operations at a time by switching the input position of the word in each round, and at the same time, the number of loops is reduced, thereby realizing the improvement of the performance of the compression function.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是SM3原始消息扩展部分流程图；Fig. 1 is a flow chart of the extended portion of the original message of SM3;

图2是本发明所述方法流程图；FIG2 is a flow chart of the method of the present invention;

图3是本发明所使用的消息扩展部分流程图；3 is a flow chart of the message extension part used in the present invention;

图4是SM3原始压缩函数说明图；FIG4 is an illustration of the original compression function of SM3;

图5是随机生成的明文示意图；FIG5 is a schematic diagram of randomly generated plain text;

图6是本发明以及其他方案加密十轮所计算出的吞吐量对比示意图；FIG6 is a schematic diagram showing a comparison of throughputs calculated by ten encryption rounds of the present invention and other schemes;

图7是本发明以及其他方案加密十轮所计算出的时延情况对比示意图。FIG. 7 is a schematic diagram comparing the delays calculated after ten rounds of encryption according to the present invention and other schemes.

具体实施方式DETAILED DESCRIPTION

为了使本发明的内容更容易被清楚地理解，下面根据具体实施例并结合附图，对本发明作进一步详细的说明。In order to make the contents of the present invention more clearly understood, the present invention is further described in detail below based on specific embodiments in conjunction with the accompanying drawings.

图2所示，本发明所述的一种基于ARM平台的SM3并行数据加密方法，包括以下步骤：As shown in FIG2 , the SM3 parallel data encryption method based on the ARM platform described in the present invention comprises the following steps:

步骤1、获取与CPU核心数(以4核作为其中一例)相同的多组明文，准备用于SM3加密；Step 1: Get multiple groups of plaintext equal to the number of CPU cores (taking 4 cores as an example) and prepare them for SM3 encryption;

步骤2、通过C语言的pthread.h库中的pthread_create()方法，创建与CPU核心数相同的线程数目，再通过调用工作函数来实现一个线程加密一组明文的功能，实现软件层面的并行；Step 2: Use the pthread_create() method in the pthread.h library of the C language to create the same number of threads as the number of CPU cores, and then call the work function to implement the function of encrypting a group of plaintexts by one thread, thus achieving parallelism at the software level.

步骤3、针对每个线程处理的每组明文，其先进入SM3加密算法的消息填充模块，假设消息m的长度为l比特；首先将比特“1”添加到消息的末尾，再添加k个“0”，k是满足l+1+k≡448mod512的最小的非负整数。然后再添加一个64位比特串，该比特串是长度l的二进制表示；填充后的消息m′的比特长度为512的倍数；Step 3: For each group of plaintext processed by each thread, it first enters the message filling module of the SM3 encryption algorithm. Assume that the length of message m is l bits. First, add the bit "1" to the end of the message, and then add k "0s", where k is the smallest non-negative integer that satisfies l+1+k≡448mod512. Then add a 64-bit bit string, which is the binary representation of the length l. The bit length of the padded message m' is a multiple of 512.

步骤4、将完成消息填充后的明文输入并行消息扩展模块，在该模块中对每组数据使用NEON指令集并行处理，实现硬件层面的并行；Step 4: Input the plain text after message filling into the parallel message expansion module, in which each group of data is processed in parallel using the NEON instruction set to achieve parallelism at the hardware level;

步骤5、将经过消息扩展函数后的明文输入CF压缩函数部分，通过不断改变字的输入位置减少赋值操作，同时降低循环次数。Step 5: Input the plain text after the message expansion function into the CF compression function part, and reduce the assignment operation by constantly changing the input position of the word, while reducing the number of loops.

步骤3中，原始消息扩展流程如图1所示，根据字生成这些字，其中由原始消息分组划分得到，而通过公式,16≤j≤67得到。 In step 3, the original message expansion process is shown in Figure 1. generate These words, It is obtained by dividing the original message into groups, and By formula ,16≤j≤67 is obtained.

为了充分利用NEON指令集多路并行计算的特点，本发明使用uint32x4类型变量一次性加载4个到寄存器中，利用NEON并行指令同时计算4组数据，并将计算完的结果一次性存入中。具体的，本发明将上述公式中的替换为uint32x4类型的tmp1，其中tmp1 =，将替换为uint32x4类型的tmp2，其中tmp2=，同理，完成、与的替换，这样可以充分利用NEON的并行指令如veorq( tmp1 , tmp2 )进行硬件层面的并行计算，从而实现使用同一指令同时计算四组数据的效果，大大提升消息扩展部分的计算性能。原本计算的循环需要从16遍历依次到67，共计52轮，而经过本发明优化后计算的循环仅需要从4遍历到17，共计14轮，并行计算的同时还减少了循环次数。本发明所述的消息扩展部分优化方案如图3所示。 In order to make full use of the multi-channel parallel computing characteristics of the NEON instruction set, the present invention uses uint32x4 type variables to load 4 at a time. to the register, use NEON parallel instructions to calculate 4 sets of data at the same time, and store the calculated results in one go Specifically, the present invention converts the above formula Replaced with tmp1 of type uint32x4, where tmp1 = ,Will Replace with tmp2 of type uint32x4, where tmp2= , similarly, complete , and , which can make full use of NEON's parallel instructions such as veorq( tmp1 , tmp2 ) for hardware-level parallel computing, thereby achieving the effect of using the same instruction to simultaneously calculate four sets of data, greatly improving the computing performance of the message expansion part. The loop needs to be traversed from 16 to 67, a total of 52 rounds, and after the optimization of the present invention, the calculation The loop only needs to be traversed from 4 to 17, a total of 14 rounds, while parallel computing also reduces the number of loops. The message extension part optimization solution of the present invention is shown in FIG3.

步骤5中，对压缩函数进行优化，SM3原始压缩函数的轮函数如图4所示。由结构图观察得知，j轮A由j-1轮的D经过运算得到，j轮的B由上一轮的A得到，j轮的C由上一轮的B经过运算后得到，j轮的D由上一轮的C得到，在经过4轮计算后，A，B，C，D与E，F，G，H都会回到初始位置，因此本发明将这4轮定义为一个Ultra Round，通过在每轮中切换字的输入位置，减少中间变量的赋值，同时降低了循环次数。In step 5, the compression function is optimized, and the round function of the original compression function of SM3 is shown in Figure 4. From the structure diagram, it can be seen that A in round j is obtained by calculating D in round j-1, B in round j is obtained by calculating A in the previous round, C in round j is obtained by calculating B in the previous round, and D in round j is obtained by calculating C in the previous round. After 4 rounds of calculation, A, B, C, D and E, F, G, H will return to the initial position. Therefore, the present invention defines these 4 rounds as an Ultra Round, and by switching the input position of the word in each round, the assignment of intermediate variables is reduced, and the number of cycles is reduced.

Ultra Round的具体实现为：The specific implementation of Ultra Round is:

uint32_t Temp = rotate_left(*A, 12);uint32_t Temp = rotate_left(*A, 12);

SS1 = rotate_left((Temp + *E + K[i]),7);SS1 = rotate_left((Temp + *E + K[i]),7);

SS2 = SS1 ^ Temp;SS2 = SS1 ^ Temp;

TT1 = (FF1(*A, *B, *C) + *D + SS2 + (W[i] ^ W[i + 4]));TT1 = (FF1(*A, *B, *C) + *D + SS2 + (W[i] ^ W[i + 4]));

TT2 = (GG1(*E, *F, *G) + *H + SS1 + W[i]);TT2 = (GG1(*E, *F, *G) + *H + SS1 + W[i]);

*D = TT1;*D = TT1;

*H = P_0(TT2);*H = P_0(TT2);

*B = rotate_left(*B, 9);*B = rotate_left(*B, 9);

*F = rotate_left(*F, 19);*F = rotate_left(*F, 19);

与原本SM3的轮函数相比，每轮减少了4次赋值操作，实现了压缩函数性能的提升。Compared with the original SM3 round function, four assignment operations are reduced in each round, thus improving the performance of the compression function.

原本的压缩函数变为：The original compression function becomes:

使用循环多次调用UltraRound函数，共循环19轮，每次变换A,B,C,D,E,F,G,H的寄存器的位置，令j为循环次数，第一次传入UltraRound函数中的参数为A,B,C,D,E,F,G,H,j，第二次参数为D,A,B,C,H,E,F,G,j+1，第三次参数为C,D,A,B,G,H,E,F,j+2，第四次传入参数为B,C,D,A,F,G,H,E,j+3。Use a loop to call the UltraRound function multiple times, for a total of 19 rounds, changing the positions of the registers A, B, C, D, E, F, G, H each time. Let j be the number of loops. The first parameter passed into the UltraRound function is A, B, C, D, E, F, G, H, j. The second parameter is D, A, B, C, H, E, F, G, j+1. The third parameter is C, D, A, B, G, H, E, F, j+2. The fourth parameter is B, C, D, A, F, G, H, E, j+3.

原本SM3压缩函数共需要循环迭代64轮，而通过轮展开的方式完成中间变量的消除后，循环次数降低为19轮，同时本发明对布尔函数进行拆分，减少了分支判断操作，提高运行效率的同时降低了侧信道攻击的风险。The original SM3 compression function requires a total of 64 loop iterations, but after the elimination of intermediate variables is completed by round expansion, the number of loop iterations is reduced to 19 rounds. At the same time, the present invention splits the Boolean function, reduces branch judgment operations, improves operating efficiency, and reduces the risk of side channel attacks.

下面结合实验对本发明所述方法进行说明。The method of the present invention is described below in conjunction with experiments.

本发明实现的SM3并行加速算法对随机生成的明文进行加密运算，随机生成的明文如图5所示；在计算过程中，随机选取一组明文进行加密运算，同时，使用OPENSSL库实现的SM3纯C版本和GMSSL库实现的SM3纯C版本进行对比。本发明主要测试吞吐量以及时延两个指标。The SM3 parallel acceleration algorithm implemented in the present invention performs encryption operations on randomly generated plaintexts, as shown in Figure 5; during the calculation process, a group of plaintexts are randomly selected for encryption operations, and at the same time, the SM3 pure C version implemented by the OPENSSL library and the SM3 pure C version implemented by the GMSSL library are compared. The present invention mainly tests two indicators: throughput and latency.

在密码学算法的实现和优化中，吞吐量是一个至关重要的性能指标，它直接反映了算法在单位时间内能够完成的运算量，从而成为评估算法运算效率的关键参数。在本发明中，采用并行计算的方式对SM3加密算法进行优化；通过比较优化前后，同一算法在单位时间内完成加密的数据量，吞吐量指标能够明确地展示出算法性能的提升。对本发明以及其他方案加密十轮所计算出的吞吐量情况如图6所示。In the implementation and optimization of cryptographic algorithms, throughput is a crucial performance indicator. It directly reflects the amount of calculation that the algorithm can complete per unit time, thus becoming a key parameter for evaluating the algorithm's operational efficiency. In the present invention, the SM3 encryption algorithm is optimized by parallel computing. By comparing the amount of data encrypted per unit time by the same algorithm before and after optimization, the throughput indicator can clearly show the improvement of the algorithm's performance. The throughput calculated for ten rounds of encryption for the present invention and other schemes is shown in Figure 6.

由图6可以看出，本发明所提出的方法相比于GMSSL与OpenSSL在吞吐量上都有相应的提高，吞吐量的提升意味着单位时间内算法所能加密的数据量提升。As can be seen from FIG6 , the method proposed in the present invention has a corresponding improvement in throughput compared to GMSSL and OpenSSL. The improvement in throughput means that the amount of data that can be encrypted by the algorithm per unit time increases.

在密码算法的性能优化中，时延是指单次加密、解密或其他密码学操作的执行时间，即从操作开始到操作完成所经过的时间。时延是一个关键的性能指标，因为它直接影响算法的响应速度和效率。在性能优化的过程中，时延的降低通常被认为是提高系统效率和响应速度的目标之一。具体到本研究中，将本发明所提出的SM3并行加密算法与GmSSL和OpenSSL的算法分别加密10000轮，并分别计算其时延。本发明以及其他方案加密十次所计算出的时延情况如图7所示。In the performance optimization of cryptographic algorithms, latency refers to the execution time of a single encryption, decryption or other cryptographic operation, that is, the time from the start of the operation to the completion of the operation. Latency is a key performance indicator because it directly affects the response speed and efficiency of the algorithm. In the process of performance optimization, reducing latency is generally considered to be one of the goals of improving system efficiency and response speed. Specifically in this study, the SM3 parallel encryption algorithm proposed in the present invention and the algorithms of GmSSL and OpenSSL were encrypted for 10,000 rounds respectively, and their latency was calculated respectively. The latency calculated by encrypting ten times by the present invention and other schemes is shown in Figure 7.

结果表明，本发明所提出的方法与GmSSL和OpenSSL的方案相比时延明显降低，证明本发明提升了SM3算法的性能。时延的降低意味着加密耗时变短，对于提高加密相应速度，提升用户使用体验来说具有重要意义。The results show that the method proposed in the present invention has significantly lower latency than the GmSSL and OpenSSL solutions, proving that the present invention improves the performance of the SM3 algorithm. The reduction in latency means that encryption takes less time, which is of great significance for improving the encryption response speed and improving the user experience.

基于上述实验结果可知，相较于其他版本实现的纯C代码，本发明所提出的方法在评价密码算法的主要指标上都展现了有效提升，本发明利用在ARM平台上并行加密的方案显著降低了SM3算法的时延，并提升了单位时间内的吞吐量，实现SM3算法整体性能的提升。Based on the above experimental results, it can be seen that compared with other versions of pure C code, the method proposed in the present invention has shown effective improvement in the main indicators for evaluating cryptographic algorithms. The present invention uses a parallel encryption scheme on the ARM platform to significantly reduce the latency of the SM3 algorithm and improve the throughput per unit time, thereby improving the overall performance of the SM3 algorithm.

以上所述仅为本发明的优选方案，并非作为对本发明的进一步限定，凡是利用本发明说明书及附图内容所作的各种等效变化均在本发明的保护范围之内。The above description is only a preferred embodiment of the present invention and is not intended to be a further limitation of the present invention. All equivalent changes made using the contents of the present specification and drawings are within the protection scope of the present invention.

Claims

1. An SM3 parallel data encryption method based on an ARM platform, characterized in that it comprises the following steps:

Step 1: Obtain the same number of plaintexts to be encrypted as the number of CPU cores;

Step 2: Create the same number of threads as the number of CPU cores, and make each thread encrypt a set of plaintext;

Step 3: For each group of plaintext processed by each thread, message filling is performed; specifically:

Divide the plaintext into 16 words, and use these 16 words to generate the subsequent 116 words, that is, ;

The formula , 16≤j≤67 Replace with tmp1 of type uint32x4; where tmp1= ;

Will Replace with tmp2 of type uint32x4, where tmp2= ;

Will Replace with tmp3 of uint32x4 type, where tmp3= ;

Will Replace with tmp4 of type uint32x4, where tmp4= ;

Will Replaced with tmp5 of type uint32x4, where tmp5= ;

Step 4: Input the plain text after message filling into the parallel message expansion module, and use the NEON instruction set to process each group of data in parallel; specifically:

Let j = 4 and perform 13 rounds of iterations;

In each iteration, the vld1q_u32 instruction in the NEON instruction set is used to load 4 words from W[j*4-16] into tmp1 at one time;

Load 4 words from W[j*4-13] into tmp2;

Load 4 words from W[j*4-9] into tmp3;

Load 4 words from W[j*4-6] into tmp4;

Then use NEON's veorq_u32 instruction to perform parallel XOR calculations on tmp1 and tmp3, and use the circular left shift instruction to shift tmp4 left by 15 bits. Finally, use the vst1q_u32 instruction to store the processed data into the W[j*4] array;

Step 5: Input the plaintext after the parallel message expansion module into the optimized CF compression function module, reduce the assignment operation by constantly changing the input position of the word, and reduce the number of loops, and finally obtain the encrypted hash value;

The CF compression function module includes registers A, B, C, D, E, F, G, and H for storing initial values, and registers SS1, SS2, TT1, and TT2 for storing intermediate variables in the calculation process. The module is executed in an Ultra Round function loop, and a specific compression function part is implemented in the Ultra Round. At the same time, the Boolean function is split, and the first 4 rounds of loops use Boolean function 1, and the last 15 rounds of loops use Boolean function 2 to reduce branch judgment; wherein the Boolean function 1 includes FF ₁ (X, Y, Z) = X Y Z，GG ₁ (X,Y, Z)=X Y Z, Boolean function 2 includes FF ₂ (X,Y, Z)=(X∧Y)∨(X∧Z)∨(Y∧Z), GG ₂ (X,Y, Z)=(X∧Y)∨(¬X∧Z), where X, Y, and Z are all values in the register that need to be calculated. represents XOR operation, ∧ represents AND operation, ∨ represents OR operation, and ¬ represents NOT operation;

The specific implementation steps of the Ultra Round are:

First, the values of registers SS1, SS2, TT1, and TT2 are calculated for the process using the original compression function, and then the value of TT1 is assigned to register D. The value of TT2 is assigned to register H after being operated by the P_0 function. The value of B is shifted left by 9 bits and then assigned to register B. The value of F is shifted left by 19 bits and then assigned to register F, and finally the encrypted hash value is obtained.