CN118034643B

CN118034643B - Carry-free multiplication and calculation array based on SRAM

Info

Publication number: CN118034643B
Application number: CN202410087862.6A
Authority: CN
Inventors: 王楚惠; 黄科杰; 沈海斌
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-01-22
Filing date: 2024-01-22
Publication date: 2024-09-13
Anticipated expiration: 2044-01-22
Also published as: CN118034643A

Abstract

The present invention provides a carry-free multiplication storage and calculation array based on SRAM, including m×m storage and calculation units based on SRAM structure and N×m XOR trees; the single-bit streams generated by the multipliers input by the storage and calculation units in the same row are the same, each row of storage and calculation units simultaneously inputs N single-bit streams generated by the multipliers, and each column of storage and calculation units outputs N×m calculation result values in one calculation cycle, and the N×m calculation result values are used as input data of N m-input XOR trees, and N single-bit calculation results are finally obtained through the calculation of the XOR trees, and the carry-free multiplication storage and calculation array finally obtains N×m single-bit data calculation results. The present invention greatly improves the efficiency of carry-free multiplication calculation by combining in-memory calculation technology; and uses NAND gates instead of AND gates to complete the AND operation required for carry-free multiplication, further reducing the number of transistors required, and reducing the power consumption and resource usage required for calculation.

Description

A carry-less multiplication storage array based on SRAM

技术领域Technical Field

本发明属于存算一体化以及类脑计算领域，具体涉及一种基于SRAM(静态随机存储器)的无进位乘法存算阵列。The present invention belongs to the field of storage-computing integration and brain-like computing, and specifically relates to a carry-free multiplication storage-computing array based on SRAM (static random access memory).

背景技术Background Art

当今社会对数据、算力、芯片性能的要求越来越高，提升计算能力与数据效率迫在眉睫。传统计算机运行的冯·诺依曼体系包括存储单元和计算单元两部分，即存算分离架构。因此，传统计算机实施运算时需要先把数据存入主存储器，再按顺序从主存储器中取出指令按序执行，数据需要在处理器与存储器之间进行频繁迁移，且读写一次内存的数据能量比计算一次数据的能量多消耗几百倍。传统冯·诺依曼架构导致的高延迟和高耗能的问题成为急需解决的问题，其中的短板存储器成为了制约数据处理速度提高的主要瓶颈。存算一体是在存储器中嵌入计算能力，以新的运算架构进行二维和三维矩阵乘法/加法运算，即利用存储器对数据进行计算，从而避免数据搬运产生的“存储墙”和“功耗墙”，提高数据的并行和效率。现有计算无进位乘法的技术主要通过在处理器中添加额外指令或利用FPGA实现定制乘法器对无进位乘法操作进行加速。在处理器中添加额外指令对无进位乘法计算效率的提升不大，例如在使用到无进位乘法的BIKE(比特翻转密钥封装技术)算法中，BIKE团队针对x86机器进行了一些优化实现，但其完成无进位乘法的速度严重依赖于无进位乘法指令(如pclmulqdq)的使用，并且其完成单次无进位乘法计算仍需耗费上万个周期。利用FPGA实现定制乘法器虽然减少了计算周期数，但与此同时花费了大量的资源。Today's society has higher and higher requirements for data, computing power, and chip performance, and it is urgent to improve computing power and data efficiency. The von Neumann system running on traditional computers includes two parts: storage units and computing units, that is, storage and computing separation architecture. Therefore, when traditional computers implement operations, they need to first store data in the main memory, and then take out instructions from the main memory in order to execute them in order. Data needs to be frequently migrated between the processor and the memory, and the energy consumed by reading and writing data in the memory once is hundreds of times more than the energy consumed by calculating data once. The high latency and high energy consumption caused by the traditional von Neumann architecture have become problems that need to be solved urgently, and the short board memory has become the main bottleneck restricting the improvement of data processing speed. Storage and computing integration is to embed computing power in the memory, and perform two-dimensional and three-dimensional matrix multiplication/addition operations with a new computing architecture, that is, to use the memory to calculate the data, thereby avoiding the "storage wall" and "power consumption wall" generated by data transfer, and improving data parallelism and efficiency. The existing technology for calculating carry-less multiplication mainly accelerates the carry-less multiplication operation by adding additional instructions to the processor or using FPGA to implement a customized multiplier. Adding additional instructions to the processor does not improve the efficiency of carry-less multiplication calculations much. For example, in the BIKE (bit flip key encapsulation technology) algorithm that uses carry-less multiplication, the BIKE team has made some optimizations for x86 machines, but the speed of completing carry-less multiplication is heavily dependent on the use of carry-less multiplication instructions (such as pclmulqdq), and it still takes tens of thousands of cycles to complete a single carry-less multiplication calculation. Although using FPGA to implement a custom multiplier reduces the number of calculation cycles, it also consumes a lot of resources.

发明内容Summary of the invention

本发明提供了一种基于SRAM的无进位乘法存算阵列，包括m列基于SRAM构造的存算单元组和m列异或树组；所述存算单元组由m个存算单元组成，所述异或树组由N个异或树组成；m为大于等于2的整数，N＝1,2,4,8,…2^q，q为自然数；The present invention provides a carry-less multiplication storage and calculation array based on SRAM, comprising m columns of storage and calculation unit groups constructed based on SRAM and m columns of XOR tree groups; the storage and calculation unit group is composed of m storage and calculation units, and the XOR tree group is composed of N XOR trees; m is an integer greater than or equal to 2, N=1, 2, 4, 8, ... 2 ^q , q is a natural number;

每个所述存算单元由一个SRAM与N个二输入与非门组成，所述SRAM输出所储存的被乘数的一比特信息；所述SRAM的输出分别作为N个二输入与非门的其中一个输入端的输入，N个二输入与非门的另一个输入端分别输入由乘数生成的N个单比特流；N个二输入与非门分别输出两个输入相与非的结果，即输出N个单比特二进制数；Each of the storage and calculation units is composed of an SRAM and N two-input NAND gates. The SRAM outputs one bit of information of the stored multiplicand. The output of the SRAM is used as the input of one of the input ends of the N two-input NAND gates. The other input ends of the N two-input NAND gates are respectively input with N single-bit streams generated by the multiplier. The N two-input NAND gates respectively output the result of the NAND of the two inputs, that is, output N single-bit binary numbers.

第j列存算单元组的每个存算单元输出的N个单比特二进制数分别输入到第j列异或树组的每个异或树中，j＝1,2,…,m，每个异或树共接收m个单比特二进制数，输出为所述m个单比特二进制数相互异或的单比特二进制数结果；The N single-bit binary numbers output by each storage unit of the j-th column storage unit group are respectively input into each XOR tree of the j-th column XOR tree group, j=1, 2, ..., m, each XOR tree receives a total of m single-bit binary numbers, and outputs a single-bit binary number result of the XOR of the m single-bit binary numbers;

每一列存算单元在一个计算周期内输出N×m个单比特二进制数，所述N×m个单比特二进制数作为N个m输入的异或树的输入数据，通过异或树的计算最终得到N个单比特计算结果，m列所述存算单元组共得到N×m个单比特计算结果；在一个计算周期内，所述无进位乘法存算阵列最终得到N×m个单比特数据计算结果。Each column of storage and computing units outputs N×m single-bit binary numbers in one calculation cycle. The N×m single-bit binary numbers are used as input data of N m-input XOR trees. N single-bit calculation results are finally obtained through the calculation of the XOR tree. The m columns of the storage and computing unit group obtain a total of N×m single-bit calculation results. In one calculation cycle, the carry-less multiplication storage and computing array finally obtains N×m single-bit data calculation results.

进一步地，被乘数以单个比特信息的形式被分别存储在所述存算阵列的SRAM中，每一个SRAM输出同一个被乘数不同的一比特信息，一个二输入与非门计算由乘数生成的单比特流的单个比特与被乘数的一比特信息的与非结果，完成无进位乘法中单个比特之间求积的运算；异或树是对二输入与非门输出的积进一步进行异或运算，完成无进位乘法中无进位操作的运算，最终输出部分积结果。Furthermore, the multiplicand is stored in the SRAM of the storage and calculation array in the form of single-bit information, each SRAM outputs different one-bit information of the same multiplicand, a two-input NAND gate calculates the NAND result of a single bit of the single-bit stream generated by the multiplier and the one-bit information of the multiplicand, and completes the product operation between single bits in the carry-less multiplication; the XOR tree further performs an XOR operation on the product output by the two-input NAND gate, completes the carry-less operation in the carry-less multiplication, and finally outputs a partial product result.

进一步地，同一行所述存算单元输入的由乘数生成的N个单比特流是相同的，每一行存算单元同时输入由乘数生成的N个单比特流，所述由乘数生成的N个单比特流分别输入到该行内每一个存算单元的N个二输入与非门中。Furthermore, the N single-bit streams generated by the multipliers input into the storage and computing units in the same row are the same, and each row of storage and computing units simultaneously inputs the N single-bit streams generated by the multipliers, and the N single-bit streams generated by the multipliers are respectively input into the N two-input NAND gates of each storage and computing unit in the row.

进一步地，所述乘数的二进制数表示形式为a_n-1a_n-2a_n-3…a₂a₁a₀，所述无进位乘法存算阵列的第k行，对乘数的二进制数表示形式循环左移m*(k-1)比特，得到一个新的二进制数，k＝1,2,3,…,m；对所述新的二进制数高位补零，使其变成一个长度为g*N的二进制数x_g*N-1x_g*N-2x_g*N-3…x₂x₁x₀，其中，g＝ceil(n/N)，ceil()函数是向上取整函数；所述无进位乘法存算阵列的第k行存算单元的第i个二输入与非门，i＝1,2,3,…,N，其输入的单比特流为x_i-1,x_i-1+N,x_i-1+2*N,x_i-1+3*N,…,x_i-1+(g-1)*N。Furthermore, the binary representation of the multiplier is a _n-1 a _n-2 a _n-3 …a ₂ a ₁ a ₀ , and the kth row of the carry-less multiplication storage array cyclically shifts the binary representation of the multiplier left by m*(k-1) bits to obtain a new binary number, k=1,2,3,…,m; the high bits of the new binary number are padded with zeros to make it a binary number x _g*N-1 x _g*N-2 x _g*N-3 …x ₂ x ₁ x ₀ of length g*N, wherein g=ceil(n/N) and the ceil() function is a rounding function; the i-th two-input NAND gate of the k-th row storage unit of the carry-less multiplication storage array, i=1,2,3,…,N, has an input single-bit stream of x _i-1 ,xi _-1+N , _xi-1+2*N , _xi-1+3*N ,…,x _i-1+(g-1)*N .

进一步地，每一列的全部所述SRAM分别输出所存储的同一个被乘数不同的一比特信息，每一个SRAM输出所储存的被乘数的一比特信息至其所在的存算单元内的N个二输入与非门，所述N个二输入与非门的另一个输入端分别输入由乘数生成的N个单比特流，每一列存算单元在一个计算周期内输出N×m个单比特二进制数至N个m输入的异或树中，N个所述异或树输出N个单比特计算结果，在一个计算周期结束后，所述的基于SRAM的无进位乘法存算阵列输出N×m个单比特计算结果，所述N×m个单比特计算结果用大小为N×m的矩阵O表示；在一列存算单元中，每个存算单元第i个二输入与非门计算后结果会输出到同一个异或树中，i＝1,2,…,N，该列存算单元为第j列，j＝1,2,…,m，则此时异或树输出的单比特计算结果即为矩阵O中的元素O_ij所对应的数值。Furthermore, all the SRAMs in each column respectively output different one-bit information of the same multiplicand stored therein, and each SRAM outputs the one-bit information of the stored multiplicand to the N two-input NAND gates in the storage unit where it is located, and the other input ends of the N two-input NAND gates respectively input N single-bit streams generated by the multiplier, and each column of storage units outputs N×m single-bit binary numbers to N m-input XOR trees in one calculation cycle, and the N XOR trees output N single-bit calculation results. After one calculation cycle, the SRAM-based carry-free multiplication storage array outputs N×m single-bit calculation results, and the N×m single-bit calculation results are represented by a matrix O of size N×m; in a column of storage units, the result of the calculation of the i-th two-input NAND gate of each storage unit will be output to the same XOR tree, i=1,2,…,N, and the column of storage units is the j-th column, j=1,2,…,m, then the single-bit calculation result output by the XOR tree at this time is the element O in the matrix O. The value corresponding to _ij .

本发明还提供一种采用所述的基于SRAM的存算阵列的无进位乘法运算方法，计算A×B的无进位乘法，被乘数B以单个比特信息的形式被分别存储在所述存算阵列的SRAM中，SRAM输出被乘数B的一比特信息，在存算阵列的每一行中，乘数A所生成的N个单比特流与SRAM的输出共同输入到所述SRAM所在的存算单元的二输入与非门中；在一个计算周期中，二输入与非门输出的结果即为该SRAM存储的B的单个比特信息与该周期内输入的乘数A所生成的单比特流内的单比特数据的积，全部所述存算单元的二输入与非门输出的结果被输入到对应的异或树中，进行相互异或运算，异或树最终输出的结果即为A×B无进位乘法计算的部分积结果。The present invention also provides a carry-free multiplication operation method using the SRAM-based storage and calculation array to calculate the carry-free multiplication of A×B, the multiplicand B is stored in the SRAM of the storage and calculation array in the form of single-bit information, the SRAM outputs one-bit information of the multiplicand B, in each row of the storage and calculation array, N single-bit streams generated by the multiplier A and the output of the SRAM are input into the two-input NAND gate of the storage and calculation unit where the SRAM is located; in a calculation cycle, the result output by the two-input NAND gate is the product of the single-bit information of B stored in the SRAM and the single-bit data in the single-bit stream generated by the multiplier A input in the cycle, the results output by the two-input NAND gates of all the storage and calculation units are input into the corresponding XOR tree to perform mutual XOR operation, and the result finally output by the XOR tree is the partial product result of the carry-free multiplication calculation of A×B.

本发明的优点如下：The advantages of the present invention are as follows:

本发明通过结合存内计算技术，大幅度提升了无进位乘法计算的效率；利用与非门代替与门完成无进位乘法所需要的相与操作，进一步减少了所需要使用的晶体管数量，降低了计算所需的功耗与资源使用；同时，利用单个存算单元使用N个二输入与非门进行并行计算的方式，可以大幅度降低计算所需周期数；可配置参数阵列大小m与并行计算数N也使得其应用场景更加广泛。The present invention greatly improves the efficiency of carry-less multiplication calculation by combining in-memory computing technology; uses NAND gates instead of AND gates to complete the AND operation required for carry-less multiplication, further reducing the number of transistors required, and reducing the power consumption and resource usage required for calculation; at the same time, a single memory computing unit is used to use N two-input NAND gates for parallel calculation, which can greatly reduce the number of cycles required for calculation; the configurable parameter array size m and the number of parallel calculations N also make its application scenarios more extensive.

(1)针对无进位乘法运算，提出了基于SRAM的无进位乘法存算阵列，该存算阵列利用与非门代替与门完成无进位乘法所需要的相与操作，且使用的门电路都使用了较少的晶体管数，有效缩小了电路的规模，降低电路的功耗和面积，并提高能效比与计算密度。(1) For carry-less multiplication operations, a carry-less multiplication storage and calculation array based on SRAM is proposed. The storage and calculation array uses NAND gates instead of AND gates to complete the AND operation required for carry-less multiplication. The gate circuits used use a smaller number of transistors, which effectively reduces the scale of the circuit, reduces the power consumption and area of the circuit, and improves the energy efficiency and computing density.

(2)通过令单个存算单元使用N个二输入与非门进行并行计算的方式，显著提高了单个计算周期的计算量，能够有效降低运算的复杂度，减少计算周期数。(2) By making a single storage and computing unit use N two-input NAND gates for parallel computing, the computational complexity of a single computing cycle is significantly improved, which can effectively reduce the complexity of the operation and the number of computing cycles.

(3)所述基于SRAM的无进位乘法存算阵列大小中的参数m与并行计算量的参数N为可配置变量，两者均可根据具体需求来选择相应的数值，因此该阵列能够针对不同的应用场景下的无进位乘法的计算要求进行配置，灵活性更高，应用范围更广。(3) The parameter m in the size of the SRAM-based carry-less multiplication storage array and the parameter N of the parallel computing amount are configurable variables, and both can select corresponding values according to specific needs. Therefore, the array can be configured according to the calculation requirements of carry-less multiplication in different application scenarios, with higher flexibility and wider application range.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1 128输入的异或树的结构示意图；Figure 1 is a schematic diagram of the structure of the XOR tree of 128 inputs;

图2无进位乘法存算阵列的第j列结构图；Fig. 2 is a structural diagram of the jth column of the carry-less multiplication storage array;

图3无进位乘法存算阵列的整体架构图；Figure 3 is a diagram of the overall architecture of the carry-less multiplication storage and calculation array;

图4 BIKE算法中多项式乘法计算的架构示意图。Figure 4 Schematic diagram of the architecture of polynomial multiplication calculation in the BIKE algorithm.

具体实施方式DETAILED DESCRIPTION

下面结合具体实施方式对本发明做进一步阐述和说明。所述实施例仅是本公开内容的示范且不圈定限制范围。本发明中各个实施方式的技术特征在没有相互冲突的前提下，均可进行相应组合。The present invention is further described and illustrated below in conjunction with specific embodiments. The embodiments are merely exemplary of the present disclosure and do not define the scope of limitation. The technical features of each embodiment of the present invention may be combined accordingly without conflicting with each other.

实现存内计算的方法中，SRAM的存储逻辑简单清晰，和现在的数字处理器技术更容易结合，同时，SRAM离CPU近读写性能优势较大。无进位乘法，即取消进位操作的乘法，数学上将其定义为伽罗华域GF(2ⁿ)的乘法，定义多项式其中a_i,b_i∈{0,1}， c_i＝Σ_j+k＝ia_jb_kmod 2∈{0,1}。在存内计算的设计中，每个存算单元既对数据进行存储也对数据进行运算操作，存算阵列主要完成无进位乘法的部分积运算，全展开形式的存算阵列可以一次性完成无进位乘法的部分积计算并输出相应的计算结果，对于整个器件阵列拥有速度快、并行度高、能效比好的特点，适用于需要进行大量无进位乘法运算的应用如后量子密码BIKE(比特翻转密钥封装技术)、HQC(汉明准循环密钥封装技术)等算法中。Among the methods for implementing in-memory computing, SRAM has a simple and clear storage logic and is easier to integrate with current digital processor technology. At the same time, SRAM is close to the CPU and has a greater read and write performance advantage. Carry-free multiplication, that is, multiplication without carry operation, is mathematically defined as multiplication of Galois Field GF(2 ⁿ ), and the polynomial is defined as where a _i ,b _i ∈{0,1}, c _i = Σ _{j + k = i} a _j b _k mod 2∈{0,1}. In the design of in-memory computing, each storage and computing unit not only stores data but also performs operations on data. The storage and computing array mainly completes the partial product operation of carry-less multiplication. The fully expanded storage and computing array can complete the partial product calculation of carry-less multiplication at one time and output the corresponding calculation results. The entire device array has the characteristics of fast speed, high parallelism and good energy efficiency. It is suitable for applications that require a large number of carry-less multiplication operations, such as post-quantum cryptography BIKE (bit flip key encapsulation technology), HQC (Hamming quasi-cyclic key encapsulation technology) and other algorithms.

本发明提供了一种基于SRAM的无进位乘法存算阵列，适用于伽罗华域GF(2ⁿ)上的无进位乘法运算。所述无进位乘法存算阵列由基于SRAM构造的存算单元与异或树组成。所述无进位乘法存算阵列大小为m×m(m＝16,32,64,128,…)，即由m×m个存算单元与异或树组成，特别的，为了将乘数全部存储在存算单元中，要求m×m≥n。以无进位乘法C＝A×B的计算为例，其中a_i,b_i∈{0,1}，c_i＝Σ_j+k＝ia_jb_kmod 2∈{0,1}。所述无进位乘法存算阵列的输入为乘数A生成的单个比特流，存算单元储存的数据为被乘数B。具体阵列架构设计介绍如下：The present invention provides a carry-less multiplication storage array based on SRAM, which is suitable for carry-less multiplication operations on Galois Field GF(2 ⁿ ). The carry-less multiplication storage array is composed of storage units and XOR trees constructed based on SRAM. The size of the carry-less multiplication storage array is m×m (m=16, 32, 64, 128, ...), that is, it is composed of m×m storage units and XOR trees. In particular, in order to store all multipliers in the storage units, m×m≥n is required. Taking the calculation of carry-less multiplication C=A×B as an example, where a _i ,b _i ∈{0,1}, c _i =Σ _{j+k = i} a _j b _k mod 2∈{0,1}. The input of the carry-less multiplication storage array is a single bit stream generated by the multiplier A, and the data stored in the storage unit is the multiplicand B. The specific array architecture design is introduced as follows:

一个存算单元由一个SRAM与N(N＝1,2,4,8,…)个二输入与非门组成。异或树即为由二输入异或门搭建而成的树状结构，其输入为m个单比特二进制数，输出为这m个单比特二进制数相互异或的单比特二进制数结果。一个m输入的异或树由多个上述二输入异或门组成，以m＝128为例，其树状结构如图1所示。计算m个单比特二进制数相互异或的结果，需要ceil(log₂(m))级异或树，ceil()函数是向上取整函数；所述m个单比特二进制数被输入到异或树第一级的二输入异或门中，进行两两相互异或的计算，第二级开始的每一级二输入异或门的输入依赖于前一级二输入异或门的输出，即后一级的一个二输入异或门的两个输入为前一级两个二输入异或门的输出结果，依次类推，最后一级二输入异或门输出的结果即为整个异或树的输出结果，即m个单比特二进制数输入的相互异或的结果。A storage unit consists of an SRAM and N (N = 1, 2, 4, 8, ...) two-input NAND gates. An XOR tree is a tree structure built by two-input XOR gates. Its input is m single-bit binary numbers, and its output is the single-bit binary result of the XOR of these m single-bit binary numbers. An m-input XOR tree consists of multiple two-input XOR gates. Taking m = 128 as an example, its tree structure is shown in Figure 1. To calculate the result of mutual XOR of m single-bit binary numbers, ceil(log ₂ (m)) levels of XOR trees are required, and the ceil() function is a rounding-up function; the m single-bit binary numbers are input into the two-input XOR gates of the first level of the XOR tree, and two-by-two XOR calculations are performed, and the inputs of each level of two-input XOR gates starting from the second level depend on the outputs of the two-input XOR gates of the previous level, that is, the two inputs of a two-input XOR gate of the next level are the output results of the two two-input XOR gates of the previous level, and so on. The output result of the two-input XOR gate of the last level is the output result of the entire XOR tree, that is, the result of mutual XOR of the m single-bit binary number inputs.

SRAM输出的数据被输入到N个二输入与非门的其中一个输入端上，与非门的另一个输入端输入的数据为乘数A生成的单个比特流。特别的，所述无进位乘法存算阵列的同一行存算单元，其输入的乘数A生成的单个比特流是相同的，每一行存算单元同时输入N个乘数A生成的单个比特流。所述无进位乘法存算阵列大小为m×m，即每一列存算单元在一个计算周期内输出N×m个计算结果值。这N×m个计算结果值将作为N个m输入的异或树的输入数据，通过异或树的计算最终得到N个单比特数据计算结果，所述无进位乘法存算阵列的第j(j＝1,2,…,m)列结构(即单列结构)如图2所示。特别的，一个计算周期内，一列存算单元中，每个存算单元第i(i＝1,2,…,N)个二输入与非门计算后结果会输出到同一个异或树中。一个大小为m×m的无进位乘法存算阵列共有m列存算单元，因此所述无进位乘法存算阵列共有N×m个m输入的异或树。在一个计算周期内，所述无进位乘法存算阵列最终得到N×m个单比特数据计算结果。所述无进位乘法存算阵列的整体架构如图3所示，其最终输出结果可以用大小为N×m的矩阵O表示，其中，在一列存算单元中，由于每个存算单元第i(i＝1,2,…,N)个二输入与非门计算后结果会输出到同一个异或树中，若该列存算单元为第j(j＝1,2,…,m)列，则此时异或树输出的单比特数据结果即为矩阵O中的元素O_ij所对应的数值。The data output by the SRAM is input to one of the input ends of the N two-input NAND gates, and the data input to the other input end of the NAND gate is a single bit stream generated by the multiplier A. In particular, the single bit stream generated by the input multiplier A is the same for the storage units in the same row of the carry-free multiplication storage array, and each row of storage units simultaneously inputs a single bit stream generated by N multipliers A. The size of the carry-free multiplication storage array is m×m, that is, each column of storage units outputs N×m calculation result values in one calculation cycle. These N×m calculation result values will be used as input data for N m-input XOR trees, and N single-bit data calculation results will eventually be obtained through the calculation of the XOR tree. The jth (j＝1,2,…,m) column structure (i.e., single column structure) of the carry-free multiplication storage array is shown in Figure 2. In particular, in one calculation cycle, in a column of storage units, the result of the calculation of the i-th (i＝1,2,…,N) two-input NAND gate of each storage unit will be output to the same XOR tree. A carry-less multiplication storage array of size m×m has a total of m columns of storage units, so the carry-less multiplication storage array has a total of N×m m-input XOR trees. In one calculation cycle, the carry-less multiplication storage array finally obtains N×m single-bit data calculation results. The overall architecture of the carry-less multiplication storage array is shown in Figure 3, and its final output result can be represented by a matrix O of size N×m, wherein, in a column of storage units, since the result of the calculation of the i-th (i＝1,2,…,N) two-input NAND gate of each storage unit will be output to the same XOR tree, if the column of storage units is the j-th (j＝1,2,…,m) column, the single-bit data result output by the XOR tree at this time is the value corresponding to the element O _ij in the matrix O.

BIKE是一种基于QC-MDPC(准循环中等密度奇偶校验)码的基于编码的密钥封装机制。BIKE算法中使用的准循环码允许将所有矩阵操作视为多项式运算。根据性能分析结果，多项式乘法是BIKE算法中最昂贵的计算操作之一。BIKE算法中的多项式乘法是在循环多项式环F₂[x]/(x^r-1)上的计算，其中F₂代表二元有限域。因此，假设A＝a₀+a₁x+…+a_r-1x^r-1，B＝b₀+b₁x+…+b_r-1x^r-1，其中a_i,b_i∈{0,1}，BIKE算法中的多项式乘法为C＝(A×B)％(x^r–1)，其中A×B部分即为无进位乘法运算，C即为A×B进行无进位乘法运算后与不可约多项式(x^r–1)的约化模计算结果。该约简操作即为将A×B无进位乘法运算结果中次数大于等于r的项的部分移动到较低的次数的部分，并将两个部分进行异或操作，所输出的计算结果即为BIKE算法中多项式乘法C＝(A×B)％(x^r–1)的结果值。BIKE is a code-based key encapsulation mechanism based on QC-MDPC (quasi-cyclic medium-density parity check) codes. The quasi-cyclic code used in the BIKE algorithm allows all matrix operations to be considered as polynomial operations. According to the performance analysis results, polynomial multiplication is one of the most expensive computational operations in the BIKE algorithm. The polynomial multiplication in the BIKE algorithm is calculated on the cyclic polynomial ring F ₂ [x]/(x ^r -1), where F ₂ represents a binary finite field. Therefore, assuming A = a ₀ + a ₁ x + … + a _r-1 x ^r-1 , B = b ₀ + b ₁ x + … + b _r-1 x ^r-1 , where a _i , b _i ∈ {0,1}, the polynomial multiplication in the BIKE algorithm is C = (A×B)% (x ^r –1), where the A×B part is the carry-less multiplication operation, and C is the reduced modulus calculation result of the carry-less multiplication of A×B and the irreducible polynomial (x ^r –1). The reduction operation is to move the part of the terms with a degree greater than or equal to r in the result of the A×B carry-less multiplication operation to the part with a lower degree, and perform an XOR operation on the two parts. The output calculation result is the result value of the polynomial multiplication C=(A×B)%(x ^r –1) in the BIKE algorithm.

以BIKE算法中的无进位乘法计算操作为例对上述方案进一步说明。由于BIKE算法可以用三个不同的参数集实例化以满足NIST(美国国家标准与技术研究院)建议的不同安全级别，本实施例将对安全级别为1时BIKE算法所使用的无进位乘法进行计算，即参数r＝12323。因此，假设A＝a₀+a₁x+…+a₁₂₃₂₂x¹²³²²，B＝b₀+b₁x+…+b₁₂₃₂₂x¹²³²²，其中a_i,b_i∈{0,1}。以下将用二进制比特数形式来表示A、B与C的具体数值，即以a₁₂₃₂₂ a₁₂₃₂₁…a₁a₀表示乘数A，以b₁₂₃₂₂ b₁₂₃₂₁…b₁b₀表示被乘数B，以c₁₂₃₂₂ c₁₂₃₂₁…c₁c₀表示乘法结果C。本实施例选取的存算阵列大小的参数m＝128，并行计算数的参数N＝4，完成一次多项式乘法计算所述的基于SRAM的无进位乘法存算阵列仅需g＝ceil(r/N)＝ceil(12323/4)＝3081个计算周期，这里的ceil()函数是向上取整函数，它返回的是大于或等于函数参数并且与之最接近的整数。利用本发明所述的基于SRAM的无进位乘法存算阵列，计算BIKE算法中多项式乘法C＝(A×B)％(x^r–1)的整体架构如图4所示，计算操作方法如下所述。The above scheme is further explained by taking the carry-less multiplication operation in the BIKE algorithm as an example. Since the BIKE algorithm can be instantiated with three different parameter sets to meet the different security levels recommended by NIST (National Institute of Standards and Technology), this embodiment will calculate the carry-less multiplication used by the BIKE algorithm when the security level is 1, that is, the parameter r=12323. Therefore, it is assumed that A=a ₀ +a ₁ x+…+a ₁₂₃₂₂ x ¹²³²² , B=b ₀ +b ₁ x+…+b ₁₂₃₂₂ x ¹²³²² , where a _i ,b _i ∈{0,1}. The specific values of A, B and C are represented in the form of binary bits, that is, a ₁₂₃₂₂ a ₁₂₃₂₁ ... a ₁ a ₀ represents the multiplier A, b ₁₂₃₂₂ b ₁₂₃₂₁ ... b ₁ b ₀ represents the multiplicand B, and c ₁₂₃₂₂ c ₁₂₃₂₁ ... c ₁ c ₀ represents the multiplication result C. In this embodiment, the parameter m of the storage array size selected is 128, and the parameter N of the number of parallel calculations is 4. The SRAM-based carry-free multiplication storage array only needs g = ceil (r / N) = ceil (12323 / 4) = 3081 calculation cycles to complete a polynomial multiplication calculation. The ceil () function here is a rounding function, which returns an integer greater than or equal to the function parameter and closest to it. The overall architecture of calculating the polynomial multiplication C=(A×B)%(x ^r –1) in the BIKE algorithm using the SRAM-based carry-less multiplication storage array of the present invention is shown in FIG4 , and the calculation operation method is described as follows.

本实施例中，根据被乘数B的数值构成存算阵列。存算阵列储存B的具体形式为：将B的每97比特(不足97比特则用0补齐高位)储存在存算阵列的同一行中，高位的比特用0补齐，依次完成共128行的储存操作。对于存算阵列的第k(k＝1,2,3,…,126,127)行存算单元，其储存的128个数据分别为0,0,…,0,0,b_97*(k-1)+96,b_97*(k-1)+95,…,b_97*(k-1)+2,b_97*(k-1)+1,b_97*(k-1)；对于存算阵列的最后一行(k＝128)存算单元，其储存的128个数据分别为0,0,…,0,0,b₁₂₃₂₂,…,b_97*127+1,b_97*127。即存算阵列第一行128个存算单元储存的数据分别为0,0,…,0,0,b₉₆,b₉₅,…,b₂,b₁,b₀，第二行128个存算单元储存的数据分别为0,0,…,0,0,b₁₉₄,b₁₉₃,…,b₉₉,b₉₈,b₉₇，以此类推，最后一行128个存算单元储存的数据分别为0,0,…,0,0,0,b₁₂₃₂₂,b₁₂₃₂₁,b₁₂₃₂₀,b₁₂₃₁₉。In this embodiment, a storage and calculation array is formed according to the value of the multiplicand B. The specific form of storing B in the storage and calculation array is: every 97 bits of B (if less than 97 bits, the high bits are filled with 0) are stored in the same row of the storage and calculation array, and the high bits are filled with 0, and the storage operation of a total of 128 rows is completed in sequence. For the kth (k＝1,2,3,…,126,127) row of storage and computing units of the storage and computing array, the 128 data stored are 0, 0, …, 0, 0, b _97*(k-1)+96 , b _97*(k-1)+95 , …, b _97*(k-1)+2 , b _97*(k-1)+1 , b _97*(k-1) ; for the last row (k＝128) of storage and computing units of the storage and computing array, the 128 data stored are 0, 0, …, 0, 0, b ₁₂₃₂₂ , …, b _97*127+1 , b _97*127 . That is, the data stored in the first row of 128 storage units of the storage array are 0, 0, …, 0, 0, b ₉₆ , b ₉₅ , …, b ₂ , b ₁ , b ₀ , the data stored in the second row of 128 storage units are 0, 0, …, 0, 0, b ₁₉₄ , b ₁₉₃ , …, b ₉₉ , b ₉₈ , b ₉₇ , and so on. The data stored in the last row of 128 storage units are 0, 0, …, 0, 0, b ₁₂₃₂₂ , b ₁₂₃₂₁ , b ₁₂₃₂₀ , b ₁₂₃₁₉ .

本实施例中，根据乘数A的数值构成存算阵列的输入。对于存算阵列的第k(k＝1,2,3,…,126,127,128)行存算单元，其输入的数据构成方式如下所述。以二进制数a₁₂₃₂₂a₁₂₃₂₁…a₁a₀表示乘数A，对该二进制数循环左移97*(k-1)比特得到一个新的二进制数In this embodiment, the input of the storage and calculation array is formed according to the value of the multiplier A. For the storage and calculation unit of the kth row (k = 1, 2, 3, ..., 126, 127, 128) of the storage and calculation array, the input data is formed as follows. The binary number a ₁₂₃₂₂ a ₁₂₃₂₁ ... a ₁ a ₀ represents the multiplier A, and the binary number is cyclically shifted left by 97*(k-1) bits to obtain a new binary number

A_i＝a_{12323-97*(k-1)-1}a_{12323-97*(k-1)-2}…a₂a₁a₀a₁₂₃₂₂a₁₂₃₂₁…a_{12323-97*(k-1)+2}a_{12323-97*(k-1)+} ₁a_{12323-97*(k-1)}，为方便描述，以下用二进制数x₁₂₃₂₂x₁₂₃₂₁…x₁x₀表示乘数A的二进制数循环左移后所得到的新的二进制数，对该二进制数的高位补零，使其变成一个长度为g*N的二进制数A_i＝x_g*N-1x_g*N-2…x₁x₀。并行计算数为N，因此一行存算单元有N个输入，这N个输入的构成方式为：对于第i(i＝1,2,3,…,N)个输入，其输入的比特流为x_i-1,x_i-1+N,x_i-1+2*N,x_i-1+3*N,…,x_i-1+(g-1)*N，共g＝3081个比特数组成。本实施例中，并行计算数的参数N＝4，一行存算单元有N＝4个输入，因此该行存算单元第i(i＝1,2,3,4)个输入的比特流由x_i-1,x_i+3,x_i+7,…,x_i+12311,x_i+12315,x_i+12319所构成，并被输入至存算阵列相应的输入端。A _i =a _{12323-97*(k-1)-1} a _{12323-97*(k-1)-2} …a ₂ a ₁ a ₀ a ₁₂₃₂₂ a ₁₂₃₂₁ …a _{12323-97*(k-1)+2} a _{12323-97*(k-1)+} ₁ a _{12323-97*(k-1)} . For the convenience of description, the binary number x ₁₂₃₂₂ x ₁₂₃₂₁ …x ₁ x ₀ is used to represent the new binary number obtained by cyclically shifting the binary number of the multiplier A to the left, and the high bits of the binary number are padded with zeros to make it a binary number A _i =x _g*N-1 x _g*N-2 …x ₁ x ₀ with a length of g*N. The number of parallel calculations is N, so a row of storage and calculation units has N inputs, and the composition of these N inputs is: for the i-th (i=1,2,3,…,N) input, its input bit stream is x _i-1 , x _i-1+N , x _i-1+2*N , x _i-1+3*N ,…, x _i-1+(g-1)*N , which is composed of g=3081 bits in total. In this embodiment, the parameter of the number of parallel calculations is N=4, and a row of storage and calculation units has N=4 inputs, so the bit stream of the i-th (i=1,2,3,4) input of the row of storage and calculation units is composed of x _i-1 , x _i+3 , x _i+7 ,…, x _i+12311 , x _i+12315 , x _i+12319 , and is input to the corresponding input end of the storage and calculation array.

本实施例中，存算阵列的每个存算单元对输入进行运算，存算单元的计算结果将被输入到异或树中进行异或运算。最终，在一个计算周期结束后，本发明所述的基于SRAM的无进位乘法存算阵列将输出4×128个单比特数据计算结果，可以用大小为4×128的矩阵O表示。In this embodiment, each storage unit of the storage unit array performs operations on the input, and the calculation results of the storage unit will be input into the XOR tree for XOR operation. Finally, after a calculation cycle, the SRAM-based carry-less multiplication storage unit array of the present invention will output 4×128 single-bit data calculation results, which can be represented by a matrix O of size 4×128.

本实施例中，将矩阵O的行向量用二进制数表示，考虑到本实施例中的存算阵列仅有后97列储存了数据，因此为减少计算量便于计算，将四个行向量的数值分别表示为100比特的二进制数O₁,O₂,O₃和O₄，其中低97比特位为矩阵O每行的后97位数值，剩余的高3比特位补零，并将这四个二进制数输入到移位异或模块中。在移位异或模块，对O₂左移1位，对O₃左移2位，对O₄左移3位，低位补零，新得到的三个二进制数与O₁一起进行异或，最终输出一个100比特的二进制数结果P。In this embodiment, the row vectors of the matrix O are represented by binary numbers. Considering that only the last 97 columns of the storage and calculation array in this embodiment store data, in order to reduce the amount of calculation and facilitate calculation, the values of the four row vectors are represented as 100-bit binary numbers O ₁ , O ₂ , O ₃ and O ₄ , respectively, where the lower 97 bits are the last 97 bits of the values of each row of the matrix O, and the remaining upper 3 bits are filled with zeros, and these four binary numbers are input into the shift XOR module. In the shift XOR module, O ₂ is shifted left by 1 bit, O ₃ is shifted left by 2 bits, O ₄ is shifted left by 3 bits, and the lower bits are filled with zeros. The newly obtained three binary numbers are XORed together with O ₁ , and finally a 100-bit binary number result P is output.

本实施例中，将P输入到寄存器模块中，寄存器模块的长度为100+4*C_n比特，这里的C_n表示每经过C_n个计算周期会对存储器进行一次写入操作。该寄存器模块主要实现了“计算+存储”的功能，它先将上一个计算周期寄存器模块的输出结果R_last进行逻辑右移N＝4比特，并将移位后的R_last的高100比特与这一个计算周期的输出结果P进行异或运算，最后得到一个新的寄存器输出结果R_now。R_now的低位为计算完成但未写入存储器的数据，高位为尚未计算完成的数据。In this embodiment, P is input into the register module, and the length of the register module is 100+4*C _n bits, where C _n means that a write operation is performed on the memory every C _n calculation cycles. The register module mainly implements the "calculation + storage" function, which first performs a logical right shift of N=4 bits on the output result R _last of the register module in the previous calculation cycle, and performs an XOR operation on the high 100 bits of the shifted R _last and the output result P of this calculation cycle, and finally obtains a new register output result R _now . The low bits of R _now are the data that have been calculated but not written into the memory, and the high bits are the data that have not been calculated yet.

本实施例中，存储器读写长度为4*Cn比特，每经过C_n个计算周期将寄存器模块保存的低4*Cn位数据写入存储器中，本实施例将以参数C_n＝32为例进行描述，即存储器读写长度为128比特，寄存器模块保存的数据R_last大小为100+4*C_n＝228比特。由于BIKE算法中的多项式乘法最低位结果的计算还与最后一次移位异或模块的计算结果相关，因此，最后需从存储器中重新读取之前计算的最低位128比特结果单独进行一次异或计算再写回。存储器中重新读取的128比特数据将与寄存器模块保存的数据R_last的中间128比特(即R_last[227:0]的[162:35]部分，共128比特)进行异或，并将输出结果写回到存储器中原始最低位128比特计算结果所存储的对应位置。最终，存储器内储存的结果即为BIKE算法中多项式乘法C＝A×B的结果值。In this embodiment, the memory read/write length is 4*Cn bits, and the lower 4*Cn bits of data stored in the register module are written into the memory after every _Cn calculation cycles. This embodiment will be described by taking parameter _Cn =32 as an example, that is, the memory read/write length is 128 bits, and the size of the data R _last stored in the register module is 100+4* _Cn =228 bits. Since the calculation of the lowest bit result of the polynomial multiplication in the BIKE algorithm is also related to the calculation result of the last shift XOR module, it is necessary to re-read the lowest bit 128 bits of the previously calculated result from the memory, perform a separate XOR calculation, and then write it back. The 128 bits of data re-read from the memory will be XORed with the middle 128 bits of the data R _last stored in the register module (i.e., the [162:35] part of R _last [227:0], a total of 128 bits), and the output result will be written back to the corresponding position where the original lowest bit 128 bits of the calculation result are stored in the memory. Finally, the result stored in the memory is the result value of the polynomial multiplication C=A×B in the BIKE algorithm.

本实施例中，本发明将BIKE算法中多项式乘法的计算周期数缩减到了3081个周期，相比于RISCV硬件处理器完成该计算所需要花费的上百万个周期而言，极大地减少了计算量，提升了计算效率。同时，本发明的存算阵列规模较小，电路的功耗低、面积小，能效比与计算密度更高。In this embodiment, the present invention reduces the number of calculation cycles of polynomial multiplication in the BIKE algorithm to 3081 cycles, which greatly reduces the amount of calculation and improves the calculation efficiency compared to the millions of cycles required for the RISCV hardware processor to complete the calculation. At the same time, the storage and calculation array of the present invention is small in scale, the circuit has low power consumption and small area, and has higher energy efficiency and calculation density.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。The above-mentioned embodiments only express several implementation methods of the present invention, and the description is relatively specific and detailed, but it cannot be understood as limiting the scope of the present invention. For ordinary technicians in this field, several modifications and improvements can be made without departing from the concept of the present invention, which all belong to the protection scope of the present invention.

Claims

1. The carry-free multiplication memory array based on the SRAM is characterized by comprising m columns of memory cell groups based on the SRAM structure and m columns of exclusive OR tree groups; the storage unit group consists of m storage units, and the exclusive-or tree group consists of N exclusive-or trees; m is an integer greater than or equal to 2, n=1, 2,4,8, …,2, ^q, q is a natural number;

Each storage unit consists of an SRAM and N two-input NAND gates, and the SRAM outputs one bit of the stored multiplicand information; the output of the SRAM is respectively used as the input of one input end of N two-input NAND gates, and the other input end of the N two-input NAND gates is respectively used for inputting N single bit streams generated by a multiplier; n two-input NAND gates respectively output two-input phase NAND results, namely N single-bit binary numbers;

N single-bit binary numbers output by each storage unit of the jth column of storage unit group are respectively input into each exclusive-or tree of the jth column of exclusive-or tree group, j=1, 2, …, m, each exclusive-or tree receives m single-bit binary numbers in total, and a single-bit binary number result which is formed by mutually exclusive-or of the m single-bit binary numbers is output;

Each column of storage and calculation units outputs N multiplied by m single-bit binary numbers in a calculation period, the N multiplied by m single-bit binary numbers are used as input data of N input exclusive-OR trees, N single-bit calculation results are finally obtained through calculation of the exclusive-OR trees, and N multiplied by m single-bit calculation results are obtained through m columns of storage and calculation unit groups; and in one calculation period, the carryless multiplication and calculation array finally obtains N multiplied by m single-bit data calculation results.

2. The SRAM-based carryless multiplication memory array of claim 1 wherein the multiplicand is stored in the SRAM of said memory array in the form of single bit information, each SRAM outputting the same multiplicand with different bit information, a two-input nand gate calculating the nand result of the single bit stream generated by the multiplier and the one bit information of the multiplicand, performing the operation of the product between the single bits in the carryless multiplication; the exclusive-or tree is used for further exclusive-or operation of the product output by the two-input NAND gate to finish the operation of no-carry operation in no-carry multiplication and finally output partial product result.

3. The SRAM-based carryless multiplication memory array of claim 1, wherein calculating the result of the exclusive or of m single bit binary numbers requires ceil (log ₂ (m)) level exclusive or trees, the ceil () function being a round-up function; the m single-bit binary numbers are input into two-input exclusive-OR gates of a first stage of an exclusive-OR tree to perform two-by-two exclusive-OR calculation, the input of each stage of two-input exclusive-OR gate started by a second stage depends on the output of the two-input exclusive-OR gate of the previous stage, namely, the two inputs of one two-input exclusive-OR gate of the subsequent stage are the output results of the two-input exclusive-OR gates of the previous stage, and so on, and the output result of the two-input exclusive-OR gate of the last stage is the output result of the whole exclusive-OR tree, namely, the result of the mutual exclusive-OR of the m single-bit binary number inputs.

4. The SRAM-based carryless multiplication memory array of claim 1 wherein the N single bit streams generated by the multipliers input by the memory cells in the same row are identical, each row of memory cells simultaneously inputting N single bit streams generated by the multipliers, said N single bit streams generated by the multipliers being input to N two-input nand gates of each memory cell within the row, respectively.

5. The SRAM-based carryless multiplication array of claim 1, wherein said binary representation of the multiplier is a _n-1a_n-2a _n-3…a₂a₁a₀, and wherein the k-th row of said carryless multiplication array is shifted left by m x (k-1) bits cyclically for the binary representation of the multiplier to obtain a new binary number, k = 1,2,3, …, m; the new binary number is subjected to high-order zero padding to be changed into a binary number x _g*N-1x_g*N-2x _g*N-3…x₂x₁x₀ with the length of g x N, wherein g=ceil (N/N), and the ceil () function is an upward rounding function; the ith two-input NAND gate of the kth row of the carryless multiplication and calculation array, i=1, 2,3, …, N, has the input single bit stream of x _i-1,x _i-1+N,x_i-1+2*N,x_i-1+3*N,…,x_i-1+(g-1)*N.

6. The SRAM-based carryless multiplication array of claim 1, wherein all of said SRAMs of each column respectively output stored identical multiplicand different bit information, each SRAM outputs stored multiplicand bit information to N two-input nand gates in a memory cell where it is located, another input terminal of said N two-input nand gates respectively inputs N single bit streams generated by multipliers, each column memory cell outputs N x m single bit binary numbers to N m input xor trees in one calculation cycle, N said xor trees output N single bit calculation results, after one calculation cycle is finished, said SRAM-based carryless multiplication array outputs N x m single bit calculation results, said N x m single bit calculation results are represented by matrix O of size N x m; in a column of memory units, the calculated result of the ith two-input nand gate of each memory unit is output to the same exclusive-or tree, i=1, 2, …, N, the column of memory units is the jth column, j=1, 2, …, m, and then the single-bit calculated result output by the exclusive-or tree is the value corresponding to the element O _ij in the matrix O.

7. A method of carryless multiplication using the SRAM-based storage array of claim 2, wherein the carryless multiplication of a x B is calculated, the multiplicand B is stored in the SRAM of the storage array in the form of single bit information, the SRAM outputs one bit of the multiplicand B, and in each row of the storage array, N single bit streams generated by the multiplier a are input together with the output of the SRAM into the two-input nand gate of the storage unit in which the SRAM is located; in a calculation period, the output result of the two-input NAND gate is the product of the single bit information of B stored in the SRAM and the single bit data in the single bit stream generated by the multiplier A input in the period, the output result of the two-input NAND gate of all the storage units is input into the corresponding exclusive-OR tree for mutual exclusive-OR operation, and the final output result of the exclusive-OR tree is the partial product result of the A multiplied by B without carry.