CN102122241A

CN102122241A - Analog multiplier/divider applicable to prime field and polynomial field

Info

Publication number: CN102122241A
Application number: CN2010100226476A
Authority: CN
Inventors: 韩军; 黄伟; 曹丹; 曾晓洋
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2010-01-08
Filing date: 2010-01-08
Publication date: 2011-07-13

Abstract

The invention relates to a dual-domain modular multiplication modulus divider for ECC (elliptic curve cryptography) algorithm that is suitable for high-speed network applications and portable mobile device applications. It consists of 4 PE computing units, 5 register files Regfile, and Booth encoding Unit, input register Load file, control module control and 17 multiplexers, which are used to change the connection between the four PE operation units and the reading position of the data to complete modular division or modular multiplication Operation, with scalability, can support up to 480-bit modular multiplication and division operations. Modular multiplication and modular division operations can share hardware units to reduce hardware area; addition, subtraction and shift operations of long operands in the algorithm All are carried out in units of words, which greatly accelerates the convergence speed of the algorithm, thus doubling the operation speed.

Description

A Modular Multiplication and Modulo Divider Applicable to Prime Fields and Polynomial Fields

技术领域technical field

本发明属于集成电路设计技术领域，具体涉及一种适用于高速网络应用以及便携式移动设备应用需要的针对椭圆曲线密码(ECC)算法的双域模乘模除器。The invention belongs to the technical field of integrated circuit design, and in particular relates to a dual-domain modulus multiplier and divider for elliptic curve cryptography (ECC) algorithms, which is suitable for high-speed network applications and portable mobile device applications.

背景技术Background technique

在当代，随着信息化的不断深入，越来越多的信息将暴露在公开的媒介中。为了保护那些敏感信息，各种密码算法被应用到无线网络通信领域中。然而，通信设备尤其是便携式设备相对有限的处理能力已无法满足日益增大的数据量的需求。In contemporary times, with the deepening of informatization, more and more information will be exposed in public media. In order to protect those sensitive information, various cryptographic algorithms are applied to the field of wireless network communication. However, the relatively limited processing capabilities of communication devices, especially portable devices, cannot meet the demand for increasing data volume.

密码体制一般可以划分为两种类型：对称密码体制和公钥密码体制。对称密码体制的最大优点是效率高，但是主要缺点却是比较明显：密钥分发问题，即要求分发密钥的信道既是保密的又是保真的；另一个缺点是密钥管理问题，即在有N个实体的网络中，每个实体都必须存取N-1个实体的密钥。而公钥密码体制却没有这些缺点，公钥密码仅要求密钥的交换是保真的，而不用保密，且提供了秘密性和不可否认性。Cryptosystems can generally be divided into two types: symmetric cryptosystems and public key cryptosystems. The biggest advantage of the symmetric cryptosystem is high efficiency, but the main disadvantage is quite obvious: the key distribution problem, that is, the channel for distributing the key is required to be both confidential and authentic; another shortcoming is the key management problem, that is, in the In a network with N entities, each entity must access the keys of N-1 entities. However, public key cryptography does not have these shortcomings. Public key cryptography only requires the exchange of keys to be authentic, not confidential, and provides confidentiality and non-repudiation.

目前，椭圆曲线密码(ECC)已成为除RSA密码之外呼声最高的公钥密码之一，它可以提供同RSA密码体制同样的功能。它的安全性建立在椭圆曲线离散对数问题(ECDLP)的困难性之上。普遍认为160位椭圆曲线密码可提供相当于1024位RSA密码的安全程度。由于密钥短，所以在实际应用中加解密速度较快，并且可节省功耗、带宽和存储空间。ECC算法的核心运算是模乘和求逆，模乘和求逆的性能关系到整个ECC芯片的性能。另外，ECC算法时间上的并行性比较明显。一个高性能且功能齐全的模乘求逆器，并充分发掘算法时间上的并行性，则可以仅用较小的硬件代价，获得较高的加密速度，适应当前的高速网络应用的需要。另外由于硬件消耗小、功耗低，它也可以适应便携式移动设备的应用需要.随着计算机技术的发展，运算速度也在飞速提高。我们有必要增加数据长度以加强加密算法的安全性。本设计在硬件结构上具有可扩展性，可以方便地扩展数据宽度。At present, Elliptic Curve Cryptography (ECC) has become one of the most popular public key ciphers besides RSA cryptography, and it can provide the same function as RSA cryptosystem. Its security is built on the difficulty of the Elliptic Curve Discrete Logarithm Problem (ECDLP). It is generally accepted that 160-bit elliptic curve ciphers provide a level of security equivalent to 1024-bit RSA ciphers. Because the key is short, the encryption and decryption speed is faster in practical applications, and power consumption, bandwidth and storage space can be saved. The core operations of the ECC algorithm are modular multiplication and inversion, and the performance of modular multiplication and inversion is related to the performance of the entire ECC chip. In addition, the time parallelism of the ECC algorithm is more obvious. A high-performance and full-featured modular multiplication invertor, and fully explore the parallelism of the algorithm time, can obtain high encryption speed with only a small hardware cost, and meet the needs of current high-speed network applications. In addition, due to the low hardware consumption and low power consumption, it can also adapt to the application needs of portable mobile devices. With the development of computer technology, the calculation speed is also increasing rapidly. It is necessary for us to increase the data length to strengthen the security of the encryption algorithm. This design has scalability in the hardware structure, and can easily expand the data width.

发明内容Contents of the invention

本发明的目的是提出一种适用于高速网络应用以及便携式移动设备应用需要的针对椭圆曲线密码ECC算法的双域模乘模除器，具有可扩展性，同时显著地降低硬件成本。The purpose of the present invention is to propose a dual-domain modulo multiplier modulo divider for the Elliptic Curve Cryptography ECC algorithm suitable for high-speed network applications and portable mobile device applications, which has scalability and significantly reduces hardware costs.

本发明的技术方案是：一种适用于素域和多项式域的模乘模除器(如图1所示)，由17个多路选择器(1～17)、4个运算单元PE(26～29)、输入寄存器(23)、5个寄存器堆(18～22)、Booth编码单元(24)和控制模块(25)组成；其中：The technical scheme of the present invention is: a kind of modulus multiplication modulus divider (as shown in Figure 1) that is applicable to prime field and polynomial field, by 17 multiplexers (1～17), 4 computing units PE (26 ～29), input register (23), 5 register files (18～22), Booth encoding unit (24) and control module (25); Wherein:

a.在第二多路选择器(2)和第八多路选择器(8)选1态、13个多路选择器(3～7、10～17)选0态时，由第一多路选择器(1)、第九多路选择器(9)、5个寄存器堆(18～22)、输入寄存器(23)、Booth编码单元(24)、控制模块(25)和4个PE运算单元(26～29)组成模除器(见图5)，模除运算采用Euclidean算法；其中：a. When the second multiplexer (2) and the eighth multiplexer (8) select 1 state, and 13 multiplexers (3～7, 10～17) select 0 state, by the first multiplexer Way selector (1), ninth multiplexer (9), 5 register files (18-22), input register (23), Booth encoding unit (24), control module (25) and 4 PE operations Units (26-29) form a modulus divider (see Figure 5), and the modulo division operation adopts the Euclidean algorithm; wherein:

第一多路选择器(1)，输入为运算单元PE0(26)的输出，由stage信号来选择输出，输出至第一寄存器堆(18)或第二寄存器堆(19)；The first multiplexer (1), the input is the output of the arithmetic unit PE0 (26), the output is selected by the stage signal, and output to the first register file (18) or the second register file (19);

第九多路选择器(9)，输入为运算单元PE2(28)的输出，由stage信号来选择输出，输出至第三寄存器堆(20)或第四寄存器堆(21)；The ninth multiplexer (9), the input is the output of the arithmetic unit PE2 (28), the output is selected by the stage signal, and the output is to the third register file (20) or the fourth register file (21);

第一寄存器堆(18)和第二寄存器堆(19)的输出均到输入寄存器(23)；The output of the first register file (18) and the second register file (19) is all to the input register (23);

第三寄存器堆(20)和第四寄存器堆(21)的输出均到PE3运算单元(29)；The outputs of the third register file (20) and the fourth register file (21) are all to the PE3 arithmetic unit (29);

第五寄存器堆(22)的输出到Booth编码器(24)；The output of the fifth register file (22) is to Booth encoder (24);

输入寄存器(23)的输出到PE0运算单元(26)；The output of the input register (23) is to the PE0 arithmetic unit (26);

Booth编码单元(24)输出到PE2运算单元(28)；Booth encoding unit (24) is output to PE2 computing unit (28);

控制模块(25)输出到PE0和PE1运算单元(26、27)；The control module (25) is output to PE0 and PE1 computing units (26, 27);

PE0运算单元(26)的输出根据stage信号的选择写入到第一寄存器堆(18)或第二寄存器堆(19)中；The output of the PE0 arithmetic unit (26) is written into the first register file (18) or the second register file (19) according to the selection of the stage signal;

PE1运算单元(27)的输出到PE2运算单元(28)；The output of PE1 arithmetic unit (27) is to PE2 arithmetic unit (28);

PE2运算单元(28)的输出根据stage信号的选择写入到第三寄存器堆(20)或第四寄存器堆(21)中；The output of the PE2 arithmetic unit (28) is written into the third register file (20) or the fourth register file (21) according to the selection of the stage signal;

PE3运算单元(29)的输出到PE1运算单元(27)；The output of PE3 arithmetic unit (29) is to PE1 arithmetic unit (27);

b.在第一多路选择器(1)和第九多路选择器(9)不工作、13个多路选择器(3～7、10～17)选1态时，由第二多路选择器(2)、第八多路选择器(8)、5个寄存器堆(18～22)、输入寄存器(23)、Booth编码单元(24)、控制模块(25)和4个PE运算单元(26～29)组成模乘器(见图6)，模乘运算采用模哥马利算法；其中：b. When the first multiplexer (1) and the ninth multiplexer (9) are not working, and 13 multiplexers (3～7, 10～17) are in 1 state, the second multiplexer Selector (2), eighth multiplexer (8), 5 register files (18-22), input register (23), Booth encoding unit (24), control module (25) and 4 PE operation units (26～29) form modular multiplier (seeing Fig. 6), and modular multiplication operation adopts Modular Gomery algorithm; Wherein:

第二多路选择器(2)，输入为第一寄存器堆(18)和第四寄存器堆(21)内存储的数据，以First_round作为选择信号，选择输出正确的数据到输入寄存器(23)内；The second multiplexer (2), the input is the data stored in the first register file (18) and the fourth register file (21), with First_round as the selection signal, selects the correct data to output in the input register (23) ;

第八多路选择器(8)，输入为第二寄存器堆(19)和第五寄存器堆(22)内存储的数据，以First_round作为选择信号，选择输出正确的数据到输入寄存器(23)内；The eighth multiplexer (8), the input is the data stored in the second register file (19) and the fifth register file (22), and uses First_round as the selection signal to select and output the correct data into the input register (23) ;

第三寄存器堆(20)输出到Booth编码器(24)中；The third register file (20) is output in the Booth encoder (24);

输入寄存器(23)的两位输出到PE0运算单元(26)；Two bits of the input register (23) are output to the PE0 arithmetic unit (26);

Booth编码单元(24)的输出分别为4个PE运算单元(26～29)提供输入；The output of the Booth coding unit (24) provides input for 4 PE computing units (26～29) respectively;

PE0运算单元(26)的输出到PE1运算单元(27)；The output of PE0 arithmetic unit (26) is to PE1 arithmetic unit (27);

PE2运算单元(28)的输出到PE3运算单元(29)；The output of PE2 computing unit (28) is to PE3 computing unit (29);

PE3运算单元(29)的输出分别到第五寄存器堆(22)和第四寄存器堆(21)。The outputs of the PE3 arithmetic unit (29) are sent to the fifth register file (22) and the fourth register file (21) respectively.

上述运算单元PE(如图2所示)由6个多路选择器(30～35)、2个寄存器(36、37)、2个反相控制器(38、39)、3个进位保留加法器(40、41、42)、PE内部控制模块(43)和PE内部移位器(44)组成；其中：The above arithmetic unit PE (as shown in Figure 2) consists of 6 multiplexers (30-35), 2 registers (36, 37), 2 inverting controllers (38, 39), 3 carry-save addition device (40, 41, 42), PE internal control module (43) and PE internal shifter (44); wherein:

第十八多路选择器(30)的输入为模乘运算时累加被乘数的倍数mul_h和模除运算时累加D寄存器值的倍数div_h，由运算函数Func_sel作为选择，选择出正确的W寄存器值的倍数选择信号，输出到第十九多路选择器(31)的选择控制端；The input of the eighteenth multiplexer (30) is the multiple mul_h of the accumulated multiplicand during the modular multiplication operation and the multiple div_h of the accumulated D register value during the modular division operation, and is selected by the operation function Func_sel to select the correct W register The multiple selection signal of the value is output to the selection control terminal of the nineteenth multiplexer (31);

第十九多路选择器(31)的输入为0、W寄存器的值和2*W寄存器的值，由第十八多路选择器(31)的输出作为选择信号，选择出正确的操作数值到第一反相控制器(38)；The input of the nineteenth multiplexer (31) is the value of 0, the W register and the value of the 2*W register, and the output of the eighteenth multiplexer (31) is used as a selection signal to select the correct operand value To the first inverting controller (38);

第二十多路选择器(32)的输入为字运算时模P的倍数double1和double2，由运算函数Func_sel作为选择，输出正确的倍数选择信号到第二十三多路选择器(35)；The input of the twentieth multiplexer (32) is the multiples double1 and double2 of the modulus P during the word operation, and is selected by the operation function Func_sel, and outputs the correct multiple selection signal to the twenty-third multiplexer (35);

第二十一多路选择器(33)的输入为字运算时模P的倍数zero1和zero2，由运算函数Func_sel作为选择，输出正确的倍数选择信号到第二十三多路选择器(35)；The input of the twenty-first multiplexer (33) is the multiples zero1 and zero2 of the modulus P during the word operation, and is selected by the operation function Func_sel, and outputs the correct multiple selection signal to the twenty-third multiplexer (35) ;

第二十二多路选择器(34)的输入为字运算时模P的倍数neg1和neg2，由运算函数Func_sel作为选择，输出正确的倍数选择信号到第二反相控制器(39)；The input of the twenty-second multiplexer (34) is the multiples neg1 and neg2 of the modulus P during the word operation, and is selected by the operation function Func_sel, and outputs the correct multiple selection signal to the second inverting controller (39);

第二十三多路选择器(35)的输入为0、模P的值和2*模P的值，由第二十、第二十一多路选择器(32、33)的输出作为选择信号，选择出正确的操作数值到第二反相控制器(39)；The input of the twenty-third multiplexer (35) is the value of 0, modulo P and 2*modulo P, and is selected by the output of the twentieth and twenty-first multiplexer (32,33) signal, select the correct operand value to the second inverting controller (39);

第七寄存器(36)用来存放运算时的模P中一个字的值；The seventh register (36) is used to store the value of a word in the modulo P during operation;

第八寄存器(37)用来存放运算时W中一个字的值；The eighth register (37) is used to store the value of a word in W when computing;

第一反相控制器(38)的输入为第十九多路选择器(31)的输出值，控制信号为第十八多路选择器(30)的输出值，输出为输入经过取反后的数值；The input of the first inverting controller (38) is the output value of the 19th multiplexer (31), and the control signal is the output value of the 18th multiplexer (30), and the output is after the input is reversed. value;

第二反相控制器(39)的输入为第二十三多路选择器(35)的输出值，控制信号为第二十二多路选择器(34)的输出值，输出为输入经过取反后的数值；The input of the second inverting controller (39) is the output value of the twenty-third multiplexer (35), and the control signal is the output value of the twenty-second multiplexer (34), and the output is the input value after taking reversed value;

第一进位保留加法器(40)的输入为保存一个操作数的U寄存器当前字的结果和进位(Uc，Us)，以及素域或多项式域的域选择信号Field，第一个字First_word，和第一反相控制器(38)的输出，输出为通过进位保留加法器的值Us_1以及进位Uc_1；The input of the first carry reserve adder (40) is the result and the carry (Uc, Us) of the current word of the U register that preserves an operand, and the field selection signal Field of the prime field or the polynomial field, the first word First_word, and The output of the first inverting controller (38), the output is the value Us_1 and the carry Uc_1 by the carry save adder;

第二进位保留加法器(41)的输入为第一进位保留加法器(40)的输出Uc_1，Us_1，素域或多项式域的域选择信号Field，第一个字First_word，和第二反相控制器(39)的输出，输出为通过进位保留加法器的值Us_2以及进位Uc_2；The input of the second carry save adder (41) is the output Uc_1 of the first carry save adder (40), Us_1, the field selection signal Field of the prime field or the polynomial field, the first word First_word, and the second inversion control The output of the device (39), the output is the value Us_2 and the carry Uc_2 of the adder saved by the carry;

第三进位保留加法器(42)的输入为PE内部移位器(44)的输入Uc_3、Us_3和carry，输出为经过进位保留加法器的值Us_out以及进位Uc_out；The input of the third carry save adder (42) is the input Uc_3, Us_3 and carry of the PE internal shifter (44), and the output is the value Us_out and the carry Uc_out through the carry save adder;

PE内部控制模块(43)的输入为第一进位保留加法器(40)的输出Us_1、Uc_1的低两位，以及模P的低两位、右移的位数shift的值，输出为确定后面所加模P的倍数的控制信号；The input of the PE internal control module (43) is the lower two bits of the output Us_1 and Uc_1 of the first carry reserve adder (40), and the lower two bits of the modulus P, the value of the number shift of the right shift, and the output is to determine the following A control signal of a multiple of the added modulo P;

PE内部移位器(44)的输入为第二进位保留加法器(41)的输出值Us_2、Uc_2的值，以及右移的位数shift的值，输出为经过右移后的Us_3、Uc_3以及右移出来的carry的值。The input of the PE internal shifter (44) is the value of the output value Us_2, Uc_2 of the second carry save adder (41), and the value of the digit shift of the right shift, and the output is Us_3, Uc_3 and Us_3 after the right shift. The value of the carry that is shifted to the right.

本发明提出的适用于高速网络应用以及便携式移动设备应用需要的针对ECC(椭圆曲线密码)算法的双域模乘模除器，最高可执行480-bit的模除运算，采用SMIC 0.18μm CMOS工艺综合，关键路径时延为4.71纳秒，最高频率达212.3MHz，完成256-bit模除用时13us，面积为37k等效门，256-bit模乘用时1.4us，面积为11.4k等效门。The dual-domain modular multiplication modulus divider for the ECC (elliptic curve cryptography) algorithm proposed by the present invention is suitable for high-speed network applications and portable mobile device applications, and can perform up to 480-bit modular division operations, using SMIC 0.18μm CMOS technology In general, the critical path delay is 4.71 nanoseconds, the highest frequency is 212.3MHz, it takes 13us to complete 256-bit module division, and the area is 37k equivalent gates, and the time for 256-bit module multiplication is 1.4us, and the area is 11.4k equivalent gates .

ECC算法是一种公钥的算法，ECDH，ECDSA中主要运算集中在点乘的运算，而点乘又是由倍点和点加组成，而倍点和点加的运算主要是有限域下的模乘和模除算法。用硬件完成有限域的模乘模除算法，上层用软件去实现不同的公钥算法是一种性能和灵活性的最好折中。现在常用来处理有限域模除的算法主要是应用Euclidean算法，具体算法实现见图3；而处理有限域模乘的算法主要是用蒙哥马利模乘算法，具体算法实现见图4。The ECC algorithm is a public key algorithm. The main operations in ECDH and ECDSA focus on the point multiplication operation, and the point multiplication is composed of point doubling and point addition, and the doubling and point addition operations are mainly under the finite field. Modular multiplication and modular division algorithms. It is the best compromise between performance and flexibility to use hardware to complete the modular multiplication and modular division algorithm of finite fields, and use software to implement different public key algorithms in the upper layer. The algorithm commonly used to deal with finite field modular division is mainly the Euclidean algorithm, and the specific algorithm implementation is shown in Figure 3; while the algorithm for dealing with finite field modular multiplication is mainly the Montgomery modular multiplication algorithm, and the specific algorithm implementation is shown in Figure 4.

本发明的优点在于能同时处理ECC运算中素域及二进制域下的模乘和模除运算，主要的运算部件由相同结构的PE运算单元组成，并能通过重新配置运算单元之间的连接关系，能相应的加快不同算法的速度；同时也能尽可能的做到硬件的复用，在达到性能要求的同时，减少了硬件面积。The present invention has the advantage that it can simultaneously handle the modular multiplication and modular division operations in the prime field and the binary field in the ECC operation. The main operation components are composed of PE operation units with the same structure, and the connection relationship between the operation units can be reconfigured. , can correspondingly accelerate the speed of different algorithms; at the same time, it can also achieve hardware multiplexing as much as possible, and reduce the hardware area while meeting the performance requirements.

本发明的优点还在于提出了一种专门处理ECC运算中模乘模除的运算单元PE单元，通过分析素域及二进制域下的运算过程，提取出了运算中的基本运算：U＝(U+mul_h*W+kP)＞＞shift操作。每个PE运算单元都能完成这一基本运算，在进行模乘模除运算时，可以通过合理分配每个PE单元进行的操作，加速模乘模除运算。The advantage of the present invention also is to propose a kind of computing unit PE unit that specially handles modular multiplication and modular division in the ECC computing, by analyzing the computing process under prime domain and binary domain, has extracted the basic operation in the computing: U=(U +mul_h*W+kP)>>shift operation. Each PE operation unit can complete this basic operation. When performing modular multiplication and modular division, the operation of each PE unit can be reasonably allocated to speed up the modular multiplication and modular division operation.

本发明的优点之三在于该模乘模除器主要运算部件由多个结构相同的PE运算单元组成，结构是可以配置的(图1中以四个PE运算单元说明)，可以根据硬件或功率的要求，相应的增加或减少PE运算单元，以达到新的要求。The third advantage of the present invention is that the main operation part of the modular multiplication module divider is made up of a plurality of PE operation units with the same structure, and the structure is configurable (illustrated with four PE operation units in Fig. 1), and can be configured according to hardware or power According to the requirements, the PE operation unit is increased or decreased accordingly to meet the new requirements.

附图说明Description of drawings

图1是本发明双域模乘模除器顶层结构图；Fig. 1 is a top-level structural diagram of a dual-domain modulus multiplier and divider of the present invention;

图2是本发明双域模乘模除器的PE运算单元结构图；Fig. 2 is the structural diagram of the PE operation unit of the dual-domain modular multiplication modulus divider of the present invention;

图3是本发明改进后的Euclidean模除算法；Fig. 3 is the improved Euclidean modular division algorithm of the present invention;

图4是本发明改进后的蒙哥马利模乘算法；Fig. 4 is the improved Montgomery modular multiplication algorithm of the present invention;

图5是本发明双域模乘模除器作模除时的等效硬件结构图；Fig. 5 is the equivalent hardware structure diagram when the double-field modulus multiplier modulo divider of the present invention is done modulo division;

图6是本发明双域模乘模除器作模乘时的等效硬件结构图；Fig. 6 is the equivalent hardware structure figure when doing modular multiplication by the dual domain modular multiplication modulus divider of the present invention;

图7a是本发明进行模除运算时数据通路的算法说明图；Fig. 7 a is the algorithm explanatory diagram of data path when the present invention carries out modulo division operation;

图7b是图5的等效说明图；Figure 7b is an equivalent explanatory diagram of Figure 5;

图8a是本发明进行模乘运算时数据通路的算法说明图；Fig. 8 a is the algorithm explanatory diagram of data path when the present invention carries out modular multiplication operation;

图8b是图6的等效说明图。FIG. 8b is an equivalent explanatory diagram of FIG. 6 .

图中标号：1为第一多路选择器，2为第二多路选择器，3为第三多路选择器，4为第四多路选择器，5为第五多路选择器，6为第六多路选择器，7为第七多路选择器，8为第八多路选择器，9为第九多路选择器，10为第十多路选择器，11为第十一多路选择器，12为第十二多路选择器，13为第十三多路选择器，14为第十四多路选择器，15为第十五多路选择器，16为第十六多路选择器，17为第十七多路选择器，18为第一寄存器堆，19为第二寄存器堆，20为第三寄存器堆，21为第四寄存器堆，22为第五寄存器堆，23为输入寄存器，24为Booth编码单元，25为控制模块，26为PE0运算单元，27为PE1运算单元，28为PE2运算单元，29为PE3运算单元，30为第十八多路选择器，31为第十九多路选择器，32为第二十多路选择器，33为第二十一多路选择器，34为第二十二多路选择器，35为第二十三多路选择器，36为第七寄存器，37为第八寄存器，38为第一反相控制器，39为第二反相控制器，40为第一进位保留加法器，41为第二进位保留加法器，42为第三进位保留加法器，43为PE内部控制模块，44为PE内部移位器。Numbers in the figure: 1 is the first multiplexer, 2 is the second multiplexer, 3 is the third multiplexer, 4 is the fourth multiplexer, 5 is the fifth multiplexer, 6 is the sixth multiplexer, 7 is the seventh multiplexer, 8 is the eighth multiplexer, 9 is the ninth multiplexer, 10 is the tenth multiplexer, and 11 is the eleventh multiplexer 12 is the twelfth multiplexer, 13 is the thirteenth multiplexer, 14 is the fourteenth multiplexer, 15 is the fifteenth multiplexer, and 16 is the sixteenth multiplexer Way selector, 17 is the seventeenth multiplexer, 18 is the first register file, 19 is the second register file, 20 is the third register file, 21 is the fourth register file, 22 is the fifth register file, 23 is the input register, 24 is the Booth coding unit, 25 is the control module, 26 is the PE0 operation unit, 27 is the PE1 operation unit, 28 is the PE2 operation unit, 29 is the PE3 operation unit, 30 is the eighteenth multiplexer, 31 For the nineteenth multiplexer, 32 for the twentieth multiplexer, 33 for the twenty-first multiplexer, 34 for the twenty-second multiplexer, and 35 for the twenty-third multiplexer 36 is the seventh register, 37 is the eighth register, 38 is the first inverting controller, 39 is the second inverting controller, 40 is the first carry save adder, 41 is the second carry save adder, 42 is the third carry reserve adder, 43 is the PE internal control module, and 44 is the PE internal shifter.

具体实施方式Detailed ways

下面结合附图进一步说明本发明。Further illustrate the present invention below in conjunction with accompanying drawing.

图1是本发明双域模乘模除器顶层结构图，主要的运算部件是PE0(26)、PE1(27)、PE2(28)和PE3(29)四个运算单元，四个运算单元结构相同(见图2)，且可加速完成模乘模除运算的基本运算。该模乘模除器能完成素域下及二进制域下的模乘模除操作，在进行模乘(模除)运算时，所用到的被乘数(被除数)、乘数(除数)、乘法结果(除法结果)及模等数据，都存储在五个寄存器堆(18～22)中。Booth编码单元用来提供模乘运算中所用到的mul_h的值，而Control单元用来提供模除运算时所用到的div_h的值。而17个多路选择器(1～17)用来改变四个PE运算单元之间的连接以及数据的读取位置以完成模除或模乘操作，整个ECC模乘模除器可以通过重新配置单元之间的连接通路，来完成不同的算法。Fig. 1 is the top-level structural diagram of the double-field modulus multiplication modulus divider of the present invention, and main computing unit is PE0 (26), PE1 (27), PE2 (28) and PE3 (29) four computing units, four computing unit structures The same (see Figure 2), and can accelerate the completion of the basic operations of modular multiplication and modular division. The modular multiplication and modulus divider can complete the modular multiplication and modulus division operations under the prime field and the binary field. Data such as result (division result) and modulus are all stored in five register files (18～22). The Booth coding unit is used to provide the value of mul_h used in the modular multiplication operation, and the Control unit is used to provide the value of div_h used in the modular division operation. The 17 multiplexers (1~17) are used to change the connection between the four PE operation units and the reading position of the data to complete the modular division or modular multiplication operation. The entire ECC modular multiplication and modulus division can be reconfigured The connection paths between units are used to complete different algorithms.

1.模除状态1. Modulus status

模除状态由第一、第九多路选择器(1、9)，五个寄存器堆Regfile(18～22)，输入寄存器(23)，Booth编码单元(24)，控制模块(25)，四个PE运算单元(26～29)组成，数据通路如附图5所示，完成的模除算法如附图7a所示，模除时的顶层说明图如附图7b所示，其中：Modulo division state is by the first, the 9th multiplexer (1,9), five register files Regfile (18～22), input register (23), Booth encoding unit (24), control module (25), four Consists of four PE computing units (26-29), the data path is as shown in accompanying drawing 5, the completed modulus division algorithm is as shown in accompanying drawing 7a, and the top-level explanatory diagram during modulus division is as shown in accompanying drawing 7b, wherein:

第一多路选择器(1)，输入为PE0中计算的结果，由stage信号来选择，将输入的结果写入到Regfile1或Regfile2中。The input of the first multiplexer (1) is the result calculated in PE0, which is selected by the stage signal, and writes the input result into Regfile1 or Regfile2.

第九多路选择器(9)，输入为PE2中计算的结果，由stage信号来选择，将输入的结果写入到Regfile3或Regfile4中。The ninth multiplexer (9), the input is the result calculated in PE2, selected by the stage signal, and writes the input result into Regfile3 or Regfile4.

第一寄存器堆(18)，在stage信号为0时，存储的是以除数初始化的C寄存器中的计算结果以及进位(Cc，Cs)，stage信号为1时，存储的是以模P初始化的D寄存器中的计算结果及进位(Dc，Ds)；输入为PE0的计算结果，输出到输入寄存器(23)中。The first register file (18), when the stage signal is 0, stores the calculation result and the carry (Cc, Cs) in the C register initialized with the divisor, and when the stage signal is 1, what stores is initialized with the modulo P The calculation result and carry (Dc, Ds) in the D register; the input is the calculation result of PE0, which is output to the input register (23).

第二寄存器堆(19)，在stage信号为0时，存储的是以模P初始化的D寄存器中的计算结果以及进位(Dc，Ds)，stage信号为1时，存储的是以除数初始化的C寄存器中的计算结果及进位(Cc，Cs)；输入为PE0的计算结果，输出到输入寄存器(23)中。The second register file (19), when the stage signal was 0, stored the calculation result and the carry (Dc, Ds) in the D register initialized with the modulo P, and when the stage signal was 1, what stored was initialized with the divisor The calculation result and the carry (Cc, Cs) in the C register; the input is the calculation result of PE0, which is output to the input register (23).

第三寄存器堆(20)，在stage信号为0时，存储的是以被除数初始化的U寄存器中的计算结果以及进位(Uc，Us)，stage信号为1时，存储的是以0初始化的W寄存器中的计算结果及进位(Wc，Ws)；输入为PE2的计算结果，输出到PE3运算单元中。The third register file (20), when the stage signal is 0, stores the calculation result and the carry (Uc, Us) in the U register initialized with the dividend, and when the stage signal is 1, stores the W initialized with 0 The calculation result and carry (Wc, Ws) in the register; the input is the calculation result of PE2, and it is output to the PE3 arithmetic unit.

第四寄存器堆(21)，在stage信号为0时，存储的是以0初始化的W寄存器中的计算结果以及进位(Wc，Ws)，stage信号为1时，存储的是以被除数初始化的U寄存器中的计算结果及进位(Uc，Us)；输入为PE2的计算结果，输出到PE3运算单元中。The fourth register file (21), when the stage signal is 0, stores the calculation result and the carry (Wc, Ws) in the W register with 0 initialization, and when the stage signal is 1, what stores is the U initialized with the dividend The calculation result and carry (Uc, Us) in the register; the input is the calculation result of PE2, and it is output to the PE3 arithmetic unit.

第五寄存器堆(22)，在stage信号为0或1时，都存储的是模(0，P)的值；输出到Booth编码器(24)中。The fifth register file (22), when the stage signal is 0 or 1, all stores the value of the modulo (0, P); it is output to the Booth encoder (24).

输入寄存器(23)，输入为第一，第二寄存堆(18，19)的输出结果；输出到PE0运算单元中。Input register (23), the input is first, the output result of the second register pile (18,19); Output in PE0 operation unit.

Booth编码单元(24)，在Func sel为模乘运算时，作为Booth编码器对乘数进行编码；而在模除运算时，只作为寄存器使用；输入为第五寄存器堆(22)的结果；输出到PE2运算单元中。Booth encoding unit (24), when Func sel is modular multiplication operation, multiplier is encoded as Booth encoder; And when modular division operation, only uses as register; Input is the result of the 5th register file (22); Output to the PE2 computing unit.

控制模块(25)为PE0、PE1运算单元提供计算所需的参数divh，输出到PE0和PE1运算单元。The control module (25) provides the calculation required parameter divh for the PE0 and PE1 calculation units, and outputs to the PE0 and PE1 calculation units.

PE0运算单元(26)，完成C＝(C+div_h*D)＞＞shift运算，输入来自输入寄存器(23)的输出(C，D)的值，以及控制模块(25)的输出divh；输出为保留进位加法器计算的结果(Cc，Cs)；根据stage信号的选择写入到第一寄存器堆(18)或第二寄存器堆(19)中。PE0 arithmetic unit (26), completes C=(C+div_h*D)>>shift operation, input the value from the output (C, D) of input register (23), and the output divh of control module (25); Output The result (Cc, Cs) calculated by the carry adder is reserved; it is written into the first register file (18) or the second register file (19) according to the selection of the stage signal.

PE1运算单元(27)，完成U＝U+div_h*W运算，输入为(U，W)的值来自PE3运算单元(29)，以及控制模块(25)的输出div_h；输出为保留进位加法器计算的结果(Uc，Us)，输出到PE2运算单元(28)。PE1 operation unit (27), completes U=U+div_h*W operation, and the value that input is (U, W) comes from PE3 operation unit (29), and the output div_h of control module (25); Output is to preserve the carry adder The calculated results (Uc, Us) are output to the PE2 arithmetic unit (28).

PE2运算单元(28)，完成U＝(U+kP)＞＞shift运算，输入为PE1运算单元(27)的输出，以及Booth编码单元的输入模P；输出为保留进位加法器计算的结果(Cc，Cs)；根据stage信号的选择写入到第三(20)或第四寄存器堆(21)中.PE2 arithmetic unit (28), completes U=(U+kP) >> shift operation, input is the output of PE1 arithmetic unit (27), and the input module P of Booth encoding unit; Output is the result ( Cc, Cs); write to the third (20) or fourth register file (21) according to the selection of the stage signal.

PE3运算单元(29)，只完成对读出的第三寄存器堆(20)或第四寄存器堆(21)中的数据进行存储；输出到PE1运算单元(27)。The PE3 computing unit (29) only finishes storing the data in the read third register file (20) or the fourth register file (21); and outputs to the PE1 computing unit (27).

当进行模除运算时，四个PE单元的连接关系经过多路选择器的选择后如附图5所示，PE0运算单元(26)完成C＝(C+div_h*D)＞＞shift操作，而PE1、PE2配合，完成对寄存器U的计算，其中PE1完成U＝(U+divh*W)的操作，PE2运算单元(28)完成U＝(U+kp)＞＞shift的操作，PE3用作数据存储器使用。When carrying out modulo division operation, the connection relationship of four PE units is as shown in accompanying drawing 5 after the selection of multiplexer, and PE0 computing unit (26) completes C=(C+div_h*D)>>shift operation, And PE1, PE2 cooperate, finish the calculation to register U, wherein PE1 finishes the operation of U=(U+divh*W), PE2 operation unit (28) finishes the operation of U=(U+kp)>>shift, PE3 uses Used as data memory.

在进行模除运算时，基本思想如下(见图3)：针对已知的X、Y、P，求Z＝X/Ymod P。使用两个等式：CX≡UY mod P和DX≡WY mod P。初始化时C＝Y、U＝X、D＝P、W＝0，然后用扩展Euclidean算法，将gcd(C，D)化为gcd(0，1)或gcd(0，-1)，在此过程中W和U做相应的素域下的线性变换，最终得到W＝X/Y mod P或W＝-X/Y mod P。When performing modulo division operation, the basic idea is as follows (see Figure 3): for known X, Y, P, find Z=X/Ymod P. Two equations are used: CX≡UY mod P and DX≡WY mod P. When initializing C=Y, U=X, D=P, W=0, then use the extended Euclidean algorithm to convert gcd(C, D) into gcd(0, 1) or gcd(0, -1), where In the process, W and U do the linear transformation under the corresponding prime domain, and finally get W=X/Y mod P or W=-X/Y mod P.

模除算法中用到的基本运算有：C＝(C+div_h*D)＞＞shift；U＝U+div_h*W；U＝(U+kP)＞＞shift，具有相似的结构，于是设计一个基本的运算单元，能够完成X＝(X+kY)＞＞shift的运算，这样就可以完成模除的运算要求。考虑到取得最大的运算并行性，四个PE运算单元(26～29)进行相应的运算，而控制模块(25)为运算单元提供运算所需的参数div_h，五个寄存器堆(18～22)存储运算所需的数据，由于考虑到可扩展性和灵活性，所以运算中多bit的模运算都分解成字级运算.于是得到了针对模除运算的硬件单元，见附图5，顶层简图如附图7b所示。The basic operations used in the modular division algorithm are: C=(C+div_h*D)>>shift; U=U+div_h*W; U=(U+kP)>>shift, which has a similar structure, so the design A basic operation unit that can complete the operation of X=(X+kY)>>shift, so that the operation requirement of modulo division can be completed. In consideration of obtaining the maximum parallelism of operations, four PE operation units (26-29) perform corresponding operations, and the control module (25) provides operation units with the required parameters div_h, five register files (18-22) To store the data required for operation, due to the consideration of scalability and flexibility, the multi-bit modulo operation in the operation is decomposed into word-level operations. Then the hardware unit for the modulo division operation is obtained, see Figure 5, the top-level simplified The figure is shown in accompanying drawing 7b.

设计的PE运算单元具有相似的运算结构，能够完成Z＝(Z+mX+nY)＞＞shift的运算，如附图1所示。这样PE运算单元也即满足了除法中的运算要求。The designed PE operation unit has a similar operation structure and can complete the operation of Z=(Z+mX+nY)>>shift, as shown in FIG. 1 . In this way, the PE operation unit also satisfies the operation requirements in the division.

PE0运算单元(26)完成C＝(C+div_h*D)＞＞shift操作，操作数C、D、div_h和shift都由外部控制模块及寄存器堆提供。每个clock完成一个字的C＝(C+div_h*D)＞＞shift的操作，为了保证能工作在一个较高的时钟频率下，所以运算采用进位保留加法器CSA，这样结果包含两部分，运算的结果和进位结果，各为32bit，将这两部分值再暂存入寄存器堆中。The PE0 operation unit (26) completes C=(C+div_h*D)>>shift operation, and the operands C, D, div_h and shift are all provided by the external control module and the register file. Each clock completes the C=(C+div_h*D)>>shift operation of a word. In order to ensure that it can work at a higher clock frequency, the operation uses the carry-save adder CSA, so the result contains two parts. The result of the operation and the carry result are each 32 bits, and these two parts of the value are temporarily stored in the register file.

PE1运算单元(27)完成U＝(U+div_h*W)的操作，操作数U、W和div_h都来自外部控制模块及其他的运算单元提供。每个clock完成一个字的U＝(U+div_h*W)的操作.运算结果包含运算的结果及进位(Uc，Us)，将结果写入到下一个运算单元PE2中，完成后续的操作，形成一个小的两级流水线。The PE1 computing unit (27) completes the operation of U=(U+div_h*W), and the operands U, W and div_h are all provided from the external control module and other computing units. Each clock completes the operation of U=(U+div_h*W) of a word. The operation result includes the result of the operation and the carry (Uc, Us), and the result is written into the next operation unit PE2 to complete the subsequent operation. Form a small two-stage pipeline.

PE2运算单元(28)完成U＝(U+kp)＞＞shift的操作，操作数U、P、k和shift来自其它的运算单元或外部寄存器堆。每个clock完成一个字的U＝(U+kp)＞＞shift的操作，运算结果包含运算的结果及进位(Uc，Us)，将这两部分结果再暂存到外部寄存器堆中。The PE2 arithmetic unit (28) completes the operation of U=(U+kp)>>shift, and the operands U, P, k and shift come from other arithmetic units or external register files. Each clock completes the operation of U=(U+kp)>>shift of a word. The operation result includes the operation result and the carry (Uc, Us), and temporarily stores the two parts of the result in the external register file.

PE3运算单元(29)这时完成数据存储功能，而不再参与运算，这样能够与算法相一致，并且不用再添加多的硬件结构来存储PE1运算单元(26)的运算数据。PE3 calculation unit (29) completes the data storage function at this moment, and no longer participates in calculation, can be consistent with algorithm like this, and does not need to add more hardware structures to store the calculation data of PE1 calculation unit (26).

这时所需要的运算操作数有C、D、U、W、P、div_h和shift，其中C、D、U和W一方面需要存储初始的值，另一方面还要处理C和D、U和W数据的交换，所以实现中采用了五个寄存器堆，为了不要添加太多的面积，所以采用stage信号进行区分，对寄存器堆在除法中进行复用。The operation operands required at this time are C, D, U, W, P, div_h and shift, among which C, D, U and W need to store the initial value on the one hand, and on the other hand, they also need to process C and D, U For the exchange of data with W, five register files are used in the implementation. In order not to add too much area, the stage signal is used to distinguish, and the register files are multiplexed in the division.

第一寄存器堆(18)有16个64-bit的寄存器，在stage信号为0时，存储的是(Cc，Cs)的数据，而当stage信号为1时，存储的是(Dc，Ds)的数据。The first register file (18) has 16 64-bit registers. When the stage signal is 0, the data of (Cc, Cs) is stored, and when the stage signal is 1, the data of (Dc, Ds) is stored. The data.

第二寄存器堆(19)有16个64-bit的寄存器，在stage信号为0时，存储的是(Dc，Ds)的数据，而当stage信号为0时，存储的是(Cc，Cs)的数据。The second register file (19) has 16 64-bit registers, when the stage signal is 0, the data of (Dc, Ds) is stored, and when the stage signal is 0, the storage is (Cc, Cs) The data.

第三寄存器堆(20)有16个64-bit的寄存器，在stage信号为0时，存储的是(Uc，Us)的数据，而当stage信号为0时，存储的是(Wc，Ws)的数据。The third register file (20) has 16 64-bit registers. When the stage signal is 0, the data of (Uc, Us) is stored, and when the stage signal is 0, the data of (Wc, Ws) is stored. The data.

第四寄存器堆(21)有16个64-bit的寄存器，在stage信号为0时，存储的是(Wc，Ws)的数据，而当stage信号为0时，存储的是(Uc，Us)的数据。The fourth register file (21) has 16 64-bit registers. When the stage signal is 0, the data of (Wc, Ws) is stored, and when the stage signal is 0, the data of (Uc, Us) is stored. The data.

第五寄存器堆(22)有16个64-bit的寄存器，在stage信号为0或1时，存储的都是模P的数据(0，P)。The fifth register file (22) has 16 64-bit registers, and when the stage signal is 0 or 1, all data (0, P) modulo P are stored.

而Booth编码单元(24)此时只提供模P的数据存储作用，这样复用模乘时的Booth编码单元，不用再为模P专门再添加暂存单元。And Booth coding unit (24) only provides the data storage effect of modulus P now, the Booth coding unit when the modular multiplication like this reuses, needn't add temporary storage unit again specially for modulus P again.

2.模乘状态2. Modular multiplication status

模乘状态由第二、第八多路选择器(2、8)、五个寄存器堆(18～22)、输入寄存器(23)、Booth编码单元(24)、四个PE运算单元(26～29)组成，数据通路如图6所示，具体实现的模乘算法如附图8a所示，模乘时的顶层说明简图如附图8b所示.其中：The modular multiplication state is composed of the second and the eighth multiplexer (2, 8), five register files (18-22), input register (23), Booth encoding unit (24), four PE operation units (26-22), 29) Composition, the data path is as shown in Figure 6, the modular multiplication algorithm of concrete realization is as shown in accompanying drawing 8a, and the top-level explanation diagram during modular multiplication is as shown in accompanying drawing 8b. Wherein:

第二多路选择器(2)，输入为第一寄存器堆(18)和第四寄存器堆(21)内存储的数据，是否为第一次运算的信号First round作为选择信号，选择出正确的数据到输入寄存器(23)内。The second multiplexer (2), the input is the data stored in the first register file (18) and the fourth register file (21), whether it is the signal First round of the first operation as the selection signal, selects the correct Data into the input register (23).

第八多路选择器(8)，输入为第二寄存器堆(19)和第五寄存器堆(22)内存储的数据，是否为第一次运算的信号First_round作为选择信号，选择出正确的数据到输入寄存器(23)内。The eighth multiplexer (8), the input is the data stored in the second register file (19) and the fifth register file (22), whether it is the signal First_round of the first operation as a selection signal, and selects the correct data into the input register (23).

第一寄存器堆(18)，存储的是被乘数W和模P的值(W，P)，按字存储在寄存器堆中；输出到第二多路选择器(2)中。The first register file (18) stores the value (W, P) of the multiplicand W and the modulo P, and is stored in the register file by word; it is output to the second multiplexer (2).

第二寄存器堆(19)，存储的是运算中的U的值及进位(Uc，Us)，也是按字存储在寄存器堆中；输出到第八多路选择器(8)中。The second register file (19) stores the value of U in the operation and the carry (Uc, Us), which is also stored in the register file by word; it is output to the eighth multiplexer (8).

第三寄存器堆(20)，存储的是运算中的乘数的值(0，C)，也是按字存储在寄存器堆中；输出到Booth编码器(24)中。The third register file (20), which stores the value (0, C) of the multiplier in the operation, is also stored in the register file by word; it is output to the Booth encoder (24).

第四寄存器堆(21)，存储的是运算中存储的被乘数和模(W，P)的值，在模乘时第四寄存器堆(21)作用是一个fifo，存储的数据也是按字存储在寄存器堆中；输出到第二多路选择器(2)中。The fourth register file (21) stores the value of the multiplicand and the modulus (W, P) stored in the operation, and the fourth register file (21) acts as a fifo during modular multiplication, and the stored data is also by word Stored in the register file; output to the second multiplexer (2).

第五寄存器堆(22)，存储的是运算中存储的U的值及进位(Uc，Us)，在模乘时第五寄存器堆(22)的作用也是一个fifo，存储的数据按字存储在寄存器堆中；输出到第八多路选择器(8)中。The fifth register file (22), stored is the value of U stored in the operation and the carry (Uc, Us), and the effect of the fifth register file (22) is also a fifo during modular multiplication, and the stored data is stored in word by word In the register file; output in the eighth multiplexer (8).

输入寄存器(23)，有两个64bit的数据输入，分别是被乘数及模P(W，P)和计算中的进位及结果(Uc，Us)；输出到PE0运算单元(26)。The input register (23) has two 64bit data inputs, which are respectively the multiplicand and the modulus P (W, P) and the carry and result (Uc, Us) in the calculation; output to the PE0 arithmetic unit (26).

Booth编码单元(24)，输入为乘数C中的数据；分别为四个PE运算单元提供乘以W的倍数mul_h0、mul_h1、mul_h2和mul_h3。Booth coding unit (24), the input is the data in the multiplier C; The multiples mul_h0, mul_h1, mul_h2 and mul_h3 that are multiplied by W are provided for the four PE operation units respectively.

PE0运算单元，输入为两个64bit的数据，分别是被乘数及模P(W，P)和计算中的进位及结果(Uc，Us)；完成运算U＝(U+mul_h*W+kP)＞＞shift；输出为计算后结果及进位(Uc，Us)和被乘数及模P(W，P)的值，输出到PE1运算单元中。PE0 operation unit, the input is two 64bit data, which are the multiplicand and the modulus P(W, P) and the carry and result (Uc, Us) in the calculation; complete the operation U=(U+mul_h*W+kP )>>shift; the output is the calculated result and the carry (Uc, Us) and the multiplicand and the value of the modulus P (W, P), which are output to the PE1 arithmetic unit.

PE1运算单元，输入为两个64bit的数据，分别是被乘数及模P(W，P)和PE0中计算中的进位及结果(Uc，Us)；完成运算U＝(U+mul_h*W+kP)＞＞shift；输出为计算后结果及进位(Uc，Us)和被乘数及模P(W，P)的值，输出到PE2运算单元中。PE1 operation unit, the input is two 64bit data, which are respectively the multiplicand and the modulus P(W, P) and the carry and result (Uc, Us) in the calculation in PE0; complete the operation U=(U+mul_h*W +kP)>>shift; the output is the calculated result and the carry (Uc, Us) and the value of the multiplicand and the modulus P (W, P), and output to the PE2 arithmetic unit.

PE2运算单元，输入为两个64bit的数据，分别是被乘数及模P(W，P)和PE1中计算中的进位及结果(Uc，Us)；完成运算U＝(U+mul_h*W+kP)＞＞shift；输出为计算后结果及进位(Uc，Us)和被乘数及模P(W，P)的值，输出到PE3运算单元中。PE2 operation unit, the input is two 64bit data, which are respectively the multiplicand and the modulus P(W, P) and the carry and result (Uc, Us) in the calculation in PE1; complete the operation U=(U+mul_h*W +kP)>>shift; the output is the calculated result and the carry (Uc, Us) and the value of the multiplicand and the modulus P (W, P), and output to the PE3 arithmetic unit.

PE3运算单元，输入为两个64bit的数据，分别是被乘数及模P(W，P)和PE2中计算中的进位及结果(Uc，Us)；完成运算U＝(U+mul_h*W+kP)＞＞shift；输出为计算后结果及进位(Uc，Us)和被乘数及模P(W，P)的值，前者输出到第五寄存器堆中(22)，后者输出到第四寄存器堆(21)中。PE3 operation unit, the input is two 64bit data, which are respectively the multiplicand and the modulus P(W, P) and the carry and result (Uc, Us) in the calculation in PE2; complete the operation U=(U+mul_h*W +kP)>>shift; the output is the calculated result and the carry (Uc, Us) and the multiplicand and the value of the modulus P (W, P), the former is output to the fifth register file (22), and the latter is output to In the fourth register file (21).

当进行模乘运算时，四个PE单元的连接关系经过多路选择器的选择后如附图8b所示，四个PE单元形成一个流水线结构，都进行U＝(U+mul_h*W+kP)＞＞shift的操作。When performing modular multiplication, the connection relationship of the four PE units is selected by the multiplexer as shown in Figure 8b, and the four PE units form a pipeline structure, and U=(U+mul_h*W+kP )>>shift operation.

对于模乘运算，对素域下，采用经过基-4的Booth编码的蒙哥马利模乘，算法中用到的基本运算有U＝(U+mul_h*W+kP)＞＞shift。为了取得更快的运算速度，所以让四个PE运算单元(26～29)都进行U＝(U+mul_h*W+kP)＞＞shift操作，形成一个四级流水线。For the modular multiplication operation, under the prime field, the Montgomery modular multiplication through the base-4 Booth code is adopted, and the basic operation used in the algorithm is U=(U+mul_h*W+kP)>>shift. In order to obtain a faster calculation speed, let the four PE calculation units (26-29) perform U=(U+mul_h*W+kP)>>shift operation to form a four-stage pipeline.

根据Field(素域还是多项式域)和C的值设置此次累加运算所需的参数值。其中mul_h表示累加时被乘数的倍数。参数shift表示此次累加后U要右移的位数。参数k的决定方式同算法1，目的也是使U+kP后值的最低shift位为0。参数h的值为乘数的位数。为了使素域下的运算结果在GF(P)内，算法对结果做了调整。而多项式域下的运算会将这一步跳过(见图4)。According to the value of Field (prime field or polynomial field) and C, set the parameter value required for this accumulation operation. Where mul_h represents the multiple of the multiplicand when accumulating. The parameter shift indicates the number of digits to be shifted right by U after this accumulation. The determination method of parameter k is the same as algorithm 1, and the purpose is to make the lowest shift bit of the value after U+kP be 0. The value of the parameter h is the number of digits of the multiplier. In order to make the operation result under the prime field within GF(P), the algorithm adjusts the result. However, operations under the polynomial domain will skip this step (see Figure 4).

可见，本算法可以支持双域下的模乘运算，而且素域下的模乘运算因为采用了booth编码，累加运算的次数可以缩减为原来的一半。需要说明的是：为了实现可扩展，并使模乘与模除运算能够共用硬件单元，算法1和算法2中的长操作数的加减法和移位运算都以字为单位进行。设操作数长度为n，字长为w，It can be seen that this algorithm can support the modular multiplication operation in the dual field, and the modular multiplication operation in the prime field can be reduced to half of the original number of accumulation operations because of the use of booth coding. It should be noted that: in order to realize scalability and enable modular multiplication and modular division operations to share hardware units, the addition, subtraction and shift operations of long operands in Algorithm 1 and Algorithm 2 are all performed in units of words. Let the operand length be n, word length be w,

则操作数以字为单位的向量表示为{0，W^(e-1)，...，W⁽¹⁾，W⁽⁰⁾}。高位上增加全零字的目的是为了防止加法运算时结果溢出，另外，它还可以作为符号字，方便加法结果右移后的符号位扩展。 Then the vector of operands in units of words is expressed as {0, W ^(e-1) , . . . , W ⁽¹⁾ , W ⁽⁰⁾ }. The purpose of adding an all-zero word on the high bit is to prevent the result from overflowing during the addition operation. In addition, it can also be used as a sign word to facilitate the sign bit expansion after the addition result is shifted to the right.

PE0运算单元(26)完成U＝(U+mul_h0*W+kP)＞＞shift操作，操作数U、W、P、mul_h0及shift都由外部控制模块及寄存器堆提供。每个clock完成一个字的U＝(U+mul_h*W+kP)＞＞shift的操作，为了保证能工作在一个较高的时钟频率下，所以运算采用进位保留加法器CSA，这样结果包含两部分，运算的结果和进位结果，各为32bit，将这两部分值再写入到下一个运算单元中，进行流水线操作。The PE0 operation unit (26) completes U=(U+mul_h0*W+kP)>>shift operation, and the operands U, W, P, mul_h0 and shift are all provided by the external control module and the register file. Each clock completes the operation of U=(U+mul_h*W+kP)>>shift of a word. In order to ensure that it can work at a higher clock frequency, the operation uses the carry-save adder CSA, so the result contains two In the part, the result of the operation and the result of the carry are each 32 bits, and the values of these two parts are written into the next operation unit for pipeline operation.

PE1运算单元(27)完成U＝(U+mul_h1*W+kP)＞＞shift操作，操作数U、W、P、mul_h1及shift都由外部控制模块及上一个PE运算单元提供。每个clock完成一个字的U＝(U+mul_h*W+kP)＞＞shift的操作，为了保证能工作在一个较高的时钟频率下，所以运算采用进位保留加法器CSA，这样结果包含两部分，运算的结果和进位结果，各为32bit，将这两部分值再写入到下一个运算单元中，进行流水线操作。PE1 arithmetic unit (27) completes U=(U+mul_h1*W+kP)>>shift operation, operands U, W, P, mul_h1 and shift are all provided by the external control module and last PE arithmetic unit. Each clock completes the operation of U=(U+mul_h*W+kP)>>shift of a word. In order to ensure that it can work at a higher clock frequency, the operation uses the carry-save adder CSA, so the result contains two In the part, the result of the operation and the result of the carry are each 32 bits, and the values of these two parts are written into the next operation unit for pipeline operation.

PE2运算单元(28)完成U＝(U+mul_h2*W+kP)＞＞shift操作，操作数U、W、P、mul_h2及shift都由外部控制模块及上一个PE运算单元提供。每个clock完成一个字的U＝(U+mul_h*W+kP)＞＞shift的操作，为了保证能工作在一个较高的时钟频率下，所以运算采用进位保留加法器CSA，这样结果包含两部分，运算的结果和进位结果，各为32bit，将这两部分值再写入到下一个运算单元中，进行流水线操作。PE2 arithmetic unit (28) completes U=(U+mul_h2*W+kP)>>shift operation, operands U, W, P, mul_h2 and shift are all provided by external control module and last PE arithmetic unit. Each clock completes the operation of U=(U+mul_h*W+kP)>>shift of a word. In order to ensure that it can work at a higher clock frequency, the operation uses the carry-save adder CSA, so the result contains two In the part, the result of the operation and the result of the carry are each 32 bits, and the values of these two parts are written into the next operation unit for pipeline operation.

PE3运算单元(29)完成U＝(U+mul_h3*W+kP)＞＞shift操作，操作数U、W、P、mul_h3及shift都由外部控制模块及上一个PE运算单元提供。每个clock完成一个字的U＝(U+mul_h*W+kP)＞＞shift的操作，为了保证能工作在一个较高的时钟频率下，所以运算采用进位保留加法器CSA，这样结果包含两部分，运算的结果和进位结果，各为32bit，将这两部分值再写入到寄存器堆单元中，进行流水线操作。PE3 arithmetic unit (29) completes U=(U+mul_h3*W+kP)>>shift operation, operands U, W, P, mul_h3 and shift are all provided by external control module and last PE arithmetic unit. Each clock completes the operation of U=(U+mul_h*W+kP)>>shift of a word. In order to ensure that it can work at a higher clock frequency, the operation uses the carry-save adder CSA, so the result contains two Part, the result of the operation and the carry result are 32 bits each, and the values of these two parts are written into the register file unit for pipeline operation.

这里需要提供U、W、P、mul_h等参数，于是设计U、W、P等参数存入到第一、第二、第三、第四和第五寄存器堆(18～22)中，而四个PE运算单元所用的mul_h0、mul_h1、mul_h2、mul_h3四个参数由Booth编码单元(24)提供。一个clock内，Booth编码单元一次提供四个运算单元所用的mul_h值，这样四个PE运算单元(26～29)组成一个四级流水线进行并行计算。其中：Parameters such as U, W, P, and mul_h need to be provided here, so parameters such as U, W, and P are designed and stored in the first, second, third, fourth, and fifth register files (18-22), and the four The four parameters mul_h0, mul_h1, mul_h2 and mul_h3 used by each PE operation unit are provided by the Booth coding unit (24). Within one clock, the Booth coding unit provides the mul_h value used by four computing units at a time, so that four PE computing units (26-29) form a four-stage pipeline for parallel computing. in:

第一寄存器堆(18)存储初始的被乘数和模P的值(W，P)，输出到PE0运算单元(26)，进行运算。The first register file (18) stores the initial multiplicand and the value (W, P) of the modulo P, and outputs to the PE0 operation unit (26) for operation.

第二寄存器堆(19)存储运算单元所计算的进位及结果(Uc，Us)，输出到PE0运算单元(27)，进行运算。The second register file (19) stores the carry and the result (Uc, Us) calculated by the operation unit, and outputs to the PE0 operation unit (27) for operation.

第三寄存器堆(20)存储乘数C的值(0，C)，输出到Booth编码单元(24)中，为四个PE运算单元(26～29)同时提供四个运算所需的参数mul_h。The third register file (20) stores the value (0, C) of the multiplier C, which is output to the Booth encoding unit (24), and provides the required parameters mul_h for four operations simultaneously for four PE operation units (26～29) .

第四寄存器堆(21)以fifo的形式存储当前运算的(W，P)的值，输出到PE0运算单元(26)进行运算。The fourth register file (21) stores the value of (W, P) of the current operation in the form of fifo, and outputs it to the PE0 operation unit (26) for operation.

第五寄存器堆(22)以fifo的形式存储当前运算的(Uc，Us)的值，输出到PE0运算单元(26)进行运算。The fifth register file (22) stores the value of (Uc, Us) of the current operation in the form of fifo, and outputs it to the PE0 operation unit (26) for operation.

Claims

1. a mould that is applicable to prime field and polynomial expression territory takes advantage of mould to remove device, it is characterized in that: it is made up of 17 MUX (1～17), 4 arithmetic element PE (26～29), input register (23), 5 register files (18～22), Booth coding unit (24) and control module (25); Wherein:

A. select 1 attitude, 13 MUX (3～7,10～17) when selecting 0 attitude in second MUX (2) and the 8th MUX (8), form mould by first MUX (1), the 9th MUX (9), 5 register files (18～22), input register (23), Booth coding unit (24), control module (25) and 4 PE arithmetic elements (26～29) and remove device, the mould division operation adopts the Euclidean algorithm;

B. do not work in first MUX (1) and the 9th MUX (9), 13 MUX (3～7,10～17) are when selecting 1 attitude, form mould by second MUX (2), the 8th MUX (8), 5 register files (18～22), input register (23), Booth coding unit (24), control module (25) and 4 PE arithmetic elements (26～29) and take advantage of device, modular multiplication adopts mould Montgomery algorithm.

2. take advantage of mould to remove device by the described mould in prime field and polynomial expression territory that is applicable to of claim 1, it is characterized in that: among the described step a:

First MUX (1) is input as the output of arithmetic element PE0 (26); Select output by the stage signal, export first register file (18) or second register file (19) to;

The 9th MUX (9) is input as the output of arithmetic element PE2 (28); Select output by the stage signal, export the 3rd register file (20) or the 4th register file (21) to;

The output of first register file (18) and second register file (19) is all to input register (23);

The output of the 3rd register file (20) and the 4th register file (21) is all to PE3 arithmetic element (29);

The 5th register file (22) output to Booth scrambler (24);

Input register (23) output to PE0 arithmetic element (26);

Booth coding unit (24) outputs to PE2 arithmetic element (28);

Control module (25) outputs to PE0 and PE1 arithmetic element (26,27);

The output of PE0 arithmetic element (26) is written to first register file (18) or second register file (19) according to the selection of stage signal;

PE1 arithmetic element (27) output to PE2 arithmetic element (28);

The output of PE2 arithmetic element (28) is written to the 3rd register file (20) or the 4th register file (21) according to the selection of stage signal;

PE3 arithmetic element (29) output to PE1 arithmetic element (27).

3. take advantage of mould to remove device by the described mould in prime field and polynomial expression territory that is applicable to of claim 1, it is characterized in that: among the described step b:

Second MUX (2) is input as the data of first register file (18) and the 4th register file (21) stored; As selecting signal, select the correct data of output with First_round in input register (23);

The 8th MUX (8) is input as the data of second register file (19) and the 5th register file (22) stored; As selecting signal, select the correct data of output with First_round in input register (23);

The 3rd register file (20) output to Booth scrambler (24);

Two of input register (23) output to PE0 arithmetic element (26);

The output of Booth coding unit (24) is respectively 4 PE arithmetic elements (26～29) input is provided;

PE0 arithmetic element (26) output to PE1 arithmetic element (27);

PE1 arithmetic element (27) output to PE2 arithmetic element (28);

PE2 arithmetic element (28) output to PE3 arithmetic element (29);

The output branch of PE3 arithmetic element (29) is clipped to the 5th register file (22) and the 4th register file (21).

4. mould according to claim 1 takes advantage of mould to remove device, it is characterized in that: described arithmetic element PE is made up of 6 MUX (30～35), 2 registers (36,37), 2 anti-phase controllers (38,39), 3 carry save adders (40,41,42), PE internal control module (43) and PE internal displacement device (44); Wherein:

The 18 MUX (30) be input as modular multiplication the time add up when adding up the multiple mul_h of multiplicand and the mould division operation multiple div_h of D register value, by operating function Func_sel as selection, select the multiple of correct W register value and select signal, output to the selection control end of the 19 MUX (31);

The 19 MUX (31) be input as 0, the value of W register and the value of 2*W register, as selecting signal, select the anti-phase controller of correct operand value to the first (38) by the output of the 18 MUX (31);

The 20 MUX (32) be input as word arithmetic the time mould P multiple double1 and double2,, export correct multiple and select signal to the 23 MUX (35) as selection by operating function Func_sel;

The 21 MUX (33) be input as word arithmetic the time mould P multiple zero1 and zero2,, export correct multiple and select signal to the 23 MUX (35) as selection by operating function Func_sel;

The 22 MUX (34) be input as word arithmetic the time mould P multiple neg1 and neg2,, export correct multiple and select the anti-phase controller of signal to the second (39) as selection by operating function Func_sel;

The 23 MUX (35) be input as 0, the value of mould P and the value of 2* mould P, by the 20, the output of the 21 MUX (32,33) is as selecting signal, selects the anti-phase controller of correct operand value to the second (39);

The value of a word among the mould P of the 7th register (36) when being used for depositing computing;

The 8th register (37) value of a word among the W when being used for depositing computing;

The output valve that is input as the 19 MUX (31) of the first anti-phase controller (38), control signal is the output valve of the 18 MUX (30), is output as input through the numerical value after the negate;

The output valve that is input as the 23 MUX (35) of the second anti-phase controller (39), control signal is the output valve of the 22 MUX (34), is output as input through the numerical value after the negate;

The result and the carry (Uc that are input as the current word of U register of preserving an operand of first carry save adder (40), Us), and signal Field is selected in the territory in prime field or polynomial expression territory, first word First_word, with the output of the first anti-phase controller (38), be output as value Us_1 and carry Uc_1 by carry save adder;

The output Uc_1 that is input as first carry save adder (40) of second carry save adder (41), Us_1, signal Field is selected in the territory in prime field or polynomial expression territory, first word First_word, with the output of the second anti-phase controller (39), be output as value Us_2 and carry Uc_2 by carry save adder;

Input Uc_3, the Us_3 and the carry that are input as PE internal displacement device (44) of the 3rd carry save adder (42) are output as value Us_out and carry Uc_out through carry save adder;

The output Us_1, the Uc_1's that are input as first carry save adder (40) of PE internal control module (43) is low two, and low two of mould P, the value of the figure place shift that moves to right, is output as and determines that the back adds the control signal of the multiple of mould P;

The output valve Us_2 that is input as second carry save adder (41) of PE internal displacement device (44), the value of Uc_2, and the value of the figure place shift that moves to right are output as the value through Us_3, Uc_3 after moving to right and the carry that moves to right out.