CN116432765A

CN116432765A - RISC-V-based special processor for post quantum cryptography algorithm

Info

Publication number: CN116432765A
Application number: CN202310059346.8A
Authority: CN
Inventors: 黄科杰; 宋瑞冰; 叶泽文; 沈海斌
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2023-01-16
Filing date: 2023-01-16
Publication date: 2023-07-14
Anticipated expiration: 2043-01-16
Also published as: CN116432765B

Abstract

The invention discloses a special processor for a post quantum cryptography algorithm based on RISC-V, which adopts a super-standard quantity pipeline design with separate reading and writing and calculation, and comprises a reading and writing pipeline and a calculating pipeline which can be executed in parallel, and comprises a finger taking part, a decoding part, an executing part, a writing back part and the like, wherein parallel calculating unit modules contained in the executing part are realized by adopting a component recombination technology and an operator fusion technology, and the same component is used for realizing core operation of the post quantum cryptography algorithm with different calculating types and bit widths, thereby reducing the hardware resource expenditure of the processor and accelerating the operation speed, and parallel calculating unit modules and parallel register modules contained in the processor form a parallel storage system and data flow with mixed bit widths, so that the loads of the two pipelines of calculation, reading and writing are balanced and the bandwidth is reduced.

Description

A special processor for post-quantum cryptography algorithm based on RISC-V

技术领域technical field

本发明属于后量子密码算法硬件加速和RISC-V(Reduced Instruction SetComputer-V，第五代精简指令集处理器)指令集扩展处理器技术领域，具体涉及一种基于RISC-V的后量子密码算法专用处理器。The invention belongs to the technical field of post-quantum cryptography algorithm hardware acceleration and RISC-V (Reduced Instruction SetComputer-V, the fifth generation reduced instruction set processor) instruction set expansion processor, and specifically relates to a post-quantum cryptography algorithm based on RISC-V dedicated processor.

背景技术Background technique

后量子密码算法即可以应对量子计算机攻击的新一代公钥密码算法。目前使用的传统公钥密码算法(RSA、Diffie-Hellman、椭圆曲线等)的构建基于大整数分解、离散对数(及椭圆曲线版本)等数学上的困难问题。然而，随着量子计算机技术的不断发展以及高效量子算法(如1994年的Shor’s算法)的出现，足够大和稳定的量子计算机能够在多项式时间复杂度内破解这些困难问题，传统公钥密码算法即将不再安全。为了保障信息安全，密码学界经过多年的广泛研究和讨论制定了新的后量子密码算法标准，以逐步推广取代即将不再安全的传统密码算法。Post-quantum cryptography is a new generation of public key cryptography that can cope with quantum computer attacks. The construction of traditional public key cryptographic algorithms (RSA, Diffie-Hellman, elliptic curve, etc.) currently in use is based on mathematically difficult problems such as large integer decomposition, discrete logarithm (and elliptic curve version). However, with the continuous development of quantum computer technology and the emergence of efficient quantum algorithms (such as Shor's algorithm in 1994), sufficiently large and stable quantum computers can solve these difficult problems within polynomial time complexity, and traditional public key cryptography algorithms will soon be unavailable. Safe again. In order to ensure information security, the cryptography community has formulated a new post-quantum cryptography algorithm standard after years of extensive research and discussion, so as to gradually promote and replace the traditional cryptography algorithm that will no longer be safe.

然而，后量子密码算法在现有的硬件设备特别是低成本的终端设备上的部署存在困难。相比传统公钥密码算法，后量子密码算法具有计算量更大、计算形式更加复杂等特征，其较大的密钥长度和计算规模也导致计算过程中数据流和存储空间的规划非常困难，难以实现实时性要求较高的应用。为了提升后量子密码算法的运行速度，学术界进行了广泛的研究，在软件和硬件等层级都进行了优化和加速，但这些设计主要针对于高性能、高开销的服务端设备，具有功耗高、电路复用度低、资源开销大等缺点；对于在物联网等应用中占主体的低性能、低成本的终端设备，其后量子密码算法硬件加速方案这一领域仍然存在空缺。However, it is difficult to deploy post-quantum cryptography algorithms on existing hardware devices, especially low-cost terminal devices. Compared with the traditional public key cryptographic algorithm, the post-quantum cryptographic algorithm has the characteristics of larger calculation amount and more complex calculation form. Its larger key length and calculation scale also make the planning of data flow and storage space in the calculation process very difficult. It is difficult to implement applications with high real-time requirements. In order to improve the running speed of the post-quantum cryptography algorithm, the academic community has carried out extensive research, and optimized and accelerated it at the software and hardware levels. High, low circuit multiplexing, high resource overhead and other disadvantages; for the low-performance, low-cost terminal equipment that dominates applications such as the Internet of Things, there are still vacancies in the field of hardware acceleration solutions for quantum cryptography algorithms.

RISC-V指令集是一个基于精简指令集(RISC)原则的开源指令集架构，其适用于小型、快速、低功耗的设计，具有较强的可扩展性，广泛应用于低成本终端设备的设计中，是本设计采取的基础技术之一。The RISC-V instruction set is an open source instruction set architecture based on the reduced instruction set (RISC) principle, which is suitable for small, fast, low-power designs, has strong scalability, and is widely used in low-cost terminal equipment In the design, it is one of the basic technologies adopted in this design.

发明内容Contents of the invention

为解决现有技术中的问题，填补该领域的技术空缺，本发明提出了一种基于RISC-V指令集进行扩展的后量子密码算法专用处理器。本发明旨在利用RISC-V指令集易于扩展的特性，结合计算部件重组技术、算子融合技术、指令多发射技术等技术的并行性、高度复用性、低功耗等特点，以较低的功耗和资源开销加速后量子密码算法计算速度，设计一款低功耗高速的后量子密码算法专用处理器。In order to solve the problems in the prior art and fill the technical vacancy in this field, the present invention proposes a post-quantum cryptographic algorithm special processor based on RISC-V instruction set expansion. The present invention aims to use the characteristics of easy expansion of the RISC-V instruction set, combined with the characteristics of parallelism, high reusability, and low power consumption of technologies such as computing component reorganization technology, operator fusion technology, and instruction multiple emission technology, to achieve a lower The power consumption and resource overhead accelerate the calculation speed of the post-quantum cryptography algorithm, and a low-power and high-speed post-quantum cryptography algorithm dedicated processor is designed.

本发明的技术方案如下：Technical scheme of the present invention is as follows:

本发明首先提供了一种基于RISC-V的后量子密码算法专用处理器，其采取读写和计算分离的超标量流水线设计，具有读写和计算两条可以并行执行的流水线，每条流水线具有按顺序执行的取指、译码、执行、写回四个部分，所述处理器具体包括：The present invention firstly provides a post-quantum cryptographic algorithm special processor based on RISC-V, which adopts a superscalar pipeline design that separates reading and writing from calculation, and has two pipelines that can be executed in parallel, each pipeline has Instruction fetching, decoding, execution, and write-back four parts executed in sequence, the processor specifically includes:

取指部分，读写和计算流水线复用同一个取指部分，该部分用于从处理器外的指令缓存中获取指令，并在预分析后分发指令给处理器的两条流水线的译码部分，包括作为接口和处理器外的指令缓存交换数据的指令接口模块、用于预读多条指令并进行缓存的预取指模块、指令分发模块，其中指令分发模块采取顺序多发射方案；该部分将非读写指令发射给计算流水线，将读写指令发射给读写流水线；Instruction fetching part, reading and writing and computing pipelines multiplex the same instruction fetching part, which is used to obtain instructions from the instruction cache outside the processor, and distribute instructions to the decoding part of the two pipelines of the processor after pre-analysis , including an instruction interface module used as an interface to exchange data with an instruction cache outside the processor, a prefetch module for pre-reading multiple instructions and caching them, and an instruction distribution module, wherein the instruction distribution module adopts a sequential multi-issue scheme; this part Send non-read-write instructions to the computing pipeline, and send read-write instructions to the read-write pipeline;

译码部分，该部分主要包含四个寄存器模块和两个译码器类型的模块，用于处理取指部分发射过来的指令，将这些指令翻译为具体的控制信号，并读取寄存器模块内的数据，然后将这些控制信号和数据发送给两条流水线的执行部分；寄存器模块由两条流水线共用；译码器类型的模块包括属于计算流水线的第一译码器模块和属于读写流水线的第二译码器模块，这两个译码器模块分别翻译对应流水线收到的指令，并各自从所需的寄存器模块中读取所需的数据，然后各自将对应流水线的控制信号和数据发送给各自流水线的执行部分；The decoding part, which mainly includes four register modules and two decoder-type modules, is used to process the instructions sent by the fetching part, translate these instructions into specific control signals, and read the register modules. Data, and then send these control signals and data to the execution part of the two pipelines; the register module is shared by the two pipelines; the decoder type module includes the first decoder module belonging to the calculation pipeline and the first decoder module belonging to the read and write pipeline Two decoder modules, these two decoder modules respectively translate the instructions received by the corresponding pipelines, and respectively read the required data from the required register modules, and then send the control signals and data corresponding to the pipelines to the execution part of the respective pipeline;

执行部分，该部分用于根据译码部分发送过来的控制信号和数据执行具体的计算或访存操作，然后将计算结果和控制信号发送给两条流水线的写回部分，计算流水线的执行部分包括并行计算单元模块，该并行计算单元模块采取混合位宽设计，支持最高256位的多种位宽的数据进行并行计算；读写流水线的执行部分包含读写控制单元模块的第一阶段，该阶段用于根据译码部分发送来的控制信号和数据访问处理器外部的数据缓存，而该模块的其他部分则在读写流水线的写回部分中工作；The execution part, which is used to perform specific calculations or memory access operations according to the control signals and data sent by the decoding part, and then send the calculation results and control signals to the write-back part of the two pipelines. The execution part of the calculation pipeline includes Parallel computing unit module, the parallel computing unit module adopts a mixed bit width design, supports parallel computing of data with multiple bit widths up to 256 bits; the execution part of the read and write pipeline includes the first stage of the read and write control unit module, this stage It is used to access the data cache outside the processor according to the control signal and data sent by the decoding part, while other parts of the module work in the write-back part of the read-write pipeline;

写回部分，该部分用于将计算流水线和读写流水线执行部分发送过来的计算结果根据控制信号写回译码部分中所包含的对应寄存器模块，其中读写流水线的写回部分包括读写控制单元模块的第二阶段，该阶段用于将在第一阶段中访问外部数据缓存得到的结果返回给处理器并存入寄存器模块。The write-back part is used to write the calculation results sent by the calculation pipeline and the execution part of the read-write pipeline back to the corresponding register module contained in the decoding part according to the control signal, wherein the write-back part of the read-write pipeline includes read-write control The second stage of the unit module, which is used to return the result obtained by accessing the external data cache in the first stage to the processor and store it in the register module.

与现有技术相比，本发明所具有的有益效果有：Compared with prior art, the beneficial effect that the present invention has has:

(1)本发明采用了读写和计算分离的超标量流水线技术，将处理器的计算和读写两部分资源分别配置在两条流水线中，两条流水线能够并行工作，一边计算一边读写数据，从而提高后量子密码算法算法运行速度；此外，其取指部分结合了后量子密码算法的特点，采取顺序多发射设计，与常用的乱序多发射相比，能够节省大量的资源和功耗。(1) The present invention adopts the superscalar pipeline technology that separates reading and writing from calculation, and configures the computing and reading and writing resources of the processor in two pipelines respectively. The two pipelines can work in parallel, and read and write data while calculating , so as to improve the running speed of the post-quantum cryptography algorithm; in addition, its indexing part combines the characteristics of the post-quantum cryptography algorithm and adopts the sequential multi-shot design, which can save a lot of resources and power consumption compared with the commonly used out-of-order multi-shot .

(2)本发明采用了部件重组技术和算子融合技术，使用相同的基础资源通过重组的方式对不同类型、不同位宽的后量子密码算法提供支持，大大减少了硬件资源的开销，并通过算子融合减少算法所需的周期数，提高后量子密码算法运行速度。(2) The present invention adopts component reorganization technology and operator fusion technology, and uses the same basic resources to provide support for post-quantum cryptography algorithms of different types and different bit widths through reorganization, which greatly reduces the overhead of hardware resources, and through Operator fusion reduces the number of cycles required by the algorithm and improves the running speed of the post-quantum cryptography algorithm.

(3)本发明采用了读写和计算位宽不同的存储结构和数据流设计，结合后量子密码算法核心运算的特点，实现了计算和读写两条流水线的工作负载的平衡，在不影响算法运行速度的同时降低了带宽需求和硬件资源开销。(3) The present invention adopts the storage structure and data flow design with different read-write and calculation bit widths, and combines the characteristics of the core operation of the post-quantum cryptography algorithm to realize the balance of the workload of the two pipelines of calculation and read-write without affecting The algorithm runs at a faster speed while reducing bandwidth requirements and hardware resource overhead.

附图说明Description of drawings

图1为本发明的后量子密码算法专用处理器流水线架构图；Fig. 1 is the post-quantum cryptographic algorithm dedicated processor pipeline architecture diagram of the present invention;

图2为本发明中并行计算单元模块的架构图；Fig. 2 is the architectural diagram of the parallel computing unit module in the present invention;

图3为本发明中存储结构和数据流的示意图；Fig. 3 is the schematic diagram of storage structure and data flow among the present invention;

图4为实施例中示例的XORVNA指令对应的并行计算单元模块内的处理模块的组合方式示意图；Fig. 4 is a schematic diagram of a combination of processing modules in the parallel computing unit module corresponding to the XORVNA instruction of the example in the embodiment;

图5为本发明实施例中进行NTT(快速数论变换)运算时一次蝶形运算对应的指令发射顺序；Fig. 5 is the command emission sequence corresponding to a butterfly operation when performing NTT (fast number theory transformation) operation in the embodiment of the present invention;

图6为本发明实施例中进行NTT(快速数论变换)运算时一次蝶形运算对应的汇编指令流。FIG. 6 is an assembly instruction flow corresponding to a butterfly operation when performing NTT (fast number theory transformation) operation in an embodiment of the present invention.

具体实施方式Detailed ways

下面结合具体实施方式对本发明做进一步阐述和说明。所述实施例仅是本公开内容的示范且不圈定限制范围。本发明中各个实施方式的技术特征在没有相互冲突的前提下，均可进行相应组合。The present invention will be further elaborated and described below in combination with specific embodiments. The embodiments are merely exemplary of the disclosure and do not delineate the scope of limitation. The technical features of the various implementations in the present invention can be combined accordingly on the premise that there is no conflict with each other.

如图1所示，本发明实施例提供了一款基于RISC-V的后量子密码算法专用处理器，该专用处理器采取读写和计算分离的超标量流水线设计，具有读写和计算两条可以并行执行的流水线，每条流水线具有按顺序执行的取指、译码、执行、写回四个部分。As shown in Figure 1, the embodiment of the present invention provides a post-quantum cryptographic algorithm special processor based on RISC-V. Pipelines that can be executed in parallel, each pipeline has four parts that are executed sequentially: fetch, decode, execute, and write back.

在本发明的一个具体实施例中，所述处理器具体包括：In a specific embodiment of the present invention, the processor specifically includes:

译码部分，该部分主要包含四个寄存器模块和两个译码器类型的模块，用于处理取指部分发射过来的指令，将这些指令翻译为具体的控制信号，并读取寄存器模块内的数据，然后将这些控制信号和数据发送给两条流水线的执行部分；寄存器模块包括用于存储32位数据的通用寄存器模块，用于存储128位数据的第一并行寄存器模块和第二并行寄存器模块，和用于存储32位参数的参数寄存器模块，这些模块由两条流水线共用；译码器类型的模块包括属于计算流水线的第一译码器模块和属于读写流水线的第二译码器模块，这两个译码器模块分别翻译对应流水线收到的指令，并各自从所需的寄存器模块中读取所需的数据，然后各自将对应流水线的控制信号和数据发送给各自流水线的执行部分；The decoding part, which mainly includes four register modules and two decoder-type modules, is used to process the instructions sent by the fetching part, translate these instructions into specific control signals, and read the register modules. data, and then send these control signals and data to the execution part of the two pipelines; the register modules include a general-purpose register module for storing 32-bit data, a first parallel register module and a second parallel register module for storing 128-bit data , and parameter register modules for storing 32-bit parameters, these modules are shared by two pipelines; the decoder type modules include the first decoder module belonging to the calculation pipeline and the second decoder module belonging to the read and write pipeline , the two decoder modules respectively translate the instructions received by the corresponding pipelines, and each read the required data from the required register modules, and then send the control signals and data corresponding to the pipelines to the execution part of the respective pipelines ;

进一步的，如图1所示，所述基于RISC-V的后量子密码算法专用处理器还包括六个寄存器组，其中第一取指到译码寄存器组用于暂存取指部分要发送给计算流水线的译码部分的信号和数据，第二取指到译码寄存器组用于暂存取值部分要发送给读写流水线的译码部分的信号和数据，第一译码到执行寄存器组用于暂存计算流水线的译码部分要发送给计算流水线的执行部分的信号和数据，第二译码到执行寄存器组用于暂存读写流水线的译码部分要发送给计算流水线的执行部分的信号和数据，第一执行到写回寄存器组用于暂存计算流水线的执行部分要发送给写回部分的信号和数据，第二执行到写回寄存器组用于暂存读写流水线的执行部分要发送给写回部分的信号和数据。Further, as shown in Figure 1, the post-quantum cipher algorithm special processor based on RISC-V also includes six register groups, wherein the first instruction fetching to decoding register group is used for temporarily storing the instruction fetching part to be sent to Calculate the signals and data of the decoding part of the pipeline, the second fetch to the decoding register group is used to temporarily store the signals and data to be sent to the decoding part of the reading and writing pipeline, and the first decoding to the execution register group It is used to temporarily store the signals and data sent by the decoding part of the calculation pipeline to the execution part of the calculation pipeline, and the second decoding to the execution register group is used to temporarily store the decoding part of the read-write pipeline to be sent to the execution part of the calculation pipeline The signals and data, the first execution to the write-back register group is used to temporarily store the signals and data to be sent to the write-back part by the execution part of the calculation pipeline, and the second execution to the write-back register group is used to temporarily store the execution of the read-write pipeline The signal and data to be sent by the section to the writeback section.

如图2所示，本发明实施例中后量子密码算法处理器的执行部分的并行计算单元模块采取了部件重组技术和算子融合技术实现，使用相同的组件实现不同的计算类型和位宽的后量子密码算法核心运算，该模块包括：As shown in Figure 2, the parallel computing unit module of the execution part of the post-quantum cryptography algorithm processor in the embodiment of the present invention adopts component reorganization technology and operator fusion technology to realize, and uses the same component to realize different computing types and bit widths. The core operation of the post-quantum cryptography algorithm, this module includes:

预处理模块，用于对NTT(快速数论变换)、keccak算法进行计算前数据变换；The preprocessing module is used for data transformation before calculation of NTT (fast number theory transformation) and keccak algorithm;

后处理模块，用于对NTT、keccak算法进行计算后的数据变换；Post-processing module, used for data transformation after calculation of NTT and keccak algorithms;

处理模块，包括主要的计算资源，由一个用于实现移位操作的64位移位器模块和8个用于实现其他计算操作的完全相同的子单元模块构成；每个子单元模块内包含一个用于实现乘法相关操作的32位乘法器模块，一个用于实现模运算和逻辑运算的32位进位保留加法器模块，两个用于实现加减法的32位加法器模块；这些模块之间具有可选择的连接方式，处理模块中的若干模块具有多种连接方式，分别用于支持后量子密码算法中所需的不同位宽的不同运算中的计算操作，可以根据取指部分取得的指令的不同，译码部分翻译出所翻译出的具体的计算操作和对应的控制信号后的不同，执行部分中该处理模块会根据控制信号选通对应的连接方式，采取不同的连接方式，从而执行不同位宽的不同计算操作以得到运算结果。The processing module, including the main computing resources, is composed of a 64-bit shifter module used to realize the shift operation and 8 identical subunit modules used to realize other computing operations; each subunit module contains a A 32-bit multiplier module for multiplication-related operations, a 32-bit carry-save adder module for modulo and logic operations, and two 32-bit adder modules for addition and subtraction; between these modules there is Optional connection methods, several modules in the processing module have multiple connection methods, which are respectively used to support the calculation operations in different operations of different bit widths required in the post-quantum cryptography algorithm, and can be obtained according to the instructions obtained by the fetching part. Different, the decoding part translates the difference between the translated specific calculation operation and the corresponding control signal, and the processing module in the execution part will select a different connection method according to the corresponding connection method of the control signal, so as to execute different bits. A wide range of different computing operations to obtain computing results.

如图3所示，本发明实施例中的并行计算单元模块和第一并行寄存器模块、第二并行寄存器模块组成了混合位宽的并行存储体系和数据流，其中第一并行寄存器模块、第二并行寄存器模块在逻辑上均包含16个128位的寄存器，其中每个128位寄存器在物理上由2个64位寄存器组成。在计算流水线进行并行计算操作时，并行计算单元模块最多可以同时执行8组运算，每组为两个32位数之间进行运算，故总的输入是输入A和输入B两个256位操作数，每个操作数由并行寄存器1提供的128位数和并行寄存器2提供的128位数拼接而成。在读写流水线进行读写操作时，读写控制单元模块在读出操作时从外部存储读入128位数据存入并行寄存器模块1或并行寄存器模块2，或从外部存储读入32位数据存入参数寄存器；在读写流水线进行写入操作时，读写控制单元模块在写入操作将并行寄存器模块1或并行寄存器模块2中某一个128位寄存器的值写入外部存储。不同的存储和计算位宽提高了计算和读写两条流水线在NTT等运算中的负载平衡度，能够节省带宽，提高资源利用率。As shown in Figure 3, the parallel computing unit module, the first parallel register module, and the second parallel register module in the embodiment of the present invention form a mixed bit-width parallel storage system and data flow, wherein the first parallel register module, the second The parallel register module logically includes 16 128-bit registers, and each 128-bit register is physically composed of two 64-bit registers. When performing parallel computing operations on the computing pipeline, the parallel computing unit module can perform up to 8 groups of operations at the same time, and each group is an operation between two 32-bit numbers, so the total input is two 256-bit operands of input A and input B , each operand is concatenated from the 128-bit number provided by parallel register 1 and the 128-bit number provided by parallel register 2. When performing read and write operations in the read and write pipeline, the read and write control unit module reads 128-bit data from external storage and stores it in parallel register module 1 or parallel register module 2, or reads 32-bit data from external storage and stores it in parallel register module 2. Into the parameter register; when the read-write pipeline performs a write operation, the read-write control unit module writes the value of a 128-bit register in the parallel register module 1 or parallel register module 2 into the external storage during the write operation. Different storage and computing bit widths improve the load balance between computing and reading and writing pipelines in operations such as NTT, which can save bandwidth and improve resource utilization.

如图4所示，本发明实施例中介绍一条本发明支持的后量子密码算法专用指令，其指令名为xorvna，功能是对A、B、C三个64位的输入数据进行逻辑运算，输出的64位的结果为A^(～B&C)的值。该指令对应的计算操作是后量子密码算法中常用的核心运算keccak运算中的一个步骤。图4所示为执行该指令对应的操作时，并行计算单元模块的处理模块所采取的连接和组合方式，即将4个图2所示的子单元模块中的32位进位保留加法器进行拼接，组成2个64位进位保留加法器，并按图4所示方式连接。图4所示的与B连接的名为取反的模块未在图2中标出，类似的未标出的起辅助作用的其他电路还有很多，均为领域内通用技术。As shown in Figure 4, in the embodiment of the present invention, a post-quantum cryptography algorithm special instruction supported by the present invention is introduced. The 64-bit result is the value of A^(~B&C). The calculation operation corresponding to this instruction is a step in the core operation keccak operation commonly used in the post-quantum cryptography algorithm. Figure 4 shows the connection and combination methods adopted by the processing modules of the parallel computing unit module when executing the operation corresponding to the instruction, that is, splicing the 32-bit carry-save adders in the four subunit modules shown in Figure 2, Two 64-bit carry-save adders are formed and connected as shown in Figure 4. The inverting module connected to B shown in FIG. 4 is not marked in FIG. 2 , and there are many other similar unmarked auxiliary circuits, all of which are common technologies in the field.

如图5和图6所示，本发明实施例中介绍在针对后量子密码算法中常用的核心运算NTT(快速数论变换)运算时，本发明所提出的读写和计算分离的超标量流水线架构如何加速该运算的执行。图5所示为NTT运算过程中，需要反复执行的蝶形单元计算的指令发射顺序，其中IF、ID、EX、WB分别对应图1所示的取指、译码、执行、写回四个部分。每次蝶形单元计算需要将两个输入的数据A和B相乘后分别求和和求差得到两个输出C和D，其中C＝A+(B*w)，D＝A-(B*w)，其中w是一个名为旋转因子的已知的常数。图5中相邻且对齐的两行所对应的两条指令即为并行执行的一条计算指令和一条读写指令，如第一行的计算指令mulvh和第二行的读写指令S1b，这两条指令在图1所示的取指IF阶段会被指令分发模块分别发送给计算流水线和读写流水线，并同时并行运行，从而使处理器能同时执行计算和读写两条指令。图5中可见，每次蝶形单元计算共包括7条计算指令和8条读写指令，在传统的单流水线处理器中需要轮流进行，共花费15个周期完成，而本发明中两者同时进行只需要8个周期即可完成，提高了接近一倍的运行速度。此外，所采取的7条计算指令分别为：mulvh，mulv，mulvm，mulvhf，subv，addmv，submv，这七条指令均为位宽为256位的并行计算指令，相比于每次只能计算32位数据的普通32位处理器，本发明能够每次计算8组32位数据，故每条指令都能提供八倍的并行加速。图6所示为本发明实施例中所进行的一次NTT计算中，一次蝶形单元计算所对应的具体的汇编指令流。As shown in Figure 5 and Figure 6, the embodiment of the present invention introduces the superscalar pipeline architecture of the separation of reading and writing and calculation for the core operation NTT (fast number theory transformation) operation commonly used in post-quantum cryptography algorithms. How to speed up the execution of this operation. Figure 5 shows the instruction emission sequence of the butterfly unit calculation that needs to be executed repeatedly during the NTT operation process, in which IF, ID, EX, and WB correspond to the four instruction fetching, decoding, execution, and writing back shown in Figure 1, respectively. part. Each butterfly unit calculation needs to multiply the two input data A and B and then sum and difference to obtain two outputs C and D, where C=A+(B*w), D=A-(B* w), where w is a known constant called the twiddle factor. The two instructions corresponding to the two adjacent and aligned rows in Figure 5 are a calculation instruction and a read and write instruction executed in parallel, such as the calculation instruction mulvh in the first row and the read and write instruction S1b in the second row. In the instruction fetch IF stage shown in Figure 1, the instructions will be sent to the calculation pipeline and the read-write pipeline by the instruction distribution module, and run in parallel at the same time, so that the processor can simultaneously execute two instructions of calculation and reading and writing. As can be seen in Fig. 5, each butterfly unit calculation includes 7 calculation instructions and 8 read and write instructions, which need to be carried out in turn in a traditional single-pipeline processor, and it takes 15 cycles to complete, while in the present invention, both It only takes 8 cycles to complete, which nearly doubles the running speed. In addition, the 7 computing instructions adopted are: mulvh, mulv, mulvm, mulvhf, subv, addmv, submv. These seven instructions are parallel computing instructions with a bit width of 256 bits, compared to only 32 The common 32-bit processor of 1-bit data, the present invention can calculate 8 groups of 32-bit data at a time, so each instruction can provide eight times of parallel acceleration. FIG. 6 shows a specific assembly instruction flow corresponding to a butterfly unit calculation in an NTT calculation performed in an embodiment of the present invention.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明专利范围的限制。本发明能够支持多种后量子密码算法及其包含的多种核心运算，并相应有几十条尚未列举出的专用扩展指令及对应的并行计算单元模块中处理模块内各个组件的组合方式。对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。The above-mentioned embodiments only express several implementation modes of the present invention, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present invention. The present invention can support various post-quantum cryptography algorithms and various core operations contained therein, and correspondingly has dozens of special-purpose extended instructions not yet listed and corresponding combinations of components in the processing module in the parallel computing unit module. For those skilled in the art, without departing from the concept of the present invention, several modifications and improvements can be made, and these all belong to the protection scope of the present invention.

Claims

1. A post-quantum cryptography algorithm-specific processor based on RISC-V, characterized in that it adopts a superscalar pipeline design that separates reading and writing and computing, and has two pipelines that can be executed in parallel for reading and writing and computing, and each pipeline has Instruction fetching, decoding, execution, and write-back four parts of sequential execution, the processor specifically includes:

Instruction fetching part, reading and writing and computing pipelines multiplex the same instruction fetching part, which is used to obtain instructions from the instruction cache outside the processor, and distribute instructions to the decoding part of the two pipelines of the processor after pre-analysis , including an instruction interface module used as an interface to exchange data with an instruction cache outside the processor, a prefetch module for pre-reading multiple instructions and caching them, and an instruction distribution module, wherein the instruction distribution module adopts a sequential multi-issue scheme; this part Send non-read-write instructions to the computing pipeline, and send read-write instructions to the read-write pipeline;

The decoding part, which mainly includes four register modules and two decoder-type modules, is used to process the instructions sent by the fetching part, translate these instructions into specific control signals, and read the register modules. Data, and then send these control signals and data to the execution part of the two pipelines; the register module is shared by the two pipelines; the decoder type module includes the first decoder module belonging to the calculation pipeline and the first decoder module belonging to the read and write pipeline Two decoder modules, these two decoder modules respectively translate the instructions received by the corresponding pipelines, and respectively read the required data from the required register modules, and then send the control signals and data corresponding to the pipelines to the execution part of the respective pipeline;

The execution part, which is used to perform specific calculations or memory access operations according to the control signals and data sent by the decoding part, and then send the calculation results and control signals to the write-back part of the two pipelines. The execution part of the calculation pipeline includes Parallel computing unit module, the parallel computing unit module adopts a mixed bit width design, supports parallel computing of data with multiple bit widths up to 256 bits; the execution part of the read and write pipeline includes the first stage of the read and write control unit module, this stage It is used to access the data cache outside the processor according to the control signal and data sent by the decoding part, while other parts of the module work in the write-back part of the read-write pipeline;

The write-back part is used to write the calculation results sent by the calculation pipeline and the execution part of the read-write pipeline back to the corresponding register module contained in the decoding part according to the control signal, wherein the write-back part of the read-write pipeline includes read-write control The second stage of the unit module, which is used to return the result obtained by accessing the external data cache in the first stage to the processor and store it in the register module.

2. post-quantum cryptographic algorithm special processor according to claim 1, is characterized in that, the register module of described decoding part comprises the general-purpose register module that is used to store 32-bit data, is used to store the first of 128-bit data A parallel register module and a second parallel register module, and a parameter register module for storing 32-bit parameters, these modules are shared by two pipelines.

3. The special processor for post-quantum cryptography algorithm according to claim 1, characterized in that, said parallel computing unit module adopts component reorganization technology and operator fusion technology to realize, and uses the same component to realize different computing types and Bit-wide post-quantum cryptography algorithm core operation, this module includes:

The preprocessing module is used for data transformation before calculation of NTT (fast number theory transformation) and keccak algorithm;

Post-processing module, used for data transformation after calculation of NTT and keccak algorithms;

The processing module, including the main computing resources, is composed of a 64-bit shifter module used to realize the shift operation and 8 identical subunit modules used to realize other computing operations; each subunit module contains a A 32-bit multiplier module for multiplication-related operations, a 32-bit carry-save adder module for modulo operations and logic operations, and two 32-bit adder modules for addition and subtraction; each of the processing modules The module has a variety of connection methods, which are respectively used to support the calculation operations in different operations of different bit widths required in the post-quantum cryptography algorithm. According to the different instructions obtained by the fetching part, the decoding part translates the specific calculation operations and After receiving the corresponding control signal, the processing module in the execution part will select the corresponding connection mode according to the control signal, so as to perform different calculation operations with different bit widths to obtain calculation results.

4. post-quantum cryptographic algorithm special processor according to claim 1, is characterized in that, described parallel computing unit module and the first parallel register module, the second parallel register module have formed the parallel storage system of mixed bit width and Data flow, in which the first parallel register module and the second parallel register module logically contain 16 128-bit registers, and each 128-bit register is physically composed of two 64-bit registers; parallel computing is performed on the computing pipeline During operation, the parallel computing unit module can perform up to 8 groups of operations at the same time, and each group is an operation between two 32-bit numbers, so the total input is two 256-bit operands of input A and input B, and each operand is composed of The 128-bit number provided by the first parallel register module and the 128-bit number provided by the second parallel register module are spliced together; when the read-write pipeline performs read-write operations, the read-write control unit module reads from the external storage during the read-out operation. 128-bit data is stored in the first parallel register module or the second parallel register module, or 32-bit data is read from external storage and stored in the parameter register module; when the read-write pipeline performs write operations, the read-write control unit module is writing The operation writes the value of a 128-bit register in the first parallel register module or the second parallel register module to the external storage.