CN115774575A

CN115774575A - A RISC-V vector processing unit implementation method and architecture

Info

Publication number: CN115774575A
Application number: CN202211588930.4A
Authority: CN
Inventors: 周志刚; 郭旭; 王世豪; 殷先英; 薛晓娜; 赵靖宇; 程知群
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-03-10
Anticipated expiration: 2042-12-12
Also published as: CN115774575B

Abstract

The invention discloses a method for realizing a RISC-V vector processing unit, which is realized in a coprocessor mode based on a RISC-V processor vector expansion set. The vector processing unit comprises four ports, namely a main processor interaction port, a vector instruction request port, a vector instruction feedback port and a vector instruction memory write port. The vector processing unit includes a 3-stage pipeline supporting sequential dispatch, out-of-order execution, out-of-order write back, including decode and dispatch, execution, write back, and commit, with fetching and partial decode being performed in a scalar processor. And when the decoding and dispatching are taken as a first-stage pipeline and a vector unit receives a vector instruction of a main processor, the vector instruction is decoded and judged for pipeline conflict, if no pipeline conflict exists, the vector instruction can be dispatched, the execution is taken as a second-stage pipeline for executing the vector instruction, and the write-back is taken as a third-stage pipeline for writing back the vector instruction, wherein the write-back comprises a write-back scalar register and a write-back vector register.

Description

A RISC-V vector processing unit implementation method and architecture

技术领域technical field

本发明涉及RISC-V矢量扩展技术领域，具体指一种RISC-V矢量处理单元实现方法及架构。The present invention relates to the technical field of RISC-V vector extension, and specifically refers to a method and architecture for realizing a RISC-V vector processing unit.

背景技术Background technique

数据级并行在如多媒体、数据流以及最近的机器学习和深度学习广泛应用和RISC-V矢量扩展达到了稳定性，现在已经准备好批准1.0版本，而且RISC-V矢量扩展的实现在国内目前实现的相对较少。Data-level parallelism has achieved stability in multimedia, data streaming, and recent machine learning and deep learning applications and RISC-V vector extensions, and is now ready to approve version 1.0, and the implementation of RISC-V vector extensions is currently implemented in China relatively few.

目前对于实现矢量扩展的方式一般分为两种形式，一种设计采用协处理器的形式围绕应用处理器进行扩展，虽然增加了标量和矢量部分之间接口的复杂性，但其扩展性强，矢量指令和标量指令的执行模块可以做到完全分开，内部采用传统的基于多通道设计的矢量架构，将矢量寄存器进行划分成块，连接多个用于并行执行的相同通道，每个通道的宽度与最大数据大小相同，每个通道提供相应的算数逻辑单元，但是这种设计所消耗的面积和功耗巨大。早期的RISC-V矢量扩展版本适合采用多通道技术，但其新的RISC-V矢量扩展1.0版本引入的一些变化会使得采用多通道技术的矢量处理单元设计面临跨通道数据处理问题，采用多通道技术在解决跨通道数据问题时需要添加额外的模块设计，这种设计会提高整个矢量单元设计的复杂性，特别是在增加通道的数量时。At present, there are generally two ways to implement vector extension. One design adopts the form of coprocessor to expand around the application processor. Although it increases the complexity of the interface between scalar and vector parts, it has strong scalability. The execution modules of vector instructions and scalar instructions can be completely separated. The traditional vector architecture based on multi-channel design is used internally to divide the vector registers into blocks and connect multiple same channels for parallel execution. The width of each channel Same as the maximum data size, each channel provides a corresponding ALU, but the area and power consumed by this design are huge. The early RISC-V vector extension version is suitable for multi-channel technology, but some changes introduced by its new RISC-V vector extension version 1.0 will make the vector processing unit design using multi-channel technology face the problem of cross-channel data processing, using multi-channel Technology needs to add additional module design when solving the problem of cross-channel data, which will increase the complexity of the entire vector unit design, especially when increasing the number of channels.

另外一种是将矢量处理单元集成到标量处理器内部，避免标量和矢量部分之间的接口，内部运算单元采用SIMD流水线方式设计，处理器按顺序执行，并且是单周期的。不是将多个指令排队到执行通道中，而是一次执行一个指令，从而消除了对调度指令设计，可以根据操作和矢量长在多个周期内执行指令，可以添加流水线以增加最大频率，内部矢量寄存器设计跟标量处理器一样，但提供多个数据端口，数据路径宽度与矢量寄存器长度匹配，这种设计简化了向算数逻辑单元的数据传输，因为每个周期传输一个寄存器的数据，功耗和面积相对较小，但添加矢量扩展所带来的性能提升相对较低。本设计结合两种设计方式的优点，第一以协处理器方式对标量处理器进行矢量扩展，在标量处理器的执行阶段作为功能单元，对矢量指令的运行和实现进行定义以及与标量执行单元的分开，提高矢量指令执行的速率，减少与标量处理器之间的依赖，无论其标量处理器架构如何，只要该标量处理器支持该协处理器端口设计，即可对其进行矢量扩展。第二本设计在执行模块采用SIMD流水线化方式设计，既可以简化数据传输，又可以避免采用多通道技术面临跨车道数据问题所带来的额外设计，可以降低实现复杂性和带来较好的性能提升。The other is to integrate the vector processing unit into the scalar processor to avoid the interface between the scalar and vector parts. The internal operation unit is designed in SIMD pipeline mode, and the processor executes in sequence and is single-cycle. Instead of queuing multiple instructions into the execution lane, execute them one at a time, eliminating the need for scheduling instruction design, can execute instructions in multiple cycles depending on operation and vector length, can add pipelining to increase maximum frequency, internal vector The register design is the same as the scalar processor, but provides multiple data ports, and the data path width matches the length of the vector register. This design simplifies the data transfer to the ALU, because the data of one register is transferred per cycle, power consumption and The area is relatively small, but the performance gain from adding vector extensions is relatively low. This design combines the advantages of the two design methods. First, the vector expansion of the scalar processor is carried out in the form of a coprocessor. In the execution stage of the scalar processor, it is used as a functional unit to define the operation and realization of the vector instruction and cooperate with the scalar execution unit. The separation of vector instructions improves the execution rate of vector instructions and reduces the dependence on scalar processors. No matter what the scalar processor architecture is, as long as the scalar processor supports the coprocessor port design, it can be vector-extended. The second design uses SIMD pipeline design in the execution module, which can not only simplify data transmission, but also avoid the additional design caused by the multi-channel technology facing the problem of cross-lane data, which can reduce the implementation complexity and bring better results. Performance improvements.

发明内容Contents of the invention

本发明根据现有技术的不足，提出一种RISC-V矢量处理单元实现方法及架构，以协处理器方式搭建在标量RISC-V主处理器上。According to the deficiencies of the prior art, the present invention proposes a RISC-V vector processing unit implementation method and architecture, which is built on a scalar RISC-V main processor in the form of a coprocessor.

为了解决上述技术问题，本发明的技术方案为：In order to solve the problems of the technologies described above, the technical solution of the present invention is:

一种RISC-V矢量处理单元实现方法，包括如下步骤：A method for implementing a RISC-V vector processing unit, comprising the steps of:

S1、矢量指令准备阶段，在标量处理器中执行，取矢量指令，对矢量指令进行简单译码，判断是否需要标量源操作数，读取标量寄存器，与标量处理器FIFO，判断标量数据相关性，可以派遣则在标量处理器FIFO保存需要写回标量寄存器的矢量指令目的寄存器索引信息，并产生矢量指令请求信号，等待矢量处理单元响应；S1. In the vector instruction preparation stage, execute in the scalar processor, fetch the vector instruction, simply decode the vector instruction, judge whether the scalar source operand is needed, read the scalar register, and judge the scalar data correlation with the scalar processor FIFO , can be dispatched, save the vector instruction destination register index information that needs to be written back to the scalar register in the scalar processor FIFO, and generate a vector instruction request signal, waiting for the vector processing unit to respond;

S2、矢量指令译码和派遣阶段，根据矢量单元自身状态判断是否响应主处理器矢量请求信号；S2. In the stage of vector instruction decoding and dispatching, judge whether to respond to the vector request signal of the main processor according to the state of the vector unit itself;

S3、矢量指令执行阶段，该阶段用来矢量指令的执行，通过多个周期执行来间接实现寄存器组内所有元素的执行，每次处理元素的总位宽VLEN位，即一个矢量寄存器长度，每个执行单元采用流水线设计，即SIMD流水线化；S3, vector instruction execution stage, this stage is used for the execution of vector instructions, through multiple cycle execution to indirectly realize the execution of all elements in the register group, the total bit width VLEN bits of each processing element, that is, the length of a vector register, each Each execution unit adopts pipeline design, that is, SIMD pipeline;

S4、矢量指令写回和交付阶段，矢量指令既可以写回标量寄存器和写回矢量寄存器，写回标量寄存器的矢量需要通过矢量指令反馈端口，向标量处理器产生反馈信息，等待标量处理器响应可直接通过矢量处理单元的反馈端口将其直接写回至标量处理器，将该指令在标量处理器中的保存的指令信息去除，完成了该指令的交付，写回矢量寄存器的矢量指令直接通过仲裁将其写回矢量寄存器以及去除在派遣阶段保存该指令的信息完成该矢量指令的交付。S4. In the phase of vector instruction write back and delivery, the vector instruction can be written back to the scalar register and the vector register. The vector written back to the scalar register needs to pass the vector instruction feedback port to generate feedback information to the scalar processor and wait for the scalar processor to respond. It can be directly written back to the scalar processor through the feedback port of the vector processing unit, the instruction information saved in the scalar processor is removed, and the delivery of the instruction is completed, and the vector instruction written back to the vector register is directly passed Arbitration completes the delivery of the vector instruction by writing it back to the vector register and removing the information that saved the instruction during the dispatch phase.

作为优选，所述步骤S2中，如果响应，则接收该矢量指令和源标量操作数，接着对矢量指令进行完全译码，译码指令的格式和读矢量寄存器，并判断流水线冲突，决定是否派遣该矢量指令。Preferably, in the step S2, if the response is received, the vector instruction and the source scalar operand are received, and then the vector instruction is completely decoded, the format of the decoding instruction and the vector register are read, and the pipeline conflict is judged to decide whether to dispatch The vector instruction.

作为优选，所述步骤S3中，所述执行单元包括读写单元，功能单元，数据处理单元，所述读写单元用来读写源操作数，功能单元用来执行指令所要做的操作，数据处理单元将功能单元的数据进行处理得到最终要写回的数据。Preferably, in the step S3, the execution unit includes a read-write unit, a functional unit, and a data processing unit, the read-write unit is used to read and write the source operand, and the functional unit is used to execute the operation to be done by the instruction, the data The processing unit processes the data of the functional unit to obtain data to be finally written back.

作为优选，所述流水线冲突包括数据冲突和资源冲突。Preferably, the pipeline conflicts include data conflicts and resource conflicts.

作为优选，所述流水线冲突的判断包括：Preferably, the judgment of the pipeline conflict includes:

对于资源冲突、读后写相关性和写后写相关性，解决：阻塞当前派遣的指令，直至资源冲突解除；For resource conflicts, read-after-write dependencies, and write-after-write dependencies, solve: block the currently dispatched instruction until the resource conflict is resolved;

对于先写后读相关性，真数据相关，解决：方式(1)指令阻塞寄存器没有指令，该矢量指令为刚接收的矢量指令，如果需要的源操作数据正在被写回，则可以在矢量寄存器模块通过数据旁路读取数据，否则将该指令保存在矢量指令寄存器，跟下面的方式(2)一样；方式(2)对于该矢量指令为指令阻塞寄存器里保存的矢量指令，待相关性解除，需要通过重新从矢量寄存器模块读取操作数，如果该数据正在被写回，则可以在矢量寄存器模块通过数据旁路读取数据，否则读取矢量寄存器模块内的值；For write-before-read correlation, true data correlation, solution: mode (1) instruction blocking register has no instruction, the vector instruction is the vector instruction just received, if the required source operation data is being written back, it can be stored in the vector register The module reads data through the data bypass, otherwise the instruction is saved in the vector instruction register, which is the same as the method (2) below; the method (2) is the vector instruction stored in the instruction blocking register for the vector instruction, and the dependency is released , the operand needs to be read from the vector register module again, if the data is being written back, the data can be read in the vector register module through data bypass, otherwise the value in the vector register module can be read;

写回方式，对于写回标量寄存器的矢量指令采取顺序派遣，顺序执行，顺序写回，每次只能执行一条写回标量寄存器的矢量指令，对于写回矢量寄存器的矢量指令采取顺序派遣，乱序执行，乱序写回，根据指令执行的大小支持多条指令执行。In the write-back mode, the vector instructions written back to the scalar register are dispatched sequentially, executed sequentially, and written back sequentially. Only one vector instruction written back to the scalar register can be executed at a time, and the vector instructions written back to the vector register are dispatched sequentially. In-order execution, out-of-order write-back, and multiple instructions are supported according to the size of instruction execution.

本发明提供了一种RISC-V矢量处理单元实现架构，包括：The present invention provides a RISC-V vector processing unit implementation architecture, including:

矢量指令译码模块，根据RISC-V矢量扩展标准文档1.0的矢量指令编码规则进行译码，产生不同的指令类型信息、操作数寄存器索引信息；The vector instruction decoding module decodes according to the vector instruction encoding rules of the RISC-V vector extension standard document 1.0, and generates different instruction type information and operand register index information;

矢量指令派遣模块，用来矢量指令的派遣，用一个指令阻塞寄存器组来保存在标量处理器解除数据相关在矢量单元有流水线冲突的矢量指令信息。接收矢量译码模块的指令信息，将其送到矢量流水线判断模块，判断是否有流水线冲突，如果没有则可以派遣该矢量指令，如果有流水线冲突则阻塞该矢量指令，保存该矢量指令的译码信息以及从矢量寄存器读的操作数到指令阻塞寄存器组，等待流水线冲突解除，将其派遣并去除保存在指令阻塞寄存器的值；The vector instruction dispatching module is used for the dispatch of vector instructions, and uses an instruction blocking register group to save vector instruction information related to data release in the vector unit that has pipeline conflicts in the scalar processor. Receive the instruction information of the vector decoding module, send it to the vector pipeline judgment module, judge whether there is a pipeline conflict, if not, the vector instruction can be dispatched, if there is a pipeline conflict, block the vector instruction, and save the decoding of the vector instruction Information and the operand read from the vector register to the instruction blocking register group, wait for the pipeline conflict to be resolved, dispatch it and remove the value stored in the instruction blocking register;

矢量流水线冲突判断模块，该模块用来判断流水线冲突和矢量指令交付，用一个指令执行寄存器组来保存正在执行的矢量指令信息，以及执行指令的顺序，当接收到派遣模块需要派遣的矢量指令信息，与正在执行的矢量指令进行判断是否产生流水线冲突，当有流水线冲突则产生该不能派遣的信号至派遣模块，当没有流水线冲突则产生可以派遣的信号至派遣模块，并且将该指令信息和执行顺序保存在指令执行寄存器组中。矢量指令交付，当收到来自该矢量指令结束的信号则会将其保存在指令执行寄存器组中的指令信息去除，即完成该指令的交付；The vector pipeline conflict judgment module, which is used to judge the pipeline conflict and vector instruction delivery, uses an instruction execution register group to save the vector instruction information being executed, and the order of the executed instructions, when receiving the vector instruction information that the dispatch module needs to dispatch , judge whether there is a pipeline conflict with the vector instruction being executed, when there is a pipeline conflict, a signal that cannot be dispatched is generated to the dispatch module, and when there is no pipeline conflict, a signal that can be dispatched is generated to the dispatch module, and the instruction information and execution The order is kept in the instruction execution register bank. The delivery of the vector instruction, when the signal from the end of the vector instruction is received, the instruction information stored in the instruction execution register group will be removed, that is, the delivery of the instruction will be completed;

矢量寄存器模块包括多个读寄存器端口组、一个通用寄存器端口组、同时执行多条指令所需的额外读寄存器端口组和一个写回端口组，所述读寄存器端口组包含四个读端口，所述写回端口组包含两个写回端口，内部包含32个VLEN长度的通用寄存器，读端口逻辑一般根据输入寄存器索引值读出数据，但在读端口这里添加数据旁路，当写回数据的寄存器索引和要读的源操作数索引相同时直接将其送至输出端口，而不是读寄存器值，写端口逻辑，一般矢量指令只需用一个写回端口写回，额外的一个写回端口是对于写回双倍宽度指令，由于其寄存器索引相差1，并不会产生写相同寄存器文件的情况，只需在写回逻辑判断寄存器索引值不同即可；The vector register block includes multiple read register port groups, a general purpose register port group, additional read register port groups required to execute multiple instructions simultaneously, and a write-back port group. The read register port group includes four read ports, so The write-back port group includes two write-back ports, which contain 32 VLEN-length general-purpose registers. The read port logic generally reads data according to the index value of the input register, but a data bypass is added to the read port. When writing back to the data register When the index is the same as the index of the source operand to be read, it is directly sent to the output port instead of reading the register value and writing port logic. Generally, vector instructions only need to use one write-back port to write back, and an additional write-back port is for The write back double-width instruction, because its register index differs by 1, does not produce the situation of writing the same register file, it only needs to judge that the register index value is different in the write back logic;

矢量控制状态寄存器模块，按照RISC-V矢量扩展标准文档进行添加和设计，其可以用来执行矢量控制状态配置指令，更改当前矢量的长度，矢量寄存器分组和矢量元素大小，并其输出至执行单元；The vector control state register module is added and designed according to the RISC-V vector extension standard document, which can be used to execute vector control state configuration instructions, change the length of the current vector, vector register grouping and vector element size, and output it to the execution unit ;

矢量执行模块，根据矢量指令类型划分为矢量加载和存储单元、矢量整数运算单元、矢量整数乘加运算单元、矢量浮点运算单元、矢量置换运算单元、矢量约简单元和矢量掩码单元；The vector execution module is divided into vector load and storage unit, vector integer operation unit, vector integer multiplication and addition operation unit, vector floating point operation unit, vector permutation operation unit, vector reduction simple element and vector mask unit according to the type of vector instruction;

矢量写回模块包括写回标量寄存器模块和写回矢量寄存器模块(VWBCK_VD)。The vector write-back module includes a write-back scalar register module and a write-back vector register module (VWBCK_VD).

作为优选：所述矢量执行模块中：矢量整数运算单元负责矢量整数加减乘除的运算指令，矢量整数乘加运算单元负责矢量矢量整数乘加的运算指令；矢量浮点运算单元负责矢量浮点指令运算。矢量置换运算单元负责矢量置换指令的执行；矢量约简单元负责矢量约简指令的指令；矢量掩码单元负责矢量掩码指令的执行；矢量执行模块采用SIMD流水线化设计，包括读写单元，功能单元，数据处理单元，读写单元用状态机设计，分为0和1两个状态，当状态机为0状态时，等待该执行单元使能信号，如果矢量指令选中该执行单元会产生使能信号，有使能信号则会跳转至1状态；写操作，将执行指令所要操作的指令信息保存在内部寄存器中，所述指令信息包括寄存器索引，矢量长度，矢量元素宽度，操作对象；读操作，会根据矢量长度判断是否需要读取下一个操作数，当状态机在1状态会判断指令执行是否结束，如果继续执行则保持1状态，如果结束则跳转至0状态，功能单元相当于SIMD执行部件，可支持执行多个不同位宽的矢量元素，最大执行元素位宽为VLEN，数据处理单元对功能单元的输出数据进行处理得到最终的写回数据。As preferred: in the vector execution module: the vector integer operation unit is responsible for the operation instruction of vector integer addition, subtraction, multiplication and division, the vector integer multiplication and addition operation unit is responsible for the vector vector integer multiplication and addition operation instruction; the vector floating point operation unit is responsible for the vector floating point instruction operation. The vector replacement operation unit is responsible for the execution of vector replacement instructions; the vector reduction simple unit is responsible for the instructions of vector reduction instructions; the vector mask unit is responsible for the execution of vector mask instructions; the vector execution module adopts SIMD pipeline design, including read and write units, functions The unit, the data processing unit, and the read-write unit are designed with a state machine, which is divided into two states of 0 and 1. When the state machine is in the 0 state, it waits for the enable signal of the execution unit. If the execution unit is selected by the vector instruction, the enable signal will be generated. Signal, if there is an enable signal, it will jump to 1 state; write operation, save the instruction information to be operated by executing the instruction in the internal register, the instruction information includes register index, vector length, vector element width, and operation object; read The operation will judge whether the next operand needs to be read according to the length of the vector. When the state machine is in the 1 state, it will judge whether the instruction execution is over. If it continues to execute, it will maintain the 1 state. If it ends, it will jump to the 0 state. The functional unit is equivalent to The SIMD execution unit can support the execution of multiple vector elements with different bit widths. The maximum execution element bit width is VLEN. The data processing unit processes the output data of the functional units to obtain the final write-back data.

作为优选：所述矢量写回模块中：如果当前矢量指令需要写回标量寄存器，则通过矢量处理单元反馈端口直接写回，如果矢量指令需要写回矢量寄存器，只有一条指令执行时可直接写回，对于两条同时执行的指令需要写回时，根据指令执行寄存器组所保存的执行指令顺序对其进行仲裁，可以先派遣的先写回，后派遣的写回数据保存在其对应的FIFO中，直至没有数据写回时从FIFO读出，写回至矢量寄存器，指令执行结束后指令执行寄存器会根据对应FIFO是否为空和该指令所在的执行模块状态判断指令执行是否完成，如果完成则去除该指令信息完成交付。As a preference: in the vector write-back module: if the current vector instruction needs to be written back to the scalar register, it is directly written back through the feedback port of the vector processing unit; if the vector instruction needs to be written back to the vector register, only one instruction can be written back directly when executing , when two simultaneously executed instructions need to be written back, they are arbitrated according to the order of execution instructions stored in the instruction execution register group, the one dispatched first can be written back first, and the write-back data dispatched later is stored in its corresponding FIFO , until no data is written back, read from the FIFO and write back to the vector register. After the instruction is executed, the instruction execution register will judge whether the instruction execution is completed according to whether the corresponding FIFO is empty and the status of the execution module where the instruction is located. If it is completed, it will be removed. The instruction message is delivered.

本发明具有以下的特点和有益效果：The present invention has following characteristics and beneficial effect:

本设计以协处理器方式对标量处理器进行矢量扩展，在设计矢量单元时首先对于写回标量寄存器的矢量指令每次只执行一条该类型指令，这样可以做到严格意义上的顺序写回，即不需要主处理器的信号来进行仲裁写回，减少对标量处理器的依赖，其次将写回矢量寄存器的矢量指令定义为短周期指令，这样主处理器在写回矢量寄存器时不需要对主处理器进行反馈，直接在矢量单元完成执行和交付，提升矢量指令的执行速度，增强其扩展性，当另外一个RISC-V标量处理器只要支持该协处理器端口设计，即可对其进行矢量扩展。最后矢量单元的内部执行模块设计是一个SIMD流水线设计。这种设计优于基于多通道的架构设计，目的是最大限度地减少矢量寄存器的压力，不需要对矢量寄存器进行划分处理，简化数据传输，并为矢量指令提供统一的执行方式，避免跨车道数据问题所带来的额外设计，降低实现复杂性，提升性能。In this design, the vector expansion of the scalar processor is carried out in the form of a coprocessor. When designing the vector unit, only one instruction of this type is executed each time for the vector instruction written back to the scalar register, so that it can be written back in a strict sense. That is, the signal of the main processor is not required for arbitration write-back, which reduces the dependence on the scalar processor. Secondly, the vector instruction written back to the vector register is defined as a short-cycle instruction, so that the main processor does not need to write back to the vector register. The main processor gives feedback, completes the execution and delivery directly in the vector unit, improves the execution speed of the vector instruction, and enhances its scalability. When another RISC-V scalar processor supports the coprocessor port design, it can execute it. Vector expansion. Finally, the internal execution module design of the vector unit is a SIMD pipeline design. This design is superior to the multi-channel-based architecture design. The purpose is to minimize the pressure on the vector registers, do not need to divide the vector registers, simplify data transmission, and provide a unified execution method for vector instructions, avoiding cross-lane data The additional design brought by the problem reduces the implementation complexity and improves the performance.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1为本发明实施例的实现方法的架构图。FIG. 1 is an architecture diagram of an implementation method of an embodiment of the present invention.

图2为本发明实施例中译码模块示意图。Fig. 2 is a schematic diagram of a decoding module in an embodiment of the present invention.

图3为本发明实施例中派遣模块示意图。Fig. 3 is a schematic diagram of the dispatch module in the embodiment of the present invention.

图4为本发明实施例中流水线冲突判断模块示意图。FIG. 4 is a schematic diagram of a pipeline conflict judgment module in an embodiment of the present invention.

图5为本发明实施例中矢量寄存器模块示意图。FIG. 5 is a schematic diagram of a vector register module in an embodiment of the present invention.

图6为本发明实施例中矢量控制状态模块示意图。Fig. 6 is a schematic diagram of a vector control state module in an embodiment of the present invention.

图7为本发明实施例中执行模块示意图。Fig. 7 is a schematic diagram of the execution module in the embodiment of the present invention.

图8为本发明实施例中写回模块示意图。FIG. 8 is a schematic diagram of a write-back module in an embodiment of the present invention.

图9为本发明实施例中矢量单元示意图。Fig. 9 is a schematic diagram of a vector unit in an embodiment of the present invention.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。It should be noted that, in the case of no conflict, the embodiments of the present invention and the features in the embodiments can be combined with each other.

本发明提供了一种RISC-V矢量处理单元实现架构，如图2-图9所示，包括矢量指令译码模块：根据RISC-V矢量扩展标准文档1.0的矢量指令编码规则进行译码，产生不同的指令类型信息、操作数寄存器索引信息。The present invention provides a RISC-V vector processing unit implementation architecture, as shown in Figure 2-Figure 9, including a vector instruction decoding module: decode according to the vector instruction encoding rules of the RISC-V vector extension standard document 1.0, and generate Different instruction type information, operand register index information.

矢量指令派遣模块：该模块用来矢量指令的派遣，用一个指令阻塞寄存器组来保存在标量处理器解除数据相关在矢量单元有流水线冲突的矢量指令信息。接收矢量译码模块的指令信息，将其送到矢量流水线判断模块，判断是否有流水线冲突，如果没有则可以派遣该矢量指令，如果有流水线冲突则阻塞该矢量指令，保存该矢量指令的译码信息以及从矢量寄存器读的操作数到指令阻塞寄存器组，等待流水线冲突解除，将其派遣并去除保存在指令阻塞寄存器的值。Vector instruction dispatching module: This module is used for dispatching vector instructions, and uses an instruction blocking register group to store vector instruction information related to data release in the vector unit that has pipeline conflicts in the scalar processor. Receive the instruction information of the vector decoding module, send it to the vector pipeline judgment module, judge whether there is a pipeline conflict, if not, the vector instruction can be dispatched, if there is a pipeline conflict, block the vector instruction, and save the decoding of the vector instruction Information and operands read from the vector registers to the instruction block register set, wait for the pipeline conflict to be resolved, dispatch it and remove the value held in the instruction block register.

矢量流水线冲突判断模块：该模块用来判断流水线冲突和矢量指令交付，用一个指令执行寄存器组来保存正在执行的矢量指令信息，以及执行指令的顺序，当接收到派遣模块需要派遣的矢量指令信息，与正在执行的矢量指令进行判断是否产生流水线冲突，当有流水线冲突则产生该不能派遣的信号至派遣模块，当没有流水线冲突则产生可以派遣的信号至派遣模块，并且将该指令信息和执行顺序保存在指令执行寄存器组中。矢量指令交付，当收到来自该矢量指令结束的信号则会将其保存在指令执行寄存器组中的指令信息去除，即完成该指令的交付。Vector pipeline conflict judgment module: This module is used to judge pipeline conflicts and vector instruction delivery. It uses an instruction execution register group to save the vector instruction information being executed and the order of execution instructions. When receiving the vector instruction information that the dispatch module needs to dispatch , judge whether there is a pipeline conflict with the vector instruction being executed, when there is a pipeline conflict, a signal that cannot be dispatched is generated to the dispatch module, and when there is no pipeline conflict, a signal that can be dispatched is generated to the dispatch module, and the instruction information and execution The order is kept in the instruction execution register bank. When the vector instruction is delivered, the instruction information stored in the instruction execution register group will be removed when the signal from the end of the vector instruction is received, that is, the delivery of the instruction is completed.

矢量寄存器模块:多个读寄存器端口组(一个组包含四个读端口)，包括一个通用寄存器端口组和同时执行多条指令所需的额外读寄存器端口组，一个写回端口组(一个组包含两个写回端口)，内部包含32个VLEN长度的通用寄存器，读端口逻辑一般根据输入寄存器索引值读出数据，但在读端口这里添加数据旁路，当写回数据的寄存器索引和要读的源操作数索引相同时直接将其送至输出端口，而不是读寄存器值。写端口逻辑，一般矢量指令只需用一个写回端口写回，额外的一个写回端口是对于写回双倍宽度指令，由于其寄存器索引相差1，并不会产生写相同寄存器文件的情况，只需在写回逻辑判断寄存器索引值不同即可。Vector register block: Multiple read register port groups (one group containing four read ports), including one general purpose register port group and additional read register port groups required to execute multiple instructions simultaneously, one writeback port group (one group containing Two write-back ports), internally contains 32 general-purpose registers of VLEN length, the read port logic generally reads data according to the input register index value, but adds data bypass here at the read port, when the register index of the write-back data and the register index to be read When the source operand index is the same, it is directly sent to the output port instead of reading the register value. Write port logic. Generally, vector instructions only need to use one write-back port to write back. An additional write-back port is for write-back double-width instructions. Since the register index differs by 1, the same register file will not be written. It is only necessary to judge that the index values of the registers are different in the write-back logic.

矢量控制状态寄存器模块，按照RISC-V矢量扩展标准文档进行添加和设计，其可以用来执行矢量控制状态配置指令，更改当前矢量的长度，矢量寄存器分组和矢量元素大小，并其输出至执行单元。The vector control state register module is added and designed according to the RISC-V vector extension standard document, which can be used to execute vector control state configuration instructions, change the length of the current vector, vector register grouping and vector element size, and output it to the execution unit .

矢量执行模块：根据矢量指令类型划分为矢量加载和存储单元(VLSU)、矢量整数运算单元(V_INT_ALU)、矢量整数乘加运算单元(V_INT_MAC)、矢量浮点运算单元(V_FP)、矢量置换运算单元(V_PERMUTE)、矢量约简单元(V_REDUCT)和矢量掩码单元(VMASK)。矢量整数运算单元负责矢量整数加减乘除的运算指令，矢量整数乘加运算单元负责矢量矢量整数乘加的运算指令。矢量浮点运算单元负责矢量浮点指令运算。矢量置换运算单元负责矢量置换指令的执行。矢量约简单元负责矢量约简指令的指令。矢量掩码单元负责矢量掩码指令的执行。矢量执行模块采用SIMD流水线化设计，包括读写单元，功能单元，数据处理单元。读写单元用状态机设计，分为0和1两个状态，当状态机为0状态时，等待该执行单元使能信号，如果矢量指令选中该执行单元会产生使能信号，有使能信号则会跳转至1状态。写操作，将执行指令所要操作的指令信息(包括寄存器索引，矢量长度，矢量元素宽度，操作对象)保存在内部寄存器中，读操作，会根据矢量长度判断是否需要读取下一个操作数。当状态机在1状态会判断指令执行是否结束，如果继续执行则保持1状态，如果结束则跳转至0状态。功能单元相当于SIMD执行部件，可支持执行多个不同位宽的矢量元素，最大执行元素位宽为VLEN。数据处理单元对功能单元的输出数据进行处理得到最终的写回数据。Vector execution module: divided into vector load and storage unit (VLSU), vector integer operation unit (V_INT_ALU), vector integer multiplication and addition operation unit (V_INT_MAC), vector floating point operation unit (V_FP), vector permutation operation unit according to the type of vector instruction (V_PERMUTE), Vector Reduced Simple Elements (V_REDUCT), and Vector Masked Units (VMASK). The vector integer operation unit is responsible for the operation instruction of vector integer addition, subtraction, multiplication and division, and the vector integer multiplication and addition operation unit is responsible for the operation instruction of vector vector integer multiplication and addition. The vector floating-point operation unit is responsible for vector floating-point instruction operations. The vector permutation operation unit is responsible for the execution of vector permutation instructions. The vector reduction simple element is responsible for the instruction of the vector reduction instruction. The vector mask unit is responsible for the execution of vector mask instructions. The vector execution module adopts SIMD pipeline design, including read and write units, functional units, and data processing units. The reading and writing unit is designed with a state machine, which is divided into two states of 0 and 1. When the state machine is in the 0 state, it waits for the execution unit to enable the signal. If the vector instruction selects the execution unit, it will generate the enable signal. There is an enable signal It will jump to state 1. In the write operation, the instruction information (including register index, vector length, vector element width, and operation object) to be operated by the execution instruction is stored in the internal register, and in the read operation, it is judged whether the next operand needs to be read according to the vector length. When the state machine is in the 1 state, it will judge whether the execution of the instruction is over, if it continues to execute, it will maintain the 1 state, and if it ends, it will jump to the 0 state. The functional unit is equivalent to a SIMD execution unit, which can support the execution of multiple vector elements with different bit widths, and the maximum bit width of the execution element is VLEN. The data processing unit processes the output data of the functional unit to obtain the final write-back data.

矢量写回模块包括写回标量寄存器模块(VWBCK_RD)和写回矢量寄存器模块(VWBCK_VD)，如果当前矢量指令需要写回标量寄存器，则通过矢量处理单元反馈端口直接写回，如果矢量指令需要写回矢量寄存器，只有一条指令执行时可直接写回，对于两条同时执行的指令需要写回时，根据指令执行寄存器组所保存的执行指令顺序对其进行仲裁，可以先派遣的先写回，后派遣的写回数据保存在其对应的FIFO中，直至没有数据写回时从FIFO读出，写回至矢量寄存器。指令执行结束后指令执行寄存器会根据对应FIFO是否为空和该指令所在的执行模块状态判断指令执行是否完成，如果完成则去除该指令信息完成交付。The vector write-back module includes the write-back scalar register module (VWBCK_RD) and the write-back vector register module (VWBCK_VD). If the current vector instruction needs to be written back to the scalar register, it can be directly written back through the feedback port of the vector processing unit. If the vector instruction needs to be written back The vector register can be directly written back when only one instruction is executed. When two simultaneously executed instructions need to be written back, it is arbitrated according to the order of execution instructions saved in the instruction execution register group. The dispatched write-back data is stored in its corresponding FIFO until it is read from the FIFO and written back to the vector register when there is no data to write back. After the instruction is executed, the instruction execution register will judge whether the instruction execution is completed according to whether the corresponding FIFO is empty and the status of the execution module where the instruction is located. If it is completed, the instruction information will be removed to complete the delivery.

其中，流水线冲突判断，对于资源冲突、读后写相关性(WAR)和写后写相关性(WAW)，解决：阻塞当前派遣的指令，直至资源冲突解除。对于先写后读相关性(RAW)，真数据相关，解决：(1)指令阻塞寄存器没有指令，该矢量指令为刚接收的矢量指令，如果需要的源操作数据正在被写回，则可以在矢量寄存器模块通过数据旁路读取数据，否则将该指令保存在矢量指令寄存器，跟下面的(2)一样。(2)对于该矢量指令为指令阻塞寄存器里保存的矢量指令，待相关性解除，需要通过重新从矢量寄存器模块读取操作数，如果该数据正在被写回，则可以在矢量寄存器模块通过数据旁路读取数据，否则读取矢量寄存器模块内的值。Among them, pipeline conflict judgment, for resource conflict, read-after-write dependency (WAR) and write-after-write dependency (WAW), resolve: block the currently dispatched instruction until the resource conflict is resolved. For read-after-write dependency (RAW), true data dependency, solve: (1) There is no instruction in the instruction blocking register, the vector instruction is the vector instruction just received, if the required source operation data is being written back, it can be in The vector register module reads data through the data bypass, otherwise the instruction is saved in the vector instruction register, the same as (2) below. (2) For the vector instruction stored in the instruction blocking register, after the dependency is released, it is necessary to read the operand from the vector register module again. If the data is being written back, it can pass the data in the vector register module. Bypass reads data, otherwise reads the value inside the vector register block.

写回方式，对于写回标量寄存器的矢量指令采取顺序派遣，顺序执行，顺序写回，每次只能执行一条写回标量寄存器的矢量指令，这样做是为了尽可能不改动标量处理器并且也尽可能的实现该矢量处理单元的通用性，对于写回矢量寄存器的矢量指令采取顺序派遣，乱序执行，乱序写回，根据指令执行的大小可以支持多条指令执行。In the write-back mode, the vector instructions written back to the scalar register are dispatched sequentially, executed sequentially, and written back sequentially. Only one vector instruction written back to the scalar register can be executed at a time. This is done in order not to change the scalar processor as much as possible and also The generality of the vector processing unit is realized as much as possible, and the vector instructions written back to the vector register are dispatched sequentially, executed out of order, and written back out of order, and multiple instructions can be executed according to the size of the instruction execution.

本发明还提供了一种RISC-V矢量处理单元实现方法，是基于RISC-V处理器矢量扩展集，以协处理器方式实现。矢量处理单元包括四个端口，分别是主处理器交互端口、矢量指令请求端口、矢量指令反馈端口，矢量指令存储器写端口。矢量处理单元包括3级流水线，支持顺序派遣，乱序执行，乱序写回，其中包括译码和派遣、执行、写回和交付,其取指和部分译码在标量处理器中完成。所述译码和派遣作为第一级流水在矢量单元接收到主处理器的矢量指令时，对矢量指令进行译码和流水线冲突判断，如没有流水线冲突则可以派遣，所述执行作为第二级流水线用于矢量指令的执行，包括矢量加载和存储单元(VLSU)、矢量整数运算单元(V_INT_ALU)、矢量整数乘加运算单元(V_INT_MAC)、矢量浮点运算单元(V_FP)、矢量置换运算单元(V_PERMUTE)、矢量约简单元(V_REDUCT)和矢量掩码单元(VMASK)。所述写回作为第三级流水线用于矢量指令的写回，包括写回标量寄存器和写回矢量寄存器。The present invention also provides a method for realizing the RISC-V vector processing unit, which is based on the vector extension set of the RISC-V processor and implemented in the form of a co-processor. The vector processing unit includes four ports, which are the main processor interaction port, the vector instruction request port, the vector instruction feedback port, and the vector instruction memory write port. The vector processing unit includes a 3-stage pipeline, supports in-order dispatch, out-of-order execution, and out-of-order write-back, including decoding and dispatch, execution, write-back, and delivery. Its instruction fetch and partial decoding are completed in the scalar processor. The decoding and dispatching are used as the first-stage pipeline. When the vector unit receives the vector instruction from the main processor, it decodes the vector instruction and judges the pipeline conflict. If there is no pipeline conflict, it can be dispatched. The execution is the second stage. The pipeline is used for the execution of vector instructions, including vector load and store unit (VLSU), vector integer operation unit (V_INT_ALU), vector integer multiply and add operation unit (V_INT_MAC), vector floating point operation unit (V_FP), vector permutation operation unit ( V_PERMUTE), Vector Reduced Simple Elements (V_REDUCT), and Vector Masked Units (VMASK). The write-back is used as the third-stage pipeline for the write-back of vector instructions, including writing back to scalar registers and writing back to vector registers.

具体的，如图1所示，主处理器的译码单元对取到的指令的操作码进行译码，判断是否是矢量指令，并且判断是否需要标量源操作数和写回结果至标量寄存器，如果需要标量源操作数，则对其进行读取通用寄存器组，读出源操作数，如果需要写回，主处理器将目标寄存器的索引信息储存在主处理器的流水线控制模块中，直至写回完成。主处理器会维护数据依赖性，如果该矢量指令与正在执行的指令(包括标量指令和需要写回标量寄存器的矢量指令)有数据依赖性，则需等待数据依赖性接触，才会将该矢量指令派遣到矢量处理单元，包括该矢量指令req_inst，两个标量源操作数的值(req_rs1，req_rs2)以及矢量指令请求信号req_valid。Specifically, as shown in Figure 1, the decoding unit of the main processor decodes the opcode of the fetched instruction, judges whether it is a vector instruction, and judges whether a scalar source operand and write-back result are needed to a scalar register, If a scalar source operand is needed, it is read from the general-purpose register bank, and the source operand is read out. If it needs to be written back, the host processor stores the index information of the target register in the pipeline control module of the host processor until the write Back to finish. The main processor will maintain data dependencies. If the vector instruction has a data dependency with the instruction being executed (including scalar instructions and vector instructions that need to be written back to the scalar register), it will wait for the data dependency to be touched before the vector instruction An instruction dispatch to a vector processing unit includes the vector instruction req_inst, the values of the two scalar source operands (req_rs1, req_rs2), and the vector instruction request signal req_valid.

矢量处理单元通过请求通道根据派遣模块(disp_ready)和流水线冲突判断模块(pipc_ready)信号判断是否接收矢量指令，可以接收该矢量指令则将req_ready至1，响应主处理器发出的矢量指令请求信号，接着在译码模块对矢量指令进行完全译码，译码出指令格式判断需要的执行单元，和两个源操作数(如果时标量源操作数则用主处理器发送的标量源操作数，如果需要矢量寄存器的值则需要读取矢量寄存器)以及目的操作数。将需要读矢量寄存器的索引值(dec_idx)送至矢量寄存器通用端口r0，并读出vrf_rd_r0，将其和指令执行所需信息dec_info)一起送至派遣模块或者矢量状态控制模块。将根据RISC-V矢量扩展指令编码将其分为三大类，矢量加载与存储指令，矢量状态设置指令(CSR设置指令)，和矢量算术指令，其中矢量状态设置指令在解码后将直接送到CSR模块中执行，不需要在矢量单元派遣和检查流水线冲突，因为它们不需要与矢量寄存器VRF进行交互，其两个源操作数和目的操作数都是标量寄存器索引，其在主处理器已经对其检查流水线冲突。此外，对于这些指令，可以立即请求写回主处理器(csr_wbck_dat)。矢量设置指令在执行前必须保证在其前面的指令在执行或者执行完才能更新控制状态寄存器。该操作在矢量单元通过判断是否有阻塞指令来实现，当矢量单元中有指令被阻塞，则矢量单元不接收主处理器派遣的指令，包括CSR指令，只有等待主处理器派遣的指令已经在执行中或者执行完即可接收CSR设置指令，更新CSR值，即不会影响前面所执行的指令。其余指令译码后需要经过派遣模块在送到执行模块。The vector processing unit judges whether to receive the vector instruction through the request channel according to the signals of the dispatch module (disp_ready) and the pipeline conflict judgment module (pipc_ready). If the vector instruction can be received, it will set req_ready to 1 to respond to the vector instruction request signal sent by the main processor, and then The vector instruction is fully decoded in the decoding module, and the execution unit required for the judgment of the instruction format is decoded, and two source operands (if the scalar source operand is the scalar source operand sent by the main processor, if necessary The value of the vector register needs to be read from the vector register) and the destination operand. Send the index value (dec_idx) of the vector register that needs to be read to the general port r0 of the vector register, and read vrf_rd_r0, and send it to the dispatch module or the vector state control module together with the information required for instruction execution (dec_info). According to the RISC-V vector extension instruction encoding, it will be divided into three categories, vector load and store instructions, vector state setting instructions (CSR setting instructions), and vector arithmetic instructions, where the vector state setting instructions will be sent directly to the Executed in the CSR module, there is no need to dispatch and check for pipeline conflicts in the vector unit, because they do not need to interact with the vector register VRF, whose two source and destination operands are scalar register indexes, which are already referenced in the host processor. It checks for pipeline conflicts. Also, for these instructions, a write back to the main processor (csr_wbck_dat) can be requested immediately. Before the vector setting instruction is executed, the control status register must be updated only when the preceding instruction is executed or executed. This operation is implemented in the vector unit by judging whether there is a blocked instruction. When there is an instruction blocked in the vector unit, the vector unit does not receive the instructions dispatched by the main processor, including CSR instructions, and only the instructions waiting for the dispatch of the main processor are already being executed. During or after execution, the CSR setting command can be received and the CSR value can be updated without affecting the previously executed commands. After the other instructions are decoded, they need to be sent to the execution module through the dispatch module.

矢量处理单元在派遣模块首先将该矢量指令信息(disp_pipc_idx，disp_pipc_info)送至流水线冲突模块判断流水线冲突，有冲突则流水线冲突模块将pipc_o_dep至1，告诉派遣模块阻塞该指令派遣，并且该指令如果是真数据相关性则还需将re_rd_rf至1，等该流水线冲突解除，派遣该指令时重新读矢量寄存器获取源操作数。The vector processing unit first sends the vector instruction information (disp_pipc_idx, disp_pipc_info) to the pipeline conflict module to judge the pipeline conflict in the dispatch module. If there is a conflict, the pipeline conflict module will set pipc_o_dep to 1, tell the dispatch module to block the dispatch of the instruction, and if the instruction is For true data dependency, it is necessary to set re_rd_rf to 1, wait for the pipeline conflict to be resolved, and re-read the vector register to obtain the source operand when dispatching the instruction.

矢量处理单元在流水线冲突判断将派遣模块发过来的矢量信息与其内部保存的正在执行的矢量指令信息进行比较，判断是否有流水线冲突。该模块会产生控制信号sel_rd_itag和sel_wbck_itag，sel_rd_itag用来选中执行模块占用矢量寄存器额外读端口和派遣的指令信息送至该矢量指令需要执行的模块。sel_wbck_itag用来控制写回。In judging pipeline conflicts, the vector processing unit compares the vector information sent by the dispatch module with the information of the executing vector instructions stored inside, and judges whether there is a pipeline conflict. This module will generate control signals sel_rd_itag and sel_wbck_itag, sel_rd_itag is used to select the execution module to occupy the extra read port of the vector register and send the instruction information to the module that the vector instruction needs to execute. sel_wbck_itag is used to control writeback.

通过执行模块接收到来自派遣模块的矢量指令信息(disp_exu_idx，disp_exu_info,disp_exu_dat)，并读取当前矢量控制状态模块的矢量状态数据csr_exu_dat，根据sel_rd_itag送至该矢量指令的执行模块执行，最后输出写回数据。矢量加载和存储单元负责矢量单元访问标量处理器内部的存储器，矢量加载指令执行需要将memory_read_ready和memory_read至1，送出所需的地址信息memory_addr和读的元素大小memory_size，标量处理器接收到矢量读访问存储器后会产生memory_read_valid并将所需读的数据memory_read_dat送至矢量加载和存储单元。矢量存储指令执行需要将memory_write_valid至1和memory_read至0，送出所需的地址信息memory_addr和写的元素大小memory_size，标量处理器接收到矢量读访问存储器后会产生memory_read_ready并将所需写的数据memory_write_dat送至主处理器的存储器。对于矢量加载和存储单元，加载指令和存储指令不能同时执行，与其他执行单元不同，每次执行读或写一个矢量元素大小，而不是一个矢量寄存器宽度VLEN，但是对于矢量模块(加载指令)写回是以一个VLEN写回，每加载一个矢量元素放进一个长度为VLEN的寄存器，直至写满或者结束将其读出写回。其他执行包括矢量整数运算单元、矢量整数乘加运算单元、矢量浮点运算单元、矢量置换运算单元、矢量约简运算单元，矢量整数运算单元负责矢量整数加减乘除的运算指令，矢量整数乘加运算单元负责矢量矢量整数乘加的运算指令。矢量浮点运算单元负责矢量浮点指令运算。矢量置换运算单元负责矢量置换指令的执行。矢量约简运算单元负责矢量约简指令的指令。矢量掩码单元负责矢量掩码指令的执行。矢量执行模块设计采用SIMD流水线化设计，包括读写单元，功能单元，数据处理单元。读写单元用状态机设计，分为0和1两个状态，当状态机为0状态时，等待该执行单元使能信号，如果矢量指令选中该执行单元会产生使能信号，有使能信号则会跳转至1状态。写操作，0状态，将第一次派遣模块发过来的执行指令所要操作的指令信息(disp_exu_info)、索引信息(disp_exu_idx)和csr_data保存在内部寄存器中，1状态，将需要读取的下一个操作数保存在寄存器中。读操作，根据csr的值和指令判断是否需要读取下一个操作数，读则通过寄存器额外端口r1或r2读取下一个操作数。当状态机在1状态会判断指令执行是否结束，如果继续执行则保持1状态，如果结束则跳转至0状态。功能单元相当于ALU，负责指令指令，数据处理单元对功能单元的输出数据进行处理得到最终的写回数据。Receive the vector instruction information (disp_exu_idx, disp_exu_info, disp_exu_dat) from the dispatch module through the execution module, and read the vector state data csr_exu_dat of the current vector control state module, send it to the execution module of the vector instruction according to sel_rd_itag for execution, and finally write the output back data. The vector load and storage unit is responsible for the vector unit to access the internal memory of the scalar processor. The execution of the vector load instruction needs to set memory_read_ready and memory_read to 1, send the required address information memory_addr and the read element size memory_size, and the scalar processor receives the vector read access After the memory, memory_read_valid will be generated and the required read data memory_read_dat will be sent to the vector load and storage unit. The execution of the vector storage instruction needs to set memory_write_valid to 1 and memory_read to 0, send the required address information memory_addr and the element size memory_size to be written, and the scalar processor will generate memory_read_ready after receiving the vector read access memory and send the required write data memory_write_dat to the main processor's memory. For vector load and store units, load instructions and store instructions cannot be executed at the same time, unlike other execution units, each execution reads or writes a vector element size instead of a vector register width VLEN, but for vector modules (load instructions) write The return is written back with a VLEN, and each time a vector element is loaded, it is put into a register with a length of VLEN until it is full or read and written back. Other implementations include vector integer operation unit, vector integer multiplication and addition operation unit, vector floating point operation unit, vector permutation operation unit, vector reduction operation unit, vector integer operation unit is responsible for the operation instruction of vector integer addition, subtraction, multiplication and division, vector integer multiplication and addition The operation unit is responsible for the operation instructions of vector-vector integer multiplication and addition. The vector floating-point operation unit is responsible for vector floating-point instruction operations. The vector permutation operation unit is responsible for the execution of vector permutation instructions. The vector reduction operation unit is responsible for the instruction of the vector reduction instruction. The vector mask unit is responsible for the execution of vector mask instructions. The design of vector execution module adopts SIMD pipeline design, including read and write unit, function unit and data processing unit. The reading and writing unit is designed with a state machine, which is divided into two states of 0 and 1. When the state machine is in the 0 state, it waits for the execution unit to enable the signal. If the vector instruction selects the execution unit, it will generate the enable signal. There is an enable signal It will jump to state 1. Write operation, 0 state, save the instruction information (disp_exu_info), index information (disp_exu_idx) and csr_data to be operated by the execution command sent by the first dispatch module in the internal register, 1 state, the next operation that needs to be read The number is stored in a register. In the read operation, judge whether the next operand needs to be read according to the value of csr and the instruction, and read the next operand through the extra port r1 or r2 of the register. When the state machine is in the 1 state, it will judge whether the execution of the instruction is over, if it continues to execute, it will maintain the 1 state, and if it ends, it will jump to the 0 state. The functional unit is equivalent to the ALU, which is in charge of instructions, and the data processing unit processes the output data of the functional unit to obtain the final write-back data.

通过矢量写回和交付模块将执行模块的各个功能单元的输出数据(vlsu_o、v_int_alu_o、v_int_mac_o、v_fp_o、v_permute_o、v_reduct_o、vmask_o)按写回方式分为写回标量寄存器(wbck_rd)和写回矢量寄存器(wbck_vd)，由于写回标量寄存器的指令是顺序写回的，可直接通过矢量处理单元的反馈通道将其直接写回至标量处理器，由标量处理器对其进行交付。对于写回矢量寄存器的矢量指令，一般来说都可直接写回，但是当两条同时执行的指令需要同时写回时，此时需要根据其派遣顺序sel_wbck_itag进行仲裁，先派遣的先写回，后派遣的指令写回数据为了不阻塞流水线，将其缓存在FIFO中，等待先派遣的指令写回或者没有数据写回则将其数据从FIFO中读出并写回，等该矢量指令写回会产生wbck_sign，流水线冲突模块会再根据各个模块的状态信息exu_status判断该指令是否完成，如完成则取消在指令保存寄存器的指令信息即完成了该指令的交付。The output data of each functional unit of the execution module (vlsu_o, v_int_alu_o, v_int_mac_o, v_fp_o, v_permute_o, v_reduct_o, vmask_o) is divided into write back scalar register (wbck_rd) and write back vector register by the vector write back and delivery module (wbck_vd), since the instruction to write back to the scalar register is written back sequentially, it can be directly written back to the scalar processor through the feedback channel of the vector processing unit, and delivered by the scalar processor. For the vector instructions written back to the vector register, generally speaking, they can be written back directly, but when two simultaneously executed instructions need to be written back at the same time, at this time, it is necessary to arbitrate according to the dispatch order sel_wbck_itag. In order not to block the pipeline, the dispatched instruction writes back the data, caches it in the FIFO, waits for the dispatched instruction to write back or if there is no data to write back, then reads its data from the FIFO and writes it back, and waits for the vector instruction to be written back Wbck_sign will be generated, and the pipeline conflict module will judge whether the instruction is completed according to the status information exu_status of each module. If it is completed, the instruction information in the instruction saving register will be canceled to complete the delivery of the instruction.

以上结合附图对本发明的实施方式作了详细说明，但本发明不限于所描述的实施方式。对于本领域的技术人员而言，在不脱离本发明原理和精神的情况下，对这些实施方式包括部件进行多种变化、修改、替换和变型，仍落入本发明的保护范围内。The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. For those skilled in the art, without departing from the principle and spirit of the present invention, various changes, modifications, replacements and modifications to these implementations, including components, still fall within the protection scope of the present invention.

Claims

1. a RISC-V vector processing unit implementation method, is characterized in that, comprises the steps:

S1. In the vector instruction preparation stage, execute in the scalar processor, fetch the vector instruction, simply decode the vector instruction, judge whether the scalar source operand is needed, read the scalar register, and judge the scalar data correlation with the scalar processor FIFO , can be dispatched, save the vector instruction destination register index information that needs to be written back to the scalar register in the scalar processor FIFO, and generate a vector instruction request signal, waiting for the vector processing unit to respond;

S2. In the stage of vector instruction decoding and dispatching, judge whether to respond to the vector request signal of the main processor according to the state of the vector unit itself;

S3, vector instruction execution stage, this stage is used for the execution of vector instructions, through multiple cycle execution to indirectly realize the execution of all elements in the register group, the total bit width VLEN bits of each processing element, that is, the length of a vector register, each Each execution unit adopts pipeline design, that is, SIMD pipeline;

S4. In the phase of vector instruction write back and delivery, the vector instruction can be written back to the scalar register and the vector register. The vector written back to the scalar register needs to pass the vector instruction feedback port to generate feedback information to the scalar processor and wait for the scalar processor to respond. It can be directly written back to the scalar processor through the feedback port of the vector processing unit, the instruction information saved in the scalar processor is removed, and the delivery of the instruction is completed, and the vector instruction written back to the vector register is directly passed Arbitration completes the delivery of the vector instruction by writing it back to the vector register and removing the information that saved the instruction during the dispatch phase.

2. A kind of RISC-V vector processing unit implementation method according to claim 1, is characterized in that, in described step S2, if respond, then receive this vector instruction and source scalar operand, then carry out completely to vector instruction Decoding, decoding the format of the instruction and reading the vector register, and judging the pipeline conflict, and deciding whether to dispatch the vector instruction.

3. A kind of RISC-V vector processing unit realization method according to claim 1, it is characterized in that, in described step S3, described execution unit comprises read-write unit, function unit, data processing unit, described read-write The unit is used to read and write the source operand, the functional unit is used to execute the operation to be done by the instruction, and the data processing unit processes the data of the functional unit to obtain the final data to be written back.

4. A method for implementing a RISC-V vector processing unit according to claim 2, wherein said pipeline conflicts include data conflicts and resource conflicts.

5. a kind of RISC-V vector processing unit implementation method according to claim 2, is characterized in that, the judgment of described pipeline conflict comprises:

For resource conflicts, read-after-write dependencies, and write-after-write dependencies, solve: block the currently dispatched instruction until the resource conflict is resolved;

For write-before-read correlation, true data correlation, solution: mode (1) instruction blocking register has no instruction, the vector instruction is the vector instruction just received, if the required source operation data is being written back, it can be stored in the vector register The module reads data through the data bypass, otherwise the instruction is saved in the vector instruction register, which is the same as the method (2) below; the method (2) is the vector instruction stored in the instruction blocking register for the vector instruction, and the dependency is released , the operand needs to be read from the vector register module again, if the data is being written back, the data can be read in the vector register module through data bypass, otherwise the value in the vector register module can be read;

In the write-back mode, the vector instructions written back to the scalar register are dispatched sequentially, executed sequentially, and written back sequentially. Only one vector instruction written back to the scalar register can be executed at a time, and the vector instructions written back to the vector register are dispatched sequentially. In-order execution, out-of-order write-back, and multiple instructions are supported according to the size of instruction execution.

6. realize the framework of a kind of RISC-V vector processing unit implementation method described in any one of claims 1-5, it is characterized in that, comprising:

The vector instruction decoding module decodes according to the vector instruction encoding rules of the RISC-V vector extension standard document 1.0, and generates different instruction type information and operand register index information;

The vector instruction dispatching module is used for the dispatch of vector instructions, and uses an instruction blocking register group to save vector instruction information related to data release in the vector unit that has pipeline conflicts in the scalar processor. Receive the instruction information of the vector decoding module, send it to the vector pipeline judgment module, judge whether there is a pipeline conflict, if not, the vector instruction can be dispatched, if there is a pipeline conflict, block the vector instruction, and save the decoding of the vector instruction Information and the operand read from the vector register to the instruction blocking register group, wait for the pipeline conflict to be resolved, dispatch it and remove the value stored in the instruction blocking register;

The vector pipeline conflict judgment module, which is used to judge the pipeline conflict and vector instruction delivery, uses an instruction execution register group to save the vector instruction information being executed, and the order of the executed instructions, when receiving the vector instruction information that the dispatch module needs to dispatch , judge whether there is a pipeline conflict with the vector instruction being executed, when there is a pipeline conflict, a signal that cannot be dispatched is generated to the dispatch module, and when there is no pipeline conflict, a signal that can be dispatched is generated to the dispatch module, and the instruction information and execution The order is kept in the instruction execution register bank. The delivery of the vector instruction, when the signal from the end of the vector instruction is received, the instruction information stored in the instruction execution register group will be removed, that is, the delivery of the instruction will be completed;

The vector register block includes multiple read register port groups, a general purpose register port group, additional read register port groups required to execute multiple instructions simultaneously, and a write-back port group. The read register port group includes four read ports, so The write-back port group includes two write-back ports, which contain 32 VLEN-length general-purpose registers. The read port logic generally reads data according to the index value of the input register, but a data bypass is added to the read port. When writing back to the data register When the index is the same as the index of the source operand to be read, it is directly sent to the output port instead of reading the register value and writing port logic. Generally, vector instructions only need to use one write-back port to write back, and an additional write-back port is for The write back double-width instruction, because its register index differs by 1, does not produce the situation of writing the same register file, it only needs to judge that the register index value is different in the write back logic;

The vector control state register module is added and designed according to the RISC-V vector extension standard document, which can be used to execute vector control state configuration instructions, change the length of the current vector, vector register grouping and vector element size, and output it to the execution unit ;

The vector execution module is divided into vector load and storage unit, vector integer operation unit, vector integer multiplication and addition operation unit, vector floating point operation unit, vector permutation operation unit, vector reduction simple element and vector mask unit according to the type of vector instruction;

The vector write-back module includes a write-back scalar register module and a write-back vector register module (VWBCK_VD).

7. A kind of RISC-V vector processing unit implementation framework according to claim 6, is characterized in that,

In the vector execution module: the vector integer operation unit is responsible for the operation instruction of vector integer addition, subtraction, multiplication and division, the vector integer multiplication and addition operation unit is responsible for the vector vector integer multiplication and addition operation instruction; the vector floating point operation unit is responsible for the vector floating point instruction operation. The vector replacement operation unit is responsible for the execution of vector replacement instructions; the vector reduction simple unit is responsible for the instructions of vector reduction instructions; the vector mask unit is responsible for the execution of vector mask instructions; the vector execution module adopts SIMD pipeline design, including read and write units, functions The unit, the data processing unit, and the read-write unit are designed with a state machine, which is divided into two states of 0 and 1. When the state machine is in the 0 state, it waits for the enable signal of the execution unit. If the execution unit is selected by the vector instruction, the enable signal will be generated. Signal, if there is an enable signal, it will jump to 1 state; write operation, save the instruction information to be operated by executing the instruction in the internal register, the instruction information includes register index, vector length, vector element width, and operation object; read The operation will judge whether the next operand needs to be read according to the length of the vector. When the state machine is in the 1 state, it will judge whether the instruction execution is over. If it continues to execute, it will maintain the 1 state. If it ends, it will jump to the 0 state. The functional unit is equivalent to The SIMD execution unit can support the execution of multiple vector elements with different bit widths. The maximum execution element bit width is VLEN. The data processing unit processes the output data of the functional units to obtain the final write-back data.

8. A kind of RISC-V vector processing unit implementation framework according to claim 6, is characterized in that,

In the vector write-back module: if the current vector instruction needs to be written back to the scalar register, it is directly written back through the feedback port of the vector processing unit; if the vector instruction needs to be written back to the vector register, only one instruction can be written back directly when executing, for two When two simultaneously executed instructions need to be written back, arbitration is performed according to the order of execution instructions stored in the instruction execution register group. The dispatched ones can be written back first, and the written-back data dispatched later will be stored in the corresponding FIFO until there is no more When the data is written back, it is read from the FIFO and written back to the vector register. After the instruction is executed, the instruction execution register will judge whether the instruction execution is completed according to whether the corresponding FIFO is empty and the status of the execution module where the instruction is located. If it is completed, the instruction information will be removed. Complete the delivery.