CN115936128A

CN115936128A - Vector processor supporting multi-precision calculation and dynamic configuration and processing method

Info

Publication number: CN115936128A
Application number: CN202211441900.0A
Authority: CN
Inventors: 林军; 王川宁; 方超; 王中风
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-04-07

Abstract

In the vector processor and the data processing method provided by the present invention, a systolic array acceleration unit is added in the processor channel to realize calculation between vectors. The storage unit on the original architecture is fully utilized, the data throughput is increased, and the calculation between more vector data is realized, so that the acceleration effect of the systolic array accelerator is fully utilized, and the calculation utilization rate is greatly improved. The systolic array accelerator can support multi-precision and ultra-low bit quantized calculations to improve the efficiency of vector calculations. At the same time, the parallelism and scalability of vector processors can greatly increase the data calculation density, thereby effectively improving the computing power.

Description

A vector processor and processing method supporting multi-precision calculation and dynamic configuration

技术领域technical field

本申请涉及集成电路及通信技术领域，特别涉及一种支持多精度计算及动态配置的向量处理器及处理方法。The present application relates to the field of integrated circuits and communication technologies, in particular to a vector processor and processing method supporting multi-precision calculation and dynamic configuration.

背景技术Background technique

神经网络模型中通常都包括大量的网络层，每一个网络层都存在权重矩阵与激活矩阵之间的卷积操作，其中权重矩阵中包含大量的权重数据，激活矩阵中包含大量的激活数据。在进行卷积操作时，一般会将卷积操作转化成矩阵乘法，然后使用矩阵乘法处理器进行计算，进而获得卷积操作的结果。A neural network model usually includes a large number of network layers, and each network layer has a convolution operation between a weight matrix and an activation matrix, wherein the weight matrix contains a large amount of weight data, and the activation matrix contains a large amount of activation data. When performing a convolution operation, the convolution operation is generally converted into a matrix multiplication, and then the matrix multiplication processor is used for calculation to obtain the result of the convolution operation.

矩阵乘法处理器通常包括多个基本运算单元，这些基本运算单元排布成脉动阵列，多个权重数据和激活数据在时钟信号的控制下广播至脉动阵列中，整个矩阵乘法运算流程通过控制信号控制每一个基本运算单元不断的对接收到的权重数据和激活数据进行乘法累加运算实现。A matrix multiplication processor usually includes a plurality of basic operation units arranged in a systolic array, multiple weight data and activation data are broadcast to the systolic array under the control of a clock signal, and the entire matrix multiplication operation process is controlled by a control signal Each basic operation unit continuously performs multiplication and accumulation operations on the received weight data and activation data.

随着深度神经网络的发展，矩阵乘加计算逐渐成为处理器重点关注的计算部分。现如今的矩阵乘加计算大多数利用算术逻辑运算单元实现，算术逻辑单元每周期只能进行单个固定宽度的数据进行计算，无法充分利用矩阵乘加计算的计算能力。导致了现有的采用算术逻辑单元实现的矩阵乘加计算算法计算效率较低，数据利用率低。从现有方案来看，瓶颈在于当前的硬件架构只能实现串行的向量与标量之间的计算，其本质还是标量与标量的计算，从而导致计算架构的计算利用率不能达到理想水平因此，如何利用有限的资源更好的设计实现矩阵乘加计算的硬件单元成为了亟待解决的问题。With the development of deep neural networks, matrix multiplication and addition calculations have gradually become the calculation part that processors focus on. Most of today's matrix multiplication and addition calculations are implemented by arithmetic logic operation units. The arithmetic logic unit can only perform calculations on a single fixed-width data per cycle, and cannot make full use of the computing power of matrix multiplication and addition calculations. As a result, the existing matrix multiplication and addition calculation algorithm implemented by the arithmetic logic unit has low calculation efficiency and low data utilization rate. Judging from the existing solutions, the bottleneck is that the current hardware architecture can only realize the calculation between serial vectors and scalars, and its essence is the calculation of scalars and scalars, which leads to the fact that the calculation utilization of the computing architecture cannot reach the ideal level. Therefore, How to use limited resources to better design hardware units for matrix multiplication and addition calculations has become an urgent problem to be solved.

发明内容Contents of the invention

本申请提供了一种支持多精度计算及动态配置的向量处理器及处理方法，可用于解决计算利用率低的技术问题。The application provides a vector processor and processing method supporting multi-precision calculation and dynamic configuration, which can be used to solve the technical problem of low calculation utilization.

第一方面，本申请实施例提供一种支持多精度计算及动态配置的向量处理器，包括：In the first aspect, the embodiment of the present application provides a vector processor supporting multi-precision calculation and dynamic configuration, including:

控制模块，所述控制模块用于接收外部传入的操作指令，以及对操作指令进行解析，得到向量计算指令和向量存储加载指令，以及确定所述向量计算指令发送的功能单元；A control module, the control module is used to receive an external incoming operation instruction, and analyze the operation instruction to obtain a vector calculation instruction and a vector storage load instruction, and determine the functional unit sent by the vector calculation instruction;

加载存储模块，所述加载存储模块用于根据所述向量存储加载指令，从外部加载待处理数据；A load-storage module, configured to load data to be processed from the outside according to the vector storage load instruction;

扩展通道模块，所述扩展通道模块包括通道存储单元和脉动阵列加速单元，所述通道存储单元用于存储所述待处理数据，所述脉动阵列加速单元用于根据所述向量计算指令从所述通道存储单元中获取对应数据进行向量之间的计算，并将计算结果返回所述通道存储单元进行存储；所述通道存储单元中存储的计算结果通过所述加载存储模块向外部传输。An extended channel module, the extended channel module includes a channel storage unit and a systolic array acceleration unit, the channel storage unit is used to store the data to be processed, and the systolic array acceleration unit is used to calculate the vector from the The corresponding data is acquired in the channel storage unit to perform calculation between vectors, and the calculation result is returned to the channel storage unit for storage; the calculation result stored in the channel storage unit is transmitted to the outside through the load storage module.

结合第一方面，在第一方面的一种可实现方式中，所述控制模块包括指令分发单元和主定序单元，所述指令分发单元接收外部传入的操作指令，对操作指令进行处理后识别出操作指令的种类和对应的功能单元并将操作指令传入所述主定序单元，所述主定序单元向所有功能单元广播指令，并监控指令的运行状态。With reference to the first aspect, in an implementable manner of the first aspect, the control module includes an instruction distribution unit and a main sequencing unit, and the instruction distribution unit receives an externally incoming operation instruction, and after processing the operation instruction, Identify the type of operation instruction and the corresponding functional unit and send the operation instruction to the main sequencing unit, and the main sequencing unit broadcasts the instruction to all functional units and monitors the running status of the instruction.

结合第一方面，在第一方面的一种可实现方式中，所述扩展通道模块包括通道指令定序单元、所述通道存储单元和若干计算处理单元，所述计算处理单元包括所述脉动阵列加速单元。With reference to the first aspect, in an implementable manner of the first aspect, the extended channel module includes a channel instruction sequencing unit, the channel storage unit, and several calculation processing units, and the calculation processing unit includes the systolic array acceleration unit.

结合第一方面，在第一方面的一种可实现方式中，所述通道存储单元包括向量寄存器文件和操作数队列，所述向量寄存器文件用于提供功能单元的操作数并吸收其结果，所述操作数队列连接所述计算处理单元和所述向量寄存器文件，用于分配各个所述计算处理单元的操作数。With reference to the first aspect, in an implementable manner of the first aspect, the channel storage unit includes a vector register file and an operand queue, and the vector register file is used to provide the operand of the functional unit and absorb its result, so The operand queue connects the calculation processing unit and the vector register file, and is used for allocating operands of each calculation processing unit.

结合第一方面，在第一方面的一种可实现方式中，所述向量寄存器文件包括若干存储体和仲裁单元，所述存储体为单端口，位宽设为64位，每个所述向量寄存器文件内设有8个存储体用于加载和输送待处理数据及计算结果，所述仲裁单元用于调配各操作指令的优先级。With reference to the first aspect, in an implementable manner of the first aspect, the vector register file includes several storage banks and arbitration units, the storage banks are single-port, and the bit width is set to 64 bits, and each of the vector register files There are 8 memory banks in the register file for loading and delivering data to be processed and calculation results, and the arbitration unit is used for allocating the priority of each operation instruction.

结合第一方面，在第一方面的一种可实现方式中，所述通道指令定序单元用于向所述扩展通道模块内各功能单元发送操作指令，并发起从所述向量寄存器文件中读取操作数的请求。With reference to the first aspect, in an implementable manner of the first aspect, the channel instruction sequencing unit is configured to send operation instructions to each functional unit in the extended channel module, and initiate a read from the vector register file A request to fetch an operand.

结合第一方面，在第一方面的一种可实现方式中，所述脉动阵列加速单元接受到从所述向量寄存器文件读取数据请求后，将待处理数据自动按顺序输入缓存，并在加载结束后根据输入模式判断是否开始计算，计算结果自动存入输出缓冲区内，等待指令将结果输出回所述向量寄存器文件中。With reference to the first aspect, in an implementable manner of the first aspect, after the systolic array acceleration unit receives the request to read data from the vector register file, it automatically inputs the data to be processed into the cache in order, and loads After the end, it is judged according to the input mode whether to start the calculation, the calculation result is automatically stored in the output buffer, and the result is output back to the vector register file by the waiting instruction.

结合第一方面，在第一方面的一种可实现方式中，所述操作指令为自定义指令，所述自定义指令均为向量指令，包括所述向量计算指令、向量加载指令和向量储存指令，所述自定义指令中包括目标地址信息和动作信息；所述自定义指令从外部输入向量处理器中进行处理。With reference to the first aspect, in an implementable manner of the first aspect, the operation instruction is a custom instruction, and the custom instruction is a vector instruction, including the vector calculation instruction, vector load instruction and vector storage instruction , the self-defined instruction includes target address information and action information; the self-defined instruction is input from an external vector processor for processing.

结合第一方面，在第一方面的一种可实现方式中，所述加载存储模块数据加载单元和数据存储单元，所述数据加载单元通过AXI AR总线加载数据，所述数据存储单元AXIAW总线用于存储数据。In conjunction with the first aspect, in a possible implementation of the first aspect, the load storage module data loading unit and the data storage unit, the data loading unit loads data through the AXI AR bus, and the data storage unit uses the AXIAW bus for storing data.

第二方面，本申请实施例提供一种支持多精度计算及动态配置的数据处理方法，包括，In the second aspect, the embodiment of the present application provides a data processing method supporting multi-precision calculation and dynamic configuration, including:

控制模块接收外部传入的操作指令；The control module receives external input operation instructions;

控制模块对操作指令进行解析，得到向量计算指令和向量存储加载指令，以及确定所述向量计算指令发送的功能单元；The control module parses the operation instruction, obtains the vector calculation instruction and the vector storage load instruction, and determines the functional unit sent by the vector calculation instruction;

加载存储模块根据所述向量存储加载指令，从外部加载待处理数据；The load storage module loads the data to be processed from the outside according to the vector storage load instruction;

扩展通道模块中的脉动阵列加速单元根据向量计算指令，从通道存储单元中获取对应数据进行向量之间的计算，并将计算结果返回所述通道存储单元进行存储；所述通道存储单元中存储的计算结果通过所述加载存储模块向外部传输。The systolic array acceleration unit in the extended channel module acquires corresponding data from the channel storage unit to perform calculation between vectors according to the vector calculation instruction, and returns the calculation result to the channel storage unit for storage; the stored in the channel storage unit The calculation result is transmitted to the outside through the load storage module.

本发明提供的向量处理器以及数据处理方法，在处理器通道内加入了脉动阵列加速单元，用于实现向量之间的计算。充分利用了原架构上的存储单元，增大了数据吞吐量，实现较多向量数据的之间的计算，使得脉动阵列加速器的加速效果得到充分利用，计算利用率得到大幅提高。脉动阵列加速器可以支持多精度及超低比特量化计算，提高向量计算的效率，同时向量处理器的并行性和可拓展性可以极大地提高数据计算密度，从而实现算力的有效提升。In the vector processor and the data processing method provided by the present invention, a systolic array acceleration unit is added in the processor channel to realize calculation between vectors. The storage unit on the original architecture is fully utilized, the data throughput is increased, and the calculation between more vector data is realized, so that the acceleration effect of the systolic array accelerator is fully utilized, and the calculation utilization rate is greatly improved. The systolic array accelerator can support multi-precision and ultra-low bit quantized calculations to improve the efficiency of vector calculations. At the same time, the parallelism and scalability of vector processors can greatly increase the data calculation density, thereby effectively improving the computing power.

附图说明Description of drawings

图1为实施例1中的向量处理器框架结构示意图；Fig. 1 is the vector processor frame structure schematic diagram in embodiment 1;

图2为实施例2中的向量处理器框架结构示意图；Fig. 2 is the vector processor frame structure schematic diagram in embodiment 2;

图3为实施例2中的扩展通道模块的框架图；Fig. 3 is the frame diagram of the expansion channel module in embodiment 2;

图4为脉动阵列加速单元的工作原理图；Fig. 4 is the working principle diagram of the pulse array acceleration unit;

图5为传统乘法树架构和脉动阵列对存储带宽利用率的比较示意图；Figure 5 is a schematic diagram of the comparison of storage bandwidth utilization between the traditional multiplication tree architecture and the systolic array;

图6为向量计算指令的组成示意图；6 is a schematic diagram of the composition of vector calculation instructions;

图7为向量存储和加载指令的组成示意图；Fig. 7 is a schematic diagram of the composition of vector storage and loading instructions;

图8为实施例2中的指令运行过程中数据处理流程图；Fig. 8 is the flow chart of data processing during the instruction running process in embodiment 2;

图9为实施例3中方法的流程图。FIG. 9 is a flowchart of the method in Embodiment 3.

100、控制模块；200、扩展通道模块；300、加载存储模块；100. Control module; 200. Expansion channel module; 300. Loading and storage module;

101、指令分发单元；102、主定序单元；101. Instruction distribution unit; 102. Main sequencing unit;

201、通道存储单元；202、计算处理单元；203、通道指令定序单元；201. Channel storage unit; 202. Calculation processing unit; 203. Channel instruction sequencing unit;

201a、向量寄存器文件；201b、操作数队列；201a, vector register file; 201b, operand queue;

202a、脉动阵列加速单元；202b、算术逻辑单元；202c、乘法计算单元；202d、浮点数处理单元；202a, a systolic array acceleration unit; 202b, an arithmetic logic unit; 202c, a multiplication calculation unit; 202d, a floating-point number processing unit;

201a-1、存储体；201a-2、仲裁单元；201a-1, storage body; 201a-2, arbitration unit;

301、数据加载单元；302、数据存储单元。301. A data loading unit; 302. A data storage unit.

具体实施方式Detailed ways

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.

实施例1Example 1

下面首先结合图1对本申请实施例适用的可能的系统架构进行介绍。A possible system architecture applicable to this embodiment of the present application is firstly introduced below with reference to FIG. 1 .

请参考图1，其示例性示出了本申请实施例适用的一种向量处理器的结构示意图，其包括扩展通道模块200、控制模块100及加载存储模块300。Please refer to FIG. 1 , which exemplarily shows a schematic structural diagram of a vector processor applicable to the embodiment of the present application, which includes an expansion channel module 200 , a control module 100 and a load-store module 300 .

控制模块100，所述控制模块100用于接收外部传入的操作指令，以及对操作指令进行解析，得到向量计算指令和向量存储加载指令，以及确定所述向量计算指令发送的功能单元；A control module 100, the control module 100 is used to receive an operation instruction input from the outside, and analyze the operation instruction to obtain a vector calculation instruction and a vector storage load instruction, and determine the functional unit sent by the vector calculation instruction;

加载存储模块300，所述加载存储模块300用于根据所述向量存储加载指令，从外部加载待处理数据；A load-storage module 300, configured to load data to be processed from the outside according to the vector storage load instruction;

扩展通道模块200，所述扩展通道模块200包括通道存储单元201和脉动阵列加速单元202a，所述通道存储单元201用于存储所述待处理数据，所述脉动阵列加速单元202a用于根据所述向量计算指令从所述通道存储单元201中获取对应数据进行向量之间的计算，并将计算结果返回所述通道存储单元201进行存储；所述通道存储单元201中存储的计算结果通过所述加载存储模块300向外部传输。An extended channel module 200, the extended channel module 200 includes a channel storage unit 201 and a systolic array acceleration unit 202a, the channel storage unit 201 is used to store the data to be processed, and the systolic array acceleration unit 202a is used to The vector calculation instruction obtains corresponding data from the channel storage unit 201 to perform calculation between vectors, and returns the calculation result to the channel storage unit 201 for storage; the calculation result stored in the channel storage unit 201 is passed through the load The storage module 300 is transmitted to the outside.

本实施例提供的向量处理器，在处理器通道内加入了脉动阵列加速单元202a，并设计了专门的向量指令调用脉动阵列加速单元202a，用于实现向量与向量之间的计算。相比于原有的采用算术逻辑单元202b每个周期只能进行单个固定宽度的数据进行计算，脉动阵列加速单元202a充分利用了原架构上的存储单元，增大了数据吞吐量，实现较多向量数据的之间的计算，使得脉动阵列加速器的加速效果得到充分利用，计算利用率得到大幅提高。脉动阵列加速器可以支持多精度及超低比特量化计算，提高向量计算的效率，同时向量处理器的并行性和可拓展性可以极大地提高数据计算密度，从而实现算力的有效提升。The vector processor provided in this embodiment adds a systolic array acceleration unit 202a to the processor channel, and designes a special vector instruction to call the systolic array acceleration unit 202a to realize vector-to-vector calculations. Compared with the original arithmetic logic unit 202b, which can only perform calculations on a single fixed-width data per cycle, the systolic array acceleration unit 202a makes full use of the storage units on the original architecture, increases the data throughput, and achieves more The calculation between the vector data makes full use of the acceleration effect of the systolic array accelerator, and the calculation utilization rate is greatly improved. The systolic array accelerator can support multi-precision and ultra-low bit quantized calculations to improve the efficiency of vector calculations. At the same time, the parallelism and scalability of vector processors can greatly increase the data calculation density, thereby effectively improving the computing power.

实施例2Example 2

本实施例提供一种支持多精度计算及动态配置的向量处理器，其架构如图2所示。This embodiment provides a vector processor supporting multi-precision calculation and dynamic configuration, the architecture of which is shown in FIG. 2 .

其中，控制模块100中包括了指令分发单元101和主定序单元102，所述指令分发单元101接收外部传入的操作指令，对操作指令进行处理后识别出操作指令的种类和对应的功能单元并将操作指令传入所述主定序单元102，所述主定序单元102向所有功能单元广播指令，并监控指令的运行状态。Wherein, the control module 100 includes an instruction distribution unit 101 and a main sequencer unit 102. The instruction distribution unit 101 receives an operation instruction input from the outside, and after processing the operation instruction, identifies the type of the operation instruction and the corresponding functional unit. And the operation instruction is sent to the main sequencing unit 102, and the main sequencing unit 102 broadcasts the instruction to all functional units, and monitors the running state of the instruction.

具体的，指令分发单元101是向量处理器的解码器，负责接收外部(即第一级指令解码单元)传入的操作指令，从而识别向量指令对应的向量功能单元、操作指令的类型以及其他信息。当指令识别结束后，指令分发单元101将该操作指令及解码得到的信息传入主定序单元102中，以待后续处理。指令分发单元101还能对指令的类型进行筛选并向第一级指令解码器进行反馈，比如当接收到的是标量指令，便会返回错误信息。Specifically, the instruction distribution unit 101 is the decoder of the vector processor, responsible for receiving the operation instructions from the outside (that is, the first-level instruction decoding unit), so as to identify the vector functional unit corresponding to the vector instruction, the type of the operation instruction, and other information . After the instruction recognition is completed, the instruction distribution unit 101 transmits the operation instruction and the decoded information to the main sequencing unit 102 for subsequent processing. The instruction distribution unit 101 can also screen the types of instructions and give feedback to the first-level instruction decoder. For example, when a scalar instruction is received, an error message will be returned.

而主定序单元102负责跟踪在扩展通道模块200的上运行的操作指令，将它们分派给不同的功能单元，并通过外部处理单元确认它们。主定序单元102能够掌握所有扩展通道模块200上的指令执行进度。主定序单元102还存储了关于哪个向量指令正在访问哪个向量寄存器的信息。这些信息被用来确定指令之间的数据冒险，以便确定各条指令运行的优先级。The main sequencer unit 102 is responsible for tracking the operation instructions running on the expansion channel module 200, dispatching them to different functional units, and confirming them through the external processing unit. The main sequencing unit 102 is able to grasp the execution progress of instructions on all expansion channel modules 200 . The main sequencing unit 102 also stores information about which vector instruction is accessing which vector register. This information is used to determine data hazards between instructions in order to determine the priority of execution of individual instructions.

如图2所示，在扩展通道模块200中，包括通道指令定序单元203，通道存储单元201和若干计算处理单元202，其中通道指令定序单元203用于向所述扩展通道模块200内各功能单元发送操作指令，通道存储单元201用于储存待处理的数据和计算结果，而计算处理单元202则用于接收向量计算指令，读取通道存储单元201中的待处理数据进行计算，再将计算结果返回通道存储单元201中。As shown in FIG. 2 , in the extended channel module 200, it includes a channel instruction sequencing unit 203, a channel storage unit 201 and several calculation processing units 202, wherein the channel instruction sequencing unit 203 is used to The functional unit sends an operation instruction, the channel storage unit 201 is used to store the data to be processed and the calculation result, and the calculation processing unit 202 is used to receive the vector calculation instruction, read the data to be processed in the channel storage unit 201 for calculation, and then The calculation result is returned to the channel storage unit 201.

如图2中所示，向量处理器中还有若干其他的扩展通道模块200，即通道1…通道N等，而本实例中包含脉动阵列加速单元202a的扩展通道模块200为通道0。每个扩展通道模块200均具有以下通道结构，即每个通道都有自己的通道指令定序单元203，负责跟踪多达8条并行向量指令。每个通道也有一组包含寄存器文件的通道存储单元201来协调其访问操作队列，以及包含有一个整数算术逻辑单元202b、一个整数乘法器和一个浮点计算单元的计算处理单元202。每个通道包含整个寄存器文件和计算处理单元202的一部分。而在本实施例中，在通道结构内加入了脉动阵列加速单元202a，用于处理向量之间的运算，As shown in FIG. 2 , there are several other expansion channel modules 200 in the vector processor, namely channel 1 . Each extended channel module 200 has the following channel structure, that is, each channel has its own channel instruction sequencing unit 203 responsible for tracking up to 8 parallel vector instructions. Each channel also has a set of channel storage units 201 including register files to coordinate its access operation queues, and a calculation processing unit 202 including an integer arithmetic logic unit 202b, an integer multiplier and a floating point calculation unit. Each lane contains the entire register file and a portion of the computational processing unit 202 . However, in this embodiment, a systolic array acceleration unit 202a is added to the channel structure for processing operations between vectors,

下面结合图3分析扩展通道模块200内的具体结构。The specific structure in the expansion channel module 200 will be analyzed below in conjunction with FIG. 3 .

通道指令定序单元203，用于向所述扩展通道模块200内各功能单元发送操作指令，控制它们在单个通道的范围内执行。与主定序单元102不同，通道指令定序单元203不存储正在运行的指令的状态，避免了通道间的数据重复。它们还发起从通道存储单元201中的向量寄存器文件201a读取操作数的请求。The channel instruction sequencing unit 203 is configured to send operation instructions to each functional unit in the expansion channel module 200, and control them to be executed within the range of a single channel. Different from the main sequencer unit 102, the channel instruction sequencer unit 203 does not store the state of the running instruction, which avoids data duplication between channels. They also initiate requests to read operands from the vector register file 201 a in the channel storage unit 201 .

本实施例中的通道存储单元201包括向量寄存器文件201a和操作数队列201b，所述向量寄存器文件201a用于提供功能单元的操作数并吸收其结果，所述操作数队列201b连接所述计算处理单元202和所述向量寄存器文件201a，用于分配各个所述计算处理单元202的操作数。The channel storage unit 201 in this embodiment includes a vector register file 201a and an operand queue 201b, the vector register file 201a is used to provide the operand of the functional unit and absorb its result, and the operand queue 201b is connected to the calculation process The unit 202 and the vector register file 201a are used to allocate operands of each of the calculation processing units 202 .

向量寄存器文件201a包括若干存储体201a-1和仲裁单元201a-2，所述存储体201a-1为单端口，位宽设为64位，每个所述向量寄存器文件201a内设有8个存储体201a-1用于加载和输送待处理数据及计算结果，所述仲裁单元201a-2用于调配各操作指令的优先级。The vector register file 201a includes several memory banks 201a-1 and arbitration units 201a-2, the memory bank 201a-1 is a single port, the bit width is set to 64 bits, and each of the vector register files 201a is provided with 8 storage The body 201a-1 is used to load and deliver data to be processed and calculation results, and the arbitration unit 201a-2 is used to allocate the priority of each operation instruction.

向量寄存器文件201aVRF(Vector Register Files)，是每个向量处理器的核心。因为几个指令可以并行运行，向量寄存器文件201a必须能够支持足够的吞吐量，以便为功能单元提供操作数并吸收其结果。向量处理器的VRF是由一组单端口的存储体201a-1(bank)组成。每个存储体201a-1的宽度被限制在每个通道的数据通路宽度上，即64位，以避免子字选择逻辑错误。因此，在稳定状态下，五个存储体201a-1可以被同时访问，以维持预定的乘加指令的最大吞吐量。向量处理器的向量寄存器文件201a每条通道被设置了八个bank，为访问提供了一些余量。The vector register file 201aVRF (Vector Register Files) is the core of each vector processor. Because several instructions can run in parallel, the vector register file 201a must be able to support sufficient throughput to provide operands to functional units and to absorb their results. The VRF of the vector processor is composed of a group of single-port memory banks 201a-1 (bank). The width of each memory bank 201a-1 is limited to the data path width of each channel, ie, 64 bits, to avoid subword selection logic errors. Therefore, in a steady state, the five memory banks 201a-1 can be accessed simultaneously to maintain the predetermined maximum throughput of multiply-add instructions. The vector register file 201a of the vector processor is configured with eight banks per channel, providing some margin for access.

当几个功能单元试图访问同一bank的操作数时，会导致bank冲突。每个扩展通道模块200在VRF和各个功能单元之间都有一组操作数队列201b，以吸收这种bank冲突。本实施例中有12个操作数队列201b：其中4个专门用于FPU/MUL(浮点数处理单元202d/乘法计算单元202c)，3个用于ALU(算术逻辑单元202b)，另外3个用于VLSU(向量加载/存储单元)，2个队列用于脉动阵列加速单元202a。每个队列的宽度是64bit位宽，它们的深度是通过模拟选择的，取决于功能单元的延迟和吞吐量。Bank conflicts result when several functional units try to access operands of the same bank. Each expansion channel module 200 has a set of operand queues 201b between the VRF and each functional unit to absorb such bank conflicts. There are 12 operand queues 201b in this embodiment: 4 of them are used exclusively for FPU/MUL (floating-point number processing unit 202d/multiplication calculation unit 202c), 3 are used for ALU (arithmetic logic unit 202b), and the other 3 are used for In a VLSU (vector load/store unit), 2 queues are used for the systolic array acceleration unit 202a. The width of each queue is 64 bits wide, and their depth is selected by simulation, depending on the delay and throughput of the functional unit.

如图3所示，在计算处理单元202中包含一个整数ALU(算术逻辑单元202b)、一个整数MUL(乘法计算单元202c)和一个FPU(浮点数处理单元202d)，它们都在一个64位的数据通路上运行，用于满足基本的标量运算功能。其中MUL与FPU共享操作数队列201b，但它们不能同时使用。向量处理器对多精度运算存在充分的支持，允许数据从8位到16位，从16位到32位，以及从32位到64位的转换。FPU的配置为支持FMAs(浮点数乘加计算)、加法、乘法、除法、平方根和比较。As shown in FIG. 3, an integer ALU (arithmetic logic unit 202b), an integer MUL (multiplication calculation unit 202c) and an FPU (floating point number processing unit 202d) are included in the calculation processing unit 202, which are all in a 64-bit It operates on the data path and is used to meet basic scalar operation functions. The MUL and the FPU share the operand queue 201b, but they cannot be used at the same time. Adequate support for multi-precision operations exists in vector processors, allowing data conversion from 8-bit to 16-bit, from 16-bit to 32-bit, and from 32-bit to 64-bit. The FPU is configured to support FMAs (floating-point multiply-accumulate), addition, multiplication, division, square root, and comparison.

脉动阵列单元则专门用于处理向量与向量之间的运算，从而大幅提高了向量处理器的向量运算能力。图4展示了脉动阵列加速单元202a的工作原理。脉动阵列加速单元202a由多个相同构造的计算单元(PE,Processing Element)组成，采用一种类似流水线的方式进行工作，每一个计算单元会计算传入的数据并暂存部分和，而在下一个时钟周期该会把指定的数据按照指定的方向广播给相邻的计算单元。在适配矩阵乘法计算的脉动阵列中，计算单元会选择向相邻的计算单元广播输入数据。The systolic array unit is specially used to handle operations between vectors, thus greatly improving the vector operation capability of the vector processor. Fig. 4 shows the working principle of the systolic array acceleration unit 202a. The systolic array acceleration unit 202a is composed of multiple computing units (PE, Processing Element) with the same structure, and works in a manner similar to a pipeline. Each computing unit will calculate the incoming data and temporarily store part of the sum, and the next The clock cycle should broadcast the specified data to the adjacent computing units in the specified direction. In a systolic array of adaptive matrix multiplication calculations, a computing unit may choose to broadcast input data to adjacent computing units.

由于在很多情况下，整体系统的处理能力受限于访存带宽而不是计算能力，从图5中可以看出，这使得脉动阵列相比传统的乘法树架构能够更好地利用有限的访存带宽而不会造成计算能力的浪费。通过脉动传输加上数据广播的方式，脉动阵列对数据进行了较为充分的复用，这使得脉动阵列在没有大量增加存储空间的情况下能实现很高的计算吞吐率。Since in many cases, the processing power of the overall system is limited by memory access bandwidth rather than computing power, as can be seen from Figure 5, this enables systolic arrays to better utilize limited memory accesses compared to traditional multiply tree architectures bandwidth without wasting computing power. Through systolic transmission and data broadcasting, the systolic array fully multiplexes data, which enables the systolic array to achieve high computing throughput without a large increase in storage space.

在本实施例中，当脉动阵列加速单元202a接收到通道内的VRF将数据输入到脉动阵列的请求时，脉动阵列加速单元202a内部的控制单元会将数据自动按顺序输入到缓存A，缓存B中。当数据加载结束之后，该单元会根据输入的模式决定是否开始计算。计算结束后，数据自动存入输出缓冲区中，等待指令将计算结果输出到VRF中。In this embodiment, when the systolic array acceleration unit 202a receives a request from the VRF in the channel to input data into the systolic array, the control unit inside the systolic array acceleration unit 202a will automatically input the data into cache A and cache B in sequence. middle. When the data loading is finished, the unit will decide whether to start calculation according to the input mode. After the calculation, the data is automatically stored in the output buffer, waiting for the command to output the calculation result to the VRF.

如图2所示，本实施例中的加载存储模块300包含一个数据加载单元301及一个数据存储单元302。分别用于从外部加载待处理数据和将计算结果从VRF中存储进外部内存中。当操作指令中首地址传入数据加载单元301后，该单元将单元行内存操作凝聚成突发请求，避免了从内存中请求单个元素的需要。然后，突发开始地址和突发长度被发送到加载或存储单元，这两个单元负责通过向量处理单元的高级可扩展接口(AXI)接口启动数据传输。数据加载单元301通过AXI AR总线加载数据，所述数据存储单元302AXI AW总线用于存储数据。As shown in FIG. 2 , the load-storage module 300 in this embodiment includes a data loading unit 301 and a data storage unit 302 . They are respectively used to load the data to be processed from the outside and store the calculation results from the VRF into the external memory. After the first address in the operation instruction is transmitted to the data loading unit 301, the unit condenses the memory operation of the unit row into a burst request, avoiding the need to request a single element from the memory. The burst start address and burst length are then sent to the load or store unit, which are responsible for initiating the data transfer through the Advanced Extensible Interface (AXI) interface of the vector processing unit. The data loading unit 301 loads data through the AXI AR bus, and the data storage unit 302 AXI AW bus is used for storing data.

本实施例中的操作指令为自定义指令，所述自定义指令均为向量指令，包括所述向量计算指令、向量加载指令和向量储存指令，所述自定义指令中包括目标地址信息和动作信息；所述自定义指令从外部输入所述处理模块中进行处理。The operation instructions in this embodiment are self-defined instructions, and the self-defined instructions are all vector instructions, including the vector calculation instruction, vector load instruction and vector storage instruction, and the self-defined instructions include target address information and action information ; The custom instruction is input from the outside to the processing module for processing.

其中向量计算指令为自定义的VMSAM指令，向量加载指令VL*E和向量存储指令VS*E均为RISC-V指令，也都支持自定义。Among them, the vector calculation instruction is a custom VMSAM instruction, the vector load instruction VL*E and the vector storage instruction VS*E are both RISC-V instructions, and both support customization.

RISC-V指令集是基于精简指令集计算(RISC)原理建立的开放指令集架构(ISA)，RISC-V是在指令集不断发展和成熟的基础上建立的全新指令体系。RISC-V指令集完全开源，设计简单，易于移植Unix系统，模块化设计，完整工具链，同时有大量的开源实现和流片案例。The RISC-V instruction set is an open instruction set architecture (ISA) based on the principle of reduced instruction set computing (RISC). RISC-V is a new instruction system established on the basis of the continuous development and maturity of the instruction set. The RISC-V instruction set is completely open source, simple in design, easy to port to Unix systems, modular design, complete tool chain, and there are a large number of open source implementations and tape-out cases.

RISC-V是一种模块化的精简指令集架构，由字母I代表的基本整数指令集和字母M、A、F、D、C等扩展指令集构成。唯一要求必须实现的指令集为I型基本整数指令集，由此即可实现完整的软件编译工具链。扩展指令集则增加了乘法与除法、存储器原子操作、浮点数操作等功能，以上指令集全部可按需选用配置。RISC-V is a modular reduced instruction set architecture, which consists of the basic integer instruction set represented by the letter I and the extended instruction set of the letters M, A, F, D, and C. The only instruction set that must be implemented is the I-type basic integer instruction set, so that a complete software compilation tool chain can be realized. The extended instruction set adds functions such as multiplication and division, memory atomic operations, and floating-point number operations. All of the above instruction sets can be configured as needed.

RISC-V虽然不是第一个开源的指令集(ISA)，但它是第一个被设计成可以根据具体场景进行合适选择的指令集架构。基于RISC-V指令集架构可以设计服务器CPU、家用电器CPU、传感器中的CPU等多种应用场景的处理器。Although RISC-V is not the first open-source instruction set (ISA), it is the first instruction set architecture designed to be properly selected according to specific scenarios. Based on the RISC-V instruction set architecture, processors for various application scenarios such as server CPUs, household appliance CPUs, and sensor CPUs can be designed.

如图6所示，自定义VMSAM指令属于向量处理类指令；进入向量处理单元后通过rs1的值解码得到脉动阵列加速器的状态输入及计算精度以驱动模块开始运行；当计算结束后指令才会结束。As shown in Figure 6, the custom VMSAM instruction belongs to the vector processing instruction; after entering the vector processing unit, the state input and calculation accuracy of the systolic array accelerator can be obtained by decoding the value of rs1 to drive the module to start running; the instruction will not end until the calculation is completed .

如图7所示，VL*E指令属于向量加载类指令；进入向量处理单元后目标功能单元为扩展通道模块200内的VRF；vs1的值在外部被解码为传输给AXI总线中AR线的首地址；rs2为数据将从内存中取出的方式标识，rd为数据在VRF中存储的首地址。As shown in Figure 7, the VL*E instruction belongs to the vector loading instruction; after entering the vector processing unit, the target functional unit is the VRF in the expansion channel module 200; Address; rs2 is the identification of how the data will be fetched from the memory, and rd is the first address of the data stored in the VRF.

VS*E指令属于向量存储类指令；进入向量处理单元后目标功能单元为通道内VRF；vs1的值在外部被解码为传输给AXI总线中AW线的首地址；rs2为数据将从VRF中取出的方式标识，rd为数据在内存中存储的首地址。The VS*E instruction belongs to the vector storage instruction; after entering the vector processing unit, the target functional unit is the VRF in the channel; the value of vs1 is decoded externally as the first address transmitted to the AW line in the AXI bus; rs2 is the data to be taken out from the VRF rd is the first address where the data is stored in the memory.

如图8展示了在向量处理器中运行指令进行数据处理的流程：Figure 8 shows the process of running instructions in the vector processor for data processing:

1.当向量处理器得到解码的VLE指令之后，数据根据首地址顺序的从AXI总线的AR线读取数据，当数据读取结束，加载到通道内的VRF中。当数据传输结束，该指令运行结束。1. After the vector processor gets the decoded VLE instruction, the data is read from the AR line of the AXI bus according to the order of the first address. When the data reading is completed, it is loaded into the VRF in the channel. When the data transmission is over, the command execution ends.

2.VMSAM指令开始之后，解码得到的rs1指定了从VRF读取数据的首地址，rs2指定了脉动阵列加速单元202a运行的模式以及脉动阵列加速单元202a的量化数据精度，rd指定了结果存储到VRF的首地址。当脉动阵列加速单元202a接收到开始计算的信号时，数据被送进脉动阵列中。计算结束后，指令运行结束。2. After the start of the VMSAM instruction, the decoded rs1 specifies the first address to read data from the VRF, rs2 specifies the operating mode of the systolic array acceleration unit 202a and the quantized data precision of the systolic array acceleration unit 202a, and rd specifies the result stored in The first address of the VRF. When the systolic array acceleration unit 202a receives a signal to start calculation, data is sent into the systolic array. After the calculation is completed, the instruction execution ends.

3.VSE指令开始后，数据从通道内VRF中加载到AXI总线的AW线上，当特定长度的一次传输结束后，指令结束。3. After the VSE command starts, the data is loaded from the VRF in the channel to the AW line of the AXI bus. When a transmission of a specific length is completed, the command ends.

作为可替换的实施方式，由于自定义指令设计具有一定的灵活性，因此可以使用不同于本实施例中提出的具体三种指令对处理器中的各个功能单元的进行指导控制。As an alternative implementation manner, since the design of the user-defined instruction has a certain degree of flexibility, it is possible to use three specific instructions different from those proposed in this embodiment to guide and control each functional unit in the processor.

实施例3Example 3

参照图9，其示例性示出了本申请实施例适用的一种数据处理方法的流程示意图，其主要包括以下步骤：Referring to FIG. 9, it exemplarily shows a schematic flow chart of a data processing method applicable to the embodiment of the present application, which mainly includes the following steps:

控制模块100接收外部传入的操作指令；The control module 100 receives external incoming operation instructions;

控制模块100对操作指令进行解析，得到向量计算指令和向量存储加载指令，以及确定所述向量计算指令发送的功能单元；The control module 100 parses the operation instruction, obtains the vector calculation instruction and the vector storage load instruction, and determines the functional unit sent by the vector calculation instruction;

加载存储模块300根据所述向量存储加载指令，从外部加载待处理数据；The load storage module 300 loads the data to be processed from the outside according to the vector store load instruction;

扩展通道模块200中的脉动阵列加速单元202a根据向量计算指令，从通道存储单元201中获取对应数据进行向量之间的计算，并将计算结果返回所述通道存储单元201进行存储；所述通道存储单元201中存储的计算结果通过所述加载存储模块300向外部传输。The systolic array acceleration unit 202a in the extended channel module 200 obtains corresponding data from the channel storage unit 201 to perform calculation between vectors according to the vector calculation instruction, and returns the calculation result to the channel storage unit 201 for storage; The calculation results stored in the unit 201 are transmitted to the outside through the load and store module 300 .

具体的，本实施例以指令类型为用于对目标寄存器和源寄存器进行操作的标准R型指令为例，展示了各种指令在进入向量处理器后的工作流程如下：Specifically, this embodiment takes the instruction type as a standard R-type instruction for operating the target register and the source register as an example, and shows that the workflow of various instructions after entering the vector processor is as follows:

其中，VL*E指令运行数据流示意：Among them, the VL*E instruction runs the data flow:

1.向外部解码器输入指令：32’h 0000 1101 1010 0111 1111 1001 0101 01111. Input instructions to the external decoder: 32’h 0000 1101 1010 0111 1111 1001 0101 0111

由一级解码得到，这是一个向量指令，应传入向量处理器进行处理。同时rs1值为0F，Obtained by one-level decoding, this is a vector instruction, which should be passed to the vector processor for processing. At the same time, the value of rs1 is 0F,

在外部中解码得到该地址内存储值为32’h 8000 0004，rd值为12，将此信息也一并传入向量中。It is decoded externally to get the stored value of the address as 32’h 8000 0004, and the value of rd is 12, and this information is also passed into the vector.

2.当向量接收到该指令及其他信息时，送入指令分发单元101进行二级解码。指令分发单元101根据指令的opcode识别出此为数据加载指令VLE。同时加载rs1，rs2值以及解码得到的地址值，将这些数据以及指令送入主定序单元102中。向外部解码器反馈并无错误。2. When the vector receives the instruction and other information, it will be sent to the instruction distribution unit 101 for secondary decoding. The instruction distribution unit 101 recognizes that this is a data load instruction VLE according to the opcode of the instruction. At the same time, the values of rs1 and rs2 and the decoded address value are loaded, and these data and instructions are sent to the main sequence unit 102 . There is no error feedback to the external decoder.

3.主定序单元102在检查该指令与其他指令之间不存在冒险之后，将该指令及信息广播到所有功能单元，但只有数据加载单元301接收。数据加载单元301将根据首地址信息以及传输长度信息向AXI总线的AR线发出数据加载请求。数据加载结束后，加载单元将数据应传输到Lane中的VRF中，存储地址为rd值。当数据完全加载到VRF中时，VLE指令运行结束。3. After checking that there is no hazard between the instruction and other instructions, the main sequence unit 102 broadcasts the instruction and information to all functional units, but only the data loading unit 301 receives it. The data loading unit 301 sends a data loading request to the AR line of the AXI bus according to the first address information and the transmission length information. After the data loading is completed, the loading unit transfers the data to the VRF in Lane, and the storage address is the rd value. When the data is fully loaded into the VRF, the VLE instruction finishes running.

VMSAM指令运行数据流示意：VMSAM instruction operation data flow diagram:

4.向外部解码器输入指令：32’h 1010 1010 1001 0001 1010 0000 0101 01114. Input instructions to the external decoder: 32’h 1010 1010 1001 0001 1010 0000 0101 0111

由一级解码得到，这是一个向量指令，应传入向量处理器进行处理。Obtained by one-level decoding, this is a vector instruction, which should be passed to the vector processor for processing.

5.当向量处理器接收到该指令及其他信息时，送入指令分发单元101进行二级解码。指令分发单元101根据指令的opcode识别出此为数据处理指令VMSAM。同时加载rs1，rs2，rd值，将这些数据以及指令送入主定序单元102中。向外部解码器反馈并无错误。5. When the vector processor receives the instruction and other information, send it to the instruction distribution unit 101 for secondary decoding. The instruction distribution unit 101 recognizes that the instruction is a data processing instruction VMSAM according to the opcode of the instruction. At the same time, the values of rs1, rs2, and rd are loaded, and these data and instructions are sent to the main sequencing unit 102 . There is no error feedback to the external decoder.

6.主定序单元102在检查该指令与其他指令之间不存在冒险之后，将该指令及信息广播到所有功能单元，但只有各通道及通道中的脉动阵列加速单元202a接收。根据rs1中的首地址以及rs2中包含的模式状态的输入，rs2值解码得到输入精度为64bit。该功能单元开始从VRF中取出数据，并根据输入的状态判断是否开始计算。计算结束后，结果存入到输出缓存。以rd值为首地址，将输出缓存的值存入VRF中。当数据完全加载到VRF中时，VMSAM指令运行结束。6. After checking that there is no hazard between the instruction and other instructions, the main sequencer unit 102 broadcasts the instruction and information to all functional units, but only each channel and the systolic array acceleration unit 202a in the channel receive it. According to the first address in rs1 and the input of the mode state contained in rs2, the input precision of rs2 value decoding is 64bit. The functional unit starts to fetch data from the VRF, and judges whether to start calculation according to the state of the input. After the calculation, the result is stored in the output buffer. With rd as the first address, store the value of the output buffer into VRF. When the data is fully loaded into the VRF, the VMSAM instruction finishes running.

VS*E指令运行数据流示意：VS*E command running data flow diagram:

7.向外部解码器输入指令：32’h 0000 0010 0000 1101 1111 0000 0010 01117. Input instructions to the external decoder: 32’h 0000 0010 0000 1101 1111 0000 0010 0111

由一级解码得到，这是一个向量指令，应传入向量进行处理。同时rs1值为1B,在外部解码器中解码得到该地址内存储值为32’h 80042380，rd值为00。将这些信息也一并传入向量中。Obtained by one-level decoding, this is a vector instruction, which should be passed to the vector for processing. At the same time, the value of rs1 is 1B, and the stored value of this address is 32’h 80042380, and the value of rd is 00 after decoding in the external decoder. Pass this information into the vector as well.

8.当向量处理器接收到该指令及其他信息时，送入指令分发单元101进行二级解码。指令分发单元101根据指令的opcode识别出此为数据存储指令VSE。同时加载rs1，rd值以及解码得到的地址值，将这些数据以及指令送入主定序单元102中。向外部解码器反馈并无错误。8. When the vector processor receives the instruction and other information, send it to the instruction distribution unit 101 for secondary decoding. The instruction distribution unit 101 recognizes that this is a data storage instruction VSE according to the opcode of the instruction. Simultaneously load the values of rs1 and rd and the decoded address value, and send these data and instructions into the main sequence unit 102 . There is no error feedback to the external decoder.

9.主定序单元102在检查该指令与其他指令之间不存在冒险之后，将该指令及信息广播到所有功能单元，但只有数据存储单元302接收。数据存储单元302根据解码得到的首地址信息以及传输长度信息向AXI总线的AW线发出数据存储请求。数据存储结束后，AW线将返回一个结束信号到指令定序单元中。当数据完全加载到AXI的AW线中时，VSE指令运行结束。9. After checking that there is no hazard between the instruction and other instructions, the main sequence unit 102 broadcasts the instruction and information to all functional units, but only the data storage unit 302 receives it. The data storage unit 302 sends a data storage request to the AW line of the AXI bus according to the decoded first address information and transmission length information. After the data storage ends, the AW line will return an end signal to the instruction sequencing unit. When the data is fully loaded into the AXI AW line, the VSE instruction ends.

本领域的技术人员可以清楚地了解到本申请实施例中的技术可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解，本申请实施例中的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品可以存储在存储介质中，如ROM/RAM、磁碟、光盘等，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。Those skilled in the art can clearly understand that the technologies in the embodiments of the present application can be implemented by means of software plus a necessary general-purpose hardware platform. Based on this understanding, the technical solution in the embodiment of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM , magnetic disk, optical disk, etc., including several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of the present application.

本说明书中各个实施例之间相同相似的部分互相参见即可。尤其，对于服务构建装置和服务加载装置实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例中的说明即可。For the same and similar parts among the various embodiments in this specification, refer to each other. In particular, for the embodiments of the service constructing device and the service loading device, since they are basically similar to the method embodiments, the description is relatively simple, and for relevant parts, please refer to the description in the method embodiments.

以上所述的本申请实施方式并不构成对本申请保护范围的限定。The embodiments of the present application described above are not intended to limit the scope of protection of the present application.

Claims

1. A vector processor supporting multi-precision calculation and dynamic configuration, characterized in that, comprising:

A control module (100), the control module (100) is used to receive an external incoming operation instruction, and analyze the operation instruction, obtain a vector calculation instruction and a vector storage load instruction, and determine the function sent by the vector calculation instruction unit;

A load-storage module (300), the load-storage module (300) is used to load data to be processed from the outside according to the vector storage load instruction;

An extended channel module (200), the extended channel module (200) includes a channel storage unit (201) and a systolic array acceleration unit (202a), the channel storage unit (201) is used to store the data to be processed, the The systolic array acceleration unit (202a) is used to obtain corresponding data from the channel storage unit (201) according to the vector calculation instruction to perform calculation between vectors, and return the calculation result to the channel storage unit (201) for storage ; The calculation result stored in the channel storage unit (201) is transmitted to the outside through the load storage module (300).

2. The vector processor supporting multi-precision calculation and dynamic configuration according to claim 1, characterized in that: the control module (100) includes an instruction distribution unit (101) and a main sequencing unit (102), the The instruction distribution unit (101) receives the operation instruction incoming from the outside, recognizes the type of the operation instruction and the corresponding functional unit after processing the operation instruction, and transmits the operation instruction to the main sequencing unit (102), and the main sequence unit (102) The sequence unit (102) broadcasts instructions to all functional units and monitors the running status of the instructions.

3. The vector processor supporting multi-precision calculation and dynamic configuration according to claim 1, characterized in that: the extended channel module (200) includes a channel instruction sequencing unit (203), the channel storage unit (201 ) and several calculation processing units (202), the calculation processing unit (202) includes the systolic array acceleration unit (202a).

4. The vector processor supporting multi-precision calculation and dynamic configuration according to claim 3, characterized in that: the channel storage unit (201) includes a vector register file (201a) and an operand queue (201b), and the The vector register file (201a) is used to provide the operand of the functional unit and absorb its result, and the operand queue (201b) is connected to the calculation processing unit (202) and the vector register file (201a), and is used to allocate each The operand of the calculation processing unit (202).

5. The vector processor supporting multi-precision calculation and dynamic configuration according to claim 4, characterized in that: said vector register file (201a) comprises some memory banks and arbitration units (201a-2), said memory bank It is a single port, and the bit width is set to 64 bits. Each of the vector register files (201a) is provided with 8 memory banks (201a-1) for loading and delivering data to be processed and calculation results, and the arbitration unit ( 201a-2) For allocating the priority of each operation instruction.

6. The vector processor supporting multi-precision calculation and dynamic configuration according to claim 4, characterized in that: the channel instruction sequencing unit (203) is used to provide each functional unit in the expansion channel module (200) An operation instruction is sent, and a request for reading an operand from the vector register file (201a) is initiated.

7. The vector processor supporting multi-precision calculation and dynamic configuration according to claim 4 or 5, characterized in that: the systolic array acceleration unit (202a) receives data read from the vector register file (201a) After the request, automatically input the data to be processed into the cache in order, and judge whether to start calculation according to the input mode after loading, and automatically store the calculation result in the output buffer, and wait for the instruction to output the result back to the vector register file (201a) middle.

8. According to the vector processor supporting multi-precision calculation and dynamic configuration according to any one of claims 1-6, it is characterized in that: the operation instructions are self-defined instructions, and the self-defined instructions are all vector instructions, including all The vector calculation instructions, vector load instructions and vector storage instructions, the self-defined instructions include target address information and action information; the self-defined instructions are input from the external vector processor for processing.

9. The vector processor supporting multi-precision calculation and dynamic configuration according to any one of claims 1-6, characterized in that: the load storage module (300) data loading unit (301) and data storage unit (302) , the data loading unit (301) loads data through the AXI AR bus, and the data storage unit (302) AXI AW bus is used to store data.

10. A processing method supporting multi-precision calculation and dynamic configuration, applied to the vector processor described in any one of rights 1 to 9, characterized in that: comprising,

The control module (100) receives external input operation instructions;

The control module (100) analyzes the operation instruction, obtains the vector calculation instruction and the vector storage load instruction, and determines the functional unit for sending the vector calculation instruction;

The load storage module (300) loads data to be processed from the outside according to the vector storage load instruction;

The systolic array acceleration unit (202a) in the extended channel module (200) obtains the corresponding data from the channel storage unit (201) to perform calculation between vectors according to the vector calculation instruction, and returns the calculation result to the channel storage unit (201 ) for storage; the calculation results stored in the channel storage unit (201) are transmitted to the outside through the load storage module (300).