CN115936128A - Vector processor supporting multi-precision calculation and dynamic configuration and processing method - Google Patents
Vector processor supporting multi-precision calculation and dynamic configuration and processing method Download PDFInfo
- Publication number
- CN115936128A CN115936128A CN202211441900.0A CN202211441900A CN115936128A CN 115936128 A CN115936128 A CN 115936128A CN 202211441900 A CN202211441900 A CN 202211441900A CN 115936128 A CN115936128 A CN 115936128A
- Authority
- CN
- China
- Prior art keywords
- vector
- unit
- calculation
- instruction
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000013598 vector Substances 0.000 title claims abstract description 155
- 238000004364 calculation method Methods 0.000 title claims abstract description 128
- 238000003672 processing method Methods 0.000 title claims abstract description 8
- 230000001133 acceleration Effects 0.000 claims abstract description 33
- 238000012545 processing Methods 0.000 claims description 41
- 238000012163 sequencing technique Methods 0.000 claims description 22
- 238000013500 data storage Methods 0.000 claims description 12
- 230000009471 action Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 3
- 239000011159 matrix material Substances 0.000 description 14
- 238000010586 diagram Methods 0.000 description 10
- 238000000034 method Methods 0.000 description 9
- 230000004913 activation Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000007667 floating Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Complex Calculations (AREA)
Abstract
本发明提供的向量处理器以及数据处理方法,在处理器通道内加入了脉动阵列加速单元,用于实现向量之间的计算。充分利用了原架构上的存储单元,增大了数据吞吐量,实现较多向量数据的之间的计算,使得脉动阵列加速器的加速效果得到充分利用,计算利用率得到大幅提高。脉动阵列加速器可以支持多精度及超低比特量化计算,提高向量计算的效率,同时向量处理器的并行性和可拓展性可以极大地提高数据计算密度,从而实现算力的有效提升。
In the vector processor and the data processing method provided by the present invention, a systolic array acceleration unit is added in the processor channel to realize calculation between vectors. The storage unit on the original architecture is fully utilized, the data throughput is increased, and the calculation between more vector data is realized, so that the acceleration effect of the systolic array accelerator is fully utilized, and the calculation utilization rate is greatly improved. The systolic array accelerator can support multi-precision and ultra-low bit quantized calculations to improve the efficiency of vector calculations. At the same time, the parallelism and scalability of vector processors can greatly increase the data calculation density, thereby effectively improving the computing power.
Description
技术领域technical field
本申请涉及集成电路及通信技术领域,特别涉及一种支持多精度计算及动态配置的向量处理器及处理方法。The present application relates to the field of integrated circuits and communication technologies, in particular to a vector processor and processing method supporting multi-precision calculation and dynamic configuration.
背景技术Background technique
神经网络模型中通常都包括大量的网络层,每一个网络层都存在权重矩阵与激活矩阵之间的卷积操作,其中权重矩阵中包含大量的权重数据,激活矩阵中包含大量的激活数据。在进行卷积操作时,一般会将卷积操作转化成矩阵乘法,然后使用矩阵乘法处理器进行计算,进而获得卷积操作的结果。A neural network model usually includes a large number of network layers, and each network layer has a convolution operation between a weight matrix and an activation matrix, wherein the weight matrix contains a large amount of weight data, and the activation matrix contains a large amount of activation data. When performing a convolution operation, the convolution operation is generally converted into a matrix multiplication, and then the matrix multiplication processor is used for calculation to obtain the result of the convolution operation.
矩阵乘法处理器通常包括多个基本运算单元,这些基本运算单元排布成脉动阵列,多个权重数据和激活数据在时钟信号的控制下广播至脉动阵列中,整个矩阵乘法运算流程通过控制信号控制每一个基本运算单元不断的对接收到的权重数据和激活数据进行乘法累加运算实现。A matrix multiplication processor usually includes a plurality of basic operation units arranged in a systolic array, multiple weight data and activation data are broadcast to the systolic array under the control of a clock signal, and the entire matrix multiplication operation process is controlled by a control signal Each basic operation unit continuously performs multiplication and accumulation operations on the received weight data and activation data.
随着深度神经网络的发展,矩阵乘加计算逐渐成为处理器重点关注的计算部分。现如今的矩阵乘加计算大多数利用算术逻辑运算单元实现,算术逻辑单元每周期只能进行单个固定宽度的数据进行计算,无法充分利用矩阵乘加计算的计算能力。导致了现有的采用算术逻辑单元实现的矩阵乘加计算算法计算效率较低,数据利用率低。从现有方案来看,瓶颈在于当前的硬件架构只能实现串行的向量与标量之间的计算,其本质还是标量与标量的计算,从而导致计算架构的计算利用率不能达到理想水平因此,如何利用有限的资源更好的设计实现矩阵乘加计算的硬件单元成为了亟待解决的问题。With the development of deep neural networks, matrix multiplication and addition calculations have gradually become the calculation part that processors focus on. Most of today's matrix multiplication and addition calculations are implemented by arithmetic logic operation units. The arithmetic logic unit can only perform calculations on a single fixed-width data per cycle, and cannot make full use of the computing power of matrix multiplication and addition calculations. As a result, the existing matrix multiplication and addition calculation algorithm implemented by the arithmetic logic unit has low calculation efficiency and low data utilization rate. Judging from the existing solutions, the bottleneck is that the current hardware architecture can only realize the calculation between serial vectors and scalars, and its essence is the calculation of scalars and scalars, which leads to the fact that the calculation utilization of the computing architecture cannot reach the ideal level. Therefore, How to use limited resources to better design hardware units for matrix multiplication and addition calculations has become an urgent problem to be solved.
发明内容Contents of the invention
本申请提供了一种支持多精度计算及动态配置的向量处理器及处理方法,可用于解决计算利用率低的技术问题。The application provides a vector processor and processing method supporting multi-precision calculation and dynamic configuration, which can be used to solve the technical problem of low calculation utilization.
第一方面,本申请实施例提供一种支持多精度计算及动态配置的向量处理器,包括:In the first aspect, the embodiment of the present application provides a vector processor supporting multi-precision calculation and dynamic configuration, including:
控制模块,所述控制模块用于接收外部传入的操作指令,以及对操作指令进行解析,得到向量计算指令和向量存储加载指令,以及确定所述向量计算指令发送的功能单元;A control module, the control module is used to receive an external incoming operation instruction, and analyze the operation instruction to obtain a vector calculation instruction and a vector storage load instruction, and determine the functional unit sent by the vector calculation instruction;
加载存储模块,所述加载存储模块用于根据所述向量存储加载指令,从外部加载待处理数据;A load-storage module, configured to load data to be processed from the outside according to the vector storage load instruction;
扩展通道模块,所述扩展通道模块包括通道存储单元和脉动阵列加速单元,所述通道存储单元用于存储所述待处理数据,所述脉动阵列加速单元用于根据所述向量计算指令从所述通道存储单元中获取对应数据进行向量之间的计算,并将计算结果返回所述通道存储单元进行存储;所述通道存储单元中存储的计算结果通过所述加载存储模块向外部传输。An extended channel module, the extended channel module includes a channel storage unit and a systolic array acceleration unit, the channel storage unit is used to store the data to be processed, and the systolic array acceleration unit is used to calculate the vector from the The corresponding data is acquired in the channel storage unit to perform calculation between vectors, and the calculation result is returned to the channel storage unit for storage; the calculation result stored in the channel storage unit is transmitted to the outside through the load storage module.
结合第一方面,在第一方面的一种可实现方式中,所述控制模块包括指令分发单元和主定序单元,所述指令分发单元接收外部传入的操作指令,对操作指令进行处理后识别出操作指令的种类和对应的功能单元并将操作指令传入所述主定序单元,所述主定序单元向所有功能单元广播指令,并监控指令的运行状态。With reference to the first aspect, in an implementable manner of the first aspect, the control module includes an instruction distribution unit and a main sequencing unit, and the instruction distribution unit receives an externally incoming operation instruction, and after processing the operation instruction, Identify the type of operation instruction and the corresponding functional unit and send the operation instruction to the main sequencing unit, and the main sequencing unit broadcasts the instruction to all functional units and monitors the running status of the instruction.
结合第一方面,在第一方面的一种可实现方式中,所述扩展通道模块包括通道指令定序单元、所述通道存储单元和若干计算处理单元,所述计算处理单元包括所述脉动阵列加速单元。With reference to the first aspect, in an implementable manner of the first aspect, the extended channel module includes a channel instruction sequencing unit, the channel storage unit, and several calculation processing units, and the calculation processing unit includes the systolic array acceleration unit.
结合第一方面,在第一方面的一种可实现方式中,所述通道存储单元包括向量寄存器文件和操作数队列,所述向量寄存器文件用于提供功能单元的操作数并吸收其结果,所述操作数队列连接所述计算处理单元和所述向量寄存器文件,用于分配各个所述计算处理单元的操作数。With reference to the first aspect, in an implementable manner of the first aspect, the channel storage unit includes a vector register file and an operand queue, and the vector register file is used to provide the operand of the functional unit and absorb its result, so The operand queue connects the calculation processing unit and the vector register file, and is used for allocating operands of each calculation processing unit.
结合第一方面,在第一方面的一种可实现方式中,所述向量寄存器文件包括若干存储体和仲裁单元,所述存储体为单端口,位宽设为64位,每个所述向量寄存器文件内设有8个存储体用于加载和输送待处理数据及计算结果,所述仲裁单元用于调配各操作指令的优先级。With reference to the first aspect, in an implementable manner of the first aspect, the vector register file includes several storage banks and arbitration units, the storage banks are single-port, and the bit width is set to 64 bits, and each of the vector register files There are 8 memory banks in the register file for loading and delivering data to be processed and calculation results, and the arbitration unit is used for allocating the priority of each operation instruction.
结合第一方面,在第一方面的一种可实现方式中,所述通道指令定序单元用于向所述扩展通道模块内各功能单元发送操作指令,并发起从所述向量寄存器文件中读取操作数的请求。With reference to the first aspect, in an implementable manner of the first aspect, the channel instruction sequencing unit is configured to send operation instructions to each functional unit in the extended channel module, and initiate a read from the vector register file A request to fetch an operand.
结合第一方面,在第一方面的一种可实现方式中,所述脉动阵列加速单元接受到从所述向量寄存器文件读取数据请求后,将待处理数据自动按顺序输入缓存,并在加载结束后根据输入模式判断是否开始计算,计算结果自动存入输出缓冲区内,等待指令将结果输出回所述向量寄存器文件中。With reference to the first aspect, in an implementable manner of the first aspect, after the systolic array acceleration unit receives the request to read data from the vector register file, it automatically inputs the data to be processed into the cache in order, and loads After the end, it is judged according to the input mode whether to start the calculation, the calculation result is automatically stored in the output buffer, and the result is output back to the vector register file by the waiting instruction.
结合第一方面,在第一方面的一种可实现方式中,所述操作指令为自定义指令,所述自定义指令均为向量指令,包括所述向量计算指令、向量加载指令和向量储存指令,所述自定义指令中包括目标地址信息和动作信息;所述自定义指令从外部输入向量处理器中进行处理。With reference to the first aspect, in an implementable manner of the first aspect, the operation instruction is a custom instruction, and the custom instruction is a vector instruction, including the vector calculation instruction, vector load instruction and vector storage instruction , the self-defined instruction includes target address information and action information; the self-defined instruction is input from an external vector processor for processing.
结合第一方面,在第一方面的一种可实现方式中,所述加载存储模块数据加载单元和数据存储单元,所述数据加载单元通过AXI AR总线加载数据,所述数据存储单元AXIAW总线用于存储数据。In conjunction with the first aspect, in a possible implementation of the first aspect, the load storage module data loading unit and the data storage unit, the data loading unit loads data through the AXI AR bus, and the data storage unit uses the AXIAW bus for storing data.
第二方面,本申请实施例提供一种支持多精度计算及动态配置的数据处理方法,包括,In the second aspect, the embodiment of the present application provides a data processing method supporting multi-precision calculation and dynamic configuration, including:
控制模块接收外部传入的操作指令;The control module receives external input operation instructions;
控制模块对操作指令进行解析,得到向量计算指令和向量存储加载指令,以及确定所述向量计算指令发送的功能单元;The control module parses the operation instruction, obtains the vector calculation instruction and the vector storage load instruction, and determines the functional unit sent by the vector calculation instruction;
加载存储模块根据所述向量存储加载指令,从外部加载待处理数据;The load storage module loads the data to be processed from the outside according to the vector storage load instruction;
扩展通道模块中的脉动阵列加速单元根据向量计算指令,从通道存储单元中获取对应数据进行向量之间的计算,并将计算结果返回所述通道存储单元进行存储;所述通道存储单元中存储的计算结果通过所述加载存储模块向外部传输。The systolic array acceleration unit in the extended channel module acquires corresponding data from the channel storage unit to perform calculation between vectors according to the vector calculation instruction, and returns the calculation result to the channel storage unit for storage; the stored in the channel storage unit The calculation result is transmitted to the outside through the load storage module.
本发明提供的向量处理器以及数据处理方法,在处理器通道内加入了脉动阵列加速单元,用于实现向量之间的计算。充分利用了原架构上的存储单元,增大了数据吞吐量,实现较多向量数据的之间的计算,使得脉动阵列加速器的加速效果得到充分利用,计算利用率得到大幅提高。脉动阵列加速器可以支持多精度及超低比特量化计算,提高向量计算的效率,同时向量处理器的并行性和可拓展性可以极大地提高数据计算密度,从而实现算力的有效提升。In the vector processor and the data processing method provided by the present invention, a systolic array acceleration unit is added in the processor channel to realize calculation between vectors. The storage unit on the original architecture is fully utilized, the data throughput is increased, and the calculation between more vector data is realized, so that the acceleration effect of the systolic array accelerator is fully utilized, and the calculation utilization rate is greatly improved. The systolic array accelerator can support multi-precision and ultra-low bit quantized calculations to improve the efficiency of vector calculations. At the same time, the parallelism and scalability of vector processors can greatly increase the data calculation density, thereby effectively improving the computing power.
附图说明Description of drawings
图1为实施例1中的向量处理器框架结构示意图;Fig. 1 is the vector processor frame structure schematic diagram in
图2为实施例2中的向量处理器框架结构示意图;Fig. 2 is the vector processor frame structure schematic diagram in
图3为实施例2中的扩展通道模块的框架图;Fig. 3 is the frame diagram of the expansion channel module in
图4为脉动阵列加速单元的工作原理图;Fig. 4 is the working principle diagram of the pulse array acceleration unit;
图5为传统乘法树架构和脉动阵列对存储带宽利用率的比较示意图;Figure 5 is a schematic diagram of the comparison of storage bandwidth utilization between the traditional multiplication tree architecture and the systolic array;
图6为向量计算指令的组成示意图;6 is a schematic diagram of the composition of vector calculation instructions;
图7为向量存储和加载指令的组成示意图;Fig. 7 is a schematic diagram of the composition of vector storage and loading instructions;
图8为实施例2中的指令运行过程中数据处理流程图;Fig. 8 is the flow chart of data processing during the instruction running process in
图9为实施例3中方法的流程图。FIG. 9 is a flowchart of the method in
100、控制模块;200、扩展通道模块;300、加载存储模块;100. Control module; 200. Expansion channel module; 300. Loading and storage module;
101、指令分发单元;102、主定序单元;101. Instruction distribution unit; 102. Main sequencing unit;
201、通道存储单元;202、计算处理单元;203、通道指令定序单元;201. Channel storage unit; 202. Calculation processing unit; 203. Channel instruction sequencing unit;
201a、向量寄存器文件;201b、操作数队列;201a, vector register file; 201b, operand queue;
202a、脉动阵列加速单元;202b、算术逻辑单元;202c、乘法计算单元;202d、浮点数处理单元;202a, a systolic array acceleration unit; 202b, an arithmetic logic unit; 202c, a multiplication calculation unit; 202d, a floating-point number processing unit;
201a-1、存储体;201a-2、仲裁单元;201a-1, storage body; 201a-2, arbitration unit;
301、数据加载单元;302、数据存储单元。301. A data loading unit; 302. A data storage unit.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the purpose, technical solution and advantages of the present application clearer, the implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings.
实施例1Example 1
下面首先结合图1对本申请实施例适用的可能的系统架构进行介绍。A possible system architecture applicable to this embodiment of the present application is firstly introduced below with reference to FIG. 1 .
请参考图1,其示例性示出了本申请实施例适用的一种向量处理器的结构示意图,其包括扩展通道模块200、控制模块100及加载存储模块300。Please refer to FIG. 1 , which exemplarily shows a schematic structural diagram of a vector processor applicable to the embodiment of the present application, which includes an
控制模块100,所述控制模块100用于接收外部传入的操作指令,以及对操作指令进行解析,得到向量计算指令和向量存储加载指令,以及确定所述向量计算指令发送的功能单元;A
加载存储模块300,所述加载存储模块300用于根据所述向量存储加载指令,从外部加载待处理数据;A load-
扩展通道模块200,所述扩展通道模块200包括通道存储单元201和脉动阵列加速单元202a,所述通道存储单元201用于存储所述待处理数据,所述脉动阵列加速单元202a用于根据所述向量计算指令从所述通道存储单元201中获取对应数据进行向量之间的计算,并将计算结果返回所述通道存储单元201进行存储;所述通道存储单元201中存储的计算结果通过所述加载存储模块300向外部传输。An
本实施例提供的向量处理器,在处理器通道内加入了脉动阵列加速单元202a,并设计了专门的向量指令调用脉动阵列加速单元202a,用于实现向量与向量之间的计算。相比于原有的采用算术逻辑单元202b每个周期只能进行单个固定宽度的数据进行计算,脉动阵列加速单元202a充分利用了原架构上的存储单元,增大了数据吞吐量,实现较多向量数据的之间的计算,使得脉动阵列加速器的加速效果得到充分利用,计算利用率得到大幅提高。脉动阵列加速器可以支持多精度及超低比特量化计算,提高向量计算的效率,同时向量处理器的并行性和可拓展性可以极大地提高数据计算密度,从而实现算力的有效提升。The vector processor provided in this embodiment adds a systolic
实施例2Example 2
本实施例提供一种支持多精度计算及动态配置的向量处理器,其架构如图2所示。This embodiment provides a vector processor supporting multi-precision calculation and dynamic configuration, the architecture of which is shown in FIG. 2 .
其中,控制模块100中包括了指令分发单元101和主定序单元102,所述指令分发单元101接收外部传入的操作指令,对操作指令进行处理后识别出操作指令的种类和对应的功能单元并将操作指令传入所述主定序单元102,所述主定序单元102向所有功能单元广播指令,并监控指令的运行状态。Wherein, the
具体的,指令分发单元101是向量处理器的解码器,负责接收外部(即第一级指令解码单元)传入的操作指令,从而识别向量指令对应的向量功能单元、操作指令的类型以及其他信息。当指令识别结束后,指令分发单元101将该操作指令及解码得到的信息传入主定序单元102中,以待后续处理。指令分发单元101还能对指令的类型进行筛选并向第一级指令解码器进行反馈,比如当接收到的是标量指令,便会返回错误信息。Specifically, the
而主定序单元102负责跟踪在扩展通道模块200的上运行的操作指令,将它们分派给不同的功能单元,并通过外部处理单元确认它们。主定序单元102能够掌握所有扩展通道模块200上的指令执行进度。主定序单元102还存储了关于哪个向量指令正在访问哪个向量寄存器的信息。这些信息被用来确定指令之间的数据冒险,以便确定各条指令运行的优先级。The
如图2所示,在扩展通道模块200中,包括通道指令定序单元203,通道存储单元201和若干计算处理单元202,其中通道指令定序单元203用于向所述扩展通道模块200内各功能单元发送操作指令,通道存储单元201用于储存待处理的数据和计算结果,而计算处理单元202则用于接收向量计算指令,读取通道存储单元201中的待处理数据进行计算,再将计算结果返回通道存储单元201中。As shown in FIG. 2 , in the
如图2中所示,向量处理器中还有若干其他的扩展通道模块200,即通道1…通道N等,而本实例中包含脉动阵列加速单元202a的扩展通道模块200为通道0。每个扩展通道模块200均具有以下通道结构,即每个通道都有自己的通道指令定序单元203,负责跟踪多达8条并行向量指令。每个通道也有一组包含寄存器文件的通道存储单元201来协调其访问操作队列,以及包含有一个整数算术逻辑单元202b、一个整数乘法器和一个浮点计算单元的计算处理单元202。每个通道包含整个寄存器文件和计算处理单元202的一部分。而在本实施例中,在通道结构内加入了脉动阵列加速单元202a,用于处理向量之间的运算,As shown in FIG. 2 , there are several other
下面结合图3分析扩展通道模块200内的具体结构。The specific structure in the
通道指令定序单元203,用于向所述扩展通道模块200内各功能单元发送操作指令,控制它们在单个通道的范围内执行。与主定序单元102不同,通道指令定序单元203不存储正在运行的指令的状态,避免了通道间的数据重复。它们还发起从通道存储单元201中的向量寄存器文件201a读取操作数的请求。The channel
本实施例中的通道存储单元201包括向量寄存器文件201a和操作数队列201b,所述向量寄存器文件201a用于提供功能单元的操作数并吸收其结果,所述操作数队列201b连接所述计算处理单元202和所述向量寄存器文件201a,用于分配各个所述计算处理单元202的操作数。The
向量寄存器文件201a包括若干存储体201a-1和仲裁单元201a-2,所述存储体201a-1为单端口,位宽设为64位,每个所述向量寄存器文件201a内设有8个存储体201a-1用于加载和输送待处理数据及计算结果,所述仲裁单元201a-2用于调配各操作指令的优先级。The
向量寄存器文件201aVRF(Vector Register Files),是每个向量处理器的核心。因为几个指令可以并行运行,向量寄存器文件201a必须能够支持足够的吞吐量,以便为功能单元提供操作数并吸收其结果。向量处理器的VRF是由一组单端口的存储体201a-1(bank)组成。每个存储体201a-1的宽度被限制在每个通道的数据通路宽度上,即64位,以避免子字选择逻辑错误。因此,在稳定状态下,五个存储体201a-1可以被同时访问,以维持预定的乘加指令的最大吞吐量。向量处理器的向量寄存器文件201a每条通道被设置了八个bank,为访问提供了一些余量。The vector register file 201aVRF (Vector Register Files) is the core of each vector processor. Because several instructions can run in parallel, the
当几个功能单元试图访问同一bank的操作数时,会导致bank冲突。每个扩展通道模块200在VRF和各个功能单元之间都有一组操作数队列201b,以吸收这种bank冲突。本实施例中有12个操作数队列201b:其中4个专门用于FPU/MUL(浮点数处理单元202d/乘法计算单元202c),3个用于ALU(算术逻辑单元202b),另外3个用于VLSU(向量加载/存储单元),2个队列用于脉动阵列加速单元202a。每个队列的宽度是64bit位宽,它们的深度是通过模拟选择的,取决于功能单元的延迟和吞吐量。Bank conflicts result when several functional units try to access operands of the same bank. Each
如图3所示,在计算处理单元202中包含一个整数ALU(算术逻辑单元202b)、一个整数MUL(乘法计算单元202c)和一个FPU(浮点数处理单元202d),它们都在一个64位的数据通路上运行,用于满足基本的标量运算功能。其中MUL与FPU共享操作数队列201b,但它们不能同时使用。向量处理器对多精度运算存在充分的支持,允许数据从8位到16位,从16位到32位,以及从32位到64位的转换。FPU的配置为支持FMAs(浮点数乘加计算)、加法、乘法、除法、平方根和比较。As shown in FIG. 3, an integer ALU (
脉动阵列单元则专门用于处理向量与向量之间的运算,从而大幅提高了向量处理器的向量运算能力。图4展示了脉动阵列加速单元202a的工作原理。脉动阵列加速单元202a由多个相同构造的计算单元(PE,Processing Element)组成,采用一种类似流水线的方式进行工作,每一个计算单元会计算传入的数据并暂存部分和,而在下一个时钟周期该会把指定的数据按照指定的方向广播给相邻的计算单元。在适配矩阵乘法计算的脉动阵列中,计算单元会选择向相邻的计算单元广播输入数据。The systolic array unit is specially used to handle operations between vectors, thus greatly improving the vector operation capability of the vector processor. Fig. 4 shows the working principle of the systolic
由于在很多情况下,整体系统的处理能力受限于访存带宽而不是计算能力,从图5中可以看出,这使得脉动阵列相比传统的乘法树架构能够更好地利用有限的访存带宽而不会造成计算能力的浪费。通过脉动传输加上数据广播的方式,脉动阵列对数据进行了较为充分的复用,这使得脉动阵列在没有大量增加存储空间的情况下能实现很高的计算吞吐率。Since in many cases, the processing power of the overall system is limited by memory access bandwidth rather than computing power, as can be seen from Figure 5, this enables systolic arrays to better utilize limited memory accesses compared to traditional multiply tree architectures bandwidth without wasting computing power. Through systolic transmission and data broadcasting, the systolic array fully multiplexes data, which enables the systolic array to achieve high computing throughput without a large increase in storage space.
在本实施例中,当脉动阵列加速单元202a接收到通道内的VRF将数据输入到脉动阵列的请求时,脉动阵列加速单元202a内部的控制单元会将数据自动按顺序输入到缓存A,缓存B中。当数据加载结束之后,该单元会根据输入的模式决定是否开始计算。计算结束后,数据自动存入输出缓冲区中,等待指令将计算结果输出到VRF中。In this embodiment, when the systolic
如图2所示,本实施例中的加载存储模块300包含一个数据加载单元301及一个数据存储单元302。分别用于从外部加载待处理数据和将计算结果从VRF中存储进外部内存中。当操作指令中首地址传入数据加载单元301后,该单元将单元行内存操作凝聚成突发请求,避免了从内存中请求单个元素的需要。然后,突发开始地址和突发长度被发送到加载或存储单元,这两个单元负责通过向量处理单元的高级可扩展接口(AXI)接口启动数据传输。数据加载单元301通过AXI AR总线加载数据,所述数据存储单元302AXI AW总线用于存储数据。As shown in FIG. 2 , the load-
本实施例中的操作指令为自定义指令,所述自定义指令均为向量指令,包括所述向量计算指令、向量加载指令和向量储存指令,所述自定义指令中包括目标地址信息和动作信息;所述自定义指令从外部输入所述处理模块中进行处理。The operation instructions in this embodiment are self-defined instructions, and the self-defined instructions are all vector instructions, including the vector calculation instruction, vector load instruction and vector storage instruction, and the self-defined instructions include target address information and action information ; The custom instruction is input from the outside to the processing module for processing.
其中向量计算指令为自定义的VMSAM指令,向量加载指令VL*E和向量存储指令VS*E均为RISC-V指令,也都支持自定义。Among them, the vector calculation instruction is a custom VMSAM instruction, the vector load instruction VL*E and the vector storage instruction VS*E are both RISC-V instructions, and both support customization.
RISC-V指令集是基于精简指令集计算(RISC)原理建立的开放指令集架构(ISA),RISC-V是在指令集不断发展和成熟的基础上建立的全新指令体系。RISC-V指令集完全开源,设计简单,易于移植Unix系统,模块化设计,完整工具链,同时有大量的开源实现和流片案例。The RISC-V instruction set is an open instruction set architecture (ISA) based on the principle of reduced instruction set computing (RISC). RISC-V is a new instruction system established on the basis of the continuous development and maturity of the instruction set. The RISC-V instruction set is completely open source, simple in design, easy to port to Unix systems, modular design, complete tool chain, and there are a large number of open source implementations and tape-out cases.
RISC-V是一种模块化的精简指令集架构,由字母I代表的基本整数指令集和字母M、A、F、D、C等扩展指令集构成。唯一要求必须实现的指令集为I型基本整数指令集,由此即可实现完整的软件编译工具链。扩展指令集则增加了乘法与除法、存储器原子操作、浮点数操作等功能,以上指令集全部可按需选用配置。RISC-V is a modular reduced instruction set architecture, which consists of the basic integer instruction set represented by the letter I and the extended instruction set of the letters M, A, F, D, and C. The only instruction set that must be implemented is the I-type basic integer instruction set, so that a complete software compilation tool chain can be realized. The extended instruction set adds functions such as multiplication and division, memory atomic operations, and floating-point number operations. All of the above instruction sets can be configured as needed.
RISC-V虽然不是第一个开源的指令集(ISA),但它是第一个被设计成可以根据具体场景进行合适选择的指令集架构。基于RISC-V指令集架构可以设计服务器CPU、家用电器CPU、传感器中的CPU等多种应用场景的处理器。Although RISC-V is not the first open-source instruction set (ISA), it is the first instruction set architecture designed to be properly selected according to specific scenarios. Based on the RISC-V instruction set architecture, processors for various application scenarios such as server CPUs, household appliance CPUs, and sensor CPUs can be designed.
如图6所示,自定义VMSAM指令属于向量处理类指令;进入向量处理单元后通过rs1的值解码得到脉动阵列加速器的状态输入及计算精度以驱动模块开始运行;当计算结束后指令才会结束。As shown in Figure 6, the custom VMSAM instruction belongs to the vector processing instruction; after entering the vector processing unit, the state input and calculation accuracy of the systolic array accelerator can be obtained by decoding the value of rs1 to drive the module to start running; the instruction will not end until the calculation is completed .
如图7所示,VL*E指令属于向量加载类指令;进入向量处理单元后目标功能单元为扩展通道模块200内的VRF;vs1的值在外部被解码为传输给AXI总线中AR线的首地址;rs2为数据将从内存中取出的方式标识,rd为数据在VRF中存储的首地址。As shown in Figure 7, the VL*E instruction belongs to the vector loading instruction; after entering the vector processing unit, the target functional unit is the VRF in the
VS*E指令属于向量存储类指令;进入向量处理单元后目标功能单元为通道内VRF;vs1的值在外部被解码为传输给AXI总线中AW线的首地址;rs2为数据将从VRF中取出的方式标识,rd为数据在内存中存储的首地址。The VS*E instruction belongs to the vector storage instruction; after entering the vector processing unit, the target functional unit is the VRF in the channel; the value of vs1 is decoded externally as the first address transmitted to the AW line in the AXI bus; rs2 is the data to be taken out from the VRF rd is the first address where the data is stored in the memory.
如图8展示了在向量处理器中运行指令进行数据处理的流程:Figure 8 shows the process of running instructions in the vector processor for data processing:
1.当向量处理器得到解码的VLE指令之后,数据根据首地址顺序的从AXI总线的AR线读取数据,当数据读取结束,加载到通道内的VRF中。当数据传输结束,该指令运行结束。1. After the vector processor gets the decoded VLE instruction, the data is read from the AR line of the AXI bus according to the order of the first address. When the data reading is completed, it is loaded into the VRF in the channel. When the data transmission is over, the command execution ends.
2.VMSAM指令开始之后,解码得到的rs1指定了从VRF读取数据的首地址,rs2指定了脉动阵列加速单元202a运行的模式以及脉动阵列加速单元202a的量化数据精度,rd指定了结果存储到VRF的首地址。当脉动阵列加速单元202a接收到开始计算的信号时,数据被送进脉动阵列中。计算结束后,指令运行结束。2. After the start of the VMSAM instruction, the decoded rs1 specifies the first address to read data from the VRF, rs2 specifies the operating mode of the systolic
3.VSE指令开始后,数据从通道内VRF中加载到AXI总线的AW线上,当特定长度的一次传输结束后,指令结束。3. After the VSE command starts, the data is loaded from the VRF in the channel to the AW line of the AXI bus. When a transmission of a specific length is completed, the command ends.
作为可替换的实施方式,由于自定义指令设计具有一定的灵活性,因此可以使用不同于本实施例中提出的具体三种指令对处理器中的各个功能单元的进行指导控制。As an alternative implementation manner, since the design of the user-defined instruction has a certain degree of flexibility, it is possible to use three specific instructions different from those proposed in this embodiment to guide and control each functional unit in the processor.
实施例3Example 3
参照图9,其示例性示出了本申请实施例适用的一种数据处理方法的流程示意图,其主要包括以下步骤:Referring to FIG. 9, it exemplarily shows a schematic flow chart of a data processing method applicable to the embodiment of the present application, which mainly includes the following steps:
控制模块100接收外部传入的操作指令;The
控制模块100对操作指令进行解析,得到向量计算指令和向量存储加载指令,以及确定所述向量计算指令发送的功能单元;The
加载存储模块300根据所述向量存储加载指令,从外部加载待处理数据;The
扩展通道模块200中的脉动阵列加速单元202a根据向量计算指令,从通道存储单元201中获取对应数据进行向量之间的计算,并将计算结果返回所述通道存储单元201进行存储;所述通道存储单元201中存储的计算结果通过所述加载存储模块300向外部传输。The systolic
具体的,本实施例以指令类型为用于对目标寄存器和源寄存器进行操作的标准R型指令为例,展示了各种指令在进入向量处理器后的工作流程如下:Specifically, this embodiment takes the instruction type as a standard R-type instruction for operating the target register and the source register as an example, and shows that the workflow of various instructions after entering the vector processor is as follows:
其中,VL*E指令运行数据流示意:Among them, the VL*E instruction runs the data flow:
1.向外部解码器输入指令:32’h 0000 1101 1010 0111 1111 1001 0101 01111. Input instructions to the external decoder: 32’h 0000 1101 1010 0111 1111 1001 0101 0111
由一级解码得到,这是一个向量指令,应传入向量处理器进行处理。同时rs1值为0F,Obtained by one-level decoding, this is a vector instruction, which should be passed to the vector processor for processing. At the same time, the value of rs1 is 0F,
在外部中解码得到该地址内存储值为32’h 8000 0004,rd值为12,将此信息也一并传入向量中。It is decoded externally to get the stored value of the address as 32’h 8000 0004, and the value of rd is 12, and this information is also passed into the vector.
2.当向量接收到该指令及其他信息时,送入指令分发单元101进行二级解码。指令分发单元101根据指令的opcode识别出此为数据加载指令VLE。同时加载rs1,rs2值以及解码得到的地址值,将这些数据以及指令送入主定序单元102中。向外部解码器反馈并无错误。2. When the vector receives the instruction and other information, it will be sent to the
3.主定序单元102在检查该指令与其他指令之间不存在冒险之后,将该指令及信息广播到所有功能单元,但只有数据加载单元301接收。数据加载单元301将根据首地址信息以及传输长度信息向AXI总线的AR线发出数据加载请求。数据加载结束后,加载单元将数据应传输到Lane中的VRF中,存储地址为rd值。当数据完全加载到VRF中时,VLE指令运行结束。3. After checking that there is no hazard between the instruction and other instructions, the
VMSAM指令运行数据流示意:VMSAM instruction operation data flow diagram:
4.向外部解码器输入指令:32’h 1010 1010 1001 0001 1010 0000 0101 01114. Input instructions to the external decoder: 32’h 1010 1010 1001 0001 1010 0000 0101 0111
由一级解码得到,这是一个向量指令,应传入向量处理器进行处理。Obtained by one-level decoding, this is a vector instruction, which should be passed to the vector processor for processing.
5.当向量处理器接收到该指令及其他信息时,送入指令分发单元101进行二级解码。指令分发单元101根据指令的opcode识别出此为数据处理指令VMSAM。同时加载rs1,rs2,rd值,将这些数据以及指令送入主定序单元102中。向外部解码器反馈并无错误。5. When the vector processor receives the instruction and other information, send it to the
6.主定序单元102在检查该指令与其他指令之间不存在冒险之后,将该指令及信息广播到所有功能单元,但只有各通道及通道中的脉动阵列加速单元202a接收。根据rs1中的首地址以及rs2中包含的模式状态的输入,rs2值解码得到输入精度为64bit。该功能单元开始从VRF中取出数据,并根据输入的状态判断是否开始计算。计算结束后,结果存入到输出缓存。以rd值为首地址,将输出缓存的值存入VRF中。当数据完全加载到VRF中时,VMSAM指令运行结束。6. After checking that there is no hazard between the instruction and other instructions, the
VS*E指令运行数据流示意:VS*E command running data flow diagram:
7.向外部解码器输入指令:32’h 0000 0010 0000 1101 1111 0000 0010 01117. Input instructions to the external decoder: 32’h 0000 0010 0000 1101 1111 0000 0010 0111
由一级解码得到,这是一个向量指令,应传入向量进行处理。同时rs1值为1B,在外部解码器中解码得到该地址内存储值为32’h 80042380,rd值为00。将这些信息也一并传入向量中。Obtained by one-level decoding, this is a vector instruction, which should be passed to the vector for processing. At the same time, the value of rs1 is 1B, and the stored value of this address is 32’h 80042380, and the value of rd is 00 after decoding in the external decoder. Pass this information into the vector as well.
8.当向量处理器接收到该指令及其他信息时,送入指令分发单元101进行二级解码。指令分发单元101根据指令的opcode识别出此为数据存储指令VSE。同时加载rs1,rd值以及解码得到的地址值,将这些数据以及指令送入主定序单元102中。向外部解码器反馈并无错误。8. When the vector processor receives the instruction and other information, send it to the
9.主定序单元102在检查该指令与其他指令之间不存在冒险之后,将该指令及信息广播到所有功能单元,但只有数据存储单元302接收。数据存储单元302根据解码得到的首地址信息以及传输长度信息向AXI总线的AW线发出数据存储请求。数据存储结束后,AW线将返回一个结束信号到指令定序单元中。当数据完全加载到AXI的AW线中时,VSE指令运行结束。9. After checking that there is no hazard between the instruction and other instructions, the
本领域的技术人员可以清楚地了解到本申请实施例中的技术可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请实施例中的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。Those skilled in the art can clearly understand that the technologies in the embodiments of the present application can be implemented by means of software plus a necessary general-purpose hardware platform. Based on this understanding, the technical solution in the embodiment of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM , magnetic disk, optical disk, etc., including several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of the present application.
本说明书中各个实施例之间相同相似的部分互相参见即可。尤其,对于服务构建装置和服务加载装置实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例中的说明即可。For the same and similar parts among the various embodiments in this specification, refer to each other. In particular, for the embodiments of the service constructing device and the service loading device, since they are basically similar to the method embodiments, the description is relatively simple, and for relevant parts, please refer to the description in the method embodiments.
以上所述的本申请实施方式并不构成对本申请保护范围的限定。The embodiments of the present application described above are not intended to limit the scope of protection of the present application.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211441900.0A CN115936128A (en) | 2022-11-17 | 2022-11-17 | Vector processor supporting multi-precision calculation and dynamic configuration and processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211441900.0A CN115936128A (en) | 2022-11-17 | 2022-11-17 | Vector processor supporting multi-precision calculation and dynamic configuration and processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115936128A true CN115936128A (en) | 2023-04-07 |
Family
ID=86648181
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211441900.0A Pending CN115936128A (en) | 2022-11-17 | 2022-11-17 | Vector processor supporting multi-precision calculation and dynamic configuration and processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115936128A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118627565A (en) * | 2024-08-13 | 2024-09-10 | 南京大学 | A configurable convolution operation acceleration device and method based on systolic array |
-
2022
- 2022-11-17 CN CN202211441900.0A patent/CN115936128A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118627565A (en) * | 2024-08-13 | 2024-09-10 | 南京大学 | A configurable convolution operation acceleration device and method based on systolic array |
CN118627565B (en) * | 2024-08-13 | 2025-01-03 | 南京大学 | A configurable convolution operation acceleration device and method based on systolic array |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10467183B2 (en) | Processors and methods for pipelined runtime services in a spatial array | |
US10469397B2 (en) | Processors and methods with configurable network-based dataflow operator circuits | |
US10515046B2 (en) | Processors, methods, and systems with a configurable spatial accelerator | |
US10387319B2 (en) | Processors, methods, and systems for a configurable spatial accelerator with memory system performance, power reduction, and atomics support features | |
EP3343388A1 (en) | Processors, methods, and systems with a configurable spatial accelerator | |
US8335812B2 (en) | Methods and apparatus for efficient complex long multiplication and covariance matrix implementation | |
US7028170B2 (en) | Processing architecture having a compare capability | |
US20190095369A1 (en) | Processors, methods, and systems for a memory fence in a configurable spatial accelerator | |
US20190004878A1 (en) | Processors, methods, and systems for a configurable spatial accelerator with security, power reduction, and performace features | |
US10678724B1 (en) | Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator | |
US20080140994A1 (en) | Data-Parallel processing unit | |
WO2021115208A1 (en) | Neural network processor, chip and electronic device | |
JPH07271764A (en) | Computer processor and system | |
CN112667289B (en) | CNN reasoning acceleration system, acceleration method and medium | |
CN110991619A (en) | Neural network processor, chip and electronic equipment | |
US20230195526A1 (en) | Graph computing apparatus, processing method, and related device | |
CN116483774A (en) | Vector processor compatible with pulse array accelerator and processing method | |
US7558816B2 (en) | Methods and apparatus for performing pixel average operations | |
CN115936128A (en) | Vector processor supporting multi-precision calculation and dynamic configuration and processing method | |
WO2021115149A1 (en) | Neural network processor, chip and electronic device | |
US20080209166A1 (en) | Method of Renaming Registers in Register File and Microprocessor Thereof | |
US6609191B1 (en) | Method and apparatus for speculative microinstruction pairing | |
US20030172248A1 (en) | Synergetic computing system | |
US11416261B2 (en) | Group load register of a graph streaming processor | |
CN117131910A (en) | A convolution accelerator based on the expansion of the RISC-V instruction set architecture and a method for accelerating convolution operations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |