CN106503797A

CN106503797A - The data for being received from neural memorizer are arranged the neutral net unit and collective with neural memorizer the neural pe array for being shifted

Info

Publication number: CN106503797A
Application number: CN201610864610.5A
Authority: CN
Inventors: G·葛兰·亨利; 泰瑞·派克斯; 凯尔·T·奥布莱恩
Original assignee: Shanghai Zhaoxin Integrated Circuit Co Ltd
Current assignee: Shanghai Zhaoxin Semiconductor Co Ltd
Priority date: 2015-10-08
Filing date: 2016-09-29
Publication date: 2017-03-15
Anticipated expiration: 2036-09-29
Also published as: CN106355246B; CN106447035B; CN106485318B; CN106485318A; CN106484362B; CN106485315B; CN106485315A; CN106528047B; CN106445468A; CN106485322A; CN106447036A; CN106484362A; CN106447037B; CN106447036B; CN106447035A; CN106528047A; CN106503797B; CN106355246A; CN106447037A; CN106485322B

Abstract

A neural network unit includes a first memory, a second memory and a neural processing unit array. The first memory is loaded with elements of the data matrix. The second memory is loaded with elements of the convolution kernel. Each neural processing unit includes a multitasking register, a register, an accumulator and an arithmetic unit. A multitasking buffer receives elements from the first memory and receives multitasking buffer outputs of adjacent neural processing units. The cache receives elements from the second memory. The arithmetic unit receives the output of the register, the multitasking register and the accumulator and performs a multiply-accumulate operation on it. For each sub-matrix, the arithmetic unit selectively receives elements from the first memory or the multitasking buffer of the adjacent neural processing unit, and performs a series of multiply-accumulate operations to accumulate the result of the convolution operation into the accumulator.

Description

Neural network units with neural memory and collectives will receive from neural memory array of neural processing units for shifting the data columns of the

技术领域technical field

本发明涉及一种处理器，特别涉及一种提升人工神经网络的运算效能与效率的处理器。The invention relates to a processor, in particular to a processor for improving the computing performance and efficiency of an artificial neural network.

本申请案主张下列的美国临时申请案的国际优先权。这些优先权案的全文并入本案以供参考。This application claims international priority to the following US Provisional Application. The entire contents of these priority claims are incorporated herein by reference.

本申请案关联于下列同时提出申请的美国申请案。这些关联申请案的全文并入本案以供参考。This application is related to the following concurrently filed US application. These related applications are hereby incorporated by reference in their entirety.

背景技术Background technique

近年来，人工神经网络(artificial neural networks，ANN)重新吸引了人们的注意。这些研究通常被称为深度学习(deep learning)、计算机学习(computer learning)等类似术语。通用处理器运算能力的提升也推升了人们在数十年后的现在对于人工神经网络的兴趣。人工神经网络近期的应用包括语言与影像辨识等。对于提升人工神经网络的运算效能与效率的需求似乎正在增加。In recent years, artificial neural networks (ANN) have attracted renewed attention. These studies are often called deep learning, computer learning, and similar terms. The increase in computing power of general-purpose processors has also fueled interest in artificial neural networks decades later. Recent applications of artificial neural networks include language and image recognition. The need to increase the computational performance and efficiency of artificial neural networks appears to be increasing.

发明内容Contents of the invention

有鉴于此，本发明提供一种神经网络单元。此神经网络单元包括一第一存储器，一第二存储器与一神经处理单元(NPU)阵列。第一存储器用以装载一数据矩阵的元素。第二存储器用以装载一卷积核的元素。神经处理单元阵列耦接至第一存储器与第二存储器。各个神经处理单元包括一多任务缓存器，一缓存器，一累加器与一算术单元。其中，多任务缓存器具有一输出，并且，此多任务缓存器从第一存储器的一列接收一相对应元素，并且接收一相邻神经处理单元的多任务缓存器输出。缓存器具有一输出，并且，此缓存器从第二存储器的一列接收一相对应元素。累加器具有一输出。算术单元接收缓存器，多任务缓存器与累加器的输出，并对其执行一乘法累加运算。其中，对于数据矩阵的多个子矩阵中的各个子矩阵，各个算术单元选择性地接收来自第一存储器的元素或是来自相邻神经处理单元的多任务缓存器输出的元素，并执行一系列乘法累加运算，而将子矩阵与卷积核的一卷积运算结果累加至累加器内。In view of this, the present invention provides a neural network unit. The neural network unit includes a first memory, a second memory and a neural processing unit (NPU) array. The first memory is used for loading elements of a data matrix. The second memory is used for loading elements of a convolution kernel. The neural processing unit array is coupled to the first memory and the second memory. Each neural processing unit includes a multitasking register, a register, an accumulator and an arithmetic unit. Wherein, the multitasking register has an output, and the multitasking register receives a corresponding element from a row of the first memory, and receives an output of the multitasking register of an adjacent neural processing unit. The register has an output, and the register receives a corresponding element from a column of the second memory. The accumulator has an output. The arithmetic unit receives the output of the register, the multitasking register and the accumulator, and performs a multiply-accumulate operation on it. Wherein, for each sub-matrix in the plurality of sub-matrices of the data matrix, each arithmetic unit selectively receives elements from the first memory or elements output from the multi-tasking buffer of the adjacent neural processing unit, and performs a series of multiplications The accumulation operation is to accumulate the result of a convolution operation of the sub-matrix and the convolution kernel into the accumulator.

本发明还提供一种运作一神经网络单元的方法。此神经网络单元具有一神经处理单元阵列。各个神经处理单元包括一多任务缓存器、一缓存器、一累加器与一算术单元。其中，多任务缓存器具有一输出，并且，多任务缓存器从一第一存储器的一列接收一相对应元素，并且接收一相邻神经处理单元的该多任务缓存器输出。缓存器具有一输出，并且，缓存器从一第二存储器的一列接收一相对应元素。累加器具有一输出。算术单元接收缓存器、多任务缓存器与累加器的输出，并对其执行一乘法累加运算。此方法包括：利用第一存储器，装载一数据矩阵的元素；利用第二存储器，装载一卷积核的元素；对于数据矩阵的多个子矩阵中的各个子矩阵：利用各个算术单元，选择性地接收来自第一存储器的元素或是来自相邻神经处理单元的多任务缓存器输出的元素；以及执行一系列乘法累加运算，而将子矩阵与卷积核的一卷积运算结果累加至累加器内。The invention also provides a method for operating a neural network unit. The neural network unit has a neural processing unit array. Each neural processing unit includes a multitasking register, a register, an accumulator and an arithmetic unit. Wherein, the multitasking register has an output, and the multitasking register receives a corresponding element from a column of a first memory, and receives the multitasking register output of an adjacent neural processing unit. The register has an output, and the register receives a corresponding element from a column of a second memory. The accumulator has an output. The arithmetic unit receives the output of the register, the multitasking register and the accumulator, and performs a multiply-accumulate operation on it. The method includes: using a first memory, loading elements of a data matrix; using a second memory, loading elements of a convolution kernel; for each sub-matrix of a plurality of sub-matrices of the data matrix: using each arithmetic unit, selectively receiving an element from the first memory or an element output from a multitasking register of an adjacent neural processing unit; and performing a series of multiply-accumulate operations, and accumulating a result of a convolution operation of the sub-matrix and the convolution kernel into the accumulator Inside.

本发明还提供一种编码于至少一非瞬时计算机可使用媒体以供一计算机装置使用的一计算机程序产品。此计算机程序产品包括内含于该媒体的计算机可使用程序代码，用以描述一神经网络单元。此计算机可使用程序代码包括第一程序代码，第二程序代码与第三程序代码。其中，第一程序代码用以描述一第一存储器，此第一存储器装载一数据矩阵的元素。第二程序代码用以描述一第二存储器，此第二存储器装载一卷积核的元素。第三程序代码用以描述一神经处理单元阵列，此神经处理单元阵列耦接至第一存储器与第二存储器。各个神经处理单元包括一多任务缓存器，一缓存器，一累加器与一算术单元。多任务缓存器具有一输出，并且，此多任务缓存器从第一存储器的一列接收一相对应元素，并且接收一相邻神经处理单元的多任务缓存器输出。缓存器具有一输出，并且，此缓存器从第二存储器的一列接收一相对应元素。累加器具有一输出。算术单元接收缓存器，多任务缓存器与累加器的输出，并对其执行一乘法累加运算。对于数据矩阵的多个子矩阵中的各个子矩阵，各个算术单元选择性地接收来自第一存储器的元素或是来自相邻神经处理单元的多任务缓存器输出的元素，并执行一系列乘法累加运算，而将子矩阵与卷积核的一卷积运算结果累加至累加器内。The present invention also provides a computer program product encoded on at least one non-transitory computer usable medium for use by a computer device. The computer program product includes computer usable program code embodied on the medium for describing a neural network unit. The computer usable program code includes first program code, second program code and third program code. Wherein, the first program code is used to describe a first memory, and the first memory is loaded with elements of a data matrix. The second program code is used to describe a second memory, and the second memory is loaded with elements of a convolution kernel. The third program code is used to describe a neural processing unit array, and the neural processing unit array is coupled to the first memory and the second memory. Each neural processing unit includes a multitasking register, a register, an accumulator and an arithmetic unit. The multitasking register has an output, and the multitasking register receives a corresponding element from a column of the first memory, and receives a multitasking register output of an adjacent NPU. The register has an output, and the register receives a corresponding element from a column of the second memory. The accumulator has an output. The arithmetic unit receives the output of the register, the multitasking register and the accumulator, and performs a multiply-accumulate operation on it. For each of the plurality of sub-matrices of the data matrix, each arithmetic unit selectively receives elements from the first memory or from the multitasking buffer output of an adjacent neural processing unit, and performs a series of multiply-accumulate operations , and accumulate a convolution operation result of the sub-matrix and the convolution kernel into the accumulator.

本发明所采用的具体实施例，将通过以下的实施例及图式作进一步的说明。The specific embodiments adopted by the present invention will be further described through the following embodiments and drawings.

附图说明Description of drawings

图1为显示包含神经网络单元(neural network unit，NNU)的处理器的方块示意图。FIG. 1 is a schematic block diagram showing a processor including a neural network unit (NNU).

图2为显示图1的神经处理单元(neural processing unit，NPU)的方块示意图。FIG. 2 is a schematic block diagram showing a neural processing unit (NPU) in FIG. 1 .

图3为方块图，显示利用图1的神经网络单元的N个神经处理单元的N个多任务缓存器，对于由图1的数据随机存取存储器取得的一列数据文字执行如同N个文字的旋转器(rotator)或称循环移位器(circular shifter)的运作。FIG. 3 is a block diagram showing N multitasking registers of N NPUs utilizing the NNU of FIG. 1 to perform rotation as N literals on a row of data literals retrieved from the data random access memory of FIG. 1 The operation of the rotator or circular shifter.

图4为表格，显示一个储存于图1的神经网络单元的程序存储器并由该神经网络单元执行的程序。FIG. 4 is a table showing a program stored in the program memory of the NNU of FIG. 1 and executed by the NNU.

图5为显示神经网络单元执行图4的程序的时序图。FIG. 5 is a timing diagram showing that the neural network unit executes the program of FIG. 4 .

图6A为显示图1的神经网络单元执行图4的程序的方块示意图。FIG. 6A is a block diagram showing that the neural network unit of FIG. 1 executes the program of FIG. 4 .

图6B为流程图，显示图1的处理器执行架构程序，以利用神经网络单元执行关联于人工神经网络的隐藏层的神经元的典型乘法累加启动函数运算的运作，如同由图4的程序执行的运作。FIG. 6B is a flow chart showing the operation of the processor of FIG. 1 executing an architectural program to perform a typical multiply-accumulate activation function operation associated with neurons of a hidden layer of an artificial neural network using a neural network unit, as performed by the program of FIG. 4 operation.

图7为显示图1的神经处理单元的另一实施例的方块示意图。FIG. 7 is a block diagram showing another embodiment of the neural processing unit of FIG. 1 .

图8为显示图1的神经处理单元的又一实施例的方块示意图。FIG. 8 is a block diagram showing still another embodiment of the neural processing unit of FIG. 1 .

图9为表格，显示一个储存于图1的神经网络单元的程序存储器并由该神经网络单元执行的程序。FIG. 9 is a table showing a program stored in the program memory of the NNU of FIG. 1 and executed by the NNU.

图10为显示神经网络单元执行图9的程序的时序图。FIG. 10 is a timing diagram showing that the neural network unit executes the program of FIG. 9 .

图11为显示图1的神经网络单元的一实施例的方块示意图。在图11的实施例中，一个神经元分成两部分，即启动函数单元部分与算术逻辑单元部分(此部分还包含移位缓存器部分)，而各个启动函数单元部分由多个算术逻辑单元部分共享。FIG. 11 is a block diagram showing an embodiment of the neural network unit of FIG. 1 . In the embodiment of Fig. 11, a neuron is divided into two parts, i.e. the start function unit part and the ALU part (this part also includes the shift register part), and each start function unit part is composed of a plurality of ALU parts shared.

图12为显示图11的神经网络单元执行图4的程序的时序图。FIG. 12 is a sequence diagram showing that the neural network unit of FIG. 11 executes the program of FIG. 4 .

图13为显示图11的神经网络单元执行图4的程序的时序图。FIG. 13 is a sequence diagram showing that the neural network unit of FIG. 11 executes the program of FIG. 4 .

图14为方块示意图，显示移动至神经网络(MTNN)架构指令以及其对应于图1的神经网络单元的部分的运作。FIG. 14 is a block schematic diagram showing a move to neural network (MTNN) architecture instruction and its operation corresponding to the part of the neural network unit in FIG. 1 .

图15为方块示意图，显示移动至神经网络(MTNN)架构指令以及其对应于图1的神经网络单元的部分的运作。FIG. 15 is a block schematic diagram showing a move to neural network (MTNN) architecture instruction and its operation corresponding to the part of the neural network unit in FIG. 1 .

图16为显示图1的数据随机存取存储器的一实施例的方块示意图。FIG. 16 is a block diagram showing an embodiment of the data random access memory of FIG. 1 .

图17为显示图1的权重随机存取存储器与缓冲器的一实施例的方块示意图。FIG. 17 is a block diagram showing an embodiment of the weight RAM and buffer of FIG. 1 .

图18为显示图1的可动态配置的神经处理单元的方块示意图。FIG. 18 is a schematic block diagram showing the dynamically configurable neural processing unit of FIG. 1 .

图19为方块示意图，显示依据图18的实施例，利用图1的神经网络单元的N个神经处理单元的2N个多任务缓存器，对于由图1的数据随机存取存储器取得的一列数据文字执行如同旋转器(rotator)的运作。Fig. 19 is a schematic block diagram showing that according to the embodiment of Fig. 18, using 2N multitasking registers of N neural processing units of the neural network unit of Fig. 1, for a row of data characters obtained by the data random access memory of Fig. 1 Executes like a rotator.

图20为表格，显示一个储存于图1的神经网络单元的程序存储器并由该神经网络单元执行的程序，而此神经网络单元具有如图18的实施例所示的神经处理单元。FIG. 20 is a table showing a program stored in the program memory of the NNU of FIG. 1 and executed by the NNU having the NPU shown in the embodiment of FIG. 18 .

图21为显示神经网络单元执行图20的程序的时序图，此神经网络单元具有如图18所示的神经处理单元执行于窄配置。FIG. 21 is a timing diagram showing a neural network unit executing the program of FIG. 20 with the neural processing unit shown in FIG. 18 executing in a narrow configuration.

图22为显示图1的神经网络单元的方块示意图，此神经网络单元具有如图18所示的神经处理单元以执行图20的程序。FIG. 22 is a schematic block diagram showing the neural network unit of FIG. 1 , which has the neural processing unit shown in FIG. 18 to execute the program of FIG. 20 .

图23为显示图1的可动态配置的神经处理单元的另一实施例的方块示意图。FIG. 23 is a block diagram illustrating another embodiment of the dynamically configurable neural processing unit of FIG. 1 .

图24为方块示意图，显示由图1的神经网络单元使用以执行卷积(convolution)运算的数据结构的一范例。FIG. 24 is a block diagram showing an example of a data structure used by the neural network unit of FIG. 1 to perform convolution operations.

图25为流程图，显示图1的处理器执行架构程序以利用神经网络单元依据图24的数据阵列执行卷积核的卷积运算。FIG. 25 is a flow chart showing that the processor of FIG. 1 executes the architecture program to use the neural network unit to perform the convolution operation of the convolution kernel according to the data array of FIG. 24 .

图26A为神经网络单元程序的程序列表，此神经网络单元程序利用图24的卷积核执行数据矩阵的卷积运算并将其写回权重随机存取存储器。FIG. 26A is a program listing of the neural network unit program, which uses the convolution kernel of FIG. 24 to perform the convolution operation of the data matrix and write it back to the weight random access memory.

图26B为显示图1的神经网络单元的控制缓存器的某些字段的一实施例的方块示意图。FIG. 26B is a block diagram showing an embodiment of some fields of the control register of the NNU of FIG. 1 .

图27为方块示意图，显示图1中填入输入数据的权重随机存取存储器的一范例，此输入数据由图1的神经网络单元执行共源运算(pooling operation)。FIG. 27 is a block diagram showing an example of the weighted random access memory of FIG. 1 filled with input data for pooling operation performed by the neural network unit of FIG. 1 .

图28为神经网络单元程序的程序列表，此神经网络单元程序执行图27的输入数据矩阵的共源运算并将其写回权重随机存取存储器。FIG. 28 is a program listing of a neural network unit program that performs common-source operations on the input data matrix of FIG. 27 and writes it back to the weight random access memory.

图29A为显示图1的控制缓存器的一实施例的方块示意图。FIG. 29A is a block diagram showing an embodiment of the control register of FIG. 1 .

图29B为显示图1的控制缓存器的另一实施例的方块示意图。FIG. 29B is a block diagram showing another embodiment of the control register of FIG. 1 .

图29C为显示以两个部分储存图29A的倒数(reciprocal)的一实施例的方块示意图。FIG. 29C is a block schematic diagram showing an embodiment of storing the reciprocal of FIG. 29A in two parts.

图30为显示图2的启动函数单元(AFU)的一实施例的方块示意图。FIG. 30 is a block diagram illustrating an embodiment of the activation function unit (AFU) of FIG. 2 .

图31为显示图30的启动函数单元的运作的一范例。FIG. 31 shows an example of the operation of the activation function unit of FIG. 30 .

图32为显示图30的启动函数单元的运作的第二个范例。FIG. 32 shows a second example of the operation of the boot function unit of FIG. 30 .

图33为显示图30的启动函数单元的运作的第三个范例。FIG. 33 shows a third example of the operation of the activation functional unit of FIG. 30 .

图34为显示图1的处理器以及神经网络单元的部分细节的方块示意图。FIG. 34 is a block diagram showing some details of the processor and the neural network unit of FIG. 1 .

图35为方块图，显示具有可变率神经网络单元的处理器。Figure 35 is a block diagram showing a processor with a variable rate neural network unit.

图36A为时序图，显示具有神经网络单元的处理器运作于一般模式的一运作范例，此一般模式即以主要时频率运作。FIG. 36A is a timing diagram showing an example of operation of a processor with an NNU operating in a normal mode, ie, operating at a primary clock rate.

图36B为时序图，显示具有神经网络单元的处理器运作于缓和模式的一运作范例，缓和模式的运作时频率低于主要时频率。FIG. 36B is a timing diagram showing an example of operation of a processor with a NNU operating in a relaxed mode, where the frequency of operation in the relaxed mode is lower than that of the main clock.

图37为流程图，显示图35的处理器的运作。FIG. 37 is a flowchart showing the operation of the processor of FIG. 35 .

图38为方块图，详细显示神经网络单元的序列。Figure 38 is a block diagram detailing the sequence of neural network units.

图39为方块图，显示神经网络单元的控制与状态缓存器的某些字段。Figure 39 is a block diagram showing certain fields of the control and state registers of the NNU.

图40为方块图，显示Elman时间递归神经网络(recurrent neural network，RNN)的一范例。FIG. 40 is a block diagram showing an example of an Elman temporal recurrent neural network (RNN).

图41为方块图，显示当神经网络单元执行关联于图40的Elman时间递归神经网络的计算时，神经网络单元的数据随机存取存储器与权重随机存取存储器内的数据配置的一范例。FIG. 41 is a block diagram showing an example of data configuration in the data RAM and weight RAM of the NNU when the NNU performs computations associated with the Elman time recurrent neural network of FIG. 40 .

图42为表格，显示储存于神经网络单元的程序存储器的程序，此程序由神经网络单元执行，并依据图41的配置使用数据与权重，以达成Elman时间递归神经网络。42 is a table showing the program stored in the program memory of the neural network unit, which is executed by the neural network unit and uses data and weights according to the configuration of FIG. 41 to achieve an Elman temporal recurrent neural network.

图43为方块图显示Jordan时间递归神经网络的一范例。FIG. 43 is a block diagram showing an example of a Jordan temporal recurrent neural network.

图44为方块图，显示当神经网络单元执行关联于图43的Jordan时间递归神经网络的计算时，神经网络单元的数据随机存取存储器与权重随机存取存储器内的数据配置的一范例。FIG. 44 is a block diagram showing an example of data configuration in the data RAM and weight RAM of the NNU when the NNU performs computations associated with the Jordan Time Recurrent Neural Network of FIG. 43 .

图45为表格，显示储存于神经网络单元的程序存储器的程序，此程序由神经网络单元执行，并依据图44的配置使用数据与权重，以达成Jordan时间递归神经网络。FIG. 45 is a table showing the program stored in the program memory of the neural network unit, which is executed by the neural network unit and uses data and weights according to the configuration of FIG. 44 to achieve a Jordan time recurrent neural network.

图46为方块图，显示长短期记忆(long short term memory，LSTM)胞的一实施例。FIG. 46 is a block diagram showing an embodiment of a long short term memory (LSTM) cell.

图47为方块图，显示当神经网络单元执行关联于图46的长短期记忆胞层的计算时，神经网络单元的数据随机存取存储器与权重随机存取存储器内的数据配置的一范例。FIG. 47 is a block diagram showing an example of data configuration in the data RAM and weight RAM of the NNU when the NNU performs calculations associated with the LSTM cell layer of FIG. 46 .

图48为表格，显示储存于神经网络单元的程序存储器的程序，此程序由神经网络单元执行并依据图47的配置使用数据与权重，以达成关联于长短期记忆胞层的计算。Fig. 48 is a table showing the program stored in the program memory of the neural network unit executed by the neural network unit and using the data and weights according to the configuration of Fig. 47 to achieve calculations associated with the LSTM cell layer.

图49为方块图，显示神经网络单元的实施例，此实施例的神经处理单元群组内具有输出缓冲遮蔽与反馈能力。FIG. 49 is a block diagram showing an embodiment of a neural network unit with output buffer masking and feedback capabilities within a group of neural processing units.

图50为方块图，显示当神经网络单元执行关联于图46的长短期记忆胞层的计算时，图49的神经网络单元的数据随机存取存储器，权重随机存取存储器与输出缓冲器内的数据配置的一范例。50 is a block diagram showing the data RAM, weight RAM and output buffer of the neural network unit of FIG. 49 when the neural network unit performs calculations associated with the LSTM cell layer of FIG. 46 An example of data configuration.

图51为表格，显示储存于神经网络单元的程序存储器的程序，此程序由图49的神经网络单元执行并依据图50的配置使用数据与权重，以达成关联于长短期记忆胞层的计算。Fig. 51 is a table showing the program stored in the program memory of the neural network unit executed by the neural network unit of Fig. 49 and using data and weights according to the configuration of Fig. 50 to achieve calculations associated with the LSTM cell layer.

图52为方块图，显示神经网络单元的实施例，此实施例的神经处理单元群组内具有输出缓冲遮蔽与反馈能力，并且共享启动函数单元。FIG. 52 is a block diagram showing an embodiment of a neural network unit. In this embodiment, a group of neural processing units has output buffer masking and feedback capabilities, and a shared activation function unit.

图53为方块图，显示当神经网络单元执行关联于图46的长短期记忆胞层的计算时，图49的神经网络单元的数据随机存取存储器，权重随机存取存储器与输出缓冲器内的数据配置的另一实施例。Fig. 53 is a block diagram showing the data RAM, weight RAM and output buffer of the neural network unit of Fig. 49 when the neural network unit performs calculations associated with the LSTM cell layer of Fig. 46 Another embodiment of data configuration.

图54为表格，显示储存于神经网络单元的程序存储器的程序，此程序由图49的神经网络单元执行并依据图53的配置使用数据与权重，以达成关联于长短期记忆胞层的计算。Fig. 54 is a table showing the program stored in the program memory of the neural network unit executed by the neural network unit of Fig. 49 and using data and weights according to the configuration of Fig. 53 to achieve calculations associated with the LSTM cell layer.

图55为方块图，显示本发明另一实施例的部分神经处理单元。Fig. 55 is a block diagram showing part of a neural processing unit according to another embodiment of the present invention.

图56为方块图，显示当神经网络单元执行关联于图43的Jordan时间递归神经网络的计算并利用图55的实施例时，神经网络单元的数据随机存取存储器与权重随机存取存储器内的数据配置的一范例。56 is a block diagram showing the data RAM and weight RAM of the NNU when the NNU performs computations associated with the Jordan Time Recurrent Neural Network of FIG. 43 and utilizes the embodiment of FIG. 55. An example of data configuration.

图57为表格，显示储存于神经网络单元的程序存储器的程序，此程序由神经网络单元执行并依据图56的配置使用数据与权重，以达成Jordan时间递归神经网络。Fig. 57 is a table showing the program stored in the program memory of the neural network unit, which is executed by the neural network unit and uses data and weights according to the configuration of Fig. 56 to achieve a Jordan time recurrent neural network.

具体实施方式detailed description

具有架构神经网络单元的处理器Processor with architectural neural network unit

图1为显示包含神经网络单元(neural network unit，NNU)121的处理器100的方块示意图。如图中所示，此处理器100包含指令撷取单元101，指令快取102，指令转译器104，重命名单元106，多个保留站108，多个媒体缓存器118，多个通用缓存器116，前述神经网络单元121外的多个执行单元112与存储器子系统114。FIG. 1 is a block diagram showing a processor 100 including a neural network unit (NNU) 121 . As shown in the figure, the processor 100 includes an instruction fetch unit 101, an instruction cache 102, an instruction translator 104, a renaming unit 106, a plurality of reservation stations 108, a plurality of media buffers 118, and a plurality of general purpose registers 116 , multiple execution units 112 and memory subsystems 114 outside the aforementioned neural network unit 121 .

处理器100为电子装置，作为集成电路的中央处理单元。处理器100接收输入的数字数据，依据由存储器攫取的指令处理这些数据，并产生由指令指示的运算的处理结果作为其输出。此处理器100可用于桌上型计算机、行动装置、或平板计算机，并用于计算、文字处理、多媒体显示与网络浏览等应用。此处理器100还可设置于嵌入系统内，以控制各种包括设备、行动电话、智能电话、车辆、与工业用控制器的装置。中央处理器透过对数据执行包括算术、逻辑与输入/输出等运算，以执行计算机程序(或称为计算机应用程序或应用程序)指令的电子电路(即硬件)。集成电路为一组制作于小型半导体材料，通常是硅，的电子电路。集成电路也通常被用于表示芯片、微芯片或晶粒。The processor 100 is an electronic device as a central processing unit of an integrated circuit. The processor 100 receives input digital data, processes the data according to instructions fetched from the memory, and generates as its output a processing result of an operation indicated by the instruction. The processor 100 can be used in a desktop computer, a mobile device, or a tablet computer, and is used for applications such as computing, word processing, multimedia display, and web browsing. The processor 100 can also be installed in an embedded system to control various devices including equipment, mobile phones, smart phones, vehicles, and industrial controllers. The central processing unit is an electronic circuit (ie, hardware) that executes instructions of a computer program (or called a computer application program or application program) by performing operations on data, including arithmetic, logic, and input/output. An integrated circuit is a group of electronic circuits fabricated in small semiconductor materials, usually silicon. Integrated circuit is also commonly used to mean a chip, microchip or die.

指令撷取单元101控制由系统存储器(未图示)攫取架构指令103至指令快取102的运作。指令撷取单元101提供攫取地址至指令快取102，以指定处理器100攫取至高速缓存102的架构指令字节的快取列的存储器地址。攫取地址的选定基于处理器100的指令指针(未图示)的当前值或程序计数器。一般而言，程序计数器会依照指令大小循序递增，直到指令串流中出现例如分支、呼叫或返回的控制指令，或是发生例如中断、陷阱(trap)、例外或错误等例外条件，而需要以如分支目标地址、返回地址或例外向量等非循序地址更新程序计数器。总而言之，程序计数器会因应执行单元112/121执行指令而进行更新。程序计数器也可在侦测到例外条件时进行更新，例如指令转译器104遭遇到未被定义于处理器100的指令集架构的指令103。The instruction fetch unit 101 controls the operation of fetching architectural instructions 103 from the system memory (not shown) to the instruction cache 102 . The instruction fetch unit 101 provides the fetch address to the instruction cache 102 to specify the memory address of the cache line of the architectural instruction bytes that the processor 100 fetches into the cache 102 . The selection of the grab address is based on the current value of the instruction pointer (not shown) or the program counter of the processor 100 . Generally speaking, the program counter will increase sequentially according to the size of the instruction until a control instruction such as branch, call, or return occurs in the instruction stream, or an exceptional condition such as an interrupt, trap, exception, or error occurs. Non-sequential addresses such as branch target addresses, return addresses, or exception vectors update the program counter. In summary, the program counter is updated as the execution units 112/121 execute instructions. The program counter may also be updated when an exceptional condition is detected, such as the instruction translator 104 encountering an instruction 103 not defined in the ISA of the processor 100 .

指令快取102储存攫取自一个耦接至处理器100的系统存储器的架构指令103。这些架构指令103包括移动至神经网络(MTNN)指令与由神经网络移出(MFNN)指令，详如后述。在一实施例中，架构指令103是x86指令集架构的指令，并附加上MTNN指令与MFNN指令。在本揭露内容中，x86指令集架构处理器理解为在执行相同机械语言指令的情况下，与处理器在指令集架构层产生相同结果的处理器。不过，其它指令集架构，例如，进阶精简指令集机器架构(ARM)、升阳(SUN)的可扩充处理器架构(SPARC)、或是增强精简指令集性能运算性能优化架构(PowerPC)，也可用于本发明的其它实施例。指令快取102提供架构指令103至指令转译器104，以将架构指令103转译为微指令105。The instruction cache 102 stores architectural instructions 103 fetched from a system memory coupled to the processor 100 . These architectural instructions 103 include move to neural network (MTNN) instructions and move from neural network (MFNN) instructions, which will be described in detail later. In one embodiment, the architectural instruction 103 is an instruction of the x86 instruction set architecture, and an MTNN instruction and an MFNN instruction are added. In this disclosure, an x86 instruction set architecture processor is understood as executing the same machine language instructions as Processor A processor that produces the same results at the instruction set architecture level. However, other instruction set architectures, such as Advanced Reduced Instruction Set Machine Architecture (ARM), Sun's Scalable Processor Architecture (SPARC), or Enhanced Reduced Instruction Set Performance Computing Performance Optimization Architecture (PowerPC), Other embodiments of the invention may also be used. The instruction cache 102 provides the architectural instructions 103 to the instruction translator 104 to translate the architectural instructions 103 into microinstructions 105 .

微指令105提供至重命名单元106而最终由执行单元112/121执行。这些微指令105会实现架构指令。就一较佳实施例而言，指令转译器104包括第一部分，用以将频繁执行以及/或是相对较不复杂的架构指令103转译为微指令105。此指令转译器104还包括第二部分，其具有微码单元(未图示)。微码单元具有微码存储器装载微码指令，以执行架构指令集中复杂和/或少用的指令。微码单元还包括微定序器(microsequencer)提供非架构微程序计数器(micro-PC)至微码存储器。就一较佳实施例而言，这些微指令经由微转译器(未图示)转译为微指令105。选择器依据微码单元当前是否具有控制权，选择来自第一部分或第二部分的微指令105提供至重命名单元106。The microinstructions 105 are provided to the renaming unit 106 and finally executed by the execution units 112/121. These microinstructions 105 implement architectural instructions. According to a preferred embodiment, the instruction translator 104 includes a first part for translating frequently executed and/or relatively uncomplex architectural instructions 103 into microinstructions 105 . The instruction translator 104 also includes a second part, which has a microcode unit (not shown). The microcode unit has a microcode memory loaded with microcode instructions to execute complex and/or infrequently used instructions in the architectural instruction set. The microcode unit also includes a microsequencer (microsequencer) to provide a non-architectural micro-program counter (micro-PC) to the microcode memory. For a preferred embodiment, these microinstructions are translated into microinstructions 105 by a microtranslator (not shown). The selector selects the microinstructions 105 from the first part or the second part to provide to the renaming unit 106 according to whether the microcode unit currently has the control right.

重命名单元106会将架构指令103指定的架构缓存器重命名为处理器100的实体缓存器。就一较佳实施例而言，此处理器100包括重排缓冲器(未图示)。重命名单元106会依照程序顺序将重排缓冲器的项目分配给各个微指令105。如此即可使处理器100依据程序顺序撤除微指令105以及其相对应的架构指令103。在一实施例中，媒体缓存器118具有256位宽度，而通用缓存器116具有64位宽度。在一实施例中，媒体缓存器118为x86媒体缓存器，例如先进向量扩充(AVX)缓存器。The renaming unit 106 renames the architectural register specified by the architectural instruction 103 to the physical register of the processor 100 . For a preferred embodiment, the processor 100 includes a reorder buffer (not shown). The renaming unit 106 assigns the items of the reorder buffer to each microinstruction 105 in program order. In this way, the processor 100 can remove the microinstructions 105 and the corresponding architectural instructions 103 according to the program order. In one embodiment, the media register 118 has a width of 256 bits, while the general register 116 has a width of 64 bits. In one embodiment, the media buffer 118 is an x86 media buffer, such as an Advanced Vector Extensions (AVX) buffer.

在一实施例中，重排缓冲器的各个项目具有储存空间以储存微指令105的结果。此外，处理器100包括架构缓存器档案，此架构缓存器档案具有实体缓存器对应于各个架构缓存器，如媒体缓存器118、通用缓存器116以及其它架构缓存器。(就一较佳实施例而言，举例来说，媒体缓存器118与通用缓存器116的大小不同，即可使用分开的缓存器档案对应至这两种缓存器。)对于微指令105中指定有一个架构缓存器的各个源操作数，重命名单元会利用写入架构缓存器的旧有微指令105中最新一个微指令的重排缓冲器目录，填入微指令105的源操作数字段。当执行单元112/121完成微指令105的执行，执行单元112/121会将其结果写入此微指令105的重排缓冲器项目。当微指令105撤除时，撤除单元(未图示)会将来自此微指令的重排缓冲器字段的结果写入实体缓存器档案的缓存器，此实体缓存器档案关联于由此撤除微指令105所指定的架构目的缓存器。In one embodiment, each entry of the shuffle buffer has storage space to store the result of the microinstruction 105 . In addition, the processor 100 includes an architectural register file, and the architectural register file has physical registers corresponding to various architectural registers, such as the media register 118 , the general purpose register 116 and other architectural registers. (For a preferred embodiment, for example, the size of the media buffer 118 and the general register 116 are different, and separate register files can be used to correspond to these two registers.) For the specified in the microinstruction 105 For each source operand of an architectural register, the renaming unit will fill the source operand field of the microinstruction 105 with the rearrangement buffer directory of the latest microinstruction in the old microinstruction 105 written into the architectural register. When the execution unit 112 / 121 completes the execution of the microinstruction 105 , the execution unit 112 / 121 writes the result into the reorder buffer entry of the microinstruction 105 . When a microinstruction 105 is retired, a eviction unit (not shown) writes the results from the microinstruction's reorder buffer field into the registers of the physical register file associated with the microinstruction thus removed The architectural destination register specified by 105 .

在另一实施例中，处理器100包括实体缓存器档案，其具有的实体缓存器的数量多于架构缓存器的数量，不过，此处理器100不包括架构缓存器档案，而且重排缓冲器项目内不包括结果储存空间。(就一较佳实施例而言，因为媒体缓存器118与通用缓存器116的大小不同，即可使用分开的缓存器档案对应至这两种缓存器。)此处理器100还包括指针表，其具有各个架构缓存器的相对应指针。对于微指令105内指定有架构缓存器的各个操作数，重命名单元会利用一个指向实体缓存器档案内自由缓存器的指针，填入微指令105内的目的操作数字段。若是实体缓存器档案内不存在自由缓存器，重命名单元106会暂时搁置管线。对于微指令105内指定有架构缓存器的各个源操作数，重命名单元会利用一个指向实体缓存器档案中，指派给写入架构缓存器的旧有微指令105中最新微指令的缓存器的指针，填入微指令105内的源操作数字段。当执行单元112/121完成执行微指令105，执行单元112/121会将结果写入实体缓存器档案中微指令105的目的操作数字段指向的缓存器。当微指令105撤除时，撤除单元会将微指令105的目的操作数字段值复制至关联于此撤除微指令105指定的架构目的缓存器的指针表的指针。In another embodiment, the processor 100 includes a physical register file with more physical registers than architectural registers, however, the processor 100 does not include an architectural register file and the reorder buffer The result storage space is not included in the project. (For a preferred embodiment, because the size of the media buffer 118 and the general buffer 116 are different, separate register files can be used to correspond to these two registers.) The processor 100 also includes a pointer table, It has corresponding pointers to each architectural register. For each operand in the microinstruction 105 that specifies an architectural register, the renaming unit uses a pointer to a free register in the physical register file to fill in the destination operand field in the microinstruction 105 . If there is no free register in the physical register file, the renaming unit 106 temporarily suspends the pipeline. For each source operand within microinstruction 105 that specifies an architectural register, the renaming unit utilizes a pointer to a register in the physical register file assigned to the newest microinstruction in the old microinstruction 105 that was written to the architectural register Pointer, fill in the source operand field in the microinstruction 105. When the execution unit 112/121 finishes executing the microinstruction 105, the execution unit 112/121 will write the result into the register pointed to by the destination operand field of the microinstruction 105 in the physical register file. When the microinstruction 105 is retired, the eviction unit copies the destination operand field value of the microinstruction 105 to the pointer table associated with the architectural destination register specified by the microinstruction 105 to be removed.

保留站108会装载微指令105，直到这些微指令完成发布至执行单元112/121以供执行的准备。当一个微指令105的所有源操作数都可取用并且执行单元112/121也可用于执行时，即为此微指令105完成发布的准备。执行单元112/121由重排缓冲器或前述第一实施例所述的架构缓存器档案，或是由前述第二实施例所述的实体缓存器档案接收缓存器源操作数。此外，执行单元112/121可直接透过结果传送总线(未图示)接收缓存器源操作数。此外，执行单元112/121可以从保留站108接收微指令105所指定的立即操作数。MTNN与MFNN架构指令103包括立即操作数以指定神经网络单元121所要执行的功能，而此功能由MTNN与MFNN架构指令103转译产生的一个或多个微指令105所提供，详如后述。Reservation station 108 is loaded with microinstructions 105 until they are ready to be issued to execution units 112/121 for execution. A microinstruction 105 is ready to issue when all source operands of a microinstruction 105 are available and execution units 112/121 are also available for execution. The execution unit 112 / 121 receives the register source operand from the rearrangement buffer or the architectural register file described in the aforementioned first embodiment, or from the physical register file described in the aforementioned second embodiment. In addition, the execution unit 112/121 can directly receive the register source operand through the result transfer bus (not shown). In addition, the execution unit 112 / 121 may receive the immediate operand specified by the microinstruction 105 from the reservation station 108 . The MTNN and MFNN architecture instructions 103 include immediate operands to specify the functions to be performed by the neural network unit 121 , and the functions are provided by one or more microinstructions 105 translated from the MTNN and MFNN architecture instructions 103 , as described later.

执行单元112包括一个或多个加载/储存单元(未图示)，由存储器子系统114加载数据并且储存数据至存储器子系统114。就一较佳实施例而言，此存储器子系统114包括存储器管理单元(未图示)，此存储器管理单元可包括，例如多个转译查找(lookaside)缓冲器、一个表移动(tablewalk)单元、一个阶层一数据快取(与指令快取102)、一个阶层二统一快取与一个作为处理器100与系统存储器间的接口的总线接口单元。在一实施例中，图1的处理器100以多核处理器的多个处理核心的其中之一来表示，而此多核处理器共享一个最后阶层高速缓存。执行单元112还可包括多个整数单元、多个媒体单元、多个浮点单元与一个分支单元。The execution unit 112 includes one or more load/store units (not shown), which load data from the memory subsystem 114 and store data to the memory subsystem 114 . For a preferred embodiment, the memory subsystem 114 includes a memory management unit (not shown), and the memory management unit may include, for example, a plurality of translation lookaside buffers, a table movement (tablewalk) unit, A level-1 data cache (and instruction cache 102), a level-2 unified cache, and a bus interface unit interface between processor 100 and system memory. In one embodiment, the processor 100 of FIG. 1 is represented as one of a plurality of processing cores of a multi-core processor, and the multi-core processor shares a last-level cache. The execution unit 112 may further include a plurality of integer units, a plurality of media units, a plurality of floating point units and a branch unit.

神经网络单元121包括权重随机存取存储器(RAM)124、数据随机存取存储器122、N个神经处理单元(NPU)126、一个程序存储器129、一个定序器128与多个控制与状态缓存器127。这些神经处理单元126在概念上如同神经网络中的神经元的功能。权重随机存取存储器124、数据随机存取存储器122与程序存储器129均可透过MTNN与MFNN架构指令103分别写入与读取。权重随机存取存储器124排列为W列，每列N个权重文字，数据随机存取存储器122排列为D列，每列N个数据文字。各个数据文字与各个权重文字均为多个位，就一较佳实施例而言，可以是8个位、9个位、12个位或16个位。各个数据文字作为网络中前一层的神经元的输出值(有时以启动值表示)，各个权重文字作为网络中关联于进入网络当前层的神经元的连结的权重。虽然在神经网络单元121的许多应用中，装载于权重随机存取存储器124的文字或操作数实际上就是关联于进入神经元的连结的权重，不过需要注意的是，在神经网络单元121的某些应用中，装载于权重随机存取存储器124的文字并非权重，不过因为这些文字是储存于权重随机存取存储器124中，所以仍然以“权重文字”的用语表示。举例来说，在神经网络单元121的某些应用中，例如图24至图26A的卷积运算的范例或是图27至图28的共源运算的范例，权重随机存取存储器124会装载权重以外的对象，例如数据矩阵(如影像像素数据)的元素。同样地，虽然在神经网络单元121的许多应用中，装载于数据随机存取存储器122的文字或操作数实质上就是神经元的输出值或启动值，不过需要注意的是，在神经网络单元121的某些应用中，装载于数据随机存取存储器122的文字并非如此，不过因为这些文字是储存于数据随机存取存储器122中，所以仍然以“数据文字”的用语表示。举例来说，在神经网络单元121的某些应用中，例如图24至图26A的卷积运算的范例，数据随机存取存储器122会装载非神经元的输出，例如卷积核的元素。The neural network unit 121 includes a weight random access memory (RAM) 124, a data random access memory 122, N neural processing units (NPU) 126, a program memory 129, a sequencer 128 and a plurality of control and state registers 127. These neural processing units 126 conceptually function like neurons in a neural network. The weight RAM 124 , the data RAM 122 and the program memory 129 can be written and read through the MTNN and MFNN architecture instructions 103 respectively. The weight random access memory 124 is arranged in W columns with N weight words in each column, and the data random access memory 122 is arranged in D columns with N data words in each column. Each data word and each weight word are a plurality of bits, for a preferred embodiment, may be 8 bits, 9 bits, 12 bits or 16 bits. Each data literal is used as the output value (sometimes expressed as activation value) of the neuron in the previous layer in the network, and each weight literal is used as the weight of the connection in the network associated with the neuron entering the current layer of the network. Although in many applications of the NNU 121, the text or operands loaded into the WRAM 124 are actually the weights associated with the connections entering the neuron, it should be noted that in some NNU 121 In some applications, the words loaded in the weight random access memory 124 are not weights, but because these words are stored in the weight random access memory 124, they are still represented by the term "weight words". For example, in some applications of the neural network unit 121, such as the example of the convolution operation of FIGS. 24-26A or the example of the common-source operation of FIGS. Objects other than , such as elements of a data matrix (such as image pixel data). Similarly, although in many applications of the neural network unit 121, the text or operand loaded in the data random access memory 122 is essentially the output value or activation value of the neuron, it should be noted that in the neural network unit 121 In some applications, the text loaded in the DRAM 122 is not like this, but because these texts are stored in the DRAM 122, they are still represented by the term "data text". For example, in some applications of the neural network unit 121 , such as the example of the convolution operation in FIGS. 24 to 26A , the DRAM 122 will be loaded with non-neuron outputs, such as elements of the convolution kernel.

在一实施例中，神经处理单元126与定序器128包括组合逻辑、定序逻辑、状态机器、或是其组合。架构指令(例如MFNN指令1500)会将状态缓存器127的内容加载其中一个通用缓存器116，以确认神经网络单元121的状态，如神经网络单元121已经从程序存储器129完成一个命令或是一个程序的运作，或是神经网络单元121可自由接收一个新的命令或开始一个新的神经网络单元程序。In one embodiment, the neural processing unit 126 and the sequencer 128 include combinational logic, sequential logic, state machine, or a combination thereof. Architectural instructions (such as MFNN instruction 1500) will load the contents of the state register 127 into one of the general registers 116 to confirm the state of the neural network unit 121, such as the neural network unit 121 has completed a command or a program from the program memory 129 operation, or the NNU 121 is free to receive a new command or start a new NNU program.

神经处理单元126的数量可依据需求增加，权重随机存取存储器124与数据随机存取存储器122的宽度与深度也可随之调整进行扩张。就一较佳实施例而言，权重随机存取存储器124会大于数据随机存取存储器122，这是因为典型的神经网络层中存在许多连结，因而需要较大的储存空间储存关联于各个神经元的权重。本文揭露许多关于数据与权重文字的大小、权重随机存取存储器124与数据随机存取存储器122的大小、以及不同神经处理单元126数量的实施例。在一实施例中，神经网络单元121具有一个大小为64KB(8192位x64列)的数据随机存取存储器122，一个大小为2MB(8192位x2048列)的权重随机存取存储器124，以及512个神经处理单元126。此神经网络单元121是以台湾集成电路(TSMC)的16纳米制程制造，其所占面积大约是3.3平方毫米。The number of neural processing units 126 can be increased according to requirements, and the width and depth of the weight random access memory 124 and the data random access memory 122 can also be adjusted and expanded accordingly. For a preferred embodiment, the weight RAM 124 is larger than the data RAM 122 because there are many connections in a typical neural network layer, thus requiring a larger storage space to store the connections associated with each neuron. the weight of. Many embodiments are disclosed herein regarding the size of data and weight literals, the size of weight RAM 124 and data RAM 122 , and the number of different NPUs 126 . In one embodiment, the neural network unit 121 has a data RAM 122 with a size of 64KB (8192 bits x 64 columns), a weight RAM 124 with a size of 2MB (8192 bits x 2048 columns), and 512 neural processing unit 126 . The neural network unit 121 is manufactured by Taiwan Integrated Circuits (TSMC) with a 16nm process, and occupies an area of about 3.3 square millimeters.

定序器128由程序存储器129攫取指令并执行，其执行的运作还包括产生地址与控制信号提供给数据随机存取存储器122、权重随机存取存储器124与神经处理单元126。定序器128产生存储器地址123与读取命令提供给数据随机存取存储器122，藉以在D列的N个数据文字中选择其一提供给N个神经处理单元126。定序器128还会产生存储器地址125与读取命令提供给权重随机存取存储器124，藉以在W列的N个权重文字中选择其一提供给N个神经处理单元126。定序器128产生还提供给神经处理单元126的地址123，125的顺序即确定神经元间的“连结”。定序器128还会产生存储器地址123与写入命令提供给数据随机存取存储器122，藉以在D列的N个数据文字中选择其一由N个神经处理单元126进行写入。定序器128还会产生存储器地址125与写入命令提供给权重随机存取存储器124，藉以在W列的N个权重文字中选择其一由N个神经处理单元126进行写入。定序器128还会产生存储器地址131至程序存储器129以选择提供给定序器128的神经网络单元指令，这部分在后续章节会进行说明。存储器地址131对应至程序计数器(未图示)，定序器128通常是依据程序存储器129的位置顺序使程序计数器递增，除非定序器128遭遇到控制指令，例如循环指令(请参照如图26A所示)，在此情况下，定序器128会将程序计数器更新为此控制指令的目标地址。定序器128还会产生控制信号至神经处理单元126，指示神经处理单元126执行各种不同的运算或功能，例如起始化、算术/逻辑运算、转动/移位运算、启动函数、以及写回运算，相关的范例在后续章节(请参照如图34的微运算3418所示)会有更详细的说明。The sequencer 128 fetches and executes instructions from the program memory 129 , and its execution includes generating address and control signals to provide to the data random access memory 122 , the weight random access memory 124 and the neural processing unit 126 . The sequencer 128 generates a memory address 123 and a read command to provide to the DRAM 122 , so as to select one of the N data words in the D column to provide to the N NPUs 126 . The sequencer 128 also generates a memory address 125 and a read command to provide to the weight random access memory 124 , so as to select one of the N weight words in the W column to provide to the N neural processing units 126 . The sequencer 128 generates the sequence of addresses 123, 125 which are also provided to the NPU 126 or determine the "connections" between the neurons. The sequencer 128 also generates a memory address 123 and a write command to provide to the DRAM 122 , so as to select one of the N data words in the D column to be written by the N NPUs 126 . The sequencer 128 also generates a memory address 125 and a write command to provide to the weight random access memory 124 , so as to select one of the N weight words in the W column to be written by the N neural processing units 126 . The sequencer 128 also generates the memory address 131 to the program memory 129 to select the NNU instruction provided to the sequencer 128 , which will be described in subsequent chapters. The memory address 131 corresponds to a program counter (not shown), and the sequencer 128 usually increments the program counter according to the position sequence of the program memory 129, unless the sequencer 128 encounters a control instruction, such as a loop instruction (please refer to FIG. 26A ), in which case sequencer 128 will update the program counter with the target address of this control instruction. The sequencer 128 also generates control signals to the neural processing unit 126, instructing the neural processing unit 126 to perform various operations or functions, such as initialization, arithmetic/logic operations, rotation/shift operations, activation functions, and write The relevant examples will be described in more detail in subsequent chapters (please refer to micro-operation 3418 shown in FIG. 34 ).

N个神经处理单元126会产生N个结果文字133，这些结果文字133可被写回权重随机存取存储器124或数据随机存取存储器122的一列。就一较佳实施例而言，权重随机存取存储器124与数据随机存取存储器122直接耦接至N个神经处理单元126。进一步来说，权重随机存取存储器124与数据随机存取存储器122转属于这些神经处理单元126，而不分享给处理器100中其它的执行单元112，这些神经处理单元126能够持续地在每一个时频周期内从权重随机存取存储器124与数据随机存取存储器122的一或二者取得并完成一列，就一较佳实施例而言，可采用管线方式处理。在一实施例中，数据随机存取存储器122与权重随机存取存储器124中的每一个都可以在每一个时频周期内提供8192个位至神经处理单元126。这8192个位可以视为512个16字节或是1024个8字节来进行处理，详如后述。The N NPUs 126 will generate N result words 133 which can be written back to a row of the weight RAM 124 or the data RAM 122 . For a preferred embodiment, the weight RAM 124 and the data RAM 122 are directly coupled to the N neural processing units 126 . Furthermore, the weight random access memory 124 and the data random access memory 122 belong to these neural processing units 126, and are not shared with other execution units 112 in the processor 100. These neural processing units 126 can be continuously used in each In a time-frequency cycle, a row is obtained from one or both of the weight RAM 124 and the data RAM 122 and completed. For a preferred embodiment, pipeline processing can be adopted. In one embodiment, each of the data RAM 122 and the weight RAM 124 can provide 8192 bits to the NPU 126 in each clock cycle. The 8192 bits can be treated as 512 16-bytes or 1024 8-bytes, which will be described in detail later.

由神经网络单元121处理的数据组大小并不受限于权重随机存取存储器124与数据随机存取存储器122的大小，而只会受限于系统存储器的大小，这是因为数据与权重可在系统存储器与权重随机存取存储器124以及数据随机存取存储器122间透过MTNN与MFNN指令的使用(例如，透过媒体缓存器118)而移动。在一实施例中，数据随机存取存储器122被赋予双端口，使能在由数据随机存取存储器122读取数据文字或写入数据文字至数据随机存取存储器122的同时，写入数据文字至数据随机存取存储器122。另外，包括高速缓存在内的存储器子系统114的大型存储器阶层结构可提供非常大的数据频宽供系统存储器与神经网络单元121间进行数据传输。此外，就一较佳实施例而言，此存储器子系统114包括硬件数据预攫取器，追踪存储器的存取模式，例如由系统存储器加载的神经数据与权重，并对快取阶层结构执行数据预攫取以利于在传输至权重随机存取存储器124与数据随机存取存储器122的过程中达成高频宽与低延迟的传输。The size of the data set processed by the NNU 121 is not limited by the size of the weight RAM 124 and the data RAM 122, but only by the size of the system memory, since the data and weights can be stored in System memory is moved between the weight RAM 124 and the data RAM 122 through the use of MTNN and MFNN instructions (eg, through the media register 118 ). In one embodiment, the DRAM 122 is provided with dual ports, enabling data words to be written while reading or writing data words from the DRAM 122 to the DRAM 122 to the data random access memory 122. In addition, the large memory hierarchy of the memory subsystem 114 including caches can provide very large data bandwidth for data transfer between the system memory and the NNU 121 . Additionally, for a preferred embodiment, the memory subsystem 114 includes a hardware data pre-fetcher that tracks memory access patterns, such as neural data and weights loaded from system memory, and performs data pre-fetching on the cache hierarchy. Grabbing facilitates high-bandwidth and low-latency transmission during transmission to the weight RAM 124 and the data RAM 122 .

虽然本文的实施例中，由权重存储器提供至各个神经处理单元126的其中一个操作数标示为权重，此用语常见于神经网络，不过需要理解的是，这些操作数也可以是其它与计算有关联的类型的数据，而其计算速度可透过这些装置加以提升。Although in the embodiment herein, one of the operands provided by the weight memory to each neural processing unit 126 is marked as a weight, this term is often used in neural networks, but it should be understood that these operands can also be other calculation-related type of data, and its calculation speed can be improved by these devices.

图2为显示图1的神经处理单元126的方块示意图。如图中所示，此神经处理单元126的运作可执行许多功能或运算。尤其是，此神经处理单元126可作为人工神经网络内的神经元或节点进行运作，以执行典型的乘积累加功能或运算。也就是说，一般而言，神经网络单元126(神经元)用以：(1)从各个与其具有连结的神经元接收输入值，此连结通常会但不必然是来自人工神经网络中的前一层；(2)将各个输出值乘上关联于其连结的相对应权重值以产生乘积；(3)将所有乘积加总以产生一总数；(4)对此总数执行启动函数以产生神经元的输出。不过，不同于传统方式需要执行关联于所有连结输入的所有乘法运算并将其乘积加总，本发明的各个神经元在给定的时频周期内可执行关联于其中一个连结输入的权重乘法运算并将其乘积与关联于该时点前的时频周期内所执行的连结输入的乘积的累加值相加(累加)。假定一共有M个连结连接至此神经元，在M个乘积加总后(大概需要M个时频周期的时间)，此神经元会对此累加数执行启动函数以产生输出或结果。此方式的优点在于可减少所需的乘法器的数量，并且在神经元内只需要一个较小、较简单且更为快速的加法器电路(例如使用两个输入的加法器)，而不需使用能够将所有连结输入的乘积加总或甚至对其中一子集合加总所需的加法器。此方式也有利于在神经网络单元121内使用极大数量(N)的神经元(神经处理单元126)，如此，在大约M个时频周期后，神经网络单元121就可产生此大数量(N)神经元的输出。最后，对于大量的不同连结输入，由这些神经元构成的神经网络单元121就能有效地作为人工神经网络层执行。也就是说，若是不同层中M的数量有所增减，产生记忆胞输出所需的时频周期数也会相对应地增减，而资源(例如乘法器与累加器)会被充分利用。相较之下，传统设计对于较小的M值而言，会有某些乘法器与加法器的部分未能被利用。因此，因应神经网络单元的连结输出数，本文所述的实施例兼具弹性与效率的优点，而能提供极高的效能。FIG. 2 is a block diagram illustrating the neural processing unit 126 of FIG. 1 . As shown in the figure, the NPU 126 operates to perform many functions or operations. In particular, the neural processing unit 126 can operate as neurons or nodes in an artificial neural network to perform typical multiply-accumulate functions or operations. That is, in general, a neural network unit 126 (neuron) is used to: (1) receive an input from each neuron it has a connection to, which typically, but not necessarily, is from a previous neuron in the artificial neural network; layer; (2) multiply each output value by the corresponding weight value associated with its connection to produce a product; (3) sum all the products to produce a sum; (4) execute an activation function on this sum to generate a neuron Output. However, unlike traditional methods that need to perform all multiplication operations associated with all connected inputs and sum their products, each neuron of the present invention can perform a weight multiplication operation associated with one of the connected inputs within a given time-frequency cycle And the product thereof is added (accumulated) to the accumulated value of the product of the link input executed in the time-frequency period before the time point. Assuming that there are M connections connected to this neuron, after M products are summed up (approximately M time-frequency cycles are required), this neuron will execute an activation function on this accumulated number to generate an output or result. The advantage of this approach is that it reduces the number of multipliers required and requires only a smaller, simpler, and faster adder circuit within the neuron (for example, using two-input adders) instead of Use the adders needed to be able to sum the products of all connected inputs, or even a subset of them. This approach is also conducive to the use of a very large number (N) of neurons (neural processing unit 126) in the neural network unit 121, so that after about M time-frequency cycles, the neural network unit 121 can generate this large number (N) N) The output of the neuron. Finally, for a large number of different connection inputs, the neural network unit 121 composed of these neurons can effectively perform as an artificial neural network layer. That is to say, if the number of M in different layers increases or decreases, the number of time-frequency cycles required to generate the output of the memory cell will increase or decrease accordingly, and resources (such as multipliers and accumulators) will be fully utilized. In contrast, for small values of M in the conventional design, some parts of the multipliers and adders are not utilized. Therefore, according to the number of connection outputs of the neural network unit, the embodiments described herein have both the advantages of flexibility and efficiency, and can provide extremely high performance.

神经处理单元126包括缓存器205、一个双输入多任务缓存器208、算术逻辑单元(ALU)204、累加器202、与启动函数单元(AFU)212。缓存器205由权重随机存取存储器124接收权重文字206并在后续时频周期提供其输出203。多任务缓存器208在两个输入207，211中选择其一储存于其缓存器并在后续时频周期提供于其输出209。输入207接收来自数据随机存取存储器122的数据文字。另一个输入211则接收相邻神经处理单元126的输出209。图2所示的神经处理单元126于图1所示的N个神经处理单元中标示为神经处理单元J。也就是说，神经处理单元J是这N个神经处理单元126的一代表范例。就一较佳实施例而言，神经处理单元126的J范例的多任务缓存器208的输入211接收神经处理单元126的J-1范例的多任务缓存器208的输出209，而神经处理单元J的多任务缓存器208的输出209提供给神经处理单元126的J+1范例的多任务缓存器208的输入211。如此，N个神经处理单元126的多任务缓存器208即可共同运作，如同N个文字的旋转器或称循环移位器，这部分在后续图3会有更详细的说明。多任务缓存器208利用控制输入213控制这两个输入中哪一个会被多任务缓存器208选择储存于其缓存器并于后续提供于输出209。The NPU 126 includes a register 205 , a dual-input multitasking register 208 , an arithmetic logic unit (ALU) 204 , an accumulator 202 , and an activation function unit (AFU) 212 . The register 205 receives the weight word 206 from the weight RAM 124 and provides its output 203 in subsequent clock cycles. The multiplexing register 208 selects one of the two inputs 207, 211 to store in its register and provide it to its output 209 in a subsequent clock cycle. Input 207 receives a data word from data RAM 122 . Another input 211 receives the output 209 of the adjacent neural processing unit 126 . The neural processing unit 126 shown in FIG. 2 is marked as a neural processing unit J among the N neural processing units shown in FIG. 1 . That is to say, the neural processing unit J is a representative example of the N neural processing units 126 . In terms of a preferred embodiment, the input 211 of the J-instance multitasking register 208 of the neural processing unit 126 receives the output 209 of the J-1 instance of the multitasking register 208 of the neural processing unit 126, and the neural processing unit J The output 209 of the multitasking register 208 of the NPU 126 is provided to the input 211 of the multitasking register 208 of the J+1 instance of the NPU 126 . In this way, the multitasking registers 208 of the N NPUs 126 can work together, like N word rotators or cyclic shifters. This part will be described in more detail in FIG. 3 . The multiplexing register 208 uses the control input 213 to control which of the two inputs will be selected by the multiplexing register 208 to store in its register and subsequently provided on the output 209 .

算术逻辑单元204具有三个输入。其中一个输入由缓存器205接收权重文字203。另一个输入接收多任务缓存器208的输出209。再另一个输入接收累加器202的输出217。此算术逻辑单元204会对其输入执行算术和/或逻辑运算以产生一结果提供于其输出。就一较佳实施例而言，算术逻辑单元204执行的算术和/或逻辑运算由储存于程序存储器129的指令所指定。举例来说，图4中的乘法累加指令指定乘法累加运算，也即，结果215会是累加器202数值217与权重文字203以及多任务缓存器208输出209的数据文字的乘积的加总。不过也可以指定其它运算，这些运算包括但不限于：结果215是多任务缓存器输出209传递的数值；结果215是权重文字203传递的数值；结果215是零值；结果215是累加器202数值217与权重203的加总；结果215是累加器202数值217与多任务缓存器输出209的加总；结果215是累加器202数值217与权重203中的最大值；结果215是累加器202数值217与多任务缓存器输出209中的最大值。The arithmetic logic unit 204 has three inputs. One of the inputs receives weight text 203 from buffer 205 . Another input receives the output 209 of the multitasking buffer 208 . Yet another input receives the output 217 of the accumulator 202 . The ALU 204 performs arithmetic and/or logic operations on its input to generate a result at its output. For a preferred embodiment, the arithmetic and/or logical operations performed by ALU 204 are specified by instructions stored in program memory 129 . For example, the multiply-accumulate instruction in FIG. 4 specifies a multiply-accumulate operation, ie, the result 215 will be the sum of the product of the accumulator 202 value 217 and the product of the weight literal 203 and the output 209 of the multitasking register 208 . However, other operations can also be specified, these operations include but are not limited to: result 215 is the value passed by the multitasking register output 209; result 215 is the value passed by the weight literal 203; result 215 is a zero value; The sum of 217 and weight 203; Result 215 is the sum of accumulator 202 value 217 and multitasking register output 209; Result 215 is the maximum value in accumulator 202 value 217 and weight 203; Result 215 is accumulator 202 value 217 and the maximum value of the multitasking buffer output 209.

算术逻辑单元204提供其输出215至累加器202储存。算术逻辑单元204包括乘法器242对权重文字203与多任务缓存器208输出209的数据文字进行乘法运算以产生一乘积246。在一实施例中，乘法器242将两个16位操作数相乘以产生一个32位的结果。此算术逻辑单元204还包括加法器244在累加器202的输出217加上乘积246以产生一总数，此总数即为储存于累加器202的累加运算的结果215。在一实施例中，加法器244在累加器202的一个41位值217加上乘法器242的一个32位结果以产生一个41位结果。如此，在多个时频周期的期间内利用多任务缓存器208所具有的旋转器特性，神经处理单元126即可达成神经网络所需的神经元的乘积加总运算。此算术逻辑单元204也可包括其它电路组件以执行其它如前所述的算术/逻辑运算。在一实施例中，第二加法器在多任务缓存器208输出209的数据文字减去权重文字203以产生一差值，随后加法器244会在累加器202的输出217加上此差值以产生一结果215，此结果即为累加器202内的累加结果。如此，在多个时频周期的期间内，神经处理单元126就能达成差值加总的运算。就一较佳实施例而言，虽然权重文字203与数据文字209的大小相同(以位计)，他们也可具有不同的二进制小数点位置，详如后述。就一较佳实施例而言，乘法器242与加法器244为整数乘法器与加法器，相较于使用浮点运算的算术逻辑单元，此算术逻辑单元204具有低复杂度、小型、快速与低耗能的优点。不过，在本发明的其它实施例中，算术逻辑单元204也可执行浮点运算。The ALU 204 provides its output 215 to the accumulator 202 for storage. The ALU 204 includes a multiplier 242 that multiplies the weight word 203 and the data word output 209 of the multiplexing register 208 to generate a product 246 . In one embodiment, multiplier 242 multiplies two 16-bit operands to produce a 32-bit result. The ALU 204 also includes an adder 244 that adds a product 246 to the output 217 of the accumulator 202 to generate a sum, which is the result 215 of the accumulation operation stored in the accumulator 202 . In one embodiment, the adder 244 adds a 32-bit result of the multiplier 242 to a 41-bit value 217 of the accumulator 202 to generate a 41-bit result. In this way, by using the rotator characteristic of the multitasking register 208 during multiple time-frequency periods, the neural processing unit 126 can achieve the sum of products of neurons required by the neural network. The ALU 204 may also include other circuit components to perform other arithmetic/logic operations as described above. In one embodiment, the second adder subtracts the weight word 203 from the data word 209 of the multiplex register 208 output 209 to generate a difference, and then the adder 244 adds the difference to the output 217 of the accumulator 202 to obtain A result 215 is generated, which is the accumulated result in the accumulator 202 . In this way, the neural processing unit 126 can achieve the operation of summing up the differences during multiple time-frequency periods. For a preferred embodiment, although the weight literal 203 and the data literal 209 have the same size (in bits), they may also have different binary point positions, as described later. For a preferred embodiment, the multiplier 242 and the adder 244 are integer multipliers and adders, and the ALU 204 is low-complexity, small, fast and Advantages of low energy consumption. However, in other embodiments of the present invention, the ALU 204 can also perform floating-point operations.

虽然图2的算术逻辑单元204内只显示一个乘法器242与加法器244，不过，就一较佳实施例而言，此算术逻辑单元204还包括有其它组件以执行前述其它不同的运算。举例来说，此算术逻辑单元204可包括比较器(未图示)比较累加器202与数据/权重文字，以及多工器(未图示)在比较器指定的两个数值中选择较大者(最大值)储存至累加器202。在另一个范例中，算术逻辑单元204包括选择逻辑(未图示)，利用数据/权重文字来跳过乘法器242，使加法器224在累加器202的数值217加上此数据/权重文字以产生一总数储存至累加器202。这些额外的运算会在后续章节如图18至图29A有更详细的说明，而这些运算也有助于如卷积运算与共源运算的执行。Although only a multiplier 242 and an adder 244 are shown in the ALU 204 in FIG. 2 , as far as a preferred embodiment is concerned, the ALU 204 also includes other components to perform the aforementioned other different operations. For example, the ALU 204 may include a comparator (not shown) to compare the accumulator 202 with the data/weight literal, and a multiplexer (not shown) to select the larger of the two values specified by the comparator (maximum value) is stored in the accumulator 202 . In another example, the ALU 204 includes selection logic (not shown) to skip the multiplier 242 by using the data/weight word so that the adder 224 adds the data/weight word to the value 217 of the accumulator 202 to A total is generated and stored in the accumulator 202 . These additional operations will be described in more detail in the subsequent chapters as shown in FIG. 18 to FIG. 29A , and these operations also facilitate the execution of convolution operations and common-source operations.

启动函数单元212接收累加器202的输出217。启动函数单元212会对累加器202的输出执行启动函数以产生图1的结果133。一般而言，人工神经网络的中介层的神经元内的启动函数可用来标准化乘积累加后的总数，尤其可以采用非线性的方式进行。为了“标准化”累加总数，当前神经元的启动函数会在连接当前神经元的其它神经元预期接收作为输入的数值范围内产生一结果值。(标准化后的结果有时会称为“启动”，在本文中，启动是当前节点的输出，而接收节点会将此输出乘上关联于输出节点与接收节点间连结的权重以产生一乘积，而此乘积会与关联于此接收节点的其它输入连结的乘积累加。)举例来说，在接收/被连结神经元预期接收作为输入的数值介于0与1间的情况下，输出神经元会需要非线性地挤压和/或调整(例如向上移位以将负值转换为正值)超出0与1的范围外的累加总数，使其落于此预期范围内。因此，启动函数单元212对累加器202数值217执行的运算会将结果133带到已知范围内。N个神经执行单元126的结果133都可被同时写回数据随机存取存储器122或权重随机存取存储器124。就一较佳实施例而言，启动函数单元212用以执行多个启动函数，而例如来自控制缓存器127的输入会在这些启动函数中选择其一执行于累加器202的输出217。这些启动函数可包括但不限于阶梯函数、校正函数、S型函数、双曲正切函数与软加函数(也称为平滑校正函数)。软加函数的解析公式为f(x)＝ln(1+e^x)，也就是1与e^x的加总的自然对数，其中，“e”是欧拉数(Euler’s number)，x是此函数的输入217。就一较佳实施例而言，启动函数也可包括传递(pass-through)函数，直接传递累加器202数值217或其中一部分，详如后述。在一实施例中，启动函数单元212的电路会在单一个时频周期内执行启动函数。在一实施例中，启动函数单元212包括多个表单，其接收累加值并输出一数值，对某些启动函数，如S型函数、双曲正切函数、软加函数等，此数值会近似于真正的启动函数所提供的数值。The activation function unit 212 receives the output 217 of the accumulator 202 . The activation function unit 212 executes the activation function on the output of the accumulator 202 to generate the result 133 of FIG. 1 . In general, the activation function in the neurons of the intermediary layer of the artificial neural network can be used to normalize the accumulated sum of the multiplication, especially in a non-linear manner. To "normalize" the accumulated total, the activation function of the current neuron produces a result value within the range of values that other neurons connected to the current neuron are expected to receive as input. (The normalized result is sometimes referred to as a "startup". In this paper, a start is the output of the current node, and the receiving node will multiply this output by the weight associated with the link between the output node and the receiving node to produce a product, and This product will be accumulated with the multiplication of other input connections associated with this receiving node.) For example, in the case where the receiving/connected neuron is expected to receive as input a value between 0 and 1, the output neuron will need Accumulated totals outside the range of 0 and 1 are non-linearly squeezed and/or adjusted (eg, shifted up to convert negative values to positive values) so that they fall within this expected range. Thus, enabling the function unit 212 to perform operations on the value 217 of the accumulator 202 will bring the result 133 into a known range. The results 133 of the N NEPs 126 can all be written back to the data RAM 122 or the weight RAM 124 at the same time. As far as a preferred embodiment is concerned, the activation function unit 212 is used to execute multiple activation functions, and for example, the input from the control register 127 selects one of these activation functions to be executed on the output 217 of the accumulator 202 . These activation functions may include, but are not limited to, step functions, correction functions, sigmoid functions, hyperbolic tangent functions, and soft-add functions (also known as smooth correction functions). The analytical formula of the soft addition function is f(x)=ln(1+e ^x ), that is, the natural logarithm of the sum of 1 and e ^x , where "e" is Euler's number and x is The input to this function is 217. As far as a preferred embodiment is concerned, the startup function may also include a pass-through function, directly passing the value 217 of the accumulator 202 or a part thereof, as described in detail later. In one embodiment, the circuit of the activation function unit 212 executes the activation function in a single clock cycle. In one embodiment, the activation function unit 212 includes a plurality of tables, which receive the accumulated value and output a value. For some activation functions, such as Sigmoid function, hyperbolic tangent function, soft addition function, etc., the value will be approximately The value provided by the actual startup function.

就一较佳实施例而言，累加器202的宽度(以位计)大于启动函数功能212的输出133的宽度。举例来说，在一实施例中，此累加器的宽度为41位，以避免在累加至最多512个32位的乘积的情况下(这部分在后续章节如对应于图30处会有更详细的说明)损失精度，而结果133的宽度为16位。在一实施例中，在后续时频周期中，启动函数单元212会传递累加器202输出217的其它未经处理的部分，并且会将这些部分写回数据随机存取存储器122或权重随机存取存储器124，这部分在后续章节对应于图8处会有更详细的说明。如此即可将未经处理的累加器202数值透过MFNN指令载回媒体缓存器118，藉此，在处理器100的其它执行单元112执行的指令就可以执行启动函数单元212无法执行的复杂启动函数，例如常见的软极大(softmax)函数，此函数也被称为标准化指数函数。在一实施例中，处理器100的指令集架构包括执行此指数函数的指令，通常表示为e^x或exp(x)，此指令可由处理器100的其它执行单元112使用以提升软极大启动函数的执行速度。For a preferred embodiment, the width (in bits) of the accumulator 202 is greater than the width of the output 133 of the enable function function 212 . For example, in one embodiment, the width of this accumulator is 41 bits, so as to avoid accumulating to a maximum of 512 32-bit products (this part will be more detailed in subsequent chapters such as corresponding to FIG. 30 ) loses precision, and the result 133 is 16 bits wide. In one embodiment, in subsequent time-frequency cycles, the enable function unit 212 will pass other unprocessed portions of the accumulator 202 output 217 and will write these portions back to the data random access memory 122 or weight random access memory 122 The memory 124, this part will be described in more detail in the subsequent chapters corresponding to FIG. 8 . In this way, the unprocessed value of the accumulator 202 can be loaded back to the media register 118 through the MFNN instruction, whereby the instructions executed by the other execution units 112 of the processor 100 can perform complex activations that cannot be performed by the activation function unit 212 Functions, such as the common softmax (softmax) function, this function is also known as the standardized exponential function. In one embodiment, the instruction set architecture of the processor 100 includes an instruction to perform the exponential function, usually denoted as ex or exp( ^x ), which can be used by other execution units 112 of the processor 100 to enhance soft-maximum activation The execution speed of the function.

在一实施例中，神经处理单元126采用管线设计。举例来说，神经处理单元126可包括算术逻辑单元204的缓存器，例如位于乘法器与加法器以及/或是算术逻辑单元204的其它电路间的缓存器，神经处理单元126还可包括一个装载启动函数功能212输出的缓存器。此神经处理单元126的其它实施例会在后续章节进行说明。In one embodiment, the neural processing unit 126 adopts a pipeline design. For example, the NPU 126 may include a register of the ALU 204, such as a register between a multiplier and an adder and/or other circuits of the ALU 204. The NPU 126 may also include a load A buffer for the output of function 212 is enabled. Other embodiments of the neural processing unit 126 will be described in subsequent chapters.

图3为方块图，显示利用图1的神经网络单元121的N个神经处理单元126的N个多任务缓存器208，对于由图1的数据随机存取存储器122取得的一列数据文字207执行如同N个文字的旋转器(rotator)或称循环移位器(circular shifter)的运作。在图3的实施例中，N是512，因此，神经网络单元121具有512个多任务缓存器208，标示为0至511，分别对应至512个神经处理单元126。每个多任务缓存器208会接收数据随机存取存储器122的D列的其中一列上的相对应数据文字207。也就是说，多任务缓存器0会从数据随机存取存储器122列接收数据文字0，多任务缓存器1会从数据随机存取存储器122列接收数据文字1，多任务缓存器2会从数据随机存取存储器122列接收数据文字2，依此类推，多任务缓存器511会从数据随机存取存储器122列接收数据文字511。此外，多任务缓存器1会接收多任务缓存器0的输出209作为另一输入211，多任务缓存器2会接收多任务缓存器1的输出209作为另一输入211，多任务缓存器3会接收多任务缓存器2的输出209作为另一输入211，依此类推，多任务缓存器511会接收多任务缓存器510的输出209作为另一输入211，而多任务缓存器0会接收多任务缓存器511的输出209作为其它输入211。每个多任务缓存器208都会接收控制输入213以控制其选择数据文字207或是循环输入211。在此运作的模式中，控制输入213会在第一时频周期内，控制每个多任务缓存器208选择数据文字207以储存至缓存器并于后续步骤提供给算术逻辑单元204，而在后续的时频周期内(如前述M-1个时频周期)，控制输入213会控制每个多任务缓存器208选择循环输入211以储存至缓存器并于后续步骤提供给算术逻辑单元204。FIG. 3 is a block diagram showing N multitasking registers 208 utilizing N NPUs 126 of the neural network unit 121 of FIG. The operation of the rotator (rotator) or circular shifter (circular shifter) of N characters. In the embodiment of FIG. 3 , N is 512, therefore, the NNU 121 has 512 multi-tasking registers 208 , marked as 0 to 511 , corresponding to 512 NPUs 126 respectively. Each multiplexing register 208 receives the corresponding data word 207 on one of the D columns of the DRAM 122 . That is, MMR 0 will receive data word 0 from DRAM column 122, MMR 1 will receive data word 1 from DRAM column 122, MMR 2 will receive data word 1 from DRAM column 122, and MMR 2 will receive data word 1 from DRAM column 122. The RAM row 122 receives the data word 2, and so on, the multitasking register 511 receives the data word 511 from the data RAM row 122 . In addition, multitasking register 1 will receive the output 209 of multitasking register 0 as another input 211, multitasking register 2 will receive the output 209 of multitasking register 1 as another input 211, and multitasking register 3 will receive Receives output 209 of multitasking buffer 2 as another input 211, and so on, multitasking buffer 511 receives output 209 of multitasking buffer 510 as another input 211, and multitasking buffer 0 receives multitasking The output 209 of the buffer 511 serves as the other input 211 . Each multiplexing register 208 receives a control input 213 to control it to select a data word 207 or a loop input 211 . In this mode of operation, the control input 213 will control each multiplexing register 208 to select the data word 207 to store in the register during the first clock cycle and provide it to the ALU 204 in a subsequent step, and in a subsequent step In the time-frequency period of (eg, M-1 time-frequency periods mentioned above), the control input 213 controls each multiplex register 208 to select the loop input 211 to store in the register and provide it to the arithmetic logic unit 204 in a subsequent step.

虽然图3(以及后续的图7与图19)所描述的实施例中，多个神经处理单元126可用以将这些多任务缓存器208/705的数值向右旋转，也即由神经处理单元J朝向神经处理单元J+1移动，不过本发明并不限于此，在其它的实施例中(例如对应于图24至图26的实施例)，多个神经处理单元126可用以将多任务缓存器208/705的数值向左旋转，也即由神经处理单元J朝向神经处理单元J-1移动。此外，在本发明的其它实施例中，这些神经处理单元126可选择性地将多任务缓存器208/705的数值向左或向右旋转，举例来说，此选择可由神经网络单元指令所指定。Although in the embodiment described in FIG. 3 (and subsequent FIG. 7 and FIG. 19 ), a plurality of neural processing units 126 can be used to rotate the values of these multitasking registers 208/705 to the right, that is, the neural processing unit J Move toward the neural processing unit J+1, but the present invention is not limited thereto. In other embodiments (for example, corresponding to the embodiment of FIG. 24 to FIG. 26 ), multiple neural processing units 126 can be used to The value of 208/705 rotates to the left, that is, moves from the neural processing unit J to the neural processing unit J-1. In addition, in other embodiments of the present invention, these NPUs 126 can selectively rotate the value of the multitasking register 208/705 to the left or right, for example, this selection can be specified by a NNU instruction .

图4为表格，显示一个储存于图1的神经网络单元121的程序存储器129并由该神经网络单元121执行的程序。如前所述，此范例程序执行与人工神经网络的一层有关的计算。图4的表格显示有四列与三行。每一列对应于程序存储器129中标示于第一行的地址。第二行指定相对应的指令，而第三行指出关联于此指令的时频周期数。就一较佳实施例而言，前述时频周期数表示在管线执行的实施例中每指令时频周期值的有效的时频周期数，而非指令延迟。如图中所示，因为神经网络单元121具有管线执行的本质，每个指令均有相关联的时频周期，位于地址2的指令是一个例外，此指令实际上自己会重复执行511次，因而需要511个时频周期，详如后述。FIG. 4 is a table showing a program stored in the program memory 129 of the neural network unit 121 of FIG. 1 and executed by the neural network unit 121 . As mentioned earlier, this sample program performs calculations associated with one layer of the artificial neural network. The table shown in Figure 4 has four columns and three rows. Each column corresponds to the address in the program memory 129 indicated in the first row. The second line specifies the corresponding instruction, and the third line indicates the number of clock cycles associated with this instruction. For a preferred embodiment, the aforementioned number of clock cycles represents the effective number of clock cycles per instruction clock cycle value in the embodiment of pipeline execution, rather than instruction latency. As shown in the figure, because the neural network unit 121 has the nature of pipeline execution, each instruction has an associated time-frequency cycle, the instruction at address 2 is an exception, and this instruction will actually repeat itself 511 times, so 511 time-frequency cycles are required, which will be described in detail later.

所有的神经处理单元126会平行处理程序中的每个指令。也就是说，所有的N个神经处理单元126都会在同一个时频周期执行第一列的指令，所有的N个神经处理单元126都会在同一个时频周期执行第二列的指令，依此类推。不过本发明并不限于此，在后续章节的其它实施例中，有些指令则是以部分平行部分序列的方式执行，举例来说，如图11的实施例所述，在多个神经处理单元126共享一个启动函数单元的实施例中，启动函数与位于地址3与4的输出指令即是以此方式执行。图4的范例中假定一层具有512个神经元(神经处理单元126)，而每个神经元具有512个来自前一层的512个神经元的连结输入，总共有256K个连结。每个神经元会从每个连结输入接收一个16位数据值，并将此16位数据值乘上一个适当的16位权重值。All NPUs 126 process each instruction in the program in parallel. That is to say, all the N neural processing units 126 will execute the instructions of the first row in the same time-frequency cycle, and all the N neural processing units 126 will execute the instructions of the second row in the same time-frequency cycle, and so on analogy. However, the present invention is not limited thereto. In other embodiments in subsequent chapters, some instructions are executed in a partially parallel partial sequence. For example, as described in the embodiment of FIG. 11 , multiple neural processing units 126 In embodiments that share a boot function unit, the boot function and output instructions at addresses 3 and 4 are executed in this manner. The example in FIG. 4 assumes that one layer has 512 neurons (NPU 126 ), and each neuron has 512 connection inputs from 512 neurons in the previous layer, for a total of 256K connections. Each neuron receives a 16-bit data value from each connection input and multiplies this 16-bit data value by an appropriate 16-bit weight value.

位于地址0的第一列(也可指定至其它地址)会指定初始化神经处理单元指令。此初始化指令会清除累加器202数值使之为零。在一实施例中，初始化指令也可在累加器202内加载数据随机存取存储器122或权重随机存取存储器124的一列中，由此指令所指定的相对应的文字。此初始化指令也会将配置值加载控制缓存器127，这部分在后续图29A与图29B会有更详细的说明。举例来说，可将数据文字207与权重文字209的宽度加载，供算术逻辑单元204利用以确认电路执行的运算大小，此宽度也会影响储存于累加器202的结果215。在一实施例中，神经处理单元126包括一电路在算术逻辑单元204的输出215储存于累加器202前填满此输出215，而初始化指令会将配置值加载此电路，此配置值会影响前述的填满运算。在一实施例中，也可在算术逻辑单元函数指令(如地址1的乘法累加指令)或输出指令(如地址4的写入起始函数单元输出指令)中如此指定，以将累加器202清除至零值。The first column at address 0 (which can also be assigned to other addresses) specifies the instruction to initialize the NPU. This initialization command clears the value of accumulator 202 to zero. In one embodiment, the initialization instruction may also load a row in the data RAM 122 or the weight RAM 124 in the accumulator 202 with the corresponding text specified by the instruction. The initialization command will also load the configuration value into the control register 127, and this part will be described in more detail in FIG. 29A and FIG. 29B. For example, the width of the data word 207 and the weight word 209 can be loaded for use by the ALU 204 to determine the size of the operation performed by the circuit. This width also affects the result 215 stored in the accumulator 202 . In one embodiment, the NPU 126 includes a circuit that fills the output 215 of the ALU 204 before storing it in the accumulator 202, and the initialization command loads the circuit with configuration values that affect the aforementioned filling operation. In one embodiment, it can also be specified in the arithmetic logic unit function instruction (such as the multiplication and accumulation instruction of address 1) or output instruction (such as the write start function unit output instruction of address 4), so that the accumulator 202 is cleared to zero value.

位于地址1的第二列指定乘法累加指令指示这512个神经处理单元126从数据随机存取存储器122的一列加载相对应的数据文字以及从权重随机存取存储器124的一列加载相对应的权重文字，并且对此数据文字输入207与权重文字输入206执行第一乘法累加运算，即加上初始化累加器202零值。进一步来说，此指令会指示定序器128在控制输入213产生一数值以选择数据文字输入207。在图4的范例中，数据随机存取存储器122的指定列为列17，权重随机存取存储器124的指定列为列0，因此定序器会被指示输出数值17作为数据随机存取存储器地址123，输出数值0作为权重随机存取存储器地址125。因此，来自数据随机存取存储器122的列17的512个数据文字提供作为512个神经处理单元126的相对应数据输入207，而来自权重随机存取存储器124的列0的512个权重文字提供作为512个神经处理单元126的相对应权重输入206。The second column at address 1 specifies a multiply-accumulate instruction that instructs the 512 NPUs 126 to load the corresponding data word from a column of data RAM 122 and the corresponding weight word from a column of weight RAM 124 , and perform the first multiply-accumulate operation on the data text input 207 and the weight text input 206 , that is, add the zero value of the initializing accumulator 202 . Further, the command instructs the sequencer 128 to generate a value at the control input 213 to select the data text input 207 . In the example of FIG. 4, the designated column of the data RAM 122 is column 17, and the designated column of the weight RAM 124 is column 0, so the sequencer will be instructed to output the value 17 as the data RAM address 123 , output the value 0 as the weight random access memory address 125 . Thus, 512 data words from column 17 of data RAM 122 are provided as corresponding data inputs 207 to 512 neural processing units 126, while 512 weight words from column 0 of weight RAM 124 are provided as The corresponding weight inputs 206 of the 512 neural processing units 126 .

位于地址2的第三列指定乘法累加旋转指令，此指令具有一计数其数值为511，以指示这512个神经处理单元126执行511次乘法累加运算。此指令指示这512个神经处理单元126将511次乘法累加运算的每一次运算中输入算术逻辑单元204的数据文字209，作为从邻近神经处理单元126来的旋转值211。也就是说，此指令会指示定序器128在控制输入213产生一数值以选择旋转值211。此外，此指令会指示这512个神经处理单元126将511次乘法累加运算的每一次运算中的相对应权重值加载权重随机存取存储器124的“下一”列。也就是说，此指令会指示定序器128将权重随机存取存储器地址125从前一个时频周期的数值增加一，在此范例中，指令的第一时频周期是列1，下一个时频周期就是列2，在下一个时频周期就是列3，依此类推，第511个时频周期就是列511。在这511个乘法累加运算中的每一个运算中，旋转输入211与权重文字输入206的乘积会被加入累加器202的前一个数值。这512个神经处理单元126会在511个时频周期内执行这511个乘法累加运算，每个神经处理单元126会对于来自数据随机存取存储器122的列17的不同数据文字-也就是，相邻的神经处理单元126在前一个时频周期执行运算的数据文字，以及关联于数据文字的不同权重文字执行一个乘法累加运算在概念上即为神经元的不同连结输入。此范例假设各个神经处理单元126(神经元)具有512个连结输入，因此牵涉到512个数据文字与512个权重文字的处理。在列2的乘法累加旋转指令重复最后一次迭代后，累加器202内就会存放有这512个连结输入的乘积的加总。在一实施例中，神经处理单元126的指令集包括“执行”指令以指示算术逻辑单元204执行由初始化神经处理单元指令指定的算术逻辑单元运算，例如图29A的算术逻辑单元函数2926所指定者，而非对于各个不同类型的算术逻辑运算(例如前述的乘法累加、累加器与权重的最大值等)具有独立的指令。The third column at address 2 specifies the multiply-accumulate rotate instruction, which has a count value of 511 to instruct the 512 NPUs 126 to perform 511 multiply-accumulate operations. This instruction instructs the 512 NPUs 126 to input each of the 511 multiply-accumulate operations into the data word 209 of the ALU 204 as the rotation value 211 from the adjacent NPU 126 . That is, the command instructs the sequencer 128 to generate a value at the control input 213 to select the rotation value 211 . In addition, the instruction instructs the 512 NPUs 126 to load the corresponding weight values in each of the 511 multiply-accumulate operations into the “next” column of the weight RAM 124 . That is, the instruction instructs the sequencer 128 to increment the weight RAM address 125 by one from the value of the previous clock cycle. In this example, the first clock cycle of the instruction is column 1, and the next clock cycle The period is column 2, the next time-frequency period is column 3, and so on, the 511th time-frequency period is column 511. In each of the 511 multiply-accumulate operations, the product of the rotation input 211 and the weight text input 206 is added to the previous value of the accumulator 202 . The 512 NPUs 126 will perform the 511 multiply-accumulate operations in 511 time-frequency cycles, and each NPU 126 will respond to a different data word from column 17 of the DRAM 122—that is, the same The data words that the adjacent neural processing unit 126 performs operations in the previous time-frequency period, and the different weight words associated with the data words perform a multiply-accumulate operation are conceptually different connection inputs of neurons. This example assumes that each neural processing unit 126 (neuron) has 512 connection inputs, thus involving the processing of 512 data words and 512 weight words. After the last iteration of the multiply accumulate rotate instruction in column 2, the accumulator 202 will store the sum of the products of the 512 connected inputs. In one embodiment, the instruction set of the NPU 126 includes an "execute" instruction to instruct the ALU 204 to perform the ALU operation specified by the initialize NPU instruction, such as specified by the ALU function 2926 of FIG. 29A , instead of having separate instructions for different types of arithmetic and logic operations (such as the aforementioned multiplication and accumulation, the maximum value of the accumulator and the weight, etc.).

位于地址3的第四列指定启动函数指令。此启动函数指令指示启动函数单元212对于累加器202数值执行所指定的启动函数以产生结果133。启动函数的实施例在后续章节会有更详细的说明。The fourth column at address 3 specifies the start function instruction. The enable function instruction instructs the enable function unit 212 to perform the specified enable function on the accumulator 202 value to generate the result 133 . The implementation of the startup function will be described in more detail in the subsequent chapters.

位于地址4的第五列指定写入启动函数单元输出指令，以指示这512个神经处理单元216将其启动函数单元212输出作为结果133写回至数据随机存取存储器122的一列，在此范例中即列16。也就是说，此指令会指示定序器128输出数值16作为数据随机存取存储器地址123以及写入命令(相对应于由地址1的乘法累加指令所指定的读取命令)。就一较佳实施例而言，因为管线执行的特性，写入启动函数单元输出指令可与其它指令同时执行，因此写入启动函数单元输出指令实际上可以在单一个时频周期内执行。The fifth column at address 4 designates a write enable function unit output instruction to instruct the 512 NPUs 216 to write their enable function unit 212 output back as result 133 to a column of DRAM 122, in this example Middle is column 16. That is, the command instructs the sequencer 128 to output the value 16 as the data RAM address 123 and the write command (corresponding to the read command specified by the multiply-accumulate command for address 1). In a preferred embodiment, due to the nature of pipeline execution, the write-enable FU output instruction can be executed concurrently with other instructions, so the write-enable FU output instruction can actually be executed in a single clock cycle.

就一较佳实施例而言，每个神经处理单元126作为一管线，此管线具有各种不同功能组件，例如多任务缓存器208(以及图7的多任务缓存器705)、算术逻辑单元204、累加器202、启动函数单元212、多工器802(请参照图8)、列缓冲器1104与启动函数单元1112(请参照图11)等，其中某些组件本身即可管线执行。除了数据文字207与权重文字206外，此管线还会从程序存储器129接收指令。这些指令会沿着管线流动并控制多种功能单元。在另一实施例中，此程序内不包含启动函数指令，而是由初始化神经处理单元指令指定执行于累加器202数值217的启动函数，指出被指定的启动函数的数值储存于配置缓存器，供管线的启动函数单元212部分在产生最后的累加器202数值217后，也就是在地址2的乘法累加旋转指令重复最后一次执行后，加以利用。就一较佳实施例而言，为了节省耗能，管线的启动函数单元212部分在写入启动函数单元输出指令到达前会处于不启动状态，在指令到达时，启动函数单元212会启动并对初始化指令指定的累加器202输出217执行启动函数。As far as a preferred embodiment is concerned, each neural processing unit 126 is used as a pipeline, and this pipeline has various functional components, such as the multitasking register 208 (and the multitasking register 705 in FIG. 7 ), the arithmetic logic unit 204 , the accumulator 202, the activation function unit 212, the multiplexer 802 (please refer to FIG. 8), the column buffer 1104 and the activation function unit 1112 (please refer to FIG. 11), etc., some of which can be pipelined by themselves. In addition to data words 207 and weight words 206 , the pipeline also receives instructions from program memory 129 . These instructions flow down the pipeline and control various functional units. In another embodiment, the program does not include the startup function instruction, but the startup function specified by the initialization neural processing unit instruction to be executed on the value 217 of the accumulator 202, indicating that the value of the designated startup function is stored in the configuration register, The startup function unit 212 portion of the pipeline is used after the last accumulator 202 value 217 was generated, ie after the last iteration of the multiply accumulate rotate instruction at address 2 was executed. As far as a preferred embodiment is concerned, in order to save energy consumption, the start function unit 212 part of the pipeline will be in an inactive state before the write start function unit output command arrives, and when the command arrives, the start function unit 212 will start and The accumulator 202 output 217 specified by the initialize instruction executes the start function.

图5为显示神经网络单元121执行图4的程序的时序图。此时序图的每一列对应至第一行指出的连续时频周期。其它行则是分别对应至这512个神经处理单元126中不同的神经处理单元126并指出其运算。图中仅显示神经处理单元0，1，511的运算以简化说明。FIG. 5 is a sequence diagram showing that the neural network unit 121 executes the procedure of FIG. 4 . Each column of this timing diagram corresponds to a consecutive time-frequency period indicated by the first row. Other rows correspond to different neural processing units 126 among the 512 neural processing units 126 and indicate their operations. In the figure, only the operations of NPUs 0, 1, 511 are shown to simplify the description.

在时频周期0，这512个神经处理单元126中的每一个神经处理单元126都会执行图4的初始化指令，在图5中即是将零值指派给累加器202。In clock cycle 0, each of the 512 NPUs 126 executes the initialization command shown in FIG. 4 , which is to assign zero to the accumulator 202 in FIG. 5 .

在时频周期1，这512个神经处理单元126中的每一个神经处理单元126都会执行图4中地址1的乘法累加指令。如图中所示，神经处理单元0会将累加器202数值(即零)加上数据随机存取存储器122的列17的文字0与权重随机存取存储器124的列0的文字0的乘积；神经处理单元1会将累加器202数值(即零)加上数据随机存取存储器122的列17的文字1与权重随机存取存储器124的列0的文字1的乘积；依此类推，神经处理单元511会将累加器202数值(即零)加上数据随机存取存储器122的列17的文字511与权重随机存取存储器124的列0的文字511的乘积。In time-frequency cycle 1, each of the 512 neural processing units 126 executes the multiply-accumulate instruction at address 1 in FIG. 4 . As shown in the figure, the NPU0 will add the accumulator 202 value (i.e. zero) to the product of the literal 0 in column 17 of the data RAM 122 and the literal 0 in column 0 of the weight RAM 124; The NPU 1 will add the value of the accumulator 202 (i.e. zero) to the product of the literal 1 of column 17 of the data RAM 122 and the literal 1 of column 0 of the weight RAM 124; Unit 511 adds the accumulator 202 value (ie zero) to the product of the word 511 in row 17 of DRAM 122 and the word 511 in row 0 of weight RAM 124 .

在时频周期2，这512个神经处理单元126中的每一个神经处理单元126都会进行图4中地址2的乘法累加旋转指令的第一次迭代。如图中所示，神经处理单元0会将累加器202数值加上由神经处理单元511的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字511)与权重随机存取存储器124的列1的文字0的乘积；神经处理单元1会将累加器202数值加上由神经处理单元0的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字0)与权重随机存取存储器124的列1的文字1的乘积；依此类推，神经处理单元511会将累加器202数值加上由神经处理单元510的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字510)与权重随机存取存储器124的列1的文字511的乘积。In time-frequency cycle 2, each of the 512 NPUs 126 performs the first iteration of the multiply-accumulate-rotate instruction at address 2 in FIG. 4 . As shown in the figure, NPU 0 will add the accumulator 202 value to the rotated data word 211 received by the multitasking register 208 output 209 of NPU 511 (i.e., the data word received by DRAM 122 511) with literal 0 of column 1 of weight RAM 124; NPU1 will add accumulator 202 value to rotation data literal 211( That is, the product of the data word (0) received by the data random access memory 122 and the word 1 of column 1 of the weight random access memory 124; and so on, the neural processing unit 511 will add the value of the accumulator 202 to The multiplexing register 208 output 209 at 510 is the product of the received rotated data word 211 (ie, the data word 510 received from the DRAM 122 ) and the word 511 of column 1 of the weighted RAM 124 .

在时频周期3，这512个神经处理单元126中的每一个神经处理单元126都会进行图4中地址2的乘法累加旋转指令的第二次迭代。如图中所示，神经处理单元0会将累加器202数值加上由神经处理单元511的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字510)与权重随机存取存储器124的列2的文字0的乘积；神经处理单元1会将累加器202数值加上由神经处理单元0的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字511)与权重随机存取存储器124的列2的文字1的乘积；依此类推，神经处理单元511会将累加器202数值加上由神经处理单元510的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字509)与权重随机存取存储器124的列2的文字511的乘积。如同图5的省略标号显示，接下来509个时频周期会依此持续进行，直到时频周期512。In time-frequency cycle 3, each of the 512 NPUs 126 performs the second iteration of the multiply-accumulate-rotate instruction at address 2 in FIG. 4 . As shown in the figure, NPU 0 will add the accumulator 202 value to the rotated data word 211 received by the multitasking register 208 output 209 of NPU 511 (i.e., the data word received by DRAM 122 510) with literal 0 of column 2 of weight RAM 124; NPU1 will add accumulator 202 value to rotation data literal 211( That is, the product of the data word 511) received by the data random access memory 122 and the word 1 of column 2 of the weight random access memory 124; and so on, the neural processing unit 511 will add the accumulator 202 value to the The multiplexed register 208 output 209 at 510 is the product of the received rotated data word 211 (ie, the data word 509 received from the DRAM 122 ) and the word 511 of column 2 of the weighted RAM 124 . As indicated by the omitted numbers in FIG. 5 , the next 509 time-frequency periods will continue in this way until the time-frequency period 512 .

在时频周期512，这512个神经处理单元126中的每一个神经处理单元126都会进行图4中地址2的乘法累加旋转指令的第511次迭代。如图中所示，神经处理单元0会将累加器202数值加上由神经处理单元511的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字1)与权重随机存取存储器124的列511的文字0的乘积；神经处理单元1会将累加器202数值加上由神经处理单元0的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字2)与权重随机存取存储器124的列511的文字1的乘积；依此类推，神经处理单元511会将累加器202数值加上由神经处理单元510的多任务缓存器208输出209接收的旋转数据文字211(即由数据随机存取存储器122接收的数据文字0)与权重随机存取存储器124的列511的文字511的乘积。在一实施例中需要多个时频周期从数据随机存取存储器122与权重随机存取存储器124读取数据文字与权重文字以执行图4中地址1的乘法累加指令；不过，数据随机存取存储器122、权重随机存取存储器124与神经处理单元126为采用管线配置，如此在第一个乘法累加运算开始后(如图5的时频周期1所示)，后续的乘法累加运算(如图5的时频周期2-512所示)就会开始在接续的时频周期内执行。就一较佳实施例而言，因应利用架构指令，如MTNN或MFNN指令(在后续图14与图15会进行说明)，对于数据随机存取存储器122和/或权重随机存取存储器124的存取动作，或是架构指令转译出的微指令，这些神经处理单元126会短暂地搁置。In the time-frequency cycle 512 , each of the 512 NPUs 126 executes the 511th iteration of the multiply-accumulate-rotate instruction at address 2 in FIG. 4 . As shown in the figure, NPU 0 will add the accumulator 202 value to the rotated data word 211 received by the multitasking register 208 output 209 of NPU 511 (i.e., the data word received by DRAM 122 1) Product with literal 0 of column 511 of weight RAM 124; NPU1 will add accumulator 202 value to rotation data literal 211 received by MTR 208 output 209 of NPU0 ( That is, the product of the data word 2) received by the data random access memory 122 and the word 1 of the column 511 of the weight random access memory 124; and so on, the neural processing unit 511 will add the value of the accumulator 202 to The multiplexing register 208 output 209 at 510 is the product of the received rotated data word 211 (ie, the data word 0 received from the DRAM 122 ) and the word 511 of the column 511 of the weighted RAM 124 . In one embodiment, multiple time-frequency cycles are required to read the data word and the weight word from the data random access memory 122 and the weight random access memory 124 to execute the multiply-accumulate instruction of address 1 in FIG. 4; however, the data random access The memory 122, the weight random access memory 124 and the neural processing unit 126 are configured in a pipeline, so that after the first multiplication and accumulation operation starts (as shown in time-frequency cycle 1 in FIG. 5 ), the subsequent multiplication and accumulation operation (as shown in FIG. 5 5 time-frequency cycle 2-512) will start to execute in the subsequent time-frequency cycle. As far as a preferred embodiment is concerned, in response to the use of architectural instructions, such as MTNN or MFNN instructions (described in subsequent FIGS. Fetching actions, or microinstructions translated from architectural instructions, these NPUs 126 are briefly suspended.

在时频周期513，这512个神经处理单元126中的每一个神经处理单元126的启动函数单元212都会执行图4中地址3的启动函数。最后，在时频周期514，这512个神经处理单元126中的每一个神经处理单元126会透过将其结果133写回数据随机存取存储器122的列16中的相对应文字以执行图4中地址4的写入启动函数单元输出指令，也就是说，神经处理单元0的结果133会被写入数据随机存取存储器122的文字0，神经处理单元1的结果133会被写入数据随机存取存储器122的文字1，依此类推，神经处理单元511的结果133会被写入数据随机存取存储器122的文字511。对应于前述图5的运算的相对应方块图显示于图6A。In the time-frequency period 513 , the startup function unit 212 of each of the 512 neural processing units 126 executes the startup function at address 3 in FIG. 4 . Finally, at clock cycle 514, each NPU 126 of the 512 NPUs 126 will perform the execution of FIG. The write at address 4 in the middle starts the function unit output instruction, that is to say, the result 133 of NPU 0 will be written into word 0 of DRAM 122, and the result 133 of NPU 1 will be written into DRAM 122. Word 1 of the memory 122 is accessed, and so on, the result 133 of the NPU 511 is written into the word 511 of the DRAM 122 . A corresponding block diagram corresponding to the aforementioned operations of FIG. 5 is shown in FIG. 6A.

图6A为显示图1的神经网络单元121执行图4的程序的方块示意图。此神经网络单元121包括512个神经处理单元126、接收地址输入123的数据随机存取存储器122，与接收地址输入125的权重随机存取存储器124。在时频周期0的时候，这512个神经处理单元126会执行初始化指令。此运作在图中并未显示。如图中所示，在时频周期1的时候，列17的512个16位的数据文字会从数据随机存取存储器122读出并提供至这512个神经处理单元126。在时频周期1至512的过程中，列0至列511的512个16位的权重文字会分别从权重随机存取存储器122读出并提供至这512个神经处理单元126。在时频周期1的时候，这512个神经处理单元126会对加载的数据文字与权重文字执行其相对应的乘法累加运算。此运作在图中并未显示。在时频周期2至512的过程中，512个神经处理单元126的多任务缓存器208会如同一个具有512个16位文字的旋转器进行运作，而将先前由数据随机存取存储器122的列17加载的数据文字转动至邻近的神经处理单元126，而这些神经处理单元126会对转动后的相对应数据文字以及由权重随机存取存储器124加载的相对应权重文字执行乘法累加运算。在时频周期513的时候，这512个启动函数单元212会执行启动指令。此运作在图中并未显示。在时频周期514的时候，这512个神经处理单元126会将其相对应的512个16位结果133写回数据随机存取存储器122的列16。FIG. 6A is a block diagram showing that the neural network unit 121 of FIG. 1 executes the program of FIG. 4 . The neural network unit 121 includes 512 neural processing units 126 , a data RAM 122 receiving an address input 123 , and a weight RAM 124 receiving an address input 125 . At time-frequency period 0, the 512 neural processing units 126 execute initialization instructions. This operation is not shown in the figure. As shown in the figure, at clock cycle 1, 512 16-bit data words of column 17 are read from the DRAM 122 and provided to the 512 NPUs 126 . During the time-frequency period 1 to 512 , 512 16-bit weight texts from row 0 to row 511 are respectively read from the weight random access memory 122 and provided to the 512 neural processing units 126 . During the time-frequency period 1, the 512 neural processing units 126 perform corresponding multiplication and accumulation operations on the loaded data words and weight words. This operation is not shown in the figure. During clock cycles 2 to 512, the 512 multitasking registers 208 of the NPU 126 operate as a rotator with 512 16-bit words, and the columns previously allocated by the DRAM 122 The loaded data words 17 are rotated to adjacent NPUs 126 , and these NPUs 126 perform multiply-accumulate operations on the rotated corresponding data words and the corresponding weight words loaded from the weight RAM 124 . During the time-frequency period 513, the 512 startup function units 212 execute the startup instruction. This operation is not shown in the figure. During the clock cycle 514 , the 512 NPUs 126 will write their corresponding 512 16-bit results 133 back to the column 16 of the DRAM 122 .

如图中所示，产生结果文字(神经元输出)并写回数据随机存取存储器122或权重随机存取存储器124需要的时频周期数大致为神经网络的当前层接收到的数据输入(连结)数量的平方根。举例来说，若是当前层具有512个神经元，而各个神经元具有512个来自前一层的连结，这些连结的总数就是256K，而产生当前层结果需要的时频周期数就会略大于512。因此，神经网络单元121在神经网络计算方面可提供极高的效能。As shown in the figure, the number of time-frequency cycles required to generate the resulting text (neuron output) and write back to the data RAM 122 or the weight RAM 124 is roughly the number of data inputs received by the current layer of the neural network (link ) square root of the quantity. For example, if the current layer has 512 neurons, and each neuron has 512 connections from the previous layer, the total number of these connections is 256K, and the number of time-frequency cycles required to generate the result of the current layer will be slightly greater than 512 . Therefore, the neural network unit 121 can provide extremely high performance in neural network calculation.

图6B为流程图，显示图1的处理器100执行架构程序，以利用神经网络单元121执行关联于人工神经网络的隐藏层的神经元的典型乘法累加启动函数运算的运作，如同由图4的程序执行的运作。图6B的范例假定有四个隐藏层(标示于初始化步骤602的变量NUM_LAYERS)，各个隐藏层具有512个神经元，各个神经元连结前一层全部的512个神经元(透过图4的程序)。不过，需要理解的是，这些层与神经元的数量的选择为说明本发明，神经网络单元121当可将类似的计算应用于不同数量隐藏层的实施例，每一层中具有不同数量神经元的实施例，或是神经元未被全部连结的实施例。在一实施例中，对于这一层中不存在的神经元或是不存在的神经元连结的权重值会被设定为零。就一较佳实施例而言，架构程序会将第一组权重写入权重随机存取存储器124并启动神经网络单元121，当神经网络单元121正在执行关联于第一层的计算时，此架构程序会将第二组权重写入权重随机存取存储器124，如此，一旦神经网络单元121完成第一隐藏层的计算，神经网络单元121就可以开始第二层的计算。如此，架构程序会往返于权重随机存取存储器124的两个区域，以确保神经网络单元121可以被充分利用。此流程始于步骤602。6B is a flow chart showing the execution of the architecture program by the processor 100 of FIG. 1 to utilize the neural network unit 121 to perform a typical multiply-accumulate activation function operation of the neurons associated with the hidden layer of the artificial neural network, as shown by FIG. 4 The operation of program execution. The example of FIG. 6B assumes that there are four hidden layers (indicated in the variable NUM_LAYERS of the initialization step 602), each hidden layer has 512 neurons, and each neuron is connected to all 512 neurons of the previous layer (through the program of FIG. 4 ). It should be understood, however, that the number of these layers and neurons was chosen to illustrate the present invention, and that the neural network unit 121 may apply similar calculations to embodiments with different numbers of hidden layers, each with a different number of neurons. , or embodiments where the neurons are not all connected. In one embodiment, the weights for non-existent neurons or non-existent neuron connections in this layer are set to zero. For a preferred embodiment, the architecture program writes the first set of weights into the weight RAM 124 and activates the NNU 121. When the NNU 121 is performing calculations associated with the first layer, the architecture The program will write the second set of weights into the weight random access memory 124, so that once the neural network unit 121 completes the calculation of the first hidden layer, the neural network unit 121 can start the calculation of the second hidden layer. In this way, the architecture program will go back and forth between the two areas of the weight RAM 124 to ensure that the NNU 121 can be fully utilized. The process starts at step 602 .

在步骤602，如图6A的相关章节所述，执行架构程序的处理器100将输入值写入数据随机存取存储器122的当前神经元隐藏层，也就是写入数据随机存取存储器122的列17。这些值也可能已经位于数据随机存取存储器122的列17作为神经网络单元121针对前一层的运算结果133(例如卷积、共源或输入层)。其次，架构程序会将变量N初始化为数值1。变量N代表隐藏层中即将由神经网络单元121处理的当前层。此外，架构程序会将变量NUM_LAYERS初始化为数值4，因为在本范例中有四个隐藏层。接下来流程前进至步骤604。In step 602, as described in the relevant section of FIG. 6A , the processor 100 executing the architecture program writes the input value into the current neuron hidden layer of the DRAM 122, that is, into the column of the DRAM 122 17. These values may also already be located in the column 17 of the data RAM 122 as the result 133 of the operation of the neural network unit 121 on the previous layer (eg convolution, common source or input layer). Second, the framework program initializes the variable N to a value of 1. The variable N represents the current layer of hidden layers to be processed by the neural network unit 121 . Also, the architecture program initializes the variable NUM_LAYERS to a value of 4, since there are four hidden layers in this example. Then the flow goes to step 604 .

在步骤604，处理器100将层1的权重文字写入权重随机存取存储器124，例如图6A所示的列0至511。接下来流程前进至步骤606。In step 604, the processor 100 writes the weight text of layer 1 into the weight RAM 124, such as columns 0 to 511 shown in FIG. 6A. Then the flow goes to step 606 .

在步骤606中，处理器100利用指定函数1432以写入程序存储器129的MTNN指令1400，将乘法累加启动函数程序(如图4所示)写入神经网络单元121程序存储器129。处理器100随后利用MTNN指令1400以启动神经网络单元程序，此指令指定函数1432开始执行此程序。接下来流程前进至步骤608。In step 606 , the processor 100 uses the MTNN instruction 1400 specifying the function 1432 to be written into the program memory 129 , and writes the multiply-accumulate activation function program (as shown in FIG. 4 ) into the program memory 129 of the neural network unit 121 . The processor 100 then uses the MTNN instruction 1400 to start the NNU program, which specifies the function 1432 to start executing the program. Then the flow goes to step 608 .

在决策步骤608中，架构程序确认变量N的数值是否小于NUM_LAYERS。若是，流程就会前进至步骤612；否则就前进至步骤614。In decision step 608, the framework program checks to see if the value of variable N is less than NUM_LAYERS. If yes, the process will go to step 612 ; otherwise, go to step 614 .

在步骤612中，处理器100将层N+1的权重文字写入权重随机存取存储器124，例如列512至1023。因此，架构程序就可以在神经网络单元121执行当前层的隐藏层计算时将下一层的权重文字写入权重随机存取存储器124，藉此，在完成当前层的计算，也就是写入数据随机存取存储器122后，神经网络单元121就可以立刻开始执行下一层的隐藏层计算。接下来前进至步骤614。In step 612 , the processor 100 writes the weight text of layer N+1 into the weight RAM 124 , such as columns 512 to 1023 . Therefore, the architecture program can write the weight text of the next layer into the weight random access memory 124 when the neural network unit 121 performs the calculation of the hidden layer of the current layer, thereby completing the calculation of the current layer, that is, writing the data After random accessing the memory 122, the neural network unit 121 can immediately start to execute the hidden layer calculation of the next layer. Then proceed to step 614 .

在步骤614中，处理器100确认正在执行的神经网络单元程序(就层1而言，在步骤606开始执行，就层2至4而言，则是在步骤618开始执行)是否已经完成执行。就一较佳实施例而言，处理器100会透过执行MFNN指令1500读取神经网络单元121状态缓存器127以确认是否已经完成执行。在另一实施例中，神经网络单元121会产生一中断，表示已经完成乘法累加启动函数层程序。接下来流程前进至决策步骤616。In step 614, the processor 100 confirms whether the executing NNU program (for layer 1, executed at step 606, for layers 2-4, executed at step 618) has completed execution. For a preferred embodiment, the processor 100 reads the state register 127 of the neural network unit 121 to confirm whether the execution has been completed by executing the MFNN instruction 1500 . In another embodiment, the neural network unit 121 generates an interrupt, indicating that the multiply-accumulate startup function layer program has been completed. Flow then proceeds to decision step 616 .

在决策步骤616中，架构程序确认变量N的数值是否小于NUM_LAYERS。若是，流程会前进至步骤618；否则就前进至步骤622。In decision step 616, the builder program checks to see if the value of variable N is less than NUM_LAYERS. If yes, the process will go to step 618 ; otherwise, go to step 622 .

在步骤618中，处理器100会更新乘法累加启动函数程序，使能执行层N+1的隐藏层计算。进一步来说，处理器100会将图4中地址1的乘法累加指令的数据随机存取存储器122列值，更新为数据随机存取存储器122中前一层计算结果写入的列(例如更新为列16)并更新输出列(例如更新为列15)。处理器100随后开始更新神经网络单元程序。在另一实施例中，图4的程序指定地址4的输出指令的同一列作为地址1的乘法累加指令所指定的列(也就是由数据随机存取存储器122读取的列)。在此实施例中，输入数据文字的当前列会被覆写(因为此列数据文字已经被读入多任务缓存器208并透过N文字旋转器在这些神经处理单元126间进行旋转，只要这列数据文字不需被用于其它目的，这样的处理方式就是可以被允许的)。在此情况下，在步骤618中就不需要更新神经网络单元程序，而只需要将其重新启动。接下来流程前进至步骤622。In step 618, the processor 100 updates the multiply-accumulate activation function program to enable the execution of the hidden layer calculation of layer N+1. Further, the processor 100 will update the DRAM 122 column value of the multiply-accumulate instruction at address 1 in FIG. column 16) and update the output column (eg update to column 15). The processor 100 then starts to update the NNU program. In another embodiment, the program of FIG. 4 specifies the same column of the output instruction at address 4 as the column specified by the multiply-accumulate instruction at address 1 (ie, the column read by the DRAM 122 ). In this embodiment, the current row of input data words will be overwritten (since the row of data words has been read into the multitasking register 208 and rotated between the NPUs 126 via the N word rotator, as long as the row The data literal need not be used for other purposes, such processing is permitted). In this case, there is no need to update the NNU program in step 618, but only to restart it. Then the process proceeds to step 622 .

在步骤622中，处理器100从数据随机存取存储器122读取层N的神经网络单元程序的结果。不过，若是这些结果只会被用于下一层，架构程序就不须从数据随机存取存储器122读取这些结果，而可将其保留在数据随机存取存储器122供下一个隐藏层计算之用。接下来流程前进至步骤624。In step 622 , the processor 100 reads the result of the NNU program of layer N from the DRAM 122 . However, if these results are only used in the next layer, the architecture program does not need to read these results from the DRAM 122, but can keep them in the DRAM 122 for the calculation of the next hidden layer use. The process then proceeds to step 624 .

在决策步骤624中，架构程序确认变量N的数值是否小于NUM_LAYERS。若是，流程前进至步骤626；否则就终止此流程。In decision step 624, the framework program checks to see if the value of variable N is less than NUM_LAYERS. If yes, the process proceeds to step 626; otherwise, the process is terminated.

在步骤626中，架构程序会将N的数值增加一。接下来流程会回到决策步骤608。In step 626, the framework program increments the value of N by one. Then the flow returns to decision step 608 .

如同图6B的范例所示，大致上每512个时频周期，这些神经处理单元126就会对数据随机存取存储器122执行一次读取与一次写入(透过图4的神经网络单元程序的运算的效果)。此外，这些神经处理单元126大致上每个时频周期都会对权重随机存取存储器124进行读取以读取一列权重文字。因此，权重随机存取存储器124全部的频宽都会因为神经网络单元121以混合方式执行隐藏层运算而被消耗。此外，假定在一实施例中具有一个写入与读取缓冲器，例如图17的缓冲器1704，神经处理单元126进行读取的同时，处理器100对权重随机存取存储器124进行写入，如此缓冲器1704大致上每16个时频周期会对权重随机存取存储器124执行一次写入以写入权重文字。因此，在权重随机存取存储器124为单一端口的实施例中(如同图17的相对应章节所述)，大致上每16个时频周期这些神经处理单元126就会暂时搁置对权重随机存取存储器124进行的读取，而使缓冲器1704能够对权重随机存取存储器124进行写入。不过，在双端口权重随机存取存储器124的实施例中，这些神经处理单元126就不需被搁置。As shown in the example of FIG. 6B, roughly every 512 time-frequency cycles, these neural processing units 126 will perform a read and a write to the data random access memory 122 (through the neural network unit program of FIG. 4 operation effect). In addition, the neural processing units 126 read the weight random access memory 124 roughly every time-frequency cycle to read a list of weight texts. Therefore, the entire bandwidth of the WRAM 124 will be consumed by the NNU 121 performing hidden layer operations in a hybrid manner. In addition, assuming that there is a write and read buffer in one embodiment, such as buffer 1704 in FIG. In this way, the buffer 1704 performs writing to the weight random access memory 124 roughly every 16 clock cycles to write the weight text. Therefore, in an embodiment in which the weight random access memory 124 is a single port (as described in the corresponding section of FIG. Reads from the memory 124 enable the buffer 1704 to write to the weight random access memory 124 . However, in the DPWRAM 124 embodiment, these NPUs 126 need not be left alone.

图7为显示图1的神经处理单元126的另一实施例的方块示意图。图7的神经处理单元126类似于图2的神经处理单元126。不过，图7的神经处理单元126另外具有一个双输入多任务缓存器705。此多任务缓存器705选择其中一个输入206或711储存于其缓存器，并于后续时频周期提供于其输出203。输入206从权重随机存取存储器124接收权重文字。另一个输入711则是接收相邻神经处理单元126的第二多任务缓存器705的输出203。就一较佳实施例而言，神经处理单元J的输入711会接收的排列在J-1的神经处理单元126的多任务缓存器705输出203，而神经处理单元J的输出203则是提供至排列在J+1的神经处理单元126的多任务缓存器705的输入711。如此，N个神经处理单元126的多任务缓存器705就可共同运作，如同一N个文字的旋转器，其运作类似于前述图3所示的方式，不过是用于权重文字而非数据文字。多任务缓存器705利用控制输入213控制这两个输入中哪一个会被多任务缓存器705选择储存于其缓存器并于后续提供于输出203。FIG. 7 is a block diagram illustrating another embodiment of the neural processing unit 126 of FIG. 1 . The neural processing unit 126 of FIG. 7 is similar to the neural processing unit 126 of FIG. 2 . However, the NPU 126 of FIG. 7 additionally has a dual-input multitasking register 705 . The multitasking register 705 selects one of the inputs 206 or 711 to store in its register, and provides it to its output 203 in a subsequent clock cycle. Input 206 receives weight text from weight RAM 124 . Another input 711 is to receive the output 203 of the second multitasking register 705 of the adjacent NPU 126 . As far as a preferred embodiment is concerned, the input 711 of the NPU J receives the output 203 of the multitasking register 705 of the NPU 126 arranged at J-1, and the output 203 of the NPU J is provided to The input 711 of the multitasking register 705 of the NPU 126 arranged at J+1. In this way, the multitasking registers 705 of the N NPUs 126 can work together, like a rotator for N words, which operates similarly to the method shown in FIG. 3 above, but is used for weight words instead of data words . The multiplexing register 705 uses the control input 213 to control which of the two inputs will be selected by the multiplexing register 705 to store in its register and subsequently provided on the output 203 .

利用多任务缓存器208和/或多任务缓存器705(以及如图18与图23所示的其它实施例中的多任务缓存器)，实际上构成一个大型的旋转器将来自数据随机存取存储器122和/或权重随机存取存储器124的一列的数据/权重进行旋转，神经网络单元121就不需要在数据随机存取存储器122和/或权重随机存取存储器124间使用一个非常大的多工器以提供需要的数据/权重文字至适当的神经网络单元。Utilizing the multitasking register 208 and/or the multitasking register 705 (as well as the multitasking registers in other embodiments shown in FIG. 18 and FIG. 23 ), a large-scale rotator is actually formed to transfer data from random access By rotating the data/weights of one column of memory 122 and/or weight RAM 124, neural network unit 121 does not need to use a very large multiplier between data RAM 122 and/or weight RAM 124. worker to provide the required data/weight text to the appropriate neural network unit.

除启动函数结果外再写回累加器数值Write back the accumulator value in addition to the result of the start function

对于某些应用而言，让处理器100接收回(例如透过图15的MFNN指令接收至媒体缓存器118)未经处理的累加器202数值217，以提供给执行于其它执行单元112的指令执行计算，确实有其用处。举例来说，在一实施例中，启动函数单元212不针对软极大启动函数的执行进行配置以降低启动函数单元212的复杂度。所以，神经网络单元121会输出未经处理的累加器202数值217或其中一个子集合至数据随机存取存储器122或权重随机存取存储器124，而架构程序在后续步骤可以由数据随机存取存储器122或权重随机存取存储器124读取并对此未经处理的数值进行计算。不过，对于未经处理的累加器202数值217的应用并不限于执行软极大运算，其它应用也为本发明所涵盖。For some applications, have the processor 100 receive back (eg, to the media register 118 via the MFNN instruction of FIG. Performing calculations does have its uses. For example, in one embodiment, the activation function unit 212 is not configured for the execution of the soft-max activation function to reduce the complexity of the activation function unit 212 . Therefore, the neural network unit 121 will output the unprocessed accumulator 202 value 217 or a subset thereof to the data random access memory 122 or the weight random access memory 124, and the architecture program can be transferred from the data random access memory in subsequent steps 122 or weight random access memory 124 reads and calculates this unprocessed value. However, the use of the raw accumulator 202 value 217 is not limited to performing soft max calculations, and other uses are also encompassed by the present invention.

图8为显示图1的神经处理单元126的又一实施例的方块示意图。图8的神经处理单元126类似于图2的神经处理单元126。不过，图8的神经处理单元126在启动函数单元212内包括多工器802，而此启动函数单元212具有控制输入803。累加器202的宽度(以位计)大于数据文字的宽度。多工器802具有多个输入以接收累加器202输出217的数据文字宽度部分。在一实施例中，累加器202的宽度为41个位，而神经处理单元216可用以输出一个16位的结果文字133；如此，举例来说，多工器802(或图30的多工器3032和/或多工器3037)具有三个输入分别接收累加器202输出217的位[15：0]、位[31：16]与位[47：32]。就一较佳实施例而言，非由累加器202提供的输出位(例如位[47：41])会被强制设定为零值位。FIG. 8 is a block diagram illustrating another embodiment of the neural processing unit 126 of FIG. 1 . The neural processing unit 126 of FIG. 8 is similar to the neural processing unit 126 of FIG. 2 . However, the NPU 126 of FIG. 8 includes a multiplexer 802 within the enable function unit 212 having a control input 803 . The width (in bits) of the accumulator 202 is greater than the width of the data words. The multiplexer 802 has multiple inputs to receive the data word-width portion of the output 217 of the accumulator 202 . In one embodiment, the width of the accumulator 202 is 41 bits, and the neural processing unit 216 can be used to output a 16-bit result text 133; thus, for example, the multiplexer 802 (or the multiplexer of FIG. 30 3032 and/or multiplexer 3037) has three inputs to receive bits [15:0], bits [31:16] and bits [47:32] of the accumulator 202 output 217, respectively. For a preferred embodiment, output bits not provided by the accumulator 202 (eg, bits [47:41]) are forced to be zero-valued bits.

定序器128会在控制输入803产生一数值，控制多工器802在累加器202的文字(如16位)中选择其一，以因应写入累加器指令，例如后续图9中位于地址3至5的写入累加器指令。就一较佳实施例而言，多工器802还具有一个或多个输入以接收启动函数电路(如图30中的组件3022，3024，3026，3018，3014与3016)的输出，而这些启动函数电路产生的输出的宽度等于一个数据文字。定序器128会在控制输入803产生数值以控制多工器802在这些启动函数电路输出中选择其，而非在累加器202的文字中选择其，以因应如图4中地址4的启动函数单元输出指令。The sequencer 128 will generate a value at the control input 803, and control the multiplexer 802 to select one of the words (such as 16 bits) in the accumulator 202 to respond to the instruction for writing the accumulator, for example, it is located at address 3 in the following figure 9 to 5 of the Write Accumulator instruction. As far as a preferred embodiment is concerned, the multiplexer 802 also has one or more inputs to receive the output of the activation function circuit (such as the components 3022, 3024, 3026, 3018, 3014 and 3016 in Figure 30), and these activation The output produced by the function circuit has a width equal to one data literal. The sequencer 128 will generate a value at the control input 803 to control the multiplexer 802 to select it among these enable function circuit outputs instead of selecting it among the words of the accumulator 202, in response to the enable function at address 4 in FIG. 4 The unit outputs instructions.

图9为表格，显示一个储存于图1的神经网络单元121的程序存储器129并由该神经网络单元121执行的程序。图9的范例程序类似于图4的程序。尤其是，二者在地址0至2的指令完全相同。不过，图4中地址3与4的指令在图9中则是由写入累加器指令取代，此指令会指示512个神经处理单元126将其累加器202输出217作为结果133写回数据随机存取存储器122的三列，在此范例中即列16至18。也就是说，此写入累加器指令会指示定序器128在第时频周期输出数值为16的数据随机存取存储器地址123以及写入命令，在第二时频周期输出数值为17的数据随机存取存储器地址123以及写入命令，在第三时频周期则是输出数值为18的数据随机存取存储器地址123与写入命令。就较佳实施例而言，写入累加器指令的执行时间可以与其它指令重迭，如此，写入累加器指令就实际上就可以在这三个时频周期内执行，其中每个时频周期会写入数据随机存取存储器122的列。在实施例中，使用者指定启动函数2934与控制缓存器127的输出命令2956栏的数值(图29A)，以将累加器202的所需部份写入数据随机存取存储器122或权重随机存取存储器124。另外，写入累加器指令可以选择性地写回累加器202的子集，而非写回累加器202的全部内容。在实施例中，可写回标准型的累加器202。这部分在后续对应于图29至图31的章节会有更详细的说明。FIG. 9 is a table showing a program stored in the program memory 129 of the neural network unit 121 of FIG. 1 and executed by the neural network unit 121 . The example program of FIG. 9 is similar to the program of FIG. 4 . In particular, the instructions at addresses 0 to 2 are identical. However, the instructions at addresses 3 and 4 in FIG. 4 are replaced in FIG. 9 by a write to accumulator instruction, which instructs each of the 512 NPUs 126 to write their accumulator 202 output 217 as a result 133 back to the DRAM Three columns of memory 122 are fetched, columns 16-18 in this example. That is to say, the write accumulator instruction will instruct the sequencer 128 to output the data random access memory address 123 with a value of 16 and the write command in the first clock cycle, and output the data with a value of 17 in the second clock cycle The RAM address 123 and the write command output the data RAM address 123 and the write command with a value of 18 in the third clock cycle. As far as the preferred embodiment is concerned, the execution time of the write accumulator instruction can be overlapped with other instructions, so that the write accumulator instruction can actually be executed within these three clock cycles, where each Cycles are written to the ranks of the DRAM 122 . In an embodiment, the user specifies the value of the start function 2934 and the output command 2956 column of the control register 127 (FIG. 29A) to write the desired portion of the accumulator 202 into the DRAM 122 or WRAM. Access memory 124. Additionally, the write accumulator instruction may selectively write back a subset of accumulator 202 rather than the entire contents of accumulator 202 . In an embodiment, standard type accumulator 202 may be written back. This part will be described in more detail in subsequent chapters corresponding to FIG. 29 to FIG. 31 .

图10为显示神经网络单元121执行图9的程序的时序图。图10的时序图类似于图5的时序图，其中时频周期0至512均为相同。不过，在时频周期513-515，这512个神经处理单元126中每个神经处理单元126的启动函数单元212会执行图9中地址3至5的写入累加器指令的其中之一。尤其是，在时频周期513，512个神经处理单元126中每一个神经处理单元126会将累加器202输出217的位[15：0]作为其结果133写回数据随机存取存储器122的列16中的相对应文字；在时频周期514，512个神经处理单元126中每一个神经处理单元126会将累加器202输出217的位[31：16]作为其结果133写回数据随机存取存储器122的列17中的相对应文字；而在时频周期515，512个神经处理单元126中每一个神经处理单元126会将累加器202输出217的位[40：32]作为其结果133写回数据随机存取存储器122的列18中的相对应文字。就一较佳实施例而言，位[47：41]会被强制设定为零值。FIG. 10 is a sequence diagram showing that the neural network unit 121 executes the procedure of FIG. 9 . The timing diagram of FIG. 10 is similar to the timing diagram of FIG. 5 , wherein the time-frequency periods 0 to 512 are all the same. However, in the clock cycle 513-515, the activation function unit 212 of each neural processing unit 126 in the 512 neural processing units 126 executes one of the instructions for writing to the accumulator at addresses 3 to 5 in FIG. 9 . In particular, each of the 512 NPUs 126 will write bits [15:0] of the output 217 of the accumulator 202 as its result 133 back to the column of the DRAM 122 during the clock cycle 513 Corresponding text in 16; in the time-frequency cycle 514, each neural processing unit 126 in the 512 neural processing units 126 will write the bits [31:16] of the accumulator 202 output 217 as its result 133 and write back the data random access The corresponding text in column 17 of the memory 122; and in the time-frequency cycle 515, each neural processing unit 126 in the 512 neural processing units 126 will write the bits [40:32] of the accumulator 202 output 217 as its result 133 The corresponding text in column 18 of DRAM 122 is returned. For a preferred embodiment, bits [47:41] are forced to be zero.

共享启动函数单元shared boot function unit

图11为显示图1的神经网络单元121的一实施例的方块示意图。在图11的实施例中，一个神经元分成两部分，即启动函数单元部分与算术逻辑单元部分(此部分还包含移位缓存器部分)，而各个启动函数单元部分由多个算术逻辑单元部分共享。在图11中，算术逻辑单元部分指神经处理单元126，而共享的启动函数单元部分则是指启动函数单元1112。相对于如图2的实施例，各个神经元则是包含自己的启动函数单元212。依此，在图11实施例的一范例中，神经处理单元126(算术逻辑单元部分)可包括图2的累加器202、算术逻辑单元204、多任务缓存器208与缓存器205，但不包括启动函数单元212。在图11的实施例中，神经网络单元121包括512个神经处理单元126，不过，本发明并不限于此。在图11的范例中，这512个神经处理单元126被分成64个群组，在图11中标示为群组0至63，而每个群组具有八个神经处理单元126。FIG. 11 is a block diagram showing an embodiment of the neural network unit 121 in FIG. 1 . In the embodiment of Fig. 11, a neuron is divided into two parts, i.e. a start function unit part and an ALU part (this part also includes a shift register part), and each start function unit part is composed of a plurality of ALU parts shared. In FIG. 11 , the ALU portion refers to the NPU 126 , and the shared AFU portion refers to the AFU 1112 . Compared with the embodiment shown in FIG. 2 , each neuron includes its own activation function unit 212 . Accordingly, in an example of the embodiment in FIG. 11 , the neural processing unit 126 (ALU part) may include the accumulator 202 , the ALU 204 , the multitasking register 208 and the register 205 in FIG. 2 , but does not include The functional unit 212 is started. In the embodiment of FIG. 11 , the neural network unit 121 includes 512 neural processing units 126 , however, the present invention is not limited thereto. In the example of FIG. 11 , the 512 NPUs 126 are divided into 64 groups, labeled as Groups 0 to 63 in FIG. 11 , and each group has eight NPUs 126 .

神经网络单元121还包括列缓冲器1104与多个共享的启动函数单元1112，这些启动函数单元1112耦接于神经处理单元126与列缓冲器1104间。列缓冲器1104的宽度(以位计)与数据随机存取存储器122或权重随机存取存储器124的一列相同，例如512个文字。每一个神经处理单元126群组具有一个启动函数单元1112，也即，每个启动函数单元1112对应于神经处理单元126群组；如此，在图11的实施例中就存在64个启动函数单元1112对应至64个神经处理单元126群组。同一个群组的八个神经处理单元126共享对应于此群组的启动函数单元1112。本发明也可应用于具有不同数量的启动函数单元以及每一个群组中具有不同数量的神经处理单元的实施例。举例来说，本发明也可应用于每个群组中具有两个、四个或十六个神经处理单元126共享同一个启动函数单元1112的实施例。The neural network unit 121 further includes a column buffer 1104 and a plurality of shared activation function units 1112 , and these activation function units 1112 are coupled between the neural processing unit 126 and the column buffer 1104 . The width (in bits) of the column buffer 1104 is the same as a column of the data RAM 122 or the weight RAM 124 , for example, 512 words. Each neural processing unit 126 group has an activation function unit 1112, that is, each activation function unit 1112 corresponds to the neural processing unit 126 group; like this, there are 64 activation function units 1112 in the embodiment of FIG. 11 Corresponding to 64 NPU 126 groups. The eight NPUs 126 of the same group share the activation function unit 1112 corresponding to this group. The invention is also applicable to embodiments having different numbers of activation function units and different numbers of NPUs in each group. For example, the present invention can also be applied to embodiments in which two, four or sixteen neural processing units 126 share the same activation function unit 1112 in each group.

共享启动函数单元1112有助于缩减神经网络单元121的尺寸。尺寸缩减会牺牲效能。也就是说，依据共享率的不同，会需要使用额外的时频周期才能产生整个神经处理单元126阵列的结果133，举例来说，如以下图12所示，在8∶1的共享率的情况下就需要七个额外的时频周期。不过，一般而言，相较于产生累加总数所需的时频周期数(举例来说，对于每个神经元具有512个连结的一层，就需要512个时频周期)，前述额外增加的时频周期数(例如7)相当少。因此，共享启动函数单元对效能的影响非常小(例如，增加大约百分之一的计算时间)，对于所能缩减神经网络单元121的尺寸而言会是一个合算的成本。Sharing the AFU 1112 helps to reduce the size of the NNU 121 . Size reduction comes at the expense of performance. That is to say, depending on the sharing ratio, additional time-frequency cycles may be required to generate the result 133 of the entire neural processing unit 126 array. For example, as shown in FIG. 12 below, in the case of a sharing ratio of 8:1 The next seven additional time-frequency cycles are required. In general, however, the aforementioned additional additional The number of time-frequency cycles (eg, 7) is relatively small. Therefore, sharing AFUs has very little impact on performance (eg, about 1 percent increase in computation time), and may be a cost-effective reduction in the size of the NNU 121 .

在一实施例中，每一个神经处理单元126包括启动函数单元212用以执行相对简单的启动函数，这些简单的启动函数单元212具有较小的尺寸而能被包含在每个神经处理单元126内；反之，共享的复杂启动函数单元1112则是执行相对复杂的启动函数，其尺寸会明显大于简单的启动函数单元212。在此实施例中，只有在指定复杂启动函数而需要由共享复杂启动函数单元1112执行的情况下，需要额外的时频周期，在指定的启动函数可以由简单启动函数单元212执行的情况下，就不需要此额外的时频周期。In one embodiment, each neural processing unit 126 includes an activation function unit 212 for performing relatively simple activation functions, and these simple activation function units 212 have a small size to be included in each neural processing unit 126 On the contrary, the shared complex activation function unit 1112 executes a relatively complex activation function, and its size will be significantly larger than the simple activation function unit 212 . In this embodiment, only when the specified complex activation function needs to be executed by the shared complex activation function unit 1112, additional time-frequency cycles are required. In the case that the specified activation function can be executed by the simple activation function unit 212, This additional time-frequency cycle is not needed.

图12与图13为显示图11的神经网络单元121执行图4的程序的时序图。图12的时序图类似于图5的时序图，二者的时频周期0至512均相同。不过，在时频周期513的运算并不相同，因为图11的神经处理单元126会共享启动函数单元1112；也即，同一个群组的神经处理单元126会共享关联于此群组的启动函数单元1112，而图11即显示此共享架构。12 and 13 are timing diagrams showing the neural network unit 121 of FIG. 11 executing the program of FIG. 4 . The timing diagram of FIG. 12 is similar to the timing diagram of FIG. 5 , and the time-frequency periods 0 to 512 of the two are the same. However, the operation in the time-frequency cycle 513 is not the same, because the neural processing unit 126 in FIG. 11 will share the activation function unit 1112; that is, the neural processing units 126 in the same group will share the activation function associated with this group Unit 1112, and Figure 11 shows this shared architecture.

图13的时序图的每一列对应至标示于第一行的连续时频周期。其它行则是分别对应至这64个启动函数单元1112中不同的启动函数单元1112并指出其运算。图中仅显示神经处理单元0，1，63的运算以简化说明。图13的时频周期对应至图12的时频周期，但以不同方式显示神经处理单元126共享启动函数单元1112的运算。如图13所示，在时频周期0至512，这64个启动函数单元1112都是处于不启动状态，而神经处理单元126执行初始化神经处理单元指令、乘法累加指令与乘法累加旋转指令。Each column of the timing diagram of FIG. 13 corresponds to a consecutive time-frequency period indicated in the first row. The other lines are respectively corresponding to different activation function units 1112 among the 64 activation function units 1112 and indicate their operations. In the figure, only the operations of NPUs 0, 1, and 63 are shown to simplify the description. The time-frequency cycle in FIG. 13 corresponds to the time-frequency cycle in FIG. 12 , but shows that the neural processing unit 126 shares the operation of the activation function unit 1112 in a different manner. As shown in FIG. 13 , during the time-frequency period 0 to 512, the 64 activation function units 1112 are in the inactive state, and the neural processing unit 126 executes the initialization neural processing unit instruction, the multiply-accumulate instruction and the multiply-accumulate rotation instruction.

如图12与图13所示，在时频周期513，启动函数单元0(关联于群组0的启动函数单元1112)开始对神经处理单元0的累加器202数值217执行所指定的启动函数，神经处理单元0即群组0中第一个神经处理单元216，而启动函数单元1112的输出将会被储存于列缓存器1104的文字0。同样在时频周期513，每个启动函数单元1112都会开始对相对应神经处理单元216群组中第一个神经处理单元126的累加器202数值217执行所指定的启动函数。因此，如图13所示，在时频周期513，启动函数单元0开始对神经处理单元0的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字0的结果；启动函数单元1开始对神经处理单元8的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字8的结果；依此类推，启动函数单元63开始对神经处理单元504的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字504的结果。As shown in FIG. 12 and FIG. 13 , in the time-frequency period 513, the activation function unit 0 (associated with the activation function unit 1112 of group 0) starts to execute the specified activation function on the value 217 of the accumulator 202 of the neural processing unit 0, NPU 0 is the first NPU 216 in group 0, and the output of the activation function unit 1112 will be stored in word 0 of the row register 1104 . Also in the time-frequency period 513 , each activation function unit 1112 starts to execute the specified activation function on the accumulator 202 value 217 of the first NPU 126 in the corresponding NPU 216 group. Therefore, as shown in FIG. 13 , in the clock cycle 513, the activation function unit 0 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 0 to generate a result of literal 0 to be stored in the column register 1104; The activation function unit 1 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 8 to generate the result of the word 8 that will be stored in the column register 1104; The accumulator 202 executes the specified activation function to produce the result of the text 504 to be stored in the column register 1104 .

在时频周期514，启动函数单元0(关联于群组0的启动函数单元1112)开始对神经处理单元1的累加器202数值217执行所指定的启动函数，神经处理单元1即群组0中第二个神经处理单元216，而启动函数单元1112的输出将会被储存于列缓存器1104的文字1。同样在时频周期514，每个启动函数单元1112都会开始对相对应神经处理单元216群组中第二个神经处理单元126的累加器202数值217执行所指定的启动函数。因此，如图13所示，在时频周期514，启动函数单元0开始对神经处理单元1的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字1的结果；启动函数单元1开始对神经处理单元9的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字9的结果；依此类推，启动函数单元63开始对神经处理单元505的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字505的结果。这样的处理会持续到时频周期520，启动函数单元0(关联于群组0的启动函数单元1112)开始对神经处理单元7的累加器202数值217执行所指定的启动函数，神经处理单元7即群组0中第八个(最后一个)神经处理单元216，而启动函数单元1112的输出将会被储存于列缓存器1104的文字7。同样在时频周期520，每个启动函数单元1112都会开始对相对应神经处理单元216群组中第八个神经处理单元126的累加器202数值217执行所指定的启动函数。因此，如图13所示，在时频周期520，启动函数单元0开始对神经处理单元7的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字7的结果；启动函数单元1开始对神经处理单元15的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字15的结果；依此类推，启动函数单元63开始对神经处理单元511的累加器202执行所指定的启动函数以产生将会储存于列缓存器1104的文字511的结果。In the time-frequency cycle 514, the activation function unit 0 (the activation function unit 1112 associated with the group 0) starts to execute the specified activation function on the value 217 of the accumulator 202 of the neural processing unit 1, and the neural processing unit 1 is in the group 0. The second NPU 216 activates the output of the functional unit 1112 to be stored in word 1 of the column register 1104 . Also in the time-frequency period 514 , each activation function unit 1112 starts to execute the specified activation function on the accumulator 202 value 217 of the second NPU 126 in the corresponding NPU 216 group. Therefore, as shown in FIG. 13 , in the time-frequency period 514, the activation function unit 0 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 1 to generate a result of literal 1 that will be stored in the column register 1104; The activation function unit 1 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 9 to generate the result of the text 9 that will be stored in the column register 1104; The accumulator 202 executes the specified activation function to generate the result of the text 505 to be stored in the column register 1104 . Such processing will continue until the time-frequency cycle 520, and the activation function unit 0 (associated with the activation function unit 1112 of group 0) starts to perform the specified activation function on the accumulator 202 value 217 of the neural processing unit 7, and the neural processing unit 7 That is, the eighth (last) NPU 216 in group 0, and the output of the activation function unit 1112 will be stored in word 7 of the column register 1104 . Also in the time-frequency period 520 , each activation function unit 1112 starts to execute the specified activation function on the accumulator 202 value 217 of the eighth NPU 126 in the corresponding NPU 216 group. Therefore, as shown in FIG. 13 , in the time-frequency period 520, the activation function unit 0 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 7 to generate the result of the literal 7 that will be stored in the column register 1104; The activation function unit 1 starts to execute the specified activation function on the accumulator 202 of the neural processing unit 15 to generate the result of the text 15 which will be stored in the column register 1104; The accumulator 202 executes the specified activation function to generate the result of the text 511 to be stored in the column register 1104 .

在时频周期521，一旦这512个神经处理单元126的全部512个结果都已经产生并写入列缓存器1104，列缓存器1104就会开始将其内容写入数据随机存取存储器122或是权重随机存取存储器124。如此，每一个神经处理单元126群组的启动函数单元1112都执行图4中地址3的启动函数指令的一部分。In clock cycle 521, once all 512 results of the 512 NPUs 126 have been generated and written into the column buffer 1104, the column buffer 1104 will begin to write its content into the DRAM 122 or Weight random access memory 124 . In this way, the activation function unit 1112 of each neural processing unit 126 group executes a part of the activation function instruction at address 3 in FIG. 4 .

如图11所示在算术逻辑单元204群组中共享启动函数单元1112的实施例，特别有助于搭配整数算术逻辑单元204的使用。这部分在后续章节如对应于图29A至图33处会有相关说明。The embodiment in which the AFU 1112 is shared among the ALU 204 group as shown in FIG. 11 is particularly helpful for use with the integer ALU 204 . This part will be described in subsequent chapters as corresponding to FIG. 29A to FIG. 33 .

MTNN与MFNN架构指令MTNN and MFNN Architecture Instructions

图14为方块示意图，显示移动至神经网络(MTNN)架构指令1400以及其对应于图1的神经网络单元121的部分的运作。此MTNN指令1400包括执行码字段1402、src1字段1404、src2字段、gpr字段1408与立即字段1412。此MTNN指令为架构指令，也即此指令包含在处理器100的指令集架构内。就一较佳实施例而言，此指令集架构会利用执行码字段1402的默认值，来区分MTNN指令1400与指令集架构内的其它指令。此MTNN指令1400的执行码1402可包括常见于x86架构等的前置码(prefix)，也可以不包括。FIG. 14 is a block diagram showing a move to neural network (MTNN) architecture instruction 1400 and its operation corresponding to the portion of the neural network unit 121 of FIG. 1 . The MTNN instruction 1400 includes an executable code field 1402 , a src1 field 1404 , a src2 field, a gpr field 1408 and an immediate field 1412 . The MTNN instruction is an architectural instruction, that is, the instruction is included in the instruction set architecture of the processor 100 . For a preferred embodiment, the ISA uses the default value of the execution code field 1402 to distinguish the MTNN instruction 1400 from other instructions in the ISA. The execution code 1402 of the MTNN instruction 1400 may or may not include a prefix commonly found in the x86 architecture.

立即字段1412提供一数值以指定函数1432至神经网络单元121的控制逻辑1434。就一较佳实施例而言，此函数1432作为图1的微指令105的立即操作数。这些可以由神经网络单元121执行的函数1432包括写入数据随机存取存储器122、写入权重随机存取存储器124、写入程序存储器129、写入控制缓存器127、开始执行程序存储器129内的程序、暂停执行程序存储器129内的程序、完成执行程序存储器129内的程序后的通知请求(例如中断)、以及重设神经网络单元121，但不限于此。就一较佳实施例而言，此神经网络单元指令组会包括一个指令，此指令的结果指出神经网络单元程序已完成。另外，此神经网络单元指令集包括一个明确产生中断指令。就一较佳实施例而言，对神经网络单元121进行重设的运作包括将神经网络单元121中，除了数据随机存取存储器122、权重随机存取存储器124、程序存储器129的数据会维持完整不动外的其它部分，有效地强制回复至重设状态(例如，清空内部状态机器并将其设定为闲置状态)。此外，内部缓存器，如累加器202，并不会受到重设函数的影响，而必须被明示地清空，例如使用图4中地址0的初始化神经处理单元指令。在一实施例中，函数1432可包括直接执行函数，其第一来源缓存器包含微运算(举例来说，可参照图34的微运算3418)。此直接执行函数指示神经网络单元121直接执行所指定的微运算。如此，架构程序就可以直接控制神经网络单元121执行运算，而非将指令写入程序存储器129并于后续指示神经网络单元121执行此位于程序存储器129内的指令或是透过MTNN指令1400(或图15的MFNN指令1500)的执行。图14显示此写入数据随机存取存储器122的函数的一范例。The immediate field 1412 provides a value to specify the function 1432 to the control logic 1434 of the NNU 121 . For a preferred embodiment, this function 1432 is used as an immediate operand of the microinstruction 105 of FIG. 1 . These functions 1432 that can be executed by the neural network unit 121 include writing data RAM 122, writing weight RAM 124, writing program memory 129, writing control buffer 127, starting execution program memory 129 program, suspend execution of the program in the program memory 129 , notify request (such as interrupt) after completing the execution of the program in the program memory 129 , and reset the neural network unit 121 , but are not limited thereto. In a preferred embodiment, the NNU command set includes a command whose result indicates that the NNU program is complete. Additionally, the NNU instruction set includes an explicit generate interrupt instruction. As far as a preferred embodiment is concerned, the operation of resetting the neural network unit 121 includes keeping the data in the neural network unit 121, except the data random access memory 122, the weight random access memory 124, and the program memory 129 intact. effectively force reverting to a reset state (for example, emptying the internal state machine and setting it to an idle state). In addition, internal registers, such as the accumulator 202, are not affected by the reset function, but must be cleared explicitly, such as using the initialize NPU instruction at address 0 in FIG. 4 . In one embodiment, the function 1432 may include a direct execution function whose first source register contains a micro-operation (for example, refer to the micro-operation 3418 of FIG. 34 ). The direct execution function instructs the neural network unit 121 to directly execute the specified micro-operation. In this way, the architecture program can directly control the neural network unit 121 to perform operations, instead of writing instructions into the program memory 129 and subsequently instructing the neural network unit 121 to execute the instructions located in the program memory 129 or through the MTNN instruction 1400 (or Execution of the MFNN instruction 1500) of FIG. 15 . FIG. 14 shows an example of the function for writing to the DRAM 122 .

此gpr字段指定通用缓存器档案116内的通用缓存器。在一实施例中，每个通用缓存器均为64位。此通用缓存器档案116提供所选定的通用缓存器的数值至神经网络单元121，如图中所示，而神经网络单元121将此数值作为地址1422使用。此地址1422会选择函数1432中指定的存储器的一列。就数据随机存取存储器122或权重随机存取存储器124而言，此地址1422会额外选择一数据块，其大小是此选定列中媒体缓存器的位置的两倍(如512个位)。就一较佳实施例而言，此位置位于一个512位边界。在一实施例中，多工器会选择地址1422(或是在以下描述的MFNN指令1400的情况下的地址1422)或是来自定序器128的地址123/125/131提供至数据随机存取存储器124/权重随机存取存储器124/程序存储器129。在一实施例中，数据随机存取存储器122具有双端口，使神经处理单元126能够利用媒体缓存器118对此数据随机存取存储器122的读取/写入，同时读取/写入此数据随机存取存储器122。在一实施例中，为了类似的目的，权重随机存取存储器124也具有双端口。The gpr field specifies a general register within the general register file 116 . In one embodiment, each general purpose register is 64 bits. The general register file 116 provides the value of the selected general register to the NNU 121 as shown in the figure, and the NNU 121 uses the value as the address 1422 . This address 1422 selects a column of memory specified in function 1432 . For DRAM 122 or WRAM 124, the address 1422 additionally selects a data block whose size is twice the size of the media register location in the selected row (eg, 512 bits). For a preferred embodiment, this location is on a 512-bit boundary. In one embodiment, the multiplexer will select address 1422 (or address 1422 in the case of the MFNN instruction 1400 described below) or addresses 123/125/131 from the sequencer 128 to provide for data random access Memory 124/weight random access memory 124/program memory 129. In one embodiment, the data random access memory 122 has dual ports, so that the neural processing unit 126 can use the media buffer 118 to read/write to the data random access memory 122 while reading/writing the data random access memory 122 . In one embodiment, the weight RAM 124 also has dual ports for a similar purpose.

图中的src1字段1404与src2字段1406均指定媒体缓存器档案118的一媒体缓存器。在一实施例中，每个媒体缓存器118均为256位。媒体缓存器档案118会将来自所选定的媒体缓存器的相连数据(例如512个位)提供至数据随机存取存储器122(或是权重随机存取存储器124或是程序存储器129)以写入地址1422指定的选定列1428以及在选定列1428中由地址1422指定的位置，如图中所示。透过一系列MTNN指令1400(以及以下所述的MFNN指令1500)的执行，执行于处理器100的架构程序即可填满数据随机存取存储器122列与权重随机存取存储器124列并将程序写入程序存储器129，例如本文所述的程序(如图4与图9所示的程序)可使神经网络单元121对数据与权重以非常快的速度进行运算，以完成此人工神经网络。在一实施例中，此架构程序直接控制神经网络单元121而非将程序写入程序存储器129。The src1 field 1404 and the src2 field 1406 in the figure both designate a media buffer of the media buffer file 118 . In one embodiment, each media buffer 118 is 256 bits. The media buffer file 118 will provide contiguous data (eg, 512 bits) from the selected media buffer to the data RAM 122 (either the weight RAM 124 or the program memory 129) for writing The selected column 1428 specified by the address 1422 and the location within the selected column 1428 specified by the address 1422 are shown in the figure. Through the execution of a series of MTNN instructions 1400 (and MFNN instructions 1500 described below), the architecture program executed on the processor 100 can fill up the data RAM 122 columns and the weight RAM 124 columns and transfer the program Writing into the program memory 129, such as the program described herein (as shown in FIG. 4 and FIG. 9 ), enables the neural network unit 121 to perform calculations on data and weights at a very fast speed to complete the artificial neural network. In one embodiment, the architecture program directly controls the NNU 121 instead of writing the program into the program memory 129 .

在一实施例中，MTNN指令1400指定一起始来源缓存器以及来源缓存器的数量，即Q，而非指定两个来源缓存器(如字段1404与1406所指定者)。这种形式的MTNN指令1400会指示处理器100将指定为起始来源缓存器的媒体缓存器118以及接下来Q-1个接续的媒体缓存器118写入神经网络单元121，也就是写入所指定的数据随机存取存储器122或权重随机存取存储器124。就一较佳实施例而言，指令转译器104会将MTNN指令1400转译为写入所有Q个所指定的媒体缓存器118所需数量的微指令。举例来说，在一实施例中，当MTNN指令1400将缓存器MR4指定为起始来源缓存器并且Q为8，指令转译器104就会将MTNN指令1400转译为四个微指令，其中第一个微指令写入缓存器MR4与MR5，第二个微指令写入缓存器MR6与MR7，第三个微指令写入缓存器MR8与MR9，而第四个微指令写入缓存器MR10与MR11。在另一个实施例中，由媒体缓存器118至神经网络单元121的数据路径是1024位而非512位，在此情况下，指令转译器104会将MTNN指令1400转译为两个微指令，其中第一个微指令写入缓存器MR4至MR7，第二个微指令则是写入缓存器MR8至MR11。本发明也可应用于MFNN指令1500指定一起始目的缓存器以及目的缓存器的数量的实施例，而使每一个MFNN指令1500可以从数据随机存取存储器122或权重随机存取存储器124的一列读取大于单一媒体缓存器118的数据块。In one embodiment, instead of specifying two source registers (as specified by fields 1404 and 1406 ), the MTNN instruction 1400 specifies a starting source register and the number of source registers, ie, Q. This form of the MTNN instruction 1400 will instruct the processor 100 to write the media buffer 118 designated as the initial source buffer and the next Q-1 subsequent media buffers 118 into the neural network unit 121, that is, write the Designated data RAM 122 or weight RAM 124 . For a preferred embodiment, the instruction translator 104 translates the MTNN instruction 1400 into a required number of microinstructions for writing all Q specified media registers 118 . For example, in one embodiment, when the MTNN instruction 1400 specifies the register MR4 as the initial source register and Q is 8, the instruction translator 104 will translate the MTNN instruction 1400 into four microinstructions, wherein the first The first microinstruction is written into registers MR4 and MR5, the second microinstruction is written into registers MR6 and MR7, the third microinstruction is written into registers MR8 and MR9, and the fourth microinstruction is written into registers MR10 and MR11 . In another embodiment, the data path from the media buffer 118 to the NNU 121 is 1024 bits instead of 512 bits. In this case, the instruction translator 104 translates the MTNN instruction 1400 into two microinstructions, where The first microinstruction is written into registers MR4 to MR7, and the second microinstruction is written into registers MR8 through MR11. The present invention is also applicable to embodiments where the MFNN instruction 1500 specifies a starting destination register and the number of destination registers, so that each MFNN instruction 1500 can read from a column of the data RAM 122 or the weight RAM 124 Data blocks larger than a single media buffer 118 are fetched.

图15为方块示意图，显示移动至神经网络(MTNN)架构指令1500以及其对应于图1的神经网络单元121的部分的运作。此MFNN指令1500包括执行码字段1502、dst字段1504、gpr字段1508以及立即字段1512。MFNN指令为架构指令，也即此指令包含于处理器100的指令集架构内。就一较佳实施例而言，此指令集架构会利用执行码字段1502的默认值，来区分MFNN指令1500与指令集架构内的其它指令。此MFNN指令1500的执行码1502可包括常见于x86架构等的前置码(prefix)，也可以不包括。FIG. 15 is a block diagram showing a move to neural network (MTNN) architecture instruction 1500 and its operation corresponding to the portion of the neural network unit 121 of FIG. 1 . This MFNN instruction 1500 includes an execution code field 1502 , a dst field 1504 , a gpr field 1508 , and an immediate field 1512 . The MFNN instruction is an architectural instruction, that is, the instruction is included in the instruction set architecture of the processor 100 . For a preferred embodiment, the ISA uses the default value of the execution code field 1502 to distinguish the MFNN instruction 1500 from other instructions in the ISA. The execution code 1502 of the MFNN instruction 1500 may or may not include a prefix commonly found in the x86 architecture.

立即字段1512提供一数值以指定函数1532至神经网络单元121的控制逻辑1434。就一较佳实施例而言，此函数1532作为图1的微指令105的立即操作数。这些神经网络单元121可以执行的函数1532包括读取数据随机存取存储器122、读取权重随机存取存储器124、读取程序存储器129、以及读取状态缓存器127，但不限于此。图15的范例显示读取数据随机存取存储器122的函数1532。The immediate field 1512 provides a value to specify the function 1532 to the control logic 1434 of the NNU 121 . For a preferred embodiment, this function 1532 is used as an immediate operand of the microinstruction 105 of FIG. 1 . The functions 1532 that the neural network unit 121 can execute include reading the data RAM 122 , reading the weight RAM 124 , reading the program memory 129 , and reading the state register 127 , but not limited thereto. The example of FIG. 15 shows a function 1532 for reading data RAM 122 .

此gpr字段1508指定通用缓存器档案116内的通用缓存器。此通用缓存器档案116提供所选定的通用缓存器的数值至神经网络单元121，如图中所示，而神经网络单元121将此数值作为地址1522并以类似于图14的地址1422的方式进行运算，藉以选择函数1532中指定的存储器的一列。就数据随机存取存储器122或权重随机存取存储器124而言，此地址1522会额外选择一数据块，其大小即为此选定列中媒体缓存器(如256个位)的位置。就一较佳实施例而言，此位置位于一个256位边界。The gpr field 1508 specifies a general register within the general register file 116 . The general register file 116 provides the value of the selected general register to the NNU 121 as shown in the figure, and the NNU 121 takes this value as an address 1522 and in a manner similar to the address 1422 of FIG. 14 An operation is performed whereby a column of memory specified in function 1532 is selected. For DRAM 122 or WRAM 124, this address 1522 additionally selects a data block whose size is the location of the media register (eg, 256 bits) in the selected row. For a preferred embodiment, this location is on a 256-bit boundary.

此dst字段1504于一媒体缓存器档案118内指定一媒体缓存器。如图中所示，媒体缓存器档案118将来自数据随机存取存储器122(或权重随机存取存储器124或程序存储器129)的数据(如256位)接收至选定的媒体缓存器，此数据读取自数据接收中地址1522指定的选定列1528以及选定列1528中地址1522指定的位置。The dst field 1504 specifies a media buffer within a media buffer file 118 . As shown, the media buffer file 118 receives data (eg, 256 bits) from the data RAM 122 (or weight RAM 124 or program memory 129) into the selected media buffer. Read from the selected column 1528 specified by address 1522 in data reception and the location specified by address 1522 in the selected column 1528 .

神经网络单元内部随机存取存储器的端口配置Port Configuration of the Internal Random Access Memory of the Neural Network Unit

图16为显示图1的数据随机存取存储器122的一实施例的方块示意图。此数据随机存取存储器122包括存储器阵列1606、读取端口1602与写入端口1604。存储器阵列1606装载数据文字，就一较佳实施例而言，这些数据排列成如前所述D列的N个文字的阵列。在一实施例中，此存储器阵列1606包括一个由64个水平排列的静态随机存取记忆胞构成的阵列，其中每个记忆胞具有128位的宽度以及64位的高度，如此即可提供一个64KB的数据随机存取存储器122，其宽度为8192位并且具有64列，而此数据随机存取存储器122所使用的晶粒面积大致为0.2平方毫米。不过，本发明并不限于此。FIG. 16 is a block diagram showing an embodiment of the data random access memory 122 of FIG. 1 . The DRAM 122 includes a memory array 1606 , a read port 1602 and a write port 1604 . Memory array 1606 is loaded with words of data which, for a preferred embodiment, are arranged in an array of N words in D columns as previously described. In one embodiment, the memory array 1606 includes an array of 64 horizontally arranged SRAM cells, each of which has a width of 128 bits and a height of 64 bits, thus providing a 64KB The DRAM 122 has a width of 8192 bits and 64 columns, and the die area used by the DRAM 122 is approximately 0.2 square millimeters. However, the present invention is not limited thereto.

就一较佳实施例而言，写入端口1602以多任务方式耦接至神经处理单元126以及媒体缓存器118。进一步来说，这些媒体缓存器118可以透过结果总线耦接至读取端口，而结果总线也用于提供数据至重排缓冲器和/或结果传送总线以提供至其它执行单元112。这些神经处理单元126与媒体缓存器118共享此读取端口1602，以对数据随机存取存储器122进行读取。又，就一较佳实施例而言，写入端口1604也是以多任务方式耦接至神经处理单元126以及媒体缓存器118。这些神经处理单元126与媒体缓存器118共享此写入端口1604，以写入此数据随机存取存储器122。如此，媒体缓存器118就可以在神经处理单元126对数据随机存取存储器122进行读取的同时，写入数据随机存取存储器122，而神经处理单元126也就可以在媒体缓存器118正在对数据随机存取存储器122进行读取的同时，写入数据随机存取存储器122。这样的进行方式可以提升效能。举例来说，这些神经处理单元126可以读取数据随机存取存储器122(例如持续执行计算)，而此同时，媒体缓存器118可以将更多数据文字写入数据随机存取存储器122。在另一范例中，这些神经处理单元126可以将计算结果写入数据随机存取存储器122，而此同时，媒体缓存器118则可以从数据随机存取存储器122读取计算结果。在一实施例中，神经处理单元126可以将一列计算结果写入数据随机存取存储器122，同时还从数据随机存取存储器122读取一列数据文字。在一实施例中，存储器阵列1606配置成存储器区块(bank)。在神经处理单元126存取数据随机存取存储器122的时候，所有的存储器区块都会被启动来存取存储器阵列1606的一完整列；不过，在媒体缓存器118存取数据随机存取存储器122的时候，只有所指定的存储器区块会被启动。在一实施例中，每个存储器区块的宽度均为128位，而媒体缓存器118的宽度则是256位，如此，举例来说，每次存取媒体缓存器118就需要启动两个存储器区块。在一实施例中，这些端口1602/1604的其中之一为读取/写入端口。在一实施例中，这些端口1602/1604都是读取/写入端口。For a preferred embodiment, the write port 1602 is coupled to the NPU 126 and the media buffer 118 in a multi-tasking manner. Further, the media buffers 118 may be coupled to the read ports through a result bus, which is also used to provide data to the reorder buffer and/or the result transfer bus to other execution units 112 . The NPUs 126 share the read port 1602 with the media buffer 118 to read the data RAM 122 . Furthermore, for a preferred embodiment, the write port 1604 is also coupled to the NPU 126 and the media buffer 118 in a multi-tasking manner. The NPUs 126 share the write port 1604 with the media buffer 118 to write into the DRAM 122 . In this way, the media buffer 118 can write the data random access memory 122 while the neural processing unit 126 is reading the data random access memory 122, and the neural processing unit 126 can also write the data to the random access memory 122 while the media buffer 118 is writing to it. While the data random access memory 122 is reading, the data random access memory 122 is being written. Doing so can improve performance. For example, the NPUs 126 can read the DRAM 122 (eg, continue to perform calculations), while the media buffer 118 can write more data words into the DRAM 122 at the same time. In another example, the neural processing units 126 can write calculation results into the DRAM 122 , and at the same time, the media buffer 118 can read the calculation results from the DRAM 122 . In one embodiment, the neural processing unit 126 can write a column of calculation results into the DRAM 122 and read a column of data words from the DRAM 122 at the same time. In one embodiment, the memory array 1606 is configured into memory banks. When NPU 126 accesses DRAM 122, all memory banks are enabled to access a complete column of memory array 1606; , only the specified memory block will be activated. In one embodiment, the width of each memory block is 128 bits, and the width of the media buffer 118 is 256 bits, so, for example, each access to the media buffer 118 requires the activation of two memories blocks. In one embodiment, one of these ports 1602/1604 is a read/write port. In one embodiment, these ports 1602/1604 are all read/write ports.

让这些神经处理单元126具备如本文所述的旋转器的能力的优点在于，相较于为了确保神经处理单元126可被充分利用而使架构程序(通过媒体缓存器118)得以持续提供数据至数据随机存取存储器122并且在神经处理单元126执行计算的同时，从数据随机存取存储器122取回结果所需要的存储器阵列，此能力有助于减少数据随机存取存储器122的存储器阵列1606的列数，因而可以缩小尺寸。An advantage of having these NPUs 126 with the capability of a rotator as described herein is that it allows the architecture program (via the media buffer 118 ) to continuously provide data to the RAM 122 and the memory array needed to retrieve results from DRAM 122 while NPU 126 is performing calculations, this capability facilitates reducing the number of columns in memory array 1606 of DRAM 122 number, and thus can be reduced in size.

内部随机存取存储器缓冲器Internal Random Access Memory Buffer

图17为显示图1的权重随机存取存储器124与缓冲器1704的一实施例的方块示意图。此权重随机存取存储器124包括存储器阵列1706与端口1702。此存储器阵列1706装载权重文字，就一较佳实施例而言，这些权重文字排列成如前所述W列的N个文字的阵列。在一实施例中，此存储器阵列1706包括一个由128个水平排列的静态随机存取记忆胞构成的阵列，其中每个记忆胞具有64位的宽度以及2048位的高度，如此即可提供一个2MB的权重随机存取存储器124，其宽度为8192位并且具有2048列，而此权重随机存取存储器124所使用的晶粒面积大致为2.4平方毫米。不过，本发明并不限于此。FIG. 17 is a block diagram showing an embodiment of the weight RAM 124 and the buffer 1704 of FIG. 1 . The weight RAM 124 includes a memory array 1706 and a port 1702 . The memory array 1706 is loaded with weight words, which are arranged as an array of N words in W columns as described above for a preferred embodiment. In one embodiment, the memory array 1706 includes an array of 128 horizontally arranged SRAM cells, each of which has a width of 64 bits and a height of 2048 bits, thus providing a 2MB The weight random access memory 124 has a width of 8192 bits and 2048 columns, and the die area used by the weight random access memory 124 is approximately 2.4 square millimeters. However, the present invention is not limited thereto.

就一较佳实施例而言，此端口1702以多任务方式耦接至神经处理单元126与缓冲器1704。这些神经处理单元126与缓冲器1704透过此端口1702读取并写入权重随机存取存储器124。缓冲器1704还耦接至图1的媒体缓存器118，如此，媒体缓存器118即可透过缓冲器1704读取并写入权重随机存取存储器124。此方式的优点在于，当神经处理单元126正在读取或写入权重随机存取存储器124的时候，媒体缓存器118还可以写入缓冲器118或是从缓冲器118读取(不过若是神经处理单元126正在执行，在较佳的情况下搁置这些神经处理单元126，以避免当缓冲器1704存取权重随机存取存储器124时，存取权重随机存取存储器124)。此方式可以提升效能，特别是因为媒体缓存器118对于权重随机存取存储器124的读取与写入相对上明显小于神经处理单元126对于权重随机存取存储器124的读取与写入。举例来说，在一实施例中，神经处理单元126一次读取/写入8192个位(一列)，不过，媒体缓存器118的宽度仅为256位，而每个MTNN指令1400仅写入两个媒体缓存器118，即512位。因此，在架构程序执行十六个MTNN指令1400以填满缓冲器1704的情况下，神经处理单元126与存取权重随机存取存储器124的架构程序间发生冲突的时间会少于大致全部时间的百分之六。在另一实施例中，指令转译器104将一个MTNN指令1400转译为两个微指令105，而每个微指令会将单一个数据缓存器118写入缓冲器1704，如此，神经处理单元126与架构程序在存取权重随机存取存储器124时产生冲突的频率还会进一步减少。For a preferred embodiment, the port 1702 is coupled to the NPU 126 and the buffer 1704 in a multi-tasking manner. These NPUs 126 and buffers 1704 read from and write to the weight RAM 124 through the port 1702 . The buffer 1704 is also coupled to the media buffer 118 of FIG. 1 , so that the media buffer 118 can read from and write to the weight RAM 124 through the buffer 1704 . The advantage of this approach is that when the NPU 126 is reading from or writing to the weight RAM 124, the media buffer 118 can also write to or read from the buffer 118 (but if the NPU units 126 are executing, these NPUs 126 are preferably parked to avoid accessing WRAM 124 when buffer 1704 accesses WRAM 124 ). This approach can improve performance, especially because the media buffer 118 reads and writes to the WRAM 124 significantly less than the NPU 126 reads and writes to the WRAM 124 . For example, in one embodiment, NPU 126 reads/writes 8192 bits (one column) at a time, however, media register 118 is only 256 bits wide, and each MTNN instruction 1400 only writes two Media buffer 118, namely 512 bits. Thus, in the case where the architectural program executes sixteen MTNN instructions 1400 to fill the buffer 1704, the conflict between the NPU 126 and the architectural program accessing the WRAM 124 takes less than approximately all of the time. six percent. In another embodiment, the instruction translator 104 translates one MTNN instruction 1400 into two microinstructions 105, and each microinstruction will write a single data register 118 into the buffer 1704, so that the neural processing unit 126 and The frequency of conflicts generated by the architecture program when accessing the weighted random access memory 124 is further reduced.

在包含缓冲器1704的实施例中，利用架构程序写入权重随机存取存储器124需要多个MTNN指令1400。一个或多个MTNN指令1400指定一函数1432以写入缓冲器1704中指定的数据块，随后一MTNN指令1400指定一函数1432指示神经网络单元121将缓冲器1704的内容写入权重随机存取存储器124的一选定列。单一个数据块的大小为媒体缓存器118的位数的两倍，而这些数据块会自然地排齐于缓冲器1704中。在一实施例中，每个指定函数1432以写入缓冲器1704指定数据块的MTNN指令1400包含一位屏蔽(bitmask)，其具有位对应至缓冲器1704的各个数据块。来自两个指定的来源缓存器118的数据被写入缓冲器1704的数据块中，在位屏蔽内的对应位为被设定的各个数据块。此实施例有助于权重随机存取存储器124的一列内存在重复数据值的情形。举例来说，为了将缓冲器1704(以及接下去的权重随机存取存储器124的一列)归零，程序设计者可以将零值加载来源缓存器并且设定位屏蔽的所有位。此外，位屏蔽也可以让程序设计者仅写入缓冲器1704中的选定数据块，而使其它数据块维持其先前的数据状态。In embodiments including buffer 1704 , multiple MTNN instructions 1400 are required to write to weight RAM 124 using the architectural program. One or more MTNN instructions 1400 specify a function 1432 to write the specified data block in the buffer 1704, and then an MTNN instruction 1400 specifies a function 1432 instructing the neural network unit 121 to write the contents of the buffer 1704 to the weight random access memory A selected column of 124. The size of a single data block is twice the number of bits in the media buffer 118 , and these data blocks are naturally aligned in the buffer 1704 . In one embodiment, each MTNN instruction 1400 specifying a function 1432 to write a specified block of data in the buffer 1704 includes a bitmask with bits corresponding to each block of the buffer 1704 . Data from the two specified source registers 118 is written into the data blocks of buffer 1704, the corresponding bit in the bitmask being set for each data block. This embodiment facilitates situations where there are duplicate data values within a column of the WRAM 124 . For example, to zero the buffer 1704 (and subsequently a column of the weight RAM 124), the programmer can load the source buffer with a value of zero and set all bits of the bitmask. In addition, bit masking also allows the programmer to write only selected data blocks in buffer 1704 while leaving other data blocks to maintain their previous data states.

在包含缓冲器1704的实施例中，利用架构程序读取权重随机存取存储器124需要多个MFNN指令1500。初始的MFNN指令1500指定一函数1532将权重随机存取单元124的一指定列加载缓冲器1704，随后一个或多个MFNN指令1500指定一函数1532将缓冲器1704的一指定数据块读取至目的缓存器。单一个数据块的大小即为媒体缓存器118的位数，而这些数据块会自然地排齐于缓冲器1704中。本发明的技术特征也可适用于其它实施例，如权重随机存取存储器124具有多个缓冲器1704，透过增加神经处理单元126执行时架构程序的可存取数量，以进一步减少神经处理单元126与架构程序间因存取权重随机存取存储器124所产生的冲突，而增加在神经处理单元126不须存取权重随机存取存储器124的时频周期内，改由缓冲器1704进行存取的可能性。In an embodiment that includes buffer 1704 , multiple MFNN instructions 1500 are required to read weight RAM 124 using an architectural program. The initial MFNN instruction 1500 specifies a function 1532 to load a specified column of the weight random access unit 124 into the buffer 1704, and then one or more MFNN instructions 1500 specify a function 1532 to read a specified block of data from the buffer 1704 to the destination cache. The size of a single data block is the number of bits in the media buffer 118 , and these data blocks are naturally aligned in the buffer 1704 . The technical features of the present invention are also applicable to other embodiments. For example, the weight random access memory 124 has a plurality of buffers 1704. By increasing the number of architecture programs that can be accessed by the neural processing unit 126 during execution, the neural processing unit can be further reduced. The conflict between 126 and the architecture program due to the access weighted random access memory 124 increases. In the time-frequency cycle when the neural processing unit 126 does not need to access the weighted random access memory 124, it is accessed by the buffer 1704 instead. possibility.

图16描述双端口数据随机存取存储器122，不过，本发明并不限于此。本发明的技术特征也可适用于权重随机存取存储器124也为双端口设计的其它实施例。此外，图17中描述缓冲器搭配权重随机存取存储器124使用，不过，本发明并不限于此。本发明的技术特征也可适用于数据随机存取存储器122具有一个类似于缓冲器1704的相对应缓冲器的实施例。FIG. 16 depicts the DPRAM 122, however, the present invention is not limited thereto. The technical features of the present invention are also applicable to other embodiments in which the weight random access memory 124 is also a dual-port design. In addition, it is described in FIG. 17 that the buffer is used with the weight random access memory 124 , but the present invention is not limited thereto. The technical features of the present invention are also applicable to embodiments in which the DRAM 122 has a corresponding buffer similar to the buffer 1704 .

可动态配置的神经处理单元Dynamically configurable neural processing unit

图18为显示图1的可动态配置的神经处理单元126的方块示意图。图18的神经处理单元126类似于图2的神经处理单元126。不过，图18的神经处理单元126可动态配置以运作于两个不同配置的其中之一。在第一个配置中，图18的神经处理单元126的运作类似于图2的神经处理单元126。也就是说，在第一个配置中，在此标示为“宽的”配置或“单一个”配置，神经处理单元126的算术逻辑单元204对单一个宽的数据文字以及单一个宽的权重文字(例如16个位)执行运算以产生单一个宽的结果。相较之下，在第二个配置中，即本文标示为“窄的”配置或“双数”配置，神经处理单元126会对两个窄的数据文字以及两个窄的权重文字(例如8个位)执行运算分别产生两个窄的结果。在一实施例中，神经处理单元126的配置(宽或窄)由初始化神经处理单元指令(例如位于前述图20中地址0的指令)达成。另外，此配置也可以由一个具有函数1432指定来设定神经处理单元设定的配置(宽或窄)的MTNN指令来达成。就一较佳实施例而言，程序存储器129指令或确定配置(宽或窄)的MTNN指令会填满配置缓存器。举例来说，配置缓存器的输出提供给算术逻辑单元204、启动函数单元212以及产生多任务缓存器控制信号213的逻辑。基本上，图18的神经处理单元126的组件与图2中相同编号的组件会执行类似的功能，可从中取得参照以了解图18的实施例。以下针对图18的实施例包含其与图2的不同处进行说明。FIG. 18 is a block diagram illustrating the dynamically configurable neural processing unit 126 of FIG. 1 . The neural processing unit 126 of FIG. 18 is similar to the neural processing unit 126 of FIG. 2 . However, the NPU 126 of FIG. 18 is dynamically configurable to operate in one of two different configurations. In a first configuration, the neural processing unit 126 of FIG. 18 operates similarly to the neural processing unit 126 of FIG. 2 . That is, in a first configuration, denoted herein as a "wide" configuration or a "single" configuration, the ALU 204 of the NPU 126 operates on a single wide data word and a single wide weight word (eg 16 bits) to perform operations to produce a single wide result. In contrast, in the second configuration, designated herein as the "narrow" configuration or the "dual" configuration, the NPU 126 will process two narrow data words and two narrow weight words (e.g., 8 ones) perform the operation to produce two narrow results respectively. In one embodiment, the configuration (wide or narrow) of the NPU 126 is achieved by an initialization NPU command (such as the command located at address 0 in FIG. 20 ). Alternatively, this configuration can also be achieved by an MTNN instruction with the function 1432 specifying the configuration (wide or narrow) of the NPU settings. For a preferred embodiment, the program memory 129 instructions or the MTNN instructions that determine the configuration (wide or narrow) fill the configuration register. For example, the output of the configuration register is provided to the ALU 204 , the enable function unit 212 , and the logic that generates the MTR control signal 213 . Basically, the components of the NPU 126 of FIG. 18 perform similar functions as the same numbered components in FIG. 2 , and reference can be taken therefrom to understand the embodiment of FIG. 18 . The following describes the embodiment of FIG. 18 including its differences from FIG. 2 .

图18的神经处理单元126包括两个缓存器205A与205B、两个三输入多任务缓存器208A与208B、一个算术逻辑单元204、两个累加器202A与202B、以及两个启动函数单元212A与212B。缓存器205A/205B分别具有图2的缓存器205的宽度的一半(如8个位)。缓存器205A/205B分别从权重随机存取存储器124接收一相对应的窄权重文字206A/B206(例如8个位)并将其输出203A/203B在一后续时频周期提供至算术逻辑单元204的操作数选择逻辑1898。神经处理单元126处于宽配置的时候，缓存器205A/205B就会一起运作以接收来自权重随机存取存储器124的一宽权重文字206A/206B(例如16个位)，类似于图2的实施例中的缓存器205；神经处理单元126处于窄配置的时候，缓存器205A/205B实际上就会是独立运作，各自接收来自权重随机存取存储器124的一窄权重文字206A/206B(例如8个位)，如此，神经处理单元126实际上就相当于两个窄的神经处理单元各自独立运作。不过，不论神经处理单元126的配置态样为何，权重随机存取存储器124的相同输出位都会耦接并提供至缓存器205A/205B。举例来说，神经处理单元0的缓存器205A接收到字节0、神经处理单元0的缓存器205B接收到字节1、神经处理单元1的缓存器205A接收到字节2、神经处理单元1的缓存器205B接收到字节3、依此类推，神经处理单元511的缓存器205B就会接收到字节1023。The neural processing unit 126 of FIG. 18 includes two registers 205A and 205B, two three-input multitasking registers 208A and 208B, an arithmetic logic unit 204, two accumulators 202A and 202B, and two activation function units 212A and 202B. 212B. The registers 205A/ 205B respectively have half the width of the register 205 of FIG. 2 (eg, 8 bits). The registers 205A/205B respectively receive a corresponding narrow weight word 206A/B 206 (e.g., 8 bits) from the weight RAM 124 and provide their output 203A/203B to the ALU 204 in a subsequent clock cycle. Operand selection logic 1898. When NPU 126 is in a wide configuration, registers 205A/205B operate together to receive a wide weight word 206A/206B (e.g., 16 bits) from weight RAM 124, similar to the embodiment of FIG. 2 When the NPU 126 is in a narrow configuration, the registers 205A/205B will actually operate independently, each receiving a narrow weight text 206A/206B from the weight random access memory 124 (for example, 8 bit), so that the NPU 126 is actually equivalent to two narrow NPUs operating independently. However, regardless of the configuration of the NPU 126 , the same output bits of the WRAM 124 are coupled and provided to the registers 205A/ 205B. For example, NPU0 buffer 205A receives byte 0, NPU0 buffer 205B receives byte 1, NPU1 buffer 205A receives byte 2, NPU1 The buffer 205B of NPU 511 receives byte 3, and so on, the buffer 205B of NPU 511 receives byte 1023.

多任务缓存器208A/208B分别具有图2的缓存器208的宽度的一半(如8个位)。多任务缓存器208A会在输入207A、211A与1811A中选择一个储存至其缓存器并在后续时频周期由输出209A提供，多任务缓存器208B会在输入207B、211B与1811B中选择一个储存至其缓存器并在后续时频周期由输出209B提供至操作数选择逻辑1898。输入207A从数据随机存取存储器122接收一窄数据文字(例如8个位)，输入207B从数据随机存取存储器122接收一窄数据文字。当神经处理单元126处于宽配置的时候，多任务缓存器208A/208B实际上就会是一起运作以接收来自数据随机存取存储器122的一宽数据文字207A/207B(例如16个位)，类似于图2的实施例中的多任务缓存器208；神经处理单元126处于窄配置的时候，多任务缓存器208A/208B实际上就会是独立运作，各自接收来自数据随机存取存储器122的一窄数据文字207A/207B(例如8个位)，如此，神经处理单元126实际上就相当于两个窄的神经处理单元各自独立运作。不过，不论神经处理单元126的配置态样为何，数据随机存取存储器122的相同输出位都会耦接并提供至多任务缓存器208A/208B。举例来说，神经处理单元0的多任务缓存器208A接收到字节0、神经处理单元0的多任务缓存器208B接收到字节1、神经处理单元1的多任务缓存器208A接收到字节2、神经处理单元1的多任务缓存器208B接收到字节3、依此类推，神经处理单元511的多任务缓存器208B就会接收到字节1023。The multitasking registers 208A/ 208B respectively have half the width of the register 208 of FIG. 2 (eg, 8 bits). The multitasking register 208A will select one of the inputs 207A, 211A, and 1811A to store in its register and provide it from the output 209A in subsequent clock cycles. The multitasking register 208B will select one of the inputs 207B, 211B, and 1811B to store in the register. It is buffered and provided to operand select logic 1898 by output 209B in subsequent clock cycles. Input 207A receives a narrow data word (eg, 8 bits) from DRAM 122 and input 207B receives a narrow data word from DRAM 122 . When the NPU 126 is in a wide configuration, the multitasking registers 208A/208B will actually work together to receive a wide data word 207A/207B (e.g., 16 bits) from the data RAM 122, similar to In the multitasking register 208 in the embodiment of FIG. Narrow data words 207A/207B (for example, 8 bits), so that the NPU 126 is actually equivalent to two narrow NPUs operating independently. However, regardless of the configuration of the NPU 126 , the same output bits of the DRAM 122 are coupled and provided to the multitasking registers 208A/ 208B. For example, MMR 208A of NPU0 receives byte 0, MMR 208B of NPU0 receives byte 1, and MMR 208A of NPU1 receives byte 2. The multitasking register 208B of the neural processing unit 1 receives byte 3, and so on, the multitasking register 208B of the neural processing unit 511 receives byte 1023.

输入211A接收邻近的神经处理单元126的多任务缓存器208A的输出209A，输入211B接收邻近的神经处理单元126的多任务缓存器208B的输出209B。输入1811A接收邻近神经处理单元126的多任务缓存器208B的输出209B，而输入1811B接收邻近神经处理单元126的多任务缓存器208A的输出209A。图18所示的神经处理单元126属于图1所示的N个神经处理单元126的其中之一并标示为神经处理单元J。也就是说，神经处理单元J是这N个神经处理单元的一代表范例。就一较佳实施例而言，神经处理单元J的多任务缓存器208A输入211A会接收范例J-1的神经处理单元126的多任务缓存器208A输出209A，而神经处理单元J的多任务缓存器208A输入1811A会接收范例J-1的神经处理单元126的多任务缓存器208B输出209B，并且神经处理单元J的多任务缓存器208A输出209A会同时提供至范例J+1的神经处理单元126的多任务缓存器208A输入211A以及范例J的神经处理单元126的多任务缓存器208B输入211B；神经处理单元J的多任务缓存器208B的输入211B会接收范例J-1的神经处理单元126的多任务缓存器208B输出209B，而神经处理单元J的多任务缓存器208B的输入1811B会接收范例J的神经处理单元126的多任务缓存器208A输出209A，并且，神经处理单元J的多任务缓存器208B的输出209B会同时提供至范例J+1的神经处理单元126的多任务缓存器208A输入1811A以及范例J+1的神经处理单元126的多任务缓存器208B输入211B。The input 211A receives the output 209A of the multitasking buffer 208A of the adjacent NPU 126 , and the input 211B receives the output 209B of the multitasking buffer 208B of the adjacent NPU 126 . Input 1811A receives output 209B of multitasking buffer 208B adjacent to NPU 126 , and input 1811B receives output 209A of multitasking buffer 208A adjacent to NPU 126 . The neural processing unit 126 shown in FIG. 18 belongs to one of the N neural processing units 126 shown in FIG. 1 and is labeled as neural processing unit J. That is to say, the neural processing unit J is a representative example of the N neural processing units. For a preferred embodiment, the multitasking register 208A input 211A of the NPU J receives the output 209A of the multitasking register 208A of the NPU 126 of Example J-1, and the multitasking buffer of the NPU J NPU 208A input 1811A will receive output 209B of multitasking register 208B of NPU 126 of example J-1, and output 209A of multitasking register 208A of NPU J will be simultaneously provided to NPU 126 of example J+1 Input 211A of the multitasking register 208A of the example J and the input 211B of the multitasking register 208B of the NPU 126 of the example J; the input 211B of the multitasking register 208B of the NPU J will receive the The multitasking buffer 208B outputs 209B, and the input 1811B of the multitasking buffer 208B of the NPU J receives the output 209A of the multitasking buffer 208A of the NPU 126 of the example J, and the multitasking buffer of the NPU J The output 209B of the register 208B is provided to both the multitasking register 208A input 1811A of the NPU 126 of the example J+1 and the multitasking register 208B input 211B of the NPU 126 of the example J+1.

控制输入213控制多任务缓存器208A/208B中的每一个，从这三个输入中选择其一储存至其相对应的缓存器，并在后续步骤提供至相对应的输出209A/209B。当神经处理单元126被指示要从数据随机存取存储器122加载一列时(例如图20中地址1的乘法累加指令，详如后述)，无论此神经处理单元126是处于宽配置或是窄配置，控制输入213会控制多任务缓存器208A/208B中的每一个多任务缓存器，从数据随机存取存储器122的选定列的相对应窄文字中选择一相对应的窄数据文字207A/207B(如8位)。The control input 213 controls each of the multiplexing registers 208A/208B, selects one of the three inputs to be stored in its corresponding register, and provided to the corresponding output 209A/209B in a subsequent step. When the neural processing unit 126 is instructed to load a column from the data random access memory 122 (such as the multiply-accumulate instruction at address 1 in FIG. , the control input 213 will control each of the multiplex registers in the multiplex registers 208A/208B to select a corresponding narrow data word 207A/207B from the corresponding narrow words of the selected column of the data random access memory 122 (eg 8 bits).

当神经处理单元126接收指示需要对先前接收的数据列数值进行旋转时(例如图20中地址2的乘法累加旋转指令，详如后述)，若是神经处理单元126是处于窄配置，控制输入213就会控制多任务缓存器208A/208B中每一个多任务缓存器选择相对应的输入1811A/1811B。在此情况下，多任务缓存器208A/208B实际上会是独立运作而使神经处理单元126实际上就如同两个独立的窄神经处理单元。如此，N个神经处理单元126的多任务缓存器208A与208B共同运作就会如同一2N个窄文字的旋转器，这部分在后续对应于图19处有更详细的说明。When the neural processing unit 126 receives an indication that it needs to rotate the value of the previously received data column (for example, the multiply-accumulate rotation instruction of address 2 in FIG. Each multitasking register in the multitasking register 208A/208B is controlled to select the corresponding input 1811A/1811B. In this case, the multi-tasking registers 208A/208B would actually operate independently so that the NPU 126 would actually act like two independent narrow NPUs. In this way, the multitasking registers 208A and 208B of the N NPUs 126 work together as a 2N narrow text rotator, and this part will be described in more detail corresponding to FIG. 19 .

当神经处理单元126接收指示需要对先前接收的数据列数值进行旋转时，若是神经处理单元126是处于宽配置，控制输入213就会控制多任务缓存器208A/208B中每一个多任务缓存器选择相对应输入211A/211B。在此情况下，多任务缓存器208A/208B会共同运作而实际上就好像这个神经处理单元126是单一个宽神经处理单元126。如此，N个神经处理单元126的多任务缓存器208A与208B共同运作就会如同一N个宽文字的旋转器，类似对应于图3所描述的方式。When the NPU 126 receives an indication that it needs to rotate the value of the previously received data column, if the NPU 126 is in a wide configuration, the control input 213 will control the selection of each multitasking register in the multitasking register 208A/208B. Corresponding to input 211A/211B. In this case, the multi-tasking registers 208A/208B will operate together as if the NPU 126 is actually a single wide NPU 126 . In this way, the multitasking registers 208A and 208B of the N NPUs 126 work together as an N wide text rotator, similar to the manner described in FIG. 3 .

算术逻辑单元204包括操作数选择逻辑1898、一个宽乘法器242A、一个窄乘法器242B、一个宽双输入多工器1896A，一个窄双输入多工器1896B，一个宽加法器244A与一个窄加法器244B。实际上，此算术逻辑单元204可理解为包括操作数选择逻辑、一个宽算术逻辑单元204A(包括前述宽乘法器242A、前述宽多工器1896A与前述宽加法器244A)与一个窄算术逻辑单元204B(包括前述窄乘法器242B、前述窄多工器1896B与前述窄加法器244B)。就一较佳实施例而言，宽乘法器242A可将两个宽文字相乘，类似于图2的乘法器242，例如一个16位乘16位的乘法器。窄乘法器242B可将两个窄文字相乘，例如一个8位乘8位的乘法器以产生一个16位的结果。神经处理单元126处于窄配置时，透过操作数选择逻辑1898的协助，即可充分利用宽乘法器242A，将其作为一个窄乘法器使两个窄文字相乘，如此神经处理单元126就会如同两个有效运作的窄神经处理单元。就一较佳实施例而言，宽加法器244A会将宽多工器1896A的输出与宽累加器202A的输出217A相加已产生一总数215A供宽累加器202A使用，其运作类似于图2的加法器244。窄加法器244B会将窄多工器1896B的输出与窄累加器202B输出217B相加以产生一总数215B供窄累加器202B使用。在一实施例中，窄累加器202B具有28位的宽度，以避免在进行多达1024个16位乘积的累加运算时会丧失准确度。神经处理单元126处于宽配置时，窄乘法器244B、窄累加器202B与窄启动函数单元212B最好是处于不启动状态以降低能量耗损。ALU 204 includes operand selection logic 1898, a wide multiplier 242A, a narrow multiplier 242B, a wide two-input multiplexer 1896A, a narrow two-input multiplexer 1896B, a wide adder 244A and a narrow adder device 244B. In fact, the ALU 204 can be understood as including operand selection logic, a wide ALU 204A (including the aforementioned wide multiplier 242A, the aforementioned wide multiplexer 1896A, and the aforementioned wide adder 244A) and a narrow ALU 204B (including the aforementioned narrow multiplier 242B, the aforementioned narrow multiplexer 1896B and the aforementioned narrow adder 244B). For a preferred embodiment, wide multiplier 242A can multiply two wide literals, similar to multiplier 242 of FIG. 2 , eg, a 16-bit by 16-bit multiplier. Narrow multiplier 242B can multiply two narrow literals, such as an 8-bit by 8-bit multiplier, to produce a 16-bit result. When NPU 126 is in the narrow configuration, wide multiplier 242A can be fully utilized with the assistance of operand selection logic 1898, which acts as a narrow multiplier for multiplying two narrow literals so that NPU 126 will Like two narrow neural processing units functioning efficiently. For a preferred embodiment, wide adder 244A adds the output of wide multiplexer 1896A to output 217A of wide accumulator 202A to produce a total 215A for wide accumulator 202A, which operates similarly to FIG. 2 The adder 244. Narrow adder 244B will add the output of narrow multiplexer 1896B and narrow accumulator 202B output 217B to produce a sum 215B for use by narrow accumulator 202B. In one embodiment, the narrow accumulator 202B has a width of 28 bits to avoid loss of accuracy when accumulating up to 1024 16-bit products. When the NPU 126 is in the wide configuration, the narrow multiplier 244B, the narrow accumulator 202B, and the narrow enable function unit 212B are preferably disabled to reduce power consumption.

操作数选择逻辑1898会从209A、209B、203A与203B中选择操作数提供至算术逻辑单元204的其它组件，详如后述。就一较佳实施例而言，操作数选择逻辑1898也具有其它功能，例如执行带符号数值数据文字与权重文字的符号延展。举例来说，若是神经处理单元126是处于窄配置，操作数选择逻辑1898会将窄数据文字与权重文字的符号延展至宽文字的宽度，然后才提供给宽乘法器242A。类似地，若是算术逻辑单元204接受指示要传递一个窄数据/权重文字(利用宽多工器1896A跳过宽乘法器242A)，操作数选择逻辑1898会将窄数据文字与权重文字的符号延展至宽文字的宽度，然后才提供给宽加法器244A。就一较佳实施例而言，此执行符号延展功能的逻辑也存在于图2的神经处理单元126的算术逻辑运算204的内部。The operand selection logic 1898 selects operands from 209A, 209B, 203A, and 203B to provide to other components of the ALU 204, which will be described in detail later. For a preferred embodiment, operand selection logic 1898 also has other functions, such as performing sign extension of signed numeric data literals and weight literals. For example, if NPU 126 is in a narrow configuration, operand select logic 1898 stretches the sign of the narrow data word and weight word to the width of the wide word before providing it to wide multiplier 242A. Similarly, if ALU 204 accepts an indication to pass a narrow data/weight literal (using wide multiplexer 1896A to skip wide multiplier 242A), operand select logic 1898 will extend the sign of the narrow data literal and weight literal to The width of the wide text is then provided to wide adder 244A. As far as a preferred embodiment is concerned, the logic for performing the symbol extension function also exists inside the ALU 204 of the NPU 126 of FIG. 2 .

宽多工器1896A接收宽乘法器242A的输出与来自操作数选择逻辑1898的一操作数，并从这些输入中选择其一提供给宽加法器244A，窄多工器1896B接收窄乘法器242B的输出与来自操作数选择逻辑1898的一操作数，并从这些输入中选择其一提供给窄加法器244B。Wide multiplexer 1896A receives the output of wide multiplier 242A and an operand from operand select logic 1898, and selects one of these inputs for wide adder 244A, and narrow multiplexer 1896B receives the output of narrow multiplier 242B. The output is ANDed with an operand from operand select logic 1898, and a selected one of these inputs is provided to narrow adder 244B.

操作数选择逻辑1898会依据神经处理单元126的配置以及算术逻辑单元204将要执行的算术和/或逻辑运算提供操作数，此算术/逻辑运算依据神经处理单元126执行的指令所指定的函数来决定。举例来说，若是指令指示算术逻辑单元204执行一乘法累加运算而神经处理单元126处于宽配置，操作数选择逻辑1898就将输出209A与209B串接构成的一宽文字提供至宽乘法器242A的一输入，而将输出203A与203B串接构成的一宽文字提供至另一输入，而窄乘法器242B则是不启动，如此，神经处理单元126的运作就会如同单一个类似于图2的神经处理单元126的宽神经处理单元126。不过，若是指令指示算术逻辑单元执行一乘法累加运算并且神经处理单元126是处于窄配置，操作数选择逻辑1898就将一延展后或扩张后版本的窄数据文字209A提供至宽乘法器242A的一输入，而将延展后版本的窄权重文字203A提供至另一输入；此外，操作数选择逻辑1898会将窄数据文字209B提供至窄乘法器242B的一输入，而将窄权重文字203B提供至另一输入。为达成如前所述对窄文字进行延展或扩张的运算，若是窄文字带有符号，操作数选择逻辑1898就会对窄文字进行符号延展；若是窄文字不带有符号，操作数选择逻辑1898就会在窄文字加入上方零值位。The operand selection logic 1898 provides operands based on the configuration of the NPU 126 and the arithmetic and/or logic operations to be performed by the ALU 204. The arithmetic/logic operations are determined based on the functions specified by the instructions executed by the NPU 126 . For example, if the instruction instructs ALU 204 to perform a multiply-accumulate operation and NPU 126 is in a wide configuration, operand select logic 1898 provides a wide literal consisting of outputs 209A and 209B concatenated to wide multiplier 242A. 1 input, and a wide text formed by the concatenation of outputs 203A and 203B is provided to the other input, while the narrow multiplier 242B is not enabled, so that the operation of the neural processing unit 126 will be like a single Wide neural processing unit 126 of neural processing unit 126 . However, if the instruction instructs the ALU to perform a multiply-accumulate operation and NPU 126 is in the narrow configuration, operand select logic 1898 provides a stretched or expanded version of narrow data literal 209A to a wide multiplier 242A. input, and an extended version of narrow weight literal 203A is provided to another input; in addition, operand selection logic 1898 provides narrow data literal 209B to one input of narrow multiplier 242B and narrow weight literal 203B to another input one input. In order to achieve the operation of extending or expanding narrow literals as described above, if the narrow literal is signed, the operand selection logic 1898 will perform sign extension on the narrow literal; if the narrow literal is unsigned, the operand selection logic 1898 will add the upper zero value bit to the narrow text.

在另一范例中，若是神经处理单元126处于宽配置并且指令指示算术逻辑单元204执行一权重文字的累加运算，宽乘法器242A就会被跳过，而操作数选择逻辑1898就会将输出203A与203B串接提供至宽多工器1896A以提供给宽加法器244A。不过，若是神经处理单元126处于窄配置并且指令指示算术逻辑单元204执行一权重文字的累加运算，宽乘法器242A就会被跳过，而操作数选择逻辑1898就会将一延展后版本的输出203A提供至宽多工器1896A以提供给宽加法器244A；此外，窄乘法器242B会被跳过，操作数选择逻辑1898会将延展后版本的输出203B提供至窄多工器1896B以提供给窄加法器244B。In another example, if the NPU 126 is in a wide configuration and the instruction instructs the ALU 204 to perform a weighted literal accumulation operation, the wide multiplier 242A is skipped and the operand select logic 1898 outputs 203A It is connected in series with 203B to provide wide multiplexer 1896A to provide wide adder 244A. However, if the NPU 126 is in a narrow configuration and the instruction instructs the ALU 204 to perform an accumulation of weighted literals, the wide multiplier 242A is skipped and the operand selection logic 1898 outputs a stretched version 203A is provided to wide multiplexer 1896A to provide to wide adder 244A; in addition, narrow multiplier 242B is skipped and operand select logic 1898 provides a stretched version of the output 203B to narrow multiplexer 1896B to provide Narrow adder 244B.

在另一范例中，若是神经处理单元126处于宽配置并且指令指示算术逻辑单元204执行一数据文字的累加运算，宽乘法器242A就会被跳过，而操作数选择逻辑1898就会将输出209A与209B串接提供至宽多工器1896A以提供给宽加法器244A。不过，若是神经处理单元126处于窄配置并且指令指示算术逻辑单元204执行一数据文字的累加运算，宽乘法器242A就会被跳过，而操作数选择逻辑1898就会将一延展后版本的输出209A提供至宽多工器1896A以提供给宽加法器244A；此外，窄乘法器242B会被跳过，操作数选择逻辑1898会将延展后版本的输出209B提供至窄多工器1896B以提供给窄加法器244B。权重/数据文字的累加计算有助于平均运算，平均运算可用如影像处理在内的某些人工神经网络应用的共源(pooling)层。In another example, if NPU 126 is in a wide configuration and the instruction instructs ALU 204 to perform an accumulation operation of a data word, wide multiplier 242A is skipped and operand select logic 1898 outputs 209A It is connected in series with 209B to provide wide multiplexer 1896A to provide wide adder 244A. However, if the NPU 126 is in the narrow configuration and the instruction instructs the ALU 204 to perform an accumulation operation of a data word, the wide multiplier 242A is skipped and the operand select logic 1898 outputs a stretched version 209A is provided to wide multiplexer 1896A for wide adder 244A; in addition, narrow multiplier 242B is skipped and operand select logic 1898 provides a stretched version of the output 209B to narrow multiplexer 1896B for supply to Narrow adder 244B. Accumulation of weights/data literals facilitates averaging, which can be used as a pooling layer in some artificial neural network applications such as image processing.

就一较佳实施例而言，神经处理单元126还包括第二宽多工器(未图示)，用以跳过宽加法器244A，以利于将宽配置下的宽数据/权重文字或是窄配置下的延展后的窄数据/权重文字加载宽累加器202A，以及第二窄多工器(未图示)，用以跳过窄加法器244B，以利于将窄配置下的窄数据/权重文字加载窄累加器202B。就一较佳实施例而言，此算术逻辑单元204还包括宽与窄的比较器/多工器组合(未图示)，此比较器/多工器组合接收相对应的累加器数值217A/217B与相对应的多工器1896A/1896B输出，藉以在累加器数值217A/217B与一数据/权重文字209A/209B/203A/203B间选择最大值，某些人工神经网络应用的共源(pooling)层使用此运算，这部分在后续章节，例如对应于图27与图28处，会有更详细的说明。此外，操作数选择逻辑1898用以提供数值零的操作数(用于加零的加法运算或是用以清除累加器)，并提供数值一的操作数(用于乘一的乘法运算)。As far as a preferred embodiment is concerned, the neural processing unit 126 further includes a second wide multiplexer (not shown), which is used to skip the wide adder 244A, so as to facilitate wide data/weight literals or The extended narrow data/weight text in the narrow configuration is loaded into the wide accumulator 202A, and a second narrow multiplexer (not shown) is used to skip the narrow adder 244B, so as to facilitate the narrow data/weight text in the narrow configuration The weight literal loads narrow accumulator 202B. For a preferred embodiment, the ALU 204 also includes wide and narrow comparator/multiplexer combinations (not shown), which receive corresponding accumulator values 217A/ 217B and the corresponding multiplexer 1896A/1896B output, so as to select the maximum value between the accumulator value 217A/217B and a data/weight text 209A/209B/203A/203B, common source (pooling) of some artificial neural network applications ) layer uses this operation, and this part will be described in more detail in subsequent chapters, for example corresponding to FIG. 27 and FIG. 28 . Additionally, operand select logic 1898 is configured to provide operands of value zero (for addition operations that add zero or to clear the accumulator) and operands of value one (for multiplication operations that multiply by one).

窄启动函数单元212B接收窄累加器202B的输出217B并对其执行启动函数以产生窄结果133B，宽启动函数单元212A接收宽累加器202A的输出217A并对其执行启动函数以产生宽结果133A。神经处理单元126处于窄配置时，宽启动函数单元212A会依此配置理解累加器202A的输出217A并对其执行启动函数以产生窄结果，如8位，这部分在后续章节如对应于图29A至图30处有更详细的说明。Narrow activation function unit 212B receives output 217B of narrow accumulator 202B and performs an activation function on it to produce narrow result 133B, and wide activation function unit 212A receives output 217A of wide accumulator 202A and performs an activation function on it to produce wide result 133A. When the neural processing unit 126 is in a narrow configuration, the wide activation function unit 212A will understand the output 217A of the accumulator 202A according to this configuration and execute the activation function on it to generate a narrow result, such as 8 bits. This part will be described in subsequent chapters as corresponding to FIG. 29A See Figure 30 for a more detailed description.

如前所述，单一个神经处理单元126在处于窄配置时实际上可以作为两个窄神经处理单元来运作，因此，对于较小的文字而言，相较于宽配置时，大致上可以提供多达两倍的处理能力。举例来说，假定神经网络层具有1024个神经元，而每个神经元从前一层接收1024个窄输入(并具有窄权重文字)，如此就会产生一百万个连结。对于具有512个神经处理单元126的神经网络单元121而言，在窄配置下(相当于1024个窄神经处理单元)，虽然处理的是窄文字而非宽文字，不过其所能处理的连结数可以达到宽配置的四倍(一百万个连结对上256K个连结)，而所需的时间大致为一半(约1026个时频周期对上514个时频周期)。As previously mentioned, a single NPU 126 in the narrow configuration can actually act as two narrow NPUs, thus providing substantially more for smaller text than in the wide configuration. Up to twice the processing power. For example, suppose a neural network layer has 1024 neurons, and each neuron receives 1024 narrow inputs (with narrow weight text) from the previous layer, which would result in a million connections. For the neural network unit 121 with 512 neural processing units 126, under the narrow configuration (equivalent to 1024 narrow neural processing units), although narrow text is processed instead of wide text, the number of connections it can process Four times the wide configuration can be achieved (1 million links vs. 256K links), and the time required is roughly half that (approximately 1026 clock cycles vs. 514 clock cycles).

在一实施例中，图18的动态配置神经处理单元126包括类似于多任务缓存器208A与208B的三输入多任务缓存器以取代缓存器205A与205B，以构成一旋转器，处理由权重随机存取存储器124接收的权重文字列，此运作部分类似于图7的实施例所描述的方式但应用于图18所述的动态配置中。In one embodiment, the dynamically configured NPU 126 of FIG. 18 includes a three-input multitasking register similar to multitasking registers 208A and 208B instead of registers 205A and 205B to form a rotator, and the processing is randomized by weights. The operation of accessing the weight text string received by the memory 124 is similar to that described in the embodiment of FIG. 7 but applied to the dynamic configuration described in FIG. 18 .

图19为一方块示意图，显示依据图18的实施例，利用图1的神经网络单元121的N个神经处理单元126的2N个多任务缓存器208A/208B，对于由图1的数据随机存取存储器122取得的一列数据文字207执行如同一旋转器的运作。在图19的实施例中，N是512，神经处理单元121具有1024个多任务缓存器208A/208B，标示为0至511，分别对应至512个神经处理单元126以及实际上1024个窄神经处理单元。神经处理单元126内的两个窄神经处理单元分别标示为A与B，在每个多任务缓存器208中，其相对应的窄神经处理单元亦加以标示。进一步来说，标示为0的神经处理单元126的多任务缓存器208A标示为0-A，标示为0的神经处理单元126的多任务缓存器208B标示为0-B，标示为1的神经处理单元126的多任务缓存器208A标示为1-A，标示为1的神经处理单元126的多任务缓存器208B标示为1-B，标示为511的神经处理单元126的多任务缓存器208A标示为511-A，而标示为511的神经处理单元126的多任务缓存器208B标示为511-B，其数值亦对应至后续图21所述的窄神经处理单元。FIG. 19 is a schematic block diagram showing that according to the embodiment of FIG. 18 , using 2N multitasking registers 208A/208B of N neural processing units 126 of the neural network unit 121 of FIG. A row of data words 207 retrieved from memory 122 performs like a rotator. In the embodiment of FIG. 19, N is 512, and the NPU 121 has 1024 multitasking registers 208A/208B, labeled 0 to 511, corresponding to 512 NPUs 126 and actually 1024 narrow NPUs, respectively. unit. The two narrow NPUs in the NPU 126 are labeled A and B, respectively, and the corresponding narrow NPUs in each multitasking register 208 are also labeled. Further, the multitasking register 208A of the neural processing unit 126 marked as 0 is marked as 0-A, the multitasking register 208B of the neural processing unit 126 marked as 0 is marked as 0-B, and the neural processing unit 126 marked as 1 The multitasking register 208A of unit 126 is designated 1-A, the multitasking register 208B of the NPU 126 designated 1 is designated 1-B, and the multitasking register 208A of the NPU 126 designated 511 is designated as 511-A, and the multitasking register 208B of the neural processing unit 126 marked as 511 is marked as 511-B, and its value also corresponds to the narrow neural processing unit described in FIG. 21 .

每个多任务缓存器208A在数据随机存取存储器122的D列的其中一列中接收其相对应的窄数据文字207A，而每个多任务缓存器208B在数据随机存取存储器122的D列的其中一列中接收其相对应的窄数据文字207B。也就是说，多任务缓存器0-A接收数据随机存取存储器122列的窄数据文字0，多任务缓存器0-B接收数据随机存取存储器122列的窄数据文字1，多任务缓存器1-A接收数据随机存取存储器122列的窄数据文字2，多任务缓存器1-B接收数据随机存取存储器122列的窄数据文字3，依此类推，多任务缓存器511-A接收数据随机存取存储器122列的窄数据文字1022，而多任务缓存器511-B则是接收数据随机存取存储器122列的窄数据文字1023。此外，多任务缓存器1-A接收多任务缓存器0-A的输出209A作为其输入211A，多任务缓存器1-B接收多任务缓存器0-B的输出209B作为其输入211B，依此类推，多任务缓存器511-A接收多任务缓存器510-A的输出209A作为其输入211A，多任务缓存器511-B接收多任务缓存器510-B的输出209B作为其输入211B，并且多任务缓存器0-A接收多任务缓存器511-A的输出209A作为其输入211A，多任务缓存器0-B接收多任务缓存器511-B的输出209B作为其输入211B。每个多任务缓存器208A/208B都会接收控制输入213以控制其选择数据文字207A/207B或是旋转后输入211A/211B或是旋转后输入1811A/1811B。最后，多任务缓存器1-A接收多任务缓存器0-B的输出209B作为其输入1811A，多任务缓存器1-B接收多任务缓存器1-A的输出209A作为其输入1811B，依此类推，多任务缓存器511-A接收多任务缓存器510-B的输出209B作为其输入1811A，多任务缓存器511-B接收多任务缓存器511-A的输出209A作为其输入1811B，并且多任务缓存器0-A接收多任务缓存器511-B的输出209B作为其输入1811A，多任务缓存器0-B接收多任务缓存器0-A的输出209A作为其输入1811B。每个多任务缓存器208A/208B都会接收控制输入213以控制其选择数据文字207A/207B或是旋转后输入211A/211B或是旋转后输入1811A/1811B。在一运算模式中，在第一时频周期，控制输入213会控制每个多任务缓存器208A/208B选择数据文字207A/207B储存至缓存器供后续提供至算术逻辑单元204；而在后续时频周期(例如前述的M-1时频周期)，控制输入213会控制每个多任务缓存器208A/208B选择旋转后输入1811A/1811B储存至缓存器供后续提供至算术逻辑单元204，这部分在后续章节会有更详细的说明。Each multiplexing register 208A receives its corresponding narrow data word 207A in one of the D columns of the DRAM 122, and each multiplexing register 208B receives its corresponding narrow data word 207A in one of the D columns of the DRAM 122. One of the columns receives its corresponding narrow data literal 207B. That is to say, MMR 0-A receives narrow data word 0 of DRAM 122 columns, MMR 0-B receives narrow data word 1 of DRAM 122 columns, MMR 0-B receives narrow data word 1 of DRAM 122 columns, 1-A receives the narrow data word 2 of 122 rows in the DRAM, the multitasking register 1-B receives the narrow data word 3 of the 122 rows in the DRAM, and so on, and the multitasking register 511-A receives The narrow data word 1022 of the DRAM 122 columns, and the multiplexing register 511-B receives the narrow data word 1023 of the 122 columns of the DRAM. In addition, multitasking register 1-A receives output 209A of multitasking register 0-A as its input 211A, multitasking register 1-B receives output 209B of multitasking register 0-B as its input 211B, and so on By analogy, the multitasking buffer 511-A receives the output 209A of the multitasking buffer 510-A as its input 211A, the multitasking buffer 511-B receives the output 209B of the multitasking buffer 510-B as its input 211B, and multiple Tasking buffer 0-A receives output 209A of multitasking buffer 511-A as its input 211A, and multitasking buffer 0-B receives output 209B of multitasking buffer 511-B as its input 211B. Each multiplexing register 208A/208B receives a control input 213 to control it to select the data word 207A/207B or the rotated input 211A/211B or the rotated input 1811A/1811B. Finally, multitasking register 1-A receives output 209B of multitasking register 0-B as its input 1811A, multitasking register 1-B receives output 209A of multitasking register 1-A as its input 1811B, and so on By analogy, the multitasking buffer 511-A receives the output 209B of the multitasking buffer 510-B as its input 1811A, the multitasking buffer 511-B receives the output 209A of the multitasking buffer 511-A as its input 1811B, and multiple Tasking buffer 0-A receives output 209B of multitasking buffer 511-B as its input 1811A, and multitasking buffer 0-B receives output 209A of multitasking buffer 0-A as its input 1811B. Each multiplexing register 208A/208B receives a control input 213 to control it to select the data word 207A/207B or the rotated input 211A/211B or the rotated input 1811A/1811B. In an operation mode, in the first time-frequency cycle, the control input 213 will control each multiplex register 208A/208B to select the data word 207A/207B to be stored in the register for subsequent provision to the arithmetic logic unit 204; Frequency cycle (such as the aforementioned M-1 time-frequency cycle), the control input 213 will control each multi-tasking register 208A/208B to select the rotation and then input 1811A/1811B to store in the register for subsequent provision to the arithmetic logic unit 204, this part It will be described in more detail in subsequent chapters.

图20为一表格，显示一个储存于图1的神经网络单元121的程序存储器129并由该神经网络单元121执行的程序，而此神经网络单元121具有如图18的实施例所示的神经处理单元126。图20的范例程序类似于图4的程序。以下针对其差异进行说明。位于地址0的初始化神经处理单元指令指定神经处理单元126将会进入窄配置。此外，如图中所示，位于地址2的乘法累加旋转指令指定一数值为1023的计数值并需要1023个时频周期。这是因为图20的范例中假定在一层中实际上具有1024个窄(如8位)神经元(即神经处理单元)，每个窄神经元具有1024个来自前一层的1024个神经元的连结输入，因此总共有1024K个连结。每个神经元从每个连结输入接收一个8位数据值并将此8位数据值乘上一个适当的8位权重值。FIG. 20 is a table showing a program stored in the program memory 129 of the neural network unit 121 of FIG. 1 and executed by the neural network unit 121 having the neural processing shown in the embodiment of FIG. 18 Unit 126. The example program of FIG. 20 is similar to the program of FIG. 4 . The differences are explained below. The initialize NPU instruction at address 0 specifies that NPU 126 will enter a narrow configuration. In addition, as shown in the figure, the multiply accumulate rotate instruction at address 2 specifies a count value of 1023 and requires 1023 clock cycles. This is because the example in Figure 20 assumes that there are actually 1024 narrow (e.g., 8-bit) neurons (i.e., neural processing units) in a layer, each with 1024 of the 1024 neurons from the previous layer of link inputs, so there are 1024K links in total. Each neuron receives an 8-bit data value from each connection input and multiplies this 8-bit data value by an appropriate 8-bit weight value.

图21为显示神经网络单元121执行图20的程序的时序图，此神经网络单元121具有如图18所示的神经处理单元126执行于窄配置。图21的时序图类似于图5的时序图。以下针对其差异进行说明。FIG. 21 is a timing diagram showing that the neural network unit 121 executes the program of FIG. 20 . The neural network unit 121 has the neural processing unit 126 as shown in FIG. 18 executed in a narrow configuration. The timing diagram of FIG. 21 is similar to the timing diagram of FIG. 5 . The differences are explained below.

在图21的时序图中，这些神经处理单元126会处于窄配置，这是因为位于地址0的初始化神经处理单元指令将其初始化为窄配置。所以，这512个神经处理单元126实际上运作起来就如同1024个窄神经处理单元(或神经元)，这1024个窄神经处理单元在字段内以神经处理单元0-A与神经处理单元0-B(标示为0的神经处理单元126的两个窄神经处理单元)，神经处理单元1-A与神经处理单元1-B(标示为1的神经处理单元126的两个窄神经处理单元)，依此类推直到神经处理单元511-A与神经处理单元511-B(标示为511的神经处理单元126的两个窄神经处理单元)，加以指明。为简化说明，图中仅显示窄神经处理单元0-A、0-B与511-B的运算。因为位于地址2的乘法累加旋转指令所指定的计数值为1023，而需要1023个时频周期进行运作因此，图21的时序图的列数包括多达1026个时频周期。In the timing diagram of FIG. 21, the NPUs 126 would be in the narrow configuration because the initialize NPU instruction at address 0 initializes them to the narrow configuration. Therefore, the 512 NPUs 126 actually operate as 1024 narrow NPUs (or neurons) within the field as NPU0-A and NPU0- B (the two narrow NPUs of NPU 126 marked as 0), NPU 1-A and NPU 1-B (the two narrow NPUs of NPU 126 marked as 1), And so on until NPU 511-A and NPU 511-B (the two narrow NPUs of NPU 126 labeled 511 ) are indicated. To simplify the description, only the operations of the narrow NPUs 0-A, 0-B and 511-B are shown in the figure. Since the count value specified by the multiply accumulate rotate instruction at address 2 is 1023, 1023 clock cycles are required for operation. Therefore, the columns of the timing diagram in FIG. 21 include as many as 1026 clock cycles.

在时频周期0，这1024个神经处理单元的每一个都会执行图4的初始化指令，即图5所示指派零值至累加器202的运作。In clock cycle 0, each of the 1024 NPUs executes the initialization command shown in FIG. 4 , that is, the operation of assigning a value of zero to the accumulator 202 shown in FIG. 5 .

在时频周期1，这1024个窄神经处理单元的每一个都会执行图20中位于地址1的乘法累加指令。如图中所示，窄神经处理单元0-A将累加器202A数值(即零)加上数据随机存取单元122的列17窄文字0与权重随机存取单元124的列0窄文字0的乘积；窄神经处理单元0-B将累加器202B数值(即零)加上数据随机存取单元122的列17窄文字1与权重随机存取单元124的列0窄文字1的乘积；依此类推直到窄神经处理单元511-B将累加器202B数值(即零)加上数据随机存取单元122的列17窄文字1023与权重随机存取单元124的列0窄文字1023的乘积。In clock cycle 1, each of the 1024 narrow NPUs executes the multiply-accumulate instruction at address 1 in Figure 20. As shown, narrow NPU 0-A adds the accumulator 202A value (i.e. zero) to column 17 narrow word 0 of data random access unit 122 and column 0 narrow word 0 of weight random access unit 124 Product; the narrow neural processing unit 0-B adds the accumulator 202B value (i.e. zero) to the product of the column 17 narrow text 1 of the data random access unit 122 and the column 0 narrow text 1 of the weight random access unit 124; accordingly By analogy, the narrow NPU 511 -B adds the value of the accumulator 202B (ie zero) to the product of the column 17 narrow text 1023 of the data random access unit 122 and the column 0 narrow text 1023 of the weight random access unit 124 .

在时频周期2，这1024个窄神经处理单元的每一个都会执行图20中位于地址2的乘法累加旋转指令的第一次迭代。如图中所示，窄神经处理单元0-A将累加器202A数值217A加上由窄神经处理单元511-B的多任务缓存器208B输出209B所接收的旋转后窄数据文字1811A(也就是由数据随机存取存储器122所接收的窄数据文字1023)与权重随机存取单元124的列1窄文字0的乘积；窄神经处理单元0-B将累加器202B数值217B加上由窄神经处理单元0-A的多任务缓存器208A输出209A所接收的旋转后窄数据文字1811B(也就是由数据随机存取存储器122所接收的窄数据文字0)与权重随机存取单元124的列1窄文字1的乘积；依此类推，直到窄神经处理单元511-B将累加器202B数值217B加上由窄神经处理单元511-A的多任务缓存器208A输出209A所接收的旋转后窄数据文字1811B(也就是由数据随机存取存储器122所接收的窄数据文字1022)与权重随机存取单元124的列1窄文字1023的乘积。In clock cycle 2, each of the 1024 narrow NPUs executes the first iteration of the multiply-accumulate-rotate instruction at address 2 in FIG. 20 . As shown, the narrow NPU 0-A adds the accumulator 202A value 217A to the rotated narrow data word 1811A received by the multitasking register 208B output 209B of the narrow NPU 511-B (i.e. Product of narrow data word 1023) received by data random access memory 122 and column 1 narrow word 0 of weight random access unit 124; narrow neural processing unit 0-B adds accumulator 202B value 217B The 0-A multiplexing register 208A outputs the rotated narrow data word 1811B received by 209A (that is, the narrow data word 0 received by the DRAM 122 ) and the column 1 narrow word of the weight RAM 124 1; and so on until the narrow NPU 511-B adds the accumulator 202B value 217B to the rotated narrow data literal 1811B ( That is, the product of the narrow data word 1022 received by the DRAM 122 and the column 1 narrow word 1023 of the weight RAM 124 .

在时频周期3，这1024个窄神经处理单元的每一个都会执行图20中位于地址2的乘法累加旋转指令的第二次迭代。如图中所示，窄神经处理单元0-A将累加器202A数值217A加上由窄神经处理单元511-B的多任务缓存器208B输出209B所接收的旋转后窄数据文字1811A(也就是由数据随机存取存储器122所接收的窄数据文字1022)与权重随机存取单元124的列2窄文字0的乘积；窄神经处理单元0-B将累加器202B数值217B加上由窄神经处理单元0-A的多任务缓存器208A输出209A所接收的旋转后窄数据文字1811B(也就是由数据随机存取存储器122所接收的窄数据文字1023)与权重随机存取单元124的列2窄文字1的乘积；依此类推，直到窄神经处理单元511-B将累加器202B数值217B加上由窄神经处理单元511-A的多任务缓存器208A输出209A所接收的旋转后窄数据文字1811B(也就是由数据随机存取存储器122所接收的窄数据文字1021)与权重随机存取单元124的列2窄文字1023的乘积。如图21所示，此运算会在后续1021个时频周期持续进行，直到以下所述的时频周期1024。At clock cycle 3, each of the 1024 narrow NPUs executes the second iteration of the multiply-accumulate-rotate instruction at address 2 in Figure 20. As shown, the narrow NPU 0-A adds the accumulator 202A value 217A to the rotated narrow data word 1811A received by the multitasking register 208B output 209B of the narrow NPU 511-B (i.e. The product of narrow data word 1022) received by data random access memory 122 and column 2 narrow word 0 of weight random access unit 124; The 0-A multiplexing register 208A outputs the rotated narrow data word 1811B received by 209A (that is, the narrow data word 1023 received by the DRAM 122 ) and the column 2 narrow word of the weight RAM 124 1; and so on until the narrow NPU 511-B adds the accumulator 202B value 217B to the rotated narrow data literal 1811B ( That is, the product of the narrow data word 1021) received by the data random access memory 122 and the column 2 narrow word 1023 of the weight random access unit 124. As shown in FIG. 21 , this operation will continue for the following 1021 time-frequency periods until the time-frequency period 1024 described below.

在时频周期1024，这1024个窄神经处理单元的每一个都会执行图20中位于地址2的乘法累加旋转指令的第1023次迭代。如图中所示，窄神经处理单元0-A将累加器202A数值217A加上由窄神经处理单元511-B的多任务缓存器208B输出209B所接收的旋转后窄数据文字1811A(也就是由数据随机存取存储器122所接收的窄数据文字1)与权重随机存取单元124的列1023窄文字0的乘积；窄神经处理单元0-B将累加器202B数值217B加上由窄神经处理单元0-A的多任务缓存器208A输出209A所接收的旋转后窄数据文字1811B(也就是由数据随机存取存储器122所接收的窄数据文字2)与权重随机存取单元124的列1023窄文字1的乘积；依此类推，直到窄神经处理单元511-B将累加器202B数值217B加上由窄神经处理单元511-A的多任务缓存器208A输出209A所接收的旋转后窄数据文字1811B(也就是由数据随机存取存储器122所接收的窄数据文字0)与权重随机存取单元124的列1023窄文字1023的乘积。In clock cycle 1024, each of the 1024 narrow NPUs executes iteration 1023 of the multiply accumulate rotate instruction at address 2 in FIG. 20 . As shown, the narrow NPU 0-A adds the accumulator 202A value 217A to the rotated narrow data word 1811A received by the multitasking register 208B output 209B of the narrow NPU 511-B (i.e. The product of narrow data word 1) received by data random access memory 122 and column 1023 narrow word 0 of weight random access unit 124; narrow neural processing unit 0-B adds accumulator 202B value 217B by narrow neural processing unit The 0-A multiplexing register 208A outputs the rotated narrow data word 1811B received by the output 209A (that is, the narrow data word 2 received by the DRAM 122 ) and the row 1023 narrow word of the weight RAM 124 1; and so on until the narrow NPU 511-B adds the accumulator 202B value 217B to the rotated narrow data literal 1811B ( That is, the product of the narrow data word O) received by the DRAM 122 and the column 1023 narrow word 1023 of the weight RAM 124 .

在时频周期1025，这1024个窄神经处理单元中的每一个的启动函数单元212A/212B会执行图20中位于地址3的启动函数指令。最后，在时频周期1026，这1024个窄神经处理单元中的每一个会将其窄结果133A/133B写回数据随机存取存储器122的列16中的相对应窄文字，以执行图20中位于地址4的写入启动函数单元指令。也即，神经处理单元0-A的窄结果133A会被写入数据随机存取存储器122的窄文字0，神经处理单元0-B的窄结果133B会被写入数据随机存取存储器122的窄文字1，依此类推，直到神经处理单元511-B的窄结果133B会被写入数据随机存取存储器122的窄文字1023。图22以方块图显示前述对应于图21的运算。In the time-frequency cycle 1025, the enable function unit 212A/212B of each of the 1024 narrow NPUs executes the enable function instruction located at address 3 in FIG. 20 . Finally, at clock cycle 1026, each of the 1024 narrow NPUs will write its narrow result 133A/133B back to the corresponding narrow text in column 16 of the data random access memory 122 to execute the The write at address 4 initiates the functional unit instruction. That is, narrow result 133A of NPU 0-A will be written into narrow literal 0 of DRAM 122, and narrow result 133B of NPU 0-B will be written into narrow literal 0 of DRAM 122. Literal 1, and so on, until narrow result 133B of NPU 511-B is written to narrow literal 1023 of DRAM 122 . FIG. 22 shows the aforementioned operations corresponding to FIG. 21 in a block diagram.

图22为显示图1的神经网络单元121的方块示意图，此神经网络单元121具有如图18所示的神经处理单元126以执行图20的程序。此神经网络单元121包括512个神经处理单元126，即1024个窄神经处理单元，数据随机存取存储器122，以及权重随机存取存储器124，数据随机存取存储器122接收其地址输入123，权重随机存取存储器124接收其地址输入125。虽然图中并未显示，不过，在时频周期0，这1024个窄神经处理单元都会执行图20的初始化指令。如图中所示，在时频周期1，列17的1024个8位数据文字会从数据随机存取存储器122读出并提供至这1024个窄神经处理单元。在时频周期1至1024，列0至1023的1024个8位权重文字会分别从权重随机存取存储器124读出并提供至这1024个窄神经处理单元。虽然图中并未显示，不过，在时频周期1，这1024个窄神经处理单元会对加载的数据文字与权重文字执行其相对应的乘法累加运算。在时频周期2至1024，这1024个窄神经处理单元的多任务缓存器208A/208B的运作如同一个1024个8位文字的旋转器，会将先前加载数据随机存取存储器122的列17的数据文字旋转至邻近的窄神经处理单元，而这些窄神经处理单元会对相对应的旋转后数据文字以及由权重随机存取存储器124加载的相对应窄权重文字执行乘法累加运算。虽然图中并未显示，在时频周期1025，这1024个窄启动函数单元212A/212B会执行启动指令。在时频周期1026，这1024个窄神经处理单元会将其相对应的1024个8位结果133A/133B写回数据随机存取存储器122的列16。FIG. 22 is a schematic block diagram showing the neural network unit 121 of FIG. 1 , and the neural network unit 121 has the neural processing unit 126 shown in FIG. 18 to execute the program of FIG. 20 . This neural network unit 121 includes 512 neural processing units 126, i.e. 1024 narrow neural processing units, data random access memory 122, and weight random access memory 124, data random access memory 122 receives its address input 123, weight random access memory 122 Access memory 124 receives its address input 125 . Although not shown in the figure, in the time-frequency cycle 0, the 1024 narrow neural processing units will execute the initialization instruction shown in FIG. 20 . As shown in the figure, in clock cycle 1, 1024 8-bit data words of row 17 are read from DRAM 122 and provided to the 1024 narrow NPUs. In clock cycles 1 to 1024, 1024 8-bit weight texts of columns 0 to 1023 are respectively read from the weight RAM 124 and provided to the 1024 narrow NPUs. Although not shown in the figure, in time-frequency cycle 1, the 1024 narrow neural processing units perform their corresponding multiplication and accumulation operations on the loaded data literals and weight literals. During clock cycles 2 to 1024, the 1024 narrow NPU multitasking registers 208A/208B operate as a 1024 8-bit word rotator, relocating the previously loaded data to row 17 of RAM 122 The data words are rotated to adjacent narrow NPUs, and these narrow NPUs perform multiply-accumulate operations on the corresponding rotated data words and the corresponding narrow weight words loaded from the weight RAM 124 . Although not shown in the figure, in the clock cycle 1025, the 1024 narrow enable function units 212A/212B execute the enable command. The 1024 narrow NPUs will write their corresponding 1024 8-bit results 133A/ 133B back to the column 16 of the DRAM 122 in the clock cycle 1026 .

由此可以发现，相较于图2的实施例，图18的实施例让程序设计者具有弹性可以选择使用宽数据与权重文字(如16位)以及窄数据与权重文字(如8位)执行计算，以因应特定应用下对于准确度的需求。从一个面向来看，对于窄数据的应用而言，图18的实施例相较于图2的实施例可提供两倍的效能，但必须增加额外的窄组件(例如多任务缓存器208B、缓存器205B、窄算术逻辑单元204B、窄累加器202B、窄启动函数单元212B)作为代价，这些额外的窄组件会使神经处理单元126增加约50％的面积。It can be found that, compared with the embodiment of FIG. 2 , the embodiment of FIG. 18 allows program designers to have the flexibility to choose to use wide data and weight literals (such as 16 bits) and narrow data and weight literals (such as 8 bits) for execution. Calculated to meet the accuracy requirements of specific applications. From one perspective, for narrow data applications, the embodiment of FIG. 18 can provide twice the performance compared to the embodiment of FIG. 205B, narrow ALU 204B, narrow accumulator 202B, and narrow AFU 212B), these additional narrow components increase the area of the NPU 126 by approximately 50%.

三模神经处理单元Trimodal Neural Processing Unit

图23为显示图1的可动态配置的神经处理单元126的另一实施例的方块示意图。图23的神经处理单元126不但可用于宽配置与窄配置，还可用以第三种配置，在此称为“漏斗(funnel)”配置。图23的神经处理单元126类似于图18的神经处理单元126。不过，图18中的宽加法器244A在图23的神经处理单元126中由一个三输入宽加法器2344A所取代，此三输入宽加法器2344A接收一第三加数2399，其为窄多工器1896B的输出的一延伸版本。具有图23的神经处理单元的神经网络单元所执行的程序类似于图20的程序。不过，其中位于地址0的初始化神经处理单元指令会将这些神经处理单元126初始化为漏斗配置，而非窄配置。此外，位于地址2的乘法累加旋转指令的计数值为511而非1023。FIG. 23 is a block diagram illustrating another embodiment of the dynamically configurable neural processing unit 126 of FIG. 1 . The NPU 126 of FIG. 23 can be used not only in wide and narrow configurations, but also in a third configuration, referred to herein as a "funnel" configuration. The neural processing unit 126 of FIG. 23 is similar to the neural processing unit 126 of FIG. 18 . However, the wide adder 244A in FIG. 18 is replaced in the neural processing unit 126 of FIG. An extended version of the output of device 1896B. The procedure executed by the neural network unit having the neural processing unit of FIG. 23 is similar to that of FIG. 20 . However, the initialize NPU instruction at address 0 therein initializes the NPUs 126 to a funnel configuration rather than a narrow configuration. Also, the multiply accumulate rotate instruction at address 2 has a count of 511 instead of 1023.

处于漏斗配置时，神经处理单元126的运作类似于处于窄配置，当执行如图20中地址1的乘法累加指令时，神经处理单元126会接收两个窄数据文字207A/207B与两个窄权重文字206A/206B；宽乘法器242A会将数据文字209A与权重文字203A相乘以产生宽多工器1896A选择的乘积246A；窄乘法器242B会将数据文字209B与权重文字203B相乘以产生窄多工器1896B选择的乘积246B。不过，宽加法器2344A会将乘积246A(由宽多工器1896A选择)以及乘积246B/2399(由宽多工器1896B选择)都与宽累加器202A输出217A相加，而窄加法器244B与窄累加器202B则是不启动。此外，处于漏斗配置而执行如图20中地址2的乘法累加旋转指令时，控制输入213会使多任务缓存器208A/208B旋转两个窄文字(如16位)，也就是说，多任务缓存器208A/208B会选择其相对应输入211A/211B，就如同处于宽配置一样。不过，宽乘法器242A会将数据文字209A与权重文字203A相乘以产生宽多工器1896A选择的乘积246A；窄乘法器242B会将数据文字209B与权重文字203B相乘以产生窄多工器1896B选择的乘积246B；并且，宽加法器2344A会将乘积246A(由宽多工器1896A选择)以及乘积246B/2399(由宽多工器1896B选择)都与宽累加器202A输出217A相加，而窄加法器244B与窄累加器202B如前述则是不启动。最后，处于漏斗配置而执行如图20中地址3的启动函数指令时，宽启动函数单元212A会对结果总数215A执行启动函数以产生一窄结果133A，而窄启动函数单元212B则是不启动。如此，只有标示为A的窄神经处理单元会产生窄结果133A，标示为B的窄神经处理单元所产生的窄结果133B则是无效。因此，写回结果的列(如图20中地址4的指令所指示的列16)会包含空洞，这是因为只有窄结果133A有效，窄结果133B则是无效。因此，在概念上，每个时频周期内，每个神经元(图23的神经处理单元)会执行两个连结数据输入，即将两个窄数据文字乘上其相对应的权重并将这两个乘积相加，相较之下，图2与图18的实施例在每个时频周期内只执行一个连结数据输入。In the funnel configuration, the NPU 126 operates similarly to the narrow configuration. When executing a multiply-accumulate instruction at address 1 in FIG. 20 , the NPU 126 receives two narrow data words 207A/207B and two narrow weights Literal 206A/206B; wide multiplier 242A will multiply data literal 209A with weight literal 203A to produce wide multiplexer 1896A selected product 246A; narrow multiplier 242B will multiply data literal 209B with weight literal 203B to produce narrow Multiplexer 1896B selects product 246B. However, wide adder 2344A adds both product 246A (selected by wide multiplexer 1896A) and product 246B/2399 (selected by wide multiplexer 1896B) to wide accumulator 202A output 217A, while narrow adder 244B and Narrow accumulator 202B is then disabled. In addition, when executing the multiply-accumulate rotation instruction at address 2 in FIG. 20 in the funnel configuration, the control input 213 will cause the multitasking register 208A/208B to rotate two narrow words (such as 16 bits), that is, the multitasking buffer The switch 208A/208B will select its corresponding input 211A/211B as if in the wide configuration. However, wide multiplier 242A will multiply data word 209A by weight word 203A to produce product 246A selected by wide multiplexer 1896A; narrow multiplier 242B will multiply data word 209B by weight word 203B to produce narrow multiplexer product 246B selected by 1896B; and, wide adder 2344A will add both product 246A (selected by wide multiplexer 1896A) and product 246B/2399 (selected by wide multiplexer 1896B) to wide accumulator 202A output 217A, The narrow adder 244B and the narrow accumulator 202B are disabled as described above. Finally, when the activation function instruction at address 3 in FIG. 20 is executed in the funnel configuration, the wide activation function unit 212A executes the activation function on the total number of results 215A to generate a narrow result 133A, while the narrow activation function unit 212B does not activate. In this way, only the narrow NPU marked A will generate the narrow result 133A, and the narrow result 133B generated by the narrow NPU marked B is invalid. Therefore, the column that writes back the result (column 16 indicated by the instruction at address 4 in FIG. 20) will contain a hole because only narrow result 133A is valid and narrow result 133B is invalid. Thus, conceptually, each neuron (neural processing unit in Figure 23) executes two concatenated data inputs per time-frequency cycle, that is, multiplying two narrow data literals by their corresponding weights and combining the two In contrast, the embodiments of FIG. 2 and FIG. 18 only perform one link data input in each time-frequency cycle.

在图23的实施例中可以发现，产生并写回数据随机存取存储器122或权重随机存取存储器124的结果文字(神经元输出)的数量是所接收数据输入(连结)数量的平方根的一半，而结果的写回列具有空洞，即每隔一个窄文字结果就是无效，更精确来说，标示为B的窄神经处理单元结果不具意义。因此，图23的实施例对于具有连续两层的神经网络特别有效率，举例来说，第一层具有的神经元数量为第二层的两倍(例如第一层具有1024个神经元充分连接至第二层的512个神经元)。此外，其它的执行单元122(例如媒体单元，如x86高级向量扩展单元)在必要时，可对一分散结果列(即具有空洞)执行合并运算(pack operation)以使其紧密(即不具空洞)。后续当神经处理单元121在执行其它关联于数据随机存取存储器122和/或权重随机存取存储器124的其它列的计算时，即可将此处理后的数据列用于计算。In the embodiment of FIG. 23 it can be seen that the number of resulting words (neuron outputs) generated and written back to the data RAM 122 or the weight RAM 124 is half the square root of the number of received data inputs (connections) , and the writeback column of the result has a hole, that is, every other narrow text result is invalid, more precisely, the narrow NPU result marked as B has no meaning. Thus, the embodiment of FIG. 23 is particularly efficient for neural networks having two consecutive layers, for example, the first layer has twice as many neurons as the second layer (e.g., the first layer has 1024 neurons fully connected to 512 neurons in the second layer). In addition, other execution units 122 (such as media units, such as the x86 advanced vector extension unit) can perform pack operations on a scattered result column (ie, with holes) to make it compact (ie, without holes) if necessary . Subsequently, when the neural processing unit 121 is performing other calculations associated with other columns of the data RAM 122 and/or the weight RAM 124 , the processed data columns can be used for calculations.

混合神经网络单元运算：卷积与共源运算能力Hybrid Neural Network Unit Operations: Convolution and Common Source Computing Capabilities

本发明实施例所述的神经网络单元121的优点在于，此神经网络单元121能够同时以类似于一个协处理器执行自己内部程序的方式运作以及以类似于一个处理器的处理单元执行所发布的架构指令(或是由架构指令转译出的微指令)。架构指令是包含在具有神经网络单元121的处理器所执行的架构程序内。如此，神经网络单元121即可以混合方式运作，而能维持神经处理单元121的高利用率。举例来说，图24至图26显示神经网络单元121执行卷积运算的运作，其中，神经网络单元被充分利用，图27至图28显示神经网络单元121执行共源运算的运作。卷积层、共源层以及其它数字数据计算的应用，例如影像处理(如边缘侦测、锐利化、模糊化、辨识/分类)需要使用到这些运算。不过，神经处理单元121的混合运算并不限于执行卷积或共源运算，此混合特征也可用于执行其它运算，例如图4至图13所述的传统神经网络乘法累加运算与启动函数运算。也就是说，处理器100(更精确地说，保留站108)会发布MTNN指令1400与MFNN指令1500至神经网络单元121，因应此发布的指令，神经网络单元121会将数据写入存储器122/124/129并将结果从被神经网络单元121写入的存储器122/124中读出，在此同时，为了执行处理器100(透过MTNN1400指令)写入程序存储器129的程序，神经网络单元121会读取并写入存储器122/124/129。The advantage of the neural network unit 121 described in the embodiment of the present invention is that the neural network unit 121 can simultaneously operate in a manner similar to a coprocessor executing its own internal program and in a processing unit similar to a processor to execute issued Architectural instructions (or microinstructions translated from architectural instructions). The architectural instructions are included in the architectural program executed by the processor with the neural network unit 121 . In this way, the neural network unit 121 can operate in a mixed manner, so as to maintain a high utilization rate of the neural processing unit 121 . For example, FIGS. 24 to 26 show the operation of the neural network unit 121 performing the convolution operation, wherein the neural network unit is fully utilized, and FIGS. 27 to 28 show the operation of the neural network unit 121 performing the common source operation. Convolutional layers, common source layers, and other digital data computing applications, such as image processing (eg, edge detection, sharpening, blurring, recognition/classification) require the use of these operations. However, the hybrid operation of the neural processing unit 121 is not limited to performing convolution or common source operations, and this hybrid feature can also be used to perform other operations, such as traditional neural network multiply-accumulate operations and activation function operations described in FIGS. 4 to 13 . That is to say, the processor 100 (more precisely, the reservation station 108) will issue the MTNN instruction 1400 and the MFNN instruction 1500 to the neural network unit 121, and the neural network unit 121 will write data into the memory 122/ 124/129 and read the result from the memory 122/124 written by the neural network unit 121. At the same time, in order to execute the program written into the program memory 129 by the processor 100 (through the MTNN1400 instruction), the neural network unit 121 Will read and write to memory 122/124/129.

图24为一方块示意图，显示由图1的神经网络单元121使用以执行卷积运算的数据结构的一范例。此方块图包括卷积核2402、数据阵列2404、以及图1的数据随机存取存储器122与权重随机存取存储器124。就一较佳实施例而言，数据阵列2404(例如对应于影像像素)装载于连接至处理器100的系统存储器(未图示)并由处理器100透过执行MTNN指令1400加载神经网络单元121的权重随机存取存储器124。卷积运算将第一阵列与第二阵列进行卷积，此第二阵列即为本文所述的卷积核。如本文所述，卷积核为一系数矩阵，这些系数也可称为权重、参数、元素或数值。就一较佳实施例而言，此卷积核2042为处理器100所执行的架构程序的静态数据。FIG. 24 is a block diagram showing an example of a data structure used by the neural network unit 121 of FIG. 1 to perform convolution operations. The block diagram includes a convolution kernel 2402 , a data array 2404 , and the data RAM 122 and the weight RAM 124 in FIG. 1 . For a preferred embodiment, the data array 2404 (eg, corresponding to image pixels) is loaded in a system memory (not shown) connected to the processor 100 and is loaded by the processor 100 into the neural network unit 121 by executing the MTNN instruction 1400 The weight random access memory 124. The convolution operation convolves the first array with the second array, and the second array is the convolution kernel described herein. As described herein, a convolution kernel is a matrix of coefficients, and these coefficients may also be called weights, parameters, elements or values. For a preferred embodiment, the convolution kernel 2042 is static data of the architecture program executed by the processor 100 .

此数据阵列2404为一个数据值的二维阵列，而每个数据值(例如影像像素值)的大小是数据随机存取存储器122或权重随机存取存储器124的文字的尺寸(例如16位或8位)。在此范例中，数据值为16位文字，神经网络单元121为配置有512个宽配置的神经处理单元126。此外，在此实施例中，神经处理单元126包括多任务缓存器以接收来自权重随机存取存储器124的权重文字206，例如图7的多任务缓存器705，藉以对由权重随机存取存储器124接收的一列数据值执行集体旋转器运算，这部分在后续章节会有更详细的说明。在此范例中，数据阵列2404为一个2560行X1600列的像素阵列。如图中所示，当架构程序将数据阵列2404与卷积核2402进行卷积计算时，数据阵列2402会被分为20个数据块，而每个数据块分别是512x400的数据阵列2406。The data array 2404 is a two-dimensional array of data values, and the size of each data value (such as an image pixel value) is the size of a character in the data RAM 122 or weight RAM 124 (such as 16 bits or 8 bits). In this example, the data value is a 16-bit word, and the NNU 121 is a NPU 126 configured with a 512-wide configuration. In addition, in this embodiment, the neural processing unit 126 includes a multitasking register to receive the weight text 206 from the weight random access memory 124, such as the multitasking register 705 in FIG. The received column of data values performs a collective rotator operation, which will be described in more detail in subsequent chapters. In this example, the data array 2404 is a pixel array with 2560 rows×1600 columns. As shown in the figure, when the architecture program performs convolution calculation on the data array 2404 and the convolution kernel 2402, the data array 2402 will be divided into 20 data blocks, and each data block is a 512x400 data array 2406.

在此范例中，卷积核2402为一个由系数、权重、参数、或元素，构成的3x3阵列。这些系数的第一列标示为C0，0；C0，1；与C0，2；这些系数的第二列标示为C1，0；C1，1；与C1，2；这些系数的第三列标示为C2，0；C2，1；与C2，2。举例来说，具有以下系数的卷积核可用于执行边缘侦测：0，1，0，1，-4，1，0，1，0。在另一实施例中，具有以下系数的卷积核可用于执行高斯模糊运算：1，2，1，2，4，2，1，2，1。在此范例中，通常会对最终累加后的数值再执行一个除法，其中，除数为卷积核2042的各元素的绝对值的加总，在此范例中即为16。在另一范例中，除数可以是卷积核2042的元素数量。在又一个范例中，除数可以是将卷积运算压缩至一目标数值范围所使用的数值，此除数由卷积核2042的元素数值、目标范围以及执行卷积运算的输入值阵列的范围所决定。In this example, the convolution kernel 2402 is a 3x3 array of coefficients, weights, parameters, or elements. The first column of these coefficients is labeled C0,0; C0,1; and C0,2; the second column of these coefficients is labeled C1,0; C1,1; and C1,2; the third column of these coefficients is labeled C2,0; C2,1; and C2,2. For example, a convolution kernel with the following coefficients can be used to perform edge detection: 0,1,0,1,-4,1,0,1,0. In another embodiment, a convolution kernel with the following coefficients may be used to perform a Gaussian blur operation: 1,2,1,2,4,2,1,2,1. In this example, a division is usually performed on the final accumulated value, wherein the divisor is the sum of the absolute values of the elements of the convolution kernel 2042 , which is 16 in this example. In another example, the divisor may be the number of elements of the convolution kernel 2042 . In yet another example, the divisor may be a value used to compress the convolution operation to a target value range, and the divisor is determined by the element values of the convolution kernel 2042, the target range, and the range of the input value array performing the convolution operation .

请参照图24以及详述其中细节的图25，架构程序将卷积核2042的系数写入数据随机存取存储器122。就一较佳实施例而言，数据随机存取存储器122的连续九列(卷积核2402内的元素数量)的每列上的所有文字，会利用卷积核2402的不同元素以列为其主要顺序加以写入。也就是说，如图中所示，在同一列的每个文字以第一系数C0，0写入；下一列则是以第二系数C0，1写入；下一列则是以第三系数C0，2写入；再下一列则是以第四系数C1，0写入；依此类推，直到第九列的每个文字都以第九系数C2，2写入。为了对数据阵列2404分割出的数据块的数据矩阵2406进行卷积运算，神经处理单元126会依据顺序重复读取数据随机存取存储器122中装载卷积核2042系数的九列，这部分在后续章节，特别是对应于图26A的部分，会有更详细的说明。Referring to FIG. 24 and FIG. 25 for details, the architecture program writes the coefficients of the convolution kernel 2042 into the DRAM 122 . As far as a preferred embodiment is concerned, all characters on each column of the nine consecutive columns (the number of elements in the convolution kernel 2402) of the DRAM 122 will use different elements of the convolution kernel 2402 to be listed as its Major order to be written. That is to say, as shown in the figure, each character in the same column is written with the first coefficient C0, 0; the next column is written with the second coefficient C0, 1; and the next column is written with the third coefficient C0 , 2 is written; the next column is written with the fourth coefficient C1, 0; and so on, until each character in the ninth column is written with the ninth coefficient C2, 2. In order to perform a convolution operation on the data matrix 2406 of the data block divided by the data array 2404, the neural processing unit 126 will repeatedly read the nine columns loaded with the coefficients of the convolution kernel 2042 in the data random access memory 122 in order, and this part will be described later Sections, especially the part corresponding to Fig. 26A, will be described in more detail.

请参照图24以及详述其中细节的图25，架构程序将数据矩阵2406的数值写入权重随机存取存储器124。神经网络单元程序执行卷积运算时，会将结果阵列写回权重随机存取存储器124。就一较佳实施例而言，架构程序会将第一数据矩阵2406写入权重随机存取存储器124并使神经网络单元121开始运作，当神经网络单元121在对第一数据矩阵2406与卷积核2402执行卷积运算时，架构程序会将第二数据矩阵2406写入权重随机存取存储器124，如此，神经网络单元121完成第一数据矩阵2406的卷积运算后，即可开始执行第二数据矩阵2406的卷积运算，这部分在后续对应于图25处有更详细的说明。以此方式，架构程序会往返于权重随机存取存储器124的两个区域，以确保神经网络单元121被充分使用。因此，图24的范例显示有第一数据矩阵2406A与第二数据矩阵2406B，第一数据矩阵2406A为对应于占据权重随机存取存储器124中列0至399的第一数据块，而第二数据矩阵2406B为对应于占据权重随机存取存储器124中列500至899的第二数据块。此外，如图中所示，神经网络单元121会将卷积运算的结果写回权重随机存取存储器124的列900-1299以及列1300-1699，随后架构程序会从权重随机存取存储器124读取这些结果。装载于权重随机存取存储器124的数据矩阵2406的数据值标示为“Dx，y”，其中“x”是权重随机存取存储器124列数，“y”是权重随机存取存储器的文字、或称行数。举例来说，位于列399的数据文字511在图24中标示为D399，511，此数据文字由神经处理单元511的多任务缓存器705接收。Referring to FIG. 24 and FIG. 25 for details, the architecture program writes the values of the data matrix 2406 into the weight RAM 124 . When the NNU program performs the convolution operation, it will write the result array back to the weight RAM 124 . For a preferred embodiment, the architecture program will write the first data matrix 2406 into the weight random access memory 124 and start the neural network unit 121. When the neural network unit 121 is convolving the first data matrix 2406 with When the core 2402 performs the convolution operation, the architecture program will write the second data matrix 2406 into the weight random access memory 124, so that after the neural network unit 121 completes the convolution operation of the first data matrix 2406, it can start to execute the second data matrix 2406. The convolution operation of the data matrix 2406 will be described in more detail later corresponding to FIG. 25 . In this way, the architecture program will shuttle between the two areas of the weight RAM 124 to ensure that the NNU 121 is fully utilized. Thus, the example of FIG. 24 shows a first data matrix 2406A and a second data matrix 2406B, the first data matrix 2406A being the first data block corresponding to columns 0 to 399 occupying the weighted random access memory 124, and the second data matrix 2406B Matrix 2406B is the second data block corresponding to columns 500 to 899 in Occupancy Weight RAM 124 . In addition, as shown in the figure, the neural network unit 121 will write the results of the convolution operation back to the columns 900-1299 and columns 1300-1699 of the weight RAM 124, and then the architecture program will read from the weight RAM 124 Take these results. The data values loaded into the data matrix 2406 of the weight random access memory 124 are marked as "Dx, y", where "x" is the column number of the weight random access memory 124, and "y" is the text of the weight random access memory, or Called the number of rows. For example, the data word 511 in column 399 is denoted D399, 511 in FIG.

图25为一流程图，显示图1的处理器100执行架构程序以利用神经网络单元121对图24的数据阵列2404执行卷积核2042的卷积运算。此流程始于步骤2502。FIG. 25 is a flow chart showing that the processor 100 of FIG. 1 executes the architecture program to use the neural network unit 121 to perform the convolution operation of the convolution kernel 2042 on the data array 2404 of FIG. 24 . The process starts at step 2502.

在步骤2502中，处理器100，即执行有架构程序的处理器100，会将图24的卷积核2402以图24所显示描述的方式写入数据随机存取存储器122。此外，架构程序会将变量N初始化为数值1。变量N标示数据阵列2404中神经网络单元121正在处理的数据块。此外，架构程序会将变量NUM_CHUNKS初始化为数值20。接下来流程前进至步骤2504。In step 2502 , the processor 100 , that is, the processor 100 executing the structured program, writes the convolution kernel 2402 in FIG. 24 into the DRAM 122 in the manner shown and described in FIG. 24 . Additionally, the architecture program initializes the variable N to a value of 1. The variable N indicates the data block in the data array 2404 that the NNU 121 is processing. Additionally, the framework program initializes the variable NUM_CHUNKS to the value 20. Then the flow goes to step 2504 .

在步骤2504中，如图24所示，处理器100会将数据块1的数据矩阵2406写入权重随机存取存储器124(如数据块1的数据矩阵2406A)。接下来流程前进至步骤2506。In step 2504 , as shown in FIG. 24 , the processor 100 writes the data matrix 2406 of the data block 1 into the weight random access memory 124 (such as the data matrix 2406A of the data block 1 ). Then the flow goes to step 2506 .

在步骤2506中，处理器100会使用一个指定函数1432以写入程序存储器129的MTNN指令1400，将卷积程序写入神经网络单元121程序存储器129。处理器100随后会使用一个指定函数1432以开始执行程序的MTNN指令1400，以启动神经网络单元卷积程序。神经网络单元卷积程序的一范例在对应于图26A处会有更详细的说明。接下来流程前进至步骤2508。In step 2506 , the processor 100 writes the convolution program into the program memory 129 of the neural network unit 121 using an MTNN instruction 1400 specifying the function 1432 to write into the program memory 129 . The processor 100 then uses a designated function 1432 to start executing the program's MTNN instruction 1400 to start the neural network unit convolution program. An example of an NNU convolution procedure is described in more detail corresponding to FIG. 26A . Then the flow goes to step 2508 .

在决策步骤2508，架构程序确认变量N的数值是否小于NUM_CHUNKS。若是，流程会前进至步骤2512；否则就前进至步骤2514。In decision step 2508, the framework program checks to see if the value of variable N is less than NUM_CHUNKS. If yes, the process will go to step 2512; otherwise, go to step 2514.

在步骤2512，如图24所示，处理器100将数据块N+1的数据矩阵2406写入权重随机存取存储器124(如数据块2的数据矩阵2406B)。因此，当神经网络单元121正在对当前数据块执行卷积运算的时候，架构程序可将下一个数据块的数据矩阵2406写入权重随机存取存储器124，如此，在完成当前数据块的卷积运算后，即写入权重随机存取存储器124后，神经网络单元121可以立即开始对下一个数据块执行卷积运算。In step 2512, as shown in FIG. 24, the processor 100 writes the data matrix 2406 of data block N+1 into the weight random access memory 124 (eg, the data matrix 2406B of data block 2). Therefore, when the neural network unit 121 is performing a convolution operation on the current data block, the architecture program can write the data matrix 2406 of the next data block into the weight random access memory 124, so that after completing the convolution of the current data block After the operation, that is, after the weight random access memory 124 is written, the neural network unit 121 can immediately start to perform the convolution operation on the next data block.

在步骤2514，处理器100确认正在执行的神经网络单元程序(对于数据块1而是从步骤2506开始执行，对于数据块2-20而言则是从步骤2518开始执行)是否已经完成执行。就一较佳实施例而言，处理器100透过执行MFNN指令1500读取神经网络单元121状态缓存器127以确认是否已经完成执行。在另一实施例中，神经网络单元121会产生中断，表示已经完成卷积程序。接下来流程前进至决策步骤2516。In step 2514, the processor 100 confirms whether the executing NNU program (executed from step 2506 for block 1 and executed from step 2518 for blocks 2-20) has completed execution. For a preferred embodiment, the processor 100 reads the state register 127 of the neural network unit 121 to confirm whether the execution has been completed by executing the MFNN instruction 1500 . In another embodiment, the NNU 121 generates an interrupt indicating that the convolution procedure has been completed. Flow then proceeds to decision step 2516.

在决策步骤2516中，架构程序确认变量N的数值是否小于NUM_CHUNKS。若是，流程前进至步骤2518；否则就前进至步骤2522。In decision step 2516, the framework program checks to see if the value of variable N is less than NUM_CHUNKS. If yes, the process proceeds to step 2518; otherwise, proceeds to step 2522.

在步骤2518中，处理器100会更新卷积程序以便执行于数据块N+1。更精确地说，处理器100会将权重随机存取存储器124中对应于地址0的初始化神经处理单元指令的列值更新为数据矩阵2406的第一列(例如，更新为数据矩阵2406A的列0或是数据矩阵2406B的列500)，并且会更新输出列(例如更新为列900或1300)。随后处理器100会开始执行此更新后的神经网络单元卷积程序。接下来流程前进至步骤2522。In step 2518, the processor 100 updates the convolution program to execute on the data block N+1. More precisely, processor 100 will update the column value of the initialize NPU instruction corresponding to address 0 in weight RAM 124 to the first column of data matrix 2406 (e.g., to column 0 of data matrix 2406A or column 500 of data matrix 2406B), and the output column is updated (eg, to column 900 or 1300). Then the processor 100 starts to execute the updated neural network unit convolution procedure. Then the process proceeds to step 2522 .

在步骤2522中，处理器100从权重随机存取存储器124读取数据块N的神经网络单元卷积程序的执行结果。接下来流程前进至决策步骤2524。In step 2522 , the processor 100 reads the execution result of the NNU convolution procedure of the data block N from the weight RAM 124 . Flow then proceeds to decision step 2524.

在决策步骤2524中，架构程序确认变量N的数值是否小于NUM_CHUNKS。若是，流程前进至步骤2526；否则就终止。In decision step 2524, the framework program checks whether the value of variable N is less than NUM_CHUNKS. If yes, the process proceeds to step 2526; otherwise, it terminates.

在步骤2526中，架构程序会将N的数值增加一。接下来流程回到决策步骤2508。In step 2526, the framework program increments the value of N by one. The process then returns to decision step 2508.

图26A为神经网络单元程序的程序列表，此神经网络单元程序利用图24的卷积核2402执行数据矩阵2406的卷积运算并将其写回权重随机存取存储器124。此程序将地址1至9的指令所构成的指令循环循环一定次数。位于地址0的初始化神经处理单元指令指定每个神经处理单元126执行此指令循环的次数，在图26A的范例所具有的循环计数值为400，对应于图24的数据矩阵2406内的列数，而位于循环终端的循环指令(位于地址10)会使当前循环计数值递减，若是结果为非零值，就使其回到指令循环的顶端(即回到地址1的指令)。初始化神经处理单元指令也会将累加器202清除为零。就一较佳实施例而言，位于地址10的循环指令也会将累加器202清除为零。另外，如前述位于地址1的乘法累加指令也可将累加器202清除为零。FIG. 26A is a program listing of the neural network unit program. The neural network unit program uses the convolution kernel 2402 of FIG. This program loops the instruction loop composed of the instructions of addresses 1 to 9 for a certain number of times. The initialize NPU instruction at address 0 specifies the number of times each NPU 126 executes this instruction loop, and in the example of FIG. 26A has a loop count value of 400, corresponding to the number of columns in the data matrix 2406 of FIG. 24 , The loop instruction (located at address 10) at the end of the loop will decrement the current loop count value, and if the result is non-zero, it will return to the top of the instruction loop (that is, return to the instruction at address 1). The initialize NPU instruction also clears accumulator 202 to zero. For a preferred embodiment, the loop instruction at address 10 also clears accumulator 202 to zero. In addition, the accumulator 202 can also be cleared to zero as the aforementioned multiply-accumulate instruction at address 1.

对于程序内指令循环的每一次执行，这512个神经处理单元126会同时执行512个3x3卷积核以及数据矩阵2406的512个相对应的3x3子矩阵的卷积运算。卷积运算是由卷积核2042的元素与相对应子矩阵内的相对应元素计算出来的九个乘积的加总。在图26A的实施例中，这512个相对应3x3子矩阵的每一个的原点(中央元素)是图24中的数据文字Dx+1，y+1，其中y(行编号)是神经处理单元126编号，而x(列编号)是当前权重随机存取存储器124中由图26A的程序中地址1的乘法累加指令所读取的列编号(此列编号也会由地址0的初始化神经处理单元指令进行初始化处理，也会在执行位于地址3与5的乘法累加指令时递增，也会被位于地址9的递减指令更新)。如此，在此程序的每一个循环中，这512个神经处理单元126会计算512个卷积运算并将这512个卷积运算的结果写回权重随机存取存储器124的指令列。在本文中省略边缘处理(edge handling)以简化说明，不过需要注意的是，利用这些神经处理单元126的集体旋转特征会造成数据矩阵2406(对于影像处理器而言即影像的数据矩阵)的多行数据中有两行从其一侧的垂直边缘到另一个垂直边缘间(例如从左侧边缘到右侧边缘，反之亦然)产生环绕(wrapping)。现在针对指令循环进行说明。For each execution of the instruction loop in the program, the 512 neural processing units 126 will simultaneously perform convolution operations of 512 3x3 convolution kernels and 512 corresponding 3x3 sub-matrices of the data matrix 2406 . The convolution operation is the sum of nine products calculated from the elements of the convolution kernel 2042 and the corresponding elements in the corresponding sub-matrix. In the embodiment of FIG. 26A, the origin (central element) of each of these 512 corresponding 3x3 sub-matrices is the data literal Dx+1,y+1 in FIG. 24, where y (row number) is the neural processing unit 126 numbers, and x (column number) is the column number read by the multiplication and accumulation instruction of address 1 in the program of FIG. Instructions perform initialization processing, are also incremented when the multiply-accumulate instructions at addresses 3 and 5 are executed, and are also updated by the decrement instruction at address 9). Thus, in each cycle of the program, the 512 neural processing units 126 will calculate 512 convolution operations and write the results of the 512 convolution operations back to the instruction queue of the weight random access memory 124 . In this paper, the edge handling (edge handling) is omitted to simplify the description, but it should be noted that using the collective rotation characteristics of these neural processing units 126 will result in a large number of data matrix 2406 (for the image processor, the data matrix of the image) Two rows in the row data generate wrapping from one vertical edge to the other vertical edge (for example, from the left edge to the right edge, and vice versa). The instruction loop is now explained.

地址1是乘法累加指令，此指令会指定数据随机存取存储器122的列0并暗中利用当前权重随机存取存储器124的列，这列最好是装载在定序器128内(并由位于地址0的指令将其初始化为零以执行第一次指令循环传递的运算)。也就是说，位于地址1的指令会使每个神经处理单元126从数据随机存取存储器122的列0读取其相对应文字，从当前权重随机存取存储器124列读取其相对应文字，并对此二个文字执行一乘法累加运算。如此，举例来说，神经处理单元5将C0，0与Dx，5相乘(其中“x”是当前权重随机存取存储器124列)，将结果加上累加器202数值217，并将总数写回累加器202。Address 1 is a multiply-accumulate instruction that specifies column 0 of data RAM 122 and implicitly utilizes the current column of weight RAM 124, which is preferably loaded in sequencer 128 (and determined by 0 to initialize it to zero to perform the operation passed in the first instruction cycle). That is to say, the instruction at address 1 causes each neural processing unit 126 to read its corresponding text from column 0 of the data random access memory 122, read its corresponding text from column 124 of the current weight random access memory, And perform a multiply-accumulate operation on the two literals. Thus, for example, the NPU 5 multiplies C0,0 by Dx,5 (where "x" is the current weight RAM 124 column), adds the result to the accumulator 202 value 217, and writes the total to back to the accumulator 202.

地址2是一个乘法累加指令，此指令会指定数据随机存取存储器122的列递增(即增加至1)，随后再从数据随机存取存储器122的递增后地址读取这列。此指令并会指定将每个神经处理单元126的多任务缓存器705内的数值旋转至邻近的神经处理单元126，在此范例中即为因应地址1的指令而从权重随机存取存储器124读取的数据矩阵2406值的列。在图24至图26的实施例中，这些神经处理单元126用以将多任务缓存器705的数值向左旋转，也即从神经处理单元J旋转至神经处理单元J-1，而非如前述图3、图7与图19从神经处理单元J旋转至神经处理单元J+1。值得注意的是，神经处理单元126向右旋转的实施例中，架构程序会将卷积核2042为数值以不同顺序写入数据随机存取存储器122(例如绕着其中心行旋转)以达到相似卷积结果的目的。此外，在需要时，架构程序可执行额外的卷积核预处理(例如移动(transposition))。此外，指令指定的计数值为2。因此，位于地址2的指令会使每个神经处理单元126从数据随机存取存储器122的列1读取其相对应文字，将旋转后文字接收至多任务缓存器705，并对这两个文字执行乘法累加运算。因为计数值为2，此指令也会使每个神经处理单元126重复前述运作。也就是说，定序器128会使数据随机存取存储器122列地址123递增(即增加至2)，而每个神经处理单元126会从数据随机存取存储器122的列2读取其相对应文字以及将旋转后文字接收至多任务缓存器705，并且对这两个文字执行乘法累加运算。如此，举例来说，假定当前权重随机存取存储器124列为27，在执行地址2的指令后，神经处理单元5会将C0，1与D27，6的乘积与C0，2与D27，7的乘积累加至其累加器202。如此，完成地址1与地址2的指令后，C0，0与D27，5的乘积、C0，1与D27，6的乘积与C0，2与D27，7就会累加至累加器202，加入其它所有来自先前传递的指令循环的累加值。Address 2 is a multiply-accumulate instruction, which specifies that the row of the DRAM 122 is incremented (ie, incremented to 1), and then the row is read from the incremented address of the DRAM 122 . This instruction will also specify to rotate the value in the multitasking register 705 of each NPU 126 to the adjacent NPU 126, in this example, to read from the weight random access memory 124 in response to the instruction at address 1 Take the 2406 columns of the data matrix. In the embodiments of FIG. 24 to FIG. 26 , these NPUs 126 are used to rotate the value of the multitasking register 705 to the left, that is, to rotate from NPU J to NPU J-1, instead of as described above Fig. 3, Fig. 7 and Fig. 19 are rotated from NPU J to NPU J+1. It should be noted that in the embodiment where the neural processing unit 126 is rotated to the right, the architecture program will write the value of the convolution kernel 2042 into the data random access memory 122 in a different order (for example, rotate around its central row) to achieve similarity. The purpose of the convolution result. In addition, the architecture program can perform additional kernel preprocessing (such as transposition) when needed. Also, the instruction specifies a count value of 2. Thus, the instruction at address 2 causes each NPU 126 to read its corresponding word from column 1 of the DRAM 122, receive the rotated word into the multitasking register 705, and execute Multiply-accumulate operation. Since the count value is 2, this command also causes each NPU 126 to repeat the aforementioned operations. That is to say, the sequencer 128 will increment the column address 123 of the DRAM 122 (that is, increase to 2), and each NPU 126 will read its corresponding The text and the rotated text are received into the multitasking register 705, and a multiply-accumulate operation is performed on the two texts. Thus, for example, assuming that the current weight random access memory 124 is listed as 27, after executing the instruction at address 2, the neural processing unit 5 will combine the product of C0,1 and D27,6 with the product of C0,2 and D27,7 The multiplication is accumulated to its accumulator 202 . In this way, after the instruction of address 1 and address 2 is completed, the product of C0, 0 and D27, 5, the product of C0, 1 and D27, 6 and C0, 2 and D27, 7 will be accumulated in the accumulator 202, and all other Accumulated value from previously passed instruction cycle.

地址3与4的指令所执行的运算类似于地址1与2的指令，利用权重随机存取存储器124列递增指针的功效，这些指令会对权重随机存取存储器124的下一列进行运算，并且这些指令会对数据随机存取存储器122的后续三列，即列3至5，进行运算。也就是说，以神经处理单元5为例，完成地址1至4的指令后，C0，0与D27，5的乘积、C0，1与D27，6的乘积、C0，2与D27，7的乘积、C1，0与D28，5的乘积、C1，1与D28，6的乘积、以及C1，2与D28，7的乘积会累加至累加器202，加入其它所有来自先前传递的指令循环的累加值。The operations performed by the instructions of addresses 3 and 4 are similar to the instructions of addresses 1 and 2, using the effect of incrementing the pointer in the column of the weight random access memory 124, these instructions will perform operations on the next column of the weight random access memory 124, and these The instruction will perform operations on the next three columns of the DRAM 122 , ie, columns 3 to 5 . That is to say, taking neural processing unit 5 as an example, after completing the instructions of addresses 1 to 4, the product of C0,0 and D27,5, the product of C0,1 and D27,6, the product of C0,2 and D27,7 , the product of C1, 0 and D28, 5, the product of C1, 1 and D28, 6, and the product of C1, 2 and D28, 7 will be accumulated in the accumulator 202, adding all other accumulated values from the previously delivered instruction cycle .

地址5与6的指令所执行的运算类似于地址3与4的指令，这些指令会对权重随机存取存储器124的下一列，以及数据随机存取存储器122的后续三列，即列6至8，进行运算。也就是说，以神经处理单元5为例，完成地址1至6的指令后，C0，0与D27，5的乘积、C0，1与D27，6的乘积、C0，2与D27，7的乘积、C1，0与D28，5的乘积、C1，1与D28，6的乘积、C1，2与D28，7、C2，0与D29，5的乘积、C2，1与D29，6的乘积、以及C2，2与D29，7的乘积会累加至累加器202，加入其它所有来自先前传递的指令循环的累加值。也就是说，完成地址1至6的指令后，假定指令循环开始时，权重随机存取存储器124列为27，以神经处理单元5为例，将会利用卷积核2042对以下3x3子矩阵进行卷积运算：The instructions at addresses 5 and 6 perform operations similar to the instructions at addresses 3 and 4, and these instructions will affect the next column of weight RAM 124 and the next three columns of data RAM 122, columns 6 to 8 , to perform the operation. That is to say, taking neural processing unit 5 as an example, after completing the instructions of addresses 1 to 6, the product of C0, 0 and D27, 5, the product of C0, 1 and D27, 6, the product of C0, 2 and D27, 7 , the product of C1,0 and D28,5, the product of C1,1 and D28,6, the product of C1,2 and D28,7, the product of C2,0 and D29,5, the product of C2,1 and D29,6, and The product of C2,2 and D29,7 is accumulated in accumulator 202, adding all other accumulated values from previously passed instruction cycles. That is to say, after completing the instructions of addresses 1 to 6, assuming that when the instruction loop starts, the weight RAM 124 is listed as 27, taking the neural processing unit 5 as an example, the following 3x3 sub-matrix will be processed by the convolution kernel 2042 Convolution operation:

一般而言，完成地址1到6的指令后，这512个神经处理单元126都已经使用卷积核2042对下列3x3子矩阵进行卷积运算：Generally speaking, after completing the instructions of addresses 1 to 6, the 512 neural processing units 126 have used the convolution kernel 2042 to perform convolution operations on the following 3x3 sub-matrices:

其中r是指令循环开始时，权重随机存取存储器124的列地址值，而n是神经处理单元126的编号。Where r is the column address value of the weight RAM 124 at the beginning of the instruction loop, and n is the number of the NPU 126 .

地址7的指令会透过启动函数单元121传递累加器202数值217。此传递功能会传递一个文字，其尺寸大小(以位计)等同于由数据随机存取存储器122与权重随机存取存储器124读取的文字(在此范例中即16位)。就一较佳实施例而言，使用者可指定输出格式，例如输出位中有多少位是小数(fractional)位，这部分在后续章节会有更详细的说明。另外，此指定可指定一个除法启动函数，而非指定一个传递启动函数，此除法启动函数会将累加器202数值217除以一个除数，如本文对应于图29A与图30所述，例如利用图30的“除法器”3014/3016的其中之一。举例来说，就一个具有系数的卷积核2042而言，如前述具有十六分之一的系数的高斯模糊核，地址7的指令会指定一除法启动函数(例如除以16)，而非指定一传递函数。另外，架构程序可以在将卷积核系数写入数据随机存取存储器122前，对卷积核2042系数执行此除以16的运算，并据以调整卷积核2042数值的二进制小数点的位置，例如使用如下所述图29的数据二进制小数点2922。The instruction at address 7 will pass the value 217 of the accumulator 202 through the activation function unit 121 . The pass function passes a text whose size (in bits) is equivalent to the text read by the data RAM 122 and the weight RAM 124 (16 bits in this example). As far as a preferred embodiment is concerned, the user can specify the output format, for example, how many digits in the output digits are fractional digits, which will be described in more detail in subsequent chapters. Alternatively, instead of specifying a transfer enable function, the designation may specify a divide enable function that divides the accumulator 202 value 217 by a divisor, as described herein corresponding to FIGS. 29A and 30 , for example using FIG. One of the 30 "dividers" 3014/3016. For example, for a convolution kernel 2042 with a coefficient, such as the aforementioned Gaussian blur kernel with a coefficient of one-sixteenth, the instruction at address 7 would specify a division activation function (for example, divide by 16) instead of Specify a transfer function. In addition, before the architecture program writes the coefficients of the convolution kernels into the DRAM 122, the operation of dividing the coefficients of the convolution kernels 2042 by 16 is performed, and accordingly the position of the binary point of the value of the convolution kernels 2042 is adjusted, For example, use the data binary decimal point 2922 of FIG. 29 as described below.

地址8的指令会将启动函数单元212的输出写入权重随机存取存储器124中由输出列缓存器的当前值所指定的列。此当前值会被地址0的指令初始化,并且由指令内的递增指针在每传递经过一次循环就递增此数值。The instruction at address 8 will write the output of the activation function unit 212 into the column of the weight RAM 124 specified by the current value of the output column register. This current value is initialized by the instruction at address 0, and is incremented each pass through the loop by the increment pointer within the instruction.

如图24至图26具有一3x3卷积核2402的范例所述，神经处理单元126大约每三个时频周期会读取权重随机存取存储器124以读取数据矩阵2406的一列，并且大约每十二个时频周期会将卷积核结果矩阵写入权重随机存取存储器124。此外，假定在一实施例中，具有如图17的缓冲器1704的写入与读取缓冲器，在神经处理单元126进行读取与写入的同时，处理器100可以对权重随机存取存储器124进行读取与写入，缓冲器1704大约每十六个时频周期会对权重随机存取存储器执行一次读取与写入动作，以分别读取数据矩阵以及写入卷积核结果矩阵。因此，权重随机存取存储器124的大约一半的频宽会由神经网络单元121以混合方式执行的卷积核运算所消耗。本范例包含一个3x3卷积核2042，不过，本发明并不限于此，其它大小的卷积核，如2x2、4x4、5x5、6x6、7x7、8x8等，也可适用于不同的神经网络单元程序。在使用较大卷积核的情况下，因为乘法累加指令的旋转版本(如图26A的地址2、4与6的指令，较大的卷积核会需要使用这些指令)具有较大的计数值，神经处理单元126读取权重随机存取存储器124的时间占比会降低，因此，权重随机存取存储器124的频宽使用比也会降低。24-26 with a 3x3 convolution kernel 2402 example, the neural processing unit 126 will read the weight random access memory 124 to read a column of the data matrix 2406 approximately every three clock cycles, and approximately every Twelve time-frequency cycles will write the convolution kernel result matrix into the weight random access memory 124 . In addition, assuming that in an embodiment, there are write and read buffers such as the buffer 1704 in FIG. 124 for reading and writing, and the buffer 1704 performs a reading and writing operation on the weight random access memory about every sixteen time-frequency cycles, so as to read the data matrix and write the convolution kernel result matrix respectively. Therefore, approximately half of the bandwidth of the WRAM 124 is consumed by the convolution kernel operations performed by the NNU 121 in a hybrid manner. This example includes a 3x3 convolution kernel 2042, but the present invention is not limited thereto, other sizes of convolution kernels, such as 2x2, 4x4, 5x5, 6x6, 7x7, 8x8, etc., are also applicable to different neural network unit programs . In the case of using a larger convolution kernel, because the rotated version of the multiply-accumulate instruction (such as the instructions of addresses 2, 4, and 6 in Figure 26A, which is required for a larger convolution kernel) has a larger count value Therefore, the proportion of time for the neural processing unit 126 to read the weight random access memory 124 will be reduced, and therefore, the bandwidth usage ratio of the weight random access memory 124 will also be reduced.

另外，架构程序可使神经网络单元程序对输入数据矩阵2406中不再需要使用的列进行覆写，而非将卷积运算结果写回权重随机存取存储器124的不同列(如列900-1299与1300-1699)。举例来说，就一个3x3的卷积核而言，架构程序可以将数据矩阵2406写入权重随机存取存储器124的列2-401，而非写入列0-399，而神经处理单元程序则会从权重随机存取存储器124的列0开始将卷积运算结果写入，而每传递经过一次指令循环就递增列数。如此，神经网络单元程序只会将不再需要使用的列进行覆写。举例来说，在第一次传递经过指令循环之后(或更精确地说，在执行地址1的指令之后其加载权重随机存取存储器124的列0)，列0的数据可以被覆写，不过，列1-3的数据需要留给第二次传递经过指令循环的运算而不能被覆写；同样地，在第二次传递经过指令循环之后，列1的数据可以被覆写，不过，列2-4的数据需要留给第三次传递经过指令循环的运算而不能被覆写；依此类推。在此实施例中，可以增大各个数据矩阵2406(数据块)的高度(如800列)，因而可以使用较少的数据块。Additionally, the architecture program allows the NNU program to overwrite columns of the input data matrix 2406 that are no longer needed, rather than writing the results of the convolution operations back to different columns of the weight RAM 124 (e.g., columns 900-1299 and 1300-1699). For example, for a 3x3 convolution kernel, the architecture program may write the data matrix 2406 into columns 2-401 of the weight RAM 124 instead of columns 0-399, while the NPU program writes The result of the convolution operation is written starting from column 0 of the weight random access memory 124 , and the column number is incremented each time the instruction cycle is passed through. This way, the NNU program only overwrites columns that are no longer needed. For example, after the first pass through the instruction loop (or more precisely, after the instruction at address 1 is executed which loads column 0 of the weight RAM 124), the data in column 0 may be overwritten, however, The data in columns 1-3 needs to be left for the operation of the second pass through the instruction cycle and cannot be overwritten; similarly, after the second pass through the instruction cycle, the data in column 1 can be overwritten, however, columns 2-4 The data needs to be left for the operation of the third pass through the instruction loop and cannot be overwritten; and so on. In this embodiment, the height of each data matrix 2406 (data block) can be increased (eg, 800 columns), so fewer data blocks can be used.

另外，架构程序可以使神经网络单元程序将卷积运算的结果写回卷积核2402上方的数据随机存取存储器122列(例如在列8上方)，而非将卷积运算结果写回权重随机存取存储器124，当神经网络单元121写入结果时，架构程序可以从数据随机存取存储器122读取结果(例如使用图26中数据随机存取存储器122的最近写入列2606地址)。此配置适用于具有单端口权重随机存取存储器124与双端口数据随机存取存储器的实施例。In addition, the architecture program can cause the NNU program to write the result of the convolution operation back to the DRAM 122 column above the convolution kernel 2402 (eg, above column 8), instead of writing the result of the convolution operation back to the weight random access memory 122 column. To access the memory 124, when the neural network unit 121 writes the result, the architecture program can read the result from the data random access memory 122 (for example, use the last written column 2606 address of the data random access memory 122 in FIG. 26). This configuration is suitable for embodiments with a single-port weight RAM 124 and a dual-port data RAM.

依据图24至图26A的实施例中神经网络单元121的运算可以发现，图26A的程序的每次执行会需要大约5000个时频周期，如此，图24中整个2560x1600的数据阵列2404的卷积运算需要大约100，000个时频周期，明显少于以传统方式执行相同任务所需要的时频周期数。According to the operation of the neural network unit 121 in the embodiment of FIG. 24 to FIG. 26A, it can be found that each execution of the program in FIG. The calculation requires about 100,000 time-frequency cycles, which is significantly less than the number of time-frequency cycles required to perform the same task in a conventional manner.

图26B为显示图1的神经网络单元121的控制缓存器127的某些字段的一实施例的方块示意图。此状态缓存器127包括一个字段2602，指出权重随机存取存储器124中最近被神经处理单元126写入的列的地址；一个字段2606，指出数据随机存取存储器122中最近被神经处理单元126写入的列的地址；一个字段2604，指出权重随机存取存储器124中最近被神经处理单元126读取的列的地址；以及一个字段2608，指出数据随机存取存储器122中最近被神经处理单元126读取的列的地址。如此，执行于处理器100的架构程序就可以确认神经网络单元121的处理进度，当对数据随机存取存储器122和/或权重随机存取存储器124进行数据的读取和/或写入时。利用此能力，加上如前述选择对输入数据矩阵进行覆写(或是如前述将结果写入数据随机存取存储器122)，如以下的范例所述，图24的数据阵列2404就可以视为5个512x1600的数据块来执行，而非20个512x400的数据块。处理器100从权重随机存取存储器124的列2开始写入第一个512x1600的数据块，并使神经网络单元程序启动(此程序具有一数值为1600的循环计数，并且将权重随机存取存储器124输出列初始化为0)。当神经网络单元121执行神经网络单元程序时，处理器100会监测权重随机存取存储器124的输出位置/地址，藉以(1)(使用MFNN指令1500)读取权重随机存取存储器124中具有由神经网络单元121(由列0开始)写入的有效卷积运算结果的列；以及(2)将第二个512x1600数据矩阵2406(始于列2)覆写于已经被读取过的有效卷积运算结果，如此当神经网络单元121对于第一个512x1600数据块完成神经网络单元程序，处理器100在必要时可以立即更新神经网络单元程序并再次启动神经网络单元程序以执行于第二个512x1600数据块。此程序会再重复三次执行剩下三个512x1600数据块，以使神经网络单元121可以被充分使用。FIG. 26B is a block diagram showing an embodiment of some fields of the control register 127 of the NNU 121 of FIG. 1 . This state buffer 127 includes a field 2602, indicating the address of the column in the weight random access memory 124 that was recently written by the neural processing unit 126; a field 2604 indicating the address of the column in the weight RAM 124 that was most recently read by the neural processing unit 126; and a field 2608 indicating the address of the column in the data random access memory 122 that was most recently The address of the column to read. In this way, the architecture program executed on the processor 100 can confirm the processing progress of the neural network unit 121 when reading and/or writing data to the data RAM 122 and/or the weight RAM 124 . Utilizing this capability, plus overwriting the input data matrix as described above (or writing the result to the DRAM 122 as described above), as described in the following example, the data array 2404 of FIG. 24 can be regarded as 5 chunks of 512x1600 to execute instead of 20 chunks of 512x400. Processor 100 writes the first 512x1600 block of data starting from column 2 of WRAM 124 and causes the NNU program to start (this program has a loop count of 1600 and loads WRAM 124 124 output columns initialized to 0). When the neural network unit 121 executes the neural network unit program, the processor 100 will monitor the output position/address of the weight random access memory 124, so as to (1) (use the MFNN instruction 1500) read the output data in the weight random access memory 124 The columns of the valid convolution operation results written by the NNU 121 (starting at column 0); and (2) overwriting the second 512x1600 data matrix 2406 (starting at column 2) over the valid volumes that have already been read In this way, when the NNU 121 completes the NNU program for the first 512x1600 data block, the processor 100 can immediately update the NNU program and restart the NNU program to execute on the second 512x1600 data block if necessary. data block. This procedure will be repeated three times to execute the remaining three 512x1600 data blocks, so that the neural network unit 121 can be fully utilized.

在一实施例中，启动函数单元212具有能够对累加器202数值217有效执行一有效除法运算的能力，这部分在后续章节尤其是对应于图29A、图29B与图30处会有更详细的说明。举例来说，对累加器202数值进行除以16的除法运算的启动函数神经网络单元指令可用于以下所述的高斯模糊矩阵。In one embodiment, the enabling function unit 212 has the ability to effectively perform an effective division operation on the value 217 of the accumulator 202. This part will be described in more detail in subsequent chapters, especially corresponding to FIG. 29A, FIG. 29B and FIG. 30. illustrate. For example, an enable function NNU instruction that divides the accumulator 202 value by 16 may be used for the Gaussian blur matrix described below.

图24的范例中所使用的卷积核2402为一个应用于整个数据矩阵2404的小型静态卷积核，不过，本发明并不限于此，此卷积核也可为一大型矩阵，具有特定的权重对应于数据阵列2404的不同数据值，例如常见于卷积神经网络的卷积核。当神经网络单元121以此方式被使用时，架构程序会将数据矩阵与卷积核的位置互换，也即将数据矩阵放置于数据随机存取存储器122内而将卷积核放置于权重随机存取存储器124内，而执行神经网络单元程序所需处理的列数也会相对较少。The convolution kernel 2402 used in the example of FIG. 24 is a small static convolution kernel applied to the entire data matrix 2404, but the present invention is not limited thereto. This convolution kernel can also be a large matrix with specific The weights correspond to different data values of the data array 2404, such as convolution kernels commonly found in convolutional neural networks. When the neural network unit 121 is used in this way, the architecture program will exchange the positions of the data matrix and the convolution kernel, that is, the data matrix is placed in the data random access memory 122 and the convolution kernel is placed in the weight random access memory. In the memory 124, the number of rows required to execute the neural network unit program will be relatively small.

图27为一方块示意图，显示图1中填入输入数据的权重随机存取存储器124的一范例，此输入数据由图1的神经网络单元121执行共源运算(pooling operation)。共源运算是由人工神经网络的共源层执行，透过取得输入矩阵的子区域或子矩阵并计算子矩阵的最大值或平均值以作为结果矩阵即共源矩阵，以缩减输入数据矩阵(如影像或是卷积后影像)的大小(dimension)。在图27与图28的范例中，共源运算计算各个子矩阵的最大值。共源运算对于如执行对象分类或侦测的人工神经网络特别有用。一般而言，共源运算实际上可以使输入矩阵缩减的因子为所检测的子矩阵的元素数，特别是可以将输入矩阵的各个维度方向都缩减子矩阵的相对应维度方向的元素数。在图27的范例中，输入数据是一个宽文字(如16位)的512x1600矩阵，储存于权重随机存取存储器124的列0至1599。在图27中，这些文字以其所在列行位置标示，如，位于列0行0的文字标示为D0，0；位于列0行1的文字标示为D0，1；位于列0行2的文字标示为D0，2；依此类推，位于列0行511的文字标示为D0，511。相同地，位于列1行0的文字标示为D1，0；位于列1行1的文字标示为D1，1；位于列1行2文字标示为D1，2；依此类推，位于列1行511的文字标示为D1，511；如此依此类推，位于列1599行0的文字标示为D1599，0；位于列1599行1的文字标示为D1599，1位于列1599行2的文字标示为D1599，2；依此类推，位于列1599行511的文字标示为D1599，511。FIG. 27 is a block diagram showing an example of the weighted random access memory 124 in FIG. 1 filled with input data for pooling operation performed by the neural network unit 121 in FIG. 1 . The common-source operation is performed by the common-source layer of the artificial neural network. By obtaining the sub-region or sub-matrix of the input matrix and calculating the maximum or average value of the sub-matrix as the result matrix, that is, the common-source matrix, the input data matrix ( Such as image or image after convolution) size (dimension). In the examples of FIG. 27 and FIG. 28 , the common source operation calculates the maximum value of each sub-matrix. Common source operations are especially useful for artificial neural networks that perform object classification or detection, for example. Generally speaking, the common-source operation can actually reduce the input matrix by a factor of the detected sub-matrix elements, in particular, each dimension of the input matrix can be reduced by the number of elements in the corresponding dimension of the sub-matrix. In the example of FIG. 27 , the input data is a 512x1600 matrix of wide text (eg, 16 bits), stored in columns 0 to 1599 of the WRAM 124 . In Figure 27, these characters are marked by their column and row positions, for example, the characters in column 0 and row 0 are marked as D0, 0; the characters in column 0 and row 1 are marked as D0, 1; the characters in column 0 and row 2 are marked as D0, 1 It is marked as D0, 2; and so on, the text located in row 511 of column 0 is marked as D0, 511. Similarly, the text in column 1, row 0 is marked as D1, 0; the text in column 1, row 1 is marked as D1, 1; the text in column 1, row 2 is marked as D1, 2; and so on, the text in column 1, row 511 The text in row 1599 is marked as D1, 511; and so on, the text in column 1599, line 0 is marked as D1599, 0; the text in column 1599, line 1 is marked as D1599, 1, and the text in column 1599, line 2 is marked as D1599, 2 ; By analogy, the text located in row 1599 and line 511 is marked as D1599, 511.

图28为神经网络单元程序的程序列表，此神经网络单元程序执行图27的输入数据矩阵的共源运算并将其写回权重随机存取存储器124。在图28的范例中，共源运算会计算输入数据矩阵中各个4x4子矩阵的最大值。此程序会多次执行由指令1至10构成的指令循环。位于地址0的初始化神经处理单元指令会指定每个神经处理单元126执行指令循环的次数，在图28的范例中的循环计数值为400，而在循环末端(在地址11)的循环指令会使当前循环计数值递减，而若是所产生的结果是非零值，就使其回到指令循环的顶端(即回到地址1的指令)。权重随机存取存储器124内的输入数据矩阵实质上会被神经网络单元程序视为400个由四个相邻列构成的互斥群组，即列0-3、列4-7、列8-11、依此类推，直到列1596-1599。每一个由四个相邻列构成的群组包括128个4x4子矩阵，这些子矩阵由此群组的四列与四个相邻行的交叉处元素所形成的4x4子矩阵，这些相邻行即行0-3、行4-7、行8-11、依此类推直到行508-511。这512个神经处理单元126中，每四个为一组计算的第四个神经处理单元126(一共即128个)会对一相对应4x4子矩阵执行共源运算，而其它三个神经处理单元126则不被使用。更精确地说，神经处理单元0、4、8、依此类推直到神经处理单元508，会对其相对应的4x4子矩阵执行共源运算，而此4x4子矩阵的最左侧行编号对应于神经处理单元编号，而下方列对应于当前权重随机存取存储器124的列值，此数值会被地址0的初始化指令初始化为零并且在重复每次指令循环后会增加4，这部分在后续章节会有更详细的说明。这400次指令循环的重复动作对应至图27的输入数据矩阵中的4x4子矩阵群组数(即输入数据矩阵具有的1600列除以4)。初始化神经处理单元指令也会清除累加器202使其归零。就一较佳实施例而言，地址11的循环指令也会清除累加器202使其归零。另外，地址1的maxwacc指令会指定清除累加器202使其归零。FIG. 28 is a program listing of the NNU program that performs the common-source operation of the input data matrix of FIG. 27 and writes it back to the weight RAM 124. In the example of FIG. 28, the common source operation calculates the maximum value of each 4x4 sub-matrix in the input data matrix. This program executes the instruction loop consisting of instructions 1 to 10 multiple times. The initialize NPU instruction at address 0 specifies the number of times each NPU 126 executes an instruction loop, the loop count value being 400 in the example of FIG. The current loop count is decremented, and if the resulting result is non-zero, it returns to the top of the instruction loop (ie, to the instruction at address 1). The input data matrix in the weight random access memory 124 will be regarded by the neural network unit program as 400 mutually exclusive groups consisting of four adjacent columns, namely columns 0-3, columns 4-7, columns 8- 11. And so on until columns 1596-1599. Each group of four adjacent columns consists of 128 4x4 sub-matrices formed by the elements at the intersection of the four columns of this group and four adjacent rows, the adjacent rows That is, rows 0-3, rows 4-7, rows 8-11, and so on up to rows 508-511. Among these 512 neural processing units 126, every fourth neural processing unit 126 (a total of 128) that calculates as a group of four will perform a common source operation on a corresponding 4x4 sub-matrix, while the other three neural processing units 126 is not used. More precisely, NPUs 0, 4, 8, and so on up to NPU 508 perform common-source operations on their corresponding 4x4 sub-matrix whose leftmost row numbers correspond to Neural processing unit number, and the lower column corresponds to the column value of the current weight random access memory 124, this value will be initialized to zero by the initialization command at address 0 and will increase by 4 after repeating each command loop, this part will be described in subsequent chapters There will be more detailed instructions. These 400 repetitions of instruction cycles correspond to the number of 4x4 sub-matrix groups in the input data matrix of FIG. 27 (ie, the input data matrix has 1600 columns divided by 4). The initialize NPU instruction also clears the accumulator 202 to zero. For a preferred embodiment, the loop instruction at address 11 also clears accumulator 202 to zero. In addition, the maxwacc instruction at address 1 specifies that the accumulator 202 be cleared to zero.

每次在执行程序的指令循环时，这128个被使用的神经处理单元126会对输入数据矩阵的当前四列群组中的128个个别的4x4子矩阵，同时执行128个共源运算。进一步来说，此共源运算会确认这4x4子矩阵的16个元素中的最大值元素。在图28的实施例中，对于这128个被使用的神经处理单元126中的每个神经处理单元y而言，4x4子矩阵的下方左侧元素为图27内的元素Dx，y，其中x是指令循环开始时当前权重随机存取存储器124的列数，而此列数据由图28的程序中地址1的maxwacc指令读取(此列数也会由地址0的初始化神经处理单元指令加以初始化，并在每次执行地址3、5与7的maxwacc指令时递增)。因此，对于此程序的每一个循环而言，这128个被使用的神经处理单元126会将当前列群组的相对应128个4x4子矩阵的最大值元素，写回权重随机存取记忆124的指定列。以下针对此指令循环进行描述。The 128 used NPUs 126 simultaneously perform 128 common-source operations on 128 individual 4x4 sub-matrices in the current four-column group of the input data matrix each time an instruction loop of the program is executed. Further, the common source operation identifies the maximum value element among the 16 elements of the 4x4 sub-matrix. In the embodiment of FIG. 28, for each NPU y of the 128 used NPUs 126, the lower left element of the 4x4 sub-matrix is the element Dx,y in FIG. 27, where x is the column number of the current weight random access memory 124 when the instruction loop begins, and this column data is read by the maxwacc instruction of address 1 in the program of Figure 28 (this column number will also be initialized by the initialization neural processing unit instruction of address 0 , and is incremented each time the maxwacc instructions at addresses 3, 5, and 7 are executed). Therefore, for each cycle of this program, the 128 used NPUs 126 will write the maximum value elements of the corresponding 128 4x4 sub-matrices of the current column group back to the weight random access memory 124 Specify columns. The following describes this instruction cycle.

地址1的maxwacc指令会暗中使用当前权重随机存取存储器124列，这列最好是装载在定序器128内(并由位于地址0的指令将其初始化为零以执行第一次传递经过指令循环的运算)。地址1的指令会使每个神经处理单元126从权重随机存取存储器124的当前列读取其相对应文字，将此文字与累加器202数值217相比，并将这两个数值的最大者储存于累加器202。因此，举例来说，神经处理单元8会确认累加器202数值217与数据文字Dx，8(其中“x”是当前权重随机存取存储器124列)中的最大值并将其写回累加器202。The maxwacc instruction at address 1 will implicitly use the current weight RAM 124 column, which is preferably loaded in the sequencer 128 (and initialized to zero by the instruction at address 0 to perform the first pass through the instruction loop operation). The instruction at address 1 causes each NPU 126 to read its corresponding text from the current column of the weight RAM 124, compare the text to the accumulator 202 value 217, and calculate the maximum of the two values stored in the accumulator 202. So, for example, the NPU 8 would identify the maximum value of the accumulator 202 value 217 and the data word Dx,8 (where "x" is the current weight RAM 124 column) and write it back to the accumulator 202 .

地址2是一个maxwacc指令，此指令会指定将每个神经处理单元126的多任务缓存器705内的数值旋转至邻近至神经处理单元126，在此即为因应地址1的指令刚从权重随机存取存储器124读取的一列输入数据阵列值。在图27至图28的实施例中，神经处理单元126用以将多工器705数值向左旋转，也即从神经处理单元J旋转至神经处理单元J-1，如前文对应于图24至图26的章节所述。此外，此指令会指定一计数值为3。如此，地址2的指令会使每个神经处理单元126将旋转后文字接收至多任务缓存器705并确认此旋转后文字与累加器202数值中的最大值，然后将这个运算再重复两次。也就是说，每个神经处理单元126会执行三次将旋转后文字接收至多任务缓存器705并确认旋转后文字与累加器202数值中最大值的运算。如此，举例来说，假定开始此指令循环时，当前权重随机存取存储器124列为36，以神经处理单元8为例，在执行地址1与2的指令后，神经处理单元8将会在其累加器202中储存循环开始时累加器202以及四个权重随机存取存储器124文字D36，8、D36，9、D36，10与D36，11中的最大值。Address 2 is a maxwacc instruction, which specifies to rotate the value in the multitasking register 705 of each neural processing unit 126 to be adjacent to the neural processing unit 126. A column of input data array values read by the memory 124 is fetched. In the embodiment shown in FIGS. 27 to 28, the neural processing unit 126 is used to rotate the value of the multiplexer 705 to the left, that is, from the neural processing unit J to the neural processing unit J-1. Figure 26 section. In addition, this command specifies a count value of 3. In this way, the instruction at address 2 causes each neural processing unit 126 to receive the rotated text into the multitasking register 705 and confirm the maximum value between the rotated text and the value of the accumulator 202 , and then repeat this operation twice. That is to say, each neural processing unit 126 performs three operations of receiving the rotated text into the multitasking register 705 and confirming the maximum value of the rotated text and the value of the accumulator 202 . Thus, for example, assuming that the current weight random access memory 124 is listed as 36 when starting the instruction cycle, taking the neural processing unit 8 as an example, after executing the instructions at addresses 1 and 2, the neural processing unit 8 will be in its The accumulator 202 stores the maximum value in the accumulator 202 and the four weight RAM 124 words D36,8, D36,9, D36,10 and D36,11 at the beginning of the cycle.

地址3与4的maxwacc指令所执行的运算类似于地址1的指令，利用权重随机存取存储器124列递增指针具有的功效，地址3与4的指令会对权重随机存取存储器124的下一列执行。也就是说，假定指令循环开始时当前权重随机存取存储器124列是36，以神经处理单元8为例，在完成地址1到4的指令后，神经处理单元8将会在其累加器202中储存循环开始时累加器202以及八个权重随机存取存储器124文字D36，8、D36，9、D36，10、D36，11、D37，8、D37，9、D37，10与D37，11中的最大值。The operations performed by the maxwacc instructions at addresses 3 and 4 are similar to the instructions at address 1. Using the effect of incrementing pointers in the columns of the weight random access memory 124, the instructions at addresses 3 and 4 will be executed on the next column of the weight random access memory 124. . That is to say, assuming that the current weight RAM 124 column is 36 at the beginning of the instruction cycle, and taking the neural processing unit 8 as an example, after completing the instructions of addresses 1 to 4, the neural processing unit 8 will store 36 in its accumulator 202 At the beginning of the storage cycle, accumulator 202 and eight weight RAM 124 words D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10 and D37,11 maximum value.

地址5至8的maxwacc指令所执行的运算类似于地址1至4的指令，地址5至8的指令会对权重随机存取存储器124的下两列执行。也就是说，假定指令循环开始时当前权重随机存取存储器124列是36，以神经处理单元8为例，在完成地址1到8的指令后，神经处理单元8将会在其累加器202中储存循环开始时累加器202以及十六个权重随机存取存储器124文字D36，8、D36，9、D36，10、D36，11、D37，8、D37，9、D37，10、D37，11、D38，8、D38，9、D38，10、D38，11、D39，8、D39，9、D39，10与D39，11中的最大值。也就是说，假定指令循环开始时当前权重随机存取存储器124列是36，以神经处理单元8为例，在完成地址1到8的指令后，神经处理单元8将会完成确认下列4x4子矩阵的最大值：The operations performed by the maxwacc instructions at addresses 5 to 8 are similar to the instructions at addresses 1 to 4, and the instructions at addresses 5 to 8 will be executed on the next two columns of the WRAM 124 . That is to say, assuming that the current weight RAM 124 column is 36 at the beginning of the instruction cycle, and taking the neural processing unit 8 as an example, after completing the instructions of addresses 1 to 8, the neural processing unit 8 will store 36 in its accumulator 202 Accumulator 202 and sixteen weight random access memory 124 words D36,8, D36,9, D36,10, D36,11, D37,8, D37,9, D37,10, D37,11, The maximum value among D38,8, D38,9, D38,10, D38,11, D39,8, D39,9, D39,10 and D39,11. That is to say, assuming that the current weight RAM 124 column is 36 at the beginning of the instruction cycle, taking the neural processing unit 8 as an example, after completing the instructions of addresses 1 to 8, the neural processing unit 8 will complete the confirmation of the following 4x4 sub-matrix The maximum value of:

基本上，在完成地址1至8的指令后，这128个被使用的神经处理单元126中的每一个神经处理单元126就会完成确认下列4x4子矩阵的最大值：Basically, after completing the instructions for addresses 1 to 8, each of the 128 used NPUs 126 will complete identifying the maximum value of the following 4x4 sub-matrix:

其中r是指令循环开始时当前权重随机存取存储器124的列地址值，n是神经处理单元126编号。Where r is the column address value of the current weight random access memory 124 at the beginning of the instruction loop, and n is the number of the neural processing unit 126 .

地址9的指令会透过启动函数单元212传递累加器202数值217。此传递功能会传递一个文字，其尺寸大小(以位计)等同于由权重随机存取存储器124读取的文字(在此范例中即16位)。就一较佳实施例而言，使用者可指定输出格式，例如输出位中有多少位是小数(fractional)位，这部分在后续章节会有更详细的说明。The instruction at address 9 will pass the value 217 of the accumulator 202 through the activation function unit 212 . The transfer function transfers a text whose size (in bits) is equivalent to the text read from the weight RAM 124 (16 bits in this example). As far as a preferred embodiment is concerned, the user can specify the output format, for example, how many bits in the output bits are fractional bits, which will be described in more detail in subsequent chapters.

地址10的指令会将累加器202数值217写入权重随机存取存储器124中由输出列缓存器的当前值所指定的列，此当前值会被地址0的指令予以初始化，并利用指令内的递增指针在每次传递经过循环后将此数值递增。进一步来说，地址10的指令会将累加器202的一宽文字(如16位)写入权重随机存取存储器124。就一较佳实施例而言，此指令会将这16个位依照输出二进制小数点2916来进行写入，这部分在下列对应于图29A与图29B处会有更详细的说明。The instruction at address 10 will write the accumulator 202 value 217 into the column of weight RAM 124 specified by the current value of the output column buffer, which will be initialized by the instruction at address 0, using the The increment pointer increments this value after each pass through the loop. Further, the instruction at address 10 will write a wide word (eg, 16 bits) from the accumulator 202 into the weight RAM 124 . As far as a preferred embodiment is concerned, the command will write the 16 bits according to the output binary point 2916, which will be described in more detail below corresponding to FIG. 29A and FIG. 29B.

如前述，迭代一次指令循环写入权重随机存取存储器124的列会包含具有无效值的空洞。也就是说，结果133的宽文字1至3、5至7、9至11、依此类推，直到宽文字509至511都是无效或未使用的。在一实施例中，启动函数单元212包括多工器使能将结果合并至列缓冲器的相邻文字，例如图11的列缓冲器1104，以写回输出权重随机存取存储器124列。就一较佳实施例而言，启动函数指令会指定每个空洞中的文字数，而此空洞内的文字数控制多工器合并结果。在一实施例中，空洞数可指定为数值2至6，以合并共源的3x3、4x4、5x5、6x6或7x7子矩阵的输出。另外，执行于处理器100的架构程序会从权重随机存取存储器124读取所产生的稀疏(即具有空洞)结果列，并利用其它执行单元112，例如使用架构合并指令的媒体单元，如x86单指令多数据流扩展(SSE)指令，执行合并功能。以类似于前述同时进行的方式并利用神经网络单元121的混合本质，执行于处理器100的架构程序可以读取状态缓存器127以监测权重随机存取存储器124的最近写入列(例如图26B的字段2602)以读取所产生的一稀疏结果列，将其合并并写回权重随机存取存储器124的同一列，如此就完成准备而能作为一输入数据矩阵，提供给神经网络的下一层使用，例如卷积层或是传统神经网络层(也即乘法累加层)。此外，本文所述的实施例以4x4子矩阵执行共源运算，不过本发明并不限于此，图28的神经网络单元程序可经调整，而以其它尺寸的子矩阵，如3x3、5x5、6x6或7x7，执行共源运算。As mentioned above, one iteration of the instruction loop writes the columns of the weight RAM 124 to include holes with invalid values. That is, wide literals 1 to 3, 5 to 7, 9 to 11, and so on for result 133, up to wide literals 509 to 511 are invalid or unused. In one embodiment, enable function unit 212 includes a multiplexer to enable combining results into adjacent words of a column buffer, such as column buffer 1104 of FIG. 11 , to write back to the output weight RAM 124 column. For a preferred embodiment, the enable function instruction specifies the number of words in each hole, and the number of words in the hole controls the multiplexer merge result. In one embodiment, the number of holes can be specified as a value from 2 to 6 to combine outputs of 3x3, 4x4, 5x5, 6x6 or 7x7 sub-matrices from a common source. In addition, the architectural program executing on the processor 100 reads the resulting sparse (i.e., with holes) result columns from the weight RAM 124 and utilizes other execution units 112, such as media units using architectural coalescing instructions, such as x86 SIMD Extensions (SSE) instructions, which perform merge functions. In a manner similar to that described above concurrently and taking advantage of the hybrid nature of the NNU 121, an architectural program executing on the processor 100 can read the state register 127 to monitor the most recently written column of the WRAM 124 (e.g., FIG. 26B field 2602) to read a resulting sparse result column, combine it and write it back to the same column in the weight RAM 124, and thus complete the preparation as an input data matrix for the next step of the neural network Layers are used, such as convolutional layers or traditional neural network layers (aka multiply-accumulate layers). In addition, the embodiments described herein perform common-source operations with 4x4 sub-matrices, but the present invention is not limited thereto. The neural network unit program of FIG. Or 7x7, performing common-source operations.

如前述可以发现，写入权重随机存取存储器124的结果列的数量是输入数据矩阵的列数的四分之一。最后，在此范例中并未使用数据随机存取存储器122。不过，也可利用数据随机存取存储器122，而非权重随机存取存储器124，来执行共源运算。As can be seen previously, the number of resultant columns written to the weight random access memory 124 is one quarter the number of columns of the input data matrix. Finally, the DRAM 122 is not used in this example. However, the data RAM 122 can also be used instead of the weight RAM 124 to perform the common source operation.

在图27与图28的实施例中，共源运算会计算子区域的最大值。不过，图28的程序可经调整以计算子区域的平均值，利入透过将maxwacc指令以sumwacc指令取代(将权重文字与累加器202数值217加总)并将地址9的启动函数指令修改为将累加结果除以各个子区域的元素数(较佳者透过如下所述的倒数乘法运算)，在此范例中为十六。In the embodiments of FIG. 27 and FIG. 28 , the common source operation calculates the maximum value of the sub-region. However, the program of FIG. 28 can be adjusted to calculate subregion averages by replacing the maxwacc instruction with the sumwacc instruction (summing the weight literal and accumulator 202 value 217) and modifying the start function instruction at address 9 To divide the accumulated result by the number of elements in each sub-region (preferably by reciprocal multiplication as described below), sixteen in this example.

由神经网络单元121依据图27与图28的运算中可以发现，每一次执行图28的程序需要使用大约6000个时频周期来对图27所示的整个512x1600数据矩阵执行一次共源运算，此运算所使用的时频周期数明显少于传统方式执行相类似任务所需的时频周期数。It can be found from the operation of the neural network unit 121 according to FIG. 27 and FIG. 28 that each execution of the program in FIG. 28 requires about 6000 time-frequency cycles to perform a common-source operation on the entire 512x1600 data matrix shown in FIG. 27 . The number of time-frequency cycles used by the calculation is significantly less than the number of time-frequency cycles required to perform similar tasks in traditional methods.

另外，架构程序可使神经网络单元程序将共源运算的结果写回数据随机存取存储器122列，而非将结果写回权重随机存取存储器124，当神经网络单元121将结果写入数据随机存取存储器122时(例如使用图26B的数据随机存取存储器122最近写入列2606的地址)，架构程序会从数据随机存取存储器122读取结果。此配置适用具有单端口权重随机存取存储器124与双端口数据随机存取存储器122的实施例。In addition, the architecture program can make the NNU program write the result of the common source operation back to the DRAM 122 column instead of writing the result back to the weight RAM 124, when the NNU 121 writes the result to the DRAM When accessing memory 122 (eg, using the address most recently written to column 2606 by DRAM 122 in FIG. 26B ), the fabric program reads the result from DRAM 122 . This configuration is suitable for an embodiment having a single-port weight RAM 124 and a dual-port data RAM 122 .

定点算术运算，具有使用者提供二进制小数点，全精度定点累加，使用者指定倒数值，累加器数值的随机舍入，以及可选择启动/输出函数Fixed-point arithmetic with user-supplied binary point, full-precision fixed-point accumulation, user-specified reciprocal value, random rounding of accumulator value, and selectable start/output functions

一般而言，在数字计算装置内执行算术运算的硬件单元依据其执行算术运算的对象为整数或浮点数，通常可分为“整数”单元与“浮点”单元。浮点数具有数值(magnitude)(或尾数)与指数，通常还有符号。指数是基数(radix)点(通常为二进制小数点)相对于数值的位置的指针。相较之下，整数不具有指数，而只具有数值，通常还有符号。浮点单元可以让程序设计者可以从一个非常大范围的不同数值中取得其工作所要使用的数字，而硬件则是在需要时负责调整此数字的指数值，而不需程序设计者处理。举例来说，假定两个浮点数0.111x 10²⁹与0.81x 10³¹相乘。(虽然浮点单元通常工作于2为基础的浮点数，此范例中所使用的是十进制小数，或以10为基础的浮点数。)浮点单元会自动负责尾数相乘，指数相加，随后再将结果标准化至数值.8911x 10⁵⁹。在另一个范例中，假定同样的两个浮点数相加。浮点单元会在相加前自动负责将尾数的二进制小数点对齐以产生数值为.81111x 10³¹的总数。Generally speaking, hardware units for performing arithmetic operations in a digital computing device can be generally divided into "integer" units and "floating point" units depending on whether the object of the arithmetic operation is an integer or a floating point number. Floating-point numbers have a magnitude (or mantissa), an exponent, and usually a sign. The exponent is a pointer to the position of the radix (radix) point (usually the binary point) relative to the value. Integers, by contrast, do not have an exponent, but only a value, and usually a sign. The floating-point unit allows the programmer to get the number he or she needs to work with from a very wide range of different values, and the hardware takes care of adjusting the exponent of that number when needed, without the programmer having to deal with it. As an example, suppose two floating point numbers 0.111x 10 ²⁹ are multiplied by 0.81x 10 ³¹ . (Although floating-point units usually work with base-2 floating-point numbers, this example uses decimal fractions, or base-10 floating-point numbers.) The floating-point unit automatically takes care of multiplying the mantissa, adding the exponents, and then The results were then normalized to a value of .8911 x 10 ⁵⁹ . In another example, assume that the same two floating point numbers are added. The floating point unit automatically takes care of aligning the binary point of the mantissa before adding to produce a total of .81111x 10 ³¹ values.

不过，众所周知，这样复杂的运算而会导致浮点单元的尺寸增加，耗能增加、每指令所需时频周期数增加、以及/或周期时间拉长。因为这个原因，许多装置(如嵌入式处理器、微控制器与相对低成本和/或低功率的微处理器)并不具有浮点单元。由前述范例可以发现，浮点单元的复杂结构包含执行关联于浮点加法与乘法/除法的指数计算的逻辑(即对操作数的指数执行加/减运算以产生浮点乘法/除法的指数数值的加法器，将操作数指数相减以确认浮点加法的二进制小数点对准偏移量的减法器)，包含为了达成浮点加法中尾数的二进制小数点对准的偏移器，包含对浮点结果进行标准化处理的偏移器。此外，流程的进行通常还需要执行浮点结果的舍入运算的逻辑、执行整数格式与浮点格式间以及不同浮点格式(例如扩增精度、双精度、单精度、半精度)间的转换的逻辑、前导零与前导一的侦测器、以及处理特殊浮点数的逻辑，例如反常值、非数值与无穷值。However, it is well known that such complex operations will increase the size of the floating point unit, increase power consumption, increase the number of time-frequency cycles required per instruction, and/or prolong the cycle time. For this reason, many devices, such as embedded processors, microcontrollers, and relatively low-cost and/or low-power microprocessors, do not have floating point units. As can be seen from the preceding examples, the complex structure of the floating-point unit contains the logic to perform exponent calculations associated with floating-point addition and multiplication/division Adder for subtracting operand exponents to confirm binary point alignment offset for floating-point addition), including offset for binary point alignment of mantissa in floating-point addition, including for floating-point addition The offset to normalize the result. In addition, the process often requires logic to perform rounding of floating-point results, perform conversions between integer formats and floating-point formats, and between different floating-point formats (such as extended precision, double precision, single precision, half precision) logic for , detectors for leading zeros and ones, and logic for handling special floating-point numbers, such as abnormal values, non-numeric values, and infinity values.

此外，关于浮点单元的正确度验证会因为设计上需要被验证的数值空间增加而大幅增加其复杂度，而会延长产品开发周期与上市时间。此外，如前述，浮点算术运算需要对用于计算的每个浮点数的尾数字段与指数字段分别储存与使用，而会增加所需的储存空间和/或在给定储存空间以储存整数的情况下降低精确度。其中许多缺点都可以透过整数单元执行算术运算来避免。In addition, the verification of the correctness of the floating point unit will greatly increase its complexity due to the increase of the numerical space to be verified in the design, which will prolong the product development cycle and time to market. In addition, as mentioned above, floating-point arithmetic operations require separate storage and use of the mantissa and exponent fields of each floating-point number used for calculation, which increases the required storage space and/or the number of integers stored in a given storage space. In this case, the accuracy is reduced. Many of these disadvantages can be avoided by performing arithmetic operations on the integer unit.

程序设计者通常需要撰写处理小数的程序，小数即为非完整数的数值。这种程序可能需要在不具有浮点单元的处理器上执行，或是处理器虽然具有浮点单元，不过由处理器的整数单元执行整数指令会比较快。为了利用整数处理器在效能上的优势，程序设计者会对定点数值(fixed-point numbers)使用习知的定点算术运算。这样的程序会包括执行于整数单元以处理整数或整数数据的指令。软件知道数据是小数，这个软件并包含指令对整数数据执行运算而处理这个数据实际上是小数的问题，例如对准偏移器。基本上，定点软件可手动执行某些或全部浮点单元所能执行的功能。Programmers often need to write programs that deal with decimals, which are values that are not complete numbers. Such a program may need to be executed on a processor that does not have a floating point unit, or the processor may have a floating point unit, but it may be faster for the processor's integer unit to execute integer instructions. To take advantage of the performance advantages of integer processors, programmers use known fixed-point arithmetic operations on fixed-point numbers. Such programs would include instructions that execute on integer units to process integers or integer data. The software knows that the data is fractional, and the software includes instructions to perform operations on integer data that deal with issues where the data is actually fractional, such as aligning offsetters. Basically, fixed-point software manually performs some or all of the functions that a floating-point unit can.

在本文中，一个“定点”数(或值或操作数或输入或输出)是一个数字，其储存位被理解为包含位以表示此定点数的一小数部分，此位在此称为“小数位”。定点数的储存位包含于存储器或缓存器内，例如存储器或缓存器内的一个8位或16位文字。此外，定点数的储存位全部都用来表达一个数值，而在某些情况下，其中一个位会用来表达符号，不过，没有一个定点数的储存位会用来表达这个数的指数。此外，此定点数的小数位数量或称二进制小数点位置指定于一个不同于定点数储存位的储存空间内，并且是以共享或通用的方式指出小数位的数量或称二进制小数点位置，分享给一个包含此定点数的定点数集合，例如输入操作数、累加数值或是处理单元阵列的输出结果的集合。In this context, a "fixed-point" number (or value or operand or input or output) is a number whose storage bits are understood to contain bits to represent the fractional part of the fixed-point number, referred to herein as the "fractional bit". The storage bits for a fixed-point number are contained within a memory or register, such as an 8-bit or 16-bit word within a memory or register. In addition, all of the fixed-point storage bits are used to represent a value, and in some cases one of the bits is used to represent the sign, but none of the fixed-point storage bits are used to represent the exponent of the number. In addition, the number of decimal places or the position of the binary decimal point of this fixed-point number is specified in a storage space different from the storage position of the fixed-point number, and the number of decimal places or the position of the binary decimal point is indicated in a shared or common way, shared among a A set of fixed-point numbers that contains this fixed-point number, such as an input operand, an accumulated value, or a set of output results from an array of processing cells.

在此描述的实施例中，算术逻辑单元是整数单元，不过，启动函数单元则是包含浮点算术硬件辅助或加速。如此可以使算术逻辑单元部分变得更小且更为快速，以利于在给定的芯片空间上使用更多的算术逻辑单元。这也表示在单位芯片空间上可以设置更多的神经元，而特别有利于神经网络单元。In the embodiments described here, the ALU is an integer unit, however, the enabling function unit includes floating point arithmetic hardware assistance or acceleration. This makes the ALU part smaller and faster, facilitating the use of more ALUs on a given chip space. This also means that more neurons can be arranged on a unit chip space, which is particularly beneficial to neural network units.

此外，相较于每个浮点数都需要指数储存位，本文所述的实施例中的定点数以一个指针表达全部的数字集合中属于小数位的储存位的数量，不过，此指针位于一个单一、共享的储存空间而广泛地指出整个集合的所有数字，例如一系列运算的输入集合、一系列运算的累加数的集合、输出的集合，其中小数位的数量。就一较佳实施例而言，神经网络单元的使用者可对此数字集合指定小数储存位的数量。因此，可以理解的是，虽然在许多情况下(如一般数学)，“整数”的用语是指一个带符号完整数，也就是一个不具有小数部分的数字，不过，在本文的脉络中，“整数”的用语可表示具有小数部分的数字。此外，在本文的脉络中，“整数”的用语是为了与浮点数进行区分，对于浮点数而言，其各自储存空间内的部分位会用来表达浮点数的指数。类似地，整数算术运算，如整数单元执行的整数乘法或加法或比较运算，假设操作数中不具有指数，因此，整数单元的整阵列件，如整数乘法器、整数加法器、整数比较器，就不需要包含逻辑来处理指数，例如不需要为了加法或比较运算而移动尾数来对准二进制小数点，不需要为了乘法运算而将指数相加。In addition, compared to each floating-point number requiring an exponent storage bit, the fixed-point number in the embodiment described herein expresses the number of decimal storage bits in the entire set of numbers with a pointer, however, this pointer resides in a single , Shared storage space and broadly point out all the numbers of the entire set, such as the input set of a series of operations, the set of cumulative numbers of a series of operations, the output set, and the number of decimal places. For a preferred embodiment, the user of the NNU can specify the number of decimal places for the set of numbers. Thus, it is understood that while in many contexts (such as mathematics in general) the term "integer" refers to a signed whole number, that is, a number without a fractional part, in the context of this paper, " The term "integer" may mean a number with a fractional part. In addition, in the context of this article, the term "integer" is used to distinguish it from floating-point numbers. For floating-point numbers, some bits in their respective storage spaces are used to express the exponent of the floating-point number. Similarly, integer arithmetic operations, such as integer multiplication or addition or comparison operations performed by the integer unit, assume that there are no exponents in the operands, so integer elements of the integer unit, such as integer multipliers, integer adders, and integer comparators, There is no need to include logic to handle exponents, such as shifting the mantissa to align the binary point for addition or comparison operations, or adding exponents for multiplication operations.

此外，本文所述的实施例包括一个大型的硬件整数累加器以对一个大型系列的整数运算进行累加(如1000个乘法累加运算)而不会丧失精确度。如此可避免神经网络单元处理浮点数，同时又能使累加数维持全精度，而不会使其饱和或因为溢位而产生不准确的结果。一旦这系列整数运算加总出一结果输入此全精度累加器，此定点硬件辅助会执行必要的缩放与饱和运算，藉以利用使用者指定的累加值小数位数量指针以及输出值所需要的小数位数量，将此全精度累加值转换为一输出值，这部分在后续章节会有更详细的说明。Additionally, embodiments described herein include a large hardware integer accumulator to accumulate a large series of integer operations (eg, 1000 multiply-accumulate operations) without loss of precision. This avoids the neural network unit from dealing with floating point numbers, while maintaining the full precision of the accumulator without saturating it or producing inaccurate results due to overflow. Once the series of integer operations have summed a result into the full-precision accumulator, the fixed-point hardware assist performs the necessary scaling and saturation operations to utilize the user-specified pointer to the number of decimal places in the accumulated value and the required number of decimal places in the output value. Quantity, convert this full-precision accumulative value into an output value, this part will be described in more detail in subsequent chapters.

当需要将累加值从全精度形式进行压缩以便用于启动函数的一输入或是用于传递，就一较佳实施例而言，启动函数单元可以选择性地对累加值执行随机舍入运算，这部分在后续章节会有更详细的说明。最后，依据神经网络的给定层的不同需求，神经处理单元可以选择性地接受指示以使用不同的启动函数以及/或输出许多不同形式的累加值。When the accumulated value needs to be compressed from full precision form for use as an input to the activation function or for transfer, in a preferred embodiment, the activation function unit may optionally perform random rounding on the accumulated value, This part will be described in more detail in subsequent chapters. Finally, the neural processing unit may optionally be instructed to use different activation functions and/or output accumulated values in many different forms, depending on the different requirements of a given layer of the neural network.

图29A为显示图1的控制缓存器127的一实施例的方块示意图。此控制缓存器127可包括多个控制缓存器127。如图中所示，此控制缓存器127包括下列字段：配置2902、带符号数据2912、带符号权重2914、数据二进制小数点2922、权重二进制小数点2924、算术逻辑单元函数2926、舍入控制2932、启动函数2934、倒数2942、偏移量2944、输出随机存取存储器2952、输出二进制小数点2954、以及输出命令2956。控制缓存器127值可以利用MTNN指令1400与NNU程序的指令，如启动指令，进行写入动作。FIG. 29A is a block diagram showing an embodiment of the control register 127 of FIG. 1 . The control register 127 may include a plurality of control registers 127 . As shown, the control register 127 includes the following fields: Configuration 2902, Signed Data 2912, Signed Weight 2914, Data Binary Point 2922, Weight Binary Point 2924, ALU Function 2926, Rounding Control 2932, Enable Function 2934 , Reciprocal 2942 , Offset 2944 , Output Random Access Memory 2952 , Output Binary Point 2954 , and Output Command 2956 . The value of the control register 127 can be written by using the MTNN command 1400 and the command of the NNU program, such as the start command.

配置2902值指定神经网络单元121是属于窄配置、宽配置或是漏斗配置，如前所述。配置2902也设定了由数据随机存取存储器122与权重随机存取存储器124接收的输入文字的大小。在窄配置与漏斗配置中，输入文字的大小是窄的(例如8位或9位)，不过，在宽配置中，输入文字的大小则是宽的(例如12位或16位)。此外，配置2902也设定了与输入文字大小相同的输出结果133的大小。The configuration 2902 value specifies whether the neural network unit 121 belongs to a narrow configuration, a wide configuration, or a funnel configuration, as previously described. The configuration 2902 also sets the size of the input text received by the data RAM 122 and the weight RAM 124 . In the narrow configuration and the funnel configuration, the size of the input text is narrow (eg, 8 or 9 digits), whereas in the wide configuration, the size of the input text is wide (eg, 12 or 16 digits). In addition, the configuration 2902 also sets the size of the output result 133 to be the same as the size of the input text.

带符号数据值2912为真的时候，即表示由数据随机存取存储器122接收的数据文字为带符号值，若为假，则表示这些数据文字为不带符号值。带符号权重值2914为真的时候，即表示由权重随机存取存储器122接收的权重文字为带符号值，若为假，则表示这些权重文字为不带符号值。Signed data value 2912 is true when it indicates that the data words received by DRAM 122 are signed values, and if false, indicates that these data words are unsigned values. When the signed weight value 2914 is true, it means that the weight text received by the weight random access memory 122 is a signed value, and if it is false, it means that the weight text is an unsigned value.

数据二进制小数点2922值表示由数据随机存取存储器122接收的数据文字的二进制小数点位置。就一较佳实施例而言，对于二进制小数点的位置而言，数据二进制小数点2922值即表示二进制小数点从右侧计算的位位置数量。换言之，数据二进制小数点2922表示数据文字的最低有效位中属于小数位的数量，即位于二进制小数点右侧的位数。类似地，权重二进制小数点2924值表示由权重随机存取存储器124接收的权重文字的二进制小数点位置。就一较佳实施例而言，当算术逻辑单元函数2926是一个乘法与累加或输出累加，神经处理单元126将装载于累加器202的数值的二进制小数点右侧的位数确定为数据二进制小数点2922与权重二进制小数点2924的加总。因此，举例来说，若是数据二进制小数点2922的值为5而权重二进制小数点2924的值为3，累加器202内的值就会在二进制小数点右侧有8个位。当算术逻辑单元函数2926是一个总数/最大值累加器与数据/权重文字或是传递数据/权重文字，神经处理单元126会将装载于累加器202的数值的二进制小数点右侧的位数分别确定为数据/权重二进制小数点2922/2924。在另一实施例中，则是指定单一个累加器二进制小数点2923，而不去指定个别的数据二进制小数点2922与权重二进制小数点2924。这部分在后续对应于图29B处会有更详细的说明。The data binary point 2922 value represents the binary point position of the data literal received by the data RAM 122 . For a preferred embodiment, for the position of the binary point, the data binary point 2922 value represents the number of bit positions of the binary point counted from the right. In other words, the data binary point 2922 represents the number of decimal places in the least significant bits of the data literal, ie, the number of bits to the right of the binary point. Similarly, the weight binary point 2924 value represents the binary point position of the weight literal received by the weight RAM 124 . For a preferred embodiment, when the ALU function 2926 is a multiplication and accumulation or output accumulation, the neural processing unit 126 determines the number of digits to the right of the binary decimal point of the value loaded in the accumulator 202 as the data binary decimal point 2922 Sum with weighted binary point 2924. So, for example, if the data binary point 2922 has a value of 5 and the weight binary point 2924 has a value of 3, the value in the accumulator 202 will have 8 bits to the right of the binary point. When the ALU function 2926 is a sum/maximum accumulator and data/weight literal or transfer data/weight literal, the NPU 126 will determine the number of digits to the right of the binary decimal point of the value loaded in the accumulator 202 respectively For data/weight binary decimal point 2922/2924. In another embodiment, a single accumulator binary point 2923 is specified instead of the individual data binary point 2922 and weight binary point 2924 . This part will be described in more detail later corresponding to FIG. 29B .

算术逻辑单元函数2926指定由神经处理单元126的算术逻辑单元204执行的函数。如前述，算术逻辑单元函数2926可包括以下运算但不限于：将数据文字209与权重文字203相乘并将此乘积与累加器202相加；将累加器202与权重文字203相加；将累加器202与数据文字209相加；累加器202与数据文字209中的最大值；累加器202与权重文字209中的最大值；输出累加器202；传递数据文字209；传递权重文字209；输出零值。在一实施例中，此算术逻辑单元函数2926由神经网络单元初始化指令予以指定，并且由算术逻辑单元204使用以因应一执行指令(未图示)。在一实施例中，此算术逻辑单元函数2926由个别的神经网络单元指令予以指定，如前述乘法累加以及maxwacc指令。ALU function 2926 specifies a function to be performed by ALU 204 of NPU 126 . As previously mentioned, ALU function 2926 may include, but is not limited to, the following operations: multiply data literal 209 by weight literal 203 and add the product to accumulator 202; add accumulator 202 to weight literal 203; accumulate add accumulator 202 and data literal 209; maximum value of accumulator 202 and data literal 209; maximum value of accumulator 202 and weight literal 209; output accumulator 202; pass data literal 209; pass weight literal 209; output zero value. In one embodiment, the ALU function 2926 is specified by the NNU initialization instruction and used by the ALU 204 in response to an execution instruction (not shown). In one embodiment, the ALU function 2926 is specified by individual NNU instructions, such as the aforementioned multiply-accumulate and maxwacc instructions.

舍入控制2932指定(图30中)舍入器3004所使用的舍入运算的形式。在一实施例中，可指定的舍入模式包括但不限于：不舍入、舍入至最近值、以及随机舍入。就一较佳实施例而言，处理器100包括随机位来源3003(请参照图30)以产生随机位3005，这些随机位3005经取样用以执行随机舍入以降低产生舍入偏差的可能性。在一实施例中，当舍入位3005为一而黏(sticky)位为零，若是取样的随机位3005为真，神经处理单元126就会向上舍入，若是取样的随机位3005为假，神经处理单元126就不会向上舍入。在一实施例中，随机位来源3003基于处理器100具有的随机电子特性进行取样以产生随机位3005，这些随机电子特性如半导体二极管或电阻的热噪声，不过本发明并不限于此。Rounding control 2932 specifies the form of rounding operations used by rounder 3004 (in FIG. 30). In one embodiment, the rounding modes that can be specified include, but are not limited to: no rounding, round-to-nearest, and random rounding. For a preferred embodiment, processor 100 includes random bit source 3003 (see FIG. 30 ) to generate random bits 3005 that are sampled to perform random rounding to reduce the possibility of rounding bias . In one embodiment, when the rounding bit 3005 is one and the sticky bit is zero, the NPU 126 rounds up if the sampled random bit 3005 is true, and if the sampled random bit 3005 is false, The NPU 126 would not round up. In one embodiment, random bit source 3003 generates random bits 3005 by sampling based on random electrical properties of processor 100, such as thermal noise of semiconductor diodes or resistors, although the invention is not limited thereto.

启动函数2934指定用于累加器202数值217的函数以产生神经处理单元126的输出133。如本文所述，启动函数2934包括但不限于：S型函数；双曲正切函数；软加函数；校正函数；除以二的指定幂次方；乘上一个使用者指定的倒数值以达成等效除法；传递整个累加器；以及将累加器以标准尺寸传递，这部分在以下章节会有更详细的说明。在一实施例中，启动函数由神经网络单元启动函数指令所指定。另外，启动函数也可由初始化指令所指定，并因应输出指令而使用，例如图4中地址4的启动函数单元输出指令，在此实施例中，位于图4中地址3的启动函数指令会包含于输出指令内。The activation function 2934 specifies a function for the value 217 of the accumulator 202 to produce the output 133 of the neural processing unit 126 . As described herein, activation functions 2934 include, but are not limited to: Sigmoid function; hyperbolic tangent function; soft add function; correction function; divide by a specified power of two; efficient division; pass the entire accumulator; and pass the accumulator in standard size, which are described in more detail in the following sections. In one embodiment, the activation function is specified by the NNU activation function instruction. In addition, the startup function can also be specified by the initialization command and used in response to the output command, for example, the startup function unit output command at address 4 in FIG. 4 . In this embodiment, the startup function command at address 3 in FIG. 4 will be included in within the output command.

倒数2942值指定一个与累加器202数值217相乘以达成对累加器202数值217进行除法运算的数值。也就是说，使用者所指定的倒数2942值会是实际上想要执行的除数的倒数。这有利于搭配如本文所述的卷积或共源运算。就一较佳实施例而言，使用者会将倒数2942值指定为两个部分，这在后续对应于图29C处会有更详细的说明。在一实施例中，控制缓存器127包括一字段(未图示)让使用者可以在多个内建除数值中指定一个进行除法，这些内建除数值的大小相当于常用的卷积核的大小，如9、25、36或49。在此实施例中，启动函数单元212会储存这些内建除数的倒数，用以与累加器202数值217相乘。The reciprocal 2942 value specifies a value that is multiplied by the accumulator 202 value 217 to achieve a divide operation on the accumulator 202 value 217 . That is, the reciprocal 2942 value specified by the user will be the reciprocal of the divisor actually intended to be performed. This is useful with convolution or common source operations as described in this paper. As far as a preferred embodiment is concerned, the user will designate the reciprocal 2942 value as two parts, which will be described in more detail later corresponding to FIG. 29C . In one embodiment, the control register 127 includes a field (not shown) that allows the user to specify one of multiple built-in divisor values for division. The size of these built-in divisor values is equivalent to that of a commonly used convolution kernel. Size, such as 9, 25, 36 or 49. In this embodiment, the activation function unit 212 stores the reciprocals of these built-in divisors to be multiplied by the value 217 of the accumulator 202 .

偏移量2944指定启动函数单元212的移位器会将累加器202数值217右移的位数，以达成将其除以二的幂次方的运算。这有利于搭配尺寸为二的幂次方的卷积核进行运算。The offset 2944 specifies the number of bits by which the shifter of the enabled functional unit 212 will right-shift the value 217 of the accumulator 202 to divide it by a power of two. This is beneficial to operate with a convolution kernel whose size is a power of two.

输出随机存取存储器2952值会在数据随机存取存储器122与权重随机存取存储器124中指定一个来接收输出结果133。The output RAM 2952 value specifies one of the data RAM 122 and the weight RAM 124 to receive the output result 133 .

输出二进制小数点2954值表示输出结果133的二进制小数点的位置。就一较佳实施例而言，对于输出结果133的二进制小数点的位置而言，输出二进制小数点2954值即表示从右侧计算的位位置数量。换言之，输出二进制小数点2954表示输出结果133的最低有效位中属于小数位的数量，即位于二进制小数点右侧的位数。启动函数单元212会基于输出二进制小数点2954的数值(在大部分的情况下，也会基于数据二进制小数点2922、权重二进制小数点2924、启动函数2934和/或配置2902的数值)执行舍入、压缩、饱和与尺寸转换的运算。The output binary point 2954 value represents the position of the binary point of the output result 133 . For a preferred embodiment, for the position of the binary point of the output result 133, the output binary point 2954 value represents the number of bit positions counted from the right. In other words, the output binary decimal point 2954 represents the number of decimal places in the least significant digits of the output result 133 , that is, the number of digits to the right of the binary decimal point. The activation function unit 212 will perform rounding, compression, Operations for saturation and size conversion.

输出命令2956会从许多面向控制输出结果133。在一实施例中，启动函数单元121会利用标准尺寸的概念，标准尺寸为配置2902指定的宽度大小(以位计)的两倍。如此，举例来说，若是配置2902设定由数据随机存取存储器122与权重随机存取存储器124接收的输入文字的大小为8位，标准尺寸就会是16位；在另一个范例中，若是配置2902设定由数据随机存取存储器122与权重随机存取存储器124接收的输入文字的大小为16位，标准尺寸就会是32位。如本文所述，累加器202的尺寸较大(举例来说，窄的累加器202B为28位，而宽的累加器202A则是41位)以维持中间计算，如1024与512个神经网络单元乘法累加指令，的全精度。如此，累加器202数值217就会大于(以位计)标准尺寸，而对于启动函数2934的大部分数值(除了传递整个累加器)，启动函数单元212(例如以下对应于图30的段落所述的标准尺寸压缩器3008)就会将累加器202数值217压缩至标准尺寸的大小。输出命令2956的第一默认值会指示启动函数单元212执行指定的启动函数2934以产生内部结果并将此内部结果作为输出结果133输出，此内部结果的大小等于原始输入文字的大小，即标准尺寸的一半。输出命令2956的第二默认值会指示启动函数单元212执行指定的启动函数2934以产生内部结果并将此内部结果的下半部作为输出结果133输出，此内部结果的大小等于原始输入文字的大小的两倍，即标准尺寸；而输出命令2956的第三默认值会指示启动函数单元212将标准尺寸的内部结果的上半部作为输出结果133输出。输出命令2956的第四默认值会指示启动函数单元212将累加器202的未经处理的最低有效文字作为输出结果133输出；而输出命令2956的第五默认值会指示启动函数单元212将累加器202的未经处理的中间有效文字作为输出结果133输出；输出命令2956的第六默认值会指示启动函数单元212将累加器202的未经处理的最高有效文字(其宽度由配置2902所指定)作为输出结果133输出，这在前文对应于图8至图10的章节有更详细的说明。如前述，输出整个累加器202尺寸或是标准尺寸的内部结果有助于让处理器100的其它执行单元112可以执行启动函数，如软极大启动函数。Output command 2956 will output results 133 from many facets of control. In one embodiment, the boot function unit 121 utilizes the concept of a standard size, which is twice the width size (in bits) specified by the configuration 2902 . Thus, for example, if configuration 2902 sets the size of input text received by data RAM 122 and weight RAM 124 to be 8 bits, the standard size would be 16 bits; in another example, if Configuration 2902 sets the size of the input text received by DRAM 122 and WRAM 124 to be 16 bits, the standard size would be 32 bits. As described herein, accumulators 202 are larger in size (eg, 28 bits for narrow accumulator 202B and 41 bits for wide accumulator 202A) to accommodate intermediate computations, such as 1024 versus 512 NNUs The full precision of the multiply-accumulate instruction. Thus, the accumulator 202 value 217 is larger (in bits) than the standard size, and for most values of the activation function 2934 (except passing the entire accumulator), the activation function unit 212 (such as described below in the paragraph corresponding to FIG. 30 standard size compressor 3008) will compress the accumulator 202 value 217 to the size of the standard size. The first default value of the output command 2956 will instruct the start function unit 212 to execute the specified start function 2934 to generate an internal result and output this internal result as the output result 133, the size of the internal result is equal to the size of the original input text, i.e. the standard size half of. The second default value of the output command 2956 will instruct the start function unit 212 to execute the specified start function 2934 to generate an internal result and output the lower half of the internal result as the output result 133, the size of which is equal to the size of the original input text Twice of the standard size; and the third default value of the output command 2956 will instruct the startup function unit 212 to output the upper half of the internal result of the standard size as the output result 133 . A fourth default value of output command 2956 would instruct enable functional unit 212 to output the unprocessed least significant word of accumulator 202 as output result 133; and a fifth default value of output command 2956 would instruct enable functional unit 212 to output accumulator The unprocessed middle significant text of 202 is output as the output result 133; the sixth default value of the output command 2956 will instruct the startup function unit 212 to convert the unprocessed most significant text of the accumulator 202 (its width is specified by the configuration 2902) It is output as an output result 133 , which is described in more detail in the previous section corresponding to FIGS. 8 to 10 . As mentioned above, outputting the full accumulator 202 size or the standard size of the internal result helps other execution units 112 of the processor 100 to execute activation functions, such as the soft-max activation function.

图29A(以及图29B与图29C)所描述的字段位于控制缓存器127内部，不过，本发明并不限于此，其中一个或多个字段也可位于神经网络单元121的其它部分。就一较佳实施例而言，其中许多字段可以包含在神经网络单元指令内部，并由定序器128予以译码以产生微指令3416(请参照图34)控制算术逻辑单元204以及/或启动函数单元212。此外，这些字段也可以包含在储存于媒体缓存器118的微运算3414内(请参照图34)，以控制算术逻辑单元204以及/或启动函数单元212。此实施例可以降低初始化神经网络单元指令的使用，而在其它实施例中则可去除此初始化神经网络单元指令。The fields described in FIG. 29A (and FIG. 29B and FIG. 29C ) are located inside the control register 127 , however, the present invention is not limited thereto, and one or more fields may also be located in other parts of the NNU 121 . For a preferred embodiment, many of these fields may be included within the NNU instructions and decoded by the sequencer 128 to generate microinstructions 3416 (see FIG. 34 ) to control the ALU 204 and/or enable Function unit 212 . In addition, these fields can also be included in the micro-operation 3414 (please refer to FIG. 34 ) stored in the media register 118 to control the ALU 204 and/or enable the function unit 212 . This embodiment can reduce the usage of the initialization NNU instruction, while in other embodiments the initialization NNU instruction can be eliminated.

如前述，神经网络单元指令可以指定对存储器操作数(如来自数据随机存取存储器122和/或权重随机存取存储器123的文字)或一个旋转后操作数(如来自多任务缓存器208/705)执行算术逻辑指令运算。在一实施例中，神经网络单元指令还可以将一个操作数指定为启动函数的缓存器输出(如图30的缓存器3038的输出)。此外，如前述，神经网络单元指令可以指定来使数据随机存取存储器122或权重随机存取存储器124的当前列地址递增。在一实施例中，神经网络单元指令可指定立即带符号整数差值加入当前列以达成递增或递减一以外数值的目的。As before, the NNU instruction may specify a reference to a memory operand (e.g., a literal from the data RAM 122 and/or weight RAM 123) or a rotated operand (e.g., from the multitasking register 208/705 ) to perform arithmetic logic instruction operations. In one embodiment, the NNU instruction may also specify an operand as the register output of the activation function (such as the output of the register 3038 in FIG. 30 ). Additionally, as previously mentioned, the NNU instruction may specify to increment the current column address of the data RAM 122 or the weight RAM 124 . In one embodiment, the NNU instruction may specify an immediate signed integer difference to be added to the current column for the purpose of incrementing or decrementing a value other than one.

图29B为显示图1的控制缓存器127的另一实施例的方块示意图。图29B的控制缓存器127类似于图29A的控制缓存器127，不过，图29B的控制缓存器127包括一个累加器二进制小数点2923。累加器二进制小数点2923表示累加器202的二进制小数点位置。就一较佳实施例而言，累加器二进制小数点2923值表示此二进制小数点位置从右侧的位位置数量。换言之，累加器二进制小数点2923表示累加器202的最低有效位中属于小数位的数量，即位于二进制小数点右侧的位。在此实施例中，累加器二进制小数点2923明确指示，而非如图29A的实施例是暗中确认。FIG. 29B is a block diagram showing another embodiment of the control register 127 of FIG. 1 . Control register 127 of FIG. 29B is similar to control register 127 of FIG. 29A , however, control register 127 of FIG. 29B includes an accumulator binary point 2923 . Accumulator binary point 2923 represents the binary point position of accumulator 202 . For a preferred embodiment, the accumulator binary point 2923 value represents the number of bit positions from the right of this binary point position. In other words, the accumulator binary point 2923 represents the number of decimal places in the least significant bits of the accumulator 202 , ie, the bits to the right of the binary point. In this embodiment, the accumulator binary point 2923 is explicitly indicated, rather than being implicitly identified as in the embodiment of FIG. 29A.

图29C为显示以两个部分储存图29A的倒数2942的一实施例的方块示意图。第一个部分2962是一个偏移值，表示使用者想要乘上累加器202数值217的真实倒数值中被抑制的前导零的数量2962。前导零的数量是紧接在二进制小数点右侧连续排列的零的数量。第二部分2694是前导零抑制倒数值，也就是将所有前导零移除后的真实倒数值。在一实施例中，被抑制前导零数量2962以4位储存，而前导零抑制倒数值2964则是以8位不带符号值储存。FIG. 29C is a block diagram illustrating an embodiment of storing the reciprocal 2942 of FIG. 29A in two parts. The first part 2962 is an offset value representing the number 2962 of leading zeros to be suppressed in the true reciprocal value of the accumulator 202 value 217 multiplied by the user. The number of leading zeros is the number of consecutive zeros immediately to the right of the binary point. The second part 2694 is the reciprocal value suppressed by leading zeros, that is, the real reciprocal value after removing all leading zeros. In one embodiment, the number of suppressed leading zeros 2962 is stored as 4 bits, and the reciprocal value of leading zeros suppressed 2964 is stored as an 8-bit unsigned value.

举例来说，假设使用者想要将累加器202数值217乘上数值49的倒数值。数值49的倒数值以二维呈现并设定13个小数位就会是0.0000010100111，其中有五个前导零。如此，使用者会将被抑制前导零数量2962填入数值5，将前导零抑制倒数值2964填入数值10100111。在倒数乘法器“除法器A”3014(请参照图30)将累加器202数值217与前导零抑制倒数值2964相乘后，所产生的乘积会依据被抑制前导零数量2962右移。这样的实施例有助于利用相对较少的位来表达倒数2942值达成高精确度的要求。For example, assume that the user wants to multiply the value 217 of the accumulator 202 by the reciprocal value of the value 49. The reciprocal of the value 49 in two dimensions and set to 13 decimal places would be 0.0000010100111 with five leading zeros. In this way, the user will fill the value 5 with the number of suppressed leading zeros 2962, and fill the value 10100111 with the reciprocal value of leading zeros suppressed 2964. After the accumulator 202 value 217 is multiplied by the leading zero suppression reciprocal value 2964 by the reciprocal multiplier "Divider A" 3014 (see FIG. 30 ), the resulting product is shifted right by the number of suppressed leading zeros 2962. Such an embodiment facilitates the use of relatively few bits to express the reciprocal 2942 value to achieve high precision requirements.

图30为显示图2的启动函数单元212的一实施例的方块示意图。此启动函数单元212包含图1的控制逻辑127、一个正类型转换器(PFC)与输出二进制小数点对准器(OBPA)3002以接收累加器202数值217、一个舍入器3004以接收累加器202数值217与输出二进制小数点对准器3002移出的位数量的指针、一个如前述的随机位来源3003以产生随机位3005、一个第一多工器3006以接收正类型转换器与输出二进制小数点对准器3002的输出以及舍入器3004的输出、一个标准尺寸压缩器(CCS)与饱和器3008以接收第一多工器3006的输出、一个位选择器与饱和器3012以接收标准尺寸压缩器与饱和器3008的输出、一个校正器3018以接收标准尺寸压缩器与饱和器3008的输出、一个倒数乘法器3014以接收标准尺寸压缩器与饱和器3008的输出、一个向右移位器3016以接收标准尺寸压缩器与饱和器3008的输出、一个双曲正切(tanh)模块3022以接收位选择器与饱和器3012的输出、一个S型模块3024以接收位选择器与饱和器3012的输出、一个软加模块3026以接收位选择器与饱和器3012的输出、一个第二多工器3032以接收双曲正切模块3022、S型模块3024、软加模块3026、校正器3018、倒数乘法器3014与向右移位器3016的输出以及标准尺寸压缩器与饱和器3008所传递的标准尺寸输出3028、一个符号恢复器3034以接收第二多工器3032的输出、一个尺寸转换器与饱和器3036以接收符号恢复器3034的输出、一第三多工器3037以接收尺寸转换器与饱和器3036的输出与累加器输出217、以及一个输出缓存器3038以接收多工器3037的输出，而其输出即为图1中的结果133。FIG. 30 is a block diagram showing an embodiment of the activation function unit 212 of FIG. 2 . The enable function unit 212 includes the control logic 127 of FIG. 1, a positive type converter (PFC) and an output binary point aligner (OBPA) 3002 to receive the accumulator 202 value 217, a rounder 3004 to receive the accumulator 202 Value 217 with a pointer to the number of bits shifted out by the output binary point aligner 3002, a random bit source 3003 as previously described to generate random bits 3005, a first multiplexer 3006 to receive the positive type converter and output binary point alignment The output of the device 3002 and the output of the rounder 3004, a standard size compressor (CCS) and saturator 3008 to receive the output of the first multiplexer 3006, a bit selector and saturator 3012 to receive the standard size compressor and saturator 3012 The output of the saturator 3008, a corrector 3018 to receive the output of the standard size compressor and saturator 3008, an inverse multiplier 3014 to receive the output of the standard size compressor and saturator 3008, a right shifter 3016 to receive The output of the standard size compressor and saturator 3008, a hyperbolic tangent (tanh) block 3022 to receive the output of the bit selector and saturator 3012, an S-shaped block 3024 to receive the output of the bit selector and saturator 3012, a Soft addition module 3026 to receive the output of bit selector and saturator 3012, a second multiplexer 3032 to receive hyperbolic tangent module 3022, S-type module 3024, soft addition module 3026, corrector 3018, reciprocal multiplier 3014 and The output of the right shifter 3016 and the standard size output 3028 delivered by the standard size compressor and saturator 3008, a sign restorer 3034 to receive the output of the second multiplexer 3032, a size converter and saturator 3036 to Receive the output of the symbol restorer 3034, a third multiplexer 3037 to receive the output of the size converter and saturator 3036 and the accumulator output 217, and an output buffer 3038 to receive the output of the multiplexer 3037, and its output That is the result 133 in FIG. 1 .

正类型转换器与输出二进制小数点对准器3002接收累加器202值217。就一较佳实施例而言，如前述，累加器202值217是一个全精度值。也就是说，累加器202具有足够的储存位数以装载累加数，此累加数是由整数加法器244将一系列由整数乘法器242产生的乘积相加所产生的总数，而此运算不舍弃乘法器242的个别乘积或加法器的各个总数中的任何一个位以维持精确度。就一较佳实施例而言，累加器202至少具有足够的位数来装载神经网络单元121可被程序化执行产生的乘积累加的最大数量。举例来说，请参照图4的程序，在宽配置下，神经网络单元121可被程序化执行产生的乘积累加的最大数量为512，而累加数202位宽度为41。在另一范例中，请参照图20的程序，在窄配置下，神经网络单元121可被程序化执行产生的乘积累加的最大数量为1024，而累加数202位宽度为28。基本上，全精度累加器202具有至少Q个位，其中Q是M与log₂P的加总，其中M是乘法器242的整数乘积的位宽度(举例来说，对于窄乘法器242而言是16位，对于宽乘法器242而言是32位)，而P是累加器202所能累加的乘积的最大容许数量。就一较佳实施例而言，乘积累加的最大数量是依据神经网络单元121的程序设计者的程序规格所指定。在一实施例中，假定一个先前乘法累加指令用以从数据/权重随机存取存储器122/124加载数据/权重文字206/207列(如图4中地址1的指令)的基础上，定序器128会执行乘法累加神经网络单元指令(如图4中地址2的指令)的计数的最大值是例如511。The positive type converter and output binary point aligner 3002 receives the accumulator 202 value 217 . For a preferred embodiment, accumulator 202 value 217 is a full precision value as previously described. That is, the accumulator 202 has enough bits of storage to load the accumulator, which is the sum produced by the integer adder 244 by adding a series of products produced by the integer multiplier 242, without discarding. The individual products of the multipliers 242 or any one bit of the respective totals of the adders to maintain accuracy. For a preferred embodiment, the accumulator 202 has at least enough bits to hold the maximum number of product accumulations that the NNU 121 can be programmed to generate. For example, please refer to the program in FIG. 4 , under the wide configuration, the neural network unit 121 can be programmed to generate a maximum number of multiply-accumulates of 512, and the width of the accumulated number 202 is 41. In another example, please refer to the program in FIG. 20 , under the narrow configuration, the neural network unit 121 can be programmed to generate a maximum number of product accumulations of 1024, and the width of the accumulation number 202 is 28 bits. Basically, full precision accumulator 202 has at least Q bits, where Q is the sum of M and log ₂ P, where M is the bit width of the integer product of multiplier 242 (for example, for narrow multiplier 242 is 16 bits, or 32 bits for wide multiplier 242), and P is the maximum allowable number of products that accumulator 202 can accumulate. As far as a preferred embodiment is concerned, the maximum number of multiply-accumulate is specified according to the program specification of the programmer of the neural network unit 121 . In one embodiment, the sequence is based on the assumption that a previous multiply-accumulate instruction is used to load the data/weight literal 206/207 column from the data/weight RAM 122/124 (such as the instruction at address 1 in FIG. 4 ). The maximum value of the count of the multiplication and accumulation neural network unit instruction (such as the instruction of address 2 in FIG. 4 ) executed by the device 128 is, for example, 511.

利用一个具有足够位宽度而能对所容许累加的最大数量的全精度值执行累加运算的累加器202，即可简化神经处理单元126的算术逻辑单元204的设计。特别是，这样处理可以缓和需要使用逻辑来对整数加法器244产生的总数执行饱和运算的需求，因为整数加法器244会使一个小型累加器产生溢位，而需要持续追踪累加器的二进制小数点位置以确认是否产生溢位以确认是否需要执行饱和运算。举例来说，对于具有非全精度累加器但具有饱和逻辑以处理非全精度累加器的溢位的设计而言，假定存在以下情况。The design of the ALU 204 of the NPU 126 is simplified by using an accumulator 202 with sufficient bit width to perform accumulation operations on the maximum number of full precision values allowed to be accumulated. In particular, this alleviates the need to use logic to saturate the sum produced by the integer adder 244, which would overflow a small accumulator, while keeping track of the accumulator's binary point position To confirm whether an overflow occurs to confirm whether a saturation operation needs to be performed. As an example, for a design with a non-full precision accumulator but with saturation logic to handle overflow of the non-full precision accumulator, assume the following.

(1)数据文字值的范围是介于0与1之间而所有储存位都用以储存小数位。权重文字值的范围是介于-8与+8之间而除了三个以外的所有储存位都用以储存小数位。做为一个双曲正切启动函数的输入的累加值的范围是介于-8与8之间，而除了三个以外的所有储存位都用以储存小数位。(1) The range of the data literal value is between 0 and 1 and all storage bits are used to store decimal places. The range of weight literal values is between -8 and +8 and all but three storage bits are used to store decimal places. The range of accumulated values used as input to a hyperbolic tangent activation function is between -8 and 8, and all but three storage bits are used to store decimal places.

(2)累加器的位宽度为非全精度(如只有乘积的位宽度)。(2) The bit width of the accumulator is non-full precision (such as only the bit width of the product).

(3)假定累加器为全精度，最终累加值也大约会介于-8与8之间(如+4.2)；不过，在此序列中“点A”前的乘积会较频繁地产生正值，而在点A后的乘积则会较频繁地产生负值。(3) Assuming that the accumulator is full precision, the final accumulated value will also be between -8 and 8 (such as +4.2); however, the product before "point A" in this sequence will produce positive values more frequently , and the product after point A will produce negative values more frequently.

在此情况下，就可能取得不正确的结果(如+4.2以外的结果)。这是因为在点A前方的某些点，当需要使累加器达到一个超过其饱和最大值+8的数值，如+8.2，就会损失多出的0.2。累加器甚至会使剩下的乘积累加结果维持在饱和值，而会损失更多正值。因此，累加器的最终值可能会小于使用具有全精度位宽度的累加器所计算的数值(即小于+4.2)。In this case, incorrect results (such as results other than +4.2) may be obtained. This is because at some point ahead of point A, when it is necessary to bring the accumulator to a value above its saturation maximum of +8, such as +8.2, the extra 0.2 is lost. The accumulator will even keep the remaining multiply-accumulate result at saturation, losing more positive values. Therefore, the final value of the accumulator may be smaller than the value calculated using the accumulator with full precision bit width (ie, less than +4.2).

正类型转换器3004会在累加器202数值217为负时，将其转换为正类型，并产生额外位指出原本数值的正负，这个位会随同此数值向下传递至启动函数单元212管线。将负数转换为正类型可以简化后续启动函数单元121的运算。举例来说，经此处理后，只有正值会输入双曲正切模块3022与S型模块3024，因而可以简化这些模块的设计。此外，也可以简化舍入器3004与饱和器3008。The positive type converter 3004 converts the value 217 of the accumulator 202 to a positive type when it is negative, and generates an extra bit to indicate the original value's sign. Converting negative numbers into positive types can simplify the operation of the subsequent startup function unit 121 . For example, after this processing, only positive values will be input into the hyperbolic tangent module 3022 and the sigmoid module 3024, thus simplifying the design of these modules. In addition, the rounder 3004 and the saturator 3008 can also be simplified.

输出二进制小数点对准器3002会向右移动或缩放此正类型值，使其对准于控制缓存器127内指定的输出二进制小数点2954。就一较佳实施例而言，输出二进制小数点对准器3002会计算累加器202数值217的小数位数(例如由累加器二进制小数点2923所指定或是数据二进制小数点2922与权重二进制小数点2924的加总)减去输出的小数位数(例如由输出二进制小数点2954所指定)的差值作为偏移量。如此，举例来说，若是累加器202二进制小数点2923为8(即上述实施例)而输出二进制小数点2954为3，输出二进制小数点对准器3002就会将此正类型数值右移5个位以产生提供至多工器3006与舍入器3004的结果。The output binary point aligner 3002 shifts or scales the positive type value to the right to align it with the output binary point 2954 specified in the control register 127 . For a preferred embodiment, the output binary point aligner 3002 calculates the number of decimal places of the accumulator 202 value 217 (such as specified by the accumulator binary point 2923 or the addition of the data binary point 2922 and the weight binary point 2924 total) minus the number of decimal places for the output (as specified, for example, by the output binary decimal point 2954) as the offset. Thus, for example, if the accumulator 202 binary decimal point 2923 is 8 (i.e. the above embodiment) and the output binary decimal point 2954 is 3, the output binary decimal point aligner 3002 will right-shift the positive type value by 5 bits to generate The result is provided to the multiplexer 3006 and the rounder 3004.

舍入器3004会对累加器202数值217执行舍入运算。就一较佳实施例而言，舍入器3004会对正类型转换器与输出二进制小数点对准器3002产生的正类型数值产生一个舍入后版本，并将此舍入后版本提供至多工器3006。舍入器3004会依据前述舍入控制2932执行舍入运算，如本文所述，前述舍入控制会包括使用随机位3005的随机舍入。多工器3006会依据舍入控制2932(如本文所述，可包含随机舍入)，在其多个输入中选择其一，也就是来自正类型转换器与输出二进制小数点对准器3002的正类型数值或是来自舍入器3004的舍入后版本，并且将选择后的数值提供给标准尺寸压缩器与饱和器3008。就一较佳实施例而言，若是舍入控制指定不进行舍入，多工器3006就会选择正类型转换器与输出二进制小数点对准器3002的输出，否则就会选择舍入器3004的输出。在其它实施例中，也可由启动函数单元212执行额外的舍入运算。举例来说，在一实施例中，当位选择器3012对标准尺寸压缩器与饱和器3008的输出(如后述)位进行压缩时，位选择器3012会基于遗失的低顺位位进行舍入运算。在另一个范例中，倒数乘法器3014(如后述)的乘积会被施以舍入运算。在又一个范例中，尺寸转换器3036需要转换出适当的输出尺寸(如后述)，此转换可能涉及丢去某些用于决定舍入的低顺位位，就会执行舍入运算。The rounder 3004 performs rounding operations on the accumulator 202 value 217 . For a preferred embodiment, the rounder 3004 generates a rounded version of the positive type value generated by the positive type converter and the output binary point aligner 3002, and provides the rounded version to the multiplexer 3006. The rounder 3004 performs rounding operations according to the aforementioned rounding control 2932 , which includes random rounding using random bits 3005 as described herein. Multiplexer 3006 selects one of its multiple inputs, namely the positive input from positive type converter and output binary point aligner 3002, in accordance with rounding control 2932 (which may include random rounding as described herein). The type value is either a rounded version from the rounder 3004 and the selected value is provided to the normal size compressor and saturator 3008. For a preferred embodiment, multiplexer 3006 selects the output of positive type converter and output binary point aligner 3002 if the rounding control specifies no rounding, otherwise selects the output of rounder 3004 output. In other embodiments, additional rounding operations may also be performed by the enable function unit 212 . For example, in one embodiment, when the bit selector 3012 compresses the output bits of the standard-size compressor and saturator 3008 (as described later), the bit selector 3012 performs truncation based on missing low order bits input operation. In another example, the product of the reciprocal multiplier 3014 (described later) is rounded. In yet another example, the size converter 3036 needs to convert to an appropriate output size (described later), and this conversion may involve dropping some low-order bits used to determine rounding, and then rounding is performed.

标准尺寸压缩器3008会将多工器3006输出值压缩至标准尺寸。因此，举例来说，若是神经处理单元126是处于窄配置或漏斗配置2902，标准尺寸压缩器3008可将28位的多工器3006输出值压缩至16位；而若是神经处理单元126是处于宽配置2902，标准尺寸压缩器3008可将41位的多工器3006输出值压缩至32位。不过，在压缩至标准尺寸前，若是压缩前值大于标准型式所能表达的最大值，饱和器3008就会使此压缩前值填满至标准型式所能表达的最大值。举例来说，若是压缩前值中位于最高有效压缩前值位左侧的任何位都是数值1，饱和器3008就会填满至最大值(如填满为全部1)。The standard size compressor 3008 compresses the output of the multiplexer 3006 to a standard size. Thus, for example, standard size compressor 3008 can compress the 28-bit output of multiplexer 3006 to 16 bits if NPU 126 is in narrow configuration or funnel configuration 2902; With configuration 2902, the standard size compressor 3008 can compress the 41-bit output value of the multiplexer 3006 to 32 bits. However, before compressing to the standard size, if the uncompressed value is greater than the maximum value that can be expressed in the standard form, the saturator 3008 will fill the uncompressed value to the maximum value that can be expressed in the standard form. For example, if any bit of the uncompressed value to the left of the most significant uncompressed value bit is a value 1, the saturator 3008 is filled to a maximum value (eg, all 1s).

就一较佳实施例而言，双曲正切模块3022、S型模块3024、以及软加模块3026都包含查找表，如可程序化逻辑阵列(PLA)、只读存储器(ROM)、组合逻辑闸等等。在一实施例中，为了简化并缩小这些模块3022/3024/3026的尺寸，提供至这些模块的输入值具有3.4的型式，即三个整数字元与四个小数位，也即输入值具有四个位位于二进制小数点右侧并且具有三个位位于二进制小数点左侧。因为在3.4型式的输入值范围(-8，+8)的极端处，输出值会渐近地靠近其最小/最大值，因此选择这些数值。不过，本发明并不限于此，本发明也可应用于其它将二进制小数点放置在不同位置的实施例，如以4.3型式或2.5型式。位选择器3012会在标准尺寸压缩器与饱和器3008输出的位中选择选择满足3.4型式规范的位，此涉及压缩处理，也就是会丧失某些位，因为标准型式则具有较多的位数。不过，在选择/压缩标准尺寸压缩器与饱和器3008输出值之前，若是压缩前值大于3.4型式所能表达的最大值，饱和器3012就会使压缩前值填满至3.4型式所能表达的最大值。举例来说，若是压缩前值中位于最高有效3.4型式位左侧的任何位都是数值1，饱和器3012就会填满至最大值(如填满至全部1)。For a preferred embodiment, the hyperbolic tangent module 3022, the S-type module 3024, and the soft-add module 3026 all include look-up tables, such as programmable logic array (PLA), read-only memory (ROM), combinational logic gate wait. In one embodiment, in order to simplify and reduce the size of these modules 3022/3024/3026, the input values provided to these modules have the form of 3.4, that is, three integer characters and four decimal places, that is, the input values have four The ones digit is to the right of the binary point and has three digits to the left of the binary point. These values were chosen because at the extremes of the input value range (-8, +8) of the 3.4 version, the output value will asymptotically approach its min/max values. However, the present invention is not limited thereto, and the present invention can also be applied to other embodiments where the binary point is placed in a different position, such as in a 4.3 pattern or a 2.5 pattern. The bit selector 3012 will select the bits that meet the 3.4 form specification among the bits output by the standard size compressor and saturator 3008, which involves compression processing, that is, some bits will be lost, because the standard form has more bits . However, before selecting/compressing the standard size compressor and saturator 3008 output value, if the uncompressed value is greater than the maximum value expressible in the 3.4 form, the saturator 3012 will fill the uncompressed value to the expressible 3.4 form maximum value. For example, if any of the bits to the left of the most significant 3.4 pattern bits in the pre-compressed value have the value 1, the saturator 3012 will be filled to the maximum value (eg, filled to all 1s).

双曲正切模块3022、S型模块3024与软加模块3026会对标准尺寸压缩器与饱和器3008输出的3.4型式数值执行相对应的启动函数(如前述)以产生一结果。就一较佳实施例而言，双曲正切模块3022与S型模块3024所产生的是一个0.7型式的7位结果，即零个整数字元与七个小数位，也即输入值具有七个位位于二进制小数点右侧。就一较佳实施例而言，软加模块3026产生的是一个3.4型式的7位结果，即其型式与此模块3026的输入型式相同。就一较佳实施例而言，双曲正切模块3022、S型模块3024与软加模块3026的输出会被延展至标准型式(例如在必要时加上前导零)并对准而使二进制小数点由输出二进制小数点2954数值所指定。The hyperbolic tangent module 3022 , the sigmoid module 3024 and the soft adder module 3026 execute the corresponding activation function (as described above) on the 3.4 type values output by the standard size compressor and saturator 3008 to generate a result. As far as a preferred embodiment is concerned, what the hyperbolic tangent module 3022 and the sigmoid module 3024 produce is a 0.7-type 7-bit result, that is, zero integer characters and seven decimal places, that is, the input value has seven digits to the right of the binary point. For a preferred embodiment, the soft-add module 3026 produces a 7-bit result of the 3.4 format, ie, the same format as the module 3026 input. For a preferred embodiment, the outputs of the hyperbolic tangent module 3022, the sigmoid module 3024, and the soft-add module 3026 are extended to normal form (such as adding leading zeros where necessary) and aligned so that the binary point is represented by Outputs the value specified with 2954 binary decimal points.

校正器3018会产生标准尺寸压缩器与饱和器3008的输出值的校正后版本。也就是说，若是标准尺寸压缩器与饱和器3008的输出值(如前述其符号以管线下移)为负，校正器3018会输出零值；否则，校正器3018就会将其输入值输出。就一较佳实施例而言，校正器3018的输出为标准型式并具有由输出二进制小数点2954数值所指定的二进制小数点。Corrector 3018 produces a corrected version of the output values of standard size compressor and saturator 3008 . That is, if the output value of the standard-size compressor and saturator 3008 (which has its sign shifted down the pipeline as described above) is negative, the corrector 3018 will output a value of zero; otherwise, the corrector 3018 will output its input value. For a preferred embodiment, the output of the corrector 3018 is of standard form and has a binary point specified by the output binary point 2954 value.

倒数乘法器3014会将标准尺寸压缩器与饱和器3008的输出与指定于倒数值2942的使用者指定倒数值相乘，以产生标准尺寸的乘积，此乘积实际上即为标准尺寸压缩器与饱和器3008的输出值，以倒数值2942的倒数作为除数计算出来的商数。就一较佳实施例而言，倒数乘法器3014的输出为标准型式并具有由输出二进制小数点2954数值指定的二进制小数点。Reciprocal multiplier 3014 multiplies the output of standard-size compressor and saturator 3008 by the user-specified reciprocal value specified in reciprocal value 2942 to produce a standard-sized product, which is effectively the standard-size compressor and saturator The output value of device 3008 is the quotient calculated with the reciprocal of the reciprocal value 2942 as the divisor. For a preferred embodiment, the output of the reciprocal multiplier 3014 is of standard form and has a binary point specified by the output binary point 2954 value.

向右移位器3016会将标准尺寸压缩器与饱和器3008的输出，以指定于偏移量值2944的使用者指定位数进行移动，以产生标准尺寸的商数。就一较佳实施例而言，向右移位器3016的输出为标准型式并具有由输出二进制小数点2954数值指定的二进制小数点。Right shifter 3016 shifts the output of standard-size compressor and saturator 3008 by the user-specified number of bits specified in offset value 2944 to produce the normal-size quotient. For a preferred embodiment, the output of the right shifter 3016 is of standard form and has a binary point specified by the output binary point 2954 value.

多工器3032选择启动函数2934值所指定的适当输入，并将其选择提供至符号恢复器3034，若是原本的累加器202数值217为负值，符号恢复器3034就会将多工器3032输出的正类型数值转换为负类型，例如转换为二补码类型。The multiplexer 3032 selects the appropriate input specified by the value of the activation function 2934, and provides its selection to the symbol restorer 3034. If the original accumulator 202 value 217 is negative, the symbol restorer 3034 will output the multiplexer 3032 Converts a value of positive type to a negative type, for example to a two's complement type.

尺寸转换器3036会依据如图29A所述的输出命令2956的数值，将符号恢复器3034的输出转换至适当的尺寸。就一较佳实施例而言，符号恢复器3034的输出具有一个由输出二进制小数点2954数值指定的二进制小数点。就一较佳实施例而言，对于输出命令的第一默认值而言，尺寸转换器3036会舍弃符号恢复器3034输出的上半部位。此外，若是符号恢复器3034的输出为正并且超过配置2902指定的文字尺寸所能表达的最大值，或是输出为负并且小于文字尺寸所能表达的最小值，饱和器3036就会将其输出分别填满至此文字尺寸的可表达最大/最小值。对于第二与第三默认值，尺寸转换器3036会传递符号恢复器3034的输出。The size converter 3036 converts the output of the sign restorer 3034 to an appropriate size according to the value of the output command 2956 as described in FIG. 29A . For a preferred embodiment, the output of the sign restorer 3034 has a binary point specified by the output binary point 2954 value. For a preferred embodiment, the size converter 3036 discards the upper half of the sign restorer 3034 output for the first default value of the output command. In addition, if the output of sign restorer 3034 is positive and exceeds the maximum value that can be expressed by the character size specified by configuration 2902, or if the output is negative and is smaller than the minimum value that can be expressed by the character size, saturator 3036 will output it Expressable maximum/minimum values to fill up to this text size respectively. For the second and third default values, the size converter 3036 passes the output of the sign restorer 3034 .

多工器3037会依据输出命令2956，在数据转换器与饱和器3036输出与累加器202输出217中选择其一以提供给输出缓存器3038。进一步来说，对于输出命令2956的第一与第二默认值，多工器3037会选择尺寸转换器与饱和器3036的输出的下方文字(尺寸由配置2902指定)。对于第三默认值，多工器3037会选择尺寸转换器与饱和器3036的输出的上方文字。对于第四默认值，多工器3037会选择未经处理的累加器202数值217的下方文字；对于第五默认值，多工器3037会选择未经处理的累加器202数值217的中间文字；而对于第六默认值，多工器3037会选择未经处理的累加器202数值217的上方文字。如前述，就一较佳实施例而言，启动函数单元212会在未经处理的累加器202数值217的上方文字加上零值上方位。The multiplexer 3037 selects one of the output of the data converter and saturator 3036 and the output 217 of the accumulator 202 according to the output command 2956 to provide to the output register 3038 . Further, for the first and second default values of output command 2956, multiplexer 3037 selects the lower text of the output of size converter and saturator 3036 (size specified by configuration 2902). For the third default, the multiplexer 3037 selects the upper text of the output of the size converter and saturator 3036 . For the fourth default value, the multiplexer 3037 will select the lower text of the unprocessed accumulator 202 value 217; for the fifth default value, the multiplexer 3037 will select the middle text of the unprocessed accumulator 202 value 217; For the sixth default value, the multiplexer 3037 will select the upper text of the value 217 of the unprocessed accumulator 202 . As mentioned above, for a preferred embodiment, the enabling function unit 212 adds a zero value upper digit to the upper text of the value 217 of the unprocessed accumulator 202 .

图31为显示图30的启动函数单元212的运作的一范例。如图中所示，神经处理单元126的配置2902设定为窄配置。此外，带符号数据2912与带符号权重2914值为真。此外，数据二进制小数点2922值表示对于数据随机存取存储器122文字而言，其二进制小数点位置右侧有7个位，神经处理单元126所接收的第一数据文字的一范例值呈现为0.1001110。此外，权重二进制小数点2924值表示对于权重随机存取存储器124文字而言，其二进制小数点位置右侧有3个位，神经处理单元126所接收的第一权重文字的范例值呈现为00001.010。FIG. 31 shows an example of the operation of the activation function unit 212 of FIG. 30 . As shown in the figure, the configuration 2902 of the neural processing unit 126 is set to a narrow configuration. Additionally, the signed data 2912 and signed weight 2914 values are true. In addition, the data binary decimal point 2922 value indicates that there are 7 bits to the right of the binary decimal point position for the data random access memory 122 word, and an example value of the first data word received by the neural processing unit 126 is 0.1001110. In addition, the value of the weight binary point 2924 indicates that for the weight random access memory 124 text, there are 3 bits to the right of the binary point position, and the example value of the first weight text received by the neural processing unit 126 is 00001.010.

第一数据与权重文字的16位乘积(此乘积会与累加器202的初始零值相加)呈现为000000.1100001100。因为数据二进制小数点2912是7而权重二进制小数点2914是3，对于所隐含的累加器202二进制小数点而言，其右侧会有10个位。在窄配置的情况下，如本实施例所示，累加器202具有28个位宽。举例来说，完成所有算术逻辑运算后(例如图20全部1024个乘法累加运算)，累加器202的数值217会是000000000000000001.1101010100。The 16-bit product of the first data and the weight literal (this product is added to the initial zero value of the accumulator 202 ) renders 000000.1100001100. Since the data binary point 2912 is 7 and the weight binary point 2914 is 3, there will be 10 bits to the right of the implied accumulator 202 binary point. In the case of a narrow configuration, as shown in this embodiment, the accumulator 202 has a width of 28 bits. For example, after completing all the arithmetic and logic operations (for example, all 1024 multiplication and accumulation operations in FIG. 20 ), the value 217 of the accumulator 202 will be 0000000000000000001.1101010100.

输出二进制小数点2954值表示输出的二进制小数点右侧有7个位。因此，在传递输出二进制小数点对准器3002与标准尺寸压缩器3008之后，累加器202数值217会被缩放、舍入与压缩至标准型式的数值，即000000001.1101011。在此范例中，输出二进制小数点地址表示7个小数位，而累加器202二进制小数点位置表示10个小数位。因此，输出二进制小数点对准器3002会计算出差值3，并透过将累加器202数值217右移3个位以对其进行缩放。在图31中即显示累加器202数值217会丧失3个最低有效位(二进制数100)。此外，在此范例中，舍入控制2932值表示使用随机舍入，并且在此范例中假定取样随机位3005为真。如此，如前述，最低有效位就会被向上舍入，这是因为累加器202数值217的舍入位(这3个因为累加器202数值217的缩放运算而被移出的位中的最高有效位)为一，而黏位(这3个因为累加器202数值217的缩放运算而被移出的位中，2个最低有效位的布尔或运算结果)为零。The output binary point 2954 value means that there are 7 bits to the right of the output binary point. Therefore, after passing the output binary point aligner 3002 and standard size compressor 3008, the accumulator 202 value 217 is scaled, rounded and compressed to a standard form value, ie 000000001.1101011. In this example, the output binary point address represents 7 decimal places, while the accumulator 202 binary point position represents 10 decimal places. Therefore, the output binary point aligner 3002 calculates a difference of 3 and scales it by right-shifting the accumulator 202 value 217 by 3 bits. In FIG. 31 it is shown that the value 217 of the accumulator 202 will lose the 3 least significant bits (binary number 100). Also, in this example, the rounding control 2932 value indicates that random rounding is used, and the sample random bit 3005 is assumed to be true in this example. Thus, as before, the least significant bit is rounded up because of the rounding bit of the accumulator 202 value 217 (the most significant bit ) is one, and the sticky bit (the result of the Boolean OR operation of the 2 least significant bits of the 3 bits shifted out due to the scaling operation of the accumulator 202 value 217 ) is zero.

在本范例中，启动函数2934表示所使用的是S型函数。如此，位选择器3012就会选择标准型式值的位而使S型模块3024的输入具有三个整数字元与四个小数位，如前述，即所示的数值001.1101。S型模块3024的输出数值会放入标准型式中，即所示的数值000000000.1101110。In this example, the start function 2934 indicates that a sigmoid function is used. Thus, the bit selector 3012 selects the bits of the standard type value such that the input to the S-type block 3024 has three integer characters and four decimal places, as before, ie the value 001.1101 shown. The output value of the S-type module 3024 will be put into the standard form, ie the value 000000000.1101110 shown.

此范例的输出命令2956指定第一默认值，即输出配置2902表示的文字尺寸，在此情况下即窄文字(8位)。如此，尺寸转换器3036会将标准S型输出值转换为一个8位量，其具有一个隐含的二进制小数点，即在此二进制小数点右侧有7个位，而产生一个输出值01101110，如图中所示。The output command 2956 of this example specifies a first default, the size of the text represented by the output configuration 2902, in this case narrow text (8 bits). In this way, the size converter 3036 will convert the standard S-type output value into an 8-bit quantity, which has an implicit binary decimal point, that is, there are 7 bits to the right of the binary decimal point, and an output value 01101110 is generated, as shown in the figure shown in .

图32为显示图30的启动函数单元212的运作的第二个范例。图32的范例描述当启动函数2934表示以标准尺寸传递累加器202数值217时，启动函数单元212的运算。如图中所示，此配置2902设定为神经处理单元216的窄配置。FIG. 32 shows a second example of the operation of the activation function unit 212 of FIG. 30 . The example of FIG. 32 depicts the operation of the enable function unit 212 when the enable function 2934 indicates passing the value 217 of the accumulator 202 in a standard size. As shown, this configuration 2902 is set to a narrow configuration of the neural processing unit 216 .

在此范例中，累加器202的宽度为28个位，累加器202二进制小数点的位置右侧有10个位(这是因为在一实施例中数据二进制小数点2912与权重二进制小数点2914的加总为10，或者在另一实施例中累加器二进制小数点2923明确被指定为具有数值10)。举例来说，在执行所有算术逻辑运算后，图32所示的累加器202数值217为000001100000011011.1101111010。In this example, the width of the accumulator 202 is 28 bits, and there are 10 bits to the right of the position of the accumulator 202 binary point (this is because in one embodiment the sum of the data binary point 2912 and the weight binary point 2914 is 10, or in another embodiment the accumulator binary point 2923 is explicitly assigned to have a value of 10). For example, after performing all arithmetic logic operations, the value 217 of the accumulator 202 shown in FIG. 32 is 000001100000011011.1101111010.

在此范例中，输出二进制小数点2954值表示对于输出而言，二进制小数点右侧有4个位。因此，在传递输出二进制小数点对准器3002与标准尺寸压缩器3008之后，累加器202数值217会饱和并压缩至所示的标准型式值111111111111.1111，此数值由多工器3032所接收以作为标准尺寸传递值3028。In this example, an output binary point value of 2954 means that there are 4 bits to the right of the binary point for the output. Thus, after passing the output binary point aligner 3002 and the standard size compressor 3008, the accumulator 202 value 217 is saturated and compressed to the shown standard value of 111111111111.1111, which is received by the multiplexer 3032 as the standard size Pass the value 3028.

在此范例中显示两个输出命令2956。第一个输出命令2956指定第二默认值，即输出标准型式尺寸的下方文字。因为配置2902所指示的尺寸为窄文字(8位)，标准尺寸就会是16位，而尺寸转换器3036会选择标准尺寸传递值3028的下方8个位以产生如图中所示的8位数值11111111。第二个输出命令2956指定第三默认值，即输出标准型式尺寸的上方文字。如此，尺寸转换器3036会选择标准尺寸传递值3028的上方8个位以产生如图中所示的8位数值11111111。In this example two output commands 2956 are shown. The first output command 2956 specifies a second default value, ie, to output the text below the standard form size. Since the size indicated by configuration 2902 is narrow literal (8 bits), the standard size would be 16 bits, and the size converter 3036 would select the lower 8 bits of the standard size pass value 3028 to produce 8 bits as shown The value is 11111111. The second output command 2956 specifies a third default, which is to output the upper text of the standard form size. Thus, the size converter 3036 would select the upper 8 bits of the standard size transfer value 3028 to produce the 8-bit value 11111111 as shown.

图33为显示图30的启动函数单元212的运作的第三个范例。图33的范例揭示当启动函数2934表示要传递整个未经处理的累加器202数值217时启动函数单元212的运作。如图中所示，此配置2902设定为神经处理单元126的宽配置(例如16位的输入文字)。FIG. 33 shows a third example of the operation of the activation function unit 212 of FIG. 30 . The example of FIG. 33 illustrates the operation of the enable function unit 212 when the enable function 2934 indicates that the entire unprocessed accumulator 202 value 217 is to be delivered. As shown, this configuration 2902 is set to a wide configuration of the NPU 126 (eg, 16-bit input text).

在此范例中，累加器202的宽度为41个位，累加器202二进制小数点位置的右侧有8个位(这是因为在一实施例中数据二进制小数点2912与权重二进制小数点2914的加总为8，或者在另一实施例中累加器二进制小数点2923明确被指定为具有数值8)。举例来说，在执行所有算术逻辑运算后，图33所示的累加器202数值217为001000000000000000001100000011011.11011110。In this example, the width of the accumulator 202 is 41 bits, and there are 8 bits to the right of the accumulator 202 binary point position (this is because in one embodiment the sum of the data binary point 2912 and the weight binary point 2914 is 8, or in another embodiment the accumulator binary point 2923 is explicitly assigned to have a value of 8). For example, after performing all arithmetic logic operations, the value 217 of the accumulator 202 shown in FIG. 33 is 001000000000000000001100000011011.11011110.

此范例中显示三个输出命令2956。第一个输出命令指定第四默认值，即输出未经处理的累加器202数值的下方文字；第二个输出命令指定第五默认值，即输出未经处理的累加器202数值的中间文字；而第三个输出命令指定第六默认值，即输出未经处理的累加器202数值的上方文字。因为配置2902所指示的尺寸为宽文字(16位)，如图33所示，因应第一输出命令2956，多工器3037会选择16位值0001101111011110；因应第二输出命令2956，多工器3037会选择16位值0000000000011000；而因应第三输出命令2956，多工器3037会选择16位值0000000001000000。Three output commands 2956 are shown in this example. The first output command specifies the fourth default value, that is, the text below the unprocessed accumulator 202 value; the second output command specifies the fifth default value, that is, the middle text that outputs the unprocessed accumulator 202 value; And the third output command specifies the sixth default value, that is, the upper text that outputs the unprocessed accumulator 202 value. Because the size indicated by the configuration 2902 is wide text (16 bits), as shown in Figure 33, in response to the first output command 2956, the multiplexer 3037 will select the 16-bit value 0001101111011110; in response to the second output command 2956, the multiplexer 3037 The 16-bit value 0000000000011000 will be selected; and in response to the third output command 2956 , the multiplexer 3037 will select the 16-bit value 0000000001000000.

如前述，神经网络单元121即可执行于整数数据而非浮点数据。如此，即有助于简化个个神经处理单元126，或至少其中的算术逻辑单元204部分。举例来说，这个算术逻辑单元204就不需要为了乘法器242而纳入在浮点运算中需用来将乘数的指数相加的加法器。类似地，这个算术逻辑单元204就不需要为了加法器234而纳入在浮点运算中需用来对准加数的二进制小数点的移位器。所属技术领域具有通常知识者当能理解，浮点单元往往非常复杂；因此，本文所述的范例仅针对算术逻辑单元204进行简化，利用所述具有硬件定点辅助而让使用者可指定相关二进制小数点的整数实施例也可用于对其他部分进行简化。相较于浮点的实施例，使用整数单元作为算术逻辑单元204可以产生一个较小(且较快)的神经处理单元126，而有利于将一个大型的神经处理单元126阵列整合进神经网络单元121内。启动函数单元212的部分可以基于使用者指定、累加数需要的小数位数量以及输出值需要的小数位数量，来处理累加器202数值217的缩放与饱和运算，而较佳者基于使用者指定。任何额外复杂度与伴随的尺寸增加，以及启动函数单元212的定点硬件辅助内的能源和/或时间耗损，都可以透过在算术逻辑单元204间共享启动函数单元212的方式来进行分摊，这是因为如图11的实施例所示，采用共享方式的实施例可以减少启动函数单元1112的数量。As mentioned above, the neural network unit 121 can execute on integer data instead of floating point data. In this way, each neural processing unit 126 , or at least the ALU 204 part thereof, can be simplified. For example, the ALU 204 does not need to incorporate an adder for the multiplier 242, which is required in floating-point operations to add the exponents of the multipliers. Similarly, the ALU 204 does not need to include a shifter for the adder 234 to align the binary point of the addend in floating point operations. Those of ordinary skill in the art will understand that floating point units are often very complex; therefore, the examples described herein are only simplified for the ALU 204, which has hardware fixed-point assistance and allows the user to specify the relevant binary point The integer embodiment of can also be used to simplify other parts. Using an integer unit for the ALU 204 results in a smaller (and faster) NPU 126 compared to the floating-point embodiment, which facilitates the integration of a large NPU 126 array into the NNU Within 121. The portion of the activation function unit 212 may handle scaling and saturation of the accumulator 202 value 217 based on user specification, the number of decimal places required for the accumulator, and the number of decimal places required for the output value, preferably based on user specification. Any additional complexity and concomitant increase in size, energy and/or time consumption in the fixed-point hardware assistance of AFU 212 can be amortized by sharing AFU 212 among ALU 204, which This is because, as shown in the embodiment of FIG. 11 , the number of startup function units 1112 can be reduced in the embodiment of sharing.

本文所述的实施例可以享有许多利用整数算数单元以降低硬件复杂度的优点(相较于使用浮点算术单元)，而同时还能用于小数的算术运算，即具有二进制小数点的数字。浮点算术的优点在于它可以提供数据算术运算给数据的个别数值落在一个非常广的数值范围内(实际上只受限于指数范围的大小，因此会是一个非常大的范围)。也就是说，每个浮点数具有其潜在独一无二的指数值。不过，本文所述的实施例理解到并利用某些应用中具有输入数据高度平行且落于一相对较窄的范围内而使所有平行数据具有相同“指数”的特性。如此，这些实施例让使用者将二进制小数点位置一次指定给所有的输入值和/或累加值。类似地，透过理解并利用平行输出具有类似范围的特性，这些实施例让使用者将二进制小数点位置一次指定给所有的输出值。人工神经网络是此种应用的一范例，不过本发明的实施例也可应用于执行其它应用的计算。透过将二进制小数点位置一次指定给多个输入而非给对个别的输入数，相较于使用浮点运算，本发明的实施例可以更有效率地利用记忆空间(如需要较少的存储器)以及/或在使用类似数量的存储器的情况下提升精度，这是因为用于浮点运算的指数的位可用来提升数值精度。Embodiments described herein can enjoy many of the advantages of using integer arithmetic units to reduce hardware complexity (compared to using floating point arithmetic units), while still being able to perform arithmetic operations on fractional numbers, ie, numbers with a binary decimal point. The advantage of floating-point arithmetic is that it can provide data arithmetic operations on individual values of data falling within a very wide range of values (actually only limited by the size of the exponent range, so a very large range). That is, each floating point number has its own potentially unique exponent value. However, the embodiments described herein understand and take advantage of the property that certain applications have input data that is highly parallel and falls within a relatively narrow range such that all parallel data have the same "exponent". Thus, these embodiments allow the user to assign the binary point position to all input values and/or accumulated values at once. Similarly, by understanding and taking advantage of the fact that parallel outputs have similar ranges, these embodiments allow the user to assign the binary point position to all output values at once. An artificial neural network is an example of such an application, although embodiments of the invention may also be applied to perform computations for other applications. By assigning the binary point position to multiple inputs at once rather than to individual input numbers, embodiments of the present invention can use memory space more efficiently (eg, require less memory) than using floating point arithmetic And/or improve precision while using a similar amount of memory, since the bits of the exponent used for floating point operations can be used to increase numerical precision.

此外，本发明的实施例理解到在对一个大型系列的整数运算(如溢位或丧失较不重要的小数位)执行累加时可能丧失精度，因此提供一个解决方法，主要是利用一个足够大的累加器来避免精度丧失。Furthermore, embodiments of the present invention understand that there may be a loss of precision when performing accumulations on a large series of integer operations (such as overflow or loss of less significant decimal places), and thus provide a solution, primarily by using a sufficiently large accumulator to avoid loss of precision.

神经网络单元微运算的直接执行Direct execution of micro-operations in neural network units

图34为显示图1的处理器100以及神经网络单元121的部分细节的方块示意图。神经网络单元121包括神经处理单元126的管线级3401。各个管线级3401以级缓存器区分，并包括组合逻辑以达成本文的神经处理单元126的运算，如布尔逻辑闸、多工器、加法器、乘法器、比较器等等。管线级3401从多工器3402接收微运算3418。微运算3418会向下流动至管线级3401并控制其组合逻辑。微运算3418是一个位集合。就一较佳实施例而言，微运算3418包括数据随机存取存储器122存储器地址123的位、权重随机存取存储器124存储器地址125的位、程序存储器129存储器地址131的位、多任务缓存器208/705控制信号213/713、还有许多控制缓存器217的字段(例如图29A至图29C的控制缓存器)。在一实施例中，微运算3418包括大约120个位。多工器3402从三个不同的来源接收微运算，并选择其中一个作为提供给管线级3401的微运算3418。FIG. 34 is a block diagram showing partial details of the processor 100 and the neural network unit 121 of FIG. 1 . The neural network unit 121 includes the pipeline stage 3401 of the neural processing unit 126 . Each pipeline stage 3401 is distinguished by a stage register, and includes combinational logic to realize the operations of the neural processing unit 126 herein, such as Boolean logic gates, multiplexers, adders, multipliers, comparators, and so on. Pipeline stage 3401 receives uops 3418 from multiplexer 3402 . Micro-ops 3418 flow down to pipeline stage 3401 and control its combinatorial logic. Micro-op 3418 is a collection of bits. For a preferred embodiment, the micro-operation 3418 includes data RAM 122 memory address 123 bits, weight RAM 124 memory address 125 bits, program memory 129 memory address 131 bits, multitasking register 208/705 control signals 213/713, and many fields of control registers 217 (eg, the control registers of FIGS. 29A-29C ). In one embodiment, micro-operations 3418 include approximately 120 bits. Multiplexer 3402 receives uops from three different sources and selects one of them as uop 3418 to provide to pipeline stage 3401 .

多工器3402的一个微运算来源为图1的定序器128。定序器128会将由程序存储器129接收的神经网络单元指令译码并据以产生一个微运算3416提供至多工器3402的第一输入。A source of micro-operations for the multiplexer 3402 is the sequencer 128 of FIG. 1 . The sequencer 128 decodes the NNU instruction received from the program memory 129 and generates a micro-operation 3416 accordingly to the first input of the multiplexer 3402 .

多工器3402的第二个微运算来源为从图1的保留站108接收微指令105以及从通用缓存器116与媒体缓存器118接收操作数的译码器3404。就一较佳实施例而言，如前述，微指令105由指令转译器104因应MTNN指令1400与MFNN指令1500的转译所产生。微指令105可包括一个立即字段以指定一特定函数(由一个MTNN指令1400或一个MFNN指令1500所指定)，例如程序存储器129内程序的开始与停止执行、直接从媒体缓存器118执行一微运算、或是如前述读取/写入神经网络单元的存储器。译码器3404会将微指令105译码并据以产生一个微运算3412提供至多工器的第二输入。就一较佳实施例而言，对于MTNN指令1400/MFNN指令1500的某些函数1432/1532而言，译码器3404不需要产生一个微运算3412向下传送至管线3401，例如写入控制缓存器127、开始执行程序存储器129内的程序、暂停执行程序存储器129内的程序、等待程序存储器129内的程序完成执行、从状态缓存器127读取以及重设神经网络单元121。The second source of micro-operations for the multiplexer 3402 is the decoder 3404 that receives microinstructions 105 from the reservation station 108 of FIG. For a preferred embodiment, as mentioned above, the microinstruction 105 is generated by the instruction translator 104 in response to the translation of the MTNN instruction 1400 and the MFNN instruction 1500 . Microinstruction 105 may include an immediate field to specify a specific function (specified by an MTNN instruction 1400 or an MFNN instruction 1500), such as starting and stopping execution of a program in program memory 129, executing a microoperation directly from media register 118 , or reading/writing the memory of the neural network unit as described above. The decoder 3404 decodes the microinstruction 105 and generates a microoperation 3412 accordingly, which is provided to the second input of the multiplexer. For a preferred embodiment, for some functions 1432/1532 of the MTNN instruction 1400/MFNN instruction 1500, the decoder 3404 does not need to generate a micro-operation 3412 to be sent down the pipeline 3401, such as writing to the control buffer register 127, start executing the program in the program memory 129, suspend the execution of the program in the program memory 129, wait for the program in the program memory 129 to finish executing, read from the state register 127, and reset the neural network unit 121.

多工器3402的第三个微运算来源为媒体缓存器118本身。就一较佳实施例而言，如前文对应于图14所述，MTNN指令1400可指定一函数以指示神经网络单元121直接执行一个由媒体缓存器118提供至多工器3402的第三输入的微运算3414。直接执行由架构媒体缓存器118提供的微运算3414有利于对神经网络单元121进行测试，如内建自我测试(BIST)，或除错的动作。The third source of micro-operations for the multiplexer 3402 is the media buffer 118 itself. In terms of a preferred embodiment, as described above corresponding to FIG. 14 , the MTNN instruction 1400 can specify a function to instruct the neural network unit 121 to directly execute a microprocessor that is provided by the media buffer 118 to the third input of the multiplexer 3402. Operation 3414. Direct execution of the micro-operations 3414 provided by the architectural media buffer 118 is beneficial for testing the NNU 121, such as built-in self-test (BIST), or debugging actions.

就一较佳实施例而言，译码器3404会产生一个模式指针3422控制多工器3402的选择。当MTNN指令1400指定一个函数开始执行一个来自程序存储器129的程序，译码器3404会产生一模式指针3422值使多工器3402选择来自定序器128的微运算3416，直到发生错误或直到译码器3404碰到一个MTNN指令1400指定一个函数停止执行来自程序存储器129的程序。当MTNN指令1400指定一个函数指示神经网络单元121直接执行由媒体缓存器118提供的一微运算3414，译码器3404会产生一个模式指针3422值使多工器3402选择来自所指定的媒体缓存器118的微运算3414。否则，译码器3404就会产生一个模式指针3422值使多工器3402选择来自译码器3404的微运算3412。For a preferred embodiment, the decoder 3404 generates a mode pointer 3422 to control the selection of the multiplexer 3402 . When the MTNN instruction 1400 specifies a function to start executing a program from the program memory 129, the decoder 3404 will generate a mode pointer 3422 value so that the multiplexer 3402 selects the micro-operation 3416 from the sequencer 128 until an error occurs or until the decoding Encoder 3404 encounters an MTNN instruction 1400 specifying a function to stop executing the program from program memory 129. When the MTNN instruction 1400 specifies a function to instruct the neural network unit 121 to directly execute a micro-operation 3414 provided by the media buffer 118, the decoder 3404 will generate a mode pointer 3422 value to make the multiplexer 3402 select from the specified media buffer 118 micro-operations 3414. Otherwise, the decoder 3404 generates a mode pointer 3422 value to make the multiplexer 3402 select the micro-operation 3412 from the decoder 3404 .

可变率神经网络单元Variable Rate Neural Network Unit

在许多情况下，神经网络单元121执行程序后就会进入待机状态(idle)等待处理器100处理一些需要在执行下一个程序前处理的事情。举例来说，假设处在一个类似于图3至图6A所述的情况，神经网络单元121会对一乘法累加启动函数程序(也可称为前授神经网络层程序(feed forward neural network layer program))连续执行两次或更多次。相较于神经网络单元121执行程序所花费的时间，处理器100明显需要花费较长的时间来将512KB的权重值写入权重随机存取存储器124以供下一次神经网络单元程序使用。换言之，神经网络单元121会在短时间内执行程序，随后就进入待机状态，直到处理器100将接下来的权重值写入权重随机存取存储器124供下一次程序执行使用。此情况可参照图36A，详如后述。在此情况下，神经网络单元121可采用较低时频率运行以延长执行程序的时间，藉以使执行程序所需的能源消耗分散至较长的时间范围，而使神经网络单元121，乃至于整个处理器100，维持在较低温度。此情况称为缓和模式，可参照图36B，详如后述。In many cases, after executing a program, the neural network unit 121 enters an idle state and waits for the processor 100 to process some things that need to be processed before executing the next program. For example, assuming a situation similar to that described in FIG. 3 to FIG. 6A, the neural network unit 121 will activate a function program (also called a feed forward neural network layer program) for a multiply-accumulate )) is executed two or more times in a row. Compared with the time taken by the neural network unit 121 to execute the program, the processor 100 obviously takes a longer time to write the 512 KB weight value into the weight random access memory 124 for the next neural network unit program. In other words, the neural network unit 121 executes the program for a short time, and then enters the standby state until the processor 100 writes the next weight value into the weight random access memory 124 for the next program execution. This situation can be referred to FIG. 36A , which will be described in detail later. In this case, the neural network unit 121 can run at a lower clock frequency to prolong the execution time of the program, so that the energy consumption required for executing the program can be distributed to a longer time range, so that the neural network unit 121, and even the entire The processor 100 is maintained at a lower temperature. This situation is called relaxation mode, which can be referred to FIG. 36B , which will be described in detail later.

图35为一方块图，显示具有可变率神经网络单元121的处理器100。此处理器100类似于图1的处理器100，并且图中具有相同标号的组件也相类似。图35的处理器100并具有时频产生逻辑3502耦接至处理器100的功能单元，这些功能单元即指令撷取单元101，指令快取102，指令转译器104，重命名单元106，保留站108，神经网络单元121，其它执行单元112，存储器子系统114，通用缓存器116与媒体缓存器118。时频产生逻辑3502包括时频产生器，例如锁相回路(PLL)，以产生一个具有主要时频率或称时频频率的时频信号。举例来说，此主要时频率可以是1GHz，1.5GHz，2GHz等等。时频率即表示每秒的周期数，如时频信号在高低状态间的震荡次数。较佳地，此时频信号具有平衡周期(duty cycle)，即此周期的一半为高状态而另一半为低状态；另外，此时频信号也可具有非平衡周期，也就是时频信号处在高状态的时间长于其处在低状态的时间，反之亦然。较佳地，锁相回路用以产生多个时频率的主要时频信号。较佳地，处理器100包括电源管理模块，依据多种因素自动调整主要时频率，这些因素包括处理器100的动态侦测操作温度，利用率(utilization)，以及来自系统软件(如操作系统，基本输入输出系统(BIOS))指示所需效能和/或节能指标的命令。在一实施例中，电源管理模块包括处理器100的微码。FIG. 35 is a block diagram showing a processor 100 with a variable rate neural network unit 121 . The processor 100 is similar to the processor 100 of FIG. 1 , and components with the same reference numbers in the figure are also similar. The processor 100 of FIG. 35 also has a time-frequency generation logic 3502 coupled to the functional units of the processor 100, these functional units are the instruction fetch unit 101, the instruction cache 102, the instruction translator 104, the renaming unit 106, and the reservation station 108 , the neural network unit 121 , other execution units 112 , the memory subsystem 114 , the general buffer 116 and the media buffer 118 . The time-frequency generation logic 3502 includes a time-frequency generator, such as a phase-locked loop (PLL), to generate a time-frequency signal having a dominant time frequency or a time-frequency frequency. For example, the primary clock frequency can be 1GHz, 1.5GHz, 2GHz and so on. The time frequency means the number of cycles per second, such as the number of oscillations of the time-frequency signal between high and low states. Preferably, the time-frequency signal has a duty cycle, that is, half of the cycle is in a high state and the other half is in a low state; in addition, the time-frequency signal may also have an unbalanced cycle, that is, the duty cycle of the time-frequency signal is It spends more time in the high state than it does in the low state, and vice versa. Preferably, the phase-locked loop is used to generate main time-frequency signals of multiple time-frequency. Preferably, the processor 100 includes a power management module, which automatically adjusts the main clock frequency according to various factors, these factors include the dynamic detection of the operating temperature of the processor 100, utilization (utilization), and information from system software (such as the operating system, Basic Input Output System (BIOS)) commands that indicate desired performance and/or power saving metrics. In one embodiment, the power management module includes microcode of the processor 100 .

时频产生逻辑3502并包括时频散布网络，或时频树(clock tree)。时频树会将主要时频信号散布至处理器100的功能单元，如图35所示，此散布动作就是将时频信号3506-1传送至指令撷取单元101，将时频信号3506-2传送至指令快取102，将时频信号3506-10传送至指令转译器104，将时频信号3506-9传送至重命名单元106，将时频信号3506-8传送至保留站108，将时频信号3506-7传送至神经网络单元121，将时频信号3506-4传送至其它执行单元112，将时频信号3506-3传送至存储器子系统114，将时频信号3506-5传送至通用缓存器116，以及将时频信号3506-6传送至媒体缓存器118，这些信号集体称为时频信号3506。此时频树具有节点或线，以传送主要时频信号3506至其相对应的功能单元。此外，较佳地，时频产生逻辑3502可包括时频缓冲器，在需要提供较干净的时频信号和/或需要提升主要时频信号的电压准位时，特别是对于较远的节点，时频缓冲器可重新产生主要时频信号。此外，各个功能单元并具有其自身的子时频树，在需要时重新产生和/或提升所接收的相对应主要时频信号3506的电压准位。Time-frequency generation logic 3502 also includes a time-frequency distribution network, or clock tree. The time-frequency tree will distribute the main time-frequency signals to the functional units of the processor 100, as shown in FIG. Send to the instruction cache 102, send the time-frequency signal 3506-10 to the instruction translator 104, send the time-frequency signal 3506-9 to the renaming unit 106, send the time-frequency signal 3506-8 to the reservation station 108, and send the time-frequency signal 3506-8 to the reservation station 108. The frequency signal 3506-7 is transmitted to the neural network unit 121, the time-frequency signal 3506-4 is transmitted to other execution units 112, the time-frequency signal 3506-3 is transmitted to the memory subsystem 114, and the time-frequency signal 3506-5 is transmitted to the general buffer 116 , and transmits time-frequency signal 3506 - 6 to media buffer 118 , these signals are collectively referred to as time-frequency signal 3506 . The time-frequency tree has nodes or lines to transmit primary time-frequency signals 3506 to their corresponding functional units. In addition, preferably, the time-frequency generation logic 3502 may include a time-frequency buffer. When it is necessary to provide a cleaner time-frequency signal and/or to increase the voltage level of the main time-frequency signal, especially for a farther node, The time-frequency buffer regenerates the main time-frequency signal. In addition, each functional unit has its own sub-time-frequency tree to regenerate and/or boost the received voltage level of the corresponding main time-frequency signal 3506 when necessary.

神经网络单元121包括时频降低逻辑3504，时频降低逻辑3504接收缓和指针3512与主要时频信号3506-7，以产生第二时频信号。第二时频信号具有时频率。此时频率若非相同于主要时频率，就是处于缓和模式从主要时频率降低数值以减少热能产生，此数值程序化至缓和指针3512。时频降低逻辑3504类似于时频产生逻辑3502，其具有时频散布网络，或时频树，以散布第二时频信号至神经网络单元121的多种功能方块，此散布动作就是将时频信号3508-1传送至神经处理单元阵列126，将时频信号3508-2传送至定序器128以即将时频信号3508-3传送至接口逻辑3514，这些信号集体称为第二时频信号3508。较佳地，这些神经处理单元126包括多个管线级3401，如图34所示，管线级3401包括管线分级缓存器，用以从时频降低逻辑3504接收第二时频信号3508-1。The NNU 121 includes clock-frequency reduction logic 3504 that receives the mitigation pointer 3512 and the main clock-frequency signal 3506-7 to generate a second clock-frequency signal. The second time-frequency signal has a time frequency. At this time, if the frequency is not the same as the main time frequency, or it is in the relaxation mode, the value is reduced from the main time frequency to reduce heat generation. This value is programmed to the relaxation pointer 3512. The time-frequency reduction logic 3504 is similar to the time-frequency generation logic 3502. It has a time-frequency distribution network, or a time-frequency tree, to distribute the second time-frequency signal to various functional blocks of the neural network unit 121. Signal 3508-1 is passed to NPU array 126, time-frequency signal 3508-2 is passed to sequencer 128 to pass time-frequency signal 3508-3 to interface logic 3514, collectively referred to as second time-frequency signal 3508 . Preferably, these neural processing units 126 include a plurality of pipeline stages 3401 , as shown in FIG. 34 , the pipeline stages 3401 include pipeline staging buffers for receiving the second clock frequency signal 3508 - 1 from the clock frequency reduction logic 3504 .

神经网络单元121还具有接口逻辑3514以接收主要时频信号3506-7与第二时频信号3508-3。接口逻辑3514耦接于处理器100前端的下部分(例如保留站108，媒体缓存器118与通用缓存器116)与神经网络单元121的多种功能方块间，这些功能方块实时频降低逻辑3504，数据随机存取存储器122，权重随机存取存储器124，程序存储器129与定序器128。接口逻辑3514包括数据随机存取存储器缓冲3522，权重随机存取存储器缓冲3524，图34的译码器3404，以及缓和指针3512。缓和指针3512装载一数值，此数值指定神经处理单元阵列126会以多慢的速度执行神经网络单元程序指令。较佳地，缓和指针3512指定一除数值N，时频降低逻辑3504将主要时频信号3506-7除以此除数值以产生第二时频信号3508，如此，第二时频信号的时频率就会是1/N。较佳地，N的数值可程序化为多个不同默认值中的任何一个，这些默认值可使时频降低逻辑3504对应产生多个具有不同时频率的第二时频信号3508，这些时频率小于主要时频率。The NNU 121 also has an interface logic 3514 to receive the primary time-frequency signal 3506-7 and the second time-frequency signal 3508-3. The interface logic 3514 is coupled between the lower portion of the front end of the processor 100 (such as the reservation station 108, the media buffer 118 and the general purpose buffer 116) and various functional blocks of the NNU 121. These functional blocks are real-time frequency reduction logic 3504, Data RAM 122 , weight RAM 124 , program memory 129 and sequencer 128 . Interface logic 3514 includes data RAM buffer 3522 , weight RAM buffer 3524 , decoder 3404 of FIG. 34 , and buffer pointer 3512 . Easing pointer 3512 is loaded with a value specifying how slowly NPU array 126 will execute NNU program instructions. Preferably, the relaxation pointer 3512 specifies a divisor value N, and the time-frequency reduction logic 3504 divides the main time-frequency signal 3506-7 by this divisor value to generate the second time-frequency signal 3508, so that the time frequency of the second time-frequency signal It will be 1/N. Preferably, the value of N can be programmed as any one of a plurality of different default values, and these default values can make the time-frequency reduction logic 3504 correspondingly generate a plurality of second time-frequency signals 3508 with different time frequencies, and these time-frequency less than the main time frequency.

在一实施例中，时频降低逻辑3504包括时频除法器电路，用以将主要时频信号3506-7除以缓和指针3512数值。在一实施例中，时频降低逻辑3504包括时频闸(如AND闸)，时频闸可透过一启动信号来门控主要时频信号3506-7，启动信号在主要时频信号的每N个周期中只会产生一次真值。以一个包含计数器以产生启动信号的电路为例，此计数器可向上计数至N。当伴随的逻辑电路侦测到计数器的输出与N匹配，逻辑电路就会在第二时频信号3508产生一真值脉冲并重设计数器。较佳地，缓和指针3512数值可由架构指令予以程序化，例如图14的MTNN指令1400。较佳地，在架构程序指示神经网络单元121开始执行神经网络单元程序前，运作于处理器100的架构程序会将缓和值程序化至缓和指针3512，这部分在后续对应于图37处会有更详细的说明。In one embodiment, the time-frequency reduction logic 3504 includes a time-frequency divider circuit for dividing the main time-frequency signal 3506-7 by the relaxation pointer 3512 value. In one embodiment, the time-frequency reduction logic 3504 includes a time-frequency gate (such as an AND gate), and the time-frequency gate can gate the main time-frequency signal 3506-7 through an enable signal, and the enable signal is activated every time the main time-frequency signal The true value will only be generated once in N cycles. As an example, consider a circuit that includes a counter that counts up to N to generate an enable signal. When the accompanying logic circuit detects that the output of the counter matches N, the logic circuit generates a true value pulse on the second time-frequency signal 3508 and resets the counter. Preferably, the value of the buffer pointer 3512 can be programmed by an architectural instruction, such as the MTNN instruction 1400 of FIG. 14 . Preferably, before the architecture program instructs the neural network unit 121 to start executing the neural network unit program, the architecture program operating on the processor 100 will program the relaxation value to the relaxation pointer 3512, and this part will be described later corresponding to FIG. 37 More detailed instructions.

权重随机存取存储器缓冲3524耦接于权重随机存取存储器124与媒体缓存器118之间作为其间数据传输的缓冲。较佳地，权重随机存取存储器缓冲3524类似于图17的缓冲器1704的一个或多个实施例。较佳地，权重随机存取存储器缓冲3524从媒体缓存器118接收数据的部分以具有主要时频率的主要时频信号3506-7作为时频，而权重随机存取存储器缓冲3524从权重随机存取存储器124接收数据的部分以具有第二时频率的第二时频信号3508-3作为时频，第二时频率可依据程序化于缓和指针3512的数值从主要时频率调降或否，也即依据神经网络单元121执行于缓和或正常模式来进行调降或否。在一实施例中，权重随机存取存储器124为单端口，如前文图17所述，权重随机存取存储器124还可由媒体缓存器118透过权重随机存取存储器缓冲3524，以及由神经处理单元126或图11的列缓冲1104，以仲裁方式(arbitrated fashion)存取。在另一实施例中，权重随机存取存储器124为双端口，如前文图16所述，各个端口可由媒体缓存器118透过权重随机存取存储器缓冲3524以及由神经处理单元126或列缓冲器1104以并行方式存取。The weight RAM buffer 3524 is coupled between the weight RAM 124 and the media buffer 118 as a buffer for data transmission therebetween. Preferably, weight RAM buffer 3524 is similar to one or more embodiments of buffer 1704 of FIG. 17 . Preferably, the portion of the WRAM buffer 3524 receiving data from the media buffer 118 takes the main time-frequency signal 3506-7 having the main time frequency as the time frequency, and the WRAM buffer 3524 receives the data from the weight random access The part of the memory 124 receiving data uses the second clock frequency signal 3508-3 with the second clock frequency as the clock frequency, and the second clock frequency can be lowered or not from the main clock frequency according to the value programmed in the relaxation pointer 3512, that is, The downscaling or no is performed depending on whether the NNU 121 is operating in a moderate or normal mode. In one embodiment, the weight random access memory 124 is a single port. As described above in FIG. 126 or the column buffer 1104 in FIG. 11 is accessed in an arbitrated fashion. In another embodiment, the weight random access memory 124 is a dual-port, as described above in FIG. 1104 are accessed in parallel.

类似于权重随机存取存储器缓冲3524，数据随机存取存储器缓冲3522耦接于数据随机存取存储器122与媒体缓存器118之间作为其间数据传送的缓冲。较佳地，数据随机存取存储器缓冲3522类似于图17的缓冲器1704的一个或多个实施例。较佳地，数据随机存取存储器缓冲3522从媒体缓存器118接收数据的部分以具有主要时频率的主要时频信号3506-7作为时频，而数据随机存取存储器缓冲3522从数据随机存取存储器122接收数据的部分以具有第二时频率的第二时频信号3508-3作为时频，第二时频率可依据程序化于缓和指针3512的数值从主要时频率调降或否，也即依据神经网络单元121执行于缓和或正常模式来进行调降或否。在一实施例中，数据随机存取存储器122为单端口，如前文图17所述，数据随机存取存储器122还可由媒体缓存器118透过数据随机存取存储器缓冲3522，以及由神经处理单元126或图11的列缓冲1104，以仲裁方式存取。在另一实施例中，数据随机存取存储器122为双端口，如前文图16所述，各个端口可由媒体缓存器118透过数据随机存取存储器缓冲3522以及由神经处理单元126或列缓冲器1104以并行方式存取。Similar to the weight RAM buffer 3524 , the data RAM buffer 3522 is coupled between the data RAM 122 and the media buffer 118 as a buffer for data transmission therebetween. Preferably, data RAM buffer 3522 is similar to one or more embodiments of buffer 1704 of FIG. 17 . Preferably, the portion of the data RAM buffer 3522 receiving data from the media buffer 118 uses the primary clock signal 3506-7 having the primary clock rate as the clock frequency, and the data RAM buffer 3522 receives data from the data random access The part of the memory 122 receiving data uses the second clock frequency signal 3508-3 with the second clock frequency as the clock frequency, and the second clock frequency can be lowered or not from the main clock frequency according to the value programmed in the relaxation pointer 3512, that is, The downscaling or no is performed depending on whether the NNU 121 is operating in a moderate or normal mode. In one embodiment, the data random access memory 122 is a single port. As described above in FIG. 126 or the column buffer 1104 in FIG. 11 is accessed by arbitration. In another embodiment, the DRAM 122 is dual-ported. As described above in FIG. 1104 are accessed in parallel.

较佳地，不论数据随机存取存储器122和/或权重随机存取存储器124为单端口或双端口，接口逻辑3514会包括数据随机存取存储器缓冲3522与权重随机存取存储器缓冲3524以同步主要时频域与第二时频域。较佳地，数据随机存取存储器122，权重随机存取存储器124与程序存储器129都具有静态随机存取存储器(SRAM)，其中包含个别的读取使能信号，写入使能信号与存储器选择使能信号。Preferably, regardless of whether data RAM 122 and/or weight RAM 124 are single-ported or dual-ported, interface logic 3514 includes data RAM buffer 3522 and weight RAM buffer 3524 to synchronize main The time-frequency domain and the second time-frequency domain. Preferably, the data random access memory 122, the weight random access memory 124 and the program memory 129 all have a static random access memory (SRAM), which includes individual read enable signals, write enable signals and memory select enable signal.

如前述，神经网络单元121是处理器100的执行单元。执行单元是处理器中执行架构指令转译出的微指令或是执行架构指令本身的功能单元，例如执行图1中架构指令103转译出的微指令105或是架构指令103本身。执行单元从处理器的通用缓存器接收操作数，例如从通用缓存器116与媒体缓存器118。执行单元执行微指令或架构指令后会产生一结果，此结果会被写入通用缓存器。图14与图15所述的MTNN指令1400与MFNN指令1500为架构指令103的范例。微指令用以实现架构指令。更精确来说，执行单元对于架构指令转译出的一个或多个微指令的集体执行，就会是对于架构指令所指定的输入执行架构指令所指定的运算，以产生架构指令定义的结果。As mentioned above, the neural network unit 121 is an execution unit of the processor 100 . The execution unit is a functional unit in the processor that executes the translated microinstructions of the architectural instructions or executes the architectural instructions itself, for example, executes the microinstructions 105 translated from the architectural instructions 103 in FIG. 1 or the architectural instructions 103 itself. The execution units receive operands from general-purpose registers of the processor, such as general-purpose register 116 and media register 118 . The execution unit will generate a result after executing the micro instruction or the architectural instruction, and the result will be written into the general register. The MTNN instruction 1400 and the MFNN instruction 1500 shown in FIGS. 14 and 15 are examples of the architectural instructions 103 . Microinstructions are used to implement architectural instructions. More precisely, the collective execution of one or more microinstructions translated by the execution unit from the architectural instruction is to execute the operation specified by the architectural instruction on the input specified by the architectural instruction to generate a result defined by the architectural instruction.

图36A为一时序图，显示处理器100具有神经网络单元121运作于一般模式的一运作范例，此一般模式即以主要时频率运作。在时序图中，时间的进程是由左而右。处理器100以主要时频率执行架构程序。更精确来说，处理器100的前端(例如指令撷取单元101，指令快取102，指令转译器104，重命名单元106与保留站108)以主要时频率攫取，译码且发布架构指令至神经网络单元121与其它执行单元112。FIG. 36A is a timing diagram showing an example of the operation of the processor 100 with the NNU 121 operating in the normal mode, ie operating at the main clock frequency. In a sequence diagram, time progresses from left to right. The processor 100 executes the architecture program at the main clock frequency. More precisely, the front end of processor 100 (e.g., instruction fetch unit 101, instruction cache 102, instruction translator 104, renaming unit 106, and reservation station 108) fetches, decodes, and issues architectural instructions to The neural network unit 121 and other execution units 112 .

起初，架构程序执行架构指令(如MTNN指令1400)，处理器前端100将此架构指令发布至神经网络单元121以指示神经网络单元121开始执行其程序存储器129内的神经网络单元程序。在之前，架构程序会执行架构指令将指定主要时频率的数值写入缓和指针3512，也即使神经网络单元处于一般模式。更精确地说，程序化至缓和指针3512的数值会使时频降低逻辑3504以主要时频信号3506的主要时频率产生第二时频信号3508。较佳地，在此范例中，时频降低逻辑3504的时频缓冲器单纯提升主要时频信号3506的电压准位。另外在之前，架构程序会执行架构指令以写入数据随机存取存储器122，权重随机存取存储器124并将神经网络单元程序写入程序存储器129。因应神经网络单元程序MTNN指令1400，神经网络单元121会开始以主要时频率执行神经网络单元程序，这是因为缓和指针3512是以主要时频率值予以程序化。神经网络单元121开始执行后，架构程序会持续以主要时频率执行架构指令，包括主要是以MTNN指令1400写入和/或读取数据随机存取存储器122与权重随机存取存储器124，以完成对于神经网络单元程序的下一次范例(instance)，或称调用(invocation)或执行(run)的准备。Initially, the architecture program executes an architecture instruction (such as the MTNN instruction 1400 ), and the processor front end 100 issues the architecture instruction to the NNU 121 to instruct the NNU 121 to start executing the NNU program in its program memory 129 . Previously, the architecture program would execute the architecture instruction to write the value specifying the main clock frequency into the relaxation pointer 3512 , even if the neural network unit is in the normal mode. More precisely, programming to the value of the ease pointer 3512 causes the clock frequency reduction logic 3504 to generate the second clock frequency signal 3508 at the primary clock frequency of the primary clock signal 3506 . Preferably, in this example, the clock buffer of the clock reduction logic 3504 simply boosts the voltage level of the main clock signal 3506 . In addition, before, the architecture program will execute the architecture instructions to write the data RAM 122 , the weight RAM 124 and write the NNU program into the program memory 129 . In response to the NNU program MTNN instruction 1400, the NNU 121 will start executing the NNU program at the dominant clock frequency because the relaxation pointer 3512 is programmed with the dominant clock frequency value. After the neural network unit 121 starts to execute, the architecture program will continue to execute the architecture instructions at the main time frequency, including mainly writing and/or reading the data random access memory 122 and the weight random access memory 124 with the MTNN instruction 1400, to complete For the next instance (instance) of the neural network unit program, it is also called invocation or preparation for execution (run).

在图36A的范例中，相较于架构程序完成对于数据随机存取存储器122与权重随机存取存储器124写入/读取所花费的时间，神经网络单元121能够以明显较少的时间(例如四分之一的时间)完成神经网络单元程序的执行。举例来说，以主要时频率运作的情况下，神经网络单元121花费大约1000个时频周期来执行神经网络单元程序，不过，架构程序会花费大约4000个时频周期。如此，神经网络单元121在剩下的时间内就会处于待机状态，在此范例中，这是一个相当长的时间，如大约3000个主要时频率周期。如图36A的范例所示，依据神经网络的大小与配置的不同，会再次执行前述模式，并可能持续执行许多次。因为神经网络单元121是处理器100中一个相当大且晶体管密集的功能单元，神经网络单元121的运作将会产生大量的热能，尤其是以主要时频率运作的时候。In the example of FIG. 36A , compared to the time it takes for the architecture program to write/read data RAM 122 and weight RAM 124 , NNU 121 can execute in significantly less time (e.g. a quarter of the time) to complete the execution of the neural network unit program. For example, the NNU 121 takes about 1000 clock cycles to execute the NNU program while operating at the main clock rate, whereas the architectural program takes about 4000 clock cycles. In this way, the NNU 121 will be in the standby state for the rest of the time, which is a relatively long time in this example, such as about 3000 main clock cycles. As shown in the example of FIG. 36A , depending on the size and configuration of the neural network, the preceding pattern is repeated, possibly many times in a row. Since the NNU 121 is a relatively large and transistor-intensive functional unit in the processor 100, the operation of the NNU 121 will generate a large amount of heat, especially when operating at the main clock frequency.

图36B为一时序图，显示处理器100具有神经网络单元121运作于缓和模式的一运作范例，缓和模式的运作时频率低于主要时频率。图36B的时序图类似于图36A，在图36A中，处理器100以主要时频率执行架构程序。此范例假定图36B中的架构程序与神经网络单元程序相同于图36A的架构程序与神经网络单元程序。不过，在启动神经网络单元程序之前，架构程序会执行MTNN指令1400以一数值程序化缓和指针3512，此数值会使时频降低逻辑3504以小于主要时频率的第二时频率产生第二时频信号3508。也就是说，架构程序会使神经网络单元121处于图36B的缓和模式，而非图36A的一般模式。如此，神经处理单元126就会以第二时频率执行神经网络单元程序，在缓和模式下，第二时频率小于主要时频率。此范例中假定缓和指针3512是以一个将第二时频率指定为四分之一主要时频率的数值予以程序化。如此，神经网络单元121在缓和模式下执行神经网络单元程序所花费的时间会是其于一般模式下花费时间的四倍，如图36A与图36B所示，透过比较此二图可发现神经网络单元121处于待机状态的时间长度会明显地缩短。如此，图36B中神经网络单元121执行神经网络单元程序消耗能量的持续时间大约会是图36A中神经网络单元121在一般模式下执行程序的四倍。因此，图36B中神经网络单元121执行神经网络单元程序在单位时间内产生的热能大约会是图36A的四分之一，而具有本文所述的优点。FIG. 36B is a timing diagram showing an example of the operation of the processor 100 with the NNU 121 operating in the relaxed mode. The frequency of operation in the relaxed mode is lower than the main frequency. The timing diagram of FIG. 36B is similar to that of FIG. 36A. In FIG. 36A, the processor 100 executes the architectural program at the main clock frequency. This example assumes that the architecture program and NNU program in FIG. 36B are the same as those of FIG. 36A. However, before starting the NNU program, the architecture program executes the MTNN instruction 1400 to program the relaxation pointer 3512 with a value that causes the clock rate reduction logic 3504 to generate the second clock rate at a second clock rate less than the primary clock rate Signal 3508. That is, the architecture program will put the NNU 121 in the relaxed mode of FIG. 36B instead of the normal mode of FIG. 36A . In this way, the neural processing unit 126 executes the neural network unit program at the second frequency, and in the moderation mode, the second frequency is lower than the main frequency. This example assumes that the moderation pointer 3512 is programmed with a value that designates the secondary clock rate as one quarter of the primary clock rate. In this way, the time spent by the neural network unit 121 on executing the neural network unit program in the relaxed mode will be four times the time spent in the normal mode, as shown in FIG. 36A and FIG. The length of time that the network unit 121 is in the standby state is significantly shortened. In this way, the NNU 121 in FIG. 36B executes the NNU program and consumes energy for about four times as long as the NNU 121 in FIG. 36A executes the program in the normal mode. Therefore, the heat energy generated by the neural network unit 121 in FIG. 36B executing the neural network unit program per unit time is about a quarter of that in FIG. 36A , and has the advantages described herein.

图37为一流程图，显示图35的处理器100的运作。此流程图描述的运作类似于前文对应于图35，图36A与图36B图的运作。此流程始于步骤3702。FIG. 37 is a flowchart showing the operation of the processor 100 of FIG. 35 . The operation described in this flow chart is similar to the operation described above in relation to Fig. 35, Fig. 36A and Fig. 36B. The process starts at step 3702.

在步骤3702中，处理器100执行MTNN指令1400而将权重写入权重随机存取存储器124并且将数据写入数据随机存取存储器122。接下来流程前进至步骤3704。In step 3702 , the processor 100 executes the MTNN instruction 1400 to write the weights into the weight RAM 124 and write the data into the data RAM 122 . Then the flow goes to step 3704.

在步骤3704中，处理器100执行MTNN指令1400而以一个数值程序化缓和指针3512，此数值指定一个低于主要时频率的时频率，也即使神经网络单元121处于缓和模式。接下来流程前进至步骤3706。In step 3704, the processor 100 executes the MTNN instruction 1400 to program the relaxation pointer 3512 with a value specifying a clock frequency lower than the main clock frequency, ie, even if the NNU 121 is in relaxation mode. The process then proceeds to step 3706.

在步骤3706中，处理器100执行MTNN指令1400指示神经网络单元121开始执行神经网络单元程序，即类似图36B所呈现的方式。接下来流程前进至步骤3708。In step 3706, the processor 100 executes the MTNN instruction 1400 to instruct the NNU 121 to start executing the NNU program, that is, a manner similar to that shown in FIG. 36B. The process then proceeds to step 3708.

在步骤3708中，神经网络单元121开始执行此神经网络单元程序。同时，处理器100会执行MTNN指令1400而将新的权重写入权重随机存取存储器124(可能也会将新的数据写入数据随机存取存储器122)，以及/或执行MFNN指令1500而从数据随机存取存储器122读取结果(可能也会从权重随机存取存储器124读取结果)。接下来流程前进至步骤3712。In step 3708, the NNU 121 starts executing the NNU program. At the same time, the processor 100 will execute the MTNN instruction 1400 to write the new weights into the weight random access memory 124 (may also write new data into the data random access memory 122), and/or execute the MFNN instruction 1500 to obtain The data RAM 122 reads the results (and possibly the weight RAM 124 as well). The process then proceeds to step 3712.

在步骤3712中，处理器100执行MFNN指令1500(例如读取状态缓存器127)，以侦测神经网络单元121已结束程序执行。假设架构程序选择一个好的缓和指针3512数值，神经网络单元121执行神经网络单元程序所花费的时间就会相同于处理器100执行部分架构程序以存取权重随机存取存储器124和/或数据随机存取存储器122所花费的时间，如图36B所示。接下来流程前进至步骤3714。In step 3712, the processor 100 executes the MFNN instruction 1500 (for example, reads the status register 127) to detect that the neural network unit 121 has finished program execution. Assuming that the architectural program chooses a good value for the relaxation pointer 3512, the time it takes for the NNU 121 to execute the NNU program is the same as the processor 100 executes part of the architectural program to access the WRAM 124 and/or the DRAM. The time taken to access the memory 122 is shown in FIG. 36B. The process then proceeds to step 3714.

在步骤3714，处理器100执行MTNN指令1400而利用一数值程序化缓和指针3512，此数值指定主要时频率，也即使神经网络单元121处于一般模式。接下来前进至步骤3716。In step 3714, the processor 100 executes the MTNN instruction 1400 to program the relaxation pointer 3512 with a value specifying the main clock frequency, ie the NNU 121 is in normal mode. Proceed to step 3716 next.

在步骤3716中，处理器100执行MTNN指令1400指示神经网络单元121开始执行神经网络单元程序，即类似图36A所呈现的方式。接下来流程前进至步骤3718。In step 3716, the processor 100 executes the MTNN instruction 1400 to instruct the NNU 121 to start executing the NNU program, that is, a manner similar to that shown in FIG. 36A . The process then proceeds to step 3718.

在步骤3718中，神经网络单元121开始以一般模式执行神经网络单元程序。此流程终止于步骤3718。In step 3718, the NNU 121 starts executing the NNU program in normal mode. The flow ends at step 3718.

如前述，相较于在一般模式下执行神经网络单元程序(即以处理器的主要时频率执行)，在缓和模式下执行可以分散执行时间而能避免产生高温。进一步来说，当神经网络单元在缓和模式执行程序时，神经网络单元是以较低的时频率产生热能，这些热能可以顺利地经由神经网络单元(例如半导体装置，金属层与下方的基材)与周围的封装体以及冷却机构(如散热片，风扇)排出，也因此，神经网络单元内的装置(如晶体管，电容，导线)就比较可能在较低的温度下运作。整体来看，在缓和模式下运作也有助于降低处理器晶粒的其它部分内的装置温度。较低的运作温度，特别是对于这些装置的接面温度而言，可以减轻漏电流的产生。此外，因为单位时间内流入的电流量降低，电感噪声与IR压降噪声也会降低。此外，温度降低对于处理器内的金氧半场效晶体管(MOSFET)的负偏压温度不稳定性(NBTI)与正偏压不稳定性(PBSI)也有正面影响，而能提升可靠度和/或装置以及处理器部分的寿命。温度降低并可减轻处理器的金属层内的焦耳热与电迁移效应。As mentioned above, compared to executing the NNU program in the normal mode (that is, executing at the main clock frequency of the processor), the execution in the relaxed mode can spread the execution time and avoid generating high temperature. Furthermore, when the neural network unit executes the program in the relaxed mode, the neural network unit generates heat energy at a lower frequency, and the heat energy can pass smoothly through the neural network unit (such as semiconductor devices, metal layers and underlying substrates) It is exhausted from the surrounding package and cooling mechanism (such as heat sink, fan), and therefore, the devices (such as transistors, capacitors, wires) in the neural network unit are more likely to operate at a lower temperature. Overall, operating in moderation mode also helps reduce device temperatures in other parts of the processor die. The lower operating temperature, especially for the junction temperature of these devices, can reduce the generation of leakage current. In addition, since the amount of current flowing per unit time is reduced, the inductance noise and IR drop noise are also reduced. In addition, the temperature reduction also has a positive impact on the negative bias temperature instability (NBTI) and positive bias instability (PBSI) of the metal oxide semiconductor field effect transistor (MOSFET) in the processor, which can improve reliability and/or Or device and processor part life. The temperature is reduced and can mitigate Joule heating and electromigration effects within the metal layers of the processor.

关于神经网络单元共享资源的架构程序与非架构程序间的沟通机制Communication Mechanisms Between Architectural Programs and Non-Architectural Programs Regarding Shared Resources of Neural Network Units

如前述，在图24至图28与图35至图37的范例中，数据随机存取存储器122与权重随机存取存储器124的资源是共享的。神经处理单元126与处理器100的前端共享数据随机存取存储器122与权重随机存取存储器124。更精确地说，神经处理单元126与处理器100的前端，如媒体缓存器118，都会对数据随机存取存储器122与权重随机存取存储器124进行读取与写入。换句话说，执行于处理器100的架构程序与执行于神经网络单元121的神经网络单元程序会共享数据随机存取存储器122与权重随机存取存储器124，而在某些情况下，如前所述，需要对于架构程序与神经网络单元程序间的流程进行控制。程序存储器129的资源在一定程度下也是共享的，这是因为架构程序会对其进行写入，而定序器128会对其进行读取。本文所述的实施例提供一高效能的解决方案，以控制架构程序与神经网络单元程序间存取共享资源的流程。As mentioned above, in the examples of FIGS. 24 to 28 and FIGS. 35 to 37 , the resources of the data RAM 122 and the weight RAM 124 are shared. The NPU 126 shares the data RAM 122 and the weight RAM 124 with the front end of the processor 100 . More precisely, the NPU 126 and the front end of the processor 100 , such as the media buffer 118 , both read and write to the data RAM 122 and the weight RAM 124 . In other words, the architecture program executed on the processor 100 and the NNU program executed on the NNU 121 will share the data RAM 122 and the weight RAM 124, and in some cases, as described above As mentioned above, it is necessary to control the flow between the architecture program and the neural network unit program. The resources of program memory 129 are also shared to some extent, since the architectural program writes to it and the sequencer 128 reads it. The embodiments described herein provide a high-performance solution to control the flow of accessing shared resources between the architecture program and the NNU program.

在本文所述的实施例中，神经网络单元程序也称为非架构程序，神经网络单元指令也称为非架构指令，而神经网络单元指令集(如前所述也称为神经处理单元指令集)也称为非架构指令集。非架构指令集不同于架构指令集。在处理器100内包含指令转译器104将架构指令转译出微指令的实施例中，非架构指令集也不同于微指令集。In the embodiments described herein, the NNU program is also referred to as a non-architectural program, the NNU instruction is also referred to as a non-architectural instruction, and the NNU instruction set (also referred to as an NPU instruction set as previously described) ) is also known as the non-architectural instruction set. A non-architectural instruction set is different from an architectural instruction set. In embodiments in which the processor 100 includes an instruction translator 104 to translate architectural instructions into microinstructions, the non-architectural instruction set is also different from the microinstruction set.

图38为一方块图，详细显示神经网络单元121的序列器128。序列器128提供存储器地址至程序存储器129，以选择提供给序列器128的非架构指令，如前所述。如图38所示，存储器地址装载于定序器128的程序计数器3802内。定序器128通常会以程序存储器129的地址顺序循序递增，除非定序器128遭遇到非架构指令，例如循环或分支指令，而在此情况下，定序器128会将程序计数器3802更新为控制指令的目标地址，即更新为位于控制指令的目标的非架构指令的地址。因此，装载于程序计数器3802的地址131会指定当前被攫取以供神经处理单元126执行的非架构程序的非架构指令在程序存储器129中的地址。程序计数器3802的数值可由架构程序透过状态缓存器127的神经网络单元程序计数器字段3912而取得，如后续图39所述。如此可使架构程序依据非架构程序的进度，决定对于数据随机存存储器122和/或权重随机存取存储器124读取/写入数据的位置。FIG. 38 is a block diagram showing the sequencer 128 of the neural network unit 121 in detail. Sequencer 128 provides memory addresses to program memory 129 to select non-architectural instructions to provide to sequencer 128, as previously described. As shown in FIG. 38 , the memory address is loaded into the program counter 3802 of the sequencer 128 . Sequencer 128 will normally increment in program memory 129 address order unless sequencer 128 encounters a non-architectural instruction, such as a loop or branch instruction, in which case sequencer 128 will update program counter 3802 to The target address of the control instruction, ie, the address of the non-architectural instruction that is updated to be the target of the control instruction. Thus, address 131 loaded in program counter 3802 specifies the address in program memory 129 of the non-architectural instructions of the non-architectural program currently being fetched for execution by NPU 126 . The value of the program counter 3802 can be obtained by the architecture program through the NNU program counter field 3912 of the state register 127, as described in FIG. 39 later. In this way, the architectural program can determine the location to read/write data to the data RAM 122 and/or the weight RAM 124 according to the progress of the non-architectural program.

定序器128还包括循环计数器3804，此循环计数器3804会搭配非架构循环指令进行运作，例如图26A中地址10的循环至1指令与图28中地址11的循环至1指令。在图26A与图28的范例中，循环计数器3804内加载地址0的非架构初始化指令所指定的数值，例如加载数值400。每一次定序器128遭遇到循环指令而跳跃至目标指令(如图26A中位于地址1的乘法累加指令或是图28中位于地址1的maxwacc指令)，定序器128就会使循环计数器3804递减。一旦循环计数器3804减少到零，定序器128就转向排序在下一个的非架构指令。在另一实施例中，首次遭遇到循环指令时会在循环计数器内加载一个循环指令中指定的循环计数值，以省去利用非架构初始化指令初始化循环计数器3804的需求。因此，循环计数器3804的数值会指出非架构程序的循环组尚待执行的次数。循环计数器3804的数值可由架构程序透过状态缓存器127的循环计数字段3914取得，如后续图39所示。如此可使架构程序依据非架构程序的进度，决定对于数据随机存存储器122和/或权重随机存取存储器124读取/写入数据的位置。在一实施例中，定序器包括三个额外的循环计数器以搭配非架构程序内的巢套循环，这三个循环计数器的数值也可透过状态缓存器127读取。循环指令中具有一位以指示这四个循环计数器中哪一个是提供给当前的循环指令使用。The sequencer 128 also includes a loop counter 3804 that operates with non-architectural loop instructions, such as the loop to 1 instruction at address 10 in FIG. 26A and the loop to 1 instruction at address 11 in FIG. 28 . In the example of FIG. 26A and FIG. 28 , the loop counter 3804 is loaded with the value specified by the non-architectural initialization instruction at address 0, for example, the value 400 is loaded. Each time the sequencer 128 encounters a loop instruction and jumps to the target instruction (such as the multiply-accumulate instruction at address 1 in Figure 26A or the maxwacc instruction at address 1 in Figure 28), the sequencer 128 will cause the loop counter 3804 decrease. Once the loop counter 3804 is decremented to zero, the sequencer 128 moves to the next in-sequence non-architectural instruction. In another embodiment, the first time a loop instruction is encountered, the loop counter is loaded with the loop count value specified in a loop instruction, thereby eliminating the need to initialize the loop counter 3804 with a non-architectural initialization instruction. Thus, the value of the loop counter 3804 will indicate the number of times the non-architectural program's loop group has yet to be executed. The value of the loop counter 3804 can be obtained by the framework program through the loop count field 3914 of the state register 127, as shown in FIG. 39 . In this way, the architectural program can determine the location to read/write data to the data RAM 122 and/or the weight RAM 124 according to the progress of the non-architectural program. In one embodiment, the sequencer includes three additional loop counters for nested loops in non-architectural programs, and the values of these three loop counters can also be read through the status register 127 . There is a bit in the loop instruction to indicate which of the four loop counters is provided for the current loop instruction.

定序器128还包括迭代次数计数器3806。迭代次数计数器3806搭配非架构指令，例如图4，图9，图20与图26A中地址2的乘法累加指令，以及图28中地址2的maxwacc指令，这些指令在此后将会被称为“执行”指令。在前述范例中，各个执行指令分别指定执行计数511，511，1023，2与3。当定序器128遭遇到一个指定一非零迭代计数的执行指令时，定序器128会以此指定值加载迭代次数计数器3806。此外，定序器128会产生适当的微运算3418以控制图34中神经处理单元126管线级3401内的逻辑执行，并且使迭代次数计数器3806递减。若是迭代次数计数器3806大于零，定序器128会再次产生适当的微运算3418控制神经处理单元126内的逻辑并使迭代次数计数器3806递减。定序器128会持续以此方式运作，直到迭代次数计数器3806的数值归零。因此，迭代次数计数器3806的数值即为非架构执行指令内指定尚待执行的运算次数(这些运算如对于累加值与数据/权重文字进行乘法累加，取最大值，加总运算等)。迭代次数计数器3806的数值可利用架构程序透过状态缓存器127的迭代次数计数字段3916取得，如后续图39所述。如此可使架构程序依据非架构程序的进度，决定对于数据随机存存储器122和/或权重随机存取存储器124读取/写入数据的位置。The sequencer 128 also includes an iteration count counter 3806 . The iteration count counter 3806 is paired with non-architectural instructions, such as the multiply-accumulate instruction at address 2 in FIG. 4, FIG. 9, FIG. 20 and FIG. 26A, and the maxwacc instruction at address 2 in FIG. "instruction. In the foregoing example, each execute instruction specifies execution counts 511, 511, 1023, 2, and 3, respectively. When the sequencer 128 encounters an execution instruction specifying a non-zero iteration count, the sequencer 128 loads the iteration count counter 3806 with the specified value. In addition, the sequencer 128 will generate appropriate micro-operations 3418 to control logic execution within the pipeline stage 3401 of the NPU 126 in FIG. 34 and to decrement the iteration counter 3806 . If the iteration counter 3806 is greater than zero, the sequencer 128 again generates appropriate micro-operations 3418 to control the logic within the NPU 126 and decrement the iteration counter 3806 . The sequencer 128 will continue to operate in this manner until the value of the iteration counter 3806 reaches zero. Therefore, the value of the iteration count counter 3806 is the number of operations specified in the non-architectural execution instruction to be executed (these operations are multiplication and accumulation of accumulated values and data/weight literals, taking the maximum value, summing operations, etc.). The value of the iteration count counter 3806 can be obtained through the iteration count field 3916 of the status register 127 by using the framework program, as described in FIG. 39 . In this way, the architectural program can determine the location to read/write data to the data RAM 122 and/or the weight RAM 124 according to the progress of the non-architectural program.

图39为一方块图，显示神经网络单元121的控制与状态缓存器127的若干字段。这些字段包括包括神经处理单元126执行非架构程序最近写入的权重随机存取存储器列的地址2602，神经处理单元126执行非架构程序最近读取的权重随机存取存储器列的地址2604，神经处理单元126执行非架构程序最近写入的数据随机存取存储器列的地址2606，以及神经处理单元126执行非架构程序最近读取的数据随机存取存储器列的地址2608，如前述图26B所示。此外，这些字段还包括神经网络单元程序计数器3912字段，循环计数器3914字段，与迭代次数计数器3916字段。如前述，架构程序可将状态缓存器127内的数据读取至媒体缓存器118和/或通用缓存器116，例如透过MFNN指令1500读取包括神经网络单元程序计数器3912，循环计数器3914与迭代次数计数器3916字段的数值。程序计数器字段3912的数值反映图38中程序计数器3802的数值。循环计数器字段3914的数值反映循环计数器3804的数值。迭代次数计数器字段3916的数值反映迭代次数计数器3806的数值。在一实施例中，定序器128在每次需要调整程序计数器3802，循环计数器3804，或迭代次数计数器3806时，都会更新程序计数器字段3912，循环计数器字段3914与迭代次数计数器字段3916的数值，如此，当架构程序读取时这些字段的数值就会是当下的数值。在另一实施例中，当神经网络单元121执行架构指令以读取状态缓存器127时，神经网络单元121仅仅取得程序计数器3802，循环计数器3804与迭代次数计数器3806的数值并将其提供回架构指令(例如提供至媒体缓存器118或通用缓存器116)。FIG. 39 is a block diagram showing several fields of the control and state register 127 of the NNU 121. These fields include the address 2602 of the WRAM column most recently written to by the NPU 126 executing the non-architectural program, the address 2604 of the WRAM column most recently read by the NPU 126 executing the non-architectural program, Unit 126 executes the address 2606 of the DRAM column most recently written by the non-architectural program, and NPU 126 executes the address 2608 of the DRAM column most recently read by the non-architectural program, as previously described in FIG. 26B . In addition, these fields also include a NNU program counter 3912 field, a loop counter 3914 field, and an iteration counter 3916 field. As mentioned above, the architecture program can read the data in the state register 127 to the media register 118 and/or the general register 116, for example, through the MFNN instruction 1500 to read the NNU program counter 3912, the loop counter 3914 and the iteration The value of the times counter 3916 field. The value of program counter field 3912 reflects the value of program counter 3802 in FIG. 38 . The value of loop counter field 3914 reflects the value of loop counter 3804 . The value of the iteration counter field 3916 reflects the value of the iteration counter 3806 . In one embodiment, the sequencer 128 updates the values of the program counter field 3912, the loop counter field 3914 and the iteration counter field 3916 each time the program counter 3802, the loop counter 3804, or the iteration count counter 3806 needs to be adjusted. In this way, the value of these fields will be the current value when the framework program reads it. In another embodiment, when the NNU 121 executes the architectural instruction to read the state register 127, the NNU 121 only fetches the values of the program counter 3802, the loop counter 3804 and the number of iterations counter 3806 and provides them back to the architecture instructions (eg, to media cache 118 or general cache 116).

由此可以发现，图39的状态缓存器127的字段的数值可以理解为非架构指令由神经网络单元执行的过程中，其执行进度的信息。关于非架构程序执行进度的某些特定面向，如程序计数器3802数值，循环计数器3804数值，迭代次数计数器3806数值，最近读取/写入的权重随机存取存储器124地址125的字段2602/2604，以及最近读取/写入的数据随机存取存储器122地址123的字段2606/2608，已于先前的章节进行描述。执行于处理器100的架构程序可以从状态缓存器127读取图39的非架构程序进度值并利用这些信息来做决策，例如透过如比较与分支指令等架构指令来进行。举例来说，架构程序会决定对于数据随机存取存储器122和/或权重随机存取存储器124进行数据/权重的读取/写入的列，以控制数据随机存取存储器122或权重随机存取存储器124的数据的流入与流出，尤其是针对大型数据组和/或不同非架构指令的重迭执行。这些利用架构程序进行决策的范例可参照本文前后章节的描述。From this, it can be found that the values of the fields of the state register 127 in FIG. 39 can be understood as the information of the execution progress of the non-architectural instruction in the process of being executed by the neural network unit. Some specific aspects about the progress of non-architectural program execution, such as program counter 3802 value, loop counter 3804 value, iteration counter 3806 value, last read/write weight RAM 124 address 125 field 2602/2604, And the fields 2606/2608 of the last read/written data RAM 122 address 123 have been described in previous chapters. Architectural programs executing on the processor 100 can read the non-architectural program progress values of FIG. 39 from the state register 127 and use this information to make decisions, eg, through architectural instructions such as compare and branch instructions. For example, the architecture program will determine the columns for data/weight read/write to data RAM 122 and/or weight RAM 124 to control data RAM 122 or weight RAM 124 The inflow and outflow of data to memory 124 is especially for large data sets and/or overlapping execution of different non-architectural instructions. Examples of these decisions using architectural procedures can be found in the preceding and following chapters of this paper.

举例来说，如前文图26A所述，架构程序设定非架构程序将卷积运算的结果写回数据随机存取存储器122中卷积核2402上方的列(如列8上方)，而当神经网络单元121利用最近写入数据随机存取存储器122列2606的地址写入结果时，架构程序会从数据随机存取存储器122读取此结果。For example, as described above in FIG. 26A , the architectural program sets the non-architectural program to write the result of the convolution operation back to the column above the convolution kernel 2402 in the DRAM 122 (such as above column 8), and when the neural When the network unit 121 writes a result using the address most recently written to the column 2606 of the DRAM 122 , the fabric program reads the result from the DRAM 122 .

在另一范例中，如前文图26B所述，架构程序利用来自图38的状态缓存器127字段的信息确认非架构程序将图24的数据阵列2404分成5个512x 1600的数据块以执行卷积运算的进度。架构程序将此2560x 1600数据阵列的第一个512x 1600数据块写入权重随机存取存储器124并启动非架构程序，其循环计数为1600而权重随机存取存储器124初始化的输出列为0。神经网络单元121执行非架构程序时，架构程序会读取状态缓存器127以确认权重随机存取存储器124的最近写入列2602，如此架构程序就可读取由非架构程序写入的有效卷积运算结果，并且在读取后利用下一个512x 1600数据块覆写此有效卷积运算结果，如此，在神经网络单元121完成非架构程序对于第一个512x 1600数据块的执行后，处理器100在必要时就可立即更新非架构程序并再次启动非架构程序以执行下一个512x 1600数据块。In another example, the architected program utilizes information from the state register 127 field of FIG. 38 to confirm that the non-architected program divides the data array 2404 of FIG. 24 into five 512 x 1600 data blocks to perform the convolution, as previously described in FIG. 26B The progress of the operation. The architectural program writes the first 512x 1600 data block of the 2560x 1600 data array into WRAM 124 and starts the non-architectural program with a loop count of 1600 and WRAM 124 initialized output columns to 0. When the NNU 121 executes the non-architectural program, the architectural program reads the state register 127 to confirm the last written column 2602 of the weight RAM 124, so that the architectural program can read valid volumes written by the non-architectural program The result of the convolution operation, and after reading, use the next 512x 1600 data block to overwrite the effective convolution operation result, so that after the neural network unit 121 completes the execution of the first 512x 1600 data block by the non-architecture program, the processor 100 can immediately update the unarchitected program if necessary and start the unarchitected program again to execute the next 512x 1600 data block.

在另一范例中，假定架构程序使神经网络单元121执行一系列典型的神经网络乘法累加启动函数，其中，权重被储存于权重随机存取存储器124而结果会被写回数据随机存取存储器122。在此情况下，架构程序读取权重随机存取存储器124的一列后就不会再对其进行读取。如此，在当前的权重已经被非架构程序读取/使用后，就可以利用架构程序开始将新的权重复写权重随机存取存储器124上的权重，以提供非架构程序的下一次范例(例如下一个神经网络层)使用。在此情况下，架构程序会读取状态缓存器127以取得权重随机存取存储器的最近读取列2604的地址以决定其于权重随机存取存储器124中写入新权重组的位置。In another example, assume that the architecture program causes NNU 121 to perform a series of typical NNN multiply-accumulate activation functions, where weights are stored in weight RAM 124 and results are written back to data RAM 122 . In this case, once a column of the weight RAM 124 is read by the architecture program, it will not be read again. In this way, after the current weight has been read/used by the non-architecture program, the architecture program can be used to start to rewrite the weight on the weight random access memory 124 to provide the next example of the non-architecture program (such as the following A neural network layer) is used. In this case, the architecture program reads the state register 127 to obtain the address of the last read column 2604 of the WRAM to determine where to write the new weight group in the WRAM 124 .

在另一个范例中，假定架构程序知道非架构程序内包括一个具有大迭代次数计数的执行指令，如图20中地址2的非架构乘法累加指令。在此情况下，架构程序需要知道迭代次数计数3916，方能知道大致上还需要多少个时频周期才能完成此非架构指令以决定架构程序接下来所要采取两个或多个动作之一的何者。举例来说，若是需要很长的时间才能完成执行，架构程序就会放弃控制给另一个架构程序，例如操作系统。类似地，假定架构程序知道非架构程序包括一个具有相当大的循环计数的循环组，例如图28的非架构程序。在此情况下，架构程序会需要知道循环计数3914，方能知道大致上还需要多少个时频周期才能完成此非架构指令以决定接下来所要采取两个或多个动作之一的何者。In another example, assume that the architected program knows that the non-architected program includes an executed instruction with a large iteration count, such as the non-architected multiply-accumulate instruction at address 2 in FIG. 20 . In this case, the architectural program needs to know the iteration count 3916 to know approximately how many time-frequency cycles are still needed to complete the non-architectural instruction to determine which of the two or more actions the architectural program will take next . For example, an architectural program may relinquish control to another architectural program, such as an operating system, if it takes a long time to complete its execution. Similarly, assume that the architected program knows that the non-architected program includes a loop group with a relatively large loop count, such as the non-architected program of FIG. 28 . In this case, the architectural program needs to know the loop count 3914 in order to know approximately how many time-frequency cycles are still needed to complete the non-architectural instruction to determine which of two or more actions to take next.

在另一范例中，假定架构程序使神经网络单元121执行类似于图27与图28所述的共源运算，其中所要共源的数据是储存在权重随机存取存储器124而结果会被写回权重随机存取存储器124。不过，不同于图27与图28的范例，假定此范例的结果会被写回权重随机存取存储器124的最上方400列，例如列1600至1999。在此情况下，非架构程序完成读取四列其所要共源的权重随机存取存储器124数据后，非架构程序就不会再次进行读取。因此，一旦当前四列数据都已被非架构程序读取/使用后，即可利用架构程序开始将新数据(如非架构程序的下一次范例的权重，举例来说，例如对取得数据执行典型乘法累加启动函数运算的非架构程序)覆写权重随机存取存储器124的数据。在此情况下，架构程序会读取状态缓存器127以取得权重随机存取存储器的最近读取列2604的地址，以决定新的权重组写入权重随机存取存储器124的位置。In another example, assume that the architecture program causes the NNU 121 to perform a common-source operation similar to that described in FIGS. 27 and 28 , where the data to be shared is stored in the weight RAM 124 and the result is written back Weight random access memory 124 . However, unlike the examples of FIGS. 27 and 28 , it is assumed that the results of this example will be written back to the top 400 columns of the weight RAM 124 , such as columns 1600 to 1999 . In this case, after the non-architectural program completes reading the four columns of weight RAM 124 data that it wants to share, the non-architectural program will not read again. Therefore, once the current four columns of data have been read/used by the non-architecture program, the architecture program can be used to start weighting the new data (such as the weight of the next instance of the non-architecture program, for example, performing typical The non-architectural program of the multiply-accumulate start function operation) overwrites the data in the weight RAM 124 . In this case, the architecture program reads the state register 127 to obtain the address of the last read column 2604 of the weight RAM to determine the location where the new weight group is written into the weight RAM 124 .

时间递归(recurrent)神经网络加速Temporal recurrent neural network acceleration

传统前馈神经网络不具有储存网络先前输入的存储器。前馈神经网络通常被用于执行在任务中随时间输入网络的多个输入是各自独立，且多个输出也是如此的任务。相较之下，时间递归神经网络通常有助于执行在任务中随时间输入至神经网络的输入顺序具有重要性的任务。(此处的顺序通常被称为时间步骤。)因此，时间递归神经网络包括一个概念上的存储器或称内部状态，以装载网络因应序列中的先前输入所执行的计算而产生的信息，时间递归神经网络的输出关联于此内部状态与下一个时间步骤的输入。下列任务，如语音辨识，语言模型，文字产生，语言翻译，影像描述产生以及某些形式的手写辨识，是时间递归神经网络可以执行良好的例子。Traditional feed-forward neural networks do not have memory to store previous inputs to the network. Feedforward neural networks are often used to perform tasks in which multiple inputs to the network over time are independent, and so are multiple outputs. In contrast, temporal recurrent neural networks are often helpful for tasks where the order of inputs to the neural network over time within the task is important. (The sequence here is often referred to as a time step.) Thus, a temporal recurrent neural network includes a conceptual memory, or internal state, to load information resulting from computations performed by the network in response to previous inputs in the sequence, temporal recurrent The output of the neural network is related to this internal state and the input of the next time step. Tasks such as speech recognition, language modeling, text generation, language translation, image description generation, and some forms of handwriting recognition are examples where temporal recurrent neural networks can perform well.

三种习知的时间递归神经网络的范例为Elman时间递归神经网络，Jordan时间递归神经网络与长短期记忆(LSTM)神经网络。Elman时间递归神经网络包含内容节点以记忆当前时间步骤中时间递归神经网络的隐藏层状态，此状态在下一个时间步骤中会作为对于隐藏层的输入。Jordan时间递归神经网络类似于Elman时间递归神经网络，除了其中的内容节点会记忆时间递归神经网络的输出层状态而非隐藏层状态。长短期记忆神经网络包括由长短期记忆胞构成的一长短期记忆层。每个长短期记忆胞具有当前时间步骤的当前状态与当前输出，以及一个新的或后续时间步骤的新的状态与新的输出。长短期记忆胞包括输入闸与输出闸，以及遗忘闸，遗忘闸可以使神经元失去其所记忆的状态。这三种时间递归神经网络在后续章节会有更详细的描述。Three well-known examples of temporal recurrent neural networks are Elman temporal recurrent neural networks, Jordan temporal recurrent neural networks and long short-term memory (LSTM) neural networks. The Elman RTNN contains content nodes to remember the hidden layer state of the RTNN at the current time step, which will be used as input to the hidden layer at the next time step. Jordan RTNNs are similar to Elman RTNNs, except that the content nodes remember the output layer state of the RTNN instead of the hidden layer state. The LSTM neural network includes a LSTM layer composed of LSTM cells. Each LSTM cell has a current state and current output at the current time step, and a new state and new output at a new or subsequent time step. LSTM cells include input gates, output gates, and forget gates, which allow neurons to lose their memorized states. These three temporal recurrent neural networks will be described in more detail in subsequent chapters.

如本文所述，对于时间递归神经网络而言，如Elman或Jordan时间递归神经网络，神经网络单元每次执行都会使用时间步骤，取得一组输入层节点值，并执行必要计算使其透过时间递归神经网络进行传播，以产生输出层节点值以及隐藏层与内容层节点值。因此，输入层节点值会关联于计算隐藏，输出与内容层节点值的时间步骤；而隐藏，输出与内容层节点值会关联于产生这些节点值的时间步骤。输入层节点值是时间递归神经网络所仿真的系统的取样值，如影像，语音取样，商业市场数据的快照。对于长短期记忆神经网络而言，神经网络单元的每次执行都会使用一时间步骤，取得一组记忆胞输入值并执行必要计算以产生记忆胞输出值(以及记忆胞状态与输入闸，遗忘闸以及输出闸数值)，这也可以理解为是透过长短期记忆层记忆胞传播记忆胞输入值。因此，记忆胞输入值会关联于计算记忆胞状态以及输入闸，遗忘闸与输出闸数值的时间步骤；而记忆胞状态以及输入闸，遗忘闸与输出闸数值会关联于产生这些节点值的时间步骤。As described in this paper, for temporal recurrent neural networks, such as Elman or Jordan temporal recurrent neural networks, each execution of the neural network unit takes a set of input layer node values and performs the necessary calculations to pass through time. The recurrent neural network propagates to produce output layer node values as well as hidden and content layer node values. Thus, the input layer node values will be associated with the time steps at which hidden, output, and content layer node values are computed; and the hidden, output, and content layer node values will be associated with the time steps at which those node values were generated. The input layer node value is the sampling value of the system simulated by the time recurrent neural network, such as video, voice sampling, and snapshot of commercial market data. For LSTM neural networks, each execution of the neural network unit takes a time step, taking a set of cell input values and performing the necessary calculations to produce the cell output values (and cell state and input gates, forgetting gates and the output gate value), which can also be understood as propagating the input value of the memory cell through the memory cell of the long-term short-term memory layer. Therefore, the memory cell input value will be related to the time step of calculating the memory cell state and the input gate, forgetting gate and output gate value; and the memory cell state and the input gate, forgetting gate and output gate value will be related to the time when these node values are generated step.

内容层节点值，也称为状态节点，是神经网络的状态值，此状态值基于关联于先前时间步骤的输入层节点值，而不仅只关联于当前时间步骤的输入层节点值。神经网络单元对于时间步骤所执行的计算(例如对于Elman或Jordan时间递归神经网络的隐藏层节点值计算)是先前时间步骤产生的内容层节点值的一函数。因此，时间步骤开始时的网络状态值(内容节点值)会影响此时间步骤的过程中产生的输出层节点值。此外，时间步骤结束时的网络状态值会受到此时间步骤的输入节点值与时间步骤开始时的网络状态值影响。类似地，对于长短期记忆胞而言，记忆胞状态值关联于先前时间步骤的记忆胞输入值，而非仅只关联于当前时间步骤的记忆胞输入值。因为神经网络单元对于时间步骤执行的计算(例如下一个记忆胞状态的计算)是先前时间步骤产生的记忆胞状态值的函数，时间步骤开始时的网络状态值(记忆胞状态值)会影响此时间步骤中产生的记忆胞输出值，而此时间步骤结束时的网络状态值会受到此时间步骤的记忆胞输入值与先前网络状态值影响。Content layer node values, also known as state nodes, are the state values of the neural network based on the input layer node values associated with previous time steps, not just the current time step. The calculations performed by the NNU for a time step (eg, hidden layer node value calculations for Elman or Jordan time recurrent neural networks) are a function of the content layer node values produced by previous time steps. Therefore, the network state value (content node value) at the beginning of a time step affects the output layer node values produced during this time step. Additionally, the value of the network state at the end of a time step is affected by the values of the input nodes at this time step and the value of the network state at the beginning of the time step. Similarly, for LSTM cells, the state value of the memory cell is related to the input value of the memory cell at the previous time step, not only to the input value of the memory cell at the current time step. Because the calculation performed by the neural network unit for the time step (such as the calculation of the next memory cell state) is a function of the memory cell state value produced by the previous time step, the network state value (memory cell state value) at the beginning of the time step will affect this. The output value of the memory cell generated in the time step, and the network state value at the end of this time step will be affected by the input value of the memory cell at this time step and the previous network state value.

图40为一方块图，显示Elman时间递归神经网络的一范例。图40的Elman时间递归神经网络包括输入层节点，或神经元，标示为D0，D1至Dn，集体称为多个输入层节点D而个别通称为输入层节点D；隐藏层节点/神经元，标示为Z0，Z1至Zn，集体称为多个隐藏层节点Z而个别通称为隐藏层节点Z；输出层节点/神经元，标示为Y0，Y1至Yn，集体称为多个输出层节点Y而个别通称为输出层节点Y；以及内容层节点/神经元，标示为C0，C1至Cn，集体称为多个内容层节点C而个别通称为内容层节点C。在图40的Elman时间递归神经网络的范例中，各个隐藏层节点Z具有一输入连结至各个输入层节点D的输出，并具有一输入连结至各个内容层节点C的输出；各个输出层节点Y具有一输入连结至各个隐藏层节点Z的输出；而各个内容层节点C具有一输入连结至相对应隐藏层节点Z的输出。Fig. 40 is a block diagram showing an example of an Elman temporal recurrent neural network. The Elman temporal recurrent neural network of FIG. 40 includes input layer nodes, or neurons, labeled D0, D1 through Dn, collectively referred to as a plurality of input layer nodes D and individually collectively referred to as input layer nodes D; hidden layer nodes/neurons, Marked as Z0, Z1 to Zn, collectively referred to as multiple hidden layer nodes Z and individually collectively referred to as hidden layer nodes Z; output layer nodes/neurons, marked as Y0, Y1 to Yn, collectively referred to as multiple output layer nodes Y Individually referred to as an output layer node Y; and content layer nodes/neurons, denoted as C0, C1 to Cn, collectively referred to as a plurality of content layer nodes C and individually referred to as a content layer node C. In the example of the Elman time recurrent neural network of Figure 40, each hidden layer node Z has an input connected to the output of each input layer node D, and has an input connected to the output of each content layer node C; each output layer node Y There is an input connected to the output of each hidden layer node Z; and each content layer node C has an input connected to the output of the corresponding hidden layer node Z.

在许多方面，Elman时间递归神经网络的运作类似于传统的前馈人工神经网络。也就是说，对于给定节点而言，此节点的各个输入连结都会有一个相关联的权重；节点在一输入连结收到的数值会和关联的权重相乘以产生一乘积；此节点会将关联于所有输入连结的乘积相加以产生一总数(此总数内可能还会包含一偏移项)；一般而言，对此总数还会执行启动函数以产生节点的输出值，此输出值有时称为此节点的启动值。对于传统的前馈网络而言，数据总是沿着输入层至输出层的方向流动。也就是说，输入层提供一数值至隐藏层(通常会有多个隐藏层)，而隐藏层会产生其输出值提供至输出层，而输出层会产生可被取用的输出。In many respects, Elman temporal recurrent neural networks operate similar to traditional feed-forward artificial neural networks. That is, for a given node, each input link of this node will have an associated weight; the value received by the node at an input link will be multiplied by the associated weight to produce a product; the node will The products associated with all input links are summed to produce a total (possibly including an offset term in this total); generally speaking, the activation function is also executed on this total to produce the node's output value, which is sometimes called The startup value for this node. For traditional feedforward networks, data always flows in the direction from the input layer to the output layer. That is, the input layer provides a value to the hidden layer (often there are multiple hidden layers), the hidden layer produces its output value to the output layer, and the output layer produces an output that can be accessed.

不过，不同于传统的前馈网络，Elman时间递归神经网络还包括一些反馈连结，也就是图40中从隐藏层节点Z至内容层节点C的连结。Elman时间递归神经网络的运作如下，当输入层节点D在一个新的时间步骤提供一输入值至隐藏层节点Z，内容节点C会提供一数值至隐藏层Z，此数值为隐藏层节点Z因应先前输入，也就是当前时间步骤，的输出值。从这个意义上来说，Elman时间递归神经网络的内容节点C是一个基于先前时间步骤的输入值的存储器。图41与图42将会对执行关联于图40的Elman时间递归神经网络的计算的神经网络单元121的运作实施例进行说明。However, unlike the traditional feedforward network, the Elman temporal recurrent neural network also includes some feedback connections, that is, the connection from hidden layer node Z to content layer node C in Fig. 40 . The operation of the Elman time recurrent neural network is as follows. When the input layer node D provides an input value to the hidden layer node Z at a new time step, the content node C will provide a value to the hidden layer Z. This value is the hidden layer node Z corresponding to The output value of the previous input, that is, the current time step. In this sense, the content node C of the Elman temporal recurrent neural network is a memory based on the input values at previous time steps. 41 and 42 will illustrate an example of the operation of the neural network unit 121 performing the computation associated with the Elman temporal recurrent neural network of FIG. 40 .

为了说明本发明，Elman时间递归神经网络是一个包含至少一个输入节点层，一个隐藏节点层，一个输出节点层与一个内容节点层的时间递归神经网络。对于一给定时间步骤，内容节点层会储存隐藏节点层于前一个时间步骤产生且反馈至内容节点层的结果。此反馈至内容层的结果可以是启动函数的执行结果或是隐藏节点层执行累加运算而未执行启动函数的结果。To illustrate the present invention, an Elman temporal recurrent neural network is a temporal recurrent neural network comprising at least one input node layer, one hidden node layer, one output node layer and one content node layer. For a given time step, the content node layer stores the results generated by the hidden node layer at the previous time step and fed back to the content node layer. The result fed back to the content layer may be the execution result of the activation function or the result of the accumulation operation performed by the hidden node layer without executing the activation function.

图41为一方块图，显示当神经网络单元121执行关联于图40的Elman时间递归神经网络的计算时，神经网络单元121的数据随机存取存储器122与权重随机存取存储器124内的数据配置的一范例。在图41的范例中假定图40的Elman时间递归神经网络具有512个输入节点D，512个隐藏节点Z，512个内容节点C，与512个输出节点Y。此外，也假定此Elman时间递归神经网络为完全连结，即全部512个输入节点D均连结各个隐藏节点Z作为输入，全部512个内容节点C均连结各个隐藏节点Z作为输入，而全部512个隐藏节点Z均连结各个输出节点Y作为输入。此外，此神经网络单元121配置为512个神经处理单元126或神经元，例如采宽配置。最后，此范例假定关联于内容节点C至隐藏节点Z的连结的权重均为数值1，因而不需储存这些为一的权重值。41 is a block diagram showing the data configuration in the data random access memory 122 and the weight random access memory 124 of the neural network unit 121 when the neural network unit 121 performs calculations associated with the Elman time recurrent neural network of FIG. 40 An example of . In the example of FIG. 41 , it is assumed that the Elman temporal recurrent neural network in FIG. 40 has 512 input nodes D, 512 hidden nodes Z, 512 content nodes C, and 512 output nodes Y. In addition, it is also assumed that this Elman time recurrent neural network is fully connected, that is, all 512 input nodes D are connected to each hidden node Z as input, all 512 content nodes C are connected to each hidden node Z as input, and all 512 hidden nodes Node Z is connected to each output node Y as an input. In addition, the neural network unit 121 is configured as 512 neural processing units 126 or neurons, such as a wide configuration. Finally, this example assumes that the links associated with content node C to hidden node Z all have a weight of 1, so there is no need to store these weight values of 1.

如图中所示，权重随机存取存储器124的下方512列(列0至511)装载关联于输入节点D与隐藏节点Z间的连结的权重值。更精确地说，如图中所示，列0装载关联于由输入节点D0至隐藏节点Z的输入连结的权重，也即，文字0会装载关联于输入节点D0与隐藏节点Z0间的连结的权重，文字1会装载关联于输入节点D0与隐藏节点Z1间的连结的权重，文字2会装载关联于输入节点D0与隐藏节点Z2间的连结的权重，依此类推，文字511会装载关联于输入节点D0与隐藏节点Z511间的连结的权重；列1装载关联于由输入节点D1至隐藏节点Z的输入连结的权重，也即，文字0会装载关联于输入节点D1与隐藏节点Z0间的连结的权重，文字1会装载关联于输入节点D1与隐藏节点Z1间的连结的权重，文字2会装载关联于输入节点D1与隐藏节点Z2间的连结的权重，依此类推，文字511会装载关联于输入节点D1与隐藏节点Z511间的连结的权重；直到列511，列511装载关联于由输入节点D511至隐藏节点Z的输入连结的权重，也即，文字0会装载关联于输入节点D511与隐藏节点Z0间的连结的权重，文字1会装载关联于输入节点D511与隐藏节点Z1间的连结的权重，文字2会装载关联于输入节点D511与隐藏节点Z2间的连结的权重，依此类推，文字511会装载关联于输入节点D511与隐藏节点Z511间的连结的权重。此配置与用途类似于前文对应于图4至图6A所述的实施例。As shown in the figure, the lower 512 columns (columns 0 to 511 ) of the weight RAM 124 are loaded with weight values associated with the links between the input node D and the hidden node Z. More precisely, as shown in the figure, column 0 holds the weights associated with the input link from input node D0 to hidden node Z, i.e., text 0 would hold the weights associated with the link between input node D0 and hidden node Z0 Weight, text 1 will load the weight associated with the link between input node D0 and hidden node Z1, text 2 will load the weight associated with the link between input node D0 and hidden node Z2, and so on, text 511 will load the weight associated with The weight of the link between input node D0 and hidden node Z511; column 1 loads the weight associated with the input link from input node D1 to hidden node Z, that is, text 0 will load the weight associated with input node D1 and hidden node Z0 Link weights, text 1 will load the weight associated with the link between input node D1 and hidden node Z1, text 2 will load the weight associated with the link between input node D1 and hidden node Z2, and so on, text 511 will load Weights associated with the link between input node D1 and hidden node Z511; up to column 511, column 511 loads the weights associated with the input link from input node D511 to hidden node Z, i.e. a literal 0 would load the weight associated with input node D511 The weight of the connection with the hidden node Z0, text 1 will load the weight associated with the connection between the input node D511 and the hidden node Z1, text 2 will load the weight associated with the connection between the input node D511 and the hidden node Z2, and so on By analogy, the text 511 will carry the weight associated with the link between the input node D511 and the hidden node Z511. This configuration and usage are similar to the embodiments described above with respect to FIGS. 4 to 6A .

如图中所示，权重随机存取存储器124的后续512列(列512至1023)是以类似的方式装载关联于隐藏节点Z与输出节点Y间的连结的权重。As shown in the figure, the subsequent 512 columns (columns 512 to 1023) of the weight RAM 124 are similarly loaded with weights associated with the link between the hidden node Z and the output node Y.

数据随机存取存储器122装载Elman时间递归神经网络节点值供一系列时间步骤使用。进一步来说，数据随机存取存储器122以三列为组装载提供给定时间步骤的节点值。如图中所示，以一个具有64列的数据随机存取存储器122为例，此数据随机存取存储器122可装载供20个不同时间步骤使用的节点值。在图41的范例中，列0至2装载供时间步骤0使用的节点值，列3至5装载供时间步骤1使用的节点值，依此类推，列57至59装载供时间步骤19使用的节点值。各组中的第一列装载此时间步骤的输入节点D的数值。各组中的第二列装载此时间步骤的隐藏节点Z的数值。各组中的第三列装载此时间步骤的输出节点Y的数值。如图中所示，数据随机存取存储器122的各行装载其相对应的神经元或神经处理单元126的节点值。也就是说，行0装载关联于节点D0，Z0与Y0的节点值，其计算是由神经处理单元0所执行；行1装载关联于节点D1，Z1与Y1的节点值，其计算是由神经处理单元1所执行；依此类推，行511装载关联于节点D511，Z511与Y511的节点值，其计算是由神经处理单元511所执行，这部分在后续对应于图42处会有更详细的说明。Data RAM 122 is loaded with Elman time recurrent neural network node values for a series of time steps. Further, the DRAM 122 is loaded in groups of three columns to provide node values for a given time step. As shown in the figure, taking a DRAM 122 with 64 columns as an example, the DRAM 122 can be loaded with node values for 20 different time steps. In the example in Figure 41, columns 0 to 2 hold node values for time step 0, columns 3 to 5 for time step 1, and so on, columns 57 to 59 for time step 19 node value. The first column in each group holds the value of the input node D for this time step. The second column in each group holds the value of hidden node Z for this time step. The third column in each group holds the value of the output node Y for this time step. As shown, each row of DRAM 122 is loaded with the node value of its corresponding neuron or NPU 126 . That is, row 0 loads the node values associated with nodes D0, Z0, and Y0, whose calculations are performed by NPU0; row 1 loads the node values associated with nodes D1, Z1, and Y1, whose calculations are performed by NPU Executed by processing unit 1; and so on, line 511 loads the node values associated with nodes D511, Z511 and Y511, and its calculation is performed by neural processing unit 511. This part will be described in more detail later corresponding to FIG. 42 illustrate.

如图41所指出，对于一给定时间步骤而言，位于各组三列存储器的第二列的隐藏节点Z的数值会是下一个时间步骤的内容节点C的数值。也就是说，神经处理单元126在一时间步骤内计算并写入的节点Z的数值，会成为此神经处理单元126在下一个时间步骤内用于计算节点Z的数值所使用的节点C的数值(连同此下一个时间步骤的输入节点D的数值)。内容节点C的初始值(在时间步骤0用以计算列1中的节点Z的数值所使用的节点C的数值)假定为零。这在后续对应于图42的非架构程序的相关章节会有更详细的说明。As indicated in FIG. 41, for a given time step, the value of the hidden node Z in the second column of each set of three-column memories will be the value of the content node C at the next time step. That is to say, the value of node Z calculated and written by the neural processing unit 126 in one time step will become the value of node C used by the neural processing unit 126 to calculate the value of node Z in the next time step ( together with the value of input node D for this next time step). The initial value of content node C (the value of node C used at time step 0 to calculate the value of node Z in column 1) is assumed to be zero. This will be described in more detail in subsequent relevant chapters corresponding to the non-architecture program in FIG. 42 .

较佳地，输入节点D的数值(图41的范例中的列0，3，依此类推至列57的数值)由执行于处理器100的架构程序透过MTNN指令1400写入/填入数据随机存取存储器122，并且是由执行于神经网络单元121的非架构程序读取/使用，例如图42的非架构程序。相反地，隐藏/输出节点Z/Y的数值(图41的范例中的列1与2，4与5，依此类推至列58与59的数值)则是由执行于神经网络单元121的非架构程序写入/填入数据随机存取存储器122，并且是由执行于处理器100的架构程序透过MFNN指令1500读取/使用。图41的范例假定此架构程序会执行以下步骤：(1)对于20个不同的时间步骤，将输入节点D的数值填入数据随机存取存储器122(列0，3，依此类推至列57)；(2)启动图42的非架构程序；(3)侦测非架构程序是否执行完毕；(4)从数据随机存取存储器122读出输出节点Y的数值(列2，5，依此类推至列59)；以及(5)重复步骤(1)至(4)若干次直到完成任务，例如对手机使用者的话语进行辨识所需的计算。Preferably, the value of the input node D (column 0, 3 in the example of FIG. 41 , and so on to the value of column 57) is written/filled by the architectural program executing on the processor 100 through the MTNN instruction 1400 The RAM 122 is read/used by non-architectural programs executed on the NNU 121, such as the non-architectural program in FIG. 42 . Conversely, the values of hidden/output nodes Z/Y (columns 1 and 2, 4 and 5 in the example of FIG. The architectural program is written/filled into the DRAM 122 and is read/used by the architectural program executing on the processor 100 through the MFNN instruction 1500 . The example of FIG. 41 assumes that the architecture program will perform the following steps: (1) For 20 different time steps, fill the data random access memory 122 with the value of the input node D (column 0, 3, and so on to column 57 ); (2) start the non-architecture program of FIG. and (5) repeat steps (1) to (4) several times until the task is completed, such as the calculation required to recognize the speech of the mobile phone user.

在另一种执行方式中，架构程序会执行以下步骤：(1)对单一个时间步骤，以输入节点D的数值填入数据随机存取存储器122(如列0)；(2)启动非架构程序(图42非架构程序的修正后版本，不需循环，并且只存取数据随机存取存储器122的单一组三列)；(3)侦测非架构程序是否执行完毕；(4)从数据随机存取存储器122读出输出节点Y的数值(如列2)；以及(5)重复步骤(1)至(4)若干次直到完成任务。此二种方式何者为优可依据时间递归神经网络的输入值的取样方式而定。举例来说，若是此任务容许在多个时间步骤对输入进行取样(例如大约20个时间步骤)并执行计算，第一种方式就较为理想，因为此方式可能带来更多计算资源效率和/或较佳的效能，不过，若是此任务只容许在单一个时间步骤执行取样，就需要使用第二种方式。In another execution mode, the architecture program will perform the following steps: (1) for a single time step, fill the data random access memory 122 (such as column 0) with the value of the input node D; (2) start the non-architecture Program (the modified version of the non-architecture program in Fig. 42 does not need a loop, and only accesses a single group of three columns of the data random access memory 122); (3) detect whether the non-architecture program has been executed; (4) from the data The random access memory 122 reads out the value of the output node Y (eg column 2); and (5) repeat steps (1) to (4) several times until the task is completed. Which of the two methods is better depends on the sampling method of the input value of the temporal recurrent neural network. For example, if the task allows the input to be sampled and the computation to be sampled at multiple time steps (say around 20 time steps), the first approach is ideal as it may lead to more computational resource efficiencies and/or or better performance, but if the task only allows sampling to be performed at a single time step, then the second approach is required.

第三实施例类似于前述第二种方式，不过，不同于第二种方式使用单一组三列数据随机存取存储器122，此方式的非架构程序使用多组三列存储器，也就是在各个时间步骤使用不同组三列存储器，此部分类似于第一种方式。在此第三实施例中，较佳地，架构程序在步骤(2)前包含一步骤，此步骤中，架构程序会在非架构程序启动前对其进行更新，例如将地址1的指令内的数据随机存取存储器122列更新为指向下一组三列存储器。The third embodiment is similar to the aforementioned second method, but unlike the second method using a single set of three-column data random access memory 122, the non-architectural program of this method uses multiple sets of three-column memory, that is, at each time The steps use different sets of three-column memory, this part is similar to the first way. In this third embodiment, preferably, the architecture program includes a step before step (2). In this step, the architecture program will update the non-architecture program before it starts, for example, the address 1 in the instruction The DRAM 122 ranks are updated to point to the next set of three ranks.

图42为一表格，显示储存于神经网络单元121的程序存储器129的一程序，此程序由神经网络单元121执行，并依据图41的配置使用数据与权重以达成Elman时间递归神经网络。图42(以及图45，图48，图51，图54与图57)的非架构程序中的若干指令详如前述(例如乘法累加(MULT-ACCUM)，循环(LOOP)，初始化(INITIALIZE)指令)，以下段落假定这些指令与前述说明内容一致，除非有不同的说明。FIG. 42 is a table showing a program stored in the program memory 129 of the neural network unit 121. The program is executed by the neural network unit 121 and uses data and weights according to the configuration of FIG. 41 to achieve an Elman time recurrent neural network. Some instructions in the non-architectural program of Figure 42 (and Figure 45, Figure 48, Figure 51, Figure 54 and Figure 57) are as described above (such as multiplication and accumulation (MULT-ACCUM), loop (LOOP), initialization (INITIALIZE) instructions ), the following paragraphs assume that these instructions are consistent with the preceding instructions unless stated otherwise.

图42的范例程序包含13个非架构指令，分别位于地址0至12。地址0的指令(INITIALIZE NPU，LOOPCNT＝20)清除累加器202并且将循环计数器3804初始化至数值20，以执行20次循环组(地址4至11的指令)。较佳地，此初始化指令也会使神经网络单元121处于宽配置，如此，神经网络单元121就会配置为512个神经处理单元126。如同后续章节所述，在地址1至3以及地址7至11的指令执行过程中，这512个神经处理单元126作为512个相对应的隐藏层节点Z进行运作，而在地址4至6的指令执行过程中，这512个神经处理单元126作为512个相对应的输出层节点Y进行运作。The example program in Figure 42 contains 13 non-architectural instructions located at addresses 0 to 12, respectively. The instruction at address 0 (INITIALIZE NPU, LOOPCNT=20) clears accumulator 202 and initializes loop counter 3804 to a value of 20 to execute the loop group (instructions at addresses 4 to 11) 20 times. Preferably, the initialization instruction also makes the NNU 121 in a wide configuration, so that the NNU 121 is configured as 512 NPUs 126 . As described in subsequent chapters, during the execution of instructions at addresses 1 to 3 and addresses 7 to 11, the 512 neural processing units 126 operate as 512 corresponding hidden layer nodes Z, while instructions at addresses 4 to 6 During execution, the 512 neural processing units 126 operate as 512 corresponding output layer nodes Y.

地址1至3的指令不属于程序的循环组而只会执行一次。这些指令计算隐藏层节点Z的初始值并将其写入数据随机存取存储器122的列1供地址4至6的指令的第一次执行使用，以计算出第一时间步骤(时间步骤0)的输出层节点Y。此外，这些由地址1至3的指令计算并写入数据随机存取存储器122的列1的隐藏层节点Z的数值会变成内容层节点C的数值供地址7与8的指令的第一次执行使用，以计算出隐藏层节点Z的数值供第二时间步骤(时间步骤1)使用。Instructions at addresses 1 to 3 do not belong to the loop group of the program and are executed only once. These instructions calculate the initial value of the hidden layer node Z and write it to column 1 of the data random access memory 122 for the first execution of the instructions at addresses 4 to 6 to calculate the first time step (time step 0) The output layer node Y of . In addition, the value of the hidden layer node Z calculated by the instructions of addresses 1 to 3 and written to column 1 of the DRAM 122 will become the value of the content layer node C for the first time of the instructions of addresses 7 and 8 Execute use to calculate the value of hidden layer node Z for the second time step (time step 1).

在地址1与2的指令的执行过程中，这512个神经处理单元126中的各个神经处理单元126会执行512个乘法运算，将位于数据随机存取存储器122列0的512个输入节点D数值乘上权重随机存取存储器124的列0至511中相对应此神经处理单元126的行的权重，以产生512个乘积累加于相对应神经处理单元126的累加器202。在地址3的指令的执行过程中，这512个神经处理单元的512个累加器202的数值会被传递并写入数据随机存取存储器122的列1。也就是说，地址3的输出指令会将512个神经处理单元中的各个神经处理单元512的累加器202数值写入数据随机存取存储器122的列1，此数值即为初始的隐藏层Z数值，随后，此指令会清除累加器202。During the execution of the instructions at addresses 1 and 2, each neural processing unit 126 in the 512 neural processing units 126 will perform 512 multiplication operations, and the 512 input node D values located at column 0 of the data random access memory 122 The weights corresponding to the rows of the NPU 126 in columns 0 to 511 of the WRAM 124 are multiplied to generate 512 multiplications that are accumulated in the accumulator 202 of the corresponding NPU 126 . During the execution of the instruction at address 3, the values of the 512 accumulators 202 of the 512 NPUs will be transferred and written into column 1 of the DRAM 122 . That is to say, the output command at address 3 will write the value of the accumulator 202 of each neural processing unit 512 in the 512 neural processing units into column 1 of the data random access memory 122, and this value is the initial hidden layer Z value , then this instruction clears accumulator 202.

图42的非架构程序的地址1至2的指令所执行的运算类似于图4的非架构指令的地址1至2的指令所执行的运算。进一步来说，地址1的指令(MULT_ACCUM DR ROW 0)会指示这512个神经处理单元126中的各个神经处理单元126将数据随机存取存储器122的列0的相对应文字读入其多任务缓存器208，将权重随机存取存储器124的列0的相对应文字读入其多任务缓存器705，将数据文字与权重文字相乘产生乘积并将此乘积加入累加器202。地址2的指令(MULT-ACCUM ROTATE，WR ROW+1，COUNT＝511)指示这512个神经处理单元中的各个神经处理单元126将来自相邻神经处理单元126的文字转入其多任务缓存器208(利用由神经网络单元121的512个多任务缓存器208集体运作构成的512个文字的旋转器，这些缓存器即为地址1的指令指示将数据随机存取存储器122的列读入的缓存器)，将权重随机存取存储器124的下一列的相对应文字读入其多任务缓存器705，将数据文字与权重文字相乘产生乘积并将此乘积加入累加器202，并且执行前述运算511次。The operations performed by the instructions at addresses 1 to 2 of the non-architectural program of FIG. 42 are similar to the operations performed by the instructions at addresses 1 to 2 of the non-architectural instructions of FIG. 4 . Further, the instruction at address 1 (MULT_ACCUM DR ROW 0) will instruct each neural processing unit 126 in the 512 neural processing units 126 to read the corresponding text of row 0 of the data random access memory 122 into its multitasking cache The device 208 reads the corresponding word of column 0 of the weight random access memory 124 into its multitasking register 705 , multiplies the data word and the weight word to generate a product and adds the product to the accumulator 202 . The instruction of address 2 (MULT-ACCUM ROTATE, WR ROW+1, COUNT=511) instructs each NPU 126 in these 512 NPUs to transfer the text from adjacent NPUs 126 into its multitasking buffer 208 (utilizing the 512 word rotators collectively operated by the 512 multitasking registers 208 of the neural network unit 121, these registers being the cache into which the instruction at address 1 instructs to read the columns of the data random access memory 122 device), read the corresponding text of the next column of the weight random access memory 124 into its multitasking register 705, multiply the data text and the weight text to generate a product and add this product to the accumulator 202, and perform the aforementioned operation 511 Second-rate.

此外，图42中地址3的单一非架构输出指令(OUTPUT PASSTHRU，DR OUT ROW 1，CLRACC)会将启动函数指令的运算与图4中地址3与4的写入输出指令合并(虽然图42的程序传递累加器202数值，而图4的程序则是对累加器202数值执行启动函数)。也就是说，在图42的程序中，执行于累加器202数值的启动函数，如果有的话，输出指令中指定(也在地址6与11的输出指令中指定)，而非如图4的程序所示于一个不同的非架构启动函数指令中指定。图4(以及图20，图26A与图28)的非架构程序的另一实施例，也即将启动函数指令的运算与写入输出指令(如图4的地址3与4)合并为如图42所示的单一非架构输出指令也属于本发明的范畴。图42的范例假定隐藏层(Z)的节点不会对累加器数值执行启动函数。不过，隐藏层(Z)对累加器数值执行启动函数的实施例也属本发明的范畴，这些实施例可利用地址3与11的指令进行运算，如S型，双曲正切，校正函数等。In addition, a single non-architectural output instruction (OUTPUT PASSTHRU, DR OUT ROW 1, CLRACC) at address 3 in FIG. The program passes the value of the accumulator 202, while the program of FIG. 4 executes the start function on the value of the accumulator 202). That is to say, in the program of Figure 42, the startup function executed on the value of the accumulator 202, if any, is specified in the output instruction (also specified in the output instruction of addresses 6 and 11), rather than as shown in Figure 4 The procedure shown is specified in a different non-architectural start function directive. Another embodiment of the non-architectural program of FIG. 4 (and FIG. 20, FIG. 26A and FIG. 28), that is, the operation of the start function instruction and the write output instruction (such as address 3 and 4 of FIG. 4) are combined into one shown in FIG. 42 The single non-architectural output instruction shown is also within the scope of the invention. The example of FIG. 42 assumes that the nodes of the hidden layer (Z) do not perform activation functions on accumulator values. However, embodiments in which the hidden layer (Z) performs activation functions on accumulator values are also within the scope of the present invention. These embodiments can use the instructions at addresses 3 and 11 to perform operations such as sigmoid, hyperbolic tangent, correction functions, etc.

相较于地址1至3的指令只会执行一次，地址4至11的指令则是位于程序循环内而会被执行若干次数，此次数由循环计数所指定(例如20)。地址7至11的指令的前十九次执行计算隐藏层节点Z的数值并将其写入数据随机存取存储器122供地址4至6的指令的第二至二十次执行使用以计算剩余时间步骤的输出层节点Y(时间步骤1至19)。(地址7至11的指令的最后/第二十次执行计算隐藏层节点Z的数值并将其写入数据随机存取存储器122的列61，不过，这些数值并未被使用。)Compared with the instructions at addresses 1 to 3 which are only executed once, the instructions at addresses 4 to 11 are located in the program loop and will be executed a number of times specified by the loop count (for example, 20). The first nineteen executions of the instructions at addresses 7 to 11 calculate the value of hidden layer node Z and write it into the DRAM 122 for the second to twenty executions of the instructions at addresses 4 to 6 to calculate the remaining time Output layer node Y for step (time steps 1 to 19). (The last/twentieth execution of the instructions at addresses 7 to 11 computes and writes the values for hidden layer node Z to column 61 of DRAM 122, however, these values are not used.)

在地址4与5的指令(MULT-ACCUM DR ROW+1，WR ROW 512 and MULT-ACCUMROTATE，WR ROW+1，COUNT＝511)的第一次执行中(对应于时间步骤0)，这512个神经处理单元126中的各个神经处理单元126会执行512个乘法运算，将数据随机存取存储器122的列1的512个隐藏节点Z的数值(这些数值由地址1至3的指令的单一次执行而产生与写入)乘上权重随机存取存储器124的列512至1023中对应此神经处理单元126的行的权重，以产生512个乘积累加于相对应神经处理单元126的累加器202。在地址6的指令(OUTPUT ACTIVATIONFUNCTION，DR OUT ROW+1，CLR ACC)的第一次执行中，会对于这512个累加数值执行启动函数(例如S型，双曲正切，校正函数)以计算输出层节点Y的数值，执行结果会写入数据随机存取存储器122的列2。In the first execution (corresponding to time step 0) of the instructions at addresses 4 and 5 (MULT-ACCUM DR ROW+1, WR ROW 512 and MULT-ACCUMROTATE, WR ROW+1, COUNT=511), these 512 Each neural processing unit 126 in the neural processing unit 126 will perform 512 multiplication operations, and the value of the 512 hidden nodes Z of column 1 of the data random access memory 122 (these values are determined by a single execution of the instructions of addresses 1 to 3 and generate and write) are multiplied by the weights corresponding to the rows of the NPU 126 in the columns 512 to 1023 of the weighted random access memory 124 to generate 512 multiplications that are accumulated in the accumulator 202 of the corresponding NPU 126 . In the first execution of the instruction at address 6 (OUTPUT ACTIVATIONFUNCTION, DR OUT ROW+1, CLR ACC), the activation function (such as sigmoid, hyperbolic tangent, correction function) will be executed for these 512 accumulated values to calculate the output The value of the layer node Y, the execution result will be written into column 2 of the data random access memory 122 .

在地址4与5的指令的第二次执行中(对应于时间步骤1)，这512个神经处理单元126中的各个神经处理单元126会执行512个乘法运算，将数据随机存取存储器122的列4的512个隐藏节点Z的数值(这些数值由地址7至11的指令的第一次执行而产生与写入)乘上权重随机存取存储器124的列512至1023中对应此神经处理单元126的行的权重，以产生512个乘积累加于相对应神经处理单元126的累加器202，而在地址6的指令的第二次执行中，会对于这512个累加数值执行启动函数以计算输出层节点Y的数值，此结果写入数据随机存取存储器122的列5；在地址4与5的指令的第三次执行中(对应于时间步骤2)，这512个神经处理单元126中的各个神经处理单元126会执行512个乘法运算，将数据随机存取存储器122的列7的512个隐藏节点Z的数值(这些数值由地址7至11的指令的第二次执行而产生与写入)乘上权重随机存取存储器124的列512至1023中对应此神经处理单元126的行的权重，以产生512个乘积累加于相对应神经处理单元126的累加器202，而在地址6的指令的第三次执行中，会对于这512个累加数值执行启动函数以计算输出层节点Y的数值，此结果写入数据随机存取存储器122的列8；依此类推，在地址4与5的指令的第二十次执行中(对应于时间步骤19)，这512个神经处理单元126中的各个神经处理单元126会执行512个乘法运算，将数据随机存取存储器122的列58的512个隐藏节点Z的数值(这些数值由地址7至11的指令的第十九次执行而产生与写入)乘上权重随机存取存储器124的列512至1023中对应此神经处理单元126的行的权重，以产生512个乘积累加于相对应神经处理单元126的累加器202，而在地址6的指令的第二十次执行中，会对于这512个累加数值执行启动函数以计算输出层节点Y的数值，执行结果写入数据随机存取存储器122的列59。In the second execution of the instructions at addresses 4 and 5 (corresponding to time step 1), each of the 512 neural processing units 126 will perform 512 multiplication operations, and the data random access memory 122 The values of 512 hidden nodes Z in column 4 (these values are generated and written by the first execution of the instructions at addresses 7 to 11) are multiplied by weights corresponding to this NPU in columns 512 to 1023 of random access memory 124 126 row weights to generate 512 multiplications and accumulate them in the accumulator 202 of the corresponding neural processing unit 126, and in the second execution of the instruction at address 6, the activation function will be executed for these 512 accumulated values to calculate the output The value of layer node Y, this result is written into column 5 of data random access memory 122; Each neural processing unit 126 will perform 512 multiplication operations, and write the values of 512 hidden nodes Z in column 7 of the DRAM 122 (these values are generated and written by the second execution of the instructions at addresses 7 to 11) ) is multiplied by the weight corresponding to the row of the neural processing unit 126 in the columns 512 to 1023 of the weight random access memory 124, to generate 512 multiplications accumulated in the accumulator 202 of the corresponding neural processing unit 126, and the instruction at address 6 In the third execution of , the startup function will be executed for these 512 accumulated values to calculate the value of the node Y of the output layer, and this result will be written into column 8 of the data random access memory 122; and so on, at addresses 4 and 5 In the twentieth execution of the instruction (corresponding to time step 19), each neural processing unit 126 in the 512 neural processing units 126 will perform 512 multiplication operations, and 512 columns 58 of the data random access memory 122 The value of hidden node Z (these values are generated and written by the nineteenth execution of the instructions at addresses 7 to 11) is multiplied by the weight of weights, so as to generate 512 multiplications and accumulate them in the accumulator 202 of the corresponding neural processing unit 126, and in the twentieth execution of the instruction at address 6, the activation function will be executed for these 512 accumulated values to calculate the output layer node Y The value of , the execution result is written into the column 59 of the data random access memory 122 .

在地址7与8的指令的第一次执行中，这512个神经处理单元126中的各个神经处理单元126将数据随机存取存储器122的列1的512个内容节点C的数值累加至其累加器202，这些数值由地址1至3的指令的单一次执行所产生。进一步来说，地址7的指令(ADD_D_ACC DRROW+0)会指示这512个神经处理单元126中的各个神经处理单元126将数据随机存取存储器122当前列(在第一次执行的过程中即为列0)的相对应文字读入其多任务缓存器208，并将此文字加入累加器202。地址8的指令(ADD_D_ACC ROTATE，COUNT＝511)指示这512个神经处理单元126中的各个神经处理单元126将来自相邻神经处理单元126的文字转入其多任务缓存器208(利用由神经网络单元121的512个多任务缓存器208集体运作构成的512个文字的旋转器，这些多任务缓存器即为地址7的指令指示读入数据随机存取存储器122的列的缓存器)，将此文字加入累加器202，并且执行前述运算511次。In the first execution of the instructions at addresses 7 and 8, each of the 512 NPUs 126 accumulates the values of the 512 content nodes C of column 1 of the DRAM 122 to its accumulator register 202, these values are generated by a single execution of instructions at addresses 1-3. Further, the instruction at address 7 (ADD_D_ACC DRROW+0) will instruct each neural processing unit 126 in the 512 neural processing units 126 to put the data random access memory 122 into the current column (in the process of the first execution, it is The corresponding text of column 0) is read into its multitasking register 208, and this text is added to the accumulator 202. The instruction of address 8 (ADD_D_ACC ROTATE, COUNT=511) instructs each neural processing unit 126 in these 512 neural processing units 126 to transfer the text from the adjacent neural processing unit 126 into its multitasking register 208 (utilized by the neural network The 512 multitasking registers 208 of the unit 121 are collectively operated to form a 512-word rotator, and these multitasking registers are the registers that the instruction at address 7 indicates to read into the column of the data random access memory 122), and this The text is added to the accumulator 202, and the aforementioned operation is performed 511 times.

在地址7与8的指令的第二次执行中，这512个神经处理单元126中的各个神经处理单元126会将将数据随机存取存储器122的列4的512个内容节点C的数值累加至其累加器202，这些数值由地址9至11的指令的第一次执行所产生并写入；在地址7与8的指令的第三次执行中，这512个神经处理单元126中的各个神经处理单元126会将将数据随机存取存储器122的列7的512个内容节点C的数值累加至其累加器202，这些数值由地址9至11的指令的第二次执行所产生并写入；依此类推，在地址7与8的指令的第二十次执行中，这512个神经处理单元126中的各个神经处理单元126会将将数据随机存取存储器122的列58的512个内容节点C的数值累加至其累加器202，这些数值由地址9至11的指令的第十九次执行所产生并写入。In the second execution of the instructions at addresses 7 and 8, each of the 512 NPUs 126 will accumulate the values of the 512 content nodes C in column 4 of the DRAM 122 to Its accumulator 202, these values are generated and written by the first execution of the instructions of addresses 9 to 11; in the third execution of the instructions of addresses 7 and 8, each of the 512 neural processing units 126 The processing unit 126 will accumulate the values of the 512 content nodes C in column 7 of the data random access memory 122 to its accumulator 202, and these values are generated and written in by the second execution of the instructions of addresses 9 to 11; By analogy, in the twentieth execution of the instructions at addresses 7 and 8, each of the 512 NPUs 126 will transfer the 512 content nodes of column 58 of the DRAM 122 to The value of C is accumulated to its accumulator 202, and these values are generated and written by the nineteenth execution of the instructions at addresses 9-11.

如前述，图42的范例假定关联于内容节点C至隐藏层节点Z的连结的权重具有为一的值。不过，在另一实施例中，这些位于Elman时间递归神经网络内的连结则是具有非零权重值，这些权重在图42的程序执行前放置于权重随机存取存储器124(例如列1024至1535)，地址7的程序指令为MULT-ACCUM DR ROW+0，WR ROW 1024，而地址8的程序指令为MULT-ACCUM ROTATE，WR ROW+1，COUNT＝511。较佳地，地址8的指令并不存取权重随机存取存储器124，而是旋转地址7的指令从权重随机存取存储器124读入多任务缓存器705的数值。在511个执行地址8指令的时频周期内不对权重随机存取存储器124进行存取即可保留更多频宽供架构程序存取权重随机存取存储器124使用。As before, the example of FIG. 42 assumes that the weight associated with the link from content node C to hidden layer node Z has a value of one. However, in another embodiment, the connections in the Elman time recurrent neural network have non-zero weight values, and these weights are placed in the weight random access memory 124 (such as columns 1024 to 1535) before the program of FIG. 42 is executed. ), the program instruction at address 7 is MULT-ACCUM DR ROW+0, WR ROW 1024, and the program instruction at address 8 is MULT-ACCUM ROTATE, WR ROW+1, COUNT=511. Preferably, the instruction at address 8 does not access the weight RAM 124 , but the instruction at address 7 rotates the value read from the weight RAM 124 into the multitasking register 705 . More bandwidth can be reserved for the architecture program to access the weight RAM 124 by not accessing the weight RAM 124 within 511 time-frequency cycles of executing the address 8 instruction.

在地址9与10的指令(MULT-ACCUM DR ROW+2，WR ROW 0 and MULT-ACCUM ROTATE，WR ROW+1，COUNT＝511)的第一次执行中(对应于时间步骤1)，这512个神经处理单元126中的各个神经处理单元126会执行512个乘法运算，将数据随机存取存储器122的列3的512个输入节点D的数值乘上权重随机存取存储器124的列0至511中对应此神经处理单元126的行的权重以产生512个乘积，连同地址7与8的指令对于512个内容节点C数值所执行的累加运算，累加于相对应神经处理单元126的累加器202以计算隐藏层节点Z的数值，在地址11的指令(OUTPUT PASSTHRU，DR OUT ROW+2，CLR ACC)的第一次执行中，这512个神经处理单元126的512个累加器202数值被传递并写入数据随机存取存储器122的列4，而累加器202会被清除；在地址9与10的指令的第二次执行中(对应于时间步骤2)，这512个神经处理单元126中的各个神经处理单元126会执行512个乘法运算，将数据随机存取存储器122的列6的512个输入节点D的数值乘上权重随机存取存储器124的列0至511中对应此神经处理单元126的行的权重，以产生512个乘积，连同地址7与8的指令对于512个内容节点C数值所执行的累加运算，累加于相对应神经处理单元126的累加器202以计算隐藏层节点Z的数值，在地址11的指令的第二次执行中，这512个神经处理单元126的512个累加器202数值被传递并写入数据随机存取存储器122的列7，而累加器202则会被清除；依此类推，在地址9与10的指令的第十九次执行中(对应于时间步骤19)，这512个神经处理单元126中的各个神经处理单元126会执行512个乘法运算，将数据随机存取存储器122的列57的512个输入节点D的数值乘上权重随机存取存储器124的列0至511中对应此神经处理单元126的行的权重，以产生512个乘积，连同地址7与8的指令对于512个内容节点C数值所执行的累加运算，累加于相对应神经处理单元126的累加器202以计算隐藏层节点Z的数值，而在地址11的指令的第十九次执行中，这512个神经处理单元126的512个累加器202数值被传递并写入数据随机存取存储器122的列58，而累加器202则会被清除。如前所述，在地址9与10的指令的第二十次执行中所产生并写入的隐藏层节点Z的数值并不会被使用。In the first execution (corresponding to time step 1) of the instructions at addresses 9 and 10 (MULT-ACCUM DR ROW+2, WR ROW 0 and MULT-ACCUM ROTATE, WR ROW+1, COUNT=511), this 512 Each of the NPUs 126 performs 512 multiplication operations to multiply the values of the 512 input nodes D in column 3 of the data RAM 122 by columns 0 to 511 of the weight RAM 124 The weights corresponding to the row of the neural processing unit 126 to generate 512 products, together with the accumulation operation performed by the instructions of addresses 7 and 8 on the 512 values of the content node C, are accumulated in the accumulator 202 corresponding to the neural processing unit 126 to To calculate the value of hidden layer node Z, in the first execution of the instruction at address 11 (OUTPUT PASSTHRU, DR OUT ROW+2, CLR ACC), the 512 accumulator 202 values of the 512 neural processing units 126 are passed and Write to column 4 of data random access memory 122, and accumulator 202 will be cleared; In the second execution (corresponding to time step 2) of the instruction of address 9 and 10, the Each neural processing unit 126 will perform 512 multiplication operations, and multiply the values of the 512 input nodes D in column 6 of the data random access memory 122 by the weight corresponding to the neural processing unit 126 in columns 0 to 511 of the random access memory 124 to generate 512 products, together with the accumulative operation performed by the instructions at addresses 7 and 8 on the 512 values of content node C, they are accumulated in the accumulator 202 of the corresponding neural processing unit 126 to calculate the value of hidden layer node Z value, in the second execution of the instruction at address 11, the 512 accumulator 202 values of the 512 neural processing units 126 are transferred and written into the column 7 of the data random access memory 122, and the accumulator 202 will be clear; and so on, in the nineteenth execution (corresponding to time step 19) of the instruction of address 9 and 10, each neural processing unit 126 in these 512 neural processing units 126 can execute 512 multiplication operations, will The values of the 512 input nodes D in column 57 of the data RAM 122 are multiplied by the weights of the row corresponding to this neural processing unit 126 in columns 0 to 511 of the weight RAM 124 to produce 512 products, along with the address The accumulative operation performed by instructions 7 and 8 on the value of 512 content nodes C is accumulated in the accumulator 202 of the corresponding neural processing unit 126 to calculate the value of hidden layer node Z, and the nineteenth time of the instruction at address 11 During execution, the 512 accumulators 202 values of the 512 NPUs 126 are passed and written into the column 58 of the DRAM 122, and the accumulators 202 are cleared. As mentioned above, the value of hidden layer node Z generated and written in the twentieth execution of the instructions at addresses 9 and 10 is not used.

地址12的指令(LOOP 4)会使循环计数器3804递减并且在新的循环计数器3804数值大于零的情况下回到地址4的指令。The instruction at address 12 (LOOP 4) will decrement the loop counter 3804 and fall back to the instruction at address 4 if the new loop counter 3804 value is greater than zero.

图43为一方块图显示Jordan时间递归神经网络的一范例。图43的Jordan时间递归神经网络类似于图40的Elman时间递归神经网络，具有输入层节点/神经元D，隐藏层节点/神经元Z，输出层节点/神经元Y，与内容层节点/神经元C。不过，在图43的Jordan时间递归神经网络中，内容层节点C以来自其相对应输出层节点Y的输出回馈作为其输入连结，而非如图40的Elman时间递归神经网络中来自隐藏层节点Z的输出作为其输入连结。FIG. 43 is a block diagram showing an example of a Jordan temporal recurrent neural network. The Jordan temporal recurrent neural network of Figure 43 is similar to the Elman temporal recurrent neural network of Figure 40, with input layer node/neuron D, hidden layer node/neuron Z, output layer node/neuron Y, and content layer node/neuron Yuan C. However, in the Jordan time recurrent neural network in Figure 43, the content layer node C uses the output feedback from its corresponding output layer node Y as its input connection, instead of the hidden layer node in the Elman time recurrent neural network as shown in Figure 40 The output of Z is connected as its input.

为了说明本发明，Jordan时间递归神经网络是一个包含至少一个输入节点层，一个隐藏节点层，一个输出节点层与一个内容节点层的时间递归神经网络。在一给定时间步骤的开始，内容节点层会储存输出节点层于前一个时间步骤产生且回馈至内容节点层的结果。此回馈至内容层的结果可以是启动函数的结果或是输出节点层执行累加运算而未执行启动函数的结果。To illustrate the present invention, a Jordan temporal recurrent neural network is a temporal recurrent neural network comprising at least one input node layer, one hidden node layer, one output node layer and one content node layer. At the beginning of a given time step, the content node layer stores the results generated by the output node layer at the previous time step and fed back to the content node layer. The result fed back to the content layer can be the result of the activation function or the result of the accumulation operation performed by the output node layer without executing the activation function.

图44为一方块图，显示当神经网络单元121执行关联于图43的Jordan时间递归神经网络的计算时，神经网络单元121的数据随机存取存储器122与权重随机存取存储器124内的数据配置的一范例。在图44的范例中假定图43的Jordan时间递归神经网络具有512个输入节点D，512个隐藏节点Z，512个内容节点C，与512个输出节点Y。此外，也假定此Jordan时间递归神经网络为完全连结，即全部512个输入节点D均连结各个隐藏节点Z作为输入，全部512个内容节点C均连结各个隐藏节点Z作为输入，而全部512个隐藏节点Z均连结各个输出节点Y作为输入。图44的Jordan时间递归神经网络的范例虽然会对累加器202数值施以启动函数以产生输出层节点Y的数值，不过，此范例假定会将施以启动函数前的累加器202数值传递至内容层节点C，而非真正的输出层节点Y数值。此外，神经网络单元121设置有512个神经处理单元126，或神经元，例如采取宽配置。最后，此范例假定关联于由内容节点C至隐藏节点Z的连结的权重均具有数值1；因而不需储存这些为一的权重值。44 is a block diagram showing the data configuration in the data random access memory 122 and weight random access memory 124 of the neural network unit 121 when the neural network unit 121 performs calculations associated with the Jordan Time Recurrent Neural Network of FIG. 43 An example of . In the example of FIG. 44 , it is assumed that the Jordan time recurrent neural network in FIG. 43 has 512 input nodes D, 512 hidden nodes Z, 512 content nodes C, and 512 output nodes Y. In addition, it is also assumed that this Jordan time recurrent neural network is fully connected, that is, all 512 input nodes D are connected to each hidden node Z as input, all 512 content nodes C are connected to each hidden node Z as input, and all 512 hidden nodes Node Z is connected to each output node Y as an input. Although the example of the Jordan time recurrent neural network in Figure 44 will apply the activation function to the value of the accumulator 202 to generate the value of the output layer node Y, however, this example assumes that the value of the accumulator 202 before the activation function is applied to the content Layer node C, not the real output layer node Y value. Furthermore, the neural network unit 121 is provided with 512 neural processing units 126 , or neurons, for example in a wide configuration. Finally, this example assumes that the weights associated with the links from content node C to hidden node Z all have a value of 1; thus there is no need to store these weight values as one.

如同图41的范例，如图中所示，权重随机存取存储器124的下方512列(列0至511)会装载关联于输入节点D与隐藏节点Z间的连结的权重值，而权重随机存取存储器124的后续512列(列512至1023)会装载关联于隐藏节点Z与输出节点Y间的连结的权重值。Like the example in FIG. 41 , as shown in the figure, the lower 512 columns (columns 0 to 511) of the WRAM 124 will be loaded with the weight values associated with the connection between the input node D and the hidden node Z, and the WRAM 124 The next 512 columns (columns 512 to 1023) of fetch memory 124 are loaded with weight values associated with the link between hidden node Z and output node Y.

数据随机存取存储器122装载Jordan时间递归神经网络节点值供一系列类似于图41的范例中的时间步骤使用；不过，图44的范例中以一组四列的存储器装载提供给定时间步骤的节点值。如图中所示，在具有64列的数据随机存取存储器122的实施例中，数据随机存取存储器122可以装载15个不同时间步骤所需的节点值。在图44的范例中，列0至3装载供时间步骤0使用的节点值，列4至7装载供时间步骤1使用的节点值，依此类推，列60至63装载供时间步骤15使用的节点值。此四列一组存储器的第一列装载此时间步骤的输入节点D的数值。此四列一组存储器的第二列装载此时间步骤的隐藏节点Z的数值。此四列一组存储器的第三列装载此时间步骤的内容节点C的数值。此四列一组存储器的第四列则是装载此时间步骤的输出节点Y的数值。如图中所示，数据随机存取存储器122的各行装载其相对应的神经元或神经处理单元126的节点值。也就是说，行0装载关联于节点D0，Z0，C0与Y0的节点值，其计算是由神经处理单元0执行；行1装载关联于节点D1，Z1，C1与Y1的节点值，其计算是由神经处理单元1执行；依此类推，行511装载关联于节点D511，Z511，C511与Y511的节点值，其计算是由神经处理单元511执行。这部分在后续对应于图44处会有更详细的说明。The data random access memory 122 is loaded with Jordan time recurrent neural network node values for a series of time steps similar to the example in FIG. 41; however, the example of FIG. node value. As shown in the figure, in an embodiment of DRAM 122 having 64 columns, DRAM 122 can be loaded with node values required for 15 different time steps. In the example in Figure 44, columns 0 to 3 hold node values for time step 0, columns 4 to 7 for time step 1, and so on, columns 60 to 63 for time step 15 node value. The first column of this quad holds the value of the input node D for this time step. The second column of the quad holds the value of hidden node Z for this time step. The third column of the quad holds the value of the content node C for this time step. The fourth column of the quad is loaded with the value of the output node Y for this time step. As shown, each row of DRAM 122 is loaded with the node value of its corresponding neuron or NPU 126 . That is to say, row 0 loads the node values associated with nodes D0, Z0, C0, and Y0, and its computation is performed by NPU 0; row 1 loads the node values associated with nodes D1, Z1, C1, and Y1, and its computation is executed by NPU 1; and so on, line 511 loads the node values associated with nodes D511, Z511, C511 and Y511, and its computation is executed by NPU 511. This part will be described in more detail later corresponding to FIG. 44 .

图44中给定时间步骤的内容节点C的数值于此时间步骤内产生并作为下一个时间步骤的输入。也就是说，神经处理单元126在此时间步骤内计算并写入的节点C的数值，会成为此神经处理单元126在下一个时间步骤内用于计算节点Z的数值所使用的节点C的数值(连同此下一个时间步骤的输入节点D的数值)。内容节点C的初始值(即时间步骤0计算列1节点Z的数值所使用的节点C的数值)假定为零。这部分在后续对应于图45的非架构程序的章节会有更详细的说明。The value of content node C at a given time step in Figure 44 is generated during this time step and serves as the input for the next time step. That is to say, the value of node C calculated and written by the neural processing unit 126 in this time step will become the value of node C used by the neural processing unit 126 to calculate the value of node Z in the next time step ( together with the value of input node D for this next time step). The initial value of content node C (ie, the value of node C used to calculate the value of column 1 node Z at time step 0) is assumed to be zero. This part will be described in more detail in the subsequent chapter corresponding to the non-architectural program of Figure 45.

如前文图41所述，较佳地，输入节点D的数值(图44的范例中的列0，4，依此类推至列60的数值)由执行于处理器100的架构程序透过MTNN指令1400写入/填入数据随机存取存储器122，并且是由执行于神经网络单元121的非架构程序读取/使用，例如图45的非架构程序。相反地，隐藏节点Z/内容节点C/输出节点Y的数值(图44的范例中分别为列1/2/3，5/6/7，依此类推至列61/62/63的数值)由执行于神经网络单元121的非架构程序写入/填入数据随机存取存储器122，并且是由执行于处理器100的架构程序透过MFNN指令1500读取/使用。图44的范例假定此架构程序会执行以下步骤：(1)对于15个不同的时间步骤，将输入节点D的数值填入数据随机存取存储器122(列0，4，依此类推至列60)；(2)启动图45的非架构程序；(3)侦测非架构程序是否执行完毕；(4)从数据随机存取存储器122读出输出节点Y的数值(列3，7，依此类推至列63)；以及(5)重复步骤(1)至(4)若干次直到完成任务，例如对手机使用者的话语进行辨识所需的计算。As previously described in FIG. 41, preferably, the value of the input node D (column 0, 4 in the example of FIG. 44, and so on to the value of column 60) is executed by the architecture program on the processor 100 through the MTNN instruction 1400 is written/filled into the DRAM 122 and read/used by non-architectural programs executing on the NNU 121, such as the non-architectural program of FIG. 45 . On the contrary, the value of hidden node Z/content node C/output node Y (in the example of Figure 44, columns 1/2/3, 5/6/7, and so on to the values of columns 61/62/63) The DRAM 122 is written/filled by the non-architectural program executing on the NNU 121 and read/used by the architectural program executing on the processor 100 via the MFNN instruction 1500 . The example of FIG. 44 assumes that the architecture program will perform the following steps: (1) For 15 different time steps, fill the data random access memory 122 with the value of the input node D (column 0, 4, and so on to column 60 ); (2) start the non-architecture program of FIG. and (5) repeat steps (1) to (4) several times until the task is completed, such as the calculation required to recognize the utterance of the mobile phone user.

在另一种执行方式中，架构程序会执行以下步骤：(1)对单一个时间步骤，以输入节点D的数值填入数据随机存取存储器122(如列0)；(2)启动非架构程序(图45非架构程序的修正后版本，不需循环，并且只存取数据随机存存储器122的单一组四列)；(3)侦测非架构程序是否执行完毕；(4)从数据随机存取存储器122读出输出节点Y的数值(如列3)；以及(5)重复步骤(1)至(4)若干次直到完成任务。此二种方式何者为优可依据时间递归神经网络的输入值的取样方式而定。举例来说，若是此任务容许在多个时间步骤内对输入进行取样(例如大约15个时间步骤)并执行计算，第一种方式就较为理想，因为此方式可带来更多计算资源效率和/或较佳的效能，不过，若是此任务只容许在单一个时间步骤内执行取样，就需要使用第二种方式。In another execution mode, the architecture program will perform the following steps: (1) for a single time step, fill the data random access memory 122 (such as column 0) with the value of the input node D; (2) start the non-architecture Program (the modified version of the non-architecture program of Fig. 45, without looping, and only accessing a single group of four columns of the DRAM 122); (3) detecting whether the non-architecture program has been executed; (4) from the data random access Access memory 122 to read out the value of output node Y (eg column 3); and (5) repeat steps (1) to (4) several times until the task is completed. Which of the two methods is better depends on the sampling method of the input value of the temporal recurrent neural network. For example, if the task allows the input to be sampled and the calculation to be performed over multiple time steps (say, around 15 time steps), the first approach is ideal because it leads to more computational resource efficiency and / or better performance, but if the task only allows sampling to be performed in a single time step, then the second method is required.

第三实施例类似于前述第二种方式，不过，不同于第二种方式使用单一组四个数据随机存取存储器122列，此方式的非架构程序使用多组四列存储器，也就是在各个时间步骤使用不同组四列存储器，此部分类似于第一种方式。在此第三实施例中，较佳地，架构程序在步骤(2)前包含一步骤，在此步骤中，架构程序会在非架构程序启动前对其进行更新，例如将地址1的指令内的数据随机存取存储器122列更新为指向下一组四列存储器。The third embodiment is similar to the aforementioned second method, but unlike the second method using a single group of four data random access memory 122 columns, the non-architectural program of this method uses multiple groups of four-column memories, that is, in each Time steps use different sets of quad-column memory, this part is similar to the first way. In this third embodiment, preferably, the architecture program includes a step before step (2). In this step, the architecture program will update the non-architecture program before it starts, for example, the address 1 in the instruction The data RAM 122 ranks are updated to point to the next set of four ranks.

图45为一表格，显示储存于神经网络单元121的程序存储器129的程序，此程序由神经网络单元121执行，并依据图44的配置使用数据与权重，以达成Jordan时间递归神经网络。图45的非架构程序类似于图42的非架构程序，二者的差异可参照本文相关章节的说明。FIG. 45 is a table showing the program stored in the program memory 129 of the neural network unit 121. The program is executed by the neural network unit 121 and uses data and weights according to the configuration of FIG. 44 to achieve the Jordan time recurrent neural network. The non-architectural program in Figure 45 is similar to the non-architectural program in Figure 42, and the differences between the two can refer to the descriptions in relevant sections of this document.

图45的范例程序包括14个非架构指令，分别位于地址0至13。地址0的指令是一个初始化指令，用以清除累加器202并将循环计数器3804初始化至数值15，以执行15次循环组(地址4至12的指令)。较佳地，此初始化指令并会使神经网络单元121处于宽配置而配置为512个神经处理单元126。如本文所述，在地址1至3以及地址8至12的指令执行过程中，这512个神经处理单元126对应并作为512个隐藏层节点Z进行运作，而在地址4，5与7的指令执行过程中，这512个神经处理单元126对应并作为512个输出层节点Y进行运作。The example program in FIG. 45 includes 14 non-architectural instructions located at addresses 0 to 13, respectively. The instruction at address 0 is an initialization instruction to clear the accumulator 202 and initialize the loop counter 3804 to a value of 15 to execute the group of 15 loops (instructions at addresses 4 to 12). Preferably, the initialization instruction also causes the NNU 121 to be configured in a wide configuration as 512 NPUs 126 . As described herein, during the execution of instructions at addresses 1 to 3 and addresses 8 to 12, the 512 neural processing units 126 correspond to and operate as 512 hidden layer nodes Z, while instructions at addresses 4, 5 and 7 During execution, the 512 neural processing units 126 correspond to and operate as 512 output layer nodes Y.

地址1至5与地址7的指令与图42中地址1至6的指令相同并具有相同功能。地址1至3的指令计算隐藏层节点Z的初始值并将其写入数据随机存取存储器122的列1供地址4，5与7的指令的第一次执行使用，以计算出第一时间步骤(时间步骤0)的输出层节点Y。The commands of addresses 1 to 5 and address 7 are the same as the commands of addresses 1 to 6 in FIG. 42 and have the same functions. The instructions at addresses 1 to 3 calculate the initial value of node Z of the hidden layer and write it into column 1 of the DRAM 122 for the first execution of the instructions at addresses 4, 5 and 7 to calculate the first time Output layer node Y at step (time step 0).

在地址6的输出指令的第一次执行的过程中，这512个由地址4与5的指令累加产生的累加器202数值(接下来这些数值会被地址7的输出指令使用以计算并写入输出层节点Y的数值)会被传递并写入数据随机存取存储器122的列2，这些数值即为第一时间步骤(时间步骤0)中产生的内容层节点C数值并于第二时间步骤(时间步骤1)中使用；在地址6的输出指令的第二次执行的过程中，这512个由地址4与5的指令累加产生的累加器202数值(接下来，这些数值会被地址7的输出指令使用以计算并写入输出层节点Y的数值)会被传递并写入数据随机存取存储器122的列6，这些数值即为第二时间步骤(时间步骤1)中产生的内容层节点C数值并于第三时间步骤(时间步骤2)中使用；依此类推，在地址6的输出指令的第十五次执行的过程中，这512个由地址4与5的指令累加产生的累加器202数值(接下来这些数值会被地址7的输出指令使用以计算并写入输出层节点Y的数值)会被传递并写入数据随机存取存储器122的列58，这些数值即为第十五时间步骤(时间步骤14)中产生的内容层节点C数值(并由地址8的指令读取，但不会被使用)。During the first execution of the output instruction at address 6, the 512 accumulator 202 values generated by the instructions at addresses 4 and 5 are accumulated (these values are then used by the output instruction at address 7 to calculate and write The value of the output layer node Y) will be transferred and written into column 2 of the data random access memory 122, these values are the value of the content layer node C generated in the first time step (time step 0) and will be generated in the second time step (time step 1); during the second execution of the output instruction at address 6, these 512 accumulator 202 values generated by the accumulation of instructions at addresses 4 and 5 (next, these values will be stored at address 7 The output instruction used to calculate and write the value of the output layer node Y) will be passed and written into the column 6 of the data random access memory 122, these values are the content layer generated in the second time step (time step 1) The value of node C is used in the third time step (time step 2); and so on, during the fifteenth execution of the output instruction at address 6, these 512 instructions generated by the accumulation of addresses 4 and 5 Accumulator 202 values (which are then used by the output instruction at address 7 to calculate and write the value of output layer node Y) are passed and written to column 58 of DRAM 122, and these values are the first The content layer node C value generated in fifteen time steps (time step 14) (and read by the instruction at address 8, but not used).

地址8至12的指令与图42中地址7至11的指令大致相同并具有相同功能，二者仅具有一差异点。此差异点即，图45中地址8的指令(ADD_D_ACC DR ROW+1)会使数据随机存取存储器122的列数增加一，而图42中地址7的指令(ADD_D_ACC DR ROW+0)会使数据随机存取存储器122的列数增加零。此差异导因于数据随机存取存储器122内的数据配置的不同，特别是，图44中四列一组的配置包括一独立列供内容层节点C数值使用(如列2，6，10等)，而图41中三列一组的配置则不具有此独立列，而是让内容层节点C的数值与隐藏层节点Z的数值共享同一列(如列1，4，7等)。地址8至12的指令的十五次执行会计算出隐藏层节点Z的数值并将其写入数据随机存取存储器122(写入列5，9，13，依此类推直到列57)供地址4，5与7的指令的第二至十六次执行使用以计算第二至十五时间步骤的输出层节点Y(时间步骤1至14)。(地址8至12的指令的最后/第十五次执行计算隐藏层节点Z的数值并将其写入数据随机存取存储器122的列61，不过这些数值并未被使用。)The commands at addresses 8 to 12 are substantially the same as the commands at addresses 7 to 11 in FIG. 42 and have the same functions, with only one difference. The difference is that the instruction (ADD_D_ACC DR ROW+1) at address 8 in FIG. The column number of the data random access memory 122 is incremented by zero. This difference is due to the difference in data configuration in the data random access memory 122. In particular, the configuration of a group of four columns in FIG. ), while the configuration of three columns in Figure 41 does not have this independent column, but allows the value of the content layer node C and the value of the hidden layer node Z to share the same column (such as columns 1, 4, 7, etc.). Fifteen executions of the instructions at addresses 8 to 12 compute the value of node Z of the hidden layer and write it to DRAM 122 (write to columns 5, 9, 13, and so on up to column 57) for address 4 , the second to sixteenth executions of the instructions of 5 and 7 are used to compute the output layer node Y for the second to fifteenth time steps (time steps 1 to 14). (The last/fifteenth execution of the instructions at addresses 8-12 computes and writes the values for hidden layer node Z to column 61 of DRAM 122, but these values are not used.)

地址13的循环指令会使循环计数器3804递减并且在新的循环计数器3804数值大于零的情况下回到地址4的指令。The loop instruction at address 13 will decrement the loop counter 3804 and fall back to the instruction at address 4 if the new loop counter 3804 value is greater than zero.

在另一实施例中，Jordan时间递归神经网络的设计利用内容节点C装载输出节点Y的启动函数值，此启动函数值即启动函数执行后的累加值。在此实施例中，因为输出节点Y的数值与内容节点C的数值相同，地址6的非架构指令并不包含于非架构程序内。因而可以减少数据随机存取存储器122内使用的列数。更精确的说，图44中的各个装载内容节点C数值的列(例如列2，6，59)都不存在于本实施例。此外，此实施例的各个时间步骤仅需要数据随机存取存储器122的三列，而会搭配20个时间步骤，而非15个，图45中非架构程序的指令的地址也会进行适当的调整。In another embodiment, the design of the Jordan time recurrent neural network uses the content node C to load the activation function value of the output node Y, and the activation function value is the accumulated value after the activation function is executed. In this embodiment, because the output node Y has the same value as the content node C, the non-architectural instruction at address 6 is not included in the non-architectural program. Therefore, the number of columns used in the data random access memory 122 can be reduced. To be more precise, the columns (for example, columns 2, 6, and 59) loaded with the value of the content node C in FIG. 44 do not exist in this embodiment. In addition, each time step of this embodiment only needs three columns of the DRAM 122, and 20 time steps will be used instead of 15, and the address of the instruction of the non-architectural program in FIG. 45 will also be properly adjusted. .

长短期记忆胞long short-term memory cells

长短期记忆胞用于时间递归神经网络是本技术领域所习知的概念。举例来说，Long Short-Term Memory，Sepp Hochreiter and Jürgen Schmidhuber，NeuralComputation，November 15，1997，Vol.9，No.8，Pages 1735-1780；Learning to Forget：Continual Prediction with LSTM，Felix A.Gers，Jürgen Schmidhuber，and FredCummins，Neural Computation，October 2000，Vol.12，No.10，Pages 2451-2471；这些文献都可以从麻省理工出版社期刊(MIT Press Journals)取得。长短期记忆胞可以建构为多种不同型式。以下所述图46的长短期记忆胞4600以网址http：//deeplearning.net/tutorial/lstm.html标题为用于情绪分析的长短期记忆网络(LSTM Networks forSentiment Analysis)的教程所描述的长短期记忆胞为模型，此教程的副本于2015年10月19日下载(以下称为“长短期记忆教程”)并提供于本案的美国申请案数据揭露陈报书内。此长短期记忆胞4600可用于一般性地描述本文所述的神经网络单元121实施例能够有效执行关联于长短期记忆的计算的能力。值得注意的是，这些神经网络单元121的实施例，包括图49所述的实施例，都可以有效执行关联于图46所述的长短期记忆胞以外的其它长短期记忆胞的计算。The use of LSTM cells in temporal recurrent neural networks is a well-known concept in the art. For example, Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, Neural Computation, November 15, 1997, Vol.9, No.8, Pages 1735-1780; Learning to Forget: Continual Prediction with LSTM, Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins, Neural Computation, October 2000, Vol.12, No.10, Pages 2451-2471; all available from MIT Press Journals. LSTM cells can be constructed in many different types. The LSTM cells 4600 of FIG. 46 described below are LSTMs as described in the tutorial at http://deeplearning.net/tutorial/lstm.html titled LSTM Networks for Sentiment Analysis. Memory cells are models, and a copy of this tutorial was downloaded on October 19, 2015 (hereinafter referred to as the "Long Short-Term Memory Tutorial") and provided in the U.S. Application Data Disclosure Report of this case. This LSTM cell 4600 may be used to generally describe the ability of the NNU 121 embodiments described herein to efficiently perform computations associated with LSTM. It is worth noting that these embodiments of the neural network unit 121 , including the embodiment described in FIG. 49 , can efficiently perform calculations associated with other LSTM cells other than the LSTM cells described in FIG. 46 .

较佳地，神经网络单元121可用以针对一个具有长短期记忆胞层连结其它层级的时间递归神经网络执行计算。举例来说，在此长短期记忆教程中，网络包含均值共源层以接收长短期记忆层的长短期记忆胞的输出(H)，以及逻辑回归层以接收均值共源层的输出。Preferably, the neural network unit 121 can be used to perform calculations for a temporal recurrent neural network with LSTM layers connected to other layers. For example, in this LSTM tutorial, the network includes a mean common source layer to receive the output (H) of the LSTM cells of the LSTM layer, and a logistic regression layer to receive the output of the mean common source layer.

图46为一方块图，显示长短期记忆胞4600的一实施例。FIG. 46 is a block diagram showing an embodiment of an LSTM cell 4600.

如图中所示，此长短期记忆胞4600包括记忆胞输入(X)，记忆胞输出(H)，输入闸(I)，输出闸(O)，遗忘闸(F)，记忆胞状态(C)与候选记忆胞状态(C’)。输入闸(I)可门控记忆胞输入(X)至记忆胞状态(C)的信号传递，而输出闸(O)可门控记忆胞状态(C)至记忆胞输出(H)的信号传递。此记忆胞状态(C)会反馈为一时间步骤的候选记忆胞状态(C’)。遗忘闸(F)可门控此候选记忆胞状态(C’)，此候选记忆胞状态会反馈并变成下一个时间步骤的记忆胞状态(C)。As shown in the figure, the LSTM 4600 includes a memory cell input (X), a memory cell output (H), an input gate (I), an output gate (O), a forgetting gate (F), and a memory cell state (C ) and candidate memory cell states (C'). The input gate (I) can gate the signal transmission from the memory cell input (X) to the memory cell state (C), while the output gate (O) can gate the signal transmission from the memory cell state (C) to the memory cell output (H) . The memory cell state (C) is fed back as a candidate memory cell state (C') for a time step. The forgetting gate (F) can gate the candidate memory cell state (C'), and the candidate memory cell state will be fed back and become the memory cell state (C) of the next time step.

图46的实施例使用下列等式来计算前述各种不同数值：The embodiment of Figure 46 uses the following equations to calculate the various aforementioned values:

(1)I＝SIGMOID(Wi*X+Ui*H+Bi)(1) I=SIGMOID(Wi*X+Ui*H+Bi)

(2)F＝SIGMOID(Wf*X+Uf*H+Bf)(2) F＝SIGMOID(Wf*X+Uf*H+Bf)

(3)C’＝TANH(Wc*X+Uc*H+Bc)(3) C'＝TANH(Wc*X+Uc*H+Bc)

(4)C＝I*C’+F*C(4) C=I*C'+F*C

(5)O＝SIGMOID(Wo*X+Uo*H+Bo)(5) O＝SIGMOID(Wo*X+Uo*H+Bo)

(6)H＝O*TANH(C)(6)H＝O*TANH(C)

Wi与Ui是关联于输入闸(I)的权重值，而Bi是关联于输入闸(I)的偏移值。Wf与Uf是关联于遗忘闸(F)的权重值，而Bf是关联于遗忘闸(F)的偏移值。Wo与Uo是关联于输出闸(O)的权重值，而Bo是关联于输出闸(O)的偏移值。如前述，等式(1)，(2)与(5)分别计算输入闸(I)，遗忘闸(F)与输出闸(O)。等式(3)计算候选记忆胞状态(C’)，而等式(4)计算以当前记忆胞状态(C)为输入的候选记忆胞状态(C’)，当前记忆胞状态(C)即当前时间步骤的记忆胞状态(C)。等式(6)计算记忆胞输出(H)。不过本发明并不限于此。使用他种方式计算输入闸，遗忘闸，输出闸，候选记忆胞状态，记忆胞状态与记忆胞输出的长短期记忆胞的实施例也为本发明所涵盖。Wi and Ui are weight values associated with the input gate (I), and Bi is an offset value associated with the input gate (I). Wf and Uf are weight values associated with the forget gate (F), and Bf is an offset value associated with the forget gate (F). Wo and Uo are weight values associated with the output gate (O), and Bo is an offset value associated with the output gate (O). As mentioned above, equations (1), (2) and (5) respectively calculate the input gate (I), the forgetting gate (F) and the output gate (O). Equation (3) calculates the candidate memory cell state (C'), and equation (4) calculates the candidate memory cell state (C') with the current memory cell state (C) as input, the current memory cell state (C) is The state of the memory cell at the current time step (C). Equation (6) calculates the memory cell output (H). However, the present invention is not limited thereto. Embodiments of LSTMs that use other methods to calculate input gates, forgetting gates, output gates, candidate memory cell states, memory cell states, and memory cell outputs are also covered by the present invention.

为了说明本发明，长短期记忆胞包括记忆胞输入，记忆胞输出，记忆胞状态，候选记忆胞状态，输入闸，输出闸与遗忘闸。对各个时间步骤而言，输入闸，输出闸，遗忘闸与候选记忆胞状态为当前时间步骤的存储器记忆胞输入与先前时间步骤的记忆胞输出与相关权重的函数。此时间步骤的记忆胞状态为先前时间步骤的记忆胞状态，候选记忆胞状态，输入闸与输出闸的函数。从这个意义上说，记忆胞状态会反馈用于计算下一个时间步骤的记忆胞状态。此时间步骤的记忆胞输出是此时间步骤计算出的记忆胞状态与输出闸的函数。长短期记忆神经网络是一个具有一个长短期记忆胞层的神经网络。To illustrate the present invention, the LSTM includes memory cell input, memory cell output, memory cell state, candidate memory cell state, input gate, output gate and forgetting gate. For each time step, the input gate, output gate, forgetting gate and candidate memory cell states are functions of memory cell input at the current time step, memory cell output at the previous time step and related weights. The state of the memory cell at this time step is a function of the state of the memory cell at the previous time step, the state of the candidate memory cell, the input gate and the output gate. In this sense, the memory cell state feeds back the memory cell state used to compute the next time step. The output of the memory cell at this time step is a function of the calculated state of the memory cell and the output gate at this time step. A LSTM neural network is a neural network with a layer of LSTM cells.

图47为一方块图，显示当神经网络单元121执行关联于图46的长短期记忆神经网络的长短期记忆胞4600层的计算时，神经网络单元121的数据随机存取存储器122与权重随机存取存储器124内的数据配置的一范例。在图47的范例中，神经网络单元121配置为512个神经处理单元126或神经元，例如采宽配置，不过，只有128个神经处理单元126(如神经处理单元0至127)所产生的数值会被使用，这是因为在此范例的长短期记忆层只有128个长短期记忆胞4600。47 is a block diagram showing the data random access memory 122 and weight random access memory of the neural network unit 121 when the neural network unit 121 performs calculations associated with the LSTM cell 4600 layer of the LSTM neural network of FIG. 46 . An example of the configuration of data within the memory 124 is taken. In the example of FIG. 47, the neural network unit 121 is configured as 512 neural processing units 126 or neurons, such as a wide configuration, but only 128 neural processing units 126 (such as neural processing units 0 to 127) produce values will be used because the LSTM layer in this example has only 128 LSTM cells 4600 .

如图中所示，权重随机存取存储器124会装载神经网络单元121的相对应神经处理单元0至127的权重值，偏移值与居间值。权重随机存取存储器124的行0至127装载神经网络单元121的相对应神经处理单元0至127的权重值，偏移值与居间值。列0至14中的各列则是装载128个下列对应于前述等式(1)至(6)的数值以提供给神经处理单元0至127，这些数值为：Wi，Ui，Bi，Wf，Uf，Bf，Wc，Uc，Bc，C’，TANH(C)，C，Wo，Uo，Bo。较佳地，权重值与偏移值-Wi，Ui，Bi，Wf，Uf，Bf，Wc，Uc，Bc，Wo，Uo，Bo(位于列0至8与列12至14)-由执行于处理器100的架构程序透过MTNN指令1400写入/填入权重随机存取存储器124，并由执行于神经网络单元121的非架构程序读取/使用，如图48的非架构程序。较佳地，居间值-C’，TANH(C)，C(位于列9至11)-由执行于神经网络单元121的非架构程序写入/填入权重随机存取存储器124并进行读取/使用，详如后述。As shown in the figure, the weight random access memory 124 is loaded with the weight values, offset values and intermediate values of the corresponding NPUs 0 to 127 of the NNU 121 . Rows 0 to 127 of the weight RAM 124 are loaded with weight values, offset values and intermediate values of the corresponding NPUs 0 to 127 of the NNU 121 . Each of columns 0 to 14 is loaded with 128 of the following values corresponding to the aforementioned equations (1) to (6) to provide to the neural processing unit 0 to 127, these values are: Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, C', TANH(C), C, Wo, Uo, Bo. Preferably, the weight values and offset values - Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (located in columns 0 to 8 and columns 12 to 14) - are performed by The architectural program of the processor 100 is written/filled into the weight random access memory 124 through the MTNN instruction 1400 and read/used by the non-architectural program executed in the neural network unit 121 , such as the non-architectural program of FIG. 48 . Preferably, the intermediate values - C', TANH(C), C (in columns 9 to 11) - are written/filled into the weight RAM 124 and read by the non-architectural program executing on the NNU 121 / use, as described later.

如图中所示，数据随机存取存储器122装载输入(X)，输出(H)，输入闸(I)，遗忘闸(F)与输出闸(O)数值供一系列时间步骤使用。进一步来说，此存储器五列一组装载X，H，I，F与O的数值供一给定时间步骤使用。以一个具有64列的数据随机存取存储器122为例，如图中所示，此数据随机存取存储器122可装载供12个不同时间步骤使用的记忆胞数值。在图47的范例中，列0至4装载供时间步骤0使用的记忆胞数值，列5至9装载供时间步骤1使用的记忆胞数值，依此类推，列55至59装载供时间步骤11使用的记忆胞数值。此五列一组存储器中的第一列装载此时间步骤的X数值。此五列一组存储器中的第二列装载此时间步骤的H数值。此五列一组存储器中的第三列装载此时间步骤的I数值。此五列一组存储器中的第四列装载此时间步骤的F数值。此五列一组存储器中的第五列装载此时间步骤的O数值。如图中所示，数据随机存取存储器122内的各行装载供相对应神经元或神经处理单元126使用的数值。也就是说，行0装载关联于长短期记忆胞0的数值，而其计算是由神经处理单元0所执行；行1装载关联于长短期记忆胞1的数值，而其计算是由神经处理单元1所执行；依此类推，行127装载关联于长短期记忆胞127的数值，而其计算是由神经处理单元127所执行，详如后续图48所述。As shown in the figure, DRAM 122 is loaded with input (X), output (H), input gate (I), forget gate (F) and output gate (O) values for a series of time steps. Further, the memory is loaded with values of X, H, I, F and O in groups of five for a given time step. Taking a DRAM 122 with 64 columns as an example, as shown in the figure, the DRAM 122 can be loaded with memory cell values for 12 different time steps. In the example of Figure 47, columns 0 to 4 hold the memory cell values for time step 0, columns 5 to 9 hold the memory cell values for time step 1, and so on, columns 55 to 59 hold the memory cell values for time step 11 The memory cell value to use. The first column in this five-column memory holds the X value for this time step. The second column in the five-column memory holds the H value for this time step. The third column in the five-column memory holds the I value for this time step. The fourth column in the five-column memory holds the F value for this time step. The fifth column in the five-column memory holds the O value for this time step. As shown, each row within DRAM 122 is loaded with values for use by a corresponding neuron or NPU 126 . That is, row 0 loads the value associated with LSTM cell 0, and its calculation is performed by NPU0; row 1 loads the value associated with LSTM cell 1, and its calculation is performed by NPU 1; and so on, row 127 loads the value associated with the long-term short-term memory cell 127, and its calculation is performed by the neural processing unit 127, as described in detail in subsequent FIG. 48 .

较佳地，X数值(位于列0，5，9，依此类推至列55)由执行于处理器100的架构程序透过MTNN指令1400写入/填入数据随机存取存储器122，并由执行于神经网络单元121的非架构程序进行读取/使用，如图48所示的非架构程序。较佳地，I数值，F数值与O数值(位于列2/3/4，7/8/9，12/13/14，依此类推至列57/58/59)由执行于神经处理单元121的非架构程序写入/填入数据随机存取存储器122，详如后述。较佳地，H数值(位于列1，6，10，依此类推至列56)由执行于神经处理单元121的非架构程序写入/填入数据随机存取存储器122并进行读取/使用，并且由执行于处理器100的架构程序透过MFNN指令1500进行读取。Preferably, the X values (located in columns 0, 5, 9, and so on to column 55) are written/filled into the DRAM 122 by the architectural program executing on the processor 100 through the MTNN instruction 1400, and by The non-architectural program executed in the neural network unit 121 is read/used, such as the non-architectural program shown in FIG. 48 . Preferably, I values, F values and O values (located in columns 2/3/4, 7/8/9, 12/13/14, and so on to columns 57/58/59) are implemented by the neural processing unit The non-architectural program of 121 is written/filled into the DRAM 122, which will be described in detail later. Preferably, the H values (located in columns 1, 6, 10, and so on to column 56) are written/filled into DRAM 122 and read/used by non-architectural programs executing in NPU 121 , and is read by the architecture program executing on the processor 100 through the MFNN instruction 1500 .

图47的范例假定此架构程序会执行以下步骤：(1)对于12个不同的时间步骤，将输入X的数值填入数据随机存取存储器122(列0，5，依此类推至列55)；(2)启动图48的非架构程序；(3)侦测非架构程序是否执行完毕；(4)从数据随机存取存储器122读出输出H的数值(列1，6，依此类推至列59)；以及(5)重复步骤(1)至(4)若干次直到完成任务，例如对手机使用者的话语进行辨识所需的计算。The example of FIG. 47 assumes that the architectural program will perform the following steps: (1) For 12 different time steps, fill the data random access memory 122 with the value of input X (column 0, 5, and so on to column 55) ; (2) start the non-architecture program of Figure 48; (3) detect whether the non-architecture program has been executed; (4) read the value of the output H from the data random access memory 122 (column 1, 6, and so on to column 59); and (5) repeating steps (1) to (4) several times until the task is completed, such as the calculation required to recognize the utterance of the mobile phone user.

在另一种执行方式中，架构程序会执行以下步骤：(1)对单一个时间步骤，以输入X的数值填入数据随机存取存储器122(如列0)；(2)启动非架构程序(图48非架构程序的修正后版本，不需循环，并且只存取数据随机存取存储器122的单一组五列)；(3)侦测非架构程序是否执行完毕；(4)从数据随机存取存储器122读出输出H的数值(如列1)；以及(5)重复步骤(1)至(4)若干次直到完成任务。此二种方式何者为优可依据长短期记忆层的输入X数值的取样方式而定。举例来说，若是此任务容许在多个时间步骤对输入进行取样(例如大约12个时间步骤)并执行计算，第一种方式就较为理想，因为此方式可能带来更多计算资源效率和/或较佳的效能，不过，若是此任务只容许在单一个时间步骤执行取样，就需要使用第二种方式。In another execution mode, the architectural program will perform the following steps: (1) for a single time step, fill the data random access memory 122 (such as column 0) with the value of input X; (2) start the non-architectural program (the modified version of Fig. 48 non-architecture program does not need to loop, and only accesses a single group of five rows of data random access memory 122); (3) detect whether the non-architecture program has been executed; (4) from the data random access memory Access the memory 122 to read out the value of the output H (eg column 1); and (5) repeat steps (1) to (4) several times until the task is completed. Which of the two methods is superior can be determined according to the sampling method of the input X value of the long-term short-term memory layer. For example, if the task allows the input to be sampled and the computation performed at multiple time steps (e.g. around 12 time steps), the first approach is ideal as it may result in more computational resource efficiency and/or or better performance, but if the task only allows sampling to be performed at a single time step, then the second approach is required.

第三实施例类似于前述第二种方式，不过，不同于第二种方式使用单一组五列数据随机存取存储器122，此方式的非架构程序使用多组五列存储器，也就是在各个时间步骤使用不同的五列一组存储器，此部分类似于第一种方式。在此第三实施例中，较佳地，架构程序在步骤(2)前包含一步骤，此步骤中，架构程序会在非架构程序启动前对其进行更新，例如将地址0的指令内的数据随机存取存储器122列更新为指向下一组五列存储器。The third embodiment is similar to the aforementioned second method, but unlike the second method using a single group of five-column data random access memory 122, the non-architectural program of this method uses multiple groups of five-column memory, that is, at each time Steps use different five-column memory, this part is similar to the first way. In this third embodiment, preferably, the architecture program includes a step before step (2). In this step, the architecture program will update the non-architecture program before it starts, for example, the address 0 in the instruction The DRAM 122 ranks are updated to point to the next set of five ranks.

图48为一表格，显示储存于神经网络单元121的程序存储器129的程序，此程序由神经网络单元121执行并依据图47的配置使用数据与权重，以达成关联于长短期记忆胞层的计算。图48的范例程序包括24个非架构指令分别位于地址0至23。地址0的指令(INITIALIZE NPU，CLRACC，LOOPCNT＝12，DR IN ROW＝-1，DR OUT ROW＝2)会清除累加器202并将循环计数器3804初始化至数值12，以执行12次循环组(地址1至22的指令)。此初始化指令并会将数据随机存取存储器122的待读取列初始化为数值-1，而在地址1的指令的第一次执行后，此数值会增加为零。此初始化指令并会将数据随机存取存储器122的待写入列(例如图26与图39的缓存器2606)初始化为列2。较佳地，此初始化指令并会使神经网络单元121处于宽配置，如此，神经网络单元121就会配置有512个神经处理单元126。如同后续章节所述，在地址0至23的指令执行过程中，这512个神经处理单元126其中的128个神经处理单元126对应并作为128个长短期记忆胞4600进行运作。FIG. 48 is a table showing the program stored in the program memory 129 of the neural network unit 121. The program is executed by the neural network unit 121 and uses data and weights according to the configuration of FIG. 47 to achieve calculations associated with the LSTM cell layer. . The example program in FIG. 48 includes 24 non-architectural instructions located at addresses 0 to 23, respectively. The instruction at address 0 (INITIALIZE NPU, CLRACC, LOOPCNT=12, DR IN ROW=-1, DR OUT ROW=2) clears accumulator 202 and initializes loop counter 3804 to a value of 12 to execute 12 loop groups (address 1 to 22 instructions). The initialization command also initializes the column to be read in the DRAM 122 to a value of -1, and after the first execution of the command at address 1, the value is incremented to zero. The initialization command also initializes the column to be written in the data random access memory 122 (such as the register 2606 in FIG. 26 and FIG. 39 ) as column 2 . Preferably, the initialization instruction will cause the NNU 121 to be in a wide configuration, so that the NNU 121 will be configured with 512 NPUs 126 . As described in subsequent chapters, during the execution of the instructions at addresses 0 to 23, 128 of the 512 neural processing units 126 correspond to and operate as 128 long-term short-term memory cells 4600 .

在地址1至4的指令的第一次执行中，这128个神经处理单元126(即神经处理单元0至127)中的各个神经处理单元126会针对相对应长短期记忆胞4600的第一时间步骤(时间步骤0)计算输入闸(I)数值并将I数值写入数据随机存取存储器122的列2的相对应文字；在地址1至4的指令的第二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第二时间步骤(时间步骤1)计算I数值并将I数值写入数据随机存取存储器122的列7的相对应文字；依此类推，在地址1至4的指令的第十二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第十二时间步骤(时间步骤11)计算I数值并将I数值写入数据随机存取存储器122的列57的相对应文字，如图47所示。In the first execution of the instructions of addresses 1 to 4, each of the 128 neural processing units 126 (that is, neural processing units 0 to 127) will perform the corresponding long-short-term memory cell 4600 for the first time Step (time step 0) calculates the input gate (I) value and writes the I value to the corresponding text of column 2 of the data random access memory 122; Each neural processing unit 126 in the neural processing unit 126 will calculate the I value for the second time step (time step 1) of the corresponding LSTM cell 4600 and write the I value into the corresponding row 7 of the data random access memory 122. Corresponding text; and so on, in the twelfth execution of the instructions of addresses 1 to 4, each neural processing unit 126 in the 128 neural processing units 126 will correspond to the twelfth time of the corresponding long-short-term memory cell 4600 The step (time step 11) calculates the I value and writes the I value into the corresponding text of column 57 of the DRAM 122, as shown in FIG. 47 .

进一步来说，地址1的乘法累加指令会读取数据随机存取存储器122当前列后方的下一列(在第一执行即为列0，在第二执行即为列5，依此类推，在第十二执行即为列55)，此列包含关联于当前时间步骤的记忆胞输入(X)值，此指令并会读取权重随机存取存储器124中包含Wi数值的列0，并且将前述读取数值相乘以产生第一乘积累加至刚刚由地址0的初始化指令或地址22的指令清除的累加器202。随后，地址2的乘法累加指令会读取下一个数据随机存取存储器122列(在第一执行即为列1，在第二执行即为列6，依此类推，在第十二执行即为列56)，此列包含关联于当前时间步骤的记忆胞输出(H)值，此指令并会读取权重随机存取存储器124中包含Ui数值的列1，并且将前述数值相乘以产生第二乘积累加至累加器202。关联于当前时间步骤的H数值由地址2的指令(以及地址6，10与18的指令)由数据随机存取存储器122读取，在先前时间步骤产生，并由地址22的输出指令写入数据随机存取存储器122；不过，在第一次执行中，地址2的指令会以一初始值写入数据随机存取存储器的列1作为H数值。较佳地，架构程序会在启动图48的非架构程序前将初始H数值写入数据随机存取存储器122的列1(例如使用MTNN指令1400)；不过，本发明并不限于此，非架构程序内包含有初始化指令将初始H数值写入数据随机存取存储器122的列1的其它实施例也属于本发明的范畴。在一实施例中，此初始H数值为零。接下来，地址3的将权重文字加入累加器的指令(ADD_W_ACC WR ROW 2)会读取权重随机存取存储器124中包含Bi数值的列2并将其加入累加器202。最后，地址4的输出指令(OUTPUT SIGMOID，DR OUT ROW+0，CLR ACC)会对累加器202数值执行一S型启动函数并将执行结果写入数据随机存取存储器122的当前输出列(在第一执行即为列2，在第二执行即为列7，依此类推，在第十二执行即为列57)并且清除累加器202。Further, the multiply-accumulate instruction at address 1 will read the next column behind the current column of the data random access memory 122 (column 0 in the first execution, column 5 in the second execution, and so on, and column 5 in the second execution. Twelve executions are column 55), which contains the memory cell input (X) value associated with the current time step, and this instruction will read column 0 containing the Wi value in the weight random access memory 124, and read the aforementioned The values are multiplied to produce the first multiplication accumulation to accumulator 202 which has just been cleared by the initialize instruction at address 0 or the instruction at address 22. Subsequently, the multiply-accumulate instruction at address 2 will read the next data random access memory 122 column (column 1 in the first execution, column 6 in the second execution, and so on, and column 6 in the twelfth execution. Column 56), this column contains the memory cell output (H) value associated with the current time step, this command will also read the column 1 containing the Ui value in the weight random access memory 124, and multiply the aforementioned values to generate the first The two times are accumulated and added to the accumulator 202 . The H value associated with the current time step is read by the instruction at address 2 (and instructions at addresses 6, 10, and 18) from the data random access memory 122, generated at the previous time step, and written into the data by the output instruction at address 22 RAM 122; however, in the first execution, the instruction at address 2 will write an initial value into column 1 of the DRAM as the H value. Preferably, the architectural program will write the initial H value into column 1 of the data random access memory 122 (for example, using the MTNN instruction 1400) before starting the non-architectural program of FIG. Other embodiments in which the program includes an initialization command to write the initial H value into column 1 of the DRAM 122 also belong to the scope of the present invention. In one embodiment, the initial H value is zero. Next, the add weight text to accumulator instruction (ADD_W_ACC WR ROW 2 ) at address 3 reads row 2 containing the Bi value in the weight RAM 124 and adds it to the accumulator 202 . Finally, the output instruction at address 4 (OUTPUT SIGMOID, DR OUT ROW+0, CLR ACC) will execute an S-type start function for the value of the accumulator 202 and write the execution result into the current output column of the DRAM 122 (in The first execution is column 2, the second execution is column 7, and so on, the twelfth execution is column 57) and accumulator 202 is cleared.

在地址5至8的指令的第一次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第一时间步骤(时间步骤0)计算其遗忘闸(F)数值并将F数值写入数据随机存取存储器122的列3的相对应文字；在地址5至8的指令的第二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第二时间步骤(时间步骤1)计算其遗忘闸(F)数值并将F数值写入数据随机存取存储器122的列8的相对应文字；依此类推，在地址5至8的指令的第十二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第十二时间步骤(时间步骤11)计算其遗忘闸(F)数值并将F数值写入数据随机存取存储器122的列58的相对应文字，如图47所示。地址5至8的指令计算F数值的方式类似于前述地址1至4的指令，不过，地址5至7的指令会分别从权重随机存取存储器124的列3，列4与列5读取Wf，Uf与Bf数值以执行乘法和/或加法运算。In the first execution of the instructions of addresses 5 to 8, each neural processing unit 126 in the 128 neural processing units 126 will calculate its forgetting for the first time step (time step 0) of the corresponding long-short-term memory cell 4600 gate (F) value and write the F value into the corresponding text of column 3 of the data random access memory 122; The processing unit 126 will calculate its forgetting gate (F) value for the second time step (time step 1) of the corresponding LSTM cell 4600 and write the F value into the corresponding text in column 8 of the data random access memory 122; By analogy, during the twelfth execution of the instructions at addresses 5 to 8, each neural processing unit 126 in the 128 neural processing units 126 will perform the twelfth time step (time Step 11) Calculate the value of the forgetting gate (F) and write the F value into the corresponding text in column 58 of the data random access memory 122, as shown in FIG. 47 . The instructions at addresses 5 to 8 calculate the value of F in a manner similar to the instructions at addresses 1 to 4 described above, however, the instructions at addresses 5 to 7 read Wf from columns 3, 4, and 5 of the weight RAM 124, respectively. , Uf and Bf values to perform multiplication and/or addition operations.

在地址9至12的指令的十二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的相对应时间步骤计算其候选记忆胞状态(C’)数值并将C’数值写入权重随机存取存储器124的列9的相对应文字。地址9至12的指令计算C’数值的方式类似于前述地址1至4的指令，不过，地址9至11的指令会分别从权重随机存取存储器124的列6，列7与列8读取Wc，Uc与Bc数值以执行乘法和/或加法运算。此外，地址12的输出指令会执行双曲正切启动函数而非(如地址4的输出指令执行)S型启动函数。During the twelve executions of the instructions at addresses 9 to 12, each of the 128 NPUs 126 will calculate its candidate memory cell state (C) for the corresponding time step of the corresponding LSTM 4600 ') value and write the C' value into the corresponding text in column 9 of the weight RAM 124 . The instructions at addresses 9 to 12 calculate the value of C' similarly to the instructions at addresses 1 to 4 described above, however, the instructions at addresses 9 to 11 read from columns 6, 7, and 8 of the weight RAM 124, respectively. Wc, Uc and Bc values to perform multiplication and/or addition operations. In addition, the output instruction at address 12 will execute the hyperbolic tangent activation function instead of (as the output instruction at address 4 executes) the sigmoid activation function.

进一步来说，地址9的乘法累加指令会读取数据随机存取存储器122的当前列(在第一次执行即为列0，在第二次执行即为列5，依此类推，在第十二次执行即为列55)，此当前列包含关联于当前时间步骤的记忆胞输入(X)值，此指令并会读取权重随机存取存储器124中包含We数值的列6，并且将前述数值相乘以产生第一乘积累加至刚刚由地址8的指令清除的累加器202。接下来，地址10的乘法累加指令会读取数据随机存取存储器122的次一列(在第一次执行即为列1，在第二次执行即为列6，依此类推，在第十二次执行即为列56)，此列包含关联于当前时间步骤的记忆胞输出(H)值，此指令并会读取权重随机存取存储器124中包含Uc数值的列7，并且将前述数值相乘以产生第二乘积累加至累加器202。接下来，地址11的将权重文字加入累加器的指令会读取权重随机存取存储器124中包含Bc数值的列8并将其加入累加器202。最后，地址12的输出指令(OUTPUT TANH，WR OUT ROW 9，CLR ACC)会对累加器202数值执行双曲正切启动函数并将执行结果写入权重随机存取存储器124的列9，并且清除累加器202。Further, the multiply-accumulate instruction at address 9 will read the current column of the data random access memory 122 (column 0 in the first execution, column 5 in the second execution, and so on, in the tenth The second execution is column 55), this current column contains the memory cell input (X) value associated with the current time step, this instruction will also read the column 6 containing the We value in the weight random access memory 124, and convert the aforementioned The values are multiplied to produce the first multiplication accumulation to accumulator 202 which was just cleared by the instruction at address 8. Next, the multiply-accumulate instruction at address 10 will read the next column of the data random access memory 122 (column 1 in the first execution, column 6 in the second execution, and so on, in the twelfth The first execution is column 56), this column contains the memory cell output (H) value associated with the current time step, this instruction will also read the column 7 containing the value of Uc in the weight random access memory 124, and compare the previous value The multiplication results in a second multiplication accumulation which is added to the accumulator 202 . Next, the add weight literal to accumulator instruction at address 11 reads column 8 containing the value of Bc from weight RAM 124 and adds it to accumulator 202 . Finally, the output instruction at address 12 (OUTPUT TANH, WR OUT ROW 9, CLR ACC) will execute the hyperbolic tangent activation function on the value of the accumulator 202 and write the execution result to column 9 of the weight random access memory 124, and clear the accumulated device 202.

在地址13至16的指令的十二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的相对应时间步骤计算新的记忆胞状态(C)数值并将此新的C数值写入权重随机存取存储器122的列11的相对应文字，各个神经处理单元126还会计算tanh(C)并将其写入权重随机存取存储器124的列10的相对应文字。进一步来说，地址13的乘法累加指令会读取数据随机存取存储器122当前列后方的下一列(在第一次执行即为列2，在第二次执行即为列7，依此类推，在第十二次执行即为列57)，此列包含关联于当前时间步骤的输入闸(I)数值，此指令并读取权重随机存取存储器124中包含候选记忆胞状态(C’)数值的列9(刚刚由地址12的指令写入)，并且将前述数值相乘以产生第一乘积累加至刚刚由地址12的指令清除的累加器202。接下来，地址14的乘法累加指令会读取数据随机存取存储器122的下一列(在第一次执行即为列3，在第二次执行即为列8，依此类推，在第十二次执行即为列58)，此列包含关联于当前时间步骤的遗忘闸(F)数值，此指令并读取权重随机存取存储器124中包含于先前时间步骤中计算的当前记忆胞状态(C)数值(由地址15的指令的最近一次执行进行写入)的列11，并且将前述数值相乘以产生第二乘积加入累加器202。接下来，地址15的输出指令(OUTPUT PASSTHRU，WR OUT ROW 11)会传递此累加器202数值并将其写入权重随机存取存储器124的列11。需要理解的是，地址14的指令由数据随机存取存储器122的列11读取的C数值即为地址13至15的指令于最近一次执行中产生并写入的C数值。地址15的输出指令并不会清除累加器202，如此，其数值即可由地址16的指令使用。最后，地址16的输出指令(OUTPUT TANH，WR OUT ROW 10，CLR ACC)会对累加器202数值执行双曲正切启动函数并将其执行结果写入权重随机存取存储器124的列10供地址21的指令使用以计算记忆胞输出(H)值。地址16的指令会清除累加器202。During the twelve executions of the instructions at addresses 13 to 16, each of the 128 NPUs 126 will calculate a new memory cell state (C ) value and write this new C value into the corresponding text of the column 11 of the weight random access memory 122, and each neural processing unit 126 also calculates tanh(C) and writes it into the column of the weight random access memory 124 The corresponding text of 10. Further, the multiply-accumulate instruction at address 13 will read the next column behind the current column of the data random access memory 122 (column 2 in the first execution, column 7 in the second execution, and so on, In the twelfth execution is column 57), this column contains the value of the input gate (I) associated with the current time step, and this instruction reads the value of the candidate memory cell state (C') contained in the weight random access memory 124 column 9 (just written to by the instruction at address 12), and the preceding values are multiplied to produce the first multiplied accumulation to accumulator 202 which has just been cleared by the instruction at address 12. Next, the multiply-accumulate instruction at address 14 will read the next column of the data random access memory 122 (column 3 in the first execution, column 8 in the second execution, and so on, in the twelfth The second execution is column 58), which contains the forgetting gate (F) value associated with the current time step, and reads the current memory cell state (C ) value (written by the most recent execution of the instruction at address 15) to column 11, and the preceding value is multiplied to produce a second product that is added to accumulator 202. Next, the output instruction at address 15 (OUTPUT PASSTHRU, WR OUT ROW 11 ) passes the accumulator 202 value and writes it to column 11 of the weight RAM 124 . It should be understood that the C value read by the instruction at address 14 from column 11 of the DRAM 122 is the C value generated and written in the latest execution of the instructions at addresses 13 to 15 . The output instruction at address 15 does not clear accumulator 202 so that its value can be used by the instruction at address 16. Finally, the output instruction at address 16 (OUTPUT TANH, WR OUT ROW 10, CLR ACC) will execute the hyperbolic tangent activation function on the value of the accumulator 202 and write its execution result to column 10 of the weight random access memory 124 for address 21 The instruction of is used to calculate the memory cell output (H) value. The instruction at address 16 clears accumulator 202 .

在地址17至20的指令的第一次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第一时间步骤(时间步骤0)计算其输出闸(O)数值并将O数值写入数据随机存取存储器122的列4的相对应文字；在地址17至20的指令的第二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第二时间步骤(时间步骤1)计算其输出闸(O)数值并将O数值写入数据随机存取存储器122的列9的相对应文字；依此类推，在地址17至20的指令的第十二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第十二时间步骤(时间步骤11)计算其输出闸(O)数值并将O数值写入数据随机存取存储器122的列58的相对应文字，如图47所示。地址17至20的指令计算O数值的方式类似于前述地址1至4的指令，不过，地址17至19的指令会分别从权重随机存取存储器124的列12，列13与列14读取Wo，Uo与Bo数值以执行乘法和/或加法运算。In the first execution of the instructions at addresses 17 to 20, each of the 128 NPUs 126 will calculate its output for the first time step (time step 0) of the corresponding LSTM 4600 gate (O) value and write the O value into the corresponding text of column 4 of the data random access memory 122; The processing unit 126 calculates the value of the output gate (O) for the second time step (time step 1) of the corresponding LSTM cell 4600 and writes the value of O into the corresponding text in column 9 of the DRAM 122; By analogy, during the twelfth execution of the instructions at addresses 17 to 20, each neural processing unit 126 in the 128 neural processing units 126 will perform the twelfth time step (time Step 11) Calculate the value of the output gate (O) and write the value of O into the corresponding text of column 58 of the data random access memory 122, as shown in FIG. 47 . The instructions at addresses 17 through 20 compute the value of O in a manner similar to the instructions at addresses 1 through 4 described above, however, the instructions at addresses 17 through 19 read Wo from column 12, column 13, and column 14 of weight RAM 124, respectively. , Uo and Bo values to perform multiplication and/or addition operations.

在地址21至22的指令的第一次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第一时间步骤(时间步骤0)计算其记忆胞输出(H)值并将H数值写入数据随机存取存储器122的列6的相对应文字；在地址21至22的指令的第二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第二时间步骤(时间步骤1)计算其记忆胞输出(H)值并将H数值写入数据随机存取存储器122的列11的相对应文字；依此类推，在地址21至22的指令的第十二次执行中，这128个神经处理单元126中的各个神经处理单元126会针对相对应长短期记忆胞4600的第十二时间步骤(时间步骤11)计算其记忆胞输出(H)值并将H数值写入数据随机存取存储器122的列60的相对应文字，如图47所示。In the first execution of the instructions at addresses 21 to 22, each of the 128 neural processing units 126 will calculate its memory for the first time step (time step 0) of the corresponding long-short-term memory cell 4600 cell output (H) value and write the H value into the corresponding text of column 6 of the data random access memory 122; The neural processing unit 126 will calculate the memory cell output (H) value for the second time step (time step 1) corresponding to the long-term short-term memory cell 4600 and write the H value into the corresponding column 11 of the data random access memory 122 text; and so on, in the twelfth execution of the instructions of addresses 21 to 22, each neural processing unit 126 in the 128 neural processing units 126 will address the twelfth time step of the corresponding long-short-term memory cell 4600 (Time step 11) Calculate the memory cell output (H) value and write the H value into the corresponding text in column 60 of the data random access memory 122, as shown in FIG. 47 .

进一步来说，地址21的乘法累加指令会读取数据随机存取存储器122当前列后方的第三列(在第一次执行即为列4，在第二次执行即为列9，依此类推，在第十二次执行即为列59)，此列包含关联于当前时间步骤的输出闸(O)数值，此指令并读取权重随机存取存储器124中包含tanh(C)数值的列10(由地址16的指令写入)，并且将前述数值相乘以产生一乘积累加至刚刚由地址20的指令清除的累加器202。随后，地址22的输出指令会传递累加器202数值并将其写入数据随机存取存储器122的接下来第二个输出列11(在第一次执行即为列6，在第二次执行即为列11，依此类推，在第十二次执行即为列61)，并且清除累加器202。需要理解的是，由地址22的指令写入数据随机存取存储器122列的H数值(在第一次执行即为列6，在第二次执行即为列11，依此类推，在第十二次执行即为列61)即为地址2，6，10与18的指令的后续执行中所消耗/读取的H数值。不过，第十二次执行中写入列61的H数值并不会被地址2，6，10与18的指令的执行所消耗/读取；就一较佳实施例而言，此数值会是由架构程序所消耗/读取。Further, the multiply-accumulate instruction at address 21 will read the third column behind the current column of the data random access memory 122 (column 4 in the first execution, column 9 in the second execution, and so on. , in the twelfth execution is column 59), this column contains the output gate (O) value associated with the current time step, and this instruction reads the column 10 containing the tanh (C) value in the weight random access memory 124 (written by the instruction at address 16), and the aforementioned values are multiplied to produce a multiplied accumulation to accumulator 202 which was just cleared by the instruction at address 20. Subsequently, the output instruction at address 22 transfers the accumulator 202 value and writes it to the next second output column 11 of the DRAM 122 (column 6 in the first execution and column 6 in the second execution). is column 11, and so on, the twelfth execution is column 61), and the accumulator 202 is cleared. It should be understood that the H value written into the data random access memory 122 column by the instruction of address 22 (in the first execution is column 6, in the second execution it is column 11, and so on, in the tenth The second execution (column 61) is the H value consumed/read in the subsequent execution of the instructions at addresses 2, 6, 10 and 18. However, the H value written to column 61 in the twelfth execution will not be consumed/read by the execution of instructions at addresses 2, 6, 10 and 18; for a preferred embodiment, this value will be Consumed/read by infrastructure programs.

地址23的指令(LOOP 1)会使循环计数器3804递减并且在新的循环计数器3804数值大于零的情况下回到地址1的指令。The instruction at address 23 (LOOP 1) will decrement the loop counter 3804 and fall back to the instruction at address 1 if the new loop counter 3804 value is greater than zero.

图49为一方块图，显示神经网络单元121的实施例，此实施例的神经处理单元群组内具有输出缓冲遮蔽与反馈能力。图49显示单一个由四个神经处理单元126构成的神经处理单元群组4901。虽然图49仅显示单一个神经处理单元群组4901，不过需要理解的是，神经网络单元121中的各个神经处理单元126都会包含于一个神经处理单元群组4901内，因此，一共会有N/J个神经处理单元群组4901，其中N是神经处理单元126的数量(举例来说，就宽配置而言为512，就窄配置而言为1024)而J是单一个群组4901内的神经处理单元126的数量(举例来说，就图49的实施例而言即为四)。图49中将神经处理单元群组4901内的四个神经处理单元126称为神经处理单元0，神经处理单元1，神经处理单元2与神经处理单元3。FIG. 49 is a block diagram showing an embodiment of a neural network unit 121 with output buffer masking and feedback capabilities within a group of neural processing units. FIG. 49 shows a single NPU group 4901 consisting of four NPUs 126 . Although FIG. 49 only shows a single neural processing unit group 4901, it should be understood that each neural processing unit 126 in the neural network unit 121 will be included in one neural processing unit group 4901. Therefore, there will be a total of N/ J NPU groups 4901, where N is the number of NPUs 126 (e.g., 512 for a wide configuration, 1024 for a narrow configuration) and J is the number of neurons within a single group 4901 The number of processing units 126 (for example, four for the embodiment of FIG. 49 ). In FIG. 49 , the four NPUs 126 in the NPU group 4901 are referred to as NPU0, NPU1, NPU2 and NPU3.

图49的实施例中的各个神经处理单元类似于前述图7的神经处理单元126，并且图中具有相同标号的组件也相类似。不过，多任务缓存器208经调整以包含四个额外的输入4905，多任务缓存器705经调整以包含四个额外的输入4907，选择输入213经调整而能从原本的输入211与207以及额外输入4905中进行选择提供至输出209，并且，选择输入713经调整而能从原本的输入711与206以及额外输入4907中进行选择提供至输出203。Each neural processing unit in the embodiment of FIG. 49 is similar to the aforementioned neural processing unit 126 in FIG. 7 , and components with the same number in the figure are also similar. However, the multitasking register 208 is adapted to include four additional inputs 4905, the multitasking register 705 is adapted to include four additional inputs 4907, and the select input 213 is adapted to select from the original inputs 211 and 207 plus the additional A selection from input 4905 is provided to output 209 , and selection input 713 is adjusted to select from original inputs 711 and 206 and additional input 4907 is provided to output 203 .

如图中所示，图11的列缓冲器1104在图49中即为输出缓冲器1104。进一步来说，图中所示的输出缓冲器1104的文字0，1，2与3接收关联于神经处理单元0，1，2与3的四个启动函数单元212的相对应输出。此部分的输出缓冲器1104包含N个文字对应于神经处理单元群组4901，这些文字称为一个输出缓冲文字群组。在图49的实施例中，N为四。输出缓冲器1104的这四个文字反馈至多任务缓存器208与705，并作为四个额外输入4905由多任务缓存器208所接收以及作为四个额外输入4907由多任务缓存器705所接收。输出缓冲文字群组反馈至其相对应神经处理单元群组4901的反馈动作，使非架构程序的算术指令能够从关联于神经处理单元群组4901的输出缓冲器1104的文字(即输出缓冲文字群组)中选择一个或两个文字作为其输入，其范例请参照后续图51的非架构程序，如图中地址4，8，11，12与15的指令。也就是说，指定于非架构指令内的输出缓冲器1104文字会确认选择输入213/713产生的数值。这个能力实际上使输出缓冲器1104可以作为一个类别草稿存储器(scratch padmemory)，能够让非架构程序减少写入数据随机存取存储器122和/或权重随机存取存储器124以及后续从中读取的次数，例如减少过程中居间产生与使用的数值。较佳地，输出缓冲器1104，或称列缓冲器1104，包括一个一维的缓存器阵列，用以储存1024个窄文字或是512个宽文字。较佳地，对于输出缓冲器1104的读取可以在单一个时频周期内执行，而对于输出缓冲器1104的写入也可以在单一个时频周期内执行。不同于数据随机存取存储器122与权重随机存取存储器124，可由架构程序与非架构程序进行存取，输出缓冲器1104无法由架构程序进行存取，而只能由非架构程序进行存取。As shown in the figure, the column buffer 1104 in FIG. 11 is the output buffer 1104 in FIG. 49 . Further, the words 0, 1, 2 and 3 of the output buffer 1104 shown in the figure receive the corresponding outputs of the four AFUs 212 associated with the NPUs 0, 1, 2 and 3. The output buffer 1104 of this part contains N words corresponding to the NPU group 4901, and these words are called an output buffer word group. In the embodiment of FIG. 49, N is four. The four words from output buffer 1104 are fed back to multiplex registers 208 and 705 and received as four additional inputs 4905 by multiplex register 208 and as four additional inputs 4907 by multiplex register 705 . The feedback action of the output buffer literal group to its corresponding NPU group 4901 enables the arithmetic instructions of the non-architectural program to be read from the literal of the output buffer 1104 associated with the NPU group 4901 (i.e., the output buffer literal group Group) to select one or two characters as its input. For an example, please refer to the non-architectural program in FIG. That is, the output buffer 1104 literal specified in the non-architectural instruction will confirm the value generated by the select input 213/713. This capability effectively enables output buffer 1104 to act as a sort of scratch pad memory, enabling non-architectural programs to reduce the number of writes to and subsequent reads from data RAM 122 and/or weight RAM 124 , such as values that are intermediately generated and used during reduction. Preferably, the output buffer 1104, or column buffer 1104, includes a one-dimensional buffer array for storing 1024 narrow words or 512 wide words. Preferably, reading from the output buffer 1104 can be performed within a single clock cycle, and writing to the output buffer 1104 can also be performed within a single clock cycle. Unlike the data RAM 122 and the weight RAM 124, which can be accessed by architectural programs and non-architectural programs, the output buffer 1104 cannot be accessed by architectural programs, but can only be accessed by non-architectural programs.

输出缓冲器1104将经调整以接收屏蔽输入(mask input)4903。较佳地，屏蔽输入4903包括四个位对应至输出缓冲器1104的四个文字，此四个文字关联于神经处理单元群组4901的四个神经处理单元126。较佳地，若是此对应至输出缓冲器1104的文字的屏蔽输入4903位为真，此输出缓冲器1104的文字就会维持其当前值；否则，此输出缓冲器1104的文字就会被启动函数单元212的输出所更新。也就是说，若是此对应至输出缓冲器1104的文字的屏蔽输入4903位为假，启动函数单元212的输出就会被写入输出缓冲器1104的文字。如此，非架构程序的输出指令即可选择性地将启动函数单元212的输出写入输出缓冲器1104的某些文字并使输出缓冲器1104的其它文字的当前数值维持不变，其范例请参照后续图51的非架构程序的指令，如图中地址6，10，13与14的指令。也就是说，指定于非架构程序内的输出缓冲器1104的文字即决产生于屏蔽输入4903的数值。Output buffer 1104 will be adjusted to receive mask input 4903 . Preferably, mask input 4903 includes four bits corresponding to four words of output buffer 1104 associated with four NPUs 126 of NPU group 4901 . Preferably, if the mask input 4903 bit corresponding to the text of the output buffer 1104 is true, the text of the output buffer 1104 will maintain its current value; otherwise, the text of the output buffer 1104 will be activated by the function The output of unit 212 is updated. That is, if the mask input 4903 bit corresponding to the text of the output buffer 1104 is false, the output of the enable function unit 212 will be written into the text of the output buffer 1104 . In this way, the output instruction of the non-architecture program can selectively write the output of the activation function unit 212 into some words of the output buffer 1104 and keep the current values of other words of the output buffer 1104 unchanged. For an example, please refer to Subsequent instructions of the non-architecture program in FIG. 51 are instructions at addresses 6, 10, 13 and 14 in the figure. That is to say, the text specified in the output buffer 1104 in the non-architecture program depends on the value generated from the mask input 4903 .

为了简化说明，图49中并未显示多任务缓存器208/705的输入1811(如图18，图19与图23所示)。不过，同时支持可动态配置神经处理单元126与输出缓冲器1104的反馈/屏蔽的实施例亦属本发明的范畴。较佳地，在此等实施例中，输出缓冲文字群组为可相对应地动态配置。To simplify the description, the input 1811 of the multitasking register 208/705 (as shown in FIG. 18 , FIG. 19 and FIG. 23 ) is not shown in FIG. 49 . However, embodiments that support both dynamically configurable NPU 126 and output buffer 1104 feedback/masking are also within the scope of the present invention. Preferably, in these embodiments, the output buffer text group can be correspondingly and dynamically configured.

需要理解的是，虽然此实施例的神经处理单元群组4901内的神经处理单元126的数量为四，不过，本发明并不限于此，群组内神经处理单元126数量较多或较少的实施例均属于本发明的范畴。此外，就一个具有共享启动函数单元1112的实施例而言，如图52所示，一个神经处理单元群组4901内的神经处理单元126数量与一个启动函数单元212群组内的神经处理单元126数量会有协同影响。神经处理单元群组内输出缓冲器1104的遮蔽与反馈能力特别有助于提升关联于长短期记忆胞4600的计算效率，详如后续图50与图51所述。It should be understood that although the number of neural processing units 126 in the neural processing unit group 4901 in this embodiment is four, the present invention is not limited thereto, and the number of neural processing units 126 in the group is larger or smaller Examples all belong to the category of the present invention. In addition, as far as an embodiment with a shared activation function unit 1112 is concerned, as shown in FIG. Quantities have synergistic effects. The masking and feedback capabilities of the output buffers 1104 in the NPU group are particularly helpful for improving the computational efficiency associated with the LSTM cells 4600, as described in detail in FIG. 50 and FIG. 51 .

图50为一方块图，显示当神经网络单元121执行关联于图46中由128个长短期记忆胞4600构成的一层级的计算时，图49的神经网络单元121的数据随机存取存储器122，权重随机存取存储器124与输出缓冲器1104内的数据配置的一范例。在图50的范例中，神经网络单元121配置为512个神经处理单元126或神经元，例如采取宽配置。如同图47与图48的范例，在图50与图51的范例中的长短期记忆层中只具有128个长短期记忆胞4600。不过，在图50的范例中，全部512个神经处理单元126(如神经处理单元0至127)产生的数值都会被使用。在执行图51的非架构程序的时候，各个神经处理单元群组4901会集体做为一个长短期记忆胞4600进行运作。FIG. 50 is a block diagram showing that when the neural network unit 121 performs calculations associated with a layer of 128 long-short-term memory cells 4600 in FIG. 46, the data random access memory 122 of the neural network unit 121 of FIG. 49, An example of data configuration in the weight random access memory 124 and the output buffer 1104 . In the example of FIG. 50 , the neural network unit 121 is configured as 512 neural processing units 126 or neurons, for example, in a wide configuration. Like the example in FIG. 47 and FIG. 48 , there are only 128 LSTM cells 4600 in the LSTM layer in the example in FIG. 50 and FIG. 51 . However, in the example of FIG. 50 , the values generated by all 512 NPUs 126 (eg, NPUs 0 to 127 ) are used. When executing the non-architectural program in FIG. 51 , each neural processing unit group 4901 collectively operates as a long-term short-term memory cell 4600 .

如图中所示，数据随机存存储器122装载记忆胞输入(X)与输出(H)值供一系列时间步骤使用。进一步来说，对于一给定时间步骤，会有一对两列存储器分别装载X数值与H数值。以一个具有64列的数据随机存取存储器122为例，如图中所示，此数据随机存取存储器122所装载的记忆胞数值可供31个不同时间步骤使用。在图50的范例中，列2与3装载供时间步骤0使用的数值，列4与5装载供时间步骤1使用的数值，依此类推，列62与63装载供时间步骤30使用的数值。这对两列存储器中的第一列装载此时间步骤的X数值，而第二列则是装载此时间步骤的H数值。如图中所示，数据随机存取存储器122中各组四行对应至神经处理单元群组4901的存储器装载供其对应长短期记忆胞4600使用的数值。也就是说，行0至3装载关联于长短期记忆胞0的数值，其计算是由神经处理单元0-3执行，即神经处理单元群组0执行；行4至7装载关联于长短期记忆胞1的数值，其计算是由神经处理单元4-7执行，即神经处理单元群组1执行；依此类推，行508至511装载关联于长短期记忆胞127的数值，其计算是由神经处理单元508-511执行，即神经处理单元群组127执行，详如后续图51所示。如图中所示，列1并未被使用，列0装载初始的记忆胞输出(H)值，就一较佳实施例而言，可由架构程序填入零值，不过，本发明并不限于此，利用非架构程序指令填入列0的初始记忆胞输出(H)数值也属于本发明的范畴。As shown, DRAM 122 is loaded with cell input (X) and output (H) values for a series of time steps. Further, for a given time step, there will be a pair of two-column memories loaded with X value and H value respectively. Taking a DRAM 122 with 64 columns as an example, as shown in the figure, the memory cell values loaded in the DRAM 122 can be used in 31 different time steps. In the example of Figure 50, columns 2 and 3 hold values for time step 0, columns 4 and 5 for time step 1, and so on, columns 62 and 63 for time step 30. The first column of the pair of two-column memories holds the X value for the time step, and the second column holds the H value for the time step. As shown in the figure, each group of four rows in the DRAM 122 corresponds to the memory of the NPU group 4901 loaded with values for its corresponding LSTM 4600 . That is to say, rows 0 to 3 are loaded with values associated with LSTM cell 0, whose calculations are performed by NPUs 0-3, that is, NPU group 0; rows 4 to 7 are loaded with values associated with LSTM The value of cell 1, whose calculation is performed by NPU 4-7, that is, NPU group 1; and so on, rows 508 to 511 are loaded with the value associated with long short-term memory cell 127, whose calculation is performed by NPU The processing units 508-511 execute, that is, the neural processing unit group 127 executes, as shown in the subsequent FIG. 51 in detail. As shown in the figure, column 1 is not used, and column 0 is loaded with the initial memory cell output (H) value. For a preferred embodiment, the zero value can be filled by the framework program, but the present invention is not limited to Therefore, using non-architectural program instructions to fill the initial memory cell output (H) value of column 0 also falls within the scope of the present invention.

较佳地，X数值(位于列2，4，6依此类推至列62)由执行于处理器100的架构程序透过MTNN指令1400写入/填入数据随机存取存储器122，并由执行于神经网络单元121的非架构程序进行读取/使用，例如图50所示的非架构程序。较佳地，H数值(位于列3，5，7依此类推至列63)由执行于神经网络单元121的非架构程序写入/填入数据随机存取存储器122并进行读取/使用，详如后述。较佳地，H数值并由执行于处理器100的架构程序透过MFNN指令1500进行读取。需要注意的是，图51的非架构程序假定对应至神经处理单元群组4901的各组四行存储器(如行0-3，行4-7，行5-8，依此类推至行508-511)中，在一给定列的四个X数值填入相同的数值(例如由架构程序填入)。类似地，图51的非架构程序会在对应至神经处理单元群组4901的各组四行存储器中，计算并对一给定列的四个H数值写入相同数值。Preferably, the X value (located in columns 2, 4, 6 and so on to column 62) is written/filled into the DRAM 122 by the architectural program executing on the processor 100 through the MTNN instruction 1400, and is executed by It is read/used by the non-architectural program of the neural network unit 121, such as the non-architectural program shown in FIG. 50 . Preferably, the H values (located in columns 3, 5, 7 and so on to column 63) are written/filled into the DRAM 122 and read/used by the non-architectural program executed in the NNU 121, Details are described later. Preferably, the H value is read by the architecture program executed on the processor 100 through the MFNN instruction 1500 . It should be noted that the non-architectural program of FIG. 51 assumes that each set of four lines of memory corresponding to the NPU group 4901 (such as lines 0-3, lines 4-7, lines 5-8, and so on to lines 508- 511), the four X values in a given column are filled with the same value (eg, by a framework program). Similarly, the non-architectural program of FIG. 51 calculates and writes the same value for the four H values for a given column in each set of four rows of memory corresponding to NPU group 4901 .

如图中所示，权重随机存取存储器124装载神经网络单元121的神经处理单元所需的权重，偏移与记忆胞状态(C)值。在对应至神经处理单元群组121的各组四行存储器中(例如行0-3，行4-7，行5-8依此类推至行508-511)：(1)行编号除以4的余数等于3的行，会在其列0，1，2与6分别装载Wc，Uc，Bc，与C的数值；(2)行编号除以4的余数等于2的行，会在其列3，4与5分别装载Wo，Uo与Bo的数值；(3)行编号除以4的余数等于1的行，会在其列3，4与5分别装载Wf，Uf与Bf的数值；以及(4)行编号除以4的余数等于0的行，会在其列3，4与5分别装载Wi，Ui与Bi的数值。较佳地，这些权重与偏移值-Wi，Ui，Bi，Wf，Uf，Bf，Wc，Uc，Bc，Wo，Uo，Bo(在列0至5)-由执行于处理器100的架构程序透过MTNN指令1400写入/填入权重随机存取存储器124，并由执行于神经网络单元121的非架构程序进行读取/使用，如图51的非架构程序。较佳地，居间的C值由执行于神经网络单元121的非架构程序写入/填入权重随机存取存储器124并进行读取/使用，详如后述。As shown in the figure, the WRAM 124 is loaded with the weights, offsets and cell state (C) values required by the NPU of the NNU 121 . In each set of four rows of memory corresponding to NPU group 121 (e.g. rows 0-3, rows 4-7, rows 5-8 and so on to rows 508-511): (1) row number divided by 4 The row whose remainder is equal to 3 will load the values of Wc, Uc, Bc, and C in its columns 0, 1, 2, and 6 respectively; (2) the row number divided by 4 and the remainder equal to 2 will be loaded in its column 3, 4, and 5 respectively load the values of Wo, Uo, and Bo; (3) the row number whose remainder equals 1 after dividing the row number by 4 will load the values of Wf, Uf, and Bf in columns 3, 4, and 5 respectively; and (4) For the row whose remainder is equal to 0 when the row number is divided by 4, the values of Wi, Ui and Bi will be loaded in columns 3, 4 and 5 respectively. Preferably, these weights and offsets—Wi, Ui, Bi, Wf, Uf, Bf, Wc, Uc, Bc, Wo, Uo, Bo (in columns 0 to 5)—are implemented by the architecture of processor 100 The program is written/filled into the weight RAM 124 through the MTNN instruction 1400, and read/used by the non-architectural program executed in the neural network unit 121, such as the non-architectural program in FIG. 51 . Preferably, the intermediate C value is written/filled into the weight random access memory 124 and read/used by the non-architectural program executed in the neural network unit 121 , as described in detail later.

图50的范例假定架构程序会执行以下步骤：(1)对于31个不同的时间步骤，将输入X的数值填入数据随机存取存储器122(列2，4，依此类推至列62)；(2)启动图51的非架构程序；(3)侦测非架构程序是否执行完毕；(4)从数据随机存取存储器122读出输出H的数值(列3，5，依此类推至列63)；以及(5)重复步骤(1)至(4)若干次直到完成任务，例如对手机使用者的话语进行辨识所需的计算。The example of FIG. 50 assumes that the architecture program will perform the following steps: (1) for 31 different time steps, fill the value of the input X into the data random access memory 122 (column 2, 4, and so on to column 62); (2) start the non-architecture program in Figure 51; (3) detect whether the non-architecture program has been executed; (4) read the value of the output H from the data random access memory 122 (column 3, 5, and so on to the column 63); and (5) repeating steps (1) to (4) several times until the task is completed, such as the calculation required for recognizing the speech of the mobile phone user.

在另一种执行方式中，架构程序会执行以下步骤：(1)对单一个时间步骤，以输入X的数值填入数据随机存取存储器122(如列2)；(2)启动非架构程序(图51非架构程序的修正后版本，不需循环，并且只存取数据随机存取存储器122的单一对两列)；(3)侦测非架构程序是否执行完毕；(4)从数据随机存取存储器122读出输出H的数值(如列3)；以及(5)重复步骤(1)至(4)若干次直到完成任务。此二种方式何者为优可依据长短期记忆层的输入X数值的取样方式而定。举例来说，若是此任务容许在多个时间步骤对输入进行取样(例如大约31个时间步骤)并执行计算，第一种方式就较为理想，因为此方式可能带来更多计算资源效率和/或较佳的效能，不过，若是此任务只容许在单一个时间步骤执行取样，就需要使用第二种方式。In another execution mode, the architectural program will perform the following steps: (1) for a single time step, fill the data random access memory 122 (such as column 2) with the value of input X; (2) start the non-architectural program (the modified version of the non-architecture program in Figure 51 does not need to loop, and only accesses the single pair of two columns of the data random access memory 122); (3) detect whether the non-architecture program has been executed; (4) from the data random access memory Access memory 122 to read out the value of output H (eg column 3); and (5) repeat steps (1) to (4) several times until the task is completed. Which of the two methods is superior can be determined according to the sampling method of the input X value of the long-term short-term memory layer. For example, if the task allows the input to be sampled and the computation to be sampled at multiple time steps (e.g. around 31 time steps), the first approach is ideal as it may result in more computational resource efficiencies and/or or better performance, but if the task only allows sampling to be performed at a single time step, then the second approach is required.

第三实施例类似于前述第二种方式，不过，不同于第二种方式使用单一对两列数据随机存取存储器122，此方式的非架构程序使用多对存储器列，也就是在各个时间步骤使用不同对存储器列，此部分类似于第一种方式。较佳地，此第三实施例的架构程序在步骤(2)前包含一步骤，在此步骤中，架构程序会在非架构程序启动前对其进行更新，例如将地址1的指令内的数据随机存取存储器122列更新为指向下一对两列存储器。The third embodiment is similar to the aforementioned second method, but unlike the second method using a single pair of two-column data random access memory 122, the non-architectural program of this method uses multiple pairs of memory columns, that is, at each time step This part is similar to the first way, using a different pair of memory columns. Preferably, the architecture program of this third embodiment includes a step before step (2). In this step, the architecture program will update the non-architecture program before it starts, for example, the data in the instruction of address 1 The RAM 122 ranks are updated to point to the next pair of two-rank memories.

如图中所示，对于神经网络单元121的神经处理单元0至511，在图51的非架构程序中不同地址的指令执行后，输出缓冲器1104装载记忆胞输出(H)，候选记忆胞状态(C’)，输入闸(I)，遗忘闸(F)，输出闸(O)，记忆胞状态(C)与tanh(C)的居间值，每一个输出缓冲文字群组中(例如输出缓冲器1104对应至神经处理单元群组4901的四个文字的群组，如文字0-3，4-7，5-8依此类推至508-511)，文字编号除以4的余数为3的文字表示为OUTBUF[3]，文字编号除以4的余数为2的文字表示为OUTBUF[2]，文字编号除以4的余数为1的文字表示为OUTBUF[1]，而文字编号除以4的余数为0的文字表示为OUTBUF[0]。As shown in the figure, for the neural processing units 0 to 511 of the neural network unit 121, after the execution of instructions at different addresses in the non-architectural program of FIG. (C'), the input gate (I), the forget gate (F), the output gate (O), the intermediate value of the memory cell state (C) and tanh (C), in each output buffer text group (such as output buffer The device 1104 corresponds to the group of four characters of the neural processing unit group 4901, such as characters 0-3, 4-7, 5-8 and so on to 508-511), and the remainder of dividing the character number by 4 is 3 The text is represented as OUTBUF[3], the text whose number is divided by 4 and the remainder is 2 is represented as OUTBUF[2], the text whose number is divided by 4 and the remainder is 1 is represented as OUTBUF[1], and the text number divided by 4 A literal with a remainder of 0 is represented as OUTBUF[0].

如图中所示，在图51的非架构程序中地址2的指令执行后，对于各个神经处理单元群组4901而言，输出缓冲器1104的全部四个文字都会写入相对应长短期记忆胞4600的初始记忆胞输出(H)值。在地址6的指令执行后，对于各个神经处理单元群组4901而言，输出缓冲器1104的OUTBUF[3]文字会写入相对应长短期记忆胞4600的候选记忆胞状态(C’)值，而输出缓冲器1104的其它三个文字则会维持其先前数值。在地址10的指令执行后，对于各个神经处理单元群组4901而言，输出缓冲器1104的OUTBUF[0]文字会写入相对应长短期记忆胞4600的输入闸(I)数值，OUTBUF[1]文字会写入相对应长短期记忆胞4600的遗忘闸(F)数值，OUTBUF[2]文字会写入相对应长短期记忆胞4600的输出闸(O)数值，而OUTBUF[3]文字则是维持其先前数值。在地址13的指令执行后，对于各个神经处理单元群组4901而言，输出缓冲器1104的OUTBUF[3]文字会写入相对应长短期记忆胞4600的新的记忆胞状态(C)值(对于输出缓冲器1104而言，包含槽(slot)3的C数值，写入权重随机存取存储器124的列6，详如后续图51所述)，而输出缓冲器1104的其它三个文字则是维持其先前数值。在地址14的指令执行后，对于各个神经处理单元群组4901而言，输出缓冲器1104的OUTBUF[3]文字会写入相对应长短期记忆胞4600的tanh(C)数值，而输出缓冲器1104的其它三个文字则是维持其先前数值。在地址16的指令执行后，对于各个神经处理单元群组4901而言，输出缓冲器1104的全部四个文字都会写入相对应长短期记忆胞4600的新的记忆胞输出(H)值。前述地址6至16的执行流程(也就是排除地址2的执行，这是因为地址2不属于程序循环的一部分)会再重复三十次，作为地址17回到地址3的程序循环。As shown in the figure, after the instruction at address 2 in the non-architectural program of FIG. The initial memory cell output (H) value of 4600. After the instruction at address 6 is executed, for each neural processing unit group 4901, the OUTBUF[3] text of the output buffer 1104 will be written into the candidate memory cell state (C') value of the corresponding long-term short-term memory cell 4600, The other three words of the output buffer 1104 will maintain their previous values. After the instruction at address 10 is executed, for each neural processing unit group 4901, the OUTBUF[0] text of the output buffer 1104 will be written into the input gate (I) value of the corresponding long-term short-term memory cell 4600, OUTBUF[1 ] text will be written into the forgetting gate (F) value corresponding to the long-term short-term memory cell 4600, and the OUTBUF[2] text will be written into the output gate (O) value corresponding to the long-term short-term memory cell 4600, while the OUTBUF[3] text will be is to maintain its previous value. After the instruction at address 13 is executed, for each neural processing unit group 4901, the OUTBUF[3] text of the output buffer 1104 will be written into the new memory cell state (C) value of the corresponding long-term short-term memory cell 4600 ( For the output buffer 1104, the value of C containing the slot (slot) 3 is written into column 6 of the weight random access memory 124, as described in detail in subsequent FIG. 51 ), while the other three words of the output buffer 1104 are is to maintain its previous value. After the instruction at address 14 is executed, for each neural processing unit group 4901, the OUTBUF[3] text of the output buffer 1104 will be written into the tanh(C) value of the corresponding long-term and short-term memory cell 4600, and the output buffer The other three characters of 1104 maintain their previous values. After the instruction at address 16 is executed, for each NPU group 4901 , all four words of the output buffer 1104 will be written into the new memory cell output (H) value of the corresponding LSTM 4600 . The aforementioned execution flow of addresses 6 to 16 (that is, excluding the execution of address 2 because address 2 is not part of the program loop) will be repeated thirty times, returning to the program loop of address 3 as address 17.

图51为一表格，显示储存于神经网络单元121的程序存储器129的程序，此程序由图49的神经网络单元121执行并依据图50的配置使用数据与权重，以达成关联于长短期记忆胞层的计算。图51的范例程序包含18个非架构指令分别位于地址0至17。地址0的指令是一个初始化指令，用以清除累加器202并将循环计数器3804初始化至数值31，以执行31次循环组(地址1至17的指令)。此初始化指令并会将数据随机存取存储器122的待写入列(例如图26/图39的缓存器2606)初始化为数值1，而在地址16的指令的第一次执行后，此数值会增加至3。较佳地，此初始化指令并会使神经网络单元121处于宽配置，如此，神经网络单元121就会配置有512个神经处理单元126。如后续章节所述，在地址0至17的指令执行过程中，这512个神经处理单元126构成的128个神经处理单元群组4901作为128个相对应的长短期记忆胞4600进行运作。FIG. 51 is a table showing the program stored in the program memory 129 of the neural network unit 121. This program is executed by the neural network unit 121 of FIG. 49 and uses data and weights according to the configuration of FIG. layer calculations. The example program in FIG. 51 contains 18 non-architectural instructions located at addresses 0 to 17, respectively. The instruction at address 0 is an initialization instruction to clear the accumulator 202 and initialize the loop counter 3804 to a value of 31 to execute the loop group (instructions at addresses 1 to 17) 31 times. This initialization command will also initialize the column to be written in the data random access memory 122 (such as the register 2606 in FIG. 26/FIG. 39) to a value of 1, and after the first execution of the instruction at address 16, this value will be increased to 3. Preferably, the initialization instruction will cause the NNU 121 to be in a wide configuration, so that the NNU 121 will be configured with 512 NPUs 126 . As described in subsequent chapters, during the execution of the instructions at addresses 0 to 17, the 128 NPU groups 4901 formed by the 512 NPUs 126 operate as 128 corresponding LSTMs 4600 .

地址1与2的指令不属于程序的循环组而只会执行一次。这些指令会产生初始记忆胞输出(H)值(例如0)并将其写入输出缓冲器1104的所有文字。地址1的指令会从数据随机存取存储器122的列0读取初始H数值并将其放置于由地址0的指令清除的累加器202。地址2的指令(OUTPUT PASSTHRU，NOP，CLR ACC)会将累加器202数值传递至输出缓冲器1104，如图50所示。地址2的输出指令(以及图51的其它输出指令)中的“NOP”标示表示输出值只会被写入输出缓冲器1104，而不会被写入存储器，也就是不会被写入数据随机存取存储器122或权重随机存取存储器124。地址2的指令并会清除累加器202。The instructions at addresses 1 and 2 do not belong to the loop group of the program and will only be executed once. These instructions generate an initial cell output (H) value (eg, 0) and write it to all words in the output buffer 1104 . The instruction at address 1 will read the initial H value from column 0 of DRAM 122 and place it in accumulator 202 which is cleared by the instruction at address 0. The instruction at address 2 (OUTPUT PASSTHRU, NOP, CLR ACC) will pass the accumulator 202 value to the output buffer 1104, as shown in FIG. 50 . The "NOP" mark in the output command of address 2 (and other output commands in Figure 51) indicates that the output value will only be written into the output buffer 1104, and will not be written into the memory, that is, it will not be written into the data random access memory 122 or weight random access memory 124 . The instruction at address 2 will also clear the accumulator 202.

地址3至17的指令位于循环组内，其执行次数为循环计数的数值(如31)。The instructions at addresses 3 to 17 are located in the loop group, and the number of executions thereof is the value of the loop count (eg, 31).

地址3至6的指令的每一次执行会计算当前时间步骤的tanh(C’)数值并将其写入文字OUTBUF[3]，此文字将会被地址11的指令使用。更精确地说，地址3的乘法累加指令会从数据随机存取存储器122的当前读取列(如列2，4，6依此类推至列62)读取关联于此时间步骤的记忆胞输入(X)值，从权重随机存取存储器124的列0读取Wc数值，并将前述数值相乘以产生一乘积加入由地址2的指令清除的累加器202。Each execution of the instructions at address 3 to 6 will calculate the value of tanh(C’) at the current time step and write it into the text OUTBUF[3], which will be used by the instruction at address 11. More precisely, the multiply-accumulate instruction at address 3 will read the memory cell input associated with this time step from the current read column of the data random access memory 122 (such as columns 2, 4, 6 and so on to column 62) (X) value, reads the Wc value from column 0 of the weight RAM 124, and multiplies the aforementioned values to generate a product that is added to the accumulator 202 that is cleared by the instruction at address 2.

地址4的乘法累加指令(MULT-ACCUM OUTBUF[0]，WR ROW 1)会从文字OUTBUF[0]读取H数值(即神经处理单元群组4901的全部四个神经处理单元126)，从权重随机存取存储器124的列1读取Uc数值，并将前述数值相乘以产生一第二乘积加入累加器202。The multiply-accumulate instruction of address 4 (MULT-ACCUM OUTBUF[0], WR ROW 1) will read the H value (that is, all four neural processing units 126 of the neural processing unit group 4901) from the text OUTBUF[0], from the weight Column 1 of the random access memory 124 reads the value of Uc and multiplies the aforementioned value to generate a second product which is added to the accumulator 202 .

地址5的将权重文字加入累加器指令(ADD_W_ACC WR ROW 2)会从权重随机存取存储器124的列2读取Bc数值并将其加入累加器202。The add weight text to accumulator instruction (ADD_W_ACC WR ROW 2 ) at address 5 reads the Bc value from row 2 of the weight RAM 124 and adds it to the accumulator 202 .

地址6的输出指令(OUTPUT TANH，NOP，MASK[0：2]，CLR ACC)会对累加器202数值执行双曲正切启动函数，并且只将执行结果写入文字OUTBUF[3](也即，只有神经处理单元群组4901中编号除4的余数为3的神经处理单元126会写入此结果)，并且，累加器202会被清除。也就是说，地址6的输出指令会遮蔽文字OUTBUF[0]，OUTBUF[1]与OUTBUF[2](如指令术语MASK[0：2]所表示)而维持其当前数值，如图50所示。此外，地址6的输出指令并不会写入存储器(如指令术语NOP所表示)。The output instruction at address 6 (OUTPUT TANH, NOP, MASK[0:2], CLR ACC) will execute the hyperbolic tangent activation function on the value of the accumulator 202, and only write the execution result into the text OUTBUF[3] (that is, Only the NPU 126 of the NPU group 4901 whose number is 3 after dividing by 4 will write this result), and the accumulator 202 will be cleared. That is to say, the output instruction at address 6 will mask the text OUTBUF[0], OUTBUF[1] and OUTBUF[2] (as indicated by the instruction term MASK[0:2]) and maintain its current value, as shown in Figure 50 . Furthermore, the output instruction at address 6 is not written to memory (as denoted by the instruction term NOP).

地址7至10的指令的每一次执行会计算当前时间步骤的输入闸(I)数值，遗忘闸(F)数值与输出闸(O)数值并将其分别写入文字OUTBUF[0]，OUTBUF[1]，与OUTBUF[2]，这些数值将会被地址11，12与15的指令使用。更精确地说，地址7的乘法累加指令会从数据随机存取存储器122的当前读取列(如列2，4，6依此类推至列62)读取关联于此时间步骤的记忆胞输入(X)值，从权重随机存取存储器124的列3读取Wi，Wf与Wo数值，并将前述数值相乘以产生一乘积加入由地址6的指令清除的累加器202。更精确地说，在神经处理单元群组4901中，编号除4的余数为0的神经处理单元126会计算X与Wi的乘积，编号除4的余数为1的神经处理单元126会计算X与Wf的乘积，而编号除4的余数为2的神经处理单元126会计算X与Wo的乘积。Each execution of the instructions of addresses 7 to 10 will calculate the input gate (I) value, forget gate (F) value and output gate (O) value of the current time step and write them into the text OUTBUF[0], OUTBUF[ 1], and OUTBUF[2], these values will be used by the instructions at addresses 11, 12 and 15. More precisely, the multiply-accumulate instruction at address 7 will read the memory cell input associated with this time step from the current read column of the data random access memory 122 (such as columns 2, 4, 6 and so on to column 62) (X) values, read the Wi, Wf and Wo values from column 3 of the weight RAM 124, and multiply the aforementioned values to generate a product which is added to the accumulator 202 which is cleared by the instruction at address 6. More precisely, in the neural processing unit group 4901, the neural processing unit 126 whose number is 0 when dividing by 4 will calculate the product of X and Wi, and the neural processing unit 126 whose number is 1 when dividing by 4 will calculate X and The product of Wf, and the neural processing unit 126 whose number has a remainder of 2 when divided by 4 calculates the product of X and Wo.

地址8的乘法累加指令会从文字OUTBUF[0]读取H数值(即神经处理单元群组4901的全部四个神经处理单元126)，从权重随机存取存储器124的列4读取Ui，Uf与Uo数值，并将前述数值相乘以产生一第二乘积加入累加器202。更精确地说，在神经处理单元群组4901中，编号除4的余数为0的神经处理单元126会计算H与Ui的乘积，编号除4的余数为1的神经处理单元126会计算H与Uf的乘积，而编号除4的余数为2的神经处理单元126会计算H与Uo的乘积。The multiply-accumulate instruction at address 8 reads the H value (that is, all four NPUs 126 of the NPU group 4901 ) from the text OUTBUF[0], and reads Ui, Uf from column 4 of the weight random access memory 124 and the value of Uo, and multiply the aforementioned value to generate a second product which is added to the accumulator 202 . More precisely, in the neural processing unit group 4901, the neural processing unit 126 whose number is 0 when dividing by 4 will calculate the product of H and Ui, and the neural processing unit 126 whose number is 1 when dividing by 4 will calculate H and The product of Uf, and the NPU 126 whose number has a remainder of 2 when divided by 4 calculates the product of H and Uo.

地址9的将权重文字加入累加器指令(ADD_W_ACC WR ROW 2)会从权重随机存取存储器124的列5读取Bi，Bf与Bo数值并将其加入累加器202。更精确地说，，在神经处理单元群组4901中，编号除4的余数为0的神经处理单元126会执行Bi数值的加法计算，编号除4的余数为1的神经处理单元126会执行Bf数值的加法计算，而编号除4的余数为2的神经处理单元126会执行Bo数值的加法计算。The add weight text to accumulator instruction (ADD_W_ACC WR ROW 2 ) at address 9 reads the Bi, Bf and Bo values from row 5 of the weight RAM 124 and adds them to the accumulator 202 . More precisely, in the neural processing unit group 4901, the neural processing units 126 whose numbers divide by 4 with a remainder of 0 will perform the addition calculation of Bi values, and the neural processing units 126 whose numbers divide by 4 with a remainder of 1 will perform Bf The addition calculation of the value, and the neural processing unit 126 whose number is divided by 4 with a remainder of 2 will perform the addition calculation of the Bo value.

地址10的输出指令(OUTPUT SIGMOID，NOP，MASK[3]，CLR ACC)会对累加器202数值执行S型启动函数并将计算出来的I，F与O数值分别写入文字OUTBUF[0]，OUTBUF[1]与OUTBUF[2]，此指令并会清除累加器202，而不写入存储器。也就是说，地址10的输出指令会遮蔽文字OUTBUF[3](如指令术语MASK[3]所表示)而维持此文字的当前数值(也就是C’)，如图50所示。The output command at address 10 (OUTPUT SIGMOID, NOP, MASK[3], CLR ACC) will execute the S-type startup function for the value of the accumulator 202 and write the calculated I, F and O values into the text OUTBUF[0] respectively, OUTBUF[1] and OUTBUF[2], this instruction will also clear the accumulator 202 without writing to the memory. That is to say, the output instruction at address 10 will mask the word OUTBUF[3] (as represented by the instruction term MASK[3]) and maintain the current value of the word (that is, C'), as shown in FIG. 50 .

地址11至13的指令的每一次执行会计算当前时间步骤产生的新的记忆胞状态(C)值并将其写入权重随机存取存储器124的列6供下一个时间步骤使用(也就是供地址12的指令在下一次循环执行时使用)，更精确的说，此数值写入列6对应于神经处理单元群组4901的四行文字中标号除4的余数为3的文字。此外，地址14的指令的每一次执行都会将tanh(C)数值写入OUTBUF[3]供地址15的指令使用。Each execution of the instructions of addresses 11 to 13 will calculate the new memory cell state (C) value generated by the current time step and write it into column 6 of the weight random access memory 124 for use in the next time step (that is, for the next time step The instruction at address 12 is used when the next cycle is executed), more precisely, the value written in column 6 corresponds to the text whose label divides by 4 and the remainder is 3 in the four lines of text of the neural processing unit group 4901 . In addition, each execution of the instruction at address 14 will write the value of tanh (C) into OUTBUF[3] for the instruction at address 15.

更精确地说，地址11的乘法累加指令(MULT-ACCUM OUTBUF[0]，OUTBUF[3])会从文字OUTBUF[0]读取输入闸(I)数值，从文字OUTBUF[3]读取候选记忆胞状态(C’)值，并将前述数值相乘以产生一第一乘积加入由地址10的指令清除的累加器202。更精确地说，神经处理单元群组4901的四个神经处理单元126中的各个神经处理单元126都会计算I数值与C’数值的第一乘积。More precisely, the multiply-accumulate instruction at address 11 (MULT-ACCUM OUTBUF[0], OUTBUF[3]) reads the input gate (I) value from the literal OUTBUF[0] and the candidate value from the literal OUTBUF[3] The value of the cell state (C') is stored and multiplied to generate a first product which is added to the accumulator 202 which is cleared by the instruction at address 10. More precisely, each neural processing unit 126 in the four neural processing units 126 of the neural processing unit group 4901 calculates the first product of the I value and the C' value.

地址12的乘法累加指令(MULT-ACCUM OUTBUF[1]，WR ROW 6)会指示神经处理单元126从文字OUTBUF[1]读取遗忘闸(F)数值，从权重随机存取存储器124的列6读取其相对应文字，并将其相乘以产生第二乘积与地址11的指令产生于累加器202内的第一乘积相加。更精确地说，对于神经处理单元群组4901内标号除4的余数为3的神经处理单元126而言，从列6读取的文字是先前时间步骤计算出的当前记忆胞状态(C)值，第一乘积与第二乘积的加总即为此新的记忆胞状态(C)。不过，对于神经处理单元群组4901的其它三个神经处理单元126而言，从列6读取的文字是不需理会的数值，这是因为这些数值所产生的累加值将不被使用，也即不会被地址13与14的指令放入输出缓冲器1104而会被地址14的指令所清除。也就是说，只有神经处理单元群组4901中标号除4的余数为3的神经处理单元126所产生的新的记忆胞状态(C)值将会被使用，即被地址13与14的指令使用。就地址12的指令的第二至三十一次执行而言，从权重随机存取存储器124的列6读取的C数值是循环组的前次执行中由地址13的指令写入的数值。不过，对于地址12的指令的第一次执行而言，列6的C数值则是由架构程序在启动图51的非架构程序前或是由非架构程序的一调整后版本写入的初始值。The multiply-accumulate instruction at address 12 (MULT-ACCUM OUTBUF[1], WR ROW 6) will instruct the NPU 126 to read the forget gate (F) value from the text OUTBUF[1], from column 6 of the WRAM 124 The corresponding word is read and multiplied to produce a second product which is added to the first product generated in accumulator 202 by the instruction at address 11 . More precisely, for the NPUs 126 in the NPU group 4901 whose label divides by 4 with a remainder of 3, the text read from column 6 is the current memory cell state (C) value calculated at the previous time step , the sum of the first product and the second product is the new memory cell state (C). However, for the other three NPUs 126 of the NPU group 4901, the text read from column 6 is a don't care value, because the accumulated value generated by these values will not be used and will not be used. That is, it will not be put into the output buffer 1104 by the instructions of address 13 and 14, but will be cleared by the instruction of address 14. That is to say, only the new memory cell state (C) value generated by the neural processing unit 126 of the neural processing unit group 4901 whose label divides by 4 with a remainder of 3 will be used, that is, used by the instructions of addresses 13 and 14 . For the second through thirty-first executions of the instruction at address 12, the C value read from column 6 of weight RAM 124 is the value written by the instruction at address 13 in the previous execution of the loop group. However, for the first execution of the instruction at address 12, the C value in column 6 is the initial value written by the architected program before starting the non-architected program of FIG. 51 or by an adjusted version of the non-architected program .

地址13的输出指令(OUTPUT PASSTHRU，WR ROW 6，MASK[0：2])只会传递累加器202数值，即计算出的C数值，至文字OUTBUF[3](也就是说，只有神经处理单元群组4901中标号除4的余数为3的神经处理单元126会将其计算出的C数值写入输出缓冲器1104)，而权重随机存取存储器124的列6则是以更新后的输出缓冲器1104写入，如图50所示。也就是说，地址13的输出指令会遮蔽文字OUTBUF[0]，OUTBUF[1]与OUTBUF[2]而维持其当前数值(即I，F与O数值)。如前述，只有列6对应于神经处理单元群组4901的四行文字中标号除4的余数为3的文字内的C数值会被使用，也就是由地址12的指令使用；因此，非架构程序不会理会权重随机存取存储器124的列6中位于行0-2，行4-6，依此类推至行508-510的数值，如图50所示(即I，F与O数值)。The output instruction at address 13 (OUTPUT PASSTHRU, WR ROW 6, MASK[0:2]) will only pass the accumulator 202 value, which is the calculated C value, to the text OUTBUF[3] (that is, only the NPU The neural processing unit 126 in the group 4901 whose label divides by 4 with a remainder of 3 will write its calculated C value into the output buffer 1104), and column 6 of the weight random access memory 124 will use the updated output buffer Writer 1104, as shown in Figure 50. That is to say, the output command at address 13 will mask the characters OUTBUF[0], OUTBUF[1] and OUTBUF[2] and maintain their current values (ie, I, F and O values). As mentioned above, only the C values in the four-line text corresponding to the neural processing unit group 4901 in column 6 will be used, that is, the C value in the text with a remainder of 3 when dividing by 4, that is, used by the instruction at address 12; therefore, the non-architectural program The values in row 0-2, row 4-6, and so on to row 508-510 in column 6 of the weight RAM 124 are ignored, as shown in FIG. 50 (ie, I, F and O values).

地址14的输出指令(OUTPUT TANH，NOP，MASK[0：2]，CLR ACC)会对累加器202数值执行双曲正切启动函数，并将计算出来的tanh(C)数值写入文字OUTBUF[3]，此指令并会清除累加器202，而不写入存储器。地址14的输出指令，如同地址13的输出指令，会遮蔽文字OUTBUF[0]，OUTBUF[1]与OUTBUF[2]而维持其原本数值，如图50所示。The output instruction at address 14 (OUTPUT TANH, NOP, MASK[0:2], CLR ACC) will execute the hyperbolic tangent activation function on the value of the accumulator 202, and write the calculated tanh (C) value into the text OUTBUF[3 ], which clears the accumulator 202 without writing to memory. The output command at address 14, like the output command at address 13, will mask the characters OUTBUF[0], OUTBUF[1] and OUTBUF[2] and maintain their original values, as shown in FIG. 50 .

地址15至16的指令的每一次执行会计算当前时间步骤产生的记忆胞输出(H)值并将其写入数据随机存取存储器122的当前输出列后方第二列，其数值将会由架构程序读取并用于下一次时间步骤(也即在下一次循环执行中由地址3及7的指令使用)。更精确地说，地址15的乘法累加指令会从文字OUTBUF[2]读取输出闸(O)数值，从文字OUTBUF[3]读取tanh(C)数值，并将其相乘以产生一乘积加入由地址14的指令清除的累加器202。更精确地说，神经处理单元群组4901的四个神经处理单元126中的各个神经处理单元126都会计算数值O与tanh(C)的乘积。Each execution of the instructions at addresses 15 to 16 will calculate the memory cell output (H) value generated at the current time step and write it into the second column behind the current output column of the data random access memory 122, and its value will be determined by the architecture The program is read and used for the next time step (ie used by the instructions at addresses 3 and 7 in the next loop execution). More precisely, the multiply-accumulate instruction at address 15 reads the output gate (O) value from literal OUTBUF[2], reads the tanh(C) value from literal OUTBUF[3], and multiplies them to produce a product Add accumulator 202 cleared by the instruction at address 14. More precisely, each of the four NPUs 126 in the NPU group 4901 calculates the product of the value O and tanh(C).

地址16的输出指令会传递累加器202数值并在第一次执行中将计算出的H数值写入列3，在第二次执行中将计算出的H数值写入列5，依此类推在第三十一次执行中将计算出的H数值写入列63，如图50所示，接下来这些数值会由地址4与8的指令使用。此外，如图50所示，这些计算出来的H数值会被放入输出缓冲器1104供地址4与8的指令后续使用。地址16的输出指令并会清除累加器202。在一实施例中，长短期记忆胞4600的设计使地址16的输出指令(以及/或图48中地址22的输出指令)具有一启动函数，如S型或双曲正切函数，而非传递累加器202数值。The output instruction at address 16 passes the accumulator 202 value and writes the calculated H value to column 3 in the first execution, writes the calculated H value to column 5 in the second execution, and so on in The calculated H value is written into column 63 in the thirty-first execution, as shown in FIG. In addition, as shown in FIG. 50 , these calculated H values will be put into the output buffer 1104 for the subsequent use of the instructions at addresses 4 and 8 . The output instruction at address 16 also clears accumulator 202. In one embodiment, the LSTM cell 4600 is designed so that the output command at address 16 (and/or the output command at address 22 in FIG. 48 ) has an activation function, such as a sigmoid or hyperbolic tangent function, rather than a transfer-accumulate device 202 value.

地址17的循环指令会使循环计数器3804递减并且在新的循环计数器3804数值大于零的情况下回到地址3的指令。The loop instruction at address 17 will decrement the loop counter 3804 and fall back to the instruction at address 3 if the new loop counter 3804 value is greater than zero.

由此可发现，因为图49的神经网络单元121实施例中的输出缓冲器1104的反馈与屏蔽能力，图51的非架构程序的循环组内的指令数相较于图48的非架构指令大致减少34％。此外，因为图49的神经网络单元121实施例中的输出缓冲器1104的反馈与屏蔽能力，图51非架构程序的数据随机存取存储器122中的存储器配置所搭配的时间步骤数大致为图48的三倍。前述改善有助于某些利用神经网络单元121执行长短期记忆胞层计算的架构程序应用，特别是针对长短期记忆胞层中的长短期记忆胞4600数量少于或等于128的应用。It can be found that, because of the feedback and shielding capabilities of the output buffer 1104 in the embodiment of the neural network unit 121 of FIG. 49 , the number of instructions in the loop group of the non-architectural program of FIG. 51 is approximately 34% reduction. In addition, because of the feedback and masking capabilities of the output buffer 1104 in the embodiment of the neural network unit 121 in FIG. three times. The aforementioned improvements are helpful for some architectural program applications using the neural network unit 121 to perform LSTM cell layer calculations, especially for applications where the number of LSTM cells 4600 in the LSTM cell layer is less than or equal to 128.

图47至图51的实施例假定各个时间步骤中的权重与偏移值维持不变。不过，本发明并不限于此，其它权重与偏移值随时间步骤改变的实施例也属本发明的范畴，其中，权重随机存取存储器124并非如图47至图50所示填入单一组权重与偏移值，而是在各个时间步骤填入不同组权重与偏移值而图48至图51的非架构程序的权重随机存取存储器124地址会随之调整。The embodiments of FIGS. 47-51 assume that the weights and offsets remain unchanged at each time step. However, the present invention is not limited thereto, and other embodiments in which weights and offsets change with time steps also fall within the scope of the present invention, wherein the weight random access memory 124 is not filled into a single group as shown in FIGS. 47 to 50 Instead, fill in different sets of weights and offset values at each time step, and the address of the weight random access memory 124 of the non-architectural program in FIGS. 48 to 51 will be adjusted accordingly.

基本上，在前述图47至图51的实施例中，权重，偏移与居间值(如C，C’数值)储存于权重随机存取存储器124，而输入与输出值(如X，H数值)则是储存于数据随机存取存储器122。此特征有利于数据随机存取存储器122为双端口而权重随机存取存储器124为单端口的实施例，这是因为从非架构程序与架构程序至数据随机存取存储器122会有更多的流量。不过，因为权重随机存取存储器124较大，在本发明的另一实施例中则是互换储存非架构与架构程序写入数值的存储器(即互换数据随机存取存储器122与权重随机存取存储器124)。也就是说，W，U，B，C’，tanh(C)与C数值储存于数据随机存取存储器122而X，H，I，F与O数值则是储存于权重随机存取存储器124(图47的调整后实施例)；以及W，U，B，与C数值储存于数据随机存取存储器122而X与H数值则是储存于权重随机存取存储器124(图50的调整后实施例)。因为权重随机存取存储器124较大，这些实施例在一个批次中可处理较多时间步骤。对于利用神经网络单元121执行计算的架构程序的应用而言，此特征有利于某些能从较多的时间步骤得利的应用并且可以为单端口设计的存储器(如权重随机存取存储器124)提供足够频宽。Basically, in the aforementioned embodiments of FIG. 47 to FIG. 51, weights, offsets and intermediate values (such as C, C' values) are stored in the weight random access memory 124, and input and output values (such as X, H values ) is stored in the DRAM 122 . This feature facilitates embodiments where the data RAM 122 is dual-ported and the weight RAM 124 is single-ported because there is more traffic to the data RAM 122 from non-architectural programs and architectural programs . However, because the weight random access memory 124 is relatively large, in another embodiment of the present invention, the memory for storing non-architecture and architecture program write values is exchanged (that is, the data random access memory 122 and the weight random access memory 122 are interchanged). access memory 124). That is, W, U, B, C', tanh(C) and C values are stored in DRAM 122 and X, H, I, F and O values are stored in WRAM 124 ( Adjusted embodiment of FIG. 47); and W, U, B, and C values are stored in DRAM 122 while X and H values are stored in Weight RAM 124 (adjusted embodiment of FIG. 50 ). Because the weight RAM 124 is larger, these embodiments can process more time steps in a batch. For applications utilizing the architecture of the neural network unit 121 to perform computations, this feature is beneficial for certain applications that can benefit from more time steps and may be a memory of a single-port design (such as weight random access memory 124) Provide sufficient bandwidth.

图52为一方块图，显示神经网络单元121的实施例，此实施例的神经处理单元群组内具有输出缓冲遮蔽与反馈能力，并且共享启动函数单元1112。图52的神经网络单元121类似于图47的神经网络单元121，并且图中具有相同标号的组件也相类似。不过，图49的四个启动函数单元212在本实施例中则是由单一个共享启动函数单元1112所取代，此单一个启动函数单元会接收四个来自四个累加器202的输出217并产生四个输出至文字OUTBUF[0]，OUTBUF[1]，OUTBUF[2]与OUTBUF[3]。图52的神经网络单元212的运作方式类似于前文图49至图51所述的实施例，并且其运作共享启动函数单元1112的方式类似于前文图11至图13所述的实施例。FIG. 52 is a block diagram showing an embodiment of the neural network unit 121 . In this embodiment, the group of neural processing units has output buffer masking and feedback capabilities, and the activation function unit 1112 is shared. The neural network unit 121 of FIG. 52 is similar to the neural network unit 121 of FIG. 47, and the components with the same reference numbers in the figure are also similar. However, the four enable function units 212 of FIG. 49 are replaced in this embodiment by a single shared enable function unit 1112, which receives four outputs 217 from the four accumulators 202 and generates Four outputs to text OUTBUF[0], OUTBUF[1], OUTBUF[2] and OUTBUF[3]. The operation of the neural network unit 212 in FIG. 52 is similar to the embodiment described above in FIGS. 49 to 51 , and the operation mode of the shared activation function unit 1112 is similar to the embodiment described in FIGS. 11 to 13 above.

图53为一方块图，显示当神经网络单元121执行关联于图46中一个具有128个长短期记忆胞4600的层级的计算时，图49的神经网络单元121的数据随机存取存储器122，权重随机存取存储器124与输出缓冲器1104内的数据配置的另一实施例。图53的范例类似于图50的范例。不过，在图53中，Wi，Wf与Wo值位于列0(而非如图50位于列3)；Ui，Uf与Uo值位于列1(而非如图50位于列4)；Bi，Bf与Bo值位于列2(而非如图50位于列5)；C值位于列3(而非如图50位于列6)。另外，图53的输出缓冲器1104的内容类似于图50，不过，因为图54与图51的非架构程序的差异，第三列的内容(即I，F，O与C’数值)是在地址7的指令执行后出现在输出缓冲器1104(而非如图50是地址10的指令)；第四列的内容(即I，F，O与C数值)是在地址10的指令执行后出现在输出缓冲器1104(而非如图50是地址13的指令)；第五列的内容(即I，F，O与tanh(C)数值)是在地址11的指令执行后出现在输出缓冲器1104(而非如图50是地址14的指令)；并且第六列的内容(即H数值)是在地址13的指令执行后出现在输出缓冲器1104(而非如图50是地址16的指令)，详如后述。Figure 53 is a block diagram showing that the data random access memory 122 of the neural network unit 121 of Figure 49, weight Another embodiment of the data configuration in the random access memory 124 and the output buffer 1104 . The example of FIG. 53 is similar to the example of FIG. 50 . However, in Figure 53, Wi, Wf and Wo values are in column 0 (instead of column 3 as in Figure 50); Ui, Uf and Uo are in column 1 (rather than in column 4 as in Figure 50); Bi, Bf and Bo values are in column 2 (instead of column 5 as in Figure 50); C values are in column 3 (instead of column 6 as in Figure 50). In addition, the content of the output buffer 1104 of FIG. 53 is similar to that of FIG. 50, but because of the difference between the non-architectural program of FIG. 54 and FIG. The instruction at address 7 appears in the output buffer 1104 after execution (rather than the instruction at address 10 as shown in Figure 50); the contents of the fourth column (ie I, F, O and C values) appear after the instruction at address 10 is executed In the output buffer 1104 (instead of the address 13 instruction as shown in Figure 50); the contents of the fifth column (ie I, F, O and tanh (C) values) appear in the output buffer after the execution of the instruction at address 11 1104 (rather than the instruction of address 14 as shown in Figure 50); and the content of the sixth column (i.e. the H value) appears in the output buffer 1104 after the execution of the instruction of address 13 (rather than the instruction of address 16 as shown in Figure 50 ), as described later.

图54为一表格，显示储存于神经网络单元121的程序存储器129的程序，此程序由图49的神经网络单元121执行并依据图53的配置使用数据与权重，以达成关联于长短期记忆胞层的计算。图54的范例程序类似于图51的程序。更精确地说，图54与图51中，地址0至5的指令相同；图54中地址7与8的指令相同于图51中地址10与11的指令；并且图54中地址10到14的指令相同于图51中地址13到17的指令。FIG. 54 is a table showing the program stored in the program memory 129 of the neural network unit 121. This program is executed by the neural network unit 121 of FIG. 49 and uses data and weights according to the configuration of FIG. layer calculations. The example program of FIG. 54 is similar to the program of FIG. 51 . More precisely, in Figure 54 and Figure 51, the instructions of addresses 0 to 5 are the same; the instructions of addresses 7 and 8 in Figure 54 are identical to the instructions of addresses 10 and 11 in Figure 51; and the instructions of addresses 10 to 14 in Figure 54 The commands are the same as those of addresses 13 to 17 in Fig.51.

不过，图54中地址6的指令并不会清除累加器202(相较之下，图51中地址6的指令则会清除累加器202)。此外，图51中地址7至9的指令并不出现在图54的非架构程序内。最后，就图54中地址9的指令与图51中地址12的指令而言，除了图54中地址9的指令读取权重随机存取存储器124的列3而图51中地址12的指令则是读取权重随机存取存储器的列6外，其它部分均相同。However, the instruction at address 6 in FIG. 54 does not clear the accumulator 202 (in contrast, the instruction at address 6 in FIG. 51 clears the accumulator 202). In addition, the instructions at addresses 7 to 9 in FIG. 51 do not appear in the non-architectural program of FIG. 54 . Finally, regarding the instruction at address 9 in FIG. 54 and the instruction at address 12 in FIG. 51, except that the instruction at address 9 in FIG. Except column 6 of the read weight random access memory, other parts are the same.

因为图54的非架构程序与图51的非架构程序的差异，图53的配置使用的权重随机存取存储器124的列数会减少三个，而程序循环内的指令数也会减少三个。图54的非架构程序内的循环组尺寸实质上只有图48的非架构程序内的循环组尺寸的一半，并且大致只有图51的非架构程序内的循环组尺寸的80％。Because of the difference between the non-architectural program of FIG. 54 and the non-architectural program of FIG. 51 , the configuration of FIG. 53 will reduce the number of columns of the weight RAM 124 by three, and the number of instructions in the program loop will also be reduced by three. The loop group size in the non-architected program of FIG. 54 is substantially only half the size of the loop group in the non-architected program of FIG. 48 , and roughly 80% of the size of the loop group in the non-architected program of FIG. 51 .

图55为一方块图，显示本发明另一实施例的神经处理单元126的部分。更精确地说，对于图49的多个神经处理单元126中的单一个神经处理单元126而言，图中显示多任务缓存器208与其相关联输入207，211与4905，以及多任务缓存器705与其相关联输入206，711与4907。除了图49的输入外，神经处理单元126的多任务缓存器208与多任务缓存器705个别接收一群组内编号(index_within_group)输入5599。群组内编号输入5599指出特定神经处理单元126在其神经处理单元群组4901内的编号。因此，举例来说，以各个神经处理单元群组4901具有四个神经处理单元126的实施例为例，在各个神经处理单元群组4901内，其中一个神经处理单元126在其群组内编号输入5599中接收数值零，其中一个神经处理单元126在其群组内编号输入5599中接收数值一，其中一个神经处理单元126在其群组内编号输入5599中接收数值二，而其中一个神经处理单元126在其群组内编号输入5599中接收数值三。换句话说，神经处理单元126所接收的群组内编号输入5599数值就是此神经处理单元126在神经网络单元121内的编号除以J的余数，其中J是神经处理单元群组4901内的神经处理单元126的数量。因此，举例来说，神经处理单元73在其群组内编号输入5599接收数值一，神经处理单元353在其群组内编号输入5599接收数值三，而神经处理单元6在其群组内编号输入5599接收数值二。FIG. 55 is a block diagram showing portions of the neural processing unit 126 according to another embodiment of the present invention. More precisely, for a single neural processing unit 126 of the plurality of neural processing units 126 of FIG. Inputs 206, 711 and 4907 are associated therewith. In addition to the input of FIG. 49 , the multitasking register 208 and the multitasking register 705 of the NPU 126 each receive an index_within_group input 5599 . The number within group input 5599 indicates the number of a particular NPU 126 within its NPU group 4901 . Thus, for example, taking an embodiment in which each NPU group 4901 has four NPUs 126, within each NPU group 4901, one of the NPUs 126 numbers the inputs within its group. 5599 receives the value zero, one of the neural processing units 126 receives the value one in its intragroup number input 5599, one of the neural processing units 126 receives the value two in its intragroup number input 5599, and one of the neural processing units 126 receives the value three in its intragroup number input 5599. In other words, the number input 5599 in the group received by the neural processing unit 126 is the remainder of dividing the number of the neural processing unit 126 in the neural network unit 121 by J, where J is the neuron in the neural processing unit group 4901 The number of processing units 126 . Thus, for example, NPU 73 receives a value of one on its group numbered input 5599, NPU 353 receives a value of three on its group numbered input 5599, and NPU 6 receives a value on its group numbered input 5599. 5599 receives the value two.

此外，当控制输入213指定一默认值，在此表示为“SELF”，多任务缓存器208会选择对应于群组内编号输入5599数值的输出缓冲器1104输出4905。因此，当非架构指令以SELF的数值指定接收来自输出缓冲器1104的数据(在图57地址2与7的指令中标示为OUTBUF[SELF])，各个神经处理单元126的多任务缓存器208会从输出缓冲器1104接收其相对应文字。因此，举例来说，当神经网络单元121执行图57中地址2与7的非架构指令，神经处理单元73的多任务缓存器208会在四个输入4905中选择第二个(编号1)输入以接收来自输出缓冲器1104的文字73，神经处理单元353的多任务缓存器208会在四个输入4905中选择第四个(编号3)输入以接收来自输出缓冲器1104的文字353，而神经处理单元6的多任务缓存器208会在四个输入4905中选择第三个(编号2)输入以接收来自输出缓冲器1104的文字6。虽然并未使用于图57的非架构程序，不过，非架构指令也可利用SELF数值(OUTBUF[SELF])指定接收来自输出缓冲器1104的数据而使控制输入713指定默认值使各个神经处理单元126的多任务缓存器705从输出缓冲器1104接收其相对应文字。In addition, when the control input 213 specifies a default value, denoted as "SELF" herein, the multitasking register 208 will select the output buffer 1104 output 4905 corresponding to the value of the number input 5599 within the group. Therefore, when a non-architectural instruction specifies to receive data from output buffer 1104 with a value of SELF (labeled OUTBUF[SELF] in instructions at addresses 2 and 7 in FIG. Its corresponding text is received from the output buffer 1104 . Therefore, for example, when the neural network unit 121 executes the non-architectural instructions of addresses 2 and 7 in FIG. To receive the text 73 from the output buffer 1104, the multitasking register 208 of the neural processing unit 353 will select the fourth (number 3) input from the four inputs 4905 to receive the text 353 from the output buffer 1104, and the neural processing unit 353 The multiplexing register 208 of the processing unit 6 selects the third input (number 2 ) among the four inputs 4905 to receive the word 6 from the output buffer 1104 . Although not used in the non-architectural program of FIG. 57, non-architectural instructions can also use SELF values (OUTBUF[SELF]) to specify the receipt of data from the output buffer 1104 so that the control input 713 specifies a default value for each neural processing unit The multitasking buffer 705 of 126 receives its corresponding text from the output buffer 1104 .

图56为一方块图，显示当神经网络单元执行关联于图43的Jordan时间递归神经网络的计算并利用图55的实施例时，神经网络单元121的数据随机存取存储器122与权重随机存取存储器124内的数据配置的一范例。图中权重随机存取存储器124内的权重配置相同于图44的范例。图中数据随机存取存储器122内的数值的配置相似于图44的范例，除了在本范例中，各个时间步骤具有相对应的一对两列存储器以装载输入层节点D值与输出层节点Y值，而非如图44的范例使用一组四列的存储器。也就是说，在本范例中，隐藏层Z数值与内容层C数值并不写入数据随机存取存储器122。而是将输出缓冲器1104作为隐藏层Z数值与内容层C数值的类别草稿存储器，详如图57的非架构程序所述。前述OUTBUF[SELF]输出缓冲器1104的反馈特征，可以使非架构程序的运作更为快速(这是将对于数据随机存取存储器122执行的两次写入与两次读取动作，以对于输出缓冲器1104执行的两次写入与两次读取动作来取代)并减少各个时间步骤使用的数据随机存取存储器122的空间，而使本实施例的数据随机存取存储器122所装载的数据可用于大约两倍于图44与图45的实施例所具有的时间步骤，如图中所示，即32个时间步骤。Figure 56 is a block diagram showing the data random access memory 122 and weight random access memory 122 of the neural network unit 121 when the neural network unit performs calculations associated with the Jordan Time Recurrent Neural Network of Figure 43 and utilizes the embodiment of Figure 55 An example of data configuration in memory 124 . The weight configuration in the weight random access memory 124 in the figure is the same as the example in FIG. 44 . The configuration of values in the data random access memory 122 in the figure is similar to the example of FIG. 44, except that in this example, each time step has a corresponding pair of two-column memories to load the input layer node D value and the output layer node Y values instead of using a set of four columns of memory as in the example of Figure 44. That is to say, in this example, the hidden layer Z value and the content layer C value are not written into the DRAM 122 . Instead, the output buffer 1104 is used as a class scratch storage for the hidden layer Z value and the content layer C value, as described in detail in the non-architectural procedure of FIG. 57 . The feedback feature of the aforementioned OUTBUF[SELF] output buffer 1104 can make the operation of the non-architectural program faster (this is two writes and two reads to the data random access memory 122, and two operations for the output Buffer 1104 performs two write and two read actions) and reduces the space of the DRAM 122 used in each time step, so that the data loaded in the DRAM 122 of this embodiment Approximately twice as many time steps as the embodiment of Figures 44 and 45 can be used, as shown in the Figures, ie 32 time steps.

图57为一表格，显示储存于神经网络单元121的程序存储器129的程序，此程序由神经网络单元121执行并依据图56的配置使用数据与权重，以达成Jordan时间递归神经网络。图57的非架构程序类似于图45的非架构程序，其差异处如下所述。FIG. 57 is a table showing the program stored in the program memory 129 of the neural network unit 121. The program is executed by the neural network unit 121 and uses data and weights according to the configuration of FIG. 56 to achieve a Jordan time recurrent neural network. The non-architected procedure of FIG. 57 is similar to the non-architected procedure of FIG. 45 with the differences described below.

图57的范例程序具有12个非架构指令分别位于地址0至11。地址0的初始化指令会清除累加器202并将循环计数器3804的数值初始化为32，使循环组(地址2至11的指令)执行32次。地址1的输出指令会将累加器202(由地址0的指令所清除)的零值放入输出缓冲器1104。由此可观察到，在地址2至6的指令的执行过程中，这512个神经处理单元126对应并作为512个隐藏层节点Z进行运作，而在地址7至10的指令的执行过程中，对应并作为512个输出层节点Y进行运作。也就是说，地址2至6的指令的32次执行会计算32个相对应时间步骤的隐藏层节点Z数值，并将其放入输出缓冲器1104供地址7至9的指令的相对应32次执行使用，以计算这32个相对应时间步骤的输出层节点Y并将其写入数据随机存取存储器122，并提供地址10的指令的相对应32次执行使用，以将这32个相对应时间步骤的内容层节点C放入输出缓冲器1104。(放入输出缓冲器1104中第32个时间步骤的内容层节点C并不会被使用。)The example program in FIG. 57 has 12 non-architectural instructions located at addresses 0 to 11, respectively. The initialize instruction at address 0 clears the accumulator 202 and initializes the value of the loop counter 3804 to 32, so that the loop group (instructions at addresses 2 to 11) is executed 32 times. The output instruction at address 1 places the zero value of accumulator 202 (cleared by the instruction at address 0 ) into output buffer 1104 . It can be observed that, during the execution of the instructions at addresses 2 to 6, the 512 neural processing units 126 correspond to and operate as 512 hidden layer nodes Z, and during the execution of the instructions at addresses 7 to 10, Corresponding to and operating as 512 output layer nodes Y. That is, the 32 executions of the instructions at addresses 2 to 6 will calculate the hidden layer node Z value for 32 corresponding time steps and put it into the output buffer 1104 for the corresponding 32 executions of the instructions at addresses 7 to 9 Execution is used to calculate the output layer node Y of these 32 corresponding time steps and write it into the data random access memory 122, and the corresponding 32 executions of the instruction at address 10 are provided to convert these 32 corresponding The content layer node C of the time step is put into the output buffer 1104 . (The content layer node C placed in the output buffer 1104 at the 32nd time step will not be used.)

在地址2与3的指令(ADD_D_ACC OUTBUF[SELF]与ADD_D_ACC ROTATE，COUNT＝511)的第一次执行中，512个神经处理单元126中的各个神经处理单元126会将输出缓冲器1104的512个内容节点C值累加至其累加器202，这些内容节点C值由地址0至1的指令执行所产生与写入。在地址2与3的指令的第二次执行中，这512个神经处理单元126中的各个神经处理单元126会将输出缓冲器1104的512个内容节点C值累加至其累加器202，这些内容节点C值由地址7至8与10的指令执行所产生与写入。更精确地说，地址2的指令会指示各个神经处理单元126的多任务缓存器208选择其相对应输出缓冲器1104文字，如前所述，并将其加入累加器202；地址3的指令会指示神经处理单元126在512个文字的旋转器内旋转内容节点C值，此512个文字的旋转器由这512个神经处理单元中相连接的多任务缓存器208的集体运作所构成，而使各个神经处理单元126可以将这512个内容节点C值累加至其累加器202。地址3的指令并不会清除累加器202，如此地址4与5的指令即可将输入层节点D值(乘上其相对应权重)加上由地址2与3的指令累加出的内容层节点C值。In the first execution of the instructions of addresses 2 and 3 (ADD_D_ACC OUTBUF[SELF] and ADD_D_ACC ROTATE, COUNT=511), each of the 512 NPUs 126 will transfer the 512 NPUs of the output buffer 1104 The content node C values are accumulated to its accumulator 202 , and these content node C values are generated and written by the execution of instructions at addresses 0 to 1 . In the second execution of the instructions at addresses 2 and 3, each of the 512 NPUs 126 will accumulate the 512 content node C values of the output buffer 1104 into its accumulator 202, which Node C values are generated and written by the execution of instructions at addresses 7-8 and 10. More precisely, the instruction at address 2 will instruct the multitasking register 208 of each NPU 126 to select its corresponding output buffer 1104 word, as previously described, and add it to the accumulator 202; the instruction at address 3 will Instructs the NPU 126 to rotate the content node C value within the 512-word rotator formed by the collective operation of the multitasking registers 208 connected among the 512 NPUs such that Each neural processing unit 126 may accumulate the 512 content node C values to its accumulator 202 . The instruction at address 3 will not clear the accumulator 202, so the instructions at addresses 4 and 5 can add the value of the input layer node D (multiplied by its corresponding weight) to the content layer node accumulated by the instructions at addresses 2 and 3 C value.

在地址4与5的指令(MULT-ACCUM DR ROW+2，WR ROW 0与MULT-ACCUM ROTATE，WRROW+1，COUNT＝511)的各次执行中，这512个神经处理单元126中的各个神经处理单元126会执行512次乘法运算，将数据随机存取存储器122中关联于当前时间步骤的列(例如：对于时间步骤0而言即为列0，对于时间步骤1而言即为列2，依此类推，对于对于时间步骤31而言即为列62)的512个输入节点D值，乘上权重随机存取存储器124的列0至511中对应于此神经处理单元126的行的权重，以产生512个乘积，而连同这地址2与3的指令对于这512个内容节点C值执行的累加结果，一并累加至相对应神经处理单元126的累加器202以计算隐藏节点Z层数值。In each execution of the instructions of addresses 4 and 5 (MULT-ACCUM DR ROW+2, WR ROW 0 and MULT-ACCUM ROTATE, WRROW+1, COUNT=511), each of the 512 neural processing units 126 The processing unit 126 will perform 512 times of multiplication, and associate the column in the data random access memory 122 with the current time step (for example: column 0 for time step 0, column 2 for time step 1, And so on, for the 512 input node D values that are column 62) for time step 31, multiplied by the weight of the row corresponding to this NPU 126 in columns 0 to 511 of the weight random access memory 124, 512 products are generated, and the accumulation results of the 512 content node C values executed by the instructions of addresses 2 and 3 are accumulated to the corresponding accumulator 202 of the neural processing unit 126 to calculate the value of the layer Z of the hidden node.

在地址6的指令(OUTPUT PASSTHRU，NOP，CLRACC)的各次执行中，这512个神经处理单元126的512个累加器202数值传递并写入输出缓冲器1104的相对应文字，并且累加器202会被清除。In each execution of the instruction at address 6 (OUTPUT PASSTHRU, NOP, CLRACC), the 512 accumulator 202 values of the 512 neural processing units 126 are transferred and written to the corresponding text of the output buffer 1104, and the accumulator 202 will be cleared.

在地址7与8的指令(MULT-ACCUM OUTBUF[SELF]，WR ROW 512与MULT-ACCUMROTATE，WR ROW+1，COUNT＝511)的执行过程中，这512个神经处理单元126中的各个神经处理单元126会执行512次乘法运算，将输出缓冲器1104中的512个隐藏节点Z值(由地址2至6的指令的相对应次执行所产生并写入)，乘上权重随机存取存储器124的列512至1023中对应于此神经处理单元126的行的权重，以产生512个乘积累加至相对应神经处理单元126的累加器202。During the execution of the instructions at addresses 7 and 8 (MULT-ACCUM OUTBUF[SELF], WR ROW 512 and MULT-ACCUMROTATE, WR ROW+1, COUNT=511), each neural processing unit in the 512 neural processing units 126 Unit 126 will perform 512 multiplication operations to multiply the 512 hidden node Z values in the output buffer 1104 (generated and written by the corresponding executions of the instructions at addresses 2 to 6) by the weight RAM 124 The weights corresponding to the rows of the NPU 126 in the columns 512 to 1023 of , to generate 512 multiplications accumulated to the accumulator 202 of the corresponding NPU 126 .

在地址9的指令(OUTPUT ACTIVATION FUNCTION，DR OUT ROW+2)的各次执行中，会对于这512个累加值执行启动函数(如双曲正切函数，S型函数，校正函数)以计算输出节点Y值，此输出节点Y值会被写入数据随机存取存储器122中对应于当前时间步骤的列(例如：对于时间步骤0而言即为列1，对于时间步骤1而言即为列3，依此类推，对于时间步骤31而言即为列63)。地址9的指令并不会清除累加器202。In each execution of the instruction at address 9 (OUTPUT ACTIVATION FUNCTION, DR OUT ROW+2), the activation function (such as hyperbolic tangent function, sigmoid function, correction function) will be executed for the 512 accumulated values to calculate the output node Y value, this output node Y value will be written into the column corresponding to the current time step in the data random access memory 122 (for example: column 1 for time step 0, column 3 for time step 1 , and so on, column 63 for time step 31). The instruction at address 9 does not clear accumulator 202.

在地址10的指令(OUTPUT PASSTHRU，NOP，CLRACC)的各次执行中，地址7与8的指令累加出的512个数值会被放入输出缓冲器1104供地址2与3的指令的下一次执行使用，并且累加器202会被清除。During each execution of the instruction at address 10 (OUTPUT PASSTHRU, NOP, CLRACC), the 512 values accumulated by the instructions at addresses 7 and 8 will be put into the output buffer 1104 for the next execution of the instructions at addresses 2 and 3 used, and the accumulator 202 will be cleared.

地址11的循环指令会使循环计数器3804的数值递减，而若是新的循环计数器3804数值仍然大于零，就指示回到地址2的指令。The loop instruction at address 11 will decrement the value of the loop counter 3804, and if the new value of the loop counter 3804 is still greater than zero, a return to the instruction at address 2 is indicated.

如同对应于图44的章节所述，在利用图57的非架构程序执行Jordan时间递归神经网络的范例中，虽然会对于累加器202数值施以启动函数以产生输出层节点Y值，不过，此范例假定在施以启动函数前，累加器202数值就传递至内容层节点C，而非传递真正的输出层节点Y值。不过，对于将启动函数施加于累加器202数值以产生内容层节点C的Jordan时间递归神经网络而言，地址10的指令将会从图57的非架构程序中移除。在本文所述的实施例中，Elman或Jordan时间递归神经网络具有单一个隐藏节点层(如图40与图42)，不过，需要理解的是，这些处理器100与神经网络单元121的实施例可以使用类似于本文所述的方式，有效地执行关联于具有多个隐藏层的时间递归神经网络的计算。As described in the section corresponding to FIG. 44, in the example of implementing the Jordan time recurrent neural network using the non-architectural program of FIG. The example assumes that the accumulator 202 value is passed to the content layer node C before the start function is applied, rather than the real output layer node Y value. However, for the Jordan Time RNN that applies the activation function to the accumulator 202 value to generate the content layer node C, the instruction at address 10 will be removed from the non-architectural program of FIG. 57 . In the embodiments described herein, the Elman or Jordan temporal recurrent neural network has a single hidden node layer (as shown in FIGS. Computations associated with temporal recurrent neural networks with multiple hidden layers can be efficiently performed in a manner similar to that described herein.

如前文对应于图2的章节所述，各个神经处理单元126作为一个人工神经网络内的神经元进行运作，而神经网络单元121内所有的神经处理单元126会以大规模平行处理的方式有效地计算此网络的一层级的神经元输出值。此神经网络单元的平行处理方式，特别是使用神经处理单元多任务缓存器集体构成的旋转器，并非传统上计算神经元层输出的方式所能直觉想到。进一步来说，传统方式通常涉及关联于单一个神经元或是一个非常小的神经元子集合的计算(例如，使用平行算术单元执行乘法与加法计算)，然后就继续执行关联于同一层级的下一个神经元的计算，依此类推以序列方式继续执行，直到完成对于此层级中所有的神经元的计算。相较之下，本发明在各个时频周期内，神经网络单元121的所有神经处理单元126(神经元)会平行执行关联于产生所有神经元输出所需计算中的一个小集合(例如单一个乘法与累加计算)。在大约M个时频周期结束后-M是当前层级内连结的节点数-神经网络单元121就会计算出所有神经元的输出。在许多人工神经网络配置中，因为存在大量神经处理单元126，神经网络单元121就可以在M个时频周期结束时对于整层级的所有神经元计算其神经元输出值。如本文所述，此计算对于所有类型的人工神经网络计算而言都具效率，这些人工神经网络包含但不限于前馈与时间递归神经网络，如Elman，Jordan与长短期记忆网络。最后，虽然本文的实施例中，神经网络单元121配置为512个神经处理单元126(例如采取宽文字配置)以执行时间递归神经网络的计算，不过，本发明并不限于此，将神经网络单元121配置为1024个神经处理单元126(例如采取窄文字配置)以执行时间递归神经网络单元的计算的实施例，以及如前述具有512与1024以外其它数量的神经处理单元126的神经网络单元121，也属本发明的范畴。As described above in the section corresponding to FIG. 2 , each neural processing unit 126 operates as a neuron in an artificial neural network, and all neural processing units 126 in the neural network unit 121 are effectively processed in a massively parallel manner. Computes the neuron output values for one level of this network. The parallel processing of the neural network units, especially the rotator using the multi-tasking registers of the neural processing units, is not intuitively expected by the traditional way of computing the output of the neuron layer. Further, traditional approaches typically involve computations associated with a single neuron or a very small subset of neurons (e.g., using parallel arithmetic units to perform multiplications and additions), and then proceed to execute computations associated with the same level below The calculation of a neuron, and so on, continues in a sequential manner until the calculation of all neurons in this level is completed. In contrast, in the present invention, in each time-frequency cycle, all the neural processing units 126 (neurons) of the neural network unit 121 will execute in parallel a small set of calculations associated with generating all neuron outputs (such as a single multiplication and accumulation calculations). After about M time-frequency periods are over—M is the number of connected nodes in the current level—the neural network unit 121 will calculate the output of all neurons. In many artificial neural network configurations, because there are a large number of neural processing units 126, the neural network unit 121 can calculate its neuron output values for all neurons in the entire layer at the end of M time-frequency periods. As described herein, this computation is efficient for all types of artificial neural network computations, including but not limited to feed-forward and temporally recurrent neural networks such as Elman, Jordan and long short-term memory networks. Finally, although in the embodiment herein, the neural network unit 121 is configured as 512 neural processing units 126 (for example, adopting a wide text configuration) to perform the calculation of the time recurrent neural network, the present invention is not limited thereto, and the neural network unit 121 is configured as an embodiment of 1024 NPUs 126 (e.g., in a narrow-text configuration) to perform RTNN calculations, and NPUs 121 with other numbers of NPUs 126 than 512 and 1024 as previously described, Also belong to the category of the present invention.

惟以上所述者，仅为本发明的较佳实施例而已，当不能以此限定本发明实施的范围，即大凡依本发明申请专利范围及发明说明内容所作的简单的等效变化与修饰，皆仍属本发明专利涵盖的范围内。举例来说，软件可以执行本发明所述的装置与方法的功能、制造、形塑、仿真、描述以及/或测试等。这可由一般的程序语言(如C、C++)、硬件描述语言(HDL)包含Verilog HDL，VHDL等，或是其它既有程序来达成。此软件可以设置于任何已知的计算机可利用媒介，如磁带、半导体、磁盘、光盘(如CD-ROM、DVD-ROM等)、网络接线、无线或是其它通讯媒介。此处描述的装置与方法的实施例可被包含于半导体智财核心，例如微处理核心(如以硬件描述语言的实施方式)并且透过集成电路的制作转换为硬件。此外，本文所描述的装置与方法也可包含硬件与软件的结合。因此，本文所述的任何实施例，并非用以限定本发明的范围。此外，本发明可应用于一般通用计算机的微处理器装置。最后，所属技术领域具有通常知识者利用本发明所揭露的观念与实施例作为基础，来设计并调整出不同的结构已达成相同的目的，也不超出本发明的范围。But the above-mentioned person is only preferred embodiment of the present invention, when can not limit the scope of the present invention implementation with this, promptly all the simple equivalent changes and modifications that are done according to the patent scope of the present invention and the content of the description of the invention, All still belong to the scope that the patent of the present invention covers. For example, software can perform functions, manufacture, shape, simulate, describe and/or test, etc. of the devices and methods described in the present invention. This can be achieved by common programming languages (such as C, C++), hardware description languages (HDL) including Verilog HDL, VHDL, etc., or other existing programs. This software can be installed on any known computer-usable media, such as tape, semiconductor, magnetic disk, optical disk (such as CD-ROM, DVD-ROM, etc.), network connection, wireless or other communication media. Embodiments of the apparatus and methods described herein may be embodied in a semiconductor intellectual property core, such as a microprocessor core (eg, implemented in a hardware description language) and translated to hardware through the fabrication of an integrated circuit. Furthermore, the apparatus and methods described herein may also comprise a combination of hardware and software. Therefore, any embodiments described herein are not intended to limit the scope of the present invention. In addition, the present invention can be applied to microprocessor devices of general general-purpose computers. Finally, those skilled in the art can use the concepts and embodiments disclosed in the present invention as a basis to design and adjust different structures to achieve the same purpose, which is within the scope of the present invention.

Claims

1. A neural network unit, characterized in that, comprising:

a first memory, loaded with elements of a data matrix;

a second memory loaded with elements of a convolution kernel; and

A neural processing unit (NPU) array, coupled to the first memory and the second memory, each of the neural processing units includes:

a multitasking register having an output, and receiving a corresponding element from a column of the first memory and receiving the multitasking register output of an adjacent neural processing unit;

a buffer having an output and receiving a corresponding element from a column of the second memory;

an accumulator with an output; and

An arithmetic unit receives the outputs of the register, the multitasking register and the accumulator, and performs a multiply-accumulate operation on them;

Wherein, for each of the multiple sub-matrices of the data matrix, each of the arithmetic units selectively receives the element from the first memory or the output from the multi-tasking buffer of the adjacent neural processing unit performing a series of multiplication and accumulation operations on the element, and accumulating a convolution operation result of the sub-matrix and the convolution kernel into the accumulator.

2. The neural network unit according to claim 1, wherein the neural network unit writes the convolution operation results of the sub-matrices into the first and the second memories.

3. The neural network unit according to claim 1, wherein the neural network unit is included in a processor, the processor includes architectural registers, and the processor executes an instruction set of the processor architectural instructions to write the data matrix from the architectural buffer to the first memory, write the convolution kernel from the architectural buffer to the second memory, and write the volume from the first and second memories The result of the product operation is read into the architectural register.

4. The neural network unit according to claim 1, wherein the neural processing unit array comprises N neural processing units, the convolution kernel is a matrix of K x K, and each of the sub-matrixes is a K x A matrix of K, the data matrix is a J x N matrix, and each of the J columns of the first memory is loaded with N elements of different columns of the J columns of the data matrix.

5. The neural network unit according to claim 4, wherein the plurality of sub-matrices is approximately (J x N)/(K x K) sub-matrices.

6. The neural network unit according to claim 4, characterized in that, each column in the K x K columns of the second memory is loaded with columns as the main order in which the K x K elements of the convolution kernel are different N instances of elements.

7. The neural network unit according to claim 6, wherein the neural network unit performs the following steps:

(a) the neural network unit initializes a first column address to point to the first column of the data matrix in the first memory;

(b) the neural processing unit clears the accumulator to zero;

(c) the neural network unit initializes a second column address to point to the first column of the K x K columns in the second memory;

(d) For each of the K iterations:

For an example:

the multitasking register receives the column data matrix element pointed to by the first column address from the first memory to provide to the arithmetic unit, and increments the first column address;

the buffer receives the column convolution kernel element pointed to by the second column address from the second memory to provide to the arithmetic unit, and increments the second column address; and

the arithmetic unit performs the multiply-accumulate operation; and

For K-1 examples:

the multitasking register receives the multitasking register output of the adjacent neural processing unit to provide to the arithmetic unit;

The buffer receives the column convolution kernel element pointed to by the second column address from the second memory to provide to the arithmetic unit, and the neural network unit increments the second column address; and

The arithmetic unit performs the multiply-accumulate operation.

8. The neural processing unit according to claim 7, wherein the neural processing unit performs the following steps:

(e) the neural network unit writes the convolution operation result into a column of the first or the second memory; and

(f) Decrement the first column address by K-2.

9. The neural network unit according to claim 8, wherein the neural network unit iterates steps (b) to (f) approximately J times.

10. The neural network unit according to claim 8, characterized in that, in step (e), the neural processing unit will perform a division operation on the result of the convolution operation, and the results of the division operation are written to into the column in the first or the second memory instead of the result of the convolution operation.

11. A method for operating a neural network unit, characterized in that the neural network unit has a neural processing unit (NPU) array, and each of the neural processing units includes a multitasking register, a register, an accumulator and an arithmetic unit, wherein the multitasking register has an output and the multitasking register receives a corresponding element from a column of a first memory and receives the multitasking register output of an adjacent neural processing unit, The register has an output, and the register receives a corresponding element from a column of a second memory, the accumulator has an output, and the arithmetic unit receives the register, the multitasking register and the accumulator output, and perform a multiply-accumulate operation on it, the method comprising:

using the first memory to load elements of a data matrix;

using the second memory to load elements of a convolution kernel;

For each of the submatrices of the data matrix:

selectively receiving, with each of the arithmetic units, the element from the first memory or the element from the multitasking register output of the neighboring NPU; and

A series of multiplication and accumulation operations are performed, and a convolution operation result of the sub-matrix and the convolution kernel is accumulated into the accumulator.

12. The method of claim 11, further comprising:

The convolution operation results of the sub-matrices are written into the first memory and the second memory.

13. The method according to claim 11, wherein the neural network unit is included in a processor, and the processor includes an architectural register, the method further comprising:

Using the processor, execute an architectural instruction of an instruction set of the processor to write the data matrix from the architectural register to the first memory, and write the convolution kernel from the architectural register to the second memory, and read the convolution operation result from the first and second memories to the architectural register.

14. The method according to claim 11, wherein the neural processing unit array comprises N neural processing units, the convolution kernel is a matrix of K x K, and each sub-matrix is a matrix of K x K matrix, the data matrix is a J x N matrix, and each of the J columns of the first memory is loaded with N elements of different columns of the J columns of the data matrix.

15. The method according to claim 14, wherein the plurality of sub-matrices is approximately (J x N)/(K x K) sub-matrices.

16. The method according to claim 14, characterized in that, each column in the K x K columns of the second memory takes columns as the main order to load the different elements of the K x K elements of the convolution kernel N examples.

17. The method of claim 16, further comprising:

(a) initializing a first column address to point to the first column of the data matrix in the first memory;

(b) clearing the accumulator to zero;

(c) initializing a second column address to point to the first column of the K x K columns in the second memory;

(d) For each of the K iterations:

For an example:

receiving the column data matrix element pointed to by the first column address from the first memory for providing to the arithmetic unit, and incrementing the first column address;

receiving the column convolution kernel element pointed to by the second column address from the second memory to provide to the arithmetic unit, and incrementing the second column address; and

using the arithmetic unit, performing the multiply-accumulate operation; and

For K-1 examples:

using the multitasking register, receiving the multitasking register output of the neighboring neural processing unit for providing to the arithmetic unit; and

With the arithmetic unit, the multiply-accumulate operation is performed.

18. The method of claim 17, further comprising:

(e) writing the result of the convolution operation into a column of the first or the second memory; and

(f) Decrement the first column address by K-2.

19. The method of claim 18, further comprising:

Steps (b) to (f) are iterated J times.

20. The method according to claim 18, wherein in step (e), a division operation is performed on the result of the convolution operation, and the results of the division operation are written into the first or The column in the second memory is not the result of the convolution operation.

21. A computer program product encoded on at least one non-transitory computer usable medium for use by a computer device, comprising:

Computer-usable program code embodied in the medium for describing a neural network unit, the computer-usable program code comprising:

a first program code for describing a first memory loaded with elements of a data matrix;

The second program code is used to describe a second memory, and the second memory is loaded with elements of a convolution kernel;

The third program code is used to describe a neural processing unit (NPU) array, the neural processing unit array is coupled to the first memory and the second memory, each of the neural processing units includes:

an accumulator with an output; and

an arithmetic unit receiving the outputs of the register, the multitasking register and the accumulator, and performing a multiply-accumulate operation thereon; and