JP2024118195A

JP2024118195A - Neural network circuit and neural network operation method

Info

Publication number: JP2024118195A
Application number: JP2023024486A
Authority: JP
Inventors: 賢治渡邊; Kenji Watanabe
Original assignee: Leap Mind Inc
Current assignee: Leap Mind Inc
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2024-08-30

Abstract

To provide a high-performance neural network circuit that can be incorporated into embedded devices such as IoT devices, and a neural network operation method.SOLUTION: A neural network circuit includes: a neural network operation core having a convolution operation circuit that performs a convolution operation based on a first instruction command; an instruction providing unit having an instruction memory that stores the first instruction command and providing the first instruction command to the convolution operation circuit; and a DMA controller transferring data between the neural network operation core and an external memory.SELECTED DRAWING: Figure 20

Description

本発明は、ニューラルネットワーク回路およびニューラルネットワーク演算方法に関する。 The present invention relates to a neural network circuit and a neural network calculation method.

近年、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ：ＣＮＮ）が画像認識等のモデルとして用いられている。畳み込みニューラルネットワークは、畳み込み層やプーリング層を有する多層構造であり、畳み込み演算等の多数の演算を必要とする。畳み込みニューラルネットワークによる演算を高速化する演算手法が様々考案されている（特許文献１など）。 In recent years, convolutional neural networks (CNNs) have been used as models for image recognition and the like. Convolutional neural networks have a multi-layer structure with convolutional layers and pooling layers, and require a large number of calculations, including convolutional calculations. Various calculation methods have been devised to speed up calculations using convolutional neural networks (e.g., Patent Document 1).

特開２０１８－０７７８２９号公報JP 2018-077829 A

一方で、ＩｏＴ機器などの組み込み機器においても畳み込みニューラルネットワークを利用した画像認識等を実現することが望まれている。組み込み機器においては、特許文献1等に記載された大規模な専用回路を組み込むことは難しい。また、ＣＰＵやメモリ等のハードウェアリソースが限られた組み込み機器においては、畳み込みニューラルネットワークの十分な演算性能をソフトウェアのみにより実現することは難しい。 On the other hand, there is a demand for implementing image recognition and the like using convolutional neural networks in embedded devices such as IoT devices. It is difficult to incorporate large-scale dedicated circuits such as those described in Patent Document 1 into embedded devices. Also, in embedded devices with limited hardware resources such as CPU and memory, it is difficult to achieve sufficient computing performance of convolutional neural networks using software alone.

上記事情を踏まえ、本発明は、ＩｏＴ機器などの組み込み機器に組み込み可能かつ高性能なニューラルネットワーク回路およびニューラルネットワーク演算方法を提供することを目的とする。 In light of the above, the present invention aims to provide a high-performance neural network circuit and neural network calculation method that can be incorporated into embedded devices such as IoT devices.

上記課題を解決するために、この発明は以下の手段を提案している。
本発明の第一の態様に係るニューラルネットワーク回路は、第一命令コマンドに基づいて畳み込み演算を実行する畳み込み演算回路を有するニューラルネットワーク演算コアと、前記第一命令コマンドを格納する命令メモリを有し、前記畳み込み演算回路に前記第一命令コマンドを提供する命令提供ユニットと、前記ニューラルネットワーク演算コアと外部メモリとの間でデータを転送するＤＭＡコントローラと、を備える。 In order to solve the above problems, the present invention proposes the following means.
A neural network circuit according to a first aspect of the present invention comprises a neural network calculation core having a convolution calculation circuit that performs a convolution calculation based on a first instruction command, an instruction providing unit having an instruction memory that stores the first instruction command and that provides the first instruction command to the convolution calculation circuit, and a DMA controller that transfers data between the neural network calculation core and an external memory.

本発明の第二の態様に係るニューラルネットワーク演算方法は、第一命令コマンドに基づいて畳み込み演算を実行する畳み込み演算回路と、前記第一命令コマンドを格納する命令メモリを有する命令提供ユニットと、を用いるニューラルネットワーク演算方法であって、前記命令メモリに前記第一命令コマンドを格納する第一工程と、前記第一工程が完了した後において、前記命令メモリに格納された前記第一命令コマンドを前記畳み込み演算回路に提供する第二工程と、を備える。 The neural network calculation method according to the second aspect of the present invention is a neural network calculation method using a convolution calculation circuit that executes a convolution calculation based on a first command and an instruction providing unit having an instruction memory that stores the first command, and includes a first step of storing the first command in the instruction memory, and a second step of providing the first command stored in the instruction memory to the convolution calculation circuit after the first step is completed.

本発明のニューラルネットワーク回路およびニューラルネットワーク演算方法は、ＩｏＴ機器などの組み込み機器に組み込み可能かつ高性能である。 The neural network circuit and neural network calculation method of the present invention can be incorporated into embedded devices such as IoT devices and has high performance.

畳み込みニューラルネットワークを示す図である。FIG. 1 illustrates a convolutional neural network. 畳み込み層が行う畳み込み演算を説明する図である。FIG. 2 is a diagram for explaining a convolution operation performed by a convolution layer. 畳み込み演算のデータの展開を説明する図である。FIG. 13 is a diagram for explaining data expansion in a convolution operation. 第一実施形態に係るニューラルネットワーク回路の全体構成を示す図である。1 is a diagram showing an overall configuration of a neural network circuit according to a first embodiment; ＮＮ演算コアの全体構成を示す図である。FIG. 2 is a diagram showing the overall configuration of an NN processing core. 同ＮＮ演算コアの動作例を示すタイミングチャートである。4 is a timing chart showing an example of the operation of the NN processing core. 同ＮＮ演算コアの他の動作例を示すタイミングチャートである。13 is a timing chart showing another example of the operation of the NN processing core. ＮＮ演算マルチコアを示す図である。FIG. 1 illustrates an NN calculation multi-core. 同ニューラルネットワーク回路のＤＭＡＣの内部ブロック図である。FIG. 2 is an internal block diagram of the DMAC of the neural network circuit. 同ＤＭＡＣの制御回路のステート遷移図である。FIG. 2 is a state transition diagram of a control circuit of the DMAC. 同ニューラルネットワーク回路の畳み込み演算回路の内部ブロック図である。FIG. 2 is an internal block diagram of a convolution operation circuit of the neural network circuit. 同畳み込み演算回路の乗算器の内部ブロック図である。FIG. 2 is an internal block diagram of a multiplier in the convolution operation circuit. 同乗算器の積和演算ユニットの内部ブロック図である。FIG. 2 is an internal block diagram of a multiply-and-accumulate unit of the multiplier. 同畳み込み演算回路のアキュムレータ回路の内部ブロック図である。FIG. 2 is an internal block diagram of an accumulator circuit of the convolution operation circuit. 同アキュムレータ回路のアキュムレータユニットの内部ブロック図である。FIG. 2 is an internal block diagram of an accumulator unit of the accumulator circuit. 同畳み込み演算回路の命令デコンプレッサの内部ブロック図である。FIG. 2 is an internal block diagram of an instruction decompressor of the convolution operation circuit. 同量子化演算回路のベクトル演算回路と量子化回路の内部ブロック図である。FIG. 2 is an internal block diagram of a vector operation circuit and a quantization circuit of the quantization operation circuit. 演算ユニットのブロック図である。FIG. 2 is a block diagram of a computing unit. 同量子化回路のベクトル量子化ユニットの内部ブロック図である。FIG. 2 is an internal block diagram of a vector quantization unit of the quantization circuit. 同ニューラルネットワーク回路の命令供給ユニットのブロック図である。FIG. 2 is a block diagram of an instruction supply unit of the neural network circuit. 同命令供給ユニット命令メモリのメモリマップを示す図である。FIG. 2 is a diagram showing a memory map of the instruction memory of the instruction supply unit; 同ニューラルネットワーク回路の制御フローチャートである。4 is a control flowchart of the neural network circuit.

（第一実施形態）
本発明の第一実施形態について、図１から図２２を参照して説明する。
図１は、畳み込みニューラルネットワーク２００（以下、「ＣＮＮ２００」という）を示す図である。第一実施形態に係るニューラルネットワーク回路１００（以下、「ＮＮ回路１００」という）が行う演算は、推論時に使用する学習済みのＣＮＮ２００の少なくとも一部である。 First Embodiment
A first embodiment of the present invention will be described with reference to FIGS.
1 is a diagram showing a convolutional neural network 200 (hereinafter, referred to as "CNN 200"). The calculations performed by a neural network circuit 100 (hereinafter, referred to as "NN circuit 100") according to the first embodiment are at least a part of the trained CNN 200 used during inference.

［ＣＮＮ２００］
ＣＮＮ２００は、畳み込み演算を行う畳み込み層２１０と、量子化演算を行う量子化演算層２２０と、出力層２３０と、を含む多層構造のネットワークである。ＣＮＮ２００の少なくとも一部において、畳み込み層２１０と量子化演算層２２０とが交互に連結されている。ＣＮＮ２００は、画像認識や動画認識に広く使われるモデルである。ＣＮＮ２００は、全結合層などの他の機能を有する層（レイヤ）をさらに有してもよい。 [CNN200]
The CNN 200 is a multi-layer network including a convolution layer 210 that performs a convolution operation, a quantization operation layer 220 that performs a quantization operation, and an output layer 230. In at least a part of the CNN 200, the convolution layer 210 and the quantization operation layer 220 are alternately connected. The CNN 200 is a model that is widely used for image recognition and video recognition. The CNN 200 may further include a layer having other functions, such as a fully connected layer.

図２は、畳み込み層２１０が行う畳み込み演算を説明する図である。
畳み込み層２１０は、入力データａに対して重みｗを用いた畳み込み演算を行う。畳み込み層２１０は、入力データａと重みｗとを入力とする積和演算を行う。 FIG. 2 is a diagram illustrating the convolution operation performed by the convolution layer 210.
The convolution layer 210 performs a convolution operation on the input data a using a weight w. The convolution layer 210 performs a multiply-and-accumulate operation on the input data a and the weight w.

畳み込み層２１０への入力データａ（アクティベーションデータ、特徴マップともいう）は、画像データ等の多次元データである。本実施形態において、入力データａは、要素（ｘ，ｙ，ｃ）からなる３次元テンソルである。ＣＮＮ２００の畳み込み層２１０は、低ビットの入力データａに対して畳み込み演算を行う。本実施形態において、入力データａの要素は、２ビットの符号なし整数（０，１，２，３）である。入力データａの要素は、例えば、４ビットや８ビット符号なし整数でもよい。 The input data a (also called activation data or feature map) to the convolutional layer 210 is multidimensional data such as image data. In this embodiment, the input data a is a three-dimensional tensor consisting of elements (x, y, c). The convolutional layer 210 of the CNN 200 performs a convolution operation on the low-bit input data a. In this embodiment, the elements of the input data a are 2-bit unsigned integers (0, 1, 2, 3). The elements of the input data a may be, for example, 4-bit or 8-bit unsigned integers.

ＣＮＮ２００に入力される入力データが、例えば３２ビットの浮動小数点型など、畳み込み層２１０への入力データａと形式が異なる場合、ＣＮＮ２００は畳み込み層２１０の前に型変換や量子化を行う入力層をさらに有してもよい。 If the input data input to CNN 200 has a different format from the input data a to convolutional layer 210, such as a 32-bit floating-point type, CNN 200 may further have an input layer that performs type conversion and quantization before convolutional layer 210.

畳み込み層２１０の重みｗ（フィルタ、カーネルともいう）は、学習可能なパラメータである要素を有する多次元データである。本実施形態において、重みｗは、要素（ｉ，ｊ，ｃ，ｄ）からなる４次元テンソルである。重みｗは、要素（ｉ，ｊ，ｃ）からなる３次元テンソル（以降、「重みｗｏ」という）をｄ個有している。学習済みのＣＮＮ２００における重みｗは、学習済みのデータである。ＣＮＮ２００の畳み込み層２１０は、低ビットの重みｗを用いて畳み込み演算を行う。本実施形態において、重みｗの要素は、１ビットの符号付整数（０，１）であり、値「０」は＋１を表し、値「１」は－１を表す。 The weights w (also called filters or kernels) of the convolutional layer 210 are multidimensional data having elements that are learnable parameters. In this embodiment, the weights w are four-dimensional tensors consisting of elements (i, j, c, d). The weights w have d three-dimensional tensors (hereinafter referred to as "weights wo") consisting of elements (i, j, c). The weights w in the trained CNN 200 are trained data. The convolutional layer 210 of the CNN 200 performs convolution operations using low-bit weights w. In this embodiment, the elements of the weights w are one-bit signed integers (0, 1), with the value "0" representing +1 and the value "1" representing -1.

畳み込み層２１０は、式１に示す畳み込み演算を行い、出力データｆを出力する。式１において、ｓはストライドを示す。図２において点線で示された領域は、入力データａに対して重みｗｏが適用される領域ａｏ（以降、「適用領域ａｏ」という）の一つを示している。適用領域ａｏの要素は、（ｘ＋ｉ，ｙ＋ｊ，ｃ）で表される。 The convolution layer 210 performs the convolution operation shown in Equation 1 and outputs output data f. In Equation 1, s indicates the stride. The area indicated by the dotted line in FIG. 2 indicates one of the areas ao (hereinafter referred to as the "application area ao") where the weight wo is applied to the input data a. The elements of the application area ao are represented by (x+i, y+j, c).

量子化演算層２２０は、畳み込み層２１０が出力する畳み込み演算の出力に対して量子化などを実施する。量子化演算層２２０は、プーリング層２２１と、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２と、活性化関数層２２３と、量子化層２２４と、を有する。 The quantization operation layer 220 performs quantization and other operations on the output of the convolution operation output by the convolution layer 210. The quantization operation layer 220 has a pooling layer 221, a batch normalization layer 222, an activation function layer 223, and a quantization layer 224.

プーリング層２２１は、畳み込み層２１０が出力する畳み込み演算の出力データｆに対して平均プーリング（式２）やＭＡＸプーリング（式３）などの演算を実施して、畳み込み層２１０の出力データｆを圧縮する。式２および式３において、ｕは入力テンソルを示し、ｖは出力テンソルを示し、Ｔはプーリング領域の大きさを示す。式３において、ｍａｘはＴに含まれるｉとｊの組み合わせに対するｕの最大値を出力する関数である。 The pooling layer 221 compresses the output data f of the convolutional operation output by the convolutional layer 210 by performing calculations such as average pooling (Equation 2) and MAX pooling (Equation 3). In Equations 2 and 3, u represents the input tensor, v represents the output tensor, and T represents the size of the pooling region. In Equation 3, max is a function that outputs the maximum value of u for the combination of i and j included in T.

ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２は、量子化演算層２２０やプーリング層２２１の出力データに対して、例えば式４に示すような演算によりデータ分布の正規化を行う。式４において、ｕは入力テンソルを示し、ｖは出力テンソルを示し、αはスケールを示し、βはバイアスを示す。学習済みのＣＮＮ２００において、αおよびβは学習済みの定数ベクトルである。 The batch normalization layer 222 normalizes the data distribution of the output data of the quantization operation layer 220 and the pooling layer 221, for example, by performing an operation as shown in Equation 4. In Equation 4, u represents the input tensor, v represents the output tensor, α represents the scale, and β represents the bias. In the trained CNN 200, α and β are trained constant vectors.

活性化関数層２２３は、量子化演算層２２０やプーリング層２２１やＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２の出力に対してＲｅＬＵ（式５）などの活性化関数の演算を行う。式５において、ｕは入力テンソルであり、ｖは出力テンソルである。式５において、ｍａｘは引数のうち最も大きい数値を出力する関数である。 The activation function layer 223 calculates an activation function such as ReLU (Equation 5) on the output of the quantization operation layer 220, the pooling layer 221, and the batch normalization layer 222. In Equation 5, u is an input tensor and v is an output tensor. In Equation 5, max is a function that outputs the largest numerical value among the arguments.

量子化層２２４は、量子化パラメータに基づいて、プーリング層２２１や活性化関数層２２３の出力に対して例えば式６に示すような量子化を行う。式６に示す量子化は、入力テンソルｕを２ビットにビット削減している。式６において、ｑ(ｃ)は量子化パラメータのベクトルである。学習済みのＣＮＮ２００において、ｑ(ｃ)は学習済みの定数ベクトルである。式６における不等号「≦」は「＜」であってもよい。 The quantization layer 224 performs quantization on the output of the pooling layer 221 and the activation function layer 223 based on the quantization parameter, for example as shown in Equation 6. The quantization shown in Equation 6 reduces the input tensor u to 2 bits. In Equation 6, q(c) is a vector of quantization parameters. In the trained CNN 200, q(c) is a trained constant vector. The inequality sign "≦" in Equation 6 may be "<".

出力層２３０は、恒等関数やソフトマックス関数等によりＣＮＮ２００の結果を出力する層である。出力層２３０の前段のレイヤは、畳み込み層２１０であってもよいし、量子化演算層２２０であってもよい。 The output layer 230 is a layer that outputs the results of the CNN 200 using an identity function, a softmax function, or the like. The layer preceding the output layer 230 may be the convolution layer 210 or the quantization operation layer 220.

ＣＮＮ２００は、量子化された量子化層２２４の出力データが、畳み込み層２１０に入力されるため、量子化を行わない他の畳み込みニューラルネットワークと比較して、畳み込み層２１０の畳み込み演算の負荷が小さい。 In CNN 200, the quantized output data of quantization layer 224 is input to convolutional layer 210, so the load of the convolution calculation in convolutional layer 210 is smaller than in other convolutional neural networks that do not perform quantization.

［畳み込み演算の分割］
ＮＮ回路１００は、畳み込み層２１０の畳み込み演算（式１）の入力データを部分テンソルに分割して演算する。部分テンソルへの分割方法や分割数は特に限定されない。部分テンソルは、例えば、入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃ）をａ（ｘ＋ｉ，ｙ＋ｊ，ｃｏ）に分割することにより形成される。なお、ＮＮ回路１００は、畳み込み層２１０の畳み込み演算（式１）の入力データを分割せずに演算することもできる。 [Dividing the convolution operation]
The NN circuit 100 divides the input data of the convolution operation (Equation 1) of the convolution layer 210 into partial tensors and performs the operation. The method of division into the partial tensors and the number of divisions are not particularly limited. The partial tensor is formed, for example, by dividing the input data a(x+i, y+j, c) into a(x+i, y+j, co). The NN circuit 100 can also perform the operation without dividing the input data of the convolution operation (Equation 1) of the convolution layer 210.

畳み込み演算の入力データ分割において、式１における変数ｃは、式７に示すように、サイズＢｃのブロックで分割される。また、式１における変数ｄは、式８に示すように、サイズＢｄのブロックで分割される。式７において、ｃｏはオフセットであり、ｃｉは０から(Ｂｃ－１)までのインデックスである。式８において、ｄｏはオフセットであり、ｄｉは０から(Ｂｄ－１)までのインデックスである。なお、サイズＢｃとサイズＢｄは同じであってもよい。 When dividing input data for a convolution operation, the variable c in Equation 1 is divided into blocks of size Bc, as shown in Equation 7. Furthermore, the variable d in Equation 1 is divided into blocks of size Bd, as shown in Equation 8. In Equation 7, co is an offset, and ci is an index from 0 to (Bc-1). In Equation 8, do is an offset, and di is an index from 0 to (Bd-1). Note that size Bc and size Bd may be the same.

式１における入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃ）は、ｃ軸方向においてサイズＢｃにより分割され、分割された入力データａ（ｘ＋ｉ，ｙ＋ｊ，ｃｏ）で表される。以降の説明において、分割された入力データａを「分割入力データａ」ともいう。 The input data a(x+i, y+j, c) in Equation 1 is divided in the c-axis direction by size Bc, and is represented as divided input data a(x+i, y+j, co). In the following description, the divided input data a is also referred to as "divided input data a".

式１における重みｗ（ｉ，ｊ，ｃ，ｄ）は、ｃ軸方向においてサイズＢｃおよびｄ軸方向においてサイズＢｄにより分割され、分割された重みｗ（ｉ，ｊ，ｃｏ，ｄｏ）で表される。以降の説明において、分割された重みｗを「分割重みｗ」ともいう。 The weight w(i, j, c, d) in formula 1 is divided by the size Bc in the c-axis direction and the size Bd in the d-axis direction, and is expressed as the divided weight w(i, j, co, do). In the following explanation, the divided weight w is also referred to as the "divided weight w".

サイズＢｄにより分割された出力データｆ（ｘ，ｙ，ｄｏ）は、式９により求まる。分割された出力データｆ（ｘ，ｙ，ｄｏ）を組み合わせることで、最終的な出力データｆ（ｘ，ｙ，ｄ）を算出できる。 The output data f(x, y, do) divided by size Bd is calculated using Equation 9. The final output data f(x, y, d) can be calculated by combining the divided output data f(x, y, do).

［畳み込み演算のデータの展開］
ＮＮ回路１００は、畳み込み層２１０の畳み込み演算における入力データａおよび重みｗを展開して畳み込み演算を行う。 [Expanding data for convolution operations]
The NN circuit 100 performs the convolution operation by expanding the input data a and the weights w in the convolution operation of the convolution layer 210.

図３は、畳み込み演算のデータの展開を説明する図である。
分割入力データａ（ｘ＋ｉ、ｙ＋ｊ、ｃｏ）は、Ｂｃ個の要素を持つベクトルデータに展開される。分割入力データａの要素は、ｃｉでインデックスされる（０≦ｃｉ＜Ｂｃ）。以降の説明において、ｉ，ｊごとにベクトルデータに展開された分割入力データａを「入力ベクトルＡ」ともいう。入力ベクトルＡは、分割入力データａ（ｘ＋ｉ、ｙ＋ｊ、ｃｏ×Ｂｃ）から分割入力データａ（ｘ＋ｉ、ｙ＋ｊ、ｃｏ×Ｂｃ＋（Ｂｃ－１））までを要素とする。 FIG. 3 is a diagram for explaining the expansion of data in a convolution operation.
Divided input data a(x+i, y+j, co) is expanded into vector data having Bc elements. The elements of divided input data a are indexed by ci (0≦ci<Bc). In the following explanation, divided input data a expanded into vector data for each i and j is also referred to as "input vector A". Input vector A has elements from divided input data a(x+i, y+j, co×Bc) to divided input data a(x+i, y+j, co×Bc+(Bc-1)).

分割重みｗ（ｉ，ｊ，ｃｏ、ｄｏ）は、Ｂｃ×Ｂｄ個の要素を持つマトリクスデータに展開される。マトリクスデータに展開された分割重みｗの要素は、ｃｉとｄｉでインデックスされる（０≦ｄｉ＜Ｂｄ）。以降の説明において、ｉ，ｊごとにマトリクスデータに展開された分割重みｗを「重みマトリクスＷ」ともいう。重みマトリクスＷは、分割重みｗ（ｉ，ｊ，ｃｏ×Ｂｃ、ｄｏ×Ｂｄ）から分割重みｗ（ｉ，ｊ，ｃｏ×Ｂｃ＋（Ｂｃ－１）、ｄｏ×Ｂｄ＋（Ｂｄ－１））までを要素とする。 The partitioning weights w(i,j,co,do) are expanded into matrix data with Bc×Bd elements. The elements of the partitioning weights w expanded into the matrix data are indexed by ci and di (0≦di<Bd). In the following explanation, the partitioning weights w expanded into the matrix data for each i and j are also referred to as the "weight matrix W". The weight matrix W has elements from partitioning weights w(i,j,co×Bc,do×Bd) to partitioning weights w(i,j,co×Bc+(Bc-1),do×Bd+(Bd-1)).

入力ベクトルＡと重みマトリクスＷとを乗算することで、ベクトルデータが算出される。ｉ，ｊ，ｃｏごとに算出されたベクトルデータを３次元テンソルに整形することで、出力データｆ（ｘ，ｙ，ｄｏ）を得ることができる。このようなデータの展開を行うことで、畳み込み層２１０の畳み込み演算を、ベクトルデータとマトリクスデータとの乗算により実施できる。 Vector data is calculated by multiplying the input vector A by the weight matrix W. The vector data calculated for each i, j, co is shaped into a three-dimensional tensor to obtain output data f(x, y, do). By expanding the data in this way, the convolution operation of the convolution layer 210 can be performed by multiplying the vector data by the matrix data.

［ＮＮ回路１００］
図４は、本実施形態に係るＮＮ回路１００の全体構成を示す図である。
ＮＮ回路１００は、ＤＭＡコントローラ３（以下、「ＤＭＡＣ３」ともいう）と、コントローラ６と、命令供給ユニット７と、少なくとも一つのニューラルネットワーク演算コア１０（以下、「ＮＮ演算コア１０」ともいう）と、を備える。 [NN circuit 100]
FIG. 4 is a diagram showing the overall configuration of an NN circuit 100 according to this embodiment.
The NN circuit 100 comprises a DMA controller 3 (hereinafter also referred to as "DMAC 3"), a controller 6, an instruction supply unit 7, and at least one neural network calculation core 10 (hereinafter also referred to as "NN calculation core 10").

ＮＮ回路１００は、複数のＮＮ演算コア１０を実装可能である。図４に例示するＮＮ回路１００は、ＮＮ演算コア１０を最大４つまで実装可能である。複数のＮＮ演算コア１０は、ＮＮ２００の少なくとの一部の演算を連携して実行する「ニューラルネットワーク演算マルチコア１０Ｍ（以下、「ＮＮ演算マルチコア１０Ｍ」ともいう）」を構成する。複数のＮＮ演算コア１０は、本実施形態においてデイジーチェーン接続されている。なお、ＮＮ回路１００に実装可能なＮＮ演算コア１０の数は５個以上であってもよい。 The NN circuit 100 can implement multiple NN calculation cores 10. The NN circuit 100 illustrated in FIG. 4 can implement up to four NN calculation cores 10. The multiple NN calculation cores 10 constitute a "neural network calculation multi-core 10M (hereinafter also referred to as "NN calculation multi-core 10M")" that cooperates to execute at least some of the calculations of the NN 200. In this embodiment, the multiple NN calculation cores 10 are daisy-chained. Note that the number of NN calculation cores 10 that can be implemented in the NN circuit 100 may be five or more.

ＤＭＡＣ３は、外部バスＥＢに接続されており、ＤＲＡＭなどの外部メモリ１２０と命令供給ユニット７との間でデータを転送する。また、ＤＭＡＣ３は、ＤＲＡＭなどの外部メモリ１２０とＮＮ演算コア１０との間でデータを転送する。ＤＭＡＣ３は、複数のＮＮ演算コア１０のいずれかに対して外部メモリ１２０から読み出したデータを転送する。なお、ＤＭＡＣ３は、複数のＮＮ演算コア１０に対して外部メモリ１２０から読み出した同一のデータを転送可能であってもよいし、ブロードキャスト可能であってもよい。 The DMAC3 is connected to the external bus EB, and transfers data between the instruction supply unit 7 and an external memory 120 such as a DRAM. The DMAC3 also transfers data between the external memory 120 such as a DRAM and the NN calculation core 10. The DMAC3 transfers data read from the external memory 120 to any one of the multiple NN calculation cores 10. The DMAC3 may be capable of transferring the same data read from the external memory 120 to multiple NN calculation cores 10, or may be capable of broadcasting the data.

コントローラ６は、外部バスＥＢに接続されており、外部のホストＣＰＵ１１０のスレーブとして動作する。コントローラ６は、バスブリッジ６０と、レジスタ６１と、を有する。 The controller 6 is connected to the external bus EB and operates as a slave to the external host CPU 110. The controller 6 has a bus bridge 60 and a register 61.

バスブリッジ６０は、外部バスＥＢから内部バスＩＢへのバスアクセスを中継する。また、バスブリッジ６０は、外部ホストＣＰＵ１１０からレジスタ６１への書き込み要求および読み込み要求を中継する。 The bus bridge 60 relays bus access from the external bus EB to the internal bus IB. The bus bridge 60 also relays write and read requests from the external host CPU 110 to the register 61.

レジスタ６１は、パラメータレジスタや状態レジスタを有する。パラメータレジスタは、ＮＮ回路１００の動作を制御するレジスタである。状態レジスタは各モジュールの命令列のポインタ・命令数などを含み、ＮＮ回路１００の状態を示すレジスタである。また、状態レジスタはセマフォＳを含む構成としてよい。外部ホストＣＰＵ１１０は、コントローラ６のバスブリッジ６０を経由して、レジスタ６１にアクセスできる。 The register 61 has a parameter register and a status register. The parameter register is a register that controls the operation of the NN circuit 100. The status register contains a pointer to the instruction sequence of each module, the number of instructions, etc., and is a register that indicates the status of the NN circuit 100. The status register may also be configured to contain a semaphore S. The external host CPU 110 can access the register 61 via the bus bridge 60 of the controller 6.

コントローラ６は、内部バスＩＢを介して、ＮＮ回路１００の各ブロック（ＤＭＡＣ３、命令供給ユニット７、ＮＮ演算コア１０）と接続されている。外部ホストＣＰＵ１１０は、コントローラ６を経由して、ＮＮ回路１００の各ブロックに対してアクセスできる。例えば、外部ホストＣＰＵ１１０は、コントローラ６を経由して、ＮＮ演算コア１０に対する命令を指示することができる。また、各ブロックは、内部バスＩＢを介して、コントローラ６が有する状態レジスタ（セマフォＳを含んでもよい）を更新できる。状態レジスタは、各ブロックと接続された専用配線を介して更新されるように構成されていてもよい。 The controller 6 is connected to each block of the NN circuit 100 (DMAC3, instruction supply unit 7, NN calculation core 10) via the internal bus IB. The external host CPU 110 can access each block of the NN circuit 100 via the controller 6. For example, the external host CPU 110 can issue instructions to the NN calculation core 10 via the controller 6. In addition, each block can update a status register (which may include a semaphore S) held by the controller 6 via the internal bus IB. The status register may be configured to be updated via a dedicated wiring connected to each block.

命令供給ユニット（ISU : Instruction Supply Unit）７は、動作ブロック（ＤＭＡＣ３、ＮＮ演算コア１０の畳み込み演算回路４および量子化演算回路５）に対する命令コマンドを記憶する。また、命令供給ユニット７は、記憶した命令コマンドを対応するＮＮ回路１００の動作ブロック（ＤＭＡＣ３、ＮＮ演算コア１０の畳み込み演算回路４および量子化演算回路５）に転送する。 The instruction supply unit (ISU) 7 stores instruction commands for the operation blocks (DMAC3, convolution operation circuit 4 of NN operation core 10, and quantization operation circuit 5). The instruction supply unit 7 also transfers the stored instruction commands to the corresponding operation blocks of the NN circuit 100 (DMAC3, convolution operation circuit 4 of NN operation core 10, and quantization operation circuit 5).

［ＮＮ回路１００：動作モード］
ＮＮ回路１００は、プリロードモードと、ノーマルモードとの、二つの動作モードを有する。本実施形態において、ＮＮ回路１００は、プリロードモードとノーマルモードとを排他的に切り替えて動作する。なお、ＮＮ回路１００は他の動作モードを有してもよい。また、ＮＮ回路１００の動作モードは、レジスタ６１に含まれる「動作モード設定レジスタ」を書き換えることで、切り換え可能であってもよい。 [NN Circuit 100: Operation Mode]
The NN circuit 100 has two operation modes: a preload mode and a normal mode. In this embodiment, the NN circuit 100 operates by switching exclusively between the preload mode and the normal mode. The NN circuit 100 may have other operation modes. The operation mode of the NN circuit 100 may be switchable by rewriting an "operation mode setting register" included in the register 61.

動作モードがプリロードモード（第一モード）であるとき、ＤＭＡＣ３は、外部メモリ１２０に格納された動作ブロック（ＤＭＡＣ３、ＮＮ演算コア１０の畳み込み演算回路４および量子化演算回路５）の命令コマンドを命令供給ユニットに転送する。このとき、ＤＭＡＣ３は、外部メモリ１２０に格納された重みｗを重みメモリ４１に転送してもよい。また、ＤＭＡＣ３は、外部メモリ１２０に格納された量子化パラメータｑを量子化パラメータメモリ５１に転送してもよい。 When the operation mode is the preload mode (first mode), the DMAC3 transfers the instruction commands of the operation blocks (DMAC3, the convolution operation circuit 4 and the quantization operation circuit 5 of the NN operation core 10) stored in the external memory 120 to the instruction supply unit. At this time, the DMAC3 may transfer the weight w stored in the external memory 120 to the weight memory 41. The DMAC3 may also transfer the quantization parameter q stored in the external memory 120 to the quantization parameter memory 51.

動作モードがノーマルモード（第二モード）であるとき、命令供給ユニット７は、格納された動作ブロック（ＤＭＡＣ３、ＮＮ演算コア１０の畳み込み演算回路４および量子化演算回路５）の命令コマンドを動作ブロック（ＤＭＡＣ３、ＮＮ演算コア１０の畳み込み演算回路４および量子化演算回路５）に供給する。動作ブロック（ＤＭＡＣ３、ＮＮ演算コア１０の畳み込み演算回路４および量子化演算回路５）は、供給された命令コマンドを実行する。 When the operation mode is the normal mode (second mode), the instruction supply unit 7 supplies the stored instruction commands of the operation blocks (DMAC3, the convolution operation circuit 4 of the NN arithmetic core 10, and the quantization operation circuit 5) to the operation blocks (DMAC3, the convolution operation circuit 4 of the NN arithmetic core 10, and the quantization operation circuit 5). The operation blocks (DMAC3, the convolution operation circuit 4 of the NN arithmetic core 10, and the quantization operation circuit 5) execute the supplied instruction commands.

［ＮＮ演算コア１０］
図５は、ＮＮ演算コア１０の全体構成を示す図である。
ＮＮ演算コア１０は、第一メモリ１と、第二メモリ２と、畳み込み演算回路４と、量子化演算回路５と、を備える。ＮＮ演算コア１０は、第一メモリ１および第二メモリ２を介して、畳み込み演算回路４と量子化演算回路５とがループ状に形成されていることを特徴とする。 [NN calculation core 10]
FIG. 5 is a diagram showing the overall configuration of the NN processing core 10. As shown in FIG.
The NN calculation core 10 includes a first memory 1, a second memory 2, a convolution calculation circuit 4, and a quantization calculation circuit 5. The NN calculation core 10 is characterized in that the convolution calculation circuit 4 and the quantization calculation circuit 5 are formed in a loop shape via the first memory 1 and the second memory 2.

第一メモリ１は、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。第一メモリ１には、ＤＭＡＣ３や内部バスＩＢを介してデータの書き込みおよび読み出しが行われる。外部ホストＣＰＵ１１０は、第一メモリ１に対するデータの書き込みや読み出しにより、ＮＮ演算コア１０に対するデータの入出力を行うことができる。 The first memory 1 is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). Data is written to and read from the first memory 1 via the DMAC 3 and the internal bus IB. The external host CPU 110 can input and output data to and from the NN calculation core 10 by writing and reading data to and from the first memory 1.

第一メモリ１は、畳み込み演算回路４の入力ポートと接続されており、畳み込み演算回路４は第一メモリ１からデータを読み出すことができる。また、第一メモリ１は、量子化演算回路５の出力ポートとループ接続（Ｃ１）されており、量子化演算回路５は第一メモリ１にデータを書き込むことができる。また、第一メモリ１は、他のＮＮ演算コア１０との間のコア間接続（Ｃ２）でデータ転送が可能であり、コア間接続（Ｃ２）された他のＮＮ演算コア１０は第一メモリ１にデータを書き込むことができる。なお、本実施形態において、コア間接続（Ｃ２）の一例として、デイジーチェーン接続を用いている。 The first memory 1 is connected to an input port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can read data from the first memory 1. The first memory 1 is also loop-connected (C1) to an output port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can write data to the first memory 1. The first memory 1 can also transfer data via an inter-core connection (C2) between the first memory 1 and another NN calculation core 10, and the other NN calculation core 10 connected to the inter-core connection (C2) can write data to the first memory 1. In this embodiment, a daisy chain connection is used as an example of the inter-core connection (C2).

第二メモリ２は、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。第二メモリ２には、ＤＭＡＣ３や内部バスＩＢを介してデータの書き込みおよび読み出しが行われる。外部ホストＣＰＵ１１０は、第二メモリ２に対するデータの書き込みや読み出しにより、ＮＮ演算コア１０に対するデータの入出力を行うことができる。 The second memory 2 is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). Data is written to and read from the second memory 2 via the DMAC 3 and the internal bus IB. The external host CPU 110 can input and output data to and from the NN calculation core 10 by writing and reading data to and from the second memory 2.

第二メモリ２は、量子化演算回路５の入力ポートと接続されており、量子化演算回路５は第二メモリ２からデータを読み出すことができる。また、第二メモリ２は、畳み込み演算回路４の出力ポートと接続されており、畳み込み演算回路４は第二メモリ２にデータを書き込むことができる。 The second memory 2 is connected to an input port of the quantization calculation circuit 5, and the quantization calculation circuit 5 can read data from the second memory 2. In addition, the second memory 2 is connected to an output port of the convolution calculation circuit 4, and the convolution calculation circuit 4 can write data to the second memory 2.

畳み込み演算回路４は、学習済みのＣＮＮ２００の畳み込み層２１０における畳み込み演算を行う回路である。畳み込み演算回路４は、第一メモリ１に格納された入力データａを読み出し、入力データａに対して畳み込み演算を実施する。畳み込み演算回路４は、畳み込み演算の出力データｆ（以降、「畳み込み演算出力データ」ともいう）を第二メモリ２に書き込む。 The convolution operation circuit 4 is a circuit that performs convolution operation in the convolution layer 210 of the trained CNN 200. The convolution operation circuit 4 reads the input data a stored in the first memory 1 and performs a convolution operation on the input data a. The convolution operation circuit 4 writes the output data f of the convolution operation (hereinafter also referred to as "convolution operation output data") to the second memory 2.

量子化演算回路５は、学習済みのＣＮＮ２００の量子化演算層２２０における量子化演算の少なくとも一部を行う回路である。量子化演算回路５は、第二メモリ２に格納された畳み込み演算の出力データｆを読み出し、畳み込み演算の出力データｆに対して量子化演算（プーリング、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ、活性化関数、および量子化のうち少なくとも量子化を含む演算）を実施する。 The quantization operation circuit 5 is a circuit that performs at least a part of the quantization operation in the quantization operation layer 220 of the trained CNN 200. The quantization operation circuit 5 reads the output data f of the convolution operation stored in the second memory 2, and performs a quantization operation (an operation including at least quantization among pooling, batch normalization, activation function, and quantization) on the output data f of the convolution operation.

量子化演算回路５は、量子化演算の出力データ（以降、「量子化演算出力データ」ともいう）をループ接続（Ｃ１）された第一メモリ１に書き込む。また、量子化演算回路５は、他のＮＮ演算コア１０とコア間接続（Ｃ２）経由でデータ転送可能であり、量子化演算回路５はコア間接続（Ｃ２）された他のＮＮ演算コア１０に量子化演算出力データを出力することができる。 The quantization calculation circuit 5 writes the output data of the quantization calculation (hereinafter also referred to as "quantization calculation output data") to the first memory 1 connected in a loop (C1). In addition, the quantization calculation circuit 5 can transfer data to other NN calculation cores 10 via the inter-core connection (C2), and the quantization calculation circuit 5 can output the quantization calculation output data to other NN calculation cores 10 connected in an inter-core connection (C2).

ＮＮ演算コア１０は、第一メモリ１や第二メモリ２等を有するため、ＤＲＡＭなどの外部メモリからのＤＭＡＣ３によるデータ転送において、重複するデータのデータ転送の回数を低減できる。これにより、メモリアクセスにより発生する消費電力または処理負荷を大幅に低減することができる。 The NN calculation core 10 has a first memory 1, a second memory 2, etc., so that the number of times that duplicate data is transferred can be reduced when data is transferred by the DMAC 3 from an external memory such as a DRAM. This makes it possible to significantly reduce the power consumption or processing load caused by memory access.

［ＮＮ演算コア１０の動作例１］
図６は、ＮＮ演算コア１０の動作例を示すタイミングチャートである。
ＤＭＡＣ３は、レイヤ１の入力データａを第一メモリ１に格納する。ＤＭＡＣ３は、畳み込み演算回路４が行う畳み込み演算の順序にあわせて、レイヤ１の入力データａを分割して第一メモリ１に転送してもよい。 [Operation Example 1 of the NN calculation core 10]
FIG. 6 is a timing chart showing an example of the operation of the NN processing core 10. In FIG.
The DMAC 3 stores the input data a of the layer 1 in the first memory 1. The DMAC 3 may divide the input data a of the layer 1 and transfer it to the first memory 1 in accordance with the order of the convolution operation performed by the convolution operation circuit 4.

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ１の入力データａを読み出す。畳み込み演算回路４は、レイヤ１の入力データａに対して図１に示すレイヤ１の畳み込み演算を行う。レイヤ１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads out the input data a of layer 1 stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 1 shown in FIG. 1 on the input data a of layer 1. The output data f of the convolution operation of layer 1 is stored in the second memory 2.

量子化演算回路５は、第二メモリ２に格納されたレイヤ１の出力データｆを読み出す。量子化演算回路５は、レイヤ１の出力データｆに対してレイヤ２の量子化演算を行う。レイヤ２の量子化演算の出力データは、第一メモリ１に格納される。 The quantization calculation circuit 5 reads the output data f of layer 1 stored in the second memory 2. The quantization calculation circuit 5 performs a quantization calculation of layer 2 on the output data f of layer 1. The output data of the quantization calculation of layer 2 is stored in the first memory 1.

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２の量子化演算の出力データを読み出す。畳み込み演算回路４は、レイヤ２の量子化演算の出力データを入力データａとしてレイヤ３の畳み込み演算を行う。レイヤ３の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data of the quantization operation of layer 2 stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 3 using the output data of the quantization operation of layer 2 as input data a. The output data f of the convolution operation of layer 3 is stored in the second memory 2.

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２Ｍ－２（Ｍは自然数）の量子化演算の出力データを読み出す。畳み込み演算回路４は、レイヤ２Ｍ－２の量子化演算の出力データを入力データａとしてレイヤ２Ｍ－１の畳み込み演算を行う。レイヤ２Ｍ－１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data of the quantization operation of layer 2M-2 (M is a natural number) stored in the first memory 1. The convolution operation circuit 4 performs the convolution operation of layer 2M-1 using the output data of the quantization operation of layer 2M-2 as input data a. The output data f of the convolution operation of layer 2M-1 is stored in the second memory 2.

量子化演算回路５は、第二メモリ２に格納されたレイヤ２Ｍ－１の出力データｆを読み出す。量子化演算回路５は、２Ｍ－１レイヤの出力データｆに対してレイヤ２Ｍの量子化演算を行う。レイヤ２Ｍの量子化演算の出力データは、第一メモリ１に格納される。 The quantization calculation circuit 5 reads the output data f of layer 2M-1 stored in the second memory 2. The quantization calculation circuit 5 performs the quantization calculation of layer 2M on the output data f of the 2M-1 layer. The output data of the quantization calculation of layer 2M is stored in the first memory 1.

畳み込み演算回路４は、第一メモリ１に格納されたレイヤ２Ｍの量子化演算の出力データを読み出す。畳み込み演算回路４は、レイヤ２Ｍの量子化演算の出力データを入力データａとしてレイヤ２Ｍ＋１の畳み込み演算を行う。レイヤ２Ｍ＋１の畳み込み演算の出力データｆは、第二メモリ２に格納される。 The convolution operation circuit 4 reads the output data of the quantization operation of layer 2M stored in the first memory 1. The convolution operation circuit 4 performs a convolution operation of layer 2M+1 using the output data of the quantization operation of layer 2M as input data a. The output data f of the convolution operation of layer 2M+1 is stored in the second memory 2.

畳み込み演算回路４と量子化演算回路５とが交互に演算を行い、図１に示すＣＮＮ２００の演算を進めていく。ＮＮ演算コア１０は、畳み込み演算回路４が時分割によりレイヤ２Ｍ－１とレイヤ２Ｍ＋１の畳み込み演算を実施する。また、ＮＮ演算コア１０は、量子化演算回路５が時分割によりレイヤ２Ｍ－２とレイヤ２Ｍの量子化演算を実施する。そのため、ＮＮ演算コア１０は、レイヤごとに別々の畳み込み演算回路４と量子化演算回路５を実装する場合と比較して、回路規模が著しく小さい。 The convolution calculation circuit 4 and the quantization calculation circuit 5 alternately perform calculations to advance the calculations of the CNN 200 shown in FIG. 1. In the NN calculation core 10, the convolution calculation circuit 4 performs the convolution calculations of layers 2M-1 and 2M+1 by time division. In the NN calculation core 10, the quantization calculation circuit 5 performs the quantization calculations of layers 2M-2 and 2M by time division. Therefore, the circuit scale of the NN calculation core 10 is significantly smaller than when a separate convolution calculation circuit 4 and quantization calculation circuit 5 are implemented for each layer.

ＮＮ演算コア１０は、複数のレイヤの多層構造であるＣＮＮ２００の演算を、ループ状に形成された回路により演算する。ＮＮ演算コア１０は、ループ状の回路構成により、ハードウェア資源を効率的に利用できる。なお、ＮＮ演算コア１０は、ループ状に回路を形成するために、各レイヤで変化する畳み込み演算回路４や量子化演算回路５におけるパラメータは適宜更新される。 The NN calculation core 10 performs calculations for the CNN 200, which is a multi-layer structure of multiple layers, using a circuit formed in a loop. The NN calculation core 10 can efficiently use hardware resources due to the loop circuit configuration. Note that, since the NN calculation core 10 forms a circuit in a loop, the parameters in the convolution calculation circuit 4 and the quantization calculation circuit 5, which change in each layer, are updated appropriately.

ＣＮＮ２００の演算にＮＮ演算コア１０により実施できない演算が含まれる場合、ＮＮ演算コア１０は外部ホストＣＰＵ１１０などの外部演算デバイスに中間データを転送する。外部演算デバイスが中間データに対して演算を行った後、外部演算デバイスによる演算結果は第一メモリ１や第二メモリ２に入力される。ＮＮ演算コア１０は、外部演算デバイスによる演算結果に対する演算を再開する。 If the calculations of CNN 200 include calculations that cannot be performed by NN calculation core 10, NN calculation core 10 transfers intermediate data to an external calculation device such as external host CPU 110. After the external calculation device performs calculations on the intermediate data, the results of the calculations by the external calculation device are input to first memory 1 and second memory 2. NN calculation core 10 resumes calculations on the results of the calculations by the external calculation device.

［ＮＮ演算コア１０の動作例２］
図７は、ＮＮ演算コア１０の他の動作例を示すタイミングチャートである。
ＮＮ演算コア１０は、入力データａを部分テンソルに分割して、時分割により部分テンソルに対する演算を行ってもよい。部分テンソルへの分割方法や分割数は特に限定されない。 [Operation Example 2 of the NN calculation core 10]
FIG. 7 is a timing chart showing another example of the operation of the NN processing core 10. In FIG.
The NN processing core 10 may divide the input data a into partial tensors and perform operations on the partial tensors by time division. The method of division into the partial tensors and the number of divisions are not particularly limited.

図７は、入力データａを二つの部分テンソルに分解した場合の動作例を示している。分解された部分テンソルを、「第一部分テンソルａ₁」、「第二部分テンソルａ₂」とする。例えば、レイヤ２Ｍ－１の畳み込み演算は、第一部分テンソルａ₁に対応する畳み込み演算（図７において、「レイヤ２Ｍ－１（ａ₁）」と表記）と、第二部分テンソルａ₂に対応する畳み込み演算（図７において、「レイヤ２Ｍ－１（ａ₂）」と表記）と、に分解される。 7 shows an example of an operation when the input data a is decomposed into two partial tensors. The decomposed partial tensors are called "first partial tensor a ₁ " and "second partial tensor a ₂ ". For example, the convolution operation of layer 2M-1 is decomposed into a convolution operation corresponding to the first partial tensor a ₁ (in FIG. 7, indicated as "layer 2M-1 (a ₁ )") and a convolution operation corresponding to the second partial tensor a ₂ (in FIG. 7, indicated as "layer 2M-1 (a ₂ )").

第一部分テンソルａ₁に対応する畳み込み演算および量子化演算と、第二部分テンソルａ₂に対応する畳み込み演算および量子化演算とは、図７に示すように、独立して実施することができる。 The convolution and quantization operations corresponding to the first partial tensor a ₁ and the convolution and quantization operations corresponding to the second partial tensor a ₂ can be performed independently, as shown in FIG.

畳み込み演算回路４は、第一部分テンソルａ₁に対応するレイヤ２Ｍ－１の畳み込み演算（図７において、レイヤ２Ｍ－１（ａ₁）で示す演算）を行う。その後、畳み込み演算回路４は、第二部分テンソルａ_２に対応するレイヤ２Ｍ－１の畳み込み演算（図７において、レイヤ２Ｍ－１（ａ_２）で示す演算）を行う。また、量子化演算回路５は、第一部分テンソルａ₁に対応するレイヤ２Ｍの量子化演算（図７において、レイヤ２Ｍ（ａ₁）で示す演算）を行う。このように、ＮＮ演算コア１０は、第二部分テンソルａ_２に対応するレイヤ２Ｍ－１の畳み込み演算と、第一部分テンソルａ₁に対応するレイヤ２Ｍの量子化演算と、を並列に実施できる。 The convolution operation circuit 4 performs a convolution operation of the layer 2M-1 corresponding to the first partial tensor _a1 (operation shown by layer 2M-1 ( _a1 ) in FIG. 7). After that, the convolution operation circuit 4 performs a convolution operation of the layer 2M-1 corresponding to the second partial tensor _a2 (operation shown by layer 2M-1 ( _a2 ) in FIG. 7). In addition, the quantization operation circuit 5 performs a quantization operation of the layer 2M corresponding to the first partial tensor _a1 (operation shown by layer 2M ( _a1 ) in FIG. 7). In this way, the NN operation core 10 can perform the convolution operation of the layer 2M-1 corresponding to the second partial tensor _a2 and the quantization operation of the layer 2M corresponding to the first partial tensor _a1 in parallel.

次に、畳み込み演算回路４は、第一部分テンソルａ₁に対応するレイヤ２Ｍ＋１の畳み込み演算（図７において、レイヤ２Ｍ＋１（ａ₁）で示す演算）を行う。また、量子化演算回路５は、第二部分テンソルａ_２に対応するレイヤ２Ｍの量子化演算（図７において、レイヤ２Ｍ（ａ_２）で示す演算）を行う。このように、ＮＮ演算コア１０は、第一部分テンソルａ₁に対応するレイヤ２Ｍ＋１の畳み込み演算と、第二部分テンソルａ_２に対応するレイヤ２Ｍの量子化演算と、を並列に実施できる。 Next, the convolution operation circuit 4 performs a convolution operation of layer 2M+1 corresponding to the first partial tensor _a1 (operation shown as layer 2M+1( _a1 ) in FIG. 7). Moreover, the quantization operation circuit 5 performs a quantization operation of layer 2M corresponding to the second partial tensor _a2 (operation shown as layer 2M( _a2 ) in FIG. 7). In this way, the NN operation core 10 can perform the convolution operation of layer 2M+1 corresponding to the first partial tensor _a1 and the quantization operation of layer 2M corresponding to the second partial tensor _a2 in parallel.

第一部分テンソルａ₁に対応する畳み込み演算および量子化演算と、第二部分テンソルａ₂に対応する畳み込み演算および量子化演算とは、独立して実施することができる。そのため、ＮＮ演算コア１０は、例えば、第一部分テンソルａ₁に対応するレイヤ２Ｍ－１の畳み込み演算と、第二部分テンソルａ_２に対応するレイヤ２Ｍ＋２の量子化演算と、を並列に実施してもよい。すなわち、ＮＮ演算コア１０が並列で演算する畳み込み演算と量子化演算は、連続するレイヤの演算に限定されない。 The convolution operation and quantization operation corresponding to the first partial tensor a ₁ and the convolution operation and quantization operation corresponding to the second partial tensor a ₂ can be performed independently. Therefore, the NN processing core 10 may perform, for example, the convolution operation of the layer 2M-1 corresponding to the first partial tensor a ₁ and the quantization operation of the layer 2M+2 corresponding to the second partial tensor a ₂ in parallel. In other words, the convolution operation and quantization operation performed in parallel by the NN processing core 10 are not limited to the operations of consecutive layers.

入力データａを部分テンソルに分割することで、ＮＮ演算コア１０は畳み込み演算回路４と量子化演算回路５とを並列して動作させることができる。その結果、畳み込み演算回路４と量子化演算回路５が待機する時間が削減され、ＮＮ演算コア１０の演算処理効率が向上する。図７に示す動作例において分割数は２であったが、分割数が２より大きい場合も同様に、ＮＮ演算コア１０は畳み込み演算回路４と量子化演算回路５とを並列して動作させることができる。 By dividing the input data a into partial tensors, the NN processing core 10 can operate the convolution processing circuit 4 and the quantization processing circuit 5 in parallel. As a result, the waiting time of the convolution processing circuit 4 and the quantization processing circuit 5 is reduced, and the processing efficiency of the NN processing core 10 is improved. In the operation example shown in FIG. 7, the number of divisions is 2, but even if the number of divisions is greater than 2, the NN processing core 10 can similarly operate the convolution processing circuit 4 and the quantization processing circuit 5 in parallel.

例えば、入力データａが「第一部分テンソルａ₁」、「第二部分テンソルａ₂」および「第三部分テンソルａ_３」に分割される場合、ＮＮ演算コア１０は、第二部分テンソルａ_２に対応するレイヤ２Ｍ－１の畳み込み演算と、第三部分テンソルａ_３に対応するレイヤ２Ｍの量子化演算と、を並列に実施してもよい。演算の順序は、第一メモリ１および第二メモリ２における入力データａの格納状況によって適宜変更される。 For example, when the input data a is divided into a "first partial tensor a ₁ ", a "second partial tensor a ₂ ", and a "third partial tensor a ₃ ", the NN processing core 10 may perform in parallel a convolution operation of the layer 2M-1 corresponding to the second partial tensor a ₂ and a quantization operation of the layer 2M corresponding to the third partial tensor a _3. The order of the operations is appropriately changed depending on the storage status of the input data a in the first memory 1 and the second memory 2.

なお、部分テンソルに対する演算方法としては、同一レイヤにおける部分テンソルの演算を畳み込み演算回路４または量子化演算回路５で行った後に次のレイヤにおける部分テンソルの演算を行う例（方法１）を示した。例えば、図７に示すように、畳み込み演算回路４において、第一部分テンソルａ₁および第二部分テンソルａ_２に対応するレイヤ２Ｍ－１の畳み込み演算（図７において、レイヤ２Ｍ－１（ａ₁）およびレイヤ２Ｍ－１（ａ_２）で示す演算）を行った後に、第一部分テンソルａ₁および第二部分テンソルａ_２に対応するレイヤ２Ｍ＋１の畳み込み演算（図７において、レイヤ２Ｍ＋１（ａ₁）およびレイヤ２Ｍ＋１（ａ_２）で示す演算）を実施している。 As an example of a method of computing partial tensors, an example (method 1) has been shown in which a partial tensor in the same layer is computed by the convolution computation circuit 4 or the quantization computation circuit 5, and then a partial tensor in the next layer is computed. For example, as shown in FIG. 7, in the convolution computation circuit 4, a convolution computation of layer 2M-1 corresponding to the first partial tensor _a1 and the second partial tensor _a2 (computations shown by layer 2M-1( _a1 ) and layer 2M-1( _a2 ) in FIG. 7) is performed, and then a convolution computation of layer 2M+1 corresponding to the first partial tensor _a1 and the second partial tensor _a2 (computations shown by layer 2M+1( _a1 ) and layer 2M+1( _a2 ) in FIG. 7).

しかしながら、部分テンソルに対する演算方法はこれに限られない。部分テンソルに対する演算方法は、複数レイヤにおける一部の部分テンソルの演算をした後に残部の部分テンソルの演算を実施する方法でもよい（方法２）。例えば、畳み込み演算回路４において、第一部分テンソルａ₁に対応するレイヤ２Ｍ－１および第一部分テンソルａ₁に対応するレイヤ２Ｍ＋１の畳み込み演算を行った後に、第二部分テンソルａ_２に対応するレイヤ２Ｍ－１および第二部分テンソルａ_２に対応するレイヤ２Ｍ＋１の畳み込み演算を実施してもよい。 However, the method of operation on the partial tensors is not limited to this. The method of operation on the partial tensors may be a method of performing an operation on some of the partial tensors in multiple layers and then performing an operation on the remaining partial tensors (method 2). For example, in the convolution operation circuit 4, after performing a convolution operation on the layer 2M-1 corresponding to the first partial tensor _a1 and the layer 2M+1 corresponding to the first partial tensor _a1 , a convolution operation on the layer 2M-1 corresponding to the second partial tensor _a2 and the layer 2M+1 corresponding to the second partial tensor _a2 may be performed.

また、部分テンソルに対する演算方法は、方法１と方法２とを組み合わせて部分テンソルを演算する方法でもよい。ただし、方法２を用いる場合は、部分テンソルの演算順序に関する依存関係に従って演算を実施する必要がある。 The method of computing partial tensors may also be a combination of methods 1 and 2. However, when using method 2, the computation must be performed according to the dependencies regarding the computation order of the partial tensors.

［ＮＮ演算マルチコア１０Ｍ］
図８は、ＮＮ演算マルチコア１０Ｍを示す図である。
図８に例示するＮＮ演算マルチコア１０Ｍは、デイジーチェーン接続された二つのＮＮ演算コア１０を備える。二つのＮＮ演算コア１０を区別する場合、二つのＮＮ演算コア１０を、「第一ＮＮ演算コア１０Ａ」と、「第二ＮＮ演算コア１０Ｂ」という。なお、図８において、第一メモリ１は「Ａ」、畳み込み演算回路４は「Ｃ」、第二メモリ２は「Ｆ」、量子化演算回路５は「Ｑ」として略記されている。 [NN calculation multi-core 10M]
FIG. 8 is a diagram showing the NN computing multi-core 10M.
The NN computing multi-core 10M illustrated in Fig. 8 includes two daisy-chained NN computing cores 10. When distinguishing between the two NN computing cores 10, the two NN computing cores 10 are referred to as a "first NN computing core 10A" and a "second NN computing core 10B." In Fig. 8, the first memory 1 is abbreviated as "A," the convolution computing circuit 4 as "C," the second memory 2 as "F," and the quantization computing circuit 5 as "Q."

具体的には、第一ＮＮ演算コア１０Ａの量子化演算回路５と、第二ＮＮ演算コア１０Ｂの第一メモリ１とがデイジーチェーン接続（Ｃ２）されている。第一ＮＮ演算コア１０Ａの量子化演算回路５は、ループ接続（Ｃ１）された第一ＮＮ演算コア１０Ａの第一メモリ１または／およびデイジーチェーン接続（Ｃ２）された第二ＮＮ演算コア１０Ｂの第一メモリ１に量子化演算出力データを書き込むことができる。 Specifically, the quantization calculation circuit 5 of the first NN calculation core 10A and the first memory 1 of the second NN calculation core 10B are daisy-chain connected (C2). The quantization calculation circuit 5 of the first NN calculation core 10A can write the quantization calculation output data to the first memory 1 of the first NN calculation core 10A, which is loop-connected (C1), and/or the first memory 1 of the second NN calculation core 10B, which is daisy-chain connected (C2).

具体的には、第二ＮＮ演算コア１０Ｂの量子化演算回路５と、第一ＮＮ演算コア１０Ａの第一メモリ１とがデイジーチェーン接続（Ｃ２）されている。第二ＮＮ演算コア１０Ｂの量子化演算回路５は、ループ接続（Ｃ１）された第二ＮＮ演算コア１０Ｂの第一メモリ１または／およびデイジーチェーン接続（Ｃ２）された第一ＮＮ演算コア１０Ａの第一メモリ１に量子化演算出力データを書き込むことができる。 Specifically, the quantization calculation circuit 5 of the second NN calculation core 10B and the first memory 1 of the first NN calculation core 10A are daisy-chain connected (C2). The quantization calculation circuit 5 of the second NN calculation core 10B can write the quantization calculation output data to the first memory 1 of the second NN calculation core 10B, which is loop-connected (C1), and/or the first memory 1 of the first NN calculation core 10A, which is daisy-chain connected (C2).

ＮＮ演算マルチコア１０Ｍが三つ以上のＮＮ演算コア１０を備える場合も同様に、複数のＮＮ演算コア１０はデイジーチェーン接続される。最終段のＮＮ演算コア１０以外のＮＮ演算コア１０の量子化演算回路５は、後段のＮＮ演算コア１０Ｂの第一メモリ１とデイジーチェーン接続（Ｃ２）される。最終段のＮＮ演算コア１０の量子化演算回路５は、最初段のＮＮ演算コア１０の第一メモリ１とデイジーチェーン接続（Ｃ２）されている。複数のＮＮ演算コア１０はデイジーチェーンループ（数珠繋ぎ）状に形成されていることを特徴とする。 Similarly, when the NN calculation multi-core 10M has three or more NN calculation cores 10, the multiple NN calculation cores 10 are connected in a daisy chain. The quantization calculation circuit 5 of the NN calculation cores 10 other than the final stage NN calculation core 10 is daisy chain connected (C2) to the first memory 1 of the subsequent stage NN calculation core 10B. The quantization calculation circuit 5 of the final stage NN calculation core 10 is daisy chain connected (C2) to the first memory 1 of the first stage NN calculation core 10. The multiple NN calculation cores 10 are characterized by being formed in a daisy chain loop (linked together).

一つのＮＮ演算コア１０において、第一メモリ（Ａ）１と畳み込み演算回路（Ｃ）４と第二メモリ（Ｆ）２と量子化演算回路（Ｑ）５とは、ループ状に接続されている。一方、ＮＮ演算マルチコア１０Ｍにおいては、第一メモリ（Ａ）１と畳み込み演算回路（Ｃ）４と第二メモリ（Ｆ）２と量子化演算回路（Ｑ）５とは、第一メモリ（Ａ）１と畳み込み演算回路（Ｃ）４と第二メモリ（Ｆ）２と量子化演算回路（Ｑ）５とが同じ順番で繰り返し配列するように、デイジーチェーンループ（数珠繋ぎ）状に接続されている。 In one NN calculation core 10, the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are connected in a loop. On the other hand, in the NN calculation multi-core 10M, the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are connected in a daisy chain loop (linked together) so that the first memory (A) 1, the convolution calculation circuit (C) 4, the second memory (F) 2, and the quantization calculation circuit (Q) 5 are repeatedly arranged in the same order.

ＮＮ演算マルチコア１０Ｍを構成する複数のＮＮ演算コア１０は、同一のハードウェア構成でなくてもよい。例えば、第一ＮＮ演算コア１０Ａの第一メモリ１の容量・構成は、第二ＮＮ演算コア１０Ｂの第一メモリ１の容量・構成と異なっていてもよい。例えば、第一ＮＮ演算コア１０Ａの量子化演算回路５の構成は、第二ＮＮ演算コア１０Ｂの量子化演算回路５の構成と異なっていてもよい。 The multiple NN calculation cores 10 constituting the NN calculation multi-core 10M do not need to have the same hardware configuration. For example, the capacity and configuration of the first memory 1 of the first NN calculation core 10A may be different from the capacity and configuration of the first memory 1 of the second NN calculation core 10B. For example, the configuration of the quantization calculation circuit 5 of the first NN calculation core 10A may be different from the configuration of the quantization calculation circuit 5 of the second NN calculation core 10B.

次に、ＮＮ回路１００の各構成に関して詳しく説明する。 Next, we will explain each component of the NN circuit 100 in detail.

［ＤＭＡＣ３］
図９は、ＤＭＡＣ３の内部ブロック図である。
ＤＭＡＣ３は、データ転送回路３１と、ステートコントローラ３２と、プリロード命令ジュネレータ３５と、を有する。ＤＭＡＣ３は、データ転送回路３１に対する専用のステートコントローラ３２を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずにＤＭＡデータ転送を実施できる。 [DMAC3]
FIG. 9 is an internal block diagram of the DMAC3.
The DMAC 3 has a data transfer circuit 31, a state controller 32, and a preload command generator 35. The DMAC 3 has a state controller 32 dedicated to the data transfer circuit 31, and when a command is input, the DMAC 3 can perform DMA data transfer without requiring an external controller.

データ転送回路３１は、外部バスＥＢに接続されており、ＤＲＡＭなどの外部メモリ１２０とＮＮ演算コア１０との間でデータを転送する。データ転送回路３１のＤＭＡチャンネル数は限定されない。例えば、第一ＮＮ演算コア１０Ａと第二ＮＮ演算コア１０Ｂのそれぞれに専用のＤＭＡチャンネルを有していてもよい。 The data transfer circuit 31 is connected to the external bus EB, and transfers data between an external memory 120 such as a DRAM and the NN calculation core 10. The number of DMA channels of the data transfer circuit 31 is not limited. For example, the first NN calculation core 10A and the second NN calculation core 10B may each have a dedicated DMA channel.

ステートコントローラ３２は、データ転送回路３１のステートを制御する。また、ステートコントローラ３２は、内部バスＩＢを介してコントローラ６と接続されている。ステートコントローラ３２は、命令キュー３３と制御回路３４とを有する。 The state controller 32 controls the state of the data transfer circuit 31. The state controller 32 is also connected to the controller 6 via the internal bus IB. The state controller 32 has an instruction queue 33 and a control circuit 34.

命令キュー３３は、ＤＭＡＣ３用の命令コマンドＣ３が格納されるキューであり、例えばＦＩＦＯメモリで構成される。命令キュー３３には、命令供給ユニット７経由または内部バスＩＢ経由で１つ以上の命令コマンドＣ３が書き込まれる。 The instruction queue 33 is a queue in which instruction commands C3 for DMAC3 are stored, and is configured, for example, as a FIFO memory. One or more instruction commands C3 are written to the instruction queue 33 via the instruction supply unit 7 or the internal bus IB.

制御回路３４は、命令コマンドＣ３をデコードし、命令コマンドＣ３に基づいて順次データ転送回路３１を制御するステートマシンである。制御回路３４は、論理回路により実装されていてもよいし、ソフトウェアによって制御されるＣＰＵによって実装されていてもよい。 The control circuit 34 is a state machine that decodes the instruction command C3 and sequentially controls the data transfer circuit 31 based on the instruction command C3. The control circuit 34 may be implemented by a logic circuit or a CPU controlled by software.

図１０は、制御回路３４のステート遷移図である。
制御回路３４は、命令キュー３３に命令コマンドＣ３が入力されると（Ｎｏｔｅｍｐｔｙ）、アイドルステートＳＴ１からデコードステートＳＴ２に遷移する。 FIG. 10 is a state transition diagram of the control circuit 34.
When the instruction command C3 is input to the instruction queue 33 (not empty), the control circuit 34 transitions from the idle state ST1 to the decode state ST2.

制御回路３４は、デコードステートＳＴ２において、命令キュー３３から出力される命令コマンドＣ３をデコードする。また、制御回路３４は、コントローラ６のレジスタ６１に格納されたセマフォＳを読み出し、命令コマンドＣ３において指示されたデータ転送回路３１の動作を実行可能であるかを判定する。実行不能である場合（Ｎｏｔｒｅａｄｙ）、制御回路３４は実行可能となるまで待つ（Ｗａｉｔ）。実行可能である場合（ｒｅａｄｙ）、制御回路３４はデコードステートＳＴ２から実行ステートＳＴ３に遷移する。 In the decode state ST2, the control circuit 34 decodes the instruction command C3 output from the instruction queue 33. The control circuit 34 also reads the semaphore S stored in the register 61 of the controller 6 and determines whether the operation of the data transfer circuit 31 instructed in the instruction command C3 can be executed. If it is not executable (Not ready), the control circuit 34 waits until it becomes executable (Wait). If it is executable (Ready), the control circuit 34 transitions from the decode state ST2 to the execution state ST3.

制御回路３４は、実行ステートＳＴ３において、データ転送回路３１を制御して、データ転送回路３１に命令コマンドＣ３において指示された動作を実施させる。制御回路３４は、データ転送回路３１の動作が終わると、命令キュー３３から実行を終えた命令コマンドＣ３を取り除くとともに、コントローラ６のレジスタ６１に格納されたセマフォＳを更新する。制御回路３４は、命令キュー３３に命令がある場合（Ｎｏｔｅｍｐｔｙ）、実行ステートＳＴ３からデコードステートＳＴ２に遷移する。制御回路３４は、命令キュー３３に命令がない場合（ｅｍｐｔｙ）、実行ステートＳＴ３からアイドルステートＳＴ１に遷移する。 In the execution state ST3, the control circuit 34 controls the data transfer circuit 31 to cause the data transfer circuit 31 to perform the operation instructed in the instruction command C3. When the operation of the data transfer circuit 31 is completed, the control circuit 34 removes the executed instruction command C3 from the instruction queue 33 and updates the semaphore S stored in the register 61 of the controller 6. If there is an instruction in the instruction queue 33 (Not empty), the control circuit 34 transitions from the execution state ST3 to the decode state ST2. If there is no instruction in the instruction queue 33 (Empty), the control circuit 34 transitions from the execution state ST3 to the idle state ST1.

プリロード命令ジュネレータ３５は、コントローラ６のレジスタ６１に含まれる「プリロードモードレジスタ」に設定されたデータ転送に関する設定値（転送元アドレス、転送先アドレス、データ転送量）に基づいて、設定されたデータ転送をデータ転送回路３１に実行させる命令コマンドＣ３を生成して命令キュー３３に入力する。なお、プリロード命令ジュネレータ３５は、ＤＭＡＣ３ではなくコントローラ６に配置されていてもよい。 The preload command generator 35 generates a command C3 for causing the data transfer circuit 31 to execute the set data transfer based on the setting values (source address, destination address, data transfer amount) set in the "preload mode register" included in the register 61 of the controller 6, and inputs the command C3 to the command queue 33. Note that the preload command generator 35 may be located in the controller 6 instead of the DMAC 3.

［ＤＭＡＣ３：プリロードモードにおける動作］
ＮＮ回路１００の動作モードがプリロードモードであるとき、データ転送回路３１は、レジスタ６１に含まれるプリロードモードの設定に関するパラメータレジスタ（プリロードモードレジスタ）に基づいて、外部メモリ１２０から命令供給ユニット７に対するＤＭＡ転送（以降、「プリロード」ともいう）を実施する。ステートコントローラ３２は、プリロード命令ジュネレータ３５が供給した命令コマンドＣ３に基づいて動作する。 [DMAC3: Operation in Preload Mode]
When the operation mode of the NN circuit 100 is the preload mode, the data transfer circuit 31 performs a DMA transfer (hereinafter also referred to as "preload") from the external memory 120 to the instruction supply unit 7 based on a parameter register (preload mode register) related to the setting of the preload mode included in the register 61. The state controller 32 operates based on the instruction command C3 supplied by the preload instruction generator 35.

［ＤＭＡＣ３：ノーマルモードにおける動作］
ＮＮ回路１００の動作モードがノーマルモードであるとき、命令供給ユニット７は、命令キュー３３に命令コマンドＣ３を供給する。ステートコントローラ３２は、命令供給ユニット７が供給した命令コマンドＣ３に基づいて動作する。 [DMAC3: Operation in normal mode]
When the operation mode of the NN circuit 100 is the normal mode, the instruction supply unit 7 supplies an instruction command C3 to the instruction queue 33. The state controller 32 operates based on the instruction command C3 supplied by the instruction supply unit 7.

［畳み込み演算回路４］
図１１は、畳み込み演算回路４の内部ブロック図である。
畳み込み演算回路４は、重みメモリ４１と、乗算器４２と、アキュムレータ回路４３と、ステートコントローラ４４と、を有する。畳み込み演算回路４は、乗算器４２およびアキュムレータ回路４３に対する専用のステートコントローラ４４を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに畳み込み演算を実施できる。 [Convolution operation circuit 4]
FIG. 11 is an internal block diagram of the convolution circuit 4.
The convolution operation circuit 4 has a weight memory 41, a multiplier 42, an accumulator circuit 43, and a state controller 44. The convolution operation circuit 4 has a state controller 44 dedicated to the multiplier 42 and the accumulator circuit 43, and when an instruction command is input, the convolution operation can be performed without requiring an external controller.

重みメモリ４１は、畳み込み演算に用いる重みｗが格納されるメモリであり、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。ＤＭＡＣ３は、ＤＭＡ転送により、畳み込み演算に必要な重みｗを重みメモリ４１に書き込む。 The weight memory 41 is a memory in which the weight w used in the convolution calculation is stored, and is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). The DMAC3 writes the weight w required for the convolution calculation to the weight memory 41 by DMA transfer.

図１２は、乗算器４２の内部ブロック図である。
乗算器４２は、入力ベクトルＡと重みマトリクスＷとを乗算する。入力ベクトルＡは、上述したように、分割入力データａ（ｘ＋ｉ、ｙ＋ｊ、ｃｏ）がｉ、ｊごとに展開されたＢｃ個の要素を持つベクトルデータである。また、重みマトリクスＷは、分割重みｗ（ｉ，ｊ，ｃｏ、ｄｏ）がｉ、ｊごとに展開されたＢｃ×Ｂｄ個の要素を持つマトリクスデータである。乗算器４２は、Ｂｃ×Ｂｄ個の積和演算ユニット４７を有し、入力ベクトルＡと重みマトリクスＷとを乗算を並列して実施できる。 FIG. 12 is an internal block diagram of the multiplier 42.
The multiplier 42 multiplies the input vector A by the weight matrix W. As described above, the input vector A is vector data having Bc elements in which the divided input data a(x+i, y+j, co) is expanded for each i and j. The weight matrix W is matrix data having Bc×Bd elements in which the divided weights w(i, j, co, do) are expanded for each i and j. The multiplier 42 has Bc×Bd product-sum operation units 47 and can multiply the input vector A by the weight matrix W in parallel.

乗算器４２は、乗算に必要な入力ベクトルＡと重みマトリクスＷを、第一メモリ１および重みメモリ４１から読み出して乗算を実施する。乗算器４２は、Ｂｄ個の積和演算結果Ｏ（ｄｉ）を出力する。 The multiplier 42 reads the input vector A and weight matrix W required for the multiplication from the first memory 1 and the weight memory 41, and performs the multiplication. The multiplier 42 outputs Bd product-sum operation results O(di).

図１３は、積和演算ユニット４７の内部ブロック図である。
積和演算ユニット４７は、入力ベクトルＡの要素Ａ（ｃｉ）と、重みマトリクスＷの要素Ｗ（ｃｉ，ｄｉ）との乗算を実施する。また、積和演算ユニット４７は、乗算結果と他の積和演算ユニット４７の乗算結果Ｓ（ｃｉ，ｄｉ）と加算する。積和演算ユニット４７は、加算結果Ｓ（ｃｉ＋１，ｄｉ）を出力する。要素Ａ（ｃｉ）は、２ビットの符号なし整数（０，１，２，３）である。要素Ｗ（ｃｉ，ｄｉ）は、１ビットの符号付整数（０，１）であり、値「０」は＋１を表し、値「１」は－１を表す。 FIG. 13 is an internal block diagram of the product-sum calculation unit 47. As shown in FIG.
The multiply-add unit 47 multiplies an element A(ci) of an input vector A by an element W(ci,di) of a weight matrix W. The multiply-add unit 47 also adds the multiplication result to a multiplication result S(ci,di) of another multiply-add unit 47. The multiply-add unit 47 outputs an addition result S(ci+1,di). The element A(ci) is a 2-bit unsigned integer (0,1,2,3). The element W(ci,di) is a 1-bit signed integer (0,1), where a value "0" represents +1 and a value "1" represents -1.

積和演算ユニット４７は、反転器（インバータ）４７ａと、セレクタ４７ｂと、加算器４７ｃと、を有する。積和演算ユニット４７は、乗算器を用いず、反転器４７ａおよびセレクタ４７ｂのみを用いて乗算を行う。セレクタ４７ｂは、要素Ｗ（ｃｉ，ｄｉ）が「０」の場合、要素Ａ（ｃｉ）の入力を選択する。セレクタ４７ｂは、要素Ｗ（ｃｉ，ｄｉ）が「１」の場合、要素Ａ（ｃｉ）を反転器により反転させた補数を選択する。要素Ｗ（ｃｉ，ｄｉ）は、加算器４７ｃのＣａｒｒｙ－ｉｎにも入力される。加算器４７ｃは、要素Ｗ（ｃｉ，ｄｉ）が「０」のとき、Ｓ（ｃｉ，ｄｉ）に要素Ａ（ｃｉ）を加算した値を出力する。加算器４７ｃは、Ｗ（ｃｉ，ｄｉ）が「１」のとき、Ｓ（ｃｉ，ｄｉ）から要素Ａ（ｃｉ）を減算した値を出力する。 The multiply-and-accumulate unit 47 has an inverter 47a, a selector 47b, and an adder 47c. The multiply-and-accumulate unit 47 performs multiplication using only the inverter 47a and the selector 47b, without using a multiplier. When the element W(ci, di) is "0", the selector 47b selects the input of the element A(ci). When the element W(ci, di) is "1", the selector 47b selects the complement of the element A(ci) inverted by the inverter. The element W(ci, di) is also input to the carry-in of the adder 47c. When the element W(ci, di) is "0", the adder 47c outputs the value obtained by adding the element A(ci) to S(ci, di). When W(ci, di) is "1", adder 47c outputs the value obtained by subtracting element A(ci) from S(ci, di).

図１４は、アキュムレータ回路４３の内部ブロック図である。
アキュムレータ回路４３は、乗算器４２の積和演算結果Ｏ（ｄｉ）を第二メモリ２にアキュムレートする。アキュムレータ回路４３は、Ｂｄ個のアキュムレータユニット４８を有し、Ｂｄ個の積和演算結果Ｏ（ｄｉ）を並列して第二メモリ２にアキュムレートできる。 FIG. 14 is an internal block diagram of the accumulator circuit 43.
The accumulator circuit 43 accumulates the product-sum operation results O(di) of the multiplier 42 in the second memory 2. The accumulator circuit 43 has Bd accumulator units 48 and can accumulate the Bd product-sum operation results O(di) in parallel in the second memory 2.

図１５は、アキュムレータユニット４８の内部ブロック図である。
アキュムレータユニット４８は、加算器４８ａと、マスク部４８ｂとを有している。加算器４８ａは、積和演算結果Ｏの要素Ｏ（ｄｉ）と、第二メモリ２に格納された式１に示す畳み込み演算の途中経過である部分和と、を加算する。加算結果は、要素あたり１６ビットである。加算結果は、要素あたり１６ビットに限定されず、例えば要素あたり１５ビットや１７ビットであってもよい。 FIG. 15 is an internal block diagram of the accumulator unit 48.
The accumulator unit 48 has an adder 48a and a mask unit 48b. The adder 48a adds an element O(di) of the sum-of-products operation result O and a partial sum which is an intermediate result of the convolution operation shown in Equation 1 stored in the second memory 2. The sum result is 16 bits per element. The sum result is not limited to 16 bits per element, and may be, for example, 15 bits or 17 bits per element.

加算器４８ａは、加算結果を第二メモリ２の同一アドレスに書き込む。マスク部４８ｂは、初期化信号ｃｌｅａｒがアサートされた場合に、第二メモリ２からの出力をマスクし、要素Ｏ（ｄｉ）に対する加算対象をゼロにする。初期化信号ｃｌｅａｒは、第二メモリ２に途中経過の部分和が格納されていない場合にアサートされる。 The adder 48a writes the result of the addition to the same address in the second memory 2. When the initialization signal clear is asserted, the masking unit 48b masks the output from the second memory 2 and sets the addition target for element O(di) to zero. The initialization signal clear is asserted when no intermediate partial sums are stored in the second memory 2.

乗算器４２およびアキュムレータ回路４３による畳み込み演算が完了すると、第二メモリ２に、出力データｆ（ｘ，ｙ，ｄｏ）が格納される。 When the convolution operation by the multiplier 42 and the accumulator circuit 43 is completed, the output data f(x, y, do) is stored in the second memory 2.

ステートコントローラ４４は、乗算器４２およびアキュムレータ回路４３のステートを制御する。また、ステートコントローラ４４は、内部バスＩＢを介してコントローラ６と接続されている。ステートコントローラ４４は、命令キュー４５と制御回路４６とを有する。 The state controller 44 controls the states of the multiplier 42 and the accumulator circuit 43. The state controller 44 is also connected to the controller 6 via the internal bus IB. The state controller 44 has an instruction queue 45 and a control circuit 46.

命令キュー４５は、畳み込み演算回路４用の命令コマンドＣ４が格納されるキューであり、例えばＦＩＦＯメモリで構成される。命令キュー４５には、命令供給ユニット７経由または内部バスＩＢ経由で命令コマンドＣ４が書き込まれる。 The instruction queue 45 is a queue in which the instruction command C4 for the convolution operation circuit 4 is stored, and is configured, for example, as a FIFO memory. The instruction command C4 is written to the instruction queue 45 via the instruction supply unit 7 or the internal bus IB.

制御回路４６は、命令コマンドＣ４をデコードし、命令コマンドＣ４に基づいて乗算器４２およびアキュムレータ回路４３を制御するステートマシンである。制御回路４６は、ＤＭＡＣ３のステートコントローラ３２の制御回路３４と同様の構成である。 The control circuit 46 is a state machine that decodes the instruction command C4 and controls the multiplier 42 and the accumulator circuit 43 based on the instruction command C4. The control circuit 46 has a configuration similar to that of the control circuit 34 of the state controller 32 of the DMAC3.

［畳み込み演算回路４：プリロードモードにおける動作］
ＮＮ回路１００の動作モードがプリロードモードであるとき、ステートコントローラ４４は動作しない。プリロードするデータに重みｗが含まれる場合、ＤＭＡＣ３は外部メモリ１２０から読み出した重みｗを重みメモリ４１に書き込む。 [Convolution Operation Circuit 4: Operation in Preload Mode]
When the operation mode of the NN circuit 100 is the preload mode, the state controller 44 does not operate. When the weight w is included in the data to be preloaded, the DMAC 3 writes the weight w read from the external memory 120 into the weight memory 41.

［畳み込み演算回路４：ノーマルモードにおける動作］
ＮＮ回路１００の動作モードがノーマルモードであるとき、命令供給ユニット７は、命令キュー４５に命令コマンドＣ４を供給する。ステートコントローラ４４は、命令キュー４５に格納された命令コマンドＣ４に基づいて、乗算器４２およびアキュムレータ回路４３を制御する。 [Convolution Operation Circuit 4: Operation in Normal Mode]
When the operation mode of the NN circuit 100 is the normal mode, the instruction supply unit 7 supplies an instruction command C4 to the instruction queue 45. The state controller 44 controls the multiplier 42 and the accumulator circuit 43 based on the instruction command C4 stored in the instruction queue 45.

［量子化演算回路５］
図１６は、量子化演算回路５の内部ブロック図である。
量子化演算回路５は、量子化パラメータメモリ５１と、ベクトル演算回路５２と、量子化回路５３と、ステートコントローラ５４と、を有する。量子化演算回路５は、ベクトル演算回路５２および量子化回路５３に対する専用のステートコントローラ５４を有しており、命令コマンドが入力されると、外部のコントローラを必要とせずに量子化演算を実施できる。 [Quantization operation circuit 5]
FIG. 16 is an internal block diagram of the quantization calculation circuit 5.
The quantization calculation circuit 5 has a quantization parameter memory 51, a vector calculation circuit 52, a quantization circuit 53, and a state controller 54. Therefore, when an instruction command is input, the quantization operation can be performed without the need for an external controller.

量子化パラメータメモリ５１は、量子化演算に用いる量子化パラメータｑが格納されるメモリであり、例えばＳＲＡＭ（ＳｔａｔｉｃＲＡＭ）などで構成された揮発性のメモリ等の書き換え可能なメモリである。ＤＭＡＣ３は、ＤＭＡ転送により、量子化演算に必要な量子化パラメータｑを量子化パラメータメモリ５１に書き込む。 The quantization parameter memory 51 is a memory in which the quantization parameter q used in the quantization operation is stored, and is a rewritable memory such as a volatile memory composed of, for example, an SRAM (Static RAM). The DMAC3 writes the quantization parameter q required for the quantization operation to the quantization parameter memory 51 by DMA transfer.

図１７は、ベクトル演算回路５２と量子化回路５３の内部ブロック図である。
ベクトル演算回路５２は、第二メモリ２に格納された出力データｆ（ｘ，ｙ，ｄｏ）に対して演算を行う。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７を有し、出力データｆ（ｘ，ｙ，ｄｏ）に対して並列にＳＩＭＤ演算を行う。 FIG. 17 is an internal block diagram of the vector calculation circuit 52 and the quantization circuit 53.
The vector operation circuit 52 performs an operation on the output data f(x, y, do) stored in the second memory 2. The vector operation circuit 52 has Bd operation units 57, and performs SIMD operations in parallel on the output data f(x, y, do).

図１８は、演算ユニット５７のブロック図である。
演算ユニット５７は、例えば、ＡＬＵ５７ａと、第一セレクタ５７ｂと、第二セレクタ５７ｃと、レジスタ５７ｄと、シフタ５７ｅと、を有する。演算ユニット５７は、公知の汎用ＳＩＭＤ演算回路が有する他の演算器等をさらに有してもよい。 FIG. 18 is a block diagram of the arithmetic unit 57.
The arithmetic unit 57 includes, for example, an ALU 57 a, a first selector 57 b, a second selector 57 c, a register 57 d, and a shifter 57 e. The arithmetic unit 57 may further include other arithmetic units included in a known general-purpose SIMD arithmetic circuit.

ベクトル演算回路５２は、演算ユニット５７が有する演算器等を組み合わせることで、出力データｆ（ｘ，ｙ，ｄｏ）に対して、量子化演算層２２０におけるプーリング層２２１や、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ層２２２や、活性化関数層２２３の演算のうち少なくとも一つの演算を行う。 The vector calculation circuit 52 performs at least one of the calculations of the pooling layer 221, the batch normalization layer 222, and the activation function layer 223 in the quantization calculation layer 220 on the output data f(x, y, do) by combining the calculation units and the like contained in the calculation unit 57.

演算ユニット５７は、レジスタ５７ｄに格納されたデータと第二メモリ２から読み出した出力データｆ（ｘ，ｙ，ｄｏ）の要素ｆ（ｄｉ）とをＡＬＵ５７ａにより加算できる。演算ユニット５７は、ＡＬＵ５７ａによる加算結果をレジスタ５７ｄに格納できる。演算ユニット５７は、第一セレクタ５７ｂの選択によりレジスタ５７ｄに格納されたデータに代えて「０」をＡＬＵ５７ａに入力することで加算結果を初期化できる。例えばプーリング領域が２×２である場合、シフタ５７ｅはＡＬＵ５７ａの出力を２ｂｉｔ右シフトすることで加算結果の平均値を出力できる。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７による上記の演算等を繰り返すことで、式２に示す平均プーリングの演算を実施できる。 The arithmetic unit 57 can add the data stored in the register 57d and the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a. The arithmetic unit 57 can store the addition result by the ALU 57a in the register 57d. The arithmetic unit 57 can initialize the addition result by inputting "0" to the ALU 57a instead of the data stored in the register 57d by the selection of the first selector 57b. For example, if the pooling area is 2x2, the shifter 57e can output the average value of the addition results by shifting the output of the ALU 57a 2 bits to the right. The vector calculation circuit 52 can perform the average pooling calculation shown in Equation 2 by repeating the above calculations by the Bd arithmetic units 57.

演算ユニット５７は、レジスタ５７ｄに格納されたデータと第二メモリ２から読み出した出力データｆ（ｘ，ｙ，ｄｏ）の要素ｆ（ｄｉ）とをＡＬＵ５７ａにより比較できる。
演算ユニット５７は、ＡＬＵ５７ａによる比較結果に応じて第二セレクタ５７ｃを制御して、レジスタ５７ｄに格納されたデータと要素ｆ（ｄｉ）の大きい方を選択できる。演算ユニット５７は、第一セレクタ５７ｂの選択により要素ｆ（ｄｉ）の取りうる値の最小値をＡＬＵ５７ａに入力することで比較対象を最小値に初期化できる。本実施形態において要素ｆ（ｄｉ）は１６ｂｉｔ符号付き整数であるので、要素ｆ（ｄｉ）の取りうる値の最小値は「０ｘ８０００」である。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７による上記の演算等を繰り返すことで、式３のＭＡＸプーリングの演算を実施できる。なお、ＭＡＸプーリングの演算ではシフタ５７ｅは第二セレクタ５７ｃの出力をシフトしない。 The arithmetic unit 57 can compare the data stored in the register 57d with the element f(di) of the output data f(x, y, do) read from the second memory 2 by the ALU 57a.
The arithmetic unit 57 controls the second selector 57c according to the comparison result by the ALU 57a, and can select the larger of the data stored in the register 57d and the element f(di). The arithmetic unit 57 can initialize the comparison target to the minimum value by inputting the minimum value of the possible value of the element f(di) to the ALU 57a by the selection of the first selector 57b. In this embodiment, the element f(di) is a 16-bit signed integer, so the minimum value of the possible value of the element f(di) is "0x8000". The vector calculation circuit 52 can perform the MAX pooling calculation of Equation 3 by repeating the above calculations by the Bd arithmetic units 57. Note that in the MAX pooling calculation, the shifter 57e does not shift the output of the second selector 57c.

演算ユニット５７は、レジスタ５７ｄに格納されたデータと第二メモリ２から読み出した出力データｆ（ｘ，ｙ，ｄｏ）の要素ｆ（ｄｉ）とをＡＬＵ５７ａにより減算できる。シフタ５７ｅはＡＬＵ５７ａの出力を左シフト（すなわち乗算）もしくは右シフト（すなわち除算）できる。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７による上記の演算等を繰り返すことで、式４のＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎの演算を実施できる。 The arithmetic unit 57 can subtract data stored in the register 57d and an element f(di) of the output data f(x, y, do) read from the second memory 2 using the ALU 57a. The shifter 57e can shift the output of the ALU 57a to the left (i.e., multiplication) or to the right (i.e., division). The vector arithmetic circuit 52 can perform the Batch Normalization calculation of Equation 4 by repeating the above calculations by the Bd arithmetic units 57.

演算ユニット５７は、第二メモリ２から読み出した出力データｆ（ｘ，ｙ，ｄｏ）の要素ｆ（ｄｉ）と第一セレクタ５７ｂにより選択された「０」とをＡＬＵ５７ａにより比較できる。演算ユニット５７は、ＡＬＵ５７ａによる比較結果に応じて要素ｆ（ｄｉ）と予めレジスタ５７ｄに格納された定数値「０」のいずれかを選択して出力できる。ベクトル演算回路５２は、Ｂｄ個の演算ユニット５７による上記の演算等を繰り返すことで、式５のＲｅＬＵ演算を実施できる。 The arithmetic unit 57 can compare the element f(di) of the output data f(x, y, do) read from the second memory 2 with the "0" selected by the first selector 57b using the ALU 57a. The arithmetic unit 57 can select and output either the element f(di) or the constant value "0" previously stored in the register 57d, depending on the comparison result by the ALU 57a. The vector calculation circuit 52 can perform the ReLU calculation of Equation 5 by repeating the above calculations by the Bd arithmetic units 57.

ベクトル演算回路５２は、平均プーリング、ＭＡＸプーリング、ＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎ、活性化関数の演算およびこれらの演算の組み合わせを実施できる。ベクトル演算回路５２は、汎用ＳＩＭＤ演算を実施できるため、量子化演算層２２０における演算に必要な他の演算を実施してもよい。また、ベクトル演算回路５２は、量子化演算層２２０における演算以外の演算を実施してもよい。 The vector operation circuit 52 can perform average pooling, MAX pooling, batch normalization, activation function operations, and combinations of these operations. Since the vector operation circuit 52 can perform general-purpose SIMD operations, it may also perform other operations necessary for the operations in the quantization operation layer 220. In addition, the vector operation circuit 52 may also perform operations other than those in the quantization operation layer 220.

なお、量子化演算回路５は、ベクトル演算回路５２を有してなくてもよい。量子化演算回路５がベクトル演算回路５２を有していない場合、出力データｆ（ｘ，ｙ，ｄｏ）は量子化回路５３に入力される。 The quantization calculation circuit 5 does not have to have a vector calculation circuit 52. If the quantization calculation circuit 5 does not have a vector calculation circuit 52, the output data f(x, y, do) is input to the quantization circuit 53.

量子化回路５３は、ベクトル演算回路５２の出力データに対して、量子化を行う。量子化回路５３は、図２０に示すように、Ｂｄ個の量子化ユニット５８を有し、ベクトル演算回路５２の出力データに対して並列に演算を行う。 The quantization circuit 53 performs quantization on the output data of the vector calculation circuit 52. As shown in FIG. 20, the quantization circuit 53 has Bd quantization units 58, and performs calculations in parallel on the output data of the vector calculation circuit 52.

図１９は、量子化ユニット５８の内部ブロック図である。
量子化ユニット５８は、ベクトル演算回路５２の出力データの要素ｉｎ（ｄｉ）に対して量子化を行う。量子化ユニット５８は、比較器５８ａと、エンコーダ５８ｂと、を有する。量子化ユニット５８はベクトル演算回路５２の出力データ（１６ビット／要素）に対して、量子化演算層２２０における量子化層２２４の演算（式６）を行う。量子化ユニット５８は、量子化パラメータメモリ５１から必要な量子化パラメータｑ（ｔｈ０，ｔｈ１，ｔｈ２）を読み出し、比較器５８ａにより入力ｉｎ（ｄｉ）と量子化パラメータｑとの比較を行う。量子化ユニット５８は、比較器５８ａによる比較結果をエンコーダ５８ｂにより２ビット／要素に量子化する。式４におけるα(c)とβ(c)は、変数ｃごとに異なるパラメータであるため、α(c)とβ(c)を反映する量子化パラメータｑ（ｔｈ０，ｔｈ１，ｔｈ２）はｉｎ（ｄｉ）ごとに異なるパラメータである。 FIG. 19 is an internal block diagram of the quantization unit 58.
The quantization unit 58 quantizes the element in(di) of the output data of the vector operation circuit 52. The quantization unit 58 has a comparator 58a and an encoder 58b. The quantization unit 58 performs the operation (Equation 6) of the quantization layer 224 in the quantization operation layer 220 on the output data (16 bits/element) of the vector operation circuit 52. The quantization unit 58 reads out the necessary quantization parameters q(th0, th1, th2) from the quantization parameter memory 51, and compares the input in(di) with the quantization parameter q by the comparator 58a. The quantization unit 58 quantizes the comparison result by the comparator 58a to 2 bits/element by the encoder 58b. Since α(c) and β(c) in Equation 4 are parameters that differ for each variable c, the quantization parameters q(th0, th1, th2) that reflect α(c) and β(c) are parameters that differ for each in(di).

量子化ユニット５８は、入力ｉｎ（ｄｉ）を３つの閾値ｔｈ０，ｔｈ１，ｔｈ２と比較することにより、入力ｉｎ（ｄｉ）を４領域（例えば、ｉｎ≦ｔｈ０，ｔｈ０＜ｉｎ≦ｔｈ１，ｔｈ１＜ｉｎ≦ｔｈ２，ｔｈ２＜ｉｎ）に分類し、分類結果を２ビットにエンコードして出力する。量子化ユニット５８は、量子化パラメータｑ（ｔｈ０，ｔｈ１，ｔｈ２）の設定により、量子化と併せてＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎや活性化関数の演算を行うこともできる。 The quantization unit 58 classifies the input in (di) into four regions (e.g., in≦th0, th0<in≦th1, th1<in≦th2, th2<in) by comparing the input in (di) with three thresholds th0, th1, and th2, and outputs the classification result by encoding it into two bits. The quantization unit 58 can also perform batch normalization and activation function calculations in addition to quantization by setting the quantization parameter q (th0, th1, th2).

量子化ユニット５８は、閾値ｔｈ０を式４のβ(c)、閾値の差（ｔｈ１―ｔｈ０）および（ｔｈ２―ｔｈ１）を式４のα(c)として設定して量子化を行うことで、式４に示すＢａｔｃｈＮｏｒｍａｌｉｚａｔｉｏｎの演算を量子化と併せて実施できる。（ｔｈ１―ｔｈ０）および（ｔｈ２―ｔｈ１）を大きくすることでα(ｃ)を小さくできる。（ｔｈ１―ｔｈ０）および（ｔｈ２―ｔｈ１）を小さくすることで、α(c)を大きくできる。 The quantization unit 58 sets the threshold th0 as β(c) in Equation 4, and the threshold differences (th1-th0) and (th2-th1) as α(c) in Equation 4, and performs quantization, thereby enabling the batch normalization calculation shown in Equation 4 to be performed in conjunction with quantization. By increasing (th1-th0) and (th2-th1), α(c) can be reduced. By decreasing (th1-th0) and (th2-th1), α(c) can be increased.

量子化ユニット５８は、入力ｉｎ（ｄｉ）の量子化と併せて活性化関数を実施できる。例えば、量子化ユニット５８は、ｉｎ（ｄｉ）≦ｔｈ０およびｔｈ２＜ｉｎ（ｄｉ）となる領域では出力値を飽和させる。量子化ユニット５８は、出力が非線形とするように量子化パラメータｑを設定することで活性化関数の演算を量子化と併せて実施できる。 The quantization unit 58 can perform the activation function in conjunction with the quantization of the input in(di). For example, the quantization unit 58 saturates the output value in the region where in(di)≦th0 and th2<in(di). The quantization unit 58 can perform the calculation of the activation function in conjunction with the quantization by setting the quantization parameter q so that the output is nonlinear.

ステートコントローラ５４は、ベクトル演算回路５２および量子化回路５３のステートを制御する。また、ステートコントローラ５４は、内部バスＩＢを介してコントローラ６と接続されている。ステートコントローラ５４は、命令キュー５５と制御回路５６とを有する。 The state controller 54 controls the states of the vector operation circuit 52 and the quantization circuit 53. The state controller 54 is also connected to the controller 6 via the internal bus IB. The state controller 54 has an instruction queue 55 and a control circuit 56.

命令キュー５５は、量子化演算回路５用の命令コマンドＣ５が格納されるキューであり、例えばＦＩＦＯメモリで構成される。命令キュー５５には、命令供給ユニット７経由または内部バスＩＢ経由で命令コマンドＣ５が書き込まれる。 The instruction queue 55 is a queue in which the instruction command C5 for the quantization calculation circuit 5 is stored, and is configured, for example, as a FIFO memory. The instruction command C5 is written to the instruction queue 55 via the instruction supply unit 7 or the internal bus IB.

制御回路５６は、命令コマンドＣ５をデコードし、命令コマンドＣ５に基づいてベクトル演算回路５２および量子化回路５３を制御するステートマシンである。制御回路５６は、ＤＭＡＣ３のステートコントローラ３２の制御回路３４と同様の構成である。 The control circuit 56 is a state machine that decodes the instruction command C5 and controls the vector calculation circuit 52 and the quantization circuit 53 based on the instruction command C5. The control circuit 56 has a configuration similar to that of the control circuit 34 of the state controller 32 of the DMAC3.

［量子化演算回路５：プリロードモードにおける動作］
ＮＮ回路１００の動作モードがプリロードモードであるとき、ステートコントローラ５４は動作しない。プリロードするデータに量子パラメータｑが含まれる場合、ＤＭＡＣ３は外部メモリ１２０から読み出した量子パラメータｑを量子化パラメータメモリ５１に書き込む。 [Quantization Calculation Circuit 5: Operation in Preload Mode]
When the operation mode of the NN circuit 100 is the preload mode, the state controller 54 does not operate. When the quantum parameter q is included in the data to be preloaded, the DMAC 3 writes the quantum parameter q read from the external memory 120 into the quantization parameter memory 51.

［量子化演算回路５：ノーマルモードにおける動作］
ＮＮ回路１００の動作モードがノーマルモードであるとき、命令供給ユニット７は、命令キュー５５に命令コマンドＣ５を供給する。ステートコントローラ５４は、命令キュー５５に格納された命令コマンドＣ５に基づいて、ベクトル演算回路５２および量子化回路５３を制御する。 [Quantization Calculation Circuit 5: Operation in Normal Mode]
When the operation mode of the NN circuit 100 is the normal mode, the instruction supply unit 7 supplies an instruction command C5 to the instruction queue 55. The state controller 54 controls the vector operation circuit 52 and the quantization circuit 53 based on the instruction command C5 stored in the instruction queue 55.

量子化演算回路５は、Ｂｄ個の要素を持つ量子化演算出力データを第一メモリ１に書き込む。なお、ＢｄとＢｃの好適な関係を式１０に示す。式１０においてｎは整数である。 The quantization calculation circuit 5 writes quantization calculation output data having Bd elements to the first memory 1. The preferred relationship between Bd and Bc is shown in Equation 10. In Equation 10, n is an integer.

［コントローラ６］
コントローラ６は、外部ホストＣＰＵ１１０から転送される命令コマンドを、内部バスＩＢを介して、ＤＭＡＣ３、畳み込み演算回路４および量子化演算回路５が有する命令キューに転送する。 [Controller 6]
The controller 6 transfers the instruction command transferred from the external host CPU 110 to the instruction queues of the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5 via the internal bus IB.

コントローラ６は、外部バスＥＢに接続されており、外部ホストＣＰＵ１１０のスレーブとして動作する。コントローラ６は、パラメータレジスタや状態レジスタを含むレジスタ６１を有している。パラメータレジスタは、ＮＮ回路１００の動作を制御するレジスタである。状態レジスタは、セマフォＳを含むＮＮ回路１００の状態を示すレジスタである。 The controller 6 is connected to the external bus EB and operates as a slave to the external host CPU 110. The controller 6 has registers 61 including a parameter register and a status register. The parameter register is a register that controls the operation of the NN circuit 100. The status register is a register that indicates the status of the NN circuit 100, including the semaphore S.

レジスタ６１は、プリロードモードの設定に関するパラメータレジスタ（以降、「プリロードモードレジスタ」ともいう）を有する。プリロードモードレジスタは、プリロード起動レジスタと、プリロードの転送元である外部メモリ１２０の先頭アドレスと、プリロードの転送先のアドレスと、プリロードするデータ転送量と、を含む。プリロードモードレジスタは、データ転送に関する設定値（転送元アドレスと転送先アドレスとデータ転送量との組み合わせ）を複数回分保持できる。プリロードモードレジスタが設定されることで、ＮＮ回路１００は動作モードをプリロードモードに変更して、プリロードモードの動作を開始する。プリロードモードの動作が完了すると、ＮＮ回路１００は動作モードをノーマルモードに変更して、ノーマルモードの動作を開始する。ＮＮ回路１００の動作モードは、プリモードモードの動作が完了してもノーマルモードに自動的に変更されず、レジスタ６１に含まれる動作モード設定レジスタが明示的に書き換えられることによって、プリロードモードからノーマルモードに変更されてもよい。 The register 61 has a parameter register (hereinafter, also referred to as a "preload mode register") for setting the preload mode. The preload mode register includes a preload start register, the top address of the external memory 120 which is the source of the preload transfer, the address of the destination of the preload transfer, and the amount of data to be transferred to be preloaded. The preload mode register can hold multiple set values for data transfer (combinations of source address, destination address, and amount of data to be transferred). By setting the preload mode register, the NN circuit 100 changes the operation mode to the preload mode and starts the operation of the preload mode. When the operation of the preload mode is completed, the NN circuit 100 changes the operation mode to the normal mode and starts the operation of the normal mode. The operation mode of the NN circuit 100 is not automatically changed to the normal mode even when the operation of the preload mode is completed, and may be changed from the preload mode to the normal mode by explicitly rewriting the operation mode setting register included in the register 61.

レジスタ６１は、命令供給ユニット７による命令供給設定に関するパラメータレジスタ（以降、「命令供給レジスタ」ともいう）を有する。 Register 61 has a parameter register (hereinafter also referred to as the "instruction supply register") related to the instruction supply settings by the instruction supply unit 7.

［命令供給ユニット７］
図２０は、命令供給ユニット７のブロック図である。
命令供給ユニット（ISU : Instruction Supply Unit）７は、命令メモリ７１と、命令供給回路７２と、を有する。命令メモリ７１は、ＤＭＡＣ３と畳み込み演算回路４と量子化演算回路５の命令コマンドを格納する。命令供給回路７２は、命令メモリ７１に格納した命令コマンドをＤＭＡＣ３と畳み込み演算回路４と量子化演算回路５とに供給する。 [Instruction supply unit 7]
FIG. 20 is a block diagram of the instruction supply unit 7.
The instruction supply unit (ISU) 7 has an instruction memory 71 and an instruction supply circuit 72. The instruction memory 71 stores instruction commands for the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5. The instruction supply circuit 72 supplies the instruction commands stored in the instruction memory 71 to the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5.

［命令供給ユニット７：プリロードモードにおける動作］
ＮＮ回路１００の動作モードがプリロードモードであるとき、命令供給回路７２は動作しない。ＤＭＡＣ３のデータ転送回路３１は、命令コマンドを命令メモリ７１に書き込む。 [Instruction supply unit 7: operation in preload mode]
When the operation mode of the NN circuit 100 is the preload mode, the instruction supply circuit 72 does not operate. The data transfer circuit 31 of the DMAC 3 writes an instruction command to the instruction memory 71.

［命令供給ユニット７：ノーマルモードにおける動作］
ＮＮ回路１００の動作モードがノーマルモードであるとき、命令供給回路７２は、命令メモリ７１に格納された命令コマンドをＤＭＡＣ３と畳み込み演算回路４と量子化演算回路５とに供給する。命令供給回路７２は、命令キュー３３，４５，５５のフルフラグなどの状態フラグに基づいて、命令キュー３３，４５，５５に対して対応する命令コマンドを随時供給する。 [Instruction supply unit 7: Operation in normal mode]
When the operation mode of the NN circuit 100 is the normal mode, the instruction supply circuit 72 supplies the instruction commands stored in the instruction memory 71 to the DMAC 3, the convolution operation circuit 4, and the quantization operation circuit 5. The instruction supply circuit 72 supplies the corresponding instruction commands to the instruction queues 33, 45, and 55 as needed, based on the status flags such as the full flags of the instruction queues 33, 45, and 55.

図２１は、命令メモリ７１のメモリマップを示す図である。
命令供給レジスタには、ＤＭＡＣ３に供給する命令コマンドＣ３が格納されたメモリ領域の先頭アドレスと、畳み込み演算回路４に供給する命令コマンドＣ４が格納されたメモリ領域の先頭アドレスと、量子化演算回路５に供給する命令コマンドＣ５が格納されたメモリ領域の先頭アドレスと、が格納される。命令供給回路７２は、命令供給レジスタに格納された先頭アドレスに基づいて命令コマンドの供給を実行する。なお、ＮＮ演算マルチコア１０Ｍに２以上のＮＮ演算コア１０が含まれる場合は、ＮＮ演算コア１０の畳み込み演算回路４および量子化演算回路５ごとに先頭アドレスが設定される。 FIG. 21 is a diagram showing a memory map of the instruction memory 71. As shown in FIG.
The instruction supply register stores the top address of a memory area in which an instruction command C3 to be supplied to the DMAC3 is stored, the top address of a memory area in which an instruction command C4 to be supplied to the convolution operation circuit 4 is stored, and the top address of a memory area in which an instruction command C5 to be supplied to the quantization operation circuit 5 is stored. The instruction supply circuit 72 executes the supply of an instruction command based on the top address stored in the instruction supply register. Note that, when the NN operation multi-core 10M includes two or more NN operation cores 10, a top address is set for each of the convolution operation circuit 4 and the quantization operation circuit 5 of the NN operation core 10.

例えば、図２１に示すメモリマップのように命令コマンドが命令メモリ７１に格納されている場合、先頭アドレスがアドレスＡ１とアドレスＡ２とアドレスＡ３から始まるメモリ領域には、第一アプリケーションとは異なる第二アプリケーションが使用する命令コマンドが格納されている。また、先頭アドレスがアドレスＡ４とアドレスＡ５とアドレスＡ６から始まるメモリ領域には、第一アプリケーションとは異なる第二アプリケーションが使用する命令コマンドが格納されている。第一アプリケーションを実行するとき、命令供給レジスタの先頭アドレスには、アドレスＡ１とアドレスＡ２とアドレスＡ３とが設定される。実行するアプリケーションを第一アプリケーションから第二アプリケーションに変更するとき、命令供給レジスタの先頭アドレスには、アドレスＡ４とアドレスＡ５とアドレスＡ６とが設定される。 For example, when the command is stored in the command memory 71 as shown in the memory map in FIG. 21, the command used by the second application different from the first application is stored in the memory area whose top addresses start from addresses A1, A2, and A3. The command used by the second application different from the first application is stored in the memory area whose top addresses start from addresses A4, A5, and A6. When the first application is executed, the top addresses of the command supply register are set to addresses A1, A2, and A3. When the application to be executed is changed from the first application to the second application, the top addresses of the command supply register are set to addresses A4, A5, and A6.

［ＮＮ回路１００の動作］
次に、ＮＮ回路１００の動作（ニューラルネットワーク演算方法）を、図２２に示すＮＮ回路１００の制御フローチャートに沿って説明する。ＮＮ回路１００は、初期化処理（ステップＳ１００）を実施した後、プリロードモードレジスタの設定に基づいてステップＳ１１０を実行する。 [Operation of NN circuit 100]
Next, the operation of the NN circuit 100 (neural network operation method) will be described with reference to the control flow chart of the NN circuit 100 shown in Fig. 22. After performing an initialization process (step S100), the NN circuit 100 executes step S110 based on the setting of the preload mode register.

＜ステップＳ１１０：プリロードモード（第一モード）＞
ＮＮ回路１００のＤＭＡＣ３は、ステップＳ１１０において、プリロードモードレジスタの設定に基づいてプリロードを実施する（第一工程）。具体的には、ＤＭＡＣ３のデータ転送回路３１は、プリロードモードレジスタの設定に基づいて、外部メモリ１２０から命令供給ユニット７に対する命令コマンドのＤＭＡ転送（プリロード）を実施する。プリロードするデータに重みｗが含まれる場合、データ転送回路３１は重みメモリ４１に対する重みｗのＤＭＡ転送（プリロード）を実施する。プリロードするデータに量子パラメータｑが含まれる場合、データ転送回路３１は量子化パラメータメモリ５１に対する量子パラメータｑのＤＭＡ転送（プリロード）を実施する。プリロードモードレジスタに設定されたプリロードが完了すると、ＮＮ回路１００は動作モードをノーマルモードに変更してステップＳ１２０を実行する。 <Step S110: Preload Mode (First Mode)>
In step S110, the DMAC3 of the NN circuit 100 performs preloading based on the setting of the preload mode register (first step). Specifically, the data transfer circuit 31 of the DMAC3 performs DMA transfer (preloading) of the instruction command from the external memory 120 to the instruction supply unit 7 based on the setting of the preload mode register. If the data to be preloaded includes a weight w, the data transfer circuit 31 performs DMA transfer (preloading) of the weight w to the weight memory 41. If the data to be preloaded includes a quantum parameter q, the data transfer circuit 31 performs DMA transfer (preloading) of the quantum parameter q to the quantization parameter memory 51. When the preloading set in the preload mode register is completed, the NN circuit 100 changes the operation mode to normal mode and executes step S120.

なお、ＣＮＮ２００に関する演算が大規模であると命令供給ユニット７にデータ転送する命令コマンドが多くなり、プリロードにおけるデータ転送期間や命令メモリ７１に必要なメモリ容量が増加する。この場合、ＮＮ回路１００のＤＭＡＣ３は、プリロードモードにおいて、命令コマンドを分割して命令メモリ７１に転送してもよい。例えば、命令メモリ７１が全ての命令コマンドを格納するのに十分なメモリ容量を有している場合、ＮＮ回路１００のＤＭＡＣ３は、全ての命令コマンドを命令メモリ７１に供給する。そうでない場合、ＮＮ回路１００のＤＭＡＣ３は、すべての命令コマンドの転送を行うのではなく、命令メモリ７１が格納可能な一部の命令コマンドのみを命令メモリ７１に供給してもよい。 When the calculations related to the CNN 200 are large-scale, the number of instruction commands to be transferred to the instruction supply unit 7 increases, and the data transfer period in preload and the memory capacity required for the instruction memory 71 increase. In this case, the DMAC3 of the NN circuit 100 may divide the instruction commands and transfer them to the instruction memory 71 in preload mode. For example, if the instruction memory 71 has sufficient memory capacity to store all the instruction commands, the DMAC3 of the NN circuit 100 supplies all the instruction commands to the instruction memory 71. If not, the DMAC3 of the NN circuit 100 may supply only a portion of the instruction commands that the instruction memory 71 can store to the instruction memory 71, rather than transferring all the instruction commands.

＜ステップＳ１２０：ノーマルモード（第二モード）＞
ＮＮ回路１００の命令供給ユニット７は、ステップＳ１２０において、命令供給レジスタの設定に基づいて命令コマンドの供給を実施する（第二工程）。具体的には、命令供給回路７２は、命令メモリ７１に格納された命令コマンドＣ３をＤＭＡＣ３の命令キュー３３に供給し、命令コマンドＣ４を畳み込み演算回路４の命令キュー４５に供給し、命令コマンドＣ５を量子化演算回路５の命令キュー５５に供給する。ＤＭＡＣ３と畳み込み演算回路４と量子化演算回路５とは、供給された命令コマンドに基づいて動作を開始する。命令メモリ７１に格納された命令コマンドが全てまたは一部が実行されたとき、ＮＮ回路１００は次にステップＳ１３０を実行する。 <Step S120: Normal Mode (Second Mode)>
In step S120, the instruction supply unit 7 of the NN circuit 100 supplies an instruction command based on the setting of the instruction supply register (second step). Specifically, the instruction supply circuit 72 supplies the instruction command C3 stored in the instruction memory 71 to the instruction queue 33 of the DMAC3, supplies the instruction command C4 to the instruction queue 45 of the convolution operation circuit 4, and supplies the instruction command C5 to the instruction queue 55 of the quantization operation circuit 5. The DMAC3, the convolution operation circuit 4, and the quantization operation circuit 5 start operating based on the supplied instruction commands. When all or part of the instruction commands stored in the instruction memory 71 have been executed, the NN circuit 100 next executes step S130.

＜ステップＳ１３０＞
ＮＮ回路１００は、ステップＳ１３０において、図２２に示す制御フローを終了するかを判定する。制御フローを終了しない場合、ＮＮ回路１００は、再度ステップＳ１２０を実行する。なお、ＮＮ回路１００は、ステップＳ１１０から制御フローを再開してもよい。制御を終了する場合、ＮＮ回路１００は、次にステップＳ１４０を実行して制御フローを終了する。 <Step S130>
In step S130, the NN circuit 100 determines whether to end the control flow shown in Fig. 22. If the control flow is not to be ended, the NN circuit 100 executes step S120 again. The NN circuit 100 may resume the control flow from step S110. If the control is to be ended, the NN circuit 100 next executes step S140 to end the control flow.

本実施形態に係るニューラルネットワーク回路１００によれば、ＩｏＴ機器などの組み込み機器に組み込み可能なＮＮ回路１００を高性能に動作させることができる。動作モードがプリロードモードであるとき、外部メモリ１２０から命令コマンドが読み出される。動作モードがノーマルモードであるとき、外部メモリ１２０とＮＮ演算コア１０との間で、命令コマンド以外のデータ（入力データａ、重みｗ、量子化パラメータｑ、中間データ）のＤＭＡ転送が実行される。そのため、外部メモリ１２０からの命令コマンドの読み出しと、命令コマンド以外のデータのＤＭＡ転送とが、同時に発生しない。そのため、ＮＮ回路１００は、命令コマンドの読み出しと、命令コマンド以外のデータのＤＭＡ転送とが同時に発生することによる外部バスＥＢの帯域圧迫を防げる。 According to the neural network circuit 100 of this embodiment, the NN circuit 100 that can be embedded in an embedded device such as an IoT device can operate with high performance. When the operating mode is the preload mode, an instruction command is read from the external memory 120. When the operating mode is the normal mode, DMA transfer of data other than the instruction command (input data a, weight w, quantization parameter q, intermediate data) is performed between the external memory 120 and the NN calculation core 10. Therefore, the reading of the instruction command from the external memory 120 and the DMA transfer of data other than the instruction command do not occur simultaneously. Therefore, the NN circuit 100 can prevent the bandwidth of the external bus EB from being compressed due to the simultaneous occurrence of the reading of the instruction command and the DMA transfer of data other than the instruction command.

プリロードモードにおける命令メモリ７１への命令コマンドの書き込みと、ノーマルモードにおける命令メモリ７１からの命令コマンドの読み出しとは、同時に発生しない。そのため、命令メモリ７１は、２ポートＲＡＭで実装する必要はない。命令メモリ７１は、回路規模が２ポートＲＡＭよりも小さいシングルポートＲＡＭで実装できる。 The writing of command commands to the command memory 71 in the preload mode and the reading of command commands from the command memory 71 in the normal mode do not occur simultaneously. Therefore, the command memory 71 does not need to be implemented as a two-port RAM. The command memory 71 can be implemented as a single-port RAM, which has a smaller circuit scale than a two-port RAM.

プリロードするデータに重みｗが含まれる場合、動作モードがプリロードモードであるとき、外部メモリ１２０から重みｗが読み出される。そのため、動作モードがノーマルモードであるとき、外部メモリ１２０からの重みｗの読み出しを省略できる。その結果、動作モードがノーマルモードであるときにおける外部バスＥＢの帯域をさらに削減できる。また、重みメモリ４１も回路規模が２ポートＲＡＭよりも小さいシングルポートＲＡＭで実装できる。 If the data to be preloaded includes a weight w, the weight w is read from the external memory 120 when the operating mode is the preload mode. Therefore, when the operating mode is the normal mode, reading of the weight w from the external memory 120 can be omitted. As a result, the bandwidth of the external bus EB can be further reduced when the operating mode is the normal mode. In addition, the weight memory 41 can also be implemented using a single-port RAM, which has a smaller circuit size than a two-port RAM.

プリロードするデータに量子パラメータｑが含まれる場合、動作モードがプリロードモードであるとき、外部メモリ１２０から量子パラメータｑが読み出される。そのため、動作モードがノーマルモードであるとき、外部メモリ１２０からの量子パラメータｑの読み出しを省略できる。その結果、動作モードがノーマルモードであるときにおける外部バスＥＢの帯域をさらに削減できる。また、量子化パラメータメモリ５１も回路規模が２ポートＲＡＭよりも小さいシングルポートＲＡＭで実装できる。 If the data to be preloaded includes a quantum parameter q, the quantum parameter q is read from the external memory 120 when the operating mode is the preload mode. Therefore, when the operating mode is the normal mode, reading of the quantum parameter q from the external memory 120 can be omitted. As a result, the bandwidth of the external bus EB can be further reduced when the operating mode is the normal mode. In addition, the quantization parameter memory 51 can also be implemented using a single-port RAM, which has a smaller circuit scale than a two-port RAM.

以上、本発明の第一実施形態について図面を参照して詳述したが、具体的な構成はこの実施形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等も含まれる。また、上述の実施形態および変形例において示した構成要素は適宜に組み合わせて構成することが可能である。 The first embodiment of the present invention has been described above in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and design modifications and the like that do not deviate from the gist of the present invention are also included. In addition, the components shown in the above-mentioned embodiment and modified examples can be appropriately combined to form a configuration.

（変形例１）
上記実施形態の命令供給ユニット７は、一つの命令メモリ７１を有していたが、命令メモリ７１の態様はこれに限定されない。命令供給ユニット７は、複数の命令メモリを７１を有していていもよい。命令供給ユニット７は、例えば、命令コマンドＣ３を格納する第一命令メモリと、命令コマンドＣ４を格納する第二命令メモリと、命令コマンドＣ５を格納する第三命令メモリと、を有してもよい。 (Variation 1)
Although the instruction supply unit 7 in the above embodiment has one instruction memory 71, the form of the instruction memory 71 is not limited to this. The instruction supply unit 7 may have multiple instruction memories 71. The instruction supply unit 7 may have, for example, a first instruction memory that stores an instruction command C3, a second instruction memory that stores an instruction command C4, and a third instruction memory that stores an instruction command C5.

（変形例２）
上記実施形態において、第一メモリ１と第二メモリ２は別のメモリであったが、第一メモリ１と第二メモリ２の態様はこれに限定されない。第一メモリ１と第二メモリ２は、例えば、同一メモリにおける第一メモリ領域と第二メモリ領域であってもよい。 (Variation 2)
In the above embodiment, the first memory 1 and the second memory 2 are separate memories, but the aspects of the first memory 1 and the second memory 2 are not limited to this. The first memory 1 and the second memory 2 may be, for example, a first memory area and a second memory area in the same memory.

（変形例３）
例えば、上記実施形態に記載のＮＮ回路１００に入力されるデータは単一の形式に限定されず、静止画像、動画像、音声、文字、数値およびこれらの組み合わせで構成することが可能である。なお、ＮＮ回路１００に入力されるデータは、ＮＮ回路１００が設けられるエッジデバイスに搭載され得る、光センサ、温度計、Global Positioning System（GPS）計測器、角速度計測器、風速計などの物理量測定器における測定結果に限られない。周辺機器から有線または無線通信経由で受信する基地局情報、車両・船舶等の情報、天候情報、混雑状況に関する情報などの周辺情報や金融情報や個人情報等の異なる情報を組み合わせてもよい。 (Variation 3)
For example, the data input to the NN circuit 100 described in the above embodiment is not limited to a single format, and can be composed of still images, moving images, sounds, characters, numbers, and combinations of these. The data input to the NN circuit 100 is not limited to the measurement results of physical quantity measuring instruments such as optical sensors, thermometers, Global Positioning System (GPS) measuring instruments, angular velocity measuring instruments, and anemometers that may be mounted on the edge device in which the NN circuit 100 is provided. Different information such as base station information received from peripheral devices via wired or wireless communication, information on vehicles and ships, weather information, congestion information, and other peripheral information, financial information, and personal information may also be combined.

（変形例４）
ＮＮ回路１００が設けられるエッジデバイスは、バッテリー等で駆動する携帯電話などの通信機器、パーソナルコンピュータなどのスマートデバイス、デジタルカメラ、ゲーム機器、ロボット製品などのモバイル機器を想定するが、これに限られるものではない。Power on Ethernet（PoE）などでの供給可能なピーク電力制限、製品発熱の低減または長時間駆動の要請が高い製品に利用することでも他の先行例にない効果を得ることができる。例えば、車両や船舶などに搭載される車載カメラや、公共施設や路上などに設けられる監視カメラ等に適用することで長時間の撮影を実現できるだけでなく、軽量化や高耐久化にも寄与する。また、テレビやディスプレイ等の表示デバイス、医療カメラや手術ロボット等の医療機器、製造現場や建築現場で使用される作業ロボットなどにも適用することで同様の効果を奏することができる。 (Variation 4)
The edge device in which the NN circuit 100 is provided is assumed to be a communication device such as a mobile phone driven by a battery, a smart device such as a personal computer, a digital camera, a game device, a robot product, and other mobile devices, but is not limited thereto. It can be used in products that have a high demand for limiting the peak power that can be supplied by Power on Ethernet (PoE), reducing heat generation of the product, or long-term operation, to obtain effects not found in other prior art. For example, by applying the circuit to an in-vehicle camera mounted on a vehicle or ship, or a surveillance camera installed in a public facility or on the road, it is possible to realize long-term shooting, and also contributes to weight reduction and high durability. In addition, the circuit can be applied to display devices such as televisions and displays, medical equipment such as medical cameras and surgical robots, and work robots used at manufacturing sites and construction sites to obtain similar effects.

（変形例５）
ＮＮ回路１００は、ＮＮ回路１００の一部または全部を一つ以上のプロセッサを用いて実現してもよい。例えば、ＮＮ回路１００は、入力層または出力層の一部または全部をプロセッサによるソフトウェア処理により実現してもよい。ソフトウェア処理により実現する入力層または出力層の一部は、例えば、データの正規化や変換である。これにより、様々な形式の入力形式または出力形式に対応できる。なお、プロセッサで実行するソフトウェアは、通信手段や外部メディアを用いて書き換え可能に構成してもよい。 (Variation 5)
The NN circuit 100 may be realized in part or in whole by using one or more processors. For example, the NN circuit 100 may realize in part or in whole the input layer or output layer by software processing by a processor. A part of the input layer or output layer realized by software processing is, for example, data normalization or conversion. This makes it possible to support various input formats or output formats. The software executed by the processor may be configured to be rewritable using communication means or external media.

（変形例６）
ＮＮ回路１００は、ＣＮＮ２００における処理の一部をクラウド上のGraphics Processing Unit（GPU）等を組み合わせることで実現してもよい。ＮＮ回路１００は、ＮＮ回路１００が設けられるエッジデバイスで行った処理に加えて、クラウド上でさらに処理を行ったり、クラウド上での処理に加えてエッジデバイス上で処理を行ったりすることで、より複雑な処理を少ないリソースで実現できる。このような構成によれば、ＮＮ回路１００は、処理分散によりエッジデバイスとクラウドとの間の通信量を低減できる。 (Variation 6)
The NN circuit 100 may realize a part of the processing in the CNN 200 by combining a graphics processing unit (GPU) or the like on the cloud. The NN circuit 100 can realize more complex processing with fewer resources by performing further processing on the cloud in addition to the processing performed on the edge device in which the NN circuit 100 is provided, or by performing processing on the edge device in addition to the processing on the cloud. With such a configuration, the NN circuit 100 can reduce the amount of communication between the edge device and the cloud by distributing processing.

また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 Furthermore, the effects described in this specification are merely descriptive or exemplary and are not limiting. In other words, the technology disclosed herein may achieve other effects that are apparent to a person skilled in the art from the description in this specification, in addition to or in place of the above effects.

本発明は、ニューラルネットワークの演算に適用することができる。 The present invention can be applied to neural network calculations.

２００畳み込みニューラルネットワーク
１００ニューラルネットワーク回路（ＮＮ回路）
１０ニューラルネットワーク演算コア（ＮＮ演算コア）
１０Ａ第一ニューラルネットワーク演算コア（第一ＮＮ演算コア）
１０Ｂ第二ニューラルネットワーク演算コア（第二ＮＮ演算コア）
１０Ｍニューラルネットワーク演算マルチコア（ＮＮ演算マルチコア）
１第一メモリ
２第二メモリ
３ＤＭＡコントローラ（ＤＭＡＣ）
４畳み込み演算回路
４２乗算器
４３アキュムレータ回路
５量子化演算回路
５２ベクトル演算回路
５３量子化回路
６コントローラ
６１レジスタ
７命令供給ユニット
７１命令メモリ
７２命令供給回路
Ｃ３命令コマンド（第三命令コマンド）
Ｃ４命令コマンド（第一命令コマンド）
Ｃ５命令コマンド（第二命令コマンド） 200 Convolutional Neural Network 100 Neural Network Circuit (NN Circuit)
10 Neural network calculation core (NN calculation core)
10A First Neural Network Calculation Core (First NN Calculation Core)
10B Second Neural Network Calculation Core (Second NN Calculation Core)
10M Neural network calculation multi-core (NN calculation multi-core)
1 First memory 2 Second memory 3 DMA controller (DMAC)
4 Convolution operation circuit 42 Multiplier 43 Accumulator circuit 5 Quantization operation circuit 52 Vector operation circuit 53 Quantization circuit 6 Controller 61 Register 7 Instruction supply unit 71 Instruction memory 72 Instruction supply circuit C3 Instruction command (third instruction command)
C4 Command (first command)
C5 Command (second command)

Claims

a neural network calculation core having a convolution calculation circuit that executes a convolution calculation based on a first instruction command;
an instruction providing unit having an instruction memory for storing the first instruction command and providing the first instruction command to the convolution operation circuit;
a DMA controller for transferring data between the neural network computing core and an external memory;
Equipped with
Neural network circuit.

When the operation mode of the neural network circuit is in a first mode, the DMA controller transfers the first instruction command from the external memory to the instruction memory.
2. The neural network circuit of claim 1.

When the operation mode of the neural network circuit is the second mode, the instruction providing unit supplies the first instruction command stored in the instruction memory to the convolution operation circuit.
3. The neural network circuit of claim 2.

The neural network operation core further includes a quantization operation circuit that performs a quantization operation on the convolution operation output data of the convolution operation circuit based on a second command;
the instruction memory stores the second instruction command;
when the operation mode of the neural network circuit is the first mode, the DMA controller transfers the second instruction command from the external memory to the instruction memory;
3. The neural network circuit of claim 2.

When the operation mode of the neural network circuit is a second mode, the instruction providing unit supplies the second instruction command stored in the instruction memory to the quantization operation circuit.
5. The neural network circuit of claim 4.

The DMA controller transfers the data based on a third instruction command;
the instruction memory stores the third instruction command;
when the operation mode of the neural network circuit is the first mode, the DMA controller transfers the third instruction command from the external memory to the instruction memory;
3. The neural network circuit of claim 2.

when the operation mode of the neural network circuit is the second mode, the instruction providing unit supplies the third instruction command stored in the instruction memory to the DMA controller;
7. The neural network circuit of claim 6.

The instruction memory is implemented as a single-ported RAM.
2. The neural network circuit of claim 1.

the instruction memory stores the first instruction command to be used for a first application and the first instruction command to be used for a second application different from the first application;
2. The neural network circuit of claim 1.

The neural network computing core includes a plurality of the neural network computing cores,
The plurality of neural network calculation cores are connected to each other so as to be capable of inputting and outputting data.
2. The neural network circuit of claim 1.

A neural network operation method using a convolution operation circuit that performs a convolution operation based on a first instruction command, and an instruction providing unit having an instruction memory that stores the first instruction command, comprising:
a first step of storing the first instruction command in the instruction memory;
a second step of providing the first instruction command stored in the instruction memory to the convolution operation circuit after the first step is completed;
Equipped with
Neural network calculation method.