JPH05101031A

JPH05101031A - Neural network device

Info

Publication number: JPH05101031A
Application number: JP3261907A
Authority: JP
Inventors: Takeshi Nagabori; 剛長堀; Masanori Mizoguchi; 正典溝口
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1991-10-09
Filing date: 1991-10-09
Publication date: 1993-04-23

Abstract

PURPOSE:To accelerate a propagation arithmetic operation and to save memory capacity when local coupling is used by inserting a delay element to a ring register bus, and applying mapping by concentrating a matrix element with value 0 in a specific column. CONSTITUTION:The delay elements (variable delay first-in/first-out memory) 50-52, 61, and 62 to the ring register buses 1, 2. For example, when the product of transposed matrix and a vector is computed, the ring register 1 and internal registers in the delay elements 50-52 are used as accumulators. In such a way. the mapping in which the matrix element with value O is concentrated in the specific column can be performed. Thereby, the number of times of operations required for computation can be reduced. Also, the capacity of memory devices 41-43 for coupling load can be saved. Furthermore, the setting of input data can be performed by using another ring register bus 1 or 2 while a sum of products arithmetic operation is being performed by using a certain ring register bus 1 or 2. Therefore. a time required for such setting can be ignored.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、画像認識，音声認識等
に利用されているニューラルネットワーク装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a neural network device used for image recognition, voice recognition and the like.

【０００２】[0002]

【従来の技術】ニューラルネットワークのバックプロパ
ゲーション等の積和演算を高速に行うための並列アーキ
テクチャとしてリングレジスタバスが知られており、信
学技報ＮＣ８９−７６で述べられている。従来のリング
レジスタバス型ニューラルネットワーク装置の構成と動
作について簡単に述べる。2. Description of the Related Art A ring register bus is known as a parallel architecture for performing high-speed product-sum calculation such as back propagation of a neural network, and is described in Technical Report NC89-76. The structure and operation of a conventional ring register bus type neural network device will be briefly described.

【０００３】図１０は、従来のリングレジスタバス型ニ
ューラルネットワーク装置の構成を示すブロック図であ
る。図１０に示すように、複数のリングレジスタ１１〜
１５と演算装置（ＰＥ）３１〜３３、そして記憶装置４
１〜４３から構成される。リングレジスタ１１〜１５は
転送機能をもつレジスタで、両隣のリングレジスタと接
続されて全体でリングレジスタバス１を構成し、各演算
装置とのデータの収集と供給を行う。リングレジスタ１
１〜１５のいくつかには演算装置３１〜３３が接続され
ており、それらには乗算器と加算器が内蔵されている。
また、同演算装置３１〜３３には各々記憶装置４１〜４
３が接続されている。基本的にニューロン１個に対して
演算装置を１個割り当て、複数の層はソフトウェアによ
る時分割逐次実行で実現する。なお、演算装置が結合し
ていないリングレジスタは、ファースト・イン・ファー
スト・アウト（ＦＩＦＯ）メモリで代用することができ
る。FIG. 10 is a block diagram showing the configuration of a conventional ring register bus type neural network device. As shown in FIG. 10, a plurality of ring registers 11 to 11
15, processing units (PE) 31 to 33, and storage device 4
1 to 43. The ring registers 11 to 15 are registers having a transfer function, and are connected to the ring registers on both sides to form the ring register bus 1 as a whole, and collect and supply data with each arithmetic unit. Ring register 1
The arithmetic units 31 to 33 are connected to some of the units 1 to 15, and a multiplier and an adder are built therein.
Further, the arithmetic units 31 to 33 have storage devices 41 to 4 respectively.
3 is connected. Basically, one arithmetic unit is assigned to one neuron, and a plurality of layers are realized by time-division sequential execution by software. The ring register to which the arithmetic unit is not connected can be replaced with a first-in-first-out (FIFO) memory.

【０００４】このアーキテクチャの動作を、図面を参照
して説明する。図１１は、行列とベクトルの積ＷＸの演
算の際の動作を示すタイミング図、図１２は、転置行列
とベクトルの積Ｗ^Tδの演算の際の動作を示すタイミン
グ図である。ただし、行列Ｗは４×３行列、ベクトル
Ｘ，ベクトルδは、それぞれ４要素，３要素のベクトル
としている。行列Ｗは、各演算装置３１から３３に接続
された記憶装置４１〜４３に、Ｗの各対角要素が先頭に
来るようにマッピングされている。The operation of this architecture will be described with reference to the drawings. FIG. 11 is a timing chart showing the operation at the time of calculating the product WX of a matrix and a vector, and FIG. 12 is a timing chart showing the operation at the time of calculating the product W ^T δ of a transposed matrix and a vector. However, the matrix W is a 4 × 3 matrix, and the vector X and the vector δ are vectors of 4 elements and 3 elements, respectively. The matrix W is mapped in the storage devices 41 to 43 connected to the respective arithmetic devices 31 to 33 such that each diagonal element of W is at the head.

【０００５】ＷＸの計算時には、図１１に示すように、
Ｘの各要素はリングレジスタ１１〜１４に置かれてリン
グレジスタバス１上を移動する。プロセッサ_iの内部の
レジスタをアキュムレータａｃｃ_iとして用いる。時刻
Ｔ₁には、ＰＥ_iはＸ_iをリングレジスタ１１〜１３か
ら、ｗ_i，_iを記憶装置４１〜４３から各々取り出して
掛け合わせ、積をアキュムレータａｃｃ_iに加える。こ
の直後に、入力データＸ_iは反時計方向に１リングレジ
スタ分だけ回転する。次の時刻Ｔ₂には、ＰＥ_iはＸ
_i+1をリングレジスタ１１〜１３から、ｗ_i，_i+1を記
憶装置４１〜４３から各々取り出して掛け合わせ、積を
アキュムレータに加える。以下同様に進み、入力データ
Ｘ_iがリングレジスタバス１上を１周したＴ₄の後には
ａｃｃ_iには積和Σｗ_ijＸ_jが得られる。一般に、ｍ×
ｎ行列の積和演算を行う場合、ｎ個の演算装置を用いる
ことにより、Ｔ_m後に積和が得られる。演算装置の数が
行数に満たない場合には、入力データＸ_iをリングレジ
スタバス１上で２周以上回転させる。演算装置の数が行
数の１／ｋしかない場合には、ｋ周回転させればよく、
したがって、Ｔ_k._m後に積和が得られる。When calculating WX, as shown in FIG.
Each element of X is placed in the ring registers 11 to 14 and moves on the ring register bus 1. A register inside the processor _i is used as an accumulator acc _i . At time T ₁ , PE _i fetches X _i from ring registers 11 to 13 and w _i and _i from storage devices 41 to 43 and multiplies them, and adds the product to accumulator acc _i . Immediately after this, the input data X _i rotates counterclockwise by one ring register. At the next time T ₂ , PE _i becomes X.
_{i + 1} is taken out from the ring registers 11 to 13, w _i and _{i + 1} are taken out from the storage devices 41 to 43, respectively, and multiplied, and the product is added to the accumulator. After that, in the same manner, after T ₄ when the input data X _i makes one round on the ring register bus 1, the product sum Σw _ij X _j is obtained in acc _i . In general, mx
When performing the n-matrix product-sum operation, the product-sum is obtained after T _m by using n arithmetic units. When the number of arithmetic units is less than the number of rows, the input data X _i is rotated on the ring register bus 1 for two or more rounds. If the number of arithmetic units is only 1 / k of the number of rows, it is sufficient to rotate k rounds.
Therefore, sum-of-products is obtained after T _k. _M.

【０００６】Ｗ^Tδの計算時には、リングレジスタ１１
〜１４の内部のレジスタをアキュムレータａｃｃ₁〜ａ
ｃｃ₄として用いる。図１２に示すように、リングレジ
スタ１１〜１４に置き、リングレジスタバス１上を移動
させる。この場合も同様に、入力でδ_jがリングレジス
タバス１上を１周したＴ₄の後にはａｃｃ_iには積和Σ
ｗ_IJδ_jが得られる。ｍ×ｎ行列の積和演算の場合、ｎ
個の演算装置を用いることにより、Ｔ_m後に積和が得ら
れること、演算装置の数が行数の１／ｋしかない場合に
は、ｋ周回転したＴ_km後に積和が得られることは、ＷＸ
の計算時と同様である。このとき、Ｗ^Tの各要素がＷＸ
の計算時と同じ位置に格納されていることがこのネット
ワーク装置の特徴である。When calculating W ^T δ, the ring register 11
The internal registers of -14 accumulator acc ₁ ~a
Used as cc ₄ . As shown in FIG. 12, the ring registers 11 to 14 are placed and moved on the ring register bus 1. In this case as well, after T _{4 when} δ _j makes one round on the ring register bus 1 at the input, the sum of products Σ is added to acc _i.
w _IJ δ _j is obtained. In the case of multiply-add operation of m × n matrix, n
By using this number of arithmetic units, the sum of products can be obtained after T _m , and if the number of arithmetic units is only 1 / k of the number of rows, the sum of products can be obtained after T _km rotated k times. , WX
The same as when calculating At this time, each element of W ^T is WX
It is a feature of this network device that it is stored in the same position as when the calculation was performed.

【０００７】次に、リングレジスタ型ニューラルネット
ワーク装置の第２の従来例について述べる。第２の従来
例と同様に、複数のリングレジスタと演算装置，記憶装
置を用いてネットワーク装置を構成し、かつ、各演算装
置に接続された記憶装置に、転置行列Ｗ^Tの各対角要素
が先頭に来るように行列Ｗをマッピングしても、行列と
ベクトルの積ＷＸ、転置行列とベクトルの積Ｗ^Tδの両
方の計算が可能であり、信学技報ＮＭ８８−１３４で述
べられている。しかし、バックプロパゲーションに適用
した際の実際の計算時間が長くなる。バックプロパゲー
ションでは、１回の学習につき５サイクルの積和演算が
行われ、うち１サイクルは転置行列に対して行われる。
図１０に示したマッピングでは、アキュムレータをリン
グレジスタに置く行列計算が１サイクルであるのに対
し、転置行列Ｗ^Tの各対角要素が先頭に来るマッピング
では、アキュムレータをリングレジスタに置く行列計算
が４サイクルとなる。アキュムレータをリングレジスタ
に置く演算は、アキュムレータを演算装置に置く演算と
比較し、１ステップあたり３倍程度の演算時間を要する
から、アキュムレータをリングレジスタに置く演算を少
なくするマッピングが有利である。Next, a second conventional example of the ring register type neural network device will be described. Similar to the second conventional example, a network device is configured using a plurality of ring registers, an arithmetic unit, and a memory device, and each diagonal element of the transpose matrix W ^T is stored in the memory device connected to each arithmetic device. Even if the matrix W is mapped so that is at the top, both the matrix-vector product WX and the transposed matrix-vector product W ^T δ can be calculated, as described in IEICE Technical Report NM88-134. There is. However, the actual calculation time when applied to backpropagation becomes long. In backpropagation, 5 cycles of product-sum operation are performed for each learning, and 1 cycle is performed for the transposed matrix.
In the mapping shown in FIG. 10, the matrix calculation in which the accumulator is placed in the ring register is one cycle, whereas in the mapping in which each diagonal element of the transposed matrix W ^T is at the head, the matrix calculation in which the accumulator is placed in the ring register is performed. It becomes 4 cycles. The operation of placing the accumulator in the ring register requires about three times the operation time per step as compared with the operation of placing the accumulator in the arithmetic unit. Therefore, the mapping in which the number of the operations placing the accumulator in the ring register is small is advantageous.

【０００８】入力データとして画像等を扱う場合、ネッ
トワークとしては、２次元の入力データを扱う２次元ニ
ューラルネットワークとなる。リングレジスタ型ニュー
ラルネットワーク装置を２次元ニューラルネットワーク
装置に適用する場合、入出力ベクトルおよび結合荷重を
１次元展開して用いる。When images and the like are used as input data, the network is a two-dimensional neural network that handles two-dimensional input data. When the ring register type neural network device is applied to the two-dimensional neural network device, the input / output vector and the connection weight are one-dimensionally expanded and used.

【０００９】[0009]

【発明が解決しようとする課題】画像等を取り扱うニュ
ーラルネットワークでは、層間の全ニューロンが結合さ
れた完全結合ではなく、近傍のニューロンのみが結合さ
れている局所結合でも十分な効果が得られることが知ら
れている。局所結合のネットワークを用いることによ
り、積和演算の演算量が大幅に削減される。In a neural network that handles images and the like, it is possible to obtain a sufficient effect even with local connection in which only neighboring neurons are connected instead of complete connection in which all neurons between layers are connected. Are known. By using the locally connected network, the calculation amount of the product-sum calculation is significantly reduced.

【００１０】しかしながら、従来のリングレジスタ型ニ
ューラルネットワーク装置では、積和演算の対象とする
ネットワークとして、完全結合のネットワークの代わり
に積和演算の演算量の少ない局所結合のネットワークを
用いても、演算時間がさほど短縮されないという欠点が
あった。However, in the conventional ring register type neural network device, even if a locally connected network with a small amount of product-sum operation is used as the network for the product-sum operation, instead of the fully-connected network, the operation is performed. There was a drawback that the time was not shortened so much.

【００１１】図５は、１次元データを入力データとする
局所結合ニューラルネットワークの模式図である。図５
は入力層５ａと中間層６ａとの結合状態を示しているも
のであり、入力層５ａのニューロン数はＮ、中間層６ａ
のニューロン数はＬである。中間層６ａの各ニューロン
には、入力層５ａのうちのｎ個のニューロンのみが結合
している。このｎ個の領域を局所領域と呼ぶ。隣接する
局所領域間のオーバーラップも局所結合の状態を示す重
要なパラメータであり、図５では、ｖと表している。FIG. 5 is a schematic diagram of a locally connected neural network using one-dimensional data as input data. Figure 5
Indicates a connection state between the input layer 5a and the intermediate layer 6a, the number of neurons in the input layer 5a is N, and the intermediate layer 6a is
The number of neurons in is L. Only n neurons in the input layer 5a are connected to each neuron in the intermediate layer 6a. These n areas are called local areas. The overlap between adjacent local regions is also an important parameter indicating the state of local coupling, and is represented by v in FIG.

【００１２】図７は、図５で示した１次元局所結合ニュ
ーラルネットの結合荷重行列Ｗの模式図であり、この行
列は疎行列となる。図８は、この行列の演算装置の記憶
装置へのマッピング状態を示すマッピング図である。こ
こで、Ｎは入力層５ａのニューロン数、Ｌは中間層６ａ
のニューロン数、ｎは局所領域の大きさである。図７お
よび図８では、斜線部に結合荷重を示す有限値の行列要
素が存在し、他の部分は結合が存在しない、すなわち行
列要素の値が常に０であることを示している。リングレ
ジスタ型ニューラルネットワーク装置では、マッピング
された行列の各列ごとに積和演算が並列に行われる。演
算装置の数が行列Ｗの行数に等しい場合、積和演算に必
要な最小のステップ数は、０でない行列要素の総数と行
数との商となる。しかるに、図８に示すマッピングで
は、値が常に０である行列要素と有限の値をもつ行列要
素とが混在している列が存在するため、積和演算に必要
なステップ数が、０でない行列要素の総数と行数との商
を大幅に上回るという欠点があった。また、記憶装置に
マッピングされる結合荷重行列の要素の数が実際の結合
の数を大幅に上回るため、記憶装置に膨大な記憶容量が
必要とされるという欠点があった。FIG. 7 is a schematic diagram of the connection weight matrix W of the one-dimensional locally connected neural network shown in FIG. 5, and this matrix is a sparse matrix. FIG. 8 is a mapping diagram showing a mapping state of this matrix to the storage device of the arithmetic unit. Here, N is the number of neurons in the input layer 5a, and L is the intermediate layer 6a.
, N is the size of the local region. In FIG. 7 and FIG. 8, there are finite-valued matrix elements indicating the coupling load in the shaded area, and in other parts, there is no coupling, that is, the matrix element value is always 0. In the ring register type neural network device, the product-sum operation is performed in parallel for each column of the mapped matrix. When the number of arithmetic units is equal to the number of rows of the matrix W, the minimum number of steps required for the product-sum operation is the quotient of the total number of nonzero matrix elements and the number of rows. However, in the mapping shown in FIG. 8, there is a column in which matrix elements each having a value of 0 and matrix elements each having a finite value coexist, so that a matrix in which the number of steps required for the product-sum operation is not 0 is used. It had the drawback that it would greatly exceed the quotient of the total number of elements and the number of lines. Further, since the number of elements of the connection weight matrix mapped to the storage device greatly exceeds the number of actual connections, there is a disadvantage that the storage device requires a huge storage capacity.

【００１３】図６は、２次元データを入力データとした
場合のニューラルネットワークの模式図である。図６で
は、入力層５ｂのニューロン数Ｍ×Ｎ、中間層６ｂのニ
ューロン数Ｋ×Ｌ、局所領域のニューロン数ｍ×ｎであ
る。オーバーラップは、Ｘ軸方向でｕ、ｙ軸方向でｖで
ある。FIG. 6 is a schematic diagram of a neural network when two-dimensional data is used as input data. In FIG. 6, the number of neurons in the input layer 5b is M × N, the number of neurons in the intermediate layer 6b is K × L, and the number of neurons in the local region is m × n. The overlap is u in the X-axis direction and v in the y-axis direction.

【００１４】リングレジスタを図６のような２次元ニュ
ーラルネットに適用する場合、入出力ベクトル、結合荷
重を１次元展開して用いる。図９は、１次元展開された
２次元局所結合の結合荷重行列Ｗの模式図である。この
行列は、図９に示すように１次元局所結合ニューラルネ
ットの行列Ｗと類似している。すなわち、図６のＸ軸方
向の結合を示すＮ×Ｌ行列を行列要素とし、ｙ軸方向の
結合を示すＭ×Ｋ行列状に配した疎行列となる。２次元
の場合、１次元局所結合ニューラルネットの場合以上に
値０の行列要素が分散しているため、積和演算に必要な
ステップ数が、０でない行列要素の総数と行数との商を
上回る度合がさらに著しい。When the ring register is applied to the two-dimensional neural network as shown in FIG. 6, the input / output vector and the connection weight are expanded one-dimensionally and used. FIG. 9 is a schematic diagram of the one-dimensionally expanded two-dimensional locally combined connection weight matrix W. This matrix is similar to the matrix W of the one-dimensional locally connected neural network as shown in FIG. That is, the matrix is an sparse matrix in which the N × L matrix indicating the coupling in the X-axis direction in FIG. In the case of two dimensions, since the matrix elements with the value 0 are distributed more than in the case of the one-dimensional locally connected neural network, the number of steps required for the product-sum calculation is the quotient of the total number of matrix elements and the number of rows The degree to exceed is even more remarkable.

【００１５】さらに、リングレジスタ型ニューラルネッ
トワーク装置では、通常、入力データはリングレジスタ
バスの一端から、リングレジスタバス上を１ステップず
つ回転させながら供給される。このため、入力データの
設定に多大な時間を要するという欠点があった。Further, in the ring register type neural network device, the input data is normally supplied from one end of the ring register bus while rotating on the ring register bus step by step. Therefore, there is a drawback that it takes a lot of time to set the input data.

【００１６】本発明の目的は、局所結合ニューラルネッ
トワークのバックプロパゲーション演算を極めて高速
で、かつ、少ない記憶装置の記憶容量で実行可能なニュ
ーラルネットワーク装置を提供することにある。It is an object of the present invention to provide a neural network device capable of performing backpropagation operation of a locally connected neural network at extremely high speed and with a small memory capacity.

【００１７】[0017]

【課題を解決するための手段】上記目的を達成するた
め、転送機能を持つ複数のリングレジスタを環状に接続
して構成したリングレジスタバスと、前記リングレジス
タのうちの一部のリングレジスタに少なくとも１基ずつ
接続された複数の演算装置と、前記演算装置の各々に接
続された複数の記憶装置を含んで構成されるニューラル
ネットワーク装置において、前記リングレジスタバス
が、前記演算装置に接続された前記リングレジスタと、
前記演算装置に接続された前記リングレジスタ間に挿入
された遅延素子とを含んでニューラルネットワーク装置
を構成する。In order to achieve the above object, at least a ring register bus constituted by connecting a plurality of ring registers having a transfer function in a ring shape and at least some of the ring registers. In a neural network device including a plurality of arithmetic devices connected to each one and a plurality of storage devices connected to each of the arithmetic devices, the ring register bus is connected to the arithmetic device. A ring register,
A neural network device is configured including a delay element inserted between the ring registers connected to the arithmetic unit.

【００１８】上記目的を達成するため、演算対象たるネ
ットワークが１次元局所結合であり、任意の第１の層の
ニューロン数がＮ、前記第１の次段の第２の層のニュー
ロン数がＬ、前記第２の層の１つのニューロンに接続さ
れた前記第１の層の局所領域のニューロン数がＮ、隣接
する前記局所領域間で重なり合うニューロン数がｕであ
るとき、前記遅延素子の遅延量を、局所領域の大きさｎ
と隣合う局所領域間のオーバーラップｖの差から１を減
じた値と、前記リングレジスタバス上に配されたデータ
が前記リングレジスタバス上を１ステップ回転するのに
要する時間Ｔとの積、（ｎ−ｖ−１）Ｔに設定してニューラルネットワーク装置を構成する。To achieve the above object, the network to be operated is a one-dimensional local connection, the number of neurons in an arbitrary first layer is N, and the number of neurons in the second layer of the first next stage is L. When the number of neurons in the local region of the first layer connected to one neuron of the second layer is N and the number of neurons overlapping between adjacent local regions is u, the delay amount of the delay element Is the size of the local region n
And a value obtained by subtracting 1 from the difference in the overlap v between the adjacent local areas and the time T required for the data arranged on the ring register bus to rotate one step on the ring register bus, (Nv-1) T to set the neural network device.

【００１９】上記目的を達成するため、演算対象たるネ
ットワークが２次元局所結合であり、任意の第１の層の
ｘ軸方向のニューロン数がＮ、ｙ軸方向のニューロン数
がＭ、前記第１の層の次段の第２の層のｘ軸方向のニュ
ーロン数がＬ、ｙ軸方向のニューロン数がＫ、前記第２
の層の１つのニューロンに接続された前記第１の層の局
所領域のｘ軸方向のニューロン数がｎ、ｙ軸方向のニュ
ーロン数がｍ、隣接する前記局所領域間で重なり合うｘ
軸方向のニューロン数がｖ、ｙ軸方向のニューロン数が
ｕであるとき、演算装置が接続されたリングレジスタを
Ｌ個ずつ、（Ｌ−１）個の第１の遅延素子と交互に接続
してＫ個のリングレジスタ群を構成し、このリングレジ
スタ群間に第２の遅延素子を挿入し、前記第１の遅延素
子の遅延量を、局所領域のｘ軸方向の大きさｎと隣合う
局所領域間のｘ軸方向のオーバーラップｖの差から１を
減じた値と、前記リングレジスタバス上に配されたデー
タが前記リングレジスタバス上を１ステップ回転するの
に要する時間Ｔとの積、（ｎ−ｖ−１）Ｔに設定し、前記第２の遅延素子の遅延量を、局所領域の
ｙ軸方向の大きさｍと隣合う局所領域間のｙ軸方向のオ
ーバーラップｕの差から１を減じた値をＮ倍した値と前
記リングレジスタバス上に配されたデータが前記リング
レジスタバス上を１ステップ回転するのに要する時間Ｔ
の積と、前記第１の遅延素子群における遅延量との和、｛Ｎ（ｍ−ｕ−１）＋（ｎ−ｖ−１）｝Ｔに設定してニューラルネットワーク装置を構成する。To achieve the above object, the network to be operated is a two-dimensional local connection, the number of neurons in the x-axis direction is N, the number of neurons in the y-axis direction is M, and the first layer is arbitrary in the first layer. The number of neurons in the x-axis direction of the second layer next to the layer of L is L, the number of neurons in the y-axis direction is K, and the second layer
The number of neurons in the x-axis direction in the local region of the first layer connected to one neuron in the layer of n is n, the number of neurons in the y-axis direction is m, and x overlaps between adjacent local regions.
When the number of neurons in the axial direction is v and the number of neurons in the y-axis direction is u, L ring registers to which the arithmetic unit is connected are alternately connected to (L-1) first delay elements. Form K ring register groups, insert a second delay element between the ring register groups, and make the delay amount of the first delay element adjacent to the size n of the local region in the x-axis direction. The product of a value obtained by subtracting 1 from the difference in the overlap v in the x-axis direction between the local regions and the time T required for the data arranged on the ring register bus to rotate one step on the ring register bus. , (N−v−1) T, and the delay amount of the second delay element is the difference between the size m of the local region in the y-axis direction and the overlap u in the y-axis direction between adjacent local regions. The value obtained by subtracting 1 from N and the value multiplied by N are distributed on the ring register bus. Time T data is required for one step rotation of said ring register on the bus
And the delay amount in the first delay element group, {N (m−u−1) + (n−v−1)} T, to construct a neural network device.

【００２０】上記目的を達成するため、転送機能を持つ
複数のリングレジスタを環状に接続して構成したリング
レジスタバスと、前記リングレジスタのうちの一部のリ
ングレジスタに少なくとも１基ずつ接続された複数の演
算装置と、前記演算装置の各々に接続された複数の記憶
装置を含んで構成されるニューラルネットワーク装置に
おいて、前記リングレジスタバスを複数個配してニュー
ラルネットワーク装置を構成する。To achieve the above object, at least one ring register bus is constructed by connecting a plurality of ring registers having a transfer function in a ring shape, and at least one ring register bus is connected to each of the ring registers. In a neural network device including a plurality of arithmetic units and a plurality of storage devices connected to each of the arithmetic units, a plurality of the ring register buses are arranged to form a neural network unit.

【００２１】上記目的を達成するため、リングレジスタ
に保持されているデータが、リングレジスタバス上の任
意の回転位置において、前記リングレジスタバスごとに
一括して、他の少なくとも１つのリングレジスタバスの
各々のリングレジスタに転送可能とする。In order to achieve the above-mentioned object, the data held in the ring register is collectively stored in each ring register bus at any rotation position on the ring register bus and stored in at least one other ring register bus. It is possible to transfer to each ring register.

【００２２】[0022]

【作用】遅延素子をリングレジスタバス中に挿入するこ
とにより、値０の行列要素を特定の列に集約するマッピ
ングが可能となる。このため、演算に必要な回数が短縮
される。また、結合荷重を記憶する記憶装置の容量も節
約できる。By inserting the delay element into the ring register bus, it becomes possible to perform mapping in which the matrix elements having the value 0 are aggregated in a specific column. Therefore, the number of times required for calculation is shortened. In addition, the capacity of the storage device that stores the coupling load can be saved.

【００２３】あるリングレジスタバスを用いて積和演算
を行っている間に、他のリングレジスタバスを用いて入
力データの設定ができる。このため、入力データの設定
に必要な時間を無視することができる。While performing a sum of products operation using one ring register bus, input data can be set using another ring register bus. Therefore, the time required to set the input data can be ignored.

【００２４】[0024]

【実施例】本発明の実施例について、図面を参照して説
明する。図１は本発明の第１の実施例の構成を示すブロ
ック図である。本実施例では、入力層のニューロン数
７、中間層のニューロン数３、局所領域の大きさ３の局
所結合ネットワークの積和演算を対象としている。Embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the first embodiment of the present invention. In the present embodiment, the product-sum calculation of a locally connected network having 7 neurons in the input layer, 3 neurons in the intermediate layer, and 3 local regions is targeted.

【００２５】リングレジスタバス１、リングレジスタバ
ス２は、それぞれ、演算装置１１〜１３および可変遅延
装置ファースト・イン・ファースト・アウト・メモリ
（ＦＩＦＯ）５１，５２、演算装置２１〜２３および可
変遅延装置ファースト・イン・ファースト・アウト・メ
モリ６１，６２が交互に配されており、入力データの供
給兼用の可変遅延装置ファースト・イン・ファースト・
アウト・メモリ５０とで２重のリングレジスタバスを構
成している。可変遅延装置ファースト・イン・ファース
ト・アウト・メモリ５１，５２，６１，６２における遅
延は、局所領域の大きさｎと隣合う局所領域間のオーバ
ーラップｕの差から１を減じた値（ｎ−ｕ−１）、すな
わち１と、リングレジスタバス１，２上に配されたデー
タが前記リングレジスタバス上を１ステップ回転するの
に要する時間Ｔとの積１Ｔに設定しておく。リングレジ
スタバス１上のリングレジスタ１１〜１３には、演算装
置３１〜３３がそれぞれ接続されており、演算装置３１
〜３３には結合荷重Ｗ_ijを記憶させておく記憶装置４１
〜４３が接続されている。リングレジスタバス２上を回
転するデータは、任意の回転位置でリングレジスタバス
１上に、一括して転送可能である。The ring register bus 1 and the ring register bus 2 are provided with arithmetic units 11 to 13 and variable delay units first-in-first-out memory (FIFO) 51 and 52, arithmetic units 21 to 23 and variable delay unit, respectively. First-in first-out memories 61 and 62 are alternately arranged, and a variable delay device first-in-first
A double ring register bus is constructed with the out memory 50. The delay in the variable delay device first-in-first-out memory 51, 52, 61, 62 is a value obtained by subtracting 1 from the difference between the size n of the local area and the overlap u between adjacent local areas (n- u-1), that is, the product of 1 and the time T required for the data arranged on the ring register buses 1 and 2 to rotate one step on the ring register buses 1T. The arithmetic units 31 to 33 are connected to the ring registers 11 to 13 on the ring register bus 1, respectively.
The ~ 33 stores the connection weights W _ij to keep the storage device 41
~ 43 are connected. Data rotating on the ring register bus 2 can be collectively transferred onto the ring register bus 1 at an arbitrary rotation position.

【００２６】図３および図４を用いて次の演算を行う場
合の動作を示す。The operation for performing the following calculation will be described with reference to FIGS. 3 and 4.

【００２７】[0027]

【数１】 [Equation 1]

【００２８】図３は、行列とベクトルの積ＷＸの演算の
際の動作を示すタイミング図、図４は、転置行列とベク
トルの積Ｗ^Tδの演算の際の動作を示すタイミング図で
ある。FIG. 3 is a timing chart showing the operation in the calculation of the product WX of the matrix and vector, and FIG. 4 is a timing chart showing the operation in the calculation of the product W ^T δ of the transposed matrix and vector.

【００２９】ＷＸの計算時には、図３に示すように、Ｘ
の各要素はリングレジスタバス１上を移動する。演算装
置１１〜１３の内部のレジスタをアキュムレータａｃｃ
₁からａｃｃ₃として用いる。時刻Ｔ₁には、演算装置
３１はｘ₁をリングレジスタ１１から、ｗ₁₁を記憶装置
４１から各々取り出して掛け合わせ、積をアキュムレー
タａｃｃ₁に加える。同様に、演算装置３２はｘ₃をリ
ングレジスタ１２から、ｗ₂₃を記憶装置４２から各々取
り出して掛け合わせ、積をアキュムレータａｃｃ₂に加
え、演算装置３３はｘ₅をリングレジスタ１３から、ｗ
₃₅を記憶装置４３から各々取り出して掛け合わせ、積を
アキュムレータａｃｃ₃に加える。この直後に、リング
は反時計方向に１リングレジスタ分だけ回転する。次の
時刻Ｔ₂には演算装置３１はｘ₂をリングレジスタ１１
から、ｗ₁₂を記憶装置４１から各々取り出して掛け合わ
せ、積をアキュムレータａｃｃ₁に加える。以下同様に
進み、Ｔ₃の後にはａｃｃ_iには積和Σｗ_ijｘ_jが得ら
れる。When calculating WX, as shown in FIG.
Each element of the above moves on the ring register bus 1. Registers in the arithmetic units 11 to 13 are used as accumulators acc.
Used from ₁ to acc ₃ . At time T ₁ , the arithmetic unit 31 fetches x ₁ from the ring register 11 and w ₁₁ from the storage unit 41 and multiplies them, and adds the product to the accumulator acc ₁ . Similarly, the arithmetic unit 32 takes x ₃ from the ring register 12 and w ₂₃ from the storage unit 42 and multiplies them, adds the product to the accumulator acc ₂ , and the arithmetic unit 33 puts x ₅ from the ring register 13 into w.
Each ₃₅ is taken out from the storage device 43 and multiplied, and the product is added to the accumulator acc ₃ . Immediately after this, the ring rotates counterclockwise by one ring register. At the next time T ₂ , the arithmetic unit 31 sets x ₂ to the ring register 11
, W ₁₂ are taken out from the storage device 41 and multiplied, and the product is added to the accumulator acc ₁ . Similarly, the product sum Σw _ij x _j is obtained for acc _i after T ₃ .

【００３０】Ｗ^Tδの計算時には、リングレジスタ１１
〜１４および可変遅延ファースト・イン・ファースト・
アウト・メモリ５１〜５４の内部のレジスタをアキュム
レータａｃｃ₁〜ａｃｃ₇として用いる。図４に示すよ
うに、アキュムレータａｃｃ₁〜ａｃｃ₄を、リングレ
ジスタバス１上で移動させる。この場合も同様に、Ｔ₃
の後にはａｃｃ_iには積和Σｗ_ijδ_jが得られる。When calculating W ^T δ, the ring register 11
~ 14 and variable delay first in first
Registers inside the out memories 51 to 54 are used as accumulators acc ₁ to acc ₇ . As shown in FIG. 4, the accumulators acc _{1 to} acc ₄ are moved on the ring register bus 1. In this case as well, T ₃
After, the sum of products Σ w _ij δ _j is obtained in acc _i .

【００３１】リングレジスタバス１を用いて積和演算を
行っている間に、入力データをリングレジスタバス２上
で回転することによって、入力データをファースト・イ
ン・ファースト・アウト・メモリ５０からリングレジス
タバス２上にセットする。積和演算が完了し、次の積和
演算に移行する場合には、リングレジスタバス２上の入
力データをリングレジスタバス１上に一括して転送す
る。By rotating the input data on the ring register bus 2 while performing the multiply-accumulate operation using the ring register bus 1, the input data is transferred from the first-in-first-out memory 50 to the ring register. Set on bus 2. When the product-sum calculation is completed and the next product-sum calculation is to be performed, the input data on the ring register bus 2 is transferred to the ring register bus 1 all at once.

【００３２】次に、第２の実施例を示す。図２は、本発
明の第２の構成を示すブロック図である。２次元局所結
合の場合には、図６におけるｘ軸方向の結合に対応した
リングレジスタバス（ただし、環状接続はしない）を構
成し、これをｙ軸方向の入力信号ベクトルの次元数Ｍだ
け縦続接続している。Next, a second embodiment will be shown. FIG. 2 is a block diagram showing the second configuration of the present invention. In the case of two-dimensional local coupling, a ring register bus (but not circular connection) corresponding to the coupling in the x-axis direction in FIG. 6 is configured, and this is cascaded by the dimension number M of the input signal vector in the y-axis direction. Connected.

【００３３】ファースト・イン・ファースト・アウト・
メモリの遅延は、第１のファースト・イン・ファースト
・アウト・メモリ５１１，５１２，…，５Ｍ（Ｒ−
１）、６１１，６１２，…，６Ｍ（Ｒ−１）において
は、局所領域のｘ軸方向の大きさｎと隣合う局所領域間
のｘ軸方向のオーバーラップｖの差から１を減じた値
と、リングレジスタバス１上に配されたデータがリング
レジスタバス１上を１ステップ回転するのに要する時間
Ｔとの積、（ｎ−ｖ−１）Ｔに設定している。また、第２のファースト・イン・ファ
ースト・アウト・メモリ５１，５２，…，５（Ｍ−
１）、６１，６２，…，６（Ｍ−１）においては、局所
領域のｙ軸方向の大きさｍと隣合う局所領域間のｙ軸方
向のオーバーラップｕの差から１を減じた値をＮ倍した
値と、リングレジスタバス１上に配されたデータがリン
グレジスタバス１上を１ステップ回転するのに要する時
間Ｔの積と、第１のファースト・イン・ファースト・ア
ウト・メモリ５１１，５１２，…，５Ｍ（Ｒ−１）、６
１１，６１２，…，６Ｍ（Ｒ−１）における遅延量との
和、｛Ｎ（ｍ−ｕ−１）＋（ｎ−ｖ−１）｝Ｔに設定している。First In First Out
The memory delay is the first first-in-first-out memory 511, 512, ..., 5M (R-
1), 611, 612, ..., 6M (R-1), a value obtained by subtracting 1 from the difference between the size n of the local region in the x-axis direction and the overlap v in the x-axis direction between adjacent local regions. And the time T required for the data arranged on the ring register bus 1 to rotate one step on the ring register bus 1 are set to (n−v−1) T. Further, the second first-in-first-out memories 51, 52, ..., 5 (M-
1), 61, 62, ..., 6 (M−1), a value obtained by subtracting 1 from the difference between the size m of the local region in the y-axis direction and the overlap u of the adjacent local regions in the y-axis direction. Is multiplied by N and the time T required for the data arranged on the ring register bus 1 to rotate on the ring register bus 1 by one step, and the first first-in-first-out memory 511. , 512, ..., 5M (R-1), 6
The sum of the delay amounts of 11, 612, ..., 6M (R-1), {N (mu-1) + (nv-1)} T, is set.

【００３４】第２の実施例を、９×１１画素の英数字認
識に適用した。このニューラルネットでは、入力層，中
間層はともに２次元であり、ニューロン数はそれぞれ、
９×１１，３×４である。入力層と中間層は局所結合さ
れており、局所領域の大きさは５×５である。中間層と
出力層は完全結合されており、出力層のニューロン数は
５である。演算装置として、ＴＩ社の３２ビット浮動小
数点演算用ＤＳＰ：ＴＭＳ３２０Ｃ３０を用いている。
このＤＳＰには、加算器と乗算器が各１台搭載されてお
り、積和演算が１マシンサイクルで実行可能である。ま
た、内部ＲＡＭの容量は、１ｋＷ×２である。マシンサ
イクル，クロックサイクルは、それぞれ、６０ｎｓ，３
０ｎｓである。同一のＤＳＰを用いて構成した従来のリ
ングレジスタ型ニューラルネットワーク装置では、学習
速度が１５．４ＭＣＵＰＳであるのに対し、第２の実施
例を用いた場合には２８．９ＭＣＵＰＳとなり、学習速
度が約１．９倍に高められている。The second example was applied to the alphanumeric recognition of 9 × 11 pixels. In this neural network, both the input layer and the intermediate layer are two-dimensional, and the number of neurons is
9 × 11 and 3 × 4. The input layer and the intermediate layer are locally coupled, and the size of the local region is 5 × 5. The middle layer and the output layer are completely connected, and the number of neurons in the output layer is 5. A 32-bit floating point arithmetic DSP: TMS320C30 manufactured by TI Co. is used as an arithmetic unit.
This DSP is equipped with one adder and one multiplier, and the product-sum operation can be executed in one machine cycle. The capacity of the internal RAM is 1 kW × 2. Machine cycle and clock cycle are 60ns and 3 respectively.
It is 0 ns. In the conventional ring register type neural network device configured by using the same DSP, the learning speed is 15.4 MCUPS, whereas when the second embodiment is used, it becomes 28.9 MCUPS, and the learning speed is about 1.9 times higher.

【００３５】本発明のニューラルネットワーク装置で
は、局所領域間のオーバーラップが少なく、中間層ニュ
ーロン数の多い２次元ネットで特に有効であり、１０倍
以上の高速化と記憶装置容量の削減も可能である。した
がって、本発明は、通常の３層構造のニューラルネット
ワークのみならず、バイオロジカル・サイバネティック
ス（ＢｉｏｌｏｇｉｃａｌＣｙｂｅｒｎｅｔｉｃｓ）
第３６巻（１９８０年）１９３頁〜２０２頁で述べられ
ている、ネオコグニトロンの様な多段ネットにおいて局
所領域間のオーバーラップが少ない場合に特に有効とな
る。The neural network device of the present invention is particularly effective for a two-dimensional net having a large number of hidden neurons, with a small overlap between local regions, and it is possible to increase the speed by 10 times or more and reduce the storage device capacity. is there. Therefore, the present invention is applicable not only to a normal three-layered neural network, but also to Biological Cybernetics.
It is particularly effective when there is little overlap between local regions in a multistage net such as neocognitron described in Vol. 36 (1980), pages 193 to 202.

【００３６】[0036]

【発明の効果】以上述べてきたように、本発明によれ
ば、局所結合ニューラルネットワークのバックプロパゲ
ーション演算を極めて高速で、かつ、少ない記憶装置の
記憶容量で実行可能なニューラルネットワーク装置を提
供することが可能となり、極めて有効である。As described above, according to the present invention, there is provided a neural network apparatus capable of performing backpropagation operation of a locally connected neural network at an extremely high speed and with a small memory capacity. It is possible and extremely effective.

[Brief description of drawings]

【図１】第１の実施例の構成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a first embodiment.

【図２】第２の実施例の構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration of a second exemplary embodiment.

【図３】第１の実施例において、行列とベクトルの積Ｗ
Ｘの演算の際の動作を示すタイミング図である。FIG. 3 shows the product W of a matrix and a vector in the first embodiment.
FIG. 6 is a timing chart showing an operation when X is calculated.

【図４】第１の実施例において、転置行列とベクトルの
積Ｗ^Tδの演算の際の動作を示すタイミング図である。FIG. 4 is a timing chart showing an operation in calculating a product W ^T δ of a transposed matrix and a vector in the first embodiment.

【図５】１次元局所結合ネットワークの模式図である。FIG. 5 is a schematic diagram of a one-dimensional locally connected network.

【図６】２次元局所結合ネットワークの模式図である。FIG. 6 is a schematic diagram of a two-dimensional locally connected network.

【図７】１次元局所結合ニューラルネットワークの結合
荷重行列Ｗの模式図である。FIG. 7 is a schematic diagram of a connection weight matrix W of a one-dimensional locally connected neural network.

【図８】１次元局所結合ニューラルネットワークの結合
荷重行列Ｗの、記憶装置へのマッピング状態を示す、マ
ッピング図である。FIG. 8 is a mapping diagram showing a mapping state of a connection weight matrix W of a one-dimensional locally connected neural network onto a storage device.

【図９】１次元展開された２次元局所結合ニューラルネ
ットワークの結合荷重行列Ｗの模式図である。FIG. 9 is a schematic diagram of a connection weight matrix W of a one-dimensional expanded two-dimensional locally connected neural network.

【図１０】従来例の構成を示すブロック図である。FIG. 10 is a block diagram showing a configuration of a conventional example.

【図１１】従来例において、行列とベクトルの積ＷＸの
演算の際の動作を示すタイミング図である。FIG. 11 is a timing chart showing an operation when a product WX of a matrix and a vector is calculated in a conventional example.

【図１２】従来例において、転置行列とベクトルの積Ｗ
^Tδの演算の際の動作を示すタイミング図である。FIG. 12 is a product W of a transposed matrix and a vector in the conventional example.
FIG. 6 is a timing chart showing an operation when calculating ^T δ.

[Explanation of symbols]

１，２リングレジスタバス５ａ，５ｂ入力層６ａ，６ｂ中間層１１〜１５，１１１〜１１１ＭＲ，２１〜２３，２１１
〜２ＭＲリングレジスタ３１〜３３，３１１〜３ＭＲ演算装置４１〜４３，４１１〜４ＭＲ記憶装置５０，５１〜５（Ｍ−１），６１〜６（Ｍ−１），５１
１〜５Ｍ（Ｒ−１），６１１〜６Ｍ（Ｒ−１）可変遅
延ファースト・イン・ファースト・アウト・メモリ1, 2 Ring register bus 5a, 5b Input layer 6a, 6b Intermediate layer 11-15, 111-111MR, 21-23, 211
-2MR ring register 31-33, 311-3MR arithmetic device 41-43, 411-4MR storage device 50, 51-5 (M-1), 61-6 (M-1), 51
1-5M (R-1), 611-6M (R-1) variable delay first-in-first-out memory

Claims

[Claims]

1. A ring register path configured by connecting a plurality of ring registers having a transfer function in a ring shape, and a plurality of arithmetic units connected to at least one of the ring registers. A neural network device including a plurality of storage devices connected to each of the arithmetic devices, wherein the ring register path includes the ring register connected to the arithmetic device, and the ring register connected to the arithmetic device. A neural network device having a delay element inserted between them.

2. The neural network device according to claim 1, wherein the network to be operated is one-dimensional local connection, the number of neurons in an arbitrary first layer is N, and the number of neurons in the second stage next to the first layer is N. The number of neurons in the layer is L, the number of neurons in the local region of the first layer connected to one neuron in the second layer is n, and the number of neurons overlapping between adjacent local regions is u. , The delay amount of the delay element is a value obtained by subtracting 1 from the difference between the size n of the local area and the overlap v between the adjacent local areas, and the data arranged on the ring register path is A product of (n−v−1) T and a time T required to rotate the upper part by one step, which is (n−v−1) T.

3. The neural network device according to claim 1, wherein the network to be operated is a two-dimensional local connection, and the number of neurons in the x-axis direction of an arbitrary first layer is N and the number of neurons in the y-axis direction is M, the number of neurons in the x-axis direction of the second layer next to the first layer is L, the number of neurons in the y-axis direction is K, and the first neuron connected to one neuron of the second layer The number of neurons in the x-axis direction in the local region of the layer is n,
When the number of neurons in the y-axis direction is m, the number of neurons in the x-axis direction overlapping between the adjacent local regions is v, and the number of neurons in the y-axis direction is u, the ring register connected to the arithmetic unit is set to L. Each of them is alternately connected to the (L-1) first delay elements and K
Each ring register group is configured, and the second delay element is inserted between the ring register groups, and the delay amount of the first delay element is adjacent to the size n of the local region in the x-axis direction. The product of a value obtained by subtracting 1 from the difference in the overlap v in the x-axis direction between the local regions and the time T required for the data arranged on the ring register bus to rotate one step on the ring register bus. , (N−v−1) T, and the delay amount of the second delay element is calculated from the difference between the size m of the local region in the y-axis direction and the overlap u in the y-axis direction between adjacent local regions. A value obtained by multiplying the value obtained by subtracting 1 by N and the time T required for the data arranged on the ring register bus to rotate one step on the ring register bus, and the delay in the first delay element group. Sum with quantity, {N (mu-1) + ( -v-1)} neural network system, characterized in that the T.

4. A ring register bus configured by connecting a plurality of ring registers having a transfer function in a ring shape, and a plurality of arithmetic units connected to at least one of the ring registers. A neural network device including a plurality of storage devices connected to each of the arithmetic units, wherein the neural network device has a plurality of ring register buses.

5. The neural network device according to claim 4, wherein the data held in the ring register is collectively stored for each ring register bus at any rotation position on the ring register bus. A neural network device capable of transferring to each ring register of at least one ring register bus.