JPH03105584A

JPH03105584A - Parallel data processing system

Info

Publication number: JPH03105584A
Application number: JP1243972A
Authority: JP
Inventors: Hideki Kato; 英樹加藤; Hideki Yoshizawa; 英樹吉沢; Hiromoto Ichiki; 宏基市來; Kazuo Asakawa; 浅川　和雄
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1989-09-20
Filing date: 1989-09-20
Publication date: 1991-05-02
Anticipated expiration: 2013-11-18
Also published as: JP2825133B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔概　　　要〕複数個のデータ処理ユニットを同期的に用いてデータを
処理する並列データ処理方式に関し、リングシストリッ
クアレイ方式や共通バス結合型Ｓ　Ｉ　Ｍ　Ｄ　（Ｓｉ
ｎｇｌｅ　Ｉｎｓｔｒｕｃｔｉｏｎ　Ｍｕｌｔｉ　Ｄａ
ｔａ　）結合方式と同程度なハードウェア構成で、デー
タ転送によるオーバヘッドを減少せしめ、特に、長方形
行列とベクトルとの積を求めるような処理に対しても、
本来の並列度を最大限利用できるようにして良好な台数
効果を得ることにより、行列演算あるいはニューロコン
ピュータ演算を行うことをことを目的とし、各々少なくとも一つの入力を持つ複数個のデータ処理ユ
ニットと、各々第１の入力及び出力を持ちかつ各々デー
タ保持及びデータ転送を行う複数個のトレイであって、
前記トレイの全部又はその一部が各々前記データ処理ユ
ニットの第１の入力に接続された第２の出力を有するも
のと、前記接続するトレイの第１の入力及び出力が接続
されて成るシフト手段とを具備し、前記シフト手段上の
データ転送と、前記トレイと前記データ処理ユニット間
のデータ転送と、前記データ処理ユニットによるデータ
処理とを同期して行うことにより、行列演算あるいはニ
ューロコンピュータ演算を行うように構成する。[Detailed Description of the Invention] [Summary] Regarding parallel data processing methods that process data using a plurality of data processing units synchronously, ring systolic array methods and common bus coupled SIMD (Si
ngle Instruction Multi Da
ta) With the same hardware configuration as the combination method, it reduces the overhead due to data transfer, and is especially suitable for processing such as calculating the product of a rectangular matrix and a vector.
The purpose is to perform matrix operations or neurocomputer operations by making maximum use of the inherent degree of parallelism and obtaining a favorable number effect. , a plurality of trays each having a first input and an output and each holding and transmitting data,
Shifting means comprising: all or part of said trays each having a second output connected to a first input of said data processing unit; and said first input and output of said connected trays being connected. By synchronously performing data transfer on the shifting means, data transfer between the tray and the data processing unit, and data processing by the data processing unit, matrix calculations or neurocomputer calculations are performed. Configure it to do so.

[Industrial application field]

本発明は並列データ処理方式に係り、更に詳しくは、複
数個のデータ処理ユニットを同期的に用いてデータを処
理する並列データ処理方式に関する。The present invention relates to a parallel data processing system, and more particularly to a parallel data processing system that processes data using a plurality of data processing units synchronously.

近年、電子計算機或いはデジタル信号処理装置等のシス
テムにおいて、データ処理の適用分野の拡大に伴い、処
理されるデータの量が膨大になり、特に画像処理或いは
音声処理等の分野では高速なデータ処理を行う必要があ
り、そのため、複数個のデータ処理ユニットを同期的に
用いてデータを処理するデータ処理の並列性の利用が重
要となる。In recent years, with the expansion of the fields of application of data processing in systems such as electronic computers and digital signal processing devices, the amount of data being processed has become enormous.Especially in fields such as image processing and audio processing, high-speed data processing is required. Therefore, it is important to utilize parallelism in data processing, in which multiple data processing units are used synchronously to process data.

−ｉに、複数の処理ユニットを用いた処理において重要
な概念に台数効果がある。これは用意されたデータ処理
ユニットの台数に比例したデータ処理速度の向上が得ら
れることを意味するが、並列処理方式においては良好な
台数効果を得ることが非常に重要となる。-i has a number-of-units effect, which is an important concept in processing using a plurality of processing units. This means that the data processing speed can be improved in proportion to the number of data processing units prepared, but in parallel processing systems, it is very important to obtain a favorable number effect.

台数効果が悪化する主要な原因は、問題そのものの並列
度による限界を別にすれば、データ処理に伴うデータ転
送に要する時間が本来のデータ処理に要する時間に加算
されてトータルとしての処理時間が引き延ばされること
にある。従って、台数効果の向上にはデータ伝送路の容
量をフルに活用することが有効であるが、これはなかな
か難しいしかし、処理が規則的な場合には、この規則性を利用し
て台数効果を上げることが可能となる。The main reason for the deterioration of the number-of-units effect is that, apart from the limitation due to the parallelism of the problem itself, the time required for data transfer associated with data processing is added to the time required for original data processing, which reduces the total processing time. It lies in being postponed. Therefore, it is effective to make full use of the capacity of the data transmission path to improve the number of devices, but this is difficult. However, if the processing is regular, this regularity can be used to improve the number of devices. It is possible to raise it.

データをシストリックアレイ、すなわち、巡回的にデー
タを流し、２つのデータがその流れにおいてそろったと
ころで演算を行うようにする。処理が規則的なことを利
用する並列処理がシストリンクアレイ方式であり、この
中でリングシストリンクアレイ方式と呼ばれる１次元の
シストリックアレイ方式は、複数個のデータ処理ユニッ
トを同期的に用いてシストリックなデータを処理する並
列データ処理方式であって実現が比較的容易である。The data is arranged in a systolic array, that is, data is flowed cyclically, and an operation is performed when two pieces of data are aligned in the flow. Parallel processing that takes advantage of the regularity of processing is the systolic array method. Among these, the one-dimensional systolic array method called the ring cyst link array method uses multiple data processing units synchronously. It is a parallel data processing method that processes systolic data and is relatively easy to implement.

規則性のある処理として、ベクトルの内積演算を基本と
した行列演算や、ニューラルネットの積和演算に非線形
関数を介して出力する並列処理がある。Examples of regular processing include matrix operations based on inner product operations of vectors and parallel processing that outputs output via nonlinear functions in neural network product-sum operations.

[Conventional technology]

第１１図（Ａ）は従来の共通バス結合型並列方式の原理
構成図である。同図において９１はプロセッサエレメン
ト、４はメモリ、９３は共通バス、９２は共通バスに接
続されるバス、９４は各プロセッサエレメントと、それ
に対応して接続されるメモリ４を接続する内部バスであ
る。この共通バス結合型並列方式においては、プロセッ
サエレメント（以下ＰＥと称す）間の通信が共通バス９
３を介して行われる。特定な時間区域には共通バス番こ
乗せるデータは１つであるため、共通バスによ？通信は
共通バス全体にわたって同期をとる必要がある。FIG. 11(A) is a diagram showing the principle configuration of a conventional common bus coupling type parallel system. In the figure, 91 is a processor element, 4 is a memory, 93 is a common bus, 92 is a bus connected to the common bus, and 94 is an internal bus that connects each processor element and the memory 4 connected to it. . In this common bus-coupled parallel system, communication between processor elements (hereinafter referred to as PEs) is carried out via a common bus.
3. Since there is only one common bus number for a specific time zone, it is difficult to use the common bus. Communication must be synchronized across the common bus.

第１１図（Ｂ）はこの共通バス結合型並列方式による行
列ベクトル積の動作フローチャートである。各ＰＥは他
のＰＥからのデータＸと内部レジスタのＹとをかけ、そ
の積をＹに足しこむ動作を行う。そのためフローチャー
トに示すように、ｉ番目のＰＥに関して、その内部にあ
るレジスタの内容、すなわち、Ｙ■の値をまず０にする
。そして以下をｎ回繰り返す。すなわち、共通バス９３
にＸ，を与えるとｉ番目のＰＥは共通バスに接続された
バス９２からの入力とメモリ４から内部バス９４を介し
て与えられる入力を掛け合わせ、そ′の積をＹｉに足し
込む。これを繰り返す。FIG. 11(B) is an operational flowchart of matrix-vector product using this common bus-coupled parallel method. Each PE multiplies data X from other PEs by Y in the internal register and adds the product to Y. Therefore, as shown in the flowchart, for the i-th PE, the contents of the internal register, that is, the value of Y■, are first set to 0. Then, repeat the following n times. That is, the common bus 93
When X is given to , the i-th PE multiplies the input from the bus 92 connected to the common bus by the input given from the memory 4 via the internal bus 94, and adds the product to Yi. Repeat this.

第１２図（Ａ）は従来のリングシストリック方式の原理
説明図である。同図において２０はプロセッサエレメン
ト（ＰＥ）である。各ＰＲは巡回バス２２によって接続
されている。また、２１は係数ＷｉＪを格納するメモリ
である。Ｗ，，，Ｗ，２・・・．Ｗ３３などは係数行列
の要素であり、一般にＷｉｊは行列のｉｊ成分である。FIG. 12(A) is a diagram explaining the principle of the conventional ring systolic system. In the figure, 20 is a processor element (PE). Each PR is connected by a circular bus 22. Further, 21 is a memory that stores the coefficient WiJ. W,,,W,2... W33, etc. are elements of a coefficient matrix, and Wij is generally an ij component of the matrix.

この係数行列Ｗと、ベクトルｘ＝　（ＸＩ　，Ｘ２　，
Ｘ３　）を掛ける動作をこのリングシストリック方式で
行う場合、次のようにして行われる。This coefficient matrix W and vector x= (XI , X2 ,
When the operation of multiplying by X3) is performed using this ring systolic method, it is performed as follows.

第１２図（Ｂ）はプロセッサエレメントの第ｉ番目の内
部構造である。同図において２３は乗算器、２４は加算
器、２５はアキュムレータ（ＡＣＣ）、２１は係数の要
素Ｗｉｊを格納するレジスタ群である。このレジスタ群
はいわゆるＦＩＦ○であって、係数行列の第ｉ行目に関
する係数としてＷｉＪ、すなわちｊ番目の列の要素が出
力されようとしている状態である。このＦＩＦ○は出力
された次のクロックでは巡回し、バス２２を介して後ろ
側からまた入力される。従って図に示すように、Ｗ■，
・・・，　Ｗｉ　Ｊ−１　はすでに巡回されて後側に格
納されている状態となっている。FIG. 12(B) shows the i-th internal structure of the processor element. In the figure, 23 is a multiplier, 24 is an adder, 25 is an accumulator (ACC), and 21 is a register group for storing coefficient elements Wij. This register group is a so-called FIF○, and WiJ, that is, the element in the j-th column is about to be output as the coefficient for the i-th row of the coefficient matrix. This FIF○ circulates at the next clock output and is input again from the rear side via the bus 22. Therefore, as shown in the figure, W■,
..., Wi J-1 has already been circulated and stored at the rear.

一方、ベクトルの各要素はバス２２を介して入力される
。現在、要素Ｘ，が入力されている状態である。すでに
アキュムレータ２５にはＷ，，ＸＸ，十・・・＋Ｗ＋　
Ｊ−Ｉ　ＸＸＪ−１の内積結果が格納されている。これ
が今アキュムレータ２５から出力され、加算器２４の一
方の入力に入力されている。On the other hand, each element of the vector is input via the bus 22. Currently, element X is being input. W,, XX, ten...+W+ are already in the accumulator 25.
The inner product result of J-I XXJ-1 is stored. This is now output from the accumulator 25 and input to one input of the adder 24.

外部からのｘｊとＦＩＦ○から出力されるＷｉｊの積が
乗算器２３によって乗算され、その結果が加算器２４の
他方の入力に入力され、現在のアキュムレータ２５の内
容とが加えられ、次のクロックで同じアキュムレータ２
５に加算される。この繰り返しによって、係数行列Ｗの
第ｉ行目の行ベクトルと外部から与えらるＸベクトルと
の内積演算がＷ行される，なお、スイッチ（Ｓｗｉｔｃ
ｈ）はデータＸｉをスルーに外部に出すか、あるいは内
部に取り込み、アキュムレータ２５にセットする場合と
の選択を行うためのものである。このようなＰＥで、行
列×ベクトルの積を行う場合、第１２図（Ａ）に示すよ
うに、ＰＥ−１はまず、ＷｌｌとＸを掛け、次のクロツ
ク周期に、Ｘ２が右側のＰＥ２から流れ込み、Ｗ＋２が
メモリ２１から出力されるので、Ｗ１２×Ｘ２が演算さ
れる。同様に次のクロックではｗ＋ｉとｘ３との積が実
行され、このことにより係数行列の第１列目とベクトル
鬼との？がＰＥ−１において可能となる。また、第２列
目とベクトルとの積はＰＥ−２において行われる。The product of xj from the outside and Wij output from FIF○ is multiplied by the multiplier 23, the result is input to the other input of the adder 24, and the current contents of the accumulator 25 are added to the next clock. Same accumulator 2 with
It is added to 5. By repeating this, the inner product calculation of the i-th row vector of the coefficient matrix W and the X vector given from the outside is performed in W rows.
h) is for selecting whether to output the data Xi to the outside or take it inside and set it in the accumulator 25. When performing a matrix×vector product in such a PE, as shown in FIG. 12(A), PE-1 first multiplies Wll and X, and in the next clock period, Since W+2 is output from the memory 21, W12×X2 is calculated. Similarly, in the next clock, the product of w+i and x3 is executed, which causes the first column of the coefficient matrix to be ? is possible in PE-1. Also, the product of the second column and the vector is performed in PE-2.

すなわち、Ｗ２２とＸ２を掛け、次のクロック周期に、
Ｗ２ｆｆとＸ３を掛け、次のクロック周期においてＷ２
１と巡回的にもどってきたＸ＋　との積を行うことにな
る。同様に、第３行目とベクトルとの積はＷ３３とＸ３
を掛け、Ｗ３１と巡回してくるＸ，とを掛け、Ｗ３■と
巡回して戻ってくるＸ２との積をとって内積演算を実行
することによって可能となる。従って、この動作におい
て、ＷｌｌとＸＩとの積、及びＷ２■とＸ２　、Ｗ３３
とｘ３との積は同時に行えることになる。しかし、図に
示すように、この同時性を実行するためには係数行列の
要素の並べ方にねじれが生じている。このようなリング
シストリックアレイ方式においては、各ＰＥ間のデータ
転送と、各ＰＢでのデータ処理を同期して実行すること
で、データ転送路を有効に利用でき、従って良好な台数
効果を得ることができる。That is, multiply W22 by X2, and in the next clock period,
Multiply W2ff by X3, and in the next clock period W2
1 and X+ returned cyclically will be multiplied. Similarly, the product of the third row and the vector is W33 and X3
This can be achieved by multiplying W31 by the circulating X, and multiplying W3 by the circulating and returning X2 to perform an inner product operation. Therefore, in this operation, the product of Wll and XI, and W2■ and X2, W33
The product of and x3 can be performed simultaneously. However, as shown in the figure, in order to achieve this simultaneity, the arrangement of the elements of the coefficient matrix is distorted. In such a ring systolic array method, by synchronizing data transfer between each PE and data processing at each PB, the data transfer path can be used effectively, thus achieving a good number of units. be able to.

第１２図（Ｃ）は、第ｌ２図（Ａ）のリングシストリッ
ク方式の構或を多段に組み合わせたのもであり、この構
或により、連続する行列とベクトルの積を行うことが可
能となる。このようなシストリックアレイ方式は処理が
規則的であるため、データ伝送路の容量をフルに活用す
ることが可能であり、従って台数効果の向上が計れる。FIG. 12(C) shows a multi-stage combination of the ring systolic structure shown in FIG. 12(A), and this structure makes it possible to perform the product of continuous matrices and vectors. Since such a systolic array method performs regular processing, it is possible to make full use of the capacity of the data transmission path, and therefore the number of devices can be improved.

[Problem to be solved by the invention]

第１１図（Ａ）のような従来の共通バス結合の並列方式
においては、プロセッシングエレメント、すなわちＰＥ
間の結合が共通バスによっているため、一時には１つの
データしか転送できない。また、共通バスによる結合は
共通バス全体にわたる同期をとらなければならない。従
って、従来の共通バス結合型並列方式においては良好な
台数効果を得られる処理の種類が少ないという問題が生
じ、さらに共通バスによる結合は、結合されるＰＥの個
数の増加とともに共通バスが長くなり、共通バス全体に
わたる同期をとるのが難しくなるという問題、そして、
大規模並列には適さないという問題が生じていた。また
、第１２図のような従来のリングシストリックアレイ方
式においては、各ＰＥ間のデータ転送とＰＥでのデータ
処理を同期して実行することにより、台数効果を得るこ
とができるが、この方弐では、各ＰＥ間でのデータ転送
と、各ＰＥ間でのデータ処理のタイ旦ングを合わせねば
ならない。また、この方式では、例えば長方形の行列と
ベクトルとの積を求める場合等のようにデータ処理ユニ
ットとデータ保持ユニットのそれぞれの最適な個数が等
しくない場合には、実際のデータ処理に係わらないＰＥ
が必要となり、すなわち、遊ぶＰＥが多くなり、そのた
め台数効果が悪化するという問題がある。言い換えれば
、効率よくとける問題と回路構成とが固く対応し、問題
の大きさが最適な値と異なると台数効果が悪化してしま
う。逆にいうと、良好な台数効果が得られる問題が特定
されてしまい、広範な処理に適用できず、柔軟性、或い
は汎用性に欠け、結果として、ある程度広い範囲の処理
に適用できる高速なデータ処理系を実現することが困難
となる。In the conventional common bus coupling parallel system as shown in FIG. 11(A), processing elements, namely PE
Since the connection between them is through a common bus, only one piece of data can be transferred at a time. Also, coupling via a common bus requires synchronization across the common bus. Therefore, in the conventional common bus coupling type parallel system, there is a problem that there are few types of processing that can obtain a good number of units effect, and furthermore, coupling by a common bus causes the common bus to become longer as the number of PEs to be coupled increases. , the difficulty of synchronizing across a common bus, and
The problem arose that it was not suitable for large-scale parallelism. In addition, in the conventional ring systolic array system as shown in Fig. 12, the effect of the number of units can be obtained by synchronously executing data transfer between each PE and data processing in the PE. Then, it is necessary to match the timing of data transfer between each PE and the timing of data processing between each PE. In addition, in this method, if the optimal numbers of data processing units and data holding units are not equal, such as when calculating the product of a rectangular matrix and a vector, PEs that are not involved in actual data processing
In other words, there is a problem that the number of PEs to play increases, and the number effect worsens. In other words, there is a strong correspondence between the problem of efficient solving and the circuit configuration, and if the size of the problem differs from the optimal value, the number effect will worsen. Conversely, the problem of obtaining a good number of units effect has been identified, and it cannot be applied to a wide range of processing, lacks flexibility or versatility, and as a result, high-speed data that can be applied to a somewhat wide range of processing has been identified. This makes it difficult to implement a processing system.

本発明は、リングシストリックアレイ方式や共通バス結
合型Ｓ　Ｉ　Ｍ　Ｄ　（Ｓｉｎｇｌｅ　Ｉｎｓｔｒｕｃ
ｔｉｏｎ　ＭｕｌｔｉＤａｔａ）結合方式と同程度なハ
ードウエア構成で、データ転送によるオーバヘッドを減
少せしめ、特に、長方形行列とベクトルとの積を求める
ような処理に対しても、本来の並列度を最大限利用でき
るようにして良好な台数効果を得ることにより、行列演
算あるいはニューロコンピュータ演算を行うことを目的
とする。The present invention is applicable to a ring systolic array method or a common bus-coupled type SIMD (Single Instrument
With a hardware configuration comparable to that of the combination method (MultiData), it reduces the overhead caused by data transfer, and allows maximum use of the original degree of parallelism, especially for processing such as calculating the product of a rectangular matrix and a vector. The purpose is to perform matrix calculations or neurocomputer calculations by obtaining a good number of units effect in this manner.

[Means to solve the problem]

第１図は本発明の原理説明図である。同図において１は
データ処理ユニット、２はデータの保持及び転送を行う
トレイ、３は各トレイの相互接続により構成されるシフ
トレジスタ、１１はデータ処理ユニットの第ｌの入力、
１２はデータ処理ユニットの第２の入力、２１はトレイ
の第１の入力、２２はトレイの第１の出力、２３はトレ
イ２の第２の出力である。FIG. 1 is a diagram explaining the principle of the present invention. In the figure, 1 is a data processing unit, 2 is a tray that holds and transfers data, 3 is a shift register formed by interconnecting each tray, 11 is the first input of the data processing unit,
12 is the second input of the data processing unit, 21 is the first input of the tray, 22 is the first output of the tray, and 23 is the second output of the tray 2.

データ処理ユニット１はデータの処理を行い、トレイ２
は転送の動作を行うものでシフトレジスタ３を構放して
、データの巡回シフトを行う。本発明では、ｍｘｎ行列
八と要素数のベクトルＸとの積を求める場合、行列八の
行数ｍが列数ｎより小さい場合であっても、或いはｍが
ｎより大きい場合であっても、ｍ個のデータ処理ユニッ
トとｎ個のトレイを用いてｎに比例する処理時間でその
積が実行可能となり、従って、良好な台数効果を得るこ
とができる。すなわち、第１図（Ａ）に示すように、そ
れぞれ２つの入力を持ち、その入力間の乗算機能とその
乗算結果の累積機能、すなわち内積演算を実行するｍ個
のデータ処理ユニット１と、ｎ個のトレイ２とからなる
構或において、ユニット内の累積レジスタをＹとした場
合に、データ処理ユニットは１１からの入力と１２から
の入力を掛け合わせ、積を累積Ｙに足し込み、その後、
シフトレジスタ３内の隣接するトレイ間でベクトルＸの
要素をシフトする。この動作をｎ回繰り返すことにより
、ｍ×ｎの行列Ａと、ｎ次元ベクトルとの乗算がｍ個の
データ処理ユニットを用いてｎに比例する処理時間で実
行可能となる。すなわち、本発明は、従来方式と異なり
、データ処理ユニット１とデータ保持機能を有するトレ
イ２とを分離することにより、それぞれｍとｎが異なっ
ている場合であっても、タイミングを合わせるための処
理を必要とせずに良好な台数効果を得ることが可能とな
る。さらに、本発明では、トレイ２間のデータ転送とデ
ータ処理ユニットｌによるデータ処理とを同時並行的に
行い、一般的にはデータ処理ユニットがデータ処理に有
する時間よりもデータ転送時間を短くすることが期待で
きるので、データ転送時間をデータ処理時間の影に隠す
ことで実質的にＯにし、そのことにより、処理時間を短
縮することが可能となっている。このことにより、行列
演算あるいはニューロコンピュータ演算を行う。Data processing unit 1 processes data and tray 2
performs a transfer operation, leaving the shift register 3 free and performing a cyclic shift of data. In the present invention, when calculating the product of an mxn matrix 8 and a vector The product can be performed using m data processing units and n trays with a processing time proportional to n, and therefore a good number effect can be obtained. That is, as shown in FIG. 1(A), m data processing units 1 each have two inputs and perform a multiplication function between the inputs and an accumulation function of the multiplication results, that is, an inner product operation, and n In a structure consisting of two trays 2, if the cumulative register in the unit is Y, the data processing unit multiplies the input from 11 and the input from 12, adds the product to the cumulative Y, and then,
Shift the elements of vector X between adjacent trays in shift register 3. By repeating this operation n times, the multiplication of the m×n matrix A by the n-dimensional vector can be performed using m data processing units in a processing time proportional to n. That is, unlike the conventional system, the present invention separates the data processing unit 1 and the tray 2 having a data holding function, so that even if m and n are different from each other, processing for synchronizing the timing can be performed. It is possible to obtain a good number of units without the need for Furthermore, in the present invention, the data transfer between the trays 2 and the data processing by the data processing unit 1 are performed simultaneously in parallel, so that the data transfer time is generally shorter than the time taken by the data processing unit for data processing. Therefore, by hiding the data transfer time behind the data processing time, it is essentially possible to reduce the processing time to zero. This allows matrix calculations or neurocomputer calculations to be performed.

[For production]

データ処理ユニットと、データ保持機能を有するトレイ
とを分離することにより、データ処理ユニットの個数ｍ
とトレイの個数ｎとが同一の場合も違っている場合も、
ｎＸｍの行列Ａと要素数ｎのベクトルＸとの積を、デー
タ転送と、データ処理の同時並列処理により行うことが
できる。By separating the data processing unit and the tray with data holding function, the number of data processing units m
Regardless of whether and the number of trays n are the same or different,
The product of the nXm matrix A and the vector X with n elements can be performed by data transfer and simultaneous parallel data processing.

〔Example〕

以下、本発明の実施例を図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

第１図（Ｂ）は第１図（Ａ）の本発明の原理構成図のシ
ステムの動作フローチャートである。第１図（Ａ）に示
されるように本発明ではデータ処理ユニット１とデータ
保持機能を有するトレイ２とを分離し、さらにトレイを
隣接間で接続し、巡回接続することによってシストリン
クなシステムを構成している。データ処理ユニットの数
をｎ、トレイの数をｍとした場合に、ｍｘｎの行列Ａと
要素数ｎのベクトルＸとの積を求める場合、第１図（Ｂ
）のフローチャートに示される動作となる。FIG. 1(B) is an operation flowchart of the system shown in FIG. 1(A) which is a diagram of the principle configuration of the present invention. As shown in FIG. 1(A), in the present invention, a data processing unit 1 and a tray 2 having a data holding function are separated, and the trays are connected between adjacent trays for circular connection, thereby creating a system linked to the system. It consists of When the number of data processing units is n and the number of trays is m, when calculating the product of mxn matrix A and vector
) The operation is shown in the flowchart.

Ｘｉ　をトレイ２のｉ番目にセットする。Ｙｓの値をＯ
にする。すなわちデータ処理ユニットのｉ番目のユニッ
トにおける累積レジスタの値を初期化する。ｉ番目の処
理ユニット１Ｎは１１＜からの入力と、１２１の入力を
掛け合わせて、積を累積器Ｙｓに足し込む。そしてシフ
トレジスタ３をシフトする。この内積とシフト動作をｎ
回繰り返す。Set Xi in the i-th position of tray 2. The value of Ys is O
Make it. That is, the value of the accumulation register in the i-th data processing unit is initialized. The i-th processing unit 1N multiplies the input from 11< by the input from 121, and adds the product to the accumulator Ys. Then, shift register 3 is shifted. This inner product and shift operation are n
Repeat times.

この処理において長方行列Ａとベクトル見との積が形成
される。この場合、トレイ間のデータ転送とデータ処理
ユニットにおけるデータ処理とは同時並行処理となる。In this process, the product of the rectangular matrix A and the vector term is formed. In this case, data transfer between trays and data processing in the data processing unit are simultaneous and parallel processes.

第１図（Ｃ）は本発明方式の動作概念図である。FIG. 1(C) is a conceptual diagram of the operation of the system of the present invention.

同図においてトレイ２内のデータＸ＋から×ｎはベクト
ルＸの要素でその個数はｎであるとする。In the figure, it is assumed that data X+ to xn in tray 2 are elements of vector X, and the number thereof is n.

またデータ処理ユニットはｍ個あり、その各々に累積器
がＹ＋　，Ｙｚ　，　　・・・，Ｙ，がある。ｍ×ｎの
長方行列の要素はＡ■からＡ．までのｍ×ｎ個存在する
。データ処理ユニットの１．には係数行列の第１行目で
あるＡｚ．　ＡＨ，　　・・・＋ＡＩｎが同期的に１２
１の入カバスから入力される。またデータ処理ユニット
１２はＡ２２，　Ａ２３，　　・・・Ａ　２　１がシス
トリック動作の各タイミングで順番に与えられる。また
、データ処理ユニット１ｌＩには？　ｆｆｉｌＷ＋　Ａ
ｌｌ　ｌｌ＋．１＋　’・・ｒ　Ａｌｌ　Ｉ１−１が同
期的に与えられる。Furthermore, there are m data processing units, each of which has an accumulator Y+, Yz, . . . , Y. The elements of the m×n rectangular matrix are from A■ to A. There are m×n pieces up to. 1 of the data processing unit. is the first row of the coefficient matrix, Az. AH, ...+AIn is 12 synchronously
It is input from the input bus No. 1. Further, the data processing unit 12 is sequentially given A22, A23, . . . A 2 1 at each timing of the systolic operation. Also, what about the data processing unit 1lI? ffilW+ A
ll ll+. 1+'...r All I1-1 is given synchronously.

第１図（Ｄ）は第１図（Ｃ）の動作のタイミングチャー
トである。時間Ｔ＋からＴｎの動作は第１図（Ｃ）のそ
れぞれの図と第１図（Ｄ）の時間Ｔ，，Ｔ２，　　・・
・，Ｔ，，とが対応している。時間タイミングＴ１にお
いては第１図（Ｃ）に示されるようにトレイの２１．２
２，　　・・・，２ｎにはＸ＋　，Ｘｚ　，Ｘ■，・・
・．Ｘいがあり、ユニッ｝１１，１２，　　・・・，ｌ
ｍにはそれぞれ係数行列の要素Ａｌｌ，　ＡＤＺ，　　
・・・六一が入力されている。従って、このタイミング
においてデータ処理ユニットはＡ１，とトレイ２１のデ
ータＸ１　との積を求め、データ処理ユニットに対応す
るトレイ２２にあるＸ２と、メモリから与えられるＡ２
２との積を求め、同様６こ、トレイ２ｍにおいてはＡ一
とＸ．の積を求める。このタイミングは第１図（Ｄ）の
Ｔ１のタイミングで行われている。すなわち積和を求め
る同期クロックにおいて、バス１ｈにはＸ１があり、バ
ス１２１にはＡ■があり、バス１１２にはＸ２、１２２
にはＡ２２、１１３にはＸ３、１２３にはＡ３３があり
、１１．にはＸＩＩ、１２，にはＡ　ａｌｌ１がのって
いる。従って、第１図（Ｃ）のＴ，タイムにおける図に
示すように内積演算が行われる。累積器Ｙの値はこの時
は０であるから内積結果はＯに掛けた値が加わることに
なる。積和演算が終わるとシフト動作に入る。すなわち
第１図（Ｄ）に示されるようにＴ１とＴ２との間がシフ
ト動作であり、隣接するトレイ間でデータのシフトが行
われる。すなわち左シフトがこの場合行われる。すると
第１図（Ｃ）のタイミングＴ２に移る。第１図ＣＤ）の
動作タイ旦ングでも同様にＴ２の積和の時間区域となる
。するとシフトされているからトレイ２１にはＸ２、ト
レイ２２にはＸ３、そしてトレイ２ｍにはｘ−１が格納
され、また、係数行列の要素もトレイ１．２．−・’，
ｍにはそれぞれＡＩ２，　Ａ２３，　Ａｍ　ｍｏ＋が入
力される。これは第ｌ図（Ｄ）のＴ２のタイξングにお
いてもバス上のデータがそれぞれ示されている。従って
、Ｔ２のタイξングにおいて、Ａ１２？Ｘ２との積をと
り、前の累積器Ｙとの和が求められる。従ってユニット
１．においてはＴＩにおいて求まったＡ　１　．とＸ，
との積とＴ２において求められるＡ　Ｉ　ＺとＸ２との
積との和が求められその結果が累積器に格納される。同
様にユニット１ｚにおいては前の結果であるＡ２２ＸＸ
２　＋Ａ２３ＸＸ３の結果が累積器に格納される。ユニ
ッｌ−１．に対しても同様である。そしてまたシフトし
、タイミングＴ３に移る。トレイ１にはＸ３、トレイ２
にはＸ４、トレイｍにはＸ．■２、トレイｎにはＸ２が
入り、第１図（Ｃ）のＴ３時間における図に示されるよ
うな内積演算が実行される。FIG. 1(D) is a timing chart of the operation of FIG. 1(C). The operations from time T+ to Tn are shown in the respective diagrams in FIG. 1(C) and times T,, T2, . . . in FIG. 1(D).
・,T,, correspond. At time timing T1, as shown in FIG. 1(C), 21.2 of the tray is
2, ..., 2n has X+ , Xz , X ■, ...
・．． There is a unit} 11, 12, ..., l
m has coefficient matrix elements All, ADZ,
...61 has been input. Therefore, at this timing, the data processing unit calculates the product of A1 and the data X1 on the tray 21, and calculates the product of X2 on the tray 22 corresponding to the data processing unit and A2 given from the memory.
Find the product of 2 and 6 in the same way, and for tray 2m, A1 and X. Find the product of This timing is performed at the timing T1 in FIG. 1(D). In other words, in the synchronous clock for calculating the sum of products, bus 1h has X1, bus 121 has A■, bus 112 has X2, 122
has A22, 113 has X3, 123 has A33, 11. has XII, and 12 has A all1. Therefore, the inner product calculation is performed as shown in the diagram at T and time in FIG. 1(C). Since the value of the accumulator Y is 0 at this time, the value multiplied by O is added to the inner product result. When the product-sum operation is completed, a shift operation begins. That is, as shown in FIG. 1(D), a shift operation occurs between T1 and T2, and data is shifted between adjacent trays. That is, a left shift is performed in this case. Then, the process moves to timing T2 in FIG. 1(C). Similarly, in the operation timing shown in FIG. 1 (CD), the time area is the sum of products of T2. Then, since they have been shifted, X2 is stored in tray 21, X3 is stored in tray 22, and x-1 is stored in tray 2m, and the elements of the coefficient matrix are also stored in trays 1, 2, . −・'、
AI2, A23, and Am mo+ are input to m, respectively. This is also shown in the timing T2 of FIG. 1(D) for each data on the bus. Therefore, in the timing of T2, A12? The product with X2 is taken and the sum with the previous accumulator Y is determined. Therefore, unit 1. In , A 1 . found at TI. and X,
The sum of the product of A I Z and X2 obtained at T2 is calculated, and the result is stored in the accumulator. Similarly, in unit 1z, the previous result A22XX
The result of 2+A23XX3 is stored in the accumulator. Unit l-1. The same applies to . Then, it shifts again and moves to timing T3. Tray 1 has X3, tray 2
X4 for tray m, X4 for tray m. (2) X2 enters tray n, and the inner product calculation as shown in the diagram at time T3 in FIG. 1(C) is executed.

第ｌ図（Ｄ）の動作タイミングの時間区域Ｔ３において
は、データ処理ユニットに入るべき入力の記号が示され
ている。このような演算が進み、時間区域Ｔ。まで行う
と、第ｌ図（Ｃ）の時間区域Ｔ。In the time period T3 of the operation timing of FIG. 1(D), the symbols of the inputs to be entered into the data processing unit are shown. As such calculations proceed, the time area T is reached. If the process is performed up to the time zone T in FIG. 1(C).

に示されるようにＡ　＋　ｎＸ　Ｘ　ｎは前の累積器と
の値に加えられると、トレイ２１においては、ＴＩで求
めたＡＩＩＸＸ＋　、Ｔ２におけるＡＩ２ＸＸ２　、Ｔ
３で求めたＡ　Ｉ３　Ｘ　Ｘ　３等の積の和が求まり、
’ｒｎ−＋までの内積結果が累算器Ｙに格納されている
ので、その結果にＡ，ｆｉＸ×ｎが加わって行列Ａの１
行目とベクトルＸとの内積が実行される。トレイ２にお
いては同様に、行列八の２行目の行ベクトルとベクトル
Ｘとの内積演算がｎクロック周期で行われ、同様にｍ行
目の行ベクトルと、ベクトルＸの内積がデータ処理ユニ
ット１、で実行される。従って、このような時系列で処
理を行うことによって、ｍ×ｎの長方行列とｎ次元ベク
トルとの乗算がｍ個のデータ処理ユニットを用いてｎに
比例する処理時間で実行可能となる。従って、良好な台
数効果を得ることが可能となる。ここで重要なことは、
データを処理するデータ処理ユニットと、データ保持機
能を有するトレイとを分離し、それぞれの個数を長方行
列の行と列に対応させ、それらの次元が異なっていても
、時系列動作が同期的に可能となっている点である。な
おｎがｍよりも小さい場合でもｍ個のトレイ２を用いる
ことで処理時間は延びるが、すなわちｍに比例するが、
台数効果的な処理が可能となる。As shown in , when A + nX
The sum of the products of A I3 X X 3 etc. found in step 3 is found,
Since the inner product result up to 'rn-+ is stored in the accumulator Y, A, fiX×n are added to the result and 1
The inner product of the row and the vector X is performed. Similarly, in tray 2, the inner product calculation of the row vector of the second row of matrix 8 and vector , is executed. Therefore, by performing processing in such a time series, multiplication of an m×n rectangular matrix by an n-dimensional vector can be performed using m data processing units in a processing time proportional to n. Therefore, it is possible to obtain a good effect on the number of units. The important thing here is that
The data processing unit that processes data and the tray that has the data holding function are separated, and the number of each corresponds to the rows and columns of a rectangular matrix, so that even if their dimensions are different, the time series operation is synchronous. This is possible. Note that even if n is smaller than m, the processing time will be extended by using m trays 2, that is, it will be proportional to m.
Effective processing of the number of units is possible.

第２図（Ａ）は第１図の構或の詳細ブロック図であり、
ｍ×ｎ　（ｎ≧ｍ≧１）の行列Ａと要素数ｎのベクトル
Ｘの積Ｖ（要素数ｍ）を求めるものである。同図におい
て、第１図で示したものと同一のものは同一の記号で示
してあり、１ａはデータ処理ユニット１の処理装置であ
り、例えばデジタルシグナルプロセッサで構成され、２
ａはトレイ２のデータ保持回路であり、例えばラッチ回
路で構成され、２ｂはトレイ２のデータ転送回路であり
、例えばバスドライバで構成され、２Ｃはトレイ２の制
御手段であり、例えば論理回路で構成され、４はデータ
処理ユニット１にデータを供給する手段の一部であると
同時にデータ処理ユニ・ノト１を制御する手段の一部で
ある記憶装置であり、例えばＲＡＭ　（ラングムアクセ
スメモリ）で構成され、５はデータ処理ユニット１とト
レイ２の同期動作を行う手段であり、５ａはクロツク発
生回路であり、例えば水晶発振回路で構成され、５ｂは
クロック分配回路であり、例えばバツファ回路から構成
される。FIG. 2(A) is a detailed block diagram of the structure of FIG. 1,
This is to find the product V (number of elements m) of a matrix A of m×n (n≧m≧1) and a vector X of n elements. In the figure, the same components as those shown in FIG.
2b is a data transfer circuit for tray 2, such as a bus driver, and 2C is a control means for tray 2, such as a logic circuit. 4 is a storage device which is part of the means for supplying data to the data processing unit 1 and at the same time is part of the means for controlling the data processing unit 1, such as RAM (Rangular Access Memory). 5 is a means for synchronizing the data processing unit 1 and the tray 2, 5a is a clock generation circuit, for example, a crystal oscillation circuit, and 5b is a clock distribution circuit, for example, from a buffer circuit. configured.

本実施例の動作は本発明の原理図で説明した動作とほぼ
同しである。The operation of this embodiment is almost the same as the operation explained in the principle diagram of the present invention.

第２図（Ｂ）は第２図（Ａ）の本発明のシステムの動作
フローチャートである。第２図（Ａ）に示されるように
本発明ではデータ処理ユニット１とデータ保持機能を有
するトレイ２とを分離し、さらにトレイを隣接間で接続
し、巡回接続することによってシストリックなシステム
を構成している。データ処理ユニットの数をｍ、トレイ
の数をｎとした場合に、ｍｘｎの行列Ａと要素数ｍのベ
クトルＸとの積を求める場合、第４図（Ｂ）のフローチ
ャートに示される動作となる。Ｘｉをトレイ２ｉにセッ
トする。ＹｉＯ値をＯにする。すなわちデータ処理ユニ
ットのｉ番目のユニットにおける累積レジスタの値を初
期化する。ｉ番目の処理ユニットをＩＩは１１，からの
入力と、１２ｉの入力を掛け合わせて、積を累算器Ｙｉ
に足し込む。そしてシフトレジスタ３をシフトする。こ
の内積とシフト動作をｎ回繰り返す。この処理において
長方行列Ａとベクトル蒐との積が形戒される。FIG. 2(B) is an operation flowchart of the system of the present invention shown in FIG. 2(A). As shown in FIG. 2(A), in the present invention, a data processing unit 1 and a tray 2 having a data holding function are separated, and the trays are connected between adjacent trays for circular connection, thereby creating a systolic system. It consists of When the number of data processing units is m and the number of trays is n, when calculating the product of mxn matrix A and vector . Set Xi on tray 2i. Set the YiO value to O. That is, the value of the accumulation register in the i-th data processing unit is initialized. The i-th processing unit II multiplies the input from 11, by the input from 12i, and the product is added to the accumulator Yi.
Add to. Then, shift register 3 is shifted. This inner product and shift operation is repeated n times. In this process, the product of the rectangular matrix A and the vector A is determined.

？の場合、トレイ間のデータ転送とデータ処理ユニット
におけるデータ処理とは同時並行処理となる。? In this case, data transfer between trays and data processing in the data processing unit are simultaneous and parallel processes.

第２図（Ｃ）は本発明方式の動作概念図である。FIG. 2(C) is a conceptual diagram of the operation of the system of the present invention.

同図においてトレイ２内のデータＸ＋からＸｆｌはベク
トルＸの要素でその個数はｎであるとする。In the figure, it is assumed that data X+ to Xfl in tray 2 are elements of vector X, and the number thereof is n.

またデータ処理ユニットはｍ個あり、その各々に累積器
がＹ＋　，Ｙ２　，　　・・・，Ｙ．がある。ｍ×ｎの
長方行列の要素はＡｌｌからＡ　ｌｌＩｎまでのｍ×ｎ
個存在する。データ処理ユニットの１１には係数行列の
第１行目であるＡ　Ｈ　Ｈ、Ａ１２，・・・＋Ａｌ１１
が同期的に１２＋の入カバスから入力される。またデー
タ処理ユニット１２はＡ２２％　Ａ２３，・・・Ａ　２
　１がシストリック動作の各タイミングで順番に与えら
れる。また、データ処理ユニット１、にはＡ，．，Ａ■
．，，・・・，　Ａｎ＋　＋ａ−ｌが同期的に与えられ
る。Furthermore, there are m data processing units, each of which has an accumulator Y+, Y2, . . . , Y. There is. The elements of the m×n rectangular matrix are m×n from All to AllIn.
Individuals exist. The data processing unit 11 has the first row of the coefficient matrix A H H, A12,...+Al11.
are input synchronously from the 12+ input buses. Moreover, the data processing unit 12 has A22% A23,...A2
1 is given in turn at each timing of the systolic operation. The data processing unit 1 also includes A, . ,A■
．． ,,..., An+ +a-l are given synchronously.

第２図（Ｄ）は第２図（Ｃ）の動作のタイミングチャー
トである。時間Ｔ１からＴ。の動作は第１図（Ｃ）のそ
れぞれの図と第１図ＣＤ）の時間？，　，　’ｒｚ　，
　　・・・，Ｔｎとが対応している。時間タイミングＴ
，においては，第２図（Ｃ）に示されるように、トレイ
２１，２２，　　・・・，２ｎにはＸＩ　，ｘ２，Ｘ．
，−　−　・，Ｘｆｌがアリ、ユニットｌｌ，１２，　
　・・・，Ｉｍにはそれぞれ係数行列の要素Ａ＋＋．Ａ
ｚｚ．Ａ■が入力されている。FIG. 2(D) is a timing chart of the operation of FIG. 2(C). From time T1 to T. The operation of each diagram in Figure 1 (C) and the time in Figure 1 CD)? , , 'rz ,
..., Tn correspond to each other. time timing T
, as shown in FIG. 2(C), the trays 21, 22, . . . , 2n have XI, x2, X.
, − − ・, Xfl is ant, unit ll, 12,
..., Im respectively have coefficient matrix elements A++. A
zz. A■ has been input.

従って、このタイ旦ングにおいてデータ処理ユニット１
１のＡＩＩとトレイ２１のデータＸＩとの積を求め、デ
ータ処理ユニット１２においてはトレイ２２にあるＸ２
と、メモリから与えられるＡ２２との積を求め、同様に
、トレイｍにおいては八〇とＸ１の積を求める。このタ
イξングは第２図（Ｄ）のＴＩのタイミングで行われて
いる。すなわち積和を求める同期クロックにおいて、バ
ス１１ＩにはＸ＋があり、バス１２．にはＡ．１があり
、バス１１２にはＸ２　、１　２ｚ　６：はＡ２２、１
　ｌｘ　ニはＸ３、１２：ｌにはＡ３３があり、１１１
にはｘｍ、１２１にはＡ．がのっている。従って、第２
図（Ｃ）のＴＩタイムにおける図に示すように内積演算
が行われる。累積器Ｙの値はこの時は０であるから内積
結果はＯに掛けた値が加わることになる。積和演算が終
わるとシフト動作に入る。すなわち第２図（Ｄ）の図に
示されるようにＴ，とＴ２との間がシフト動作であり、
トレイの隣接するトレイ間でデータのシフトが行われる
。すなわち左シフトがこの場合行われる。すると第２図
（Ｃ）のタイミングＴ２に移る。第２図（Ｄ）の動作タ
イミングでも同様にＴ２の積和の時間区域となる。Therefore, in this timing, the data processing unit 1
The product of AII of 1 and the data XI of tray 21 is calculated, and in the data processing unit 12,
and A22 given from memory, and similarly, for tray m, the product of 80 and X1 is found. This timing is performed at the timing of TI in FIG. 2(D). That is, in the synchronous clock for calculating the sum of products, bus 11I has X+, bus 12. A. 1, bus 112 has X2, 1 2z 6: has A22, 1
lx d has X3, 12:l has A33, 111
xm for 121, A. is on it. Therefore, the second
The inner product calculation is performed as shown in the diagram at the TI time in Figure (C). Since the value of the accumulator Y is 0 at this time, the value multiplied by O is added to the inner product result. When the product-sum operation is completed, a shift operation begins. That is, as shown in FIG. 2(D), the shift operation is between T and T2,
Data is shifted between adjacent trays. That is, a left shift is performed in this case. Then, the process moves to timing T2 in FIG. 2(C). Similarly, the operation timing shown in FIG. 2(D) corresponds to the time area of the sum of products of T2.

するとシフトされているからトレイ２１にはＸ２、トレ
イ２２にはＸ３、そしてトレイし２ｍにはＸ　．ｌが格
納され、また、係数行列の要素もトレイ２１，２２，−
，２ｍにはそれぞれＡ　＋　ｚ　，　Ａ　２　３　，　
Ａ．１１１４１が入力される。これは第２図ＣＤ）のＴ
２のタイミングにおいてもバス上のデータがそれぞれ示
されている。従って、Ｔ２のタイミングにおいて、ＡＩ
２とｘ２との積をとり、前の累積器Ｙとの和が求められ
る。従って、ユニット１ｌにおいてはＴＩにおいて求ま
ったＡ．とＸ１との積とＴｚにおいて求められるＡ　１
　２とＸ２との積との和が求められ、その結果が累積器
に格納される。同様に？ニット１２においては前の結果
であるＡ　２２　Ｘ　Ｘ　Ｚ十Ａ２３ＸＸ３の結果が累
積器に格納される。ユニット１ｆｆｉに対しても同様で
ある。そしてまたシフトし、タイξングＴ，に移る。ト
レイ２１にはＸ３、トレイ２２にはＸ４、トレイ２ｍに
はＸ■２、トレイ２ｎにはｘ２が入り、第２図（Ｃ）の
Ｔ２時間における図に示されるような内積演算が実行さ
れる。Then, since it has been shifted, tray 21 has X2, tray 22 has X3, and tray 2m has X. l is stored, and the elements of the coefficient matrix are also stored in trays 21, 22, -
, 2m respectively have A + z , A 2 3 ,
A. 11141 is input. This is T in Figure 2 (CD)
Also at timing 2, the data on the bus is shown. Therefore, at timing T2, AI
2 is multiplied by x2 and summed with the previous accumulator Y. Therefore, in unit 1l, A. A 1 found at the product of and X1 and Tz
The product of 2 and X2 is summed and the result is stored in an accumulator. Similarly? In unit 12, the previous result, A 22 X X Z A23XX3, is stored in the accumulator. The same applies to unit 1ffi. Then, it shifts again and moves to timing ξ. X3 is placed in tray 21, X4 is placed in tray 22, X2 is placed in tray 2m, and x2 is placed in tray 2n, and the inner product calculation as shown in the diagram at time T2 in FIG. 2(C) is executed. .

第２図（Ｄ）の動作タイミングにおいての時間区域Ｔ３
においては、データ処理ユニットに入るべき入力の記号
が示されている。このような演算が進み、時間区域Ｔ０
まで行うと第２図（Ｃ）の時間区域Ｔ．に示されるよう
にＡＩｆｌ××ｎは前の累積器との値に加えられると、
トレイ１においてはＴ，で求めたＡｔ＋ＸＸ＋　％　Ｔ
ｚにおけるＡ１２×Ｘｚ、Ｔｚで求めたＡ　Ｈ　３　Ｘ
　Ｘ　３等の積の和が求まり、Ｔ，，．．Ｉ　までの内
積結果が累積器Ｙに格納されているので、その結果にＡ
．ｎＸＸｆｌが加わって行列八のＩ行目とベクトルＸと
の内積が実行される。Time zone T3 in the operation timing of FIG. 2(D)
In , the symbols of the inputs to be entered into the data processing unit are shown. As this calculation progresses, the time area T0
If the process is performed up to the time zone T in FIG. 2(C). When AIfl××n is added to the value with the previous accumulator as shown in
In tray 1, At+XX+ % T, calculated by T.
A12×Xz at z, A H 3 X found at Tz
The sum of the products of X 3 etc. is found, T, . ．． Since the inner product results up to I are stored in the accumulator Y,
．． nXXfl is added and the inner product of the I-th row of the 8th matrix and the vector X is performed.

トレイ２においては同様に、行列Ａの２行目の行ベクト
ルとベクトルＸとの内積演算がｎクロック周期で行われ
、同様にｍ行目の行ベクトルと、ベクトルＸの内積がデ
ータ処理ユニット１１で実行される。従って、このよう
な時系列で処理を行うことによってｍ×ｎＯ長方行列と
ｎ次元ベクトルとの乗算がｍ個のデータ処理ユニットを
用いてｎに比例する処理時間で実行可能となる。従って
、良好な台数効果を得ることが可能となる。Similarly, in the tray 2, the inner product calculation of the second row vector of the matrix A and the vector is executed. Therefore, by performing processing in such a time series, multiplication of an m×nO rectangular matrix by an n-dimensional vector can be performed using m data processing units in a processing time proportional to n. Therefore, it is possible to obtain a good effect on the number of units.

第３図は、本発明の第２の実施例説明図である．ｍ×ｎ
の行列Ａと要素数ｎのベクトル１との積に対し、引き続
きｋＸｍの行列Ｂを左から掛ける場合の動作に対するシ
ストリック方式の構或図である。第３図（Ａ）において
第１図で示したものと同一のものは同一の記号で示して
ある。すなわち１ａはデータ処理ユニット１の処理装置
であり、例えばデジタルシグナルプロセッサである。２
ａはトレイ２のデータ保持回路であり、例えばラッチ回
路で構成され、２ｂはトレイ２のデータ転送回路であり
、例えばバスドライバで構或され、２Ｃはトレイ２の制
御手段であり、例えば論理回路で構成されている。４は
データ処理ユニット１にデータを供給する手段の一部で
あると同時にデータ処理ユニットｌを制御する手段の一
部でもある記憶装置であって、例えばＲＡＭ　（ランダ
ムアクセスメモリ）で構成されている。５はデータ処理
ユニット１とトレイ２の同期動作を行う手段であり、内
部の５ａは、クロック発生回路で、例えば、水晶発振回
路で構成され、５ｂはクロック分配回路であり、例えば
、バッファ回路から構成される。FIG. 3 is an explanatory diagram of a second embodiment of the present invention. m×n
1 is a diagram illustrating the structure of a systolic system for operations when the product of matrix A of 1 and vector 1 of n elements is subsequently multiplied by matrix B of kXm from the left. Components in FIG. 3(A) that are the same as those shown in FIG. 1 are indicated by the same symbols. That is, 1a is a processing device of the data processing unit 1, for example, a digital signal processor. 2
2b is a data transfer circuit for tray 2, such as a bus driver, and 2C is a control means for tray 2, such as a logic circuit. It consists of Reference numeral 4 denotes a storage device which is part of the means for supplying data to the data processing unit 1 and at the same time is part of the means for controlling the data processing unit 1, and is composed of, for example, a RAM (random access memory). . 5 is a means for synchronizing the data processing unit 1 and the tray 2, 5a inside is a clock generation circuit, for example, a crystal oscillation circuit, and 5b is a clock distribution circuit, for example, from a buffer circuit to configured.

６はシストリック的に戻るデータとトレイに入力する場
合のデータと外部データとの選択を行う選択回路で、７
はシストリックされるデータを途中からバイパスする選
択回路である。6 is a selection circuit that selects between systically returned data, data to be input into the tray, and external data;
is a selection circuit that bypasses systolic data from the middle.

本実施例は、中間結果Ａｘを求めるところまでは第１の
実施例と全く同一であり、各データ処理ユニット中にそ
の中間結果Ａｘの各要素が求まっている状態から（ａ）中間結果をトレイ２に書き込み、（ｂ）バイアス
の選択回路７をオンさせて、シフトレジスタの長さをｍ
に変更し、（Ｃ）以後は本発明の第１の実施例において、行列Ａを
行列Ｂに、そして、ｎをｍに、ｍをｋにそれぞれ変更す
ればまったく同じ動作となる。This embodiment is exactly the same as the first embodiment up to the point where the intermediate result Ax is obtained, and from a state in which each element of the intermediate result Ax has been obtained in each data processing unit, (a) the intermediate result is transferred to the tray. (b) Turn on the bias selection circuit 7 and set the length of the shift register to m.
(C) From then on, in the first embodiment of the present invention, if matrix A is changed to matrix B, n is changed to m, and m is changed to k, the operation will be exactly the same.

第３図（Ｂ）は第２の実施例の動作フローチャート、第
３図（Ｃ）は第２の実庵例の動作概要図、第３図（Ｄ）
は第２の実施例の動作タイムチャートである。FIG. 3(B) is an operation flowchart of the second embodiment, FIG. 3(C) is an operation overview diagram of the second practical example, and FIG. 3(D)
is an operation time chart of the second embodiment.

まず、ｍ×ｎの行列Ａと要素数ｎのベクトルＸとの積、
そして、ｋＸｍの行列Ｂを左から掛ける場合、第３図（
Ｂ）のフローチャートに示される動作となる。Ｘｉをト
レイ２１にセットする。ＹｉＯ値をＯにする。すなわち
データ処理ユニットのｉ番目のユニットにおける累積レ
ジスタの値を初期化する。ｉ番目の処理ユニット１ｌは
ｌｉｔからの入力と、１２１の入力を掛け合わせて、積
を累積器Ｙｉに足し込む。そしてシフトレジスタ３をシ
フトする。この内積とシフト動作をｎ＠繰り返す。この
処理において長方行列ＡとベクトルＸとの積が形威され
る。First, the product of m×n matrix A and vector X with n elements,
Then, when multiplying the matrix B of kXm from the left, Fig. 3 (
The operation is shown in the flowchart B). Set the Xi on the tray 21. Set the YiO value to O. That is, the value of the accumulation register in the i-th data processing unit is initialized. The i-th processing unit 1l multiplies the input from lit by the input of 121, and adds the product to the accumulator Yi. Then, shift register 3 is shifted. This inner product and shift operation is repeated n@. In this process, the product of rectangular matrix A and vector X is expressed.

次に、シフトレジスタの長さをｍに変更し、Ｙ１？トレ
イ２１に転送する。そして、Ｚ，（ｉ＝１，・・・，ｋ
）をＯにする。次にＢ行列を掛けるために、まず、ｉ番
目の処理ユニットｌｌとｌｉｔからの入力と１２ｔの入
力を掛け合わせて、積を累積器Ｚｌに足し込む。そして
、シフトレジスク３をシフトする二〇内積とシフト動作
をｋ回繰り返す。Next, change the length of the shift register to m and Y1? Transfer to tray 21. And Z, (i=1,...,k
) to O. Next, in order to multiply the B matrix, first, the inputs from the i-th processing units ll and lit are multiplied by the input of 12t, and the product is added to the accumulator Zl. Then, the 20 inner product and shift operation for shifting the shift register 3 are repeated k times.

第３図（Ｃ）は以上の動作概念図である。同図において
トレイ２内のデータＸ１からＸＩ，はベクトル真の要素
でその個数はまず、ｎであるとする。FIG. 3(C) is a conceptual diagram of the above operation. In the figure, it is assumed that data X1 to XI in tray 2 are true elements of a vector, and the number thereof is n.

またデータ処理ユニットは最初は、ｍ個が有効で、その
各々に累積器がＹ＋　，Ｙｚ　，　　・・・，Ｙ．があ
るとする。まず、ｍ×ｎの長方行列Ａの要素はＡ■から
Ａ■までのｍ×ｎ個存在する。データ処理ユニットのｈ
には孫数行列の第１行目であるＡ＋＋．　Ａｄｚ，　　
Ｈ　＋　＋，　ＡＩｎが同期的に１２、の入カバスから
入力される。またデータ処理ユニット１２はＡ　２　２
　，　　Ａ　２　３　，　　・・・＋　　Ａｌｌがシス
トリック動作の各タイξングで順番に与えられる。また
、データ処理ユニット１■にはＡ　，．，　Ａ．，■竃
．・・？．■１が同期的に与えられる。In addition, m data processing units are initially effective, and each of them has an accumulator Y+, Yz, . . . , Y. Suppose there is. First, there are m×n elements of an m×n rectangular matrix A, from A■ to A■. data processing unit h
is the first row of the grandchild matrix, A++. Adz,
H + +, AIn are synchronously input from the input bus 12. Moreover, the data processing unit 12 is A 2 2
, A 2 3 , . . . + All are given in turn at each timing of the systolic operation. The data processing unit 1■ also includes A, . , A. , ■Kado. ...? ．． ■1 is given synchronously.

第３図（Ｄ）は第３図（Ｃ）の動作のタイξングチャー
トである。時間ＴＩからＴ。の動作は第３図（Ｃ）のそ
れぞれの図と第３図（Ｄ）の時間Ｔ＋　，Ｔｚ　，　　
・・・，Ｔｎとが対応している。時間タイミングＴ１に
おいては、第３図（Ｃ）に示されるように、トレイの１
．２．・・・　ｎにはＸ＋，Ｘｚ，　　・・・＋Ｘｋ＋
　　・・・，Ｘｌｌがあり、ユニット１，２，・・・，
ｋ，・・・，ｍにはそれぞれ係数行列の要素Ａｌｌ，Ａ
２■，・・・，　Ａｈｈ．・・・，Ａ，，が入力されて
いる。従って、このタイミングにおいてデータ処理ユニ
ットは、トレイ１において、Ａ目とトレイ１のデークＸ
，との積を求め、データ処理ユニット２においてはトレ
イ２にあるＸ２と、メモリから与えられるＡ２■との積
を求め、同様に、トレイｋにおいてはＡｋｋとＸ，の積
を求め、トレイｍにおいて、Ａ．とＸｍの積を求める。FIG. 3(D) is a timing chart of the operation of FIG. 3(C). Time TI to T. The operations are shown in the respective diagrams in FIG. 3(C) and at times T+, Tz, and Tz in FIG. 3(D).
..., Tn correspond to each other. At time timing T1, as shown in FIG. 3(C), one of the trays
．． 2. ... n has X+, Xz, ...+Xk+
..., Xll, and units 1, 2, ...,
k, . . . , m are coefficient matrix elements All, A, respectively.
2■,..., Ahh. ..., A,, are input. Therefore, at this timing, the data processing unit processes the A-th disk and the disk X of tray 1.
, and in the data processing unit 2, the product of X2 in tray 2 and A2■ given from the memory is found.Similarly, in tray k, the product of Akk and X, is found, and in tray m In A. Find the product of and Xm.

このタイミングは第３図（Ｄ）のＴ１のタイミングで行
われている。すなわち積和を求める同期クロックにおい
て、バス１１＋にはｘｌ？あり、バス１２＋にはＡ　＋
　＋があり、バスｌｌｚにはＸ２、１２２にはＡｚｚ，
　１　１　ｋにはＸｈ，１２ｋにはＡｋｋがあり、１１
．にはＸｌｌｌ，ｌ２１ＩにはＡ１■がのっている。従
って、第３図（Ｃ）のＴ＋タイムにおける図に示すよう
に、内積演算が行われる。累積器Ｙの値はこの時は０で
あるから内積結果はＯに掛けた値が加わることになる。This timing is performed at the timing T1 in FIG. 3(D). In other words, in the synchronous clock that calculates the sum of products, xl? Yes, there is A+ for bus 12+.
+, bus llz has X2, 122 has Azz,
1 1 k has Xh, 12k has Akk, 11
．． has Xllll on it, and A1■ on l21I. Therefore, as shown in the diagram at T+time in FIG. 3(C), an inner product calculation is performed. Since the value of the accumulator Y is 0 at this time, the value multiplied by O is added to the inner product result.

積和演算が終わるとシフト動作に入る。すなわち第３図
（Ｄ）の図に示されるように、Ｔ＋　とＴ２との間がシ
フト動作であり、トレイの隣接するトレイ間でデータの
シフトが行われる。すなわち左シフトがこの場合行われ
る。すると第３図（Ｃ）のタイくングＴ２に移る。第３
図（Ｄ）の動作タイミングでも同様にＴ２の積和の時間
区域となる。するとシフトされているからトレイ１には
Ｘ２、トレイ２にはＸ３、トレイｋにはＸＩｌ＋１％そ
してトレイｍにはＸ１。１が格納され、また、係数行列
の要素もトレイ１，２，・・・，ｋ，・・・，ｍにはそ
れぞれＡＩ２＋　Ａ２３＋　　・・・Ａｋｋ＋重，・・
・ＡＩ１１１１４１が入力される。これは第３図（Ｄ）
のＴ２のタイξングにおいてもバス上のデータがそれぞ
れ示されている。従ってＴ２のタイミングにおいて、Ａ
，２とｘ２との積をとり、前の累積器Ｙとの和が求めら
れる。従ってトレイ１においてはＴ１において求まった
Ａ　１　１とＸ，との積とＴ２において求められるＡ１
２とｘ２との積との和が求められその結果が累積器に格
納される。同様にトレイ２においては前の結果であるＡ
２２ＸＸ２　＋Ａ２３ＸＸ３の結果が累積器に格納され
る。トレイｋやｍに対しても同様である。そしてまたシ
フトし、タイミングＴ３に移る。トレイ１にはＸ３、ト
レイ２にはＸ４、トレイｋにはＸｗ　ｋ４２　．．　　
｝レイｍにはＸｌｌｌｍｈ２、トレイｎにはＸ２が入り
、第３図（Ｃ）のＴ３時間における図に示されるような
内積演算が実行される。When the product-sum operation is completed, a shift operation begins. That is, as shown in FIG. 3(D), a shift operation is performed between T+ and T2, and data is shifted between adjacent trays. That is, a left shift is performed in this case. Then, the process moves to tying T2 in FIG. 3(C). Third
Similarly, the operation timing shown in FIG. 3D also corresponds to the time area of the sum of products of T2. Then, since they have been shifted, X2 is stored in tray 1, X3 is stored in tray 2, XIl+1% is stored in tray k, and X1.1 is stored in tray m, and the elements of the coefficient matrix are also stored in trays 1, 2, etc.・, k, ..., m each have AI2+ A23+ ...Akk+heavy, ...
- AI111141 is input. This is Figure 3 (D)
The data on the bus is also shown in the timing T2. Therefore, at the timing of T2, A
, 2 and x2, and the sum with the previous accumulator Y is determined. Therefore, for tray 1, the product of A 1 1 and X, found at T1, and A1 found at T2
The product of 2 and x2 is summed and the result is stored in the accumulator. Similarly, in tray 2, the previous result A
The result of 22XX2 +A23XX3 is stored in the accumulator. The same applies to trays k and m. Then, it shifts again and moves to timing T3. X3 for tray 1, X4 for tray 2, Xw k42 for tray k. ．．
}Xllmh2 is entered into ray m, and X2 is entered into tray n, and an inner product calculation as shown in the diagram at time T3 in FIG. 3(C) is executed.

このような演算が進み、時間区域Ｔｎまで行うと第３図
（Ｃ）の時間区域Ｔ．に示されるようにＡ　＋　ｎ　Ｘ
　Ｘ　（＠が前の累積器との値に加えられるとトレイ１
においてはＴ＋　で求めたＡ１１ＸＸ１−．Ｔｚにおけ
るＡ　１　ｚ　Ｘ　Ｘ　２　、Ｔ　ｋで求めたＡｌｋｘ
Ｘｋ等の積の和が求まり、Ｔｎ−＋までの内積結果が累
積器Ｙに格納されているので、その結果にＡ，．ｘ×ｎ
が加わって行列Ａの１行目とベクトルＸとの内積が実行
される。トレイ２においては同様に行列八の２行目の行
ベクトルとベクトルスとの内積演算がｎクロック周期で
行われ、同様にｋ行目の行ベクトルと、ベクトル見の内
積がデータ処理ユニット１ｋで実行される。When such calculations proceed until the time area Tn is reached, the time area T. shown in FIG. 3(C) is reached. A + n X as shown in
X (Tray 1 when @ is added to the value with the previous accumulator
In this case, A11XX1-. determined by T+. A 1 z X X 2 at Tz, Alkx determined at T k
The sum of products such as Xk is calculated, and the inner product results up to Tn-+ are stored in the accumulator Y, so A, . x×n
is added, and the inner product of the first row of matrix A and vector X is performed. Similarly, in tray 2, the inner product calculation of the row vector of the second row of matrix 8 and vectors is performed in n clock cycles, and similarly, the inner product of the row vector of the k-th row and the vector is calculated in data processing unit 1k. executed.

データ処理ユニットの有効数をｋ、トレイの有効数をｍ
とした場合に、ｋＸｍの行列Ｂと要素数ｍのベクトルｙ
との積を求める動作となる。Ｙｔをトレイ２のｈにセッ
トする。ＺｉＯ値をＯにする。すなわちデータ処理ユニ
ットのｉ番目のユニットにおける累積レジスタの値を初
期化する。The effective number of data processing units is k, and the effective number of trays is m.
In this case, matrix B of kXm and vector y of m elements
The operation is to find the product of . Set Yt on tray 2 h. Set the ZiO value to O. That is, the value of the accumulation register in the i-th data processing unit is initialized.

ｉ番目の処理ユニットｈは１ｈからの入力と、１２１の
入力を掛け合わせて、積を累積器Ｚｉに足し込む。そし
てシフトレジスタ３をシフトする。The i-th processing unit h multiplies the input from 1h by the input of 121, and adds the product to the accumulator Zi. Then, shift register 3 is shifted.

この内積とシフト動作をｍ＠繰り返す。この処理におい
て長方行列Ｂとベクトル！との積が形威される。This inner product and shift operation is repeated m@. In this process, rectangular matrix B and vector! The product of is expressed as

？３図（Ｃ）においてトレイ２内のデータＹ１からＹｍ
はベクトルｙの要素でその個数はｍであるとする。また
データ処理ユニットの有効数はｋ個あり、その各々に累
積器がＺｌ，Ｚ２，　　・・・Ｚｋがある。ｋＸｍの長
方行列旧の要素はＢ＋＋からＢｈまでのｋＸｍ個存在す
る。データ処理ユニットのｈには係数行列Ｂの第１行目
であるＢｌｌ、Ｂ＋ｚ，　　・・・＋ＢＩ＋ｍが同期的
に１２１　の入力バスから入力される。またデータ処理
ユニットｌｚはＢ２■，　　Ｂ２３，　　・・・，Ｂ２
１がシストリック動作の各タイミングで順番に与えられ
る。また、データ処理ユニッｌ−１ヶにはＢｋｋ．　Ｂ
ｋ　ｋ。１，・・・Ｂｋ　ｋ−１が同期的に与えられる
。? In Figure 3 (C), data Y1 to Ym in tray 2
is an element of vector y, and the number of elements is m. Further, the effective number of data processing units is k, and each of them has an accumulator Zl, Z2, . . . Zk. The old kXm rectangular matrix has kXm elements from B++ to Bh. The first row of the coefficient matrix B, Bll, B+z, . . . +BI+m, is synchronously input to the data processing unit h from the 121 input bus. Moreover, the data processing unit lz is B2■, B23, ..., B2
1 is given in turn at each timing of the systolic operation. In addition, the data processing unit l-1 has Bkk. B
k k. 1,...Bk k-1 are given synchronously.

第３図（Ｄ）は第３図（Ｃ’）の動作のタイ５ングチャ
ートでも同様の記号が使われている。時間Ｔ　ｎ　＋　
１からＴ　ｎ　＋　＠　＋　ｌの動作は第３図（Ｃ）の
それぞれの図と第３図（Ｄ）の時間とが対応している。Similar symbols are used in the timing chart of FIG. 3(D) and the operation of FIG. 3(C'). Time T n +
The operations from 1 to T n + @ + l correspond to the respective diagrams in FIG. 3(C) and the times in FIG. 3(D).

時間タイミングＴ。＋Ｉにおいては第３図（Ｃ）に示さ
れるように、トレイＩ，２，・・・，ｍにはＹ＋　，Ｙ
２　，　　・・・，Ｙ．が移され、ユニット１，？，・
・・，ｋにはそれぞれ係数行列Ｂの要素Ｂｚ，Ｂ２■，
・・・，Ｂｋｋが入力されている。次のタイミング’ｒ
ｎ＊ｚにおいてデータ処理ユニット１においてＢｌ＋と
トレイ■のデータＹ１との積を求め、データ処理ユニッ
ト２においてはトレイ２にあるＹ２と、メモリから与え
られるＢ２■との積を求め、同様にユニッＩ−ｋにおい
てはＢｋｋとＹｋの積を求める。このタイミングは第５
図（ｄ）のＴｎ。２のタイミングで行われている。すな
わち積和を求める同期クロツクにおいて、バス１１＋に
はＹ１があり、バス１２＋にはＢ目があり、バス１１２
にはＹ２、１２２にはＢ２２、１ｌ３にはＹ３，１２３
にはＢ３３があり、１１ｋにはＹｋ、１２ｋにはＢｋｋ
がのっている。従って、第３図（Ｃ）のＴ，ｌ。２にお
ける図に示すように内積演算が行われる。累積器Ｚの値
はこの時はＯであるから内積結果はＯに掛けた値が加わ
ることになる。積和演算が終わるとシフト動作に入る。Time timing T. In +I, as shown in FIG. 3(C), trays I, 2, ..., m have Y+, Y
2,...,Y. is moved and unit 1,? 、・
..., k are the elements Bz, B2■, of the coefficient matrix B, respectively.
..., Bkk are input. next timing'r
At n*z, the data processing unit 1 calculates the product of Bl+ and the data Y1 of the tray ■, and the data processing unit 2 calculates the product of Y2 in the tray 2 and B2■ given from the memory, and similarly the unit In I-k, the product of Bkk and Yk is calculated. This timing is the fifth
Tn in figure (d). This is done at the timing of 2. In other words, in the synchronous clock that calculates the sum of products, bus 11+ has Y1, bus 12+ has B, and bus 112
Y2 for 122, B22 for 1l3, Y3, 123 for 1l3
has B33, 11k has Yk, 12k has Bkk
is on it. Therefore, T, l in FIG. 3(C). The inner product operation is performed as shown in the diagram in FIG. Since the value of the accumulator Z is O at this time, the value multiplied by O is added to the inner product result. When the product-sum operation is completed, a shift operation begins.

すなわち第３図（Ｄ）の図に示されるようにＴｎ。２と
Ｔ。＋３との間がシフト動作であり、トレイの隣接する
トレイ間でデータのシフ？が行われる。すなわち左シフ
トがこの場合行われる。すると第３図（Ｃ）のタイξン
グＴ　ｎ　＋　３に移る。第３図（Ｄ）の動作タイミン
グでも同様にＴ　ｎ　＋　３の積和の時間区域となる。That is, Tn as shown in the diagram of FIG. 3(D). 2 and T. +3 is a shift operation, and data is shifted between adjacent trays. will be held. That is, a left shift is performed in this case. Then, the process moves to the tying T n + 3 in FIG. 3(C). Similarly, the operation timing shown in FIG. 3(D) corresponds to the time area of the sum of products of T n +3.

すると、シフトされているからトレイ１にはＹ２、トレ
イ２にはＹ３、そしてトレイｋにはＹ　ｋ　＋　＋が格
納され、また、係数行列Ｂの要素もトレイ１，２．・・
・．ｋにはそれぞれＢ１■，　Ｂ２３，　　・・・＋　
Ｂｋ　ｋ＋１が入力される。これは第３図（Ｄ）のＴ　
ｎ　＋　３のタイミングにおいてもバス上のデータがそ
れぞれ示されている。従ってＴ　ｎ　＋　３のタイミン
グにおいてＢＩ２とＹ２との積をとり、前の累積器Ｚと
の和が求められる。従って、ユニット１においては、Ｔ
７。２において求まったＢ．とＹ＋　との積とＴ　ｎ　
＊　３において求められるＢ１■とＹ２との積との和が
求められその結果が累積器Ｚに格納される。同様にユニ
ット２においては前の結果であるＢ２２ＸＹ２　＋Ｂｚ
３×Ｙ３の結果が累積器Ｚに格納される。トレイｋに対
しても同様である。そしてまたシフトし、タイミングＴ
　ｎ　＋　４に移る。Then, since they have been shifted, Y2 is stored in tray 1, Y3 is stored in tray 2, and Y k + + is stored in tray k, and the elements of coefficient matrix B are also stored in trays 1, 2, .・・・
・．． B1■, B23, ...+ for k, respectively
Bk k+1 is input. This is T in Figure 3 (D)
Data on the bus is also shown at timing n+3. Therefore, at the timing of T n + 3, BI2 is multiplied by Y2, and the sum with the previous accumulator Z is obtained. Therefore, in unit 1, T
7. B. found in 2. The product of Y+ and T n
* The sum of the product of B1 and Y2 obtained in step 3 is obtained and the result is stored in the accumulator Z. Similarly, in unit 2, the previous result B22XY2 +Bz
The result of 3×Y3 is stored in accumulator Z. The same applies to tray k. Then shift again, timing T
Move to n+4.

このような演算が進み、時間区域Ｔ。＊＠＋ｌ　まで行
うと第３図（Ｃ）の時間区域Ｔ　ｎ　＋　Ｉ１。直に示
されるようにＢ　ＩＩＩＩＸ　Ｙ．Ｎが前の累積器Ｚと
の値に加えられるとユニット１においてはＴ，，。２で
求めたＢＸ　Ｙ　＋　−．　Ｔ　ｎ　＋　２におけるＢ
　ＢＸ　Ｙ２　、Ｔｎ＋３で求めたＢ　，，Ｘ　Ｙ３等
の積の和が求まり、Ｔ７。１までの内積結果が累積器Ｚ
に格納されているので、その結果にＢＩｍ　Ｘ　Ｙ　ｙ
３が加わって行列Ｂの１行目とベクトルｙとの内積が実
行される。ユニット２においては同様に行列Ｂの２行目
の行ベクトルとベクトルｌとの内積演算が行われ、同様
にｋ行目の行ベクトルと、ベクトルｌの内積がデータ処
理ユニット１ｋで実行される。従って、このような時系
列で処理を行うことによってｋＸｍの長方行列Ｂに対し
てｍに比例する処理時間で実行可能となり、従って良好
な台数効果を得ることが可能となる。As such calculations proceed, the time area T is reached. *If you proceed up to @+l, you will get the time area T n + I1 in Figure 3(C). B IIIX Y. as shown directly. When N is added to the value of the previous accumulator Z, in unit 1 T, . BX Y + −. B at T n + 2
BX Y2 , the sum of the products of B , ,
Since it is stored in , the result is BIm X Y y
3 is added and the inner product of the first row of matrix B and vector y is performed. Similarly, in unit 2, an inner product operation is performed between the row vector of the second row of matrix B and vector l, and similarly, an inner product operation of the row vector of the k-th row and vector l is performed in data processing unit 1k. Therefore, by performing the processing in such a time series, it becomes possible to perform the processing on the kXm rectangular matrix B in a processing time proportional to m, and therefore it becomes possible to obtain a good effect on the number of units.

本実施例においてはシフトレジスタ３の長さを変更でき
ること、及び中間結果をトレイ２に書き込み、それを新
たなデータとして処理できることが重要である。シフト
レジスタ３の長さを変更できなければ、データをすべて
巡回するためにｎ単位時間が必要になってしまう。また
中間結果を新たなデータとして処理できることで小規模
なハードウエアでリングシストリックアレイ方式より広
い範囲の処理が実行可能となっている。さらに書き込み
に要する時間が短くて各一定であることも重要である。In this embodiment, it is important that the length of the shift register 3 can be changed and that intermediate results can be written to the tray 2 and processed as new data. If the length of the shift register 3 cannot be changed, it will take n units of time to cycle through all the data. Additionally, by being able to process intermediate results as new data, it is possible to perform a wider range of processing than with the ring systolic array method using small-scale hardware. Furthermore, it is important that the time required for writing is short and constant.

第４図は本発明の第３の実施例説明図である。FIG. 4 is an explanatory diagram of a third embodiment of the present invention.

このシステムではｍｘｎの長方行列Ａの転置行列ＡＴ、
すなわち（ｎＸｍ）の行列と要素数ｍのベクトルＸとの
積とを計算するものである。同図において第１図に示し
たもの同じものは同一の記号で示してある。In this system, the transposed matrix AT of the mxn rectangular matrix A,
That is, it calculates the product of an (nXm) matrix and a vector X having m elements. In the figure, the same parts as shown in FIG. 1 are indicated by the same symbols.

転置行列ＡＴとへクトルＸとの積を求める場合において
は行列Ａを構成する部分行ヘクトルを各データ処理ユニ
ット１に接続された記憶装置４中に格納し、演算途中に
生ずる部分和をトレイ中のデータ保持回路２ａ上に累積
しつつシフトレジスタ３上のデータを循環させる。When calculating the product of the transposed matrix AT and hector The data on the shift register 3 is circulated while being accumulated on the data holding circuit 2a.

第４図（Ａ）は第３の実施例の構或の詳細ブロック図で
あり、ｎＸｍ　（ｎ≧ｍ≧１）の行列ＡＴと要素数ｍの
ベクトルＸの積１（要素数ｎ）を求めるものである。同
図において、第１図で示したものと同一のものは同一の
記号で示してあり、ｌａはデータ処理ユニット１の処理
装置であり、例えばデジタルシグナルプロセッサで構成
され、２ａはトレイ２のデータ保持回路であり、例えば
ラッチ回路で構成され、２ｂはトレイ２のデータ転送回
路であり、例えばバスドライバで構成され、２Ｃはトレ
イ２の制御手段であり、例えば論理回路で構成され、４
はデータ処理ユニット１にデータを供給する手段の一部
であると同時にデータ処理ユニット１を制御する手段の
一部である記憶装置であり、例えばＲＡＭ　（ランダム
アクセスメモリ）で構成され、５はデータ処理ユニント
１とトレイ２の同期動作を行う手段であり、５ａはクロ
ック発生回路であり、例えば水晶発振回路で構成され、
５ｂはクロック分配回路であり、例えばバッファ回路か
ら構成される。FIG. 4(A) is a detailed block diagram of the structure of the third embodiment, and the product 1 (number of elements n) of the matrix AT of nXm (n≧m≧1) and the vector X of m elements is calculated. It is something. In the figure, the same components as those shown in FIG. 2b is a data transfer circuit for the tray 2, which is composed of, for example, a bus driver; 2C is a control means for the tray 2, which is composed of, for example, a logic circuit;
5 is a storage device which is part of the means for supplying data to the data processing unit 1 and at the same time is part of the means for controlling the data processing unit 1, and is composed of, for example, a RAM (random access memory); It is a means for synchronizing the processing unit 1 and the tray 2, and 5a is a clock generation circuit, which is composed of, for example, a crystal oscillation circuit.
5b is a clock distribution circuit, which is composed of, for example, a buffer circuit.

第４図（Ｂ）は第３の実施例の動作フローチャートであ
る。Ｘ＋をユニットＩｔ　　（ｉ＝１，　　・・ｍ）に
セソトする。モしてＹｉ（　ｉ　＝　１　，・・，ｎ）
の値をＯにする。各ユニット１１はＡ　Ｊ　（とＸカを
掛け合わせ、積をＹｊに足し込む動作をｉ＝１．・・・
．ｎに対して行ってシフトする。FIG. 4(B) is an operation flowchart of the third embodiment. Sesoto X+ into unit It (i=1, . . . m). Yi (i = 1,...,n)
Set the value to O. Each unit 11 multiplies A J ( and
．． Go to n and shift.

この動作をｊ＝１，・・・，ｍに対して繰り返す．転置
行列とベクトルの掛け算は、記憶装置４中に格納された
行列八の各部分行ベクトルをそのままにして計算可能と
なり、これは後述するニューラルネットの学習アルゴリ
ズムの１つであるバックプロバゲションの実行において
は極めて重要となる。またネットワークの量はオーダｎ
ですむこと。Repeat this operation for j=1,...,m. The multiplication of a transposed matrix and a vector can be calculated by leaving each partial row vector of the matrix 8 stored in the storage device 4 unchanged, and this is done using backpropagation, which is one of the neural network learning algorithms described later. This is extremely important in execution. Also, the amount of network is of order n
That's all.

リングネットワークである。またデータ転送時間が処理
時間の影に隠れて転送時間に対するオーバヘッドはない
ことになる。しかもＳＩＭＤ方式である。It is a ring network. Furthermore, since the data transfer time is hidden by the processing time, there is no overhead to the transfer time. Moreover, it is a SIMD method.

第４図（Ｃ）は第３の実施例の動作概要図である。ユニ
ットｈには、Ａ１１からＡ１Ｌ，ｌまでを順に与えてい
く。ユニット１２にはＡ２２からＡ２３．？・，Ａ２．
を与え、ｋ番目のユニットには記憶回路を介して、Ａｈ
ｈ，　　Ａｋ　ｋｌ　＋　　・・・．　　Ａｈ　ト＋を
順に与える。ｍ番目にはＡ■，　Ａｍ　ｍ＋＋　＋　　
・・Ａ．．１を順に与えていく。また、トレイ上を循環
するものはＹ１からＹｎである。FIG. 4(C) is a schematic diagram of the operation of the third embodiment. Unit h is given sequentially from A11 to A1L,l. Unit 12 has A22 to A23. ?・,A2.
is given, and the k-th unit is given Ah
h, Ak kl + .... Give Ah + in order. The mth is A■, Am m++ +
...A. ．． 1 in order. Further, those circulating on the tray are Y1 to Yn.

第４図（Ｄ）は第３の実施例の動作タイムチャートであ
る。時間区ＭＴ．からＴｎまでのバス上のデータが示さ
れ、これらは第６図（Ｃ）の時間区域Ｔ１からＴｎまで
の図にそれぞれ対応している。FIG. 4(D) is an operation time chart of the third embodiment. Time zone MT. The data on the bus from T1 to Tn are shown, which respectively correspond to the diagram from time period T1 to Tn in FIG. 6(C).

時間区域Ｔ＋においては、Ｙ１からＹ７まではすぺてＯ
である。モしてＡ．とＸ１との積がユニット１１で形威
され、それをＹ＋に足し込む。それと同時にＡ２２とＸ
２がＹ２に足し込まれ、Ａｋｋ×ＸｋがＹｋに足し込み
、Ａエ×Ｘ．がＹ．に足し込まれる。そしてシフト動作
に入るとタイミングＴ２になる。すなわちＹデータが循
環する。第１のユニットではＡ　Ｈ　２　Ｘ　Ｘ　１が
計算され、これがＹ２に足し込まれるが、そのＹ２はＴ
Ｉにおいて求まったＡ　２２　Ｘ　Ｘ　２の値が格納さ
れているのでこれに足し込まれる。そのため、Ａ２２Ｘ
Ｘ２　＋ＡＩ２ＸＸＩの結果がＹ２となる。同様にユニ
ット２においては、前のＹ３の結果にＡ　２３　Ｘ　Ｘ
　２が足し込まれる。In time area T+, all from Y1 to Y7 are O.
It is. Mo and A. The product of and X1 is calculated in unit 11 and added to Y+. At the same time, A22 and X
2 is added to Y2, Akk×Xk is added to Yk, A×X. is Y. It is added to. Then, when the shift operation starts, timing T2 occurs. In other words, the Y data circulates. In the first unit, A H 2 X X 1 is calculated and added to Y2, which is T
Since the value of A 22 X X 2 found in I is stored, it is added to this. Therefore, A22X
The result of X2 +AI2XXI is Y2. Similarly, in unit 2, A 23 X X is added to the previous result of Y3.
2 is added.

ｋ番目のユニットにおいてはＹ　ｋ＊　１にＡｋｋ。１
×Ｘ，が加えられる。また、ｍ番目のユニットにはＹ．
。１にＡＩｌｓ。Ｉｘｘ．が加えられことになる。Akk for Y k* 1 in the kth unit. 1
×X, is added. Also, the m-th unit has Y.
. AIls in 1. Ixx. will be added.

このようにＹデータを循環するとｍ番目の時間区域Ｔｎ
においては、例えば第１のユニット１、においては、そ
の前までに求まったＹｎにＡ　１　ｎＸ　Ｘ　１が加え
られる。またＹ＋にはＡ　２　１　Ｘ　Ｘ　２が加えら
れる。これを全体的に眺めてみると、例えば、ベクトル
鬼の第１の要素Ｘ＋には、Ｔ＋においてＡ■と積がとら
れ、Ａ　Ｉ　Ｉ　Ｘ　Ｘ　ｌが計算される。それはＹ１
に格納される。また、転置行列八〇の第１行目の第２番
目の要素Ａ２１ＸＸ２は実は最後のクロック周！ｔｌＩ
Ｔｎにおいて計算されている。これは同じＹ，に格納さ
れている形になっている。また、転置行列Ａ７の第１行
目の最後の要素であるＡ１１とＸ１との積は第４図（Ｃ
）のクロック周期Ｔ　ｎ　−　ｖ＊　＊　２のｍ番目の
ユニットで計算されている。すなわちＡ１とＸ１の積が
Ｙ＋に足し込むことによって得？れる。転置行列Ａ７の
第２行目においても同様であり、ＡＩ２とＸ１との積は
Ｔ２のクロックにおいては、ユニット１において計算さ
れている。また、Ａ２■×Ｘ２はクロック周期Ｔ１の第
２番目のユニットにおいて行われている。モしてＹ２が
再び循環して積の実行が行われるのは、時間区域Ｔｎ−
ＴＩ。２である。その時間区域以後は乗算が行われ、シ
フト動作が行われる。そして時間区域Ｔｆｌにおいては
Ｙ２に足し込まれる値は第３番目のユニットであり、Ｙ
２に足し込まれる値はＡ３■ＸＸ３である。従って、Ｔ
７において転置行列ＡＴの第２行目とベクトルＸの内積
が計算される。一般に第ｋ番目のユニットに関してはｋ
番目のトレイからのデータ線がｌｌｍであるから第４図
（Ｄ）に示されるように、１１ｋに示すところを追って
いけばよいことになる。すなわち、Ｔ１においてはＹ，
＋ＡＨＸＸｍ　、ＴｚにおいてはＹ　ｋ＊　ｒ　＋　Ａ
　ｈ　ｋ＋　＋×χ，、Ｔ３においてはＹｋ＋ｚ　＋Ａ
ｈ　ｋ＋２　Ｘｈが計算され、’ｒｎ−＋においてはＹ
ｋ−■＋Ａｌ　ｋ−ｚ　Ｘｋが計算され、時間区域Ｔｎ
においてはＹｋ−１＋Ａｋｋ−Ｉ　Ｘｋが計算されるこ
とになる。このことにより転置行列ＡＴとｍ次元のベク
トルＸの積が実行される。すなわち、転置行列Ａ７とベ
クトルＸとの積を求める場合においては、行列Ａを構成
する部分行ベクトルを各データ処理ユニットＩに接続さ
れた記憶装置４中に格納し、演算途中に生ずる部分和を
トレイ２中のデータ保持回路上に累積しつつシフトレジ
スタ上を循環させている。このような方法により行列Ａ
とベクトルＵとの積Ｘに継続して行列八の転置ＡＴとベ
クトル真の積を求める場合は、行列ＡとベクトルＵとの
積を求める時に用いた各データ処理ユニット１に接続さ
れた記憶装置４中に格納された行列八の各部分行ベクト
ルをそのまま用いて、すなわち転置行列八〇の部分行列
を各データ処理ユニット１に転送することなしに処理を
おこなしうことができ、従って転送に要する時間が節約
でき、さらに処理時間が短縮できることになる。When Y data is circulated in this way, the mth time area Tn
For example, in the first unit 1, A 1 nX X 1 is added to Yn determined up to that point. Further, A 2 1 X X 2 is added to Y+. Looking at this as a whole, for example, the first element X+ of the vector demon is multiplied with A■ at T+, and A I I X X l is calculated. That is Y1
is stored in Also, the second element A21XX2 in the first row of transposed matrix 80 is actually the last clock cycle! tlI
Calculated at Tn. This is stored in the same Y. In addition, the product of A11, which is the last element in the first row of the transposed matrix A7, and X1 is shown in Figure 4 (C
) is calculated in the mth unit of clock period T n − v* * 2. In other words, is the product of A1 and X1 obtained by adding it to Y+? It will be done. The same applies to the second row of the transposed matrix A7, and the product of AI2 and X1 is calculated in unit 1 at the clock of T2. Further, A2.times.X2 is performed in the second unit of clock period T1. Then, Y2 is circulated again and the product is performed in the time area Tn-
T.I. It is 2. After that time period, a multiplication is performed and a shift operation is performed. In the time area Tfl, the value added to Y2 is the third unit, Y
The value added to 2 is A3■XX3. Therefore, T
7, the inner product of the second row of the transposed matrix AT and the vector X is calculated. Generally, for the kth unit, k
Since the data line from the th tray is llm, it is sufficient to follow the line 11k as shown in FIG. 4(D). That is, at T1, Y,
+AHXXm, Y k* r + A at Tz
h k+ +×χ,, Yk+z +A at T3
h k+2 Xh is calculated, and in 'rn-+ Y
k−■+Al k−z Xk is calculated, and the time domain Tn
In this case, Yk-1+Akk-I Xk is calculated. As a result, the product of the transposed matrix AT and the m-dimensional vector X is executed. That is, when calculating the product of transposed matrix A7 and vector The data is accumulated on the data holding circuit in tray 2 and circulated on the shift register. By this method, matrix A
When calculating the product of the transpose AT of matrix 8 and the vector true following the product Processing can be performed using each sub-row vector of matrix 8 stored in matrix 8 as is, that is, without transferring the sub-matrix of transposed matrix 80 to each data processing unit 1. This saves time and further reduces processing time.

第４図（Ｅ）は第４図（Ｂ）の繰り返し部分を詳細に分
解して示したフローチャートである。FIG. 4(E) is a flowchart showing a detailed breakdown of the repeated portion of FIG. 4(B).

第５図は本発明の第４の実施例図である。本実施例は本
発明を利用したニューロコンピュータの構或図である。FIG. 5 is a diagram showing a fourth embodiment of the present invention. This embodiment is a diagram showing the structure of a neurocomputer using the present invention.

同図において第４図に示したものと同一のものは同一の
記号で示してある。同図において１ａはデータ処理ユニ
ット１の処理装置であり、例えばデジタルシグナルプロ
センサで構成される。２ａはトレイ２のデータ保持回路
であり、例えばラッチ回路で構成される。２ｂはトレイ
２のデータ転送回路であり、例えばバスドライバで構成
される。２ｃはトレイ２の制御手段であり、例えば論理
回路で構成される。４はデータ処理ユニット１にデータ
を供給する手段の一部であると同時にデータ処理ユニッ
ト１を制御する手段の一部でもある記憶装置である。例
えばＲＡＭで構成される。５ａはデータ処理ユニット１
とトレイ２の同期動作を行う手段であり、５ａはクロッ
ク発生回路、例えば水晶発振回路で構成される。５ｂは
クロンク分配回路であり、例えばパッファ回路で構成さ
れる。これに加えて１０１はシグモイド関数と称される
単調非減少連続関数及びその微分係数を計算するシグモ
イド関数ユニットであり、例えば多項式による近似式に
より実現される。ｌ０３は学習時の終了を判定する手段
であり、例えば通信手段により前記各処理ユニット１と
接続されたホストコンピュータと、各処理ユニット■が
計算した出力誤差を前記通信手段により前記ホストコン
ピュータに通知する手段と、一般に複数個の前記出力誤
差値を基に学習の終了を判定し、ニューロコンピュータ
の停止を行う手段とから構或される。なお１０２はニュ
ーロコンピュータの全体である。In this figure, the same parts as those shown in FIG. 4 are indicated by the same symbols. In the figure, 1a is a processing device of the data processing unit 1, which is composed of, for example, a digital signal processing sensor. Reference numeral 2a denotes a data holding circuit for the tray 2, which is composed of, for example, a latch circuit. 2b is a data transfer circuit for the tray 2, which is composed of, for example, a bus driver. Reference numeral 2c denotes a control means for the tray 2, which is composed of, for example, a logic circuit. A storage device 4 is part of the means for supplying data to the data processing unit 1 and is also part of the means for controlling the data processing unit 1. For example, it is composed of RAM. 5a is data processing unit 1
5a is a means for performing synchronized operation of the tray 2 and the tray 2, and 5a is constituted by a clock generation circuit, for example, a crystal oscillation circuit. 5b is a Cronk distribution circuit, which is composed of, for example, a buffer circuit. In addition, 101 is a sigmoid function unit that calculates a monotonically non-decreasing continuous function called a sigmoid function and its differential coefficient, and is realized by, for example, an approximate expression using a polynomial. l03 is a means for determining the end of learning, and for example, a host computer connected to each of the processing units 1 through a communication means and an output error calculated by each processing unit (2) is notified to the host computer through the communication means. and means for determining the end of learning based on the plurality of output error values and stopping the neurocomputer. Note that 102 is the entire neurocomputer.

第５図（Ｂ）は本発明のニューロコンピュータにおいて
処理の計算における基本素子であるニューロンモデルの
実施例図である。ニューロンモデルは入力Ｘ．，ＸＺ，
　　・・・，Ｘ１の各々にシナプス結合としての重み時
Ｗ．，Ｗ２，　　・・・，Ｗｎをそれぞれ掛け、その総
和を求め、これを内部値Ｕとする。このＵに非線形関数
ｆを施し、出力Ｙとする。ここで非線形関数ｆは図に示
すようなＳ型のシグモイド関数が一般に使われる。FIG. 5(B) is an example diagram of a neuron model which is a basic element in processing calculations in the neurocomputer of the present invention. The neuron model has input X. ,XZ,
. . , X1 has a weight W as a synaptic connection. , W2, . A nonlinear function f is applied to this U to obtain an output Y. Here, as the nonlinear function f, an S-type sigmoid function as shown in the figure is generally used.

第５図（Ｃ）は第５図（Ｄ）のニューロンモデルの複数
を用いて入力層、中間層、出力層の３Ｎ構造でニューロ
コンピュータを形成する階層型のニューラルネットワー
クの概念図である。第ＩＮの入力層は入力信号Ｉ＋，Ｉ
ｚ，　　・・・，　　ＩＮ＋１１を入力する。第２層の
中間層は各々のユニット、すなわち、各々のニューロン
モデルが第１層のすべてのニューロンモデルに接続され
、その結合枝がシナプス結合であって、重み値Ｗｉｊが
与えられている。第３層の出力層は同様に中間層の各ニ
ューロンモデルの全てに各々のユニットが接続されてい
る。その出力は外部に出される。このニューラルネット
においては学習時において入力層に与えられる入力パタ
ーンの信号に対応する教師信号と出力層との出力信号と
の誤差を求め、この差が非常に小さくなるように中間層
と出力層との間の重み及び第１層と第２層の間の重みを
定めるようにする，このアルゴリズムがパックプロバゲ
ーション法則、すなわち逆伝播学習則と呼ばれるもので
ある。逆伝播学習則によって定められた重み値を保存し
、例えばパターン認識等の連想処理を行う場合には、第
１層の入力にて認識するべきパターンからややずれた不
完全なパターンを与えると、出力層からそのパターンに
対応した出力信号が出力され、その信号は学習時に与え
たそのパターンに対応する教師信号と非常に似たような
信号が出てくる。教師信号との差が非常に小さければ、
その不完全なパターンを認識したことになる。FIG. 5(C) is a conceptual diagram of a hierarchical neural network that uses a plurality of the neuron models shown in FIG. 5(D) to form a neurocomputer with a 3N structure of an input layer, an intermediate layer, and an output layer. The input layer of the IN-th is the input signal I+, I
Input z, ..., IN+11. In the intermediate layer of the second layer, each unit, that is, each neuron model, is connected to all the neuron models of the first layer, and the connecting branches thereof are synaptic connections, and are given weight values Wij. Similarly, each unit of the output layer of the third layer is connected to all of the neuron models of the intermediate layer. The output is output to the outside. In this neural network, during learning, the error between the teacher signal corresponding to the input pattern signal given to the input layer and the output signal of the output layer is calculated, and the error between the intermediate layer and the output layer is determined so that this difference is very small. This algorithm, which determines the weight between the two layers and the weight between the first layer and the second layer, is called the pack propagation law, that is, the backpropagation learning rule. When storing the weight values determined by the backpropagation learning rule and performing associative processing such as pattern recognition, if an incomplete pattern slightly deviated from the pattern to be recognized is given as input to the first layer, An output signal corresponding to the pattern is output from the output layer, and this signal is very similar to the teacher signal corresponding to the pattern given during learning. If the difference from the teacher signal is very small,
You have recognized that imperfect pattern.

第５図（Ａ）のニューロコンピュータ１０２を用いてこ
のニューラルネットワークの動作を工学的に実現できる
。本実施例では第５図（Ｃ）に示すような３層のネット
ワーク構成を用いるが、以下の説明のようにこの層数は
本実施例の動作にはなんら本質的な影響を受けない。同
図においてＮ（１）は第１層のニューロン数である。ま
た通常、第工層、すなわち入力層の各ニューロンの出力
は入力と等しいものとするので、実質的な処理の必要１
よない。通常の動作、すなわちパターン認識を行う場合
の前向きの処理を第５図（Ｄ）に示す。The operation of this neural network can be realized engineeringly using the neurocomputer 102 shown in FIG. 5(A). In this embodiment, a three-layer network configuration as shown in FIG. 5(C) is used, but as explained below, this number of layers has no essential influence on the operation of this embodiment. In the figure, N(1) is the number of neurons in the first layer. In addition, since the output of each neuron in the first layer, that is, the input layer, is usually equal to the input, the actual processing is necessary.
Not good. FIG. 5(D) shows normal operation, that is, forward-looking processing when performing pattern recognition.

第５図（Ｄ）は第４の実施例の前向き処理フロ一チャー
トである。前向き処理では第５図（Ｃ）に示すネットワ
ークにおいて、各層間の結合枝上の重み係数は定まって
いるものとする。第５図（Ｃ）のネットワークを第５図
（Ａ）のニューロコンピュータで実現する場合、次の処
理が行われる。前向き動作の基本動作は第５図（Ｂ）の
ニューロンモデルにおいて、入力に重みを掛けその総和
をとったものをＵとし、そのＵに非線形関数を施す処理
となる。これを各層において行うことになる。そのため
、まず、ステップ７０において入力データ、すなわちＩ
１から■８．，までのデータをシフトレジスタ上にセッ
トする。そして層の数をＬで表すと、以下のすべての処
理を層分繰り返す。例えばＬが３であった場合には、３
回繰り返す。繰り返される層は１層分の前向き処理であ
る。FIG. 5(D) is a forward processing flowchart of the fourth embodiment. In the forward processing, it is assumed that in the network shown in FIG. 5(C), the weighting coefficients on the connection branches between each layer are fixed. When the network of FIG. 5(C) is realized by the neurocomputer of FIG. 5(A), the following processing is performed. The basic operation of the forward movement is a process of multiplying the inputs with weights and taking the summation as U, and applying a nonlinear function to that U in the neuron model shown in FIG. 5(B). This will be done for each layer. Therefore, first, in step 70, the input data, that is, I
1 to ■8. , and set the data up to , into the shift register. Then, when the number of layers is represented by L, all the following processes are repeated for each layer. For example, if L is 3, then 3
Repeat times. The repeated layer is one layer of forward processing.

そして、処理が終了する。そのＩＮ分の前向き処理が下
側に示されている。今、中間層に注目すると、ｌは２で
ある。ステップ７２において、シフトレジスタの長さを
Ｎ（／２−１）にする。すなわち、ｆ＝２であるからＮ
（１）、すなわち入力層の数にする。ステップ７３は中
間層におけるニューロンモデルの処理である。インデッ
クスのｊは１から入力層のユニット数Ｎ（１）まで変化
させる。Ｗｉｊ（ｆ）は入力層と中間層の間の結合にお
ける重み係数である。すなわちｆ＝２である。Ｙ，（ｆ
−１）は入力層のｊ番目のユニットからの出力である。Then, the process ends. The forward processing of the IN portion is shown at the bottom. Now, if we focus on the middle layer, l is 2. In step 72, the length of the shift register is set to N(/2-1). That is, since f=2, N
(1), that is, the number of input layers. Step 73 is processing of the neuron model in the intermediate layer. The index j is varied from 1 to the number of units in the input layer N(1). Wij(f) is a weighting factor in the connection between the input layer and the hidden layer. That is, f=2. Y, (f
-1) is the output from the j-th unit of the input layer.

ｉは中間層のｉ番目のユニットを意味する。ｉ番目のユ
ニットの状態Ｕｉ（２）は入力層の出力Ｙｊ、すなわち
ｊ番目のＹに重みＷｉ，をかけてその総和より計算され
る。ステップ７４に移って、その中間層のｉ番目の状態
Ｕｉ（２）は非線形関数、すなわちシグモイド関数に入
力され、その出力がＹｉ（２）となる。すなわちステッ
プ７３の内積計算は第５図（Ａ）のユニット内で行うが
、このシグモイド関数の計算は、１０１によって行われ
る。ステップ７５で例えば、中間層のｉ番目のユニット
の出力ｙｉ（２）はトレイのｉ番目に出力される。そし
て処理が終わる。以上の前向き処理を入力層、中間層、
出力層に対して行うことになる。このようにして各層の
前向き処理が終了する。すなわちニューロン単体のシミ
ュレーションに必要な処理は第５図（Ｂ）の式で示され
る演算で、その内容は重みと入力ベクトルとの内積演算
及びその演算結果に対するシグモイド関数値の計算であ
り、その関数値の計算はシグモイド関数ユニット１０１
により実現される。従って、ネットワーク中のある１層
の処理は第５図（Ｃ）に示すように、そのニューロン単
体の演算をその層内の全ニューロン分行うことである。i means the i-th unit of the intermediate layer. The state Ui(2) of the i-th unit is calculated from the sum of the output Yj of the input layer, that is, the j-th Y, multiplied by the weight Wi. Proceeding to step 74, the i-th state Ui(2) of the intermediate layer is input to a nonlinear function, ie, a sigmoid function, and its output becomes Yi(2). That is, the calculation of the inner product in step 73 is performed within the unit shown in FIG. In step 75, for example, the output yi(2) of the i-th unit of the intermediate layer is output to the i-th tray. Then the process ends. The above forward processing is performed in the input layer, middle layer,
This will be done for the output layer. In this way, the forward processing of each layer is completed. In other words, the processing required to simulate a single neuron is the calculation shown in the equation shown in Figure 5 (B), which consists of the inner product calculation of the weight and the input vector, and the calculation of the sigmoid function value for the result of the calculation. Value calculation is performed by sigmoid function unit 101
This is realized by Therefore, as shown in FIG. 5(C), the processing of one layer in the network is to perform calculations for a single neuron for all neurons in that layer.

従って内積演算は各ニューロンｉ番目とするの結合係数
ベクトルを並べた行列Ｗ　（Ｉ！．）　＝　（ＷＩＪ　
（ｆｆｉ）　：ｌと、その層への入力を並べたベクトル
ｘ　（ｊ２）　一（ＸＪ（Ｉ！．））の積のベクトルｔＪ　（ｆ）　＝　（ｕｉ　（ｆ）　）となり、これは
本発明の第３の実施例で説明した方法で実行可能となる
。またシグモイド関数演算は各シグモイド関数ユニツ｝
１０１が積ベクトルの各要素、ｕｔ（ｐ）を入力し、対
応する関数値Ｙ１（ｆ）＝ｆ　（Ｕｔ　（ｆ））を出力
することによってなされる。継続する層すなわち、第（
ｌ＋１）層が存在する場合は、その各関数値出力Ｙｉ（
／！）を各トレイに書き込み、第（ｆ＋１）層の処理に
おいてはこれを入力として以上の過程を繰り返す。Therefore, the inner product operation is a matrix W (I!.) = (WIJ
(ffi): vector tJ (f) = (ui (f)), which is the product of l and the vector x (j2) one (XJ (I!.)), which is a list of the inputs to that layer, and this is the result of the present invention. This can be executed by the method described in the third embodiment. Also, sigmoid function operations are performed using each sigmoid function unit}
101 is done by inputting each element of the product vector, ut(p), and outputting the corresponding function value Y1(f)=f (Ut(f)). Successive layers i.e.
l+1) layer exists, each function value output Yi(
/! ) is written in each tray, and in the processing of the (f+1)th layer, this is input and the above process is repeated.

次に第５図（Ａ）のニューロコンピュータを用いて学習
動作、すなわちバックブロパゲーションアルゴリズムを
実行する場合について説明する。Next, a case will be described in which a learning operation, that is, a backblopage algorithm is executed using the neurocomputer shown in FIG. 5(A).

第５図（Ｅ）は第４の実施例の学習処理フローチャート
である。ニューロコンピュータにおける学習とはネット
ワークが所望の入出力関係を満たすようになるまで各ニ
ューロンの重みを修正することである。学習方法は所望
の入力信号ベクトルと教師信号ベクトルとの対を複数個
、すなわち教師信号の集合分だけ用意し、その中から１
対を選び、その入力信号Ｉ，を学習対象ネットワークに
入力し、入力に対するネットワークの出力と正しい出力
信号、すなわちその入力信号に対応した教師信号○，と
を比較する。この差を誤差と称するが、その誤差、及び
この時の入出力信号の値を基？、各ニューロンの重みを
修正することになる。FIG. 5(E) is a learning process flowchart of the fourth embodiment. Learning in a neurocomputer involves modifying the weights of each neuron until the network satisfies the desired input-output relationship. The learning method is to prepare a plurality of pairs of a desired input signal vector and a teacher signal vector, that is, for a set of teacher signals, and select one from them.
A pair is selected, the input signal I, is input to the learning target network, and the output of the network for the input is compared with the correct output signal, that is, the teacher signal ○ corresponding to the input signal. This difference is called an error, but is it based on that error and the values of the input and output signals at this time? , which will modify the weight of each neuron.

この過程を教師信号の集合中の全要素にわたり学習が収
束するまで繰り返すものである。すなわち、入力パター
ンの数の分だけ、すべて重み値として分布的に記憶する
ことになる。この後ろ向き処理と呼ばれる重みの修正過
程において出力層で得られた誤差を途中で変形しながら
入力層に向け通常の信号の流れる向きとは逆方向に伝播
させる。これがバックプロパゲーションのアルゴリズム
である。This process is repeated until learning converges over all elements in the set of teacher signals. That is, the number of input patterns are all stored as weight values in a distributed manner. In this weight correction process called backward processing, the error obtained in the output layer is propagated toward the input layer in the opposite direction to the normal signal flow direction, while being modified along the way. This is the backpropagation algorithm.

まず前記誤差Ｄを以下のように再帰的に定義する。Ｄｉ
　　（Ｉ！．）は第１層のｉ番目のニューロンから逆向
きに伝播される誤差、Ｌはネットワークの層数である。First, the error D is defined recursively as follows. Di
(I!.) is the error propagated backward from the i-th neuron of the first layer, and L is the number of layers of the network.

Ｄｉ　（Ｌ）　一Ｖ　　（Ｕｉ　（Ｌ））（Ｙｉ　（Ｌ
）−Ｏｐｉ）　　　　　　　（最終層）（１）Ｄｉ　　
（４２−１）＝ｆ’　　（Ｕｉ　（４２−１））Σ，８
１，■ｊ，Ｗｊ　ｉ　（ｊ２）　Ｄｊ　（Ａ）（Ａ＝２
，　　・　・　・，Ｌ）　　　　（２）（ｉ＝１，　　
・　・　・，　　Ｎ　（ｆ）　）ここでｆ’　　（Ｕ）
はシグモイド関数ｆ　（Ｘ）のＸに対する微係数ｒ’　
　（ｘ）のＸ＝Ｕの時の値であり、例えばｆ　（Ｘ）　＝ｔａｎｈＸ　　　　　　　　　　　　（
３）ならば、ｆ　’　　（Ｘ）　＝ｄ　（ｔａｎｈＸ）　／ｄ　Ｘ＝
　１−ｔａｎｈ２Ｘ＝１−ｆ２　（Ｘ）　　　　　　　
　　（４）であるから、ｒ′　（Ｕｉ）＝１−ｆ２　（Ｕｉ）＝１−Ｙｉ”（５
）である。Di (L) 1V (Ui (L)) (Yi (L)
)-Opi) (Final layer) (1) Di
(42-1)=f' (Ui (42-1))Σ,8
1,■j,Wj i (j2) Dj (A) (A=2
, ・・ , L) (2) (i=1,
・・・, N (f)) where f' (U)
is the differential coefficient r' of the sigmoid function f (X) with respect to X
It is the value of (x) when X=U, for example, f (X) = tanhX (
3), then f' (X) = d (tanhX) /d X=
1-tanh2X=1-f2 (X)
(4), so r' (Ui)=1-f2 (Ui)=1-Yi''(5
).

このＤｉ（！：Ｙｉを基に、以下にように重みを更新す
る。基本的には次の式を用いる。ここでηは重みを更新
する刻み巾であり、小さければ学習安定に収束する収束
が遅くなり、大きすぎると収束ひなくなるという性質を
持ったバラメタである。Based on this Di(!:Yi), the weights are updated as follows. Basically, the following formula is used. Here, η is the increment width for updating the weights, and if it is small, the convergence will converge to stable learning. It is a variable parameter that has the property that it becomes slow, and if it is too large, it will not converge.

Ｗｉ　ｊ　（ｊ２）　（ｔ”’　　一Ｗｉ　ｊ　　（／
！）　（Ｌ）十ΔＷ　ｉ　ｊ　（４２）　ＬＬ’　　　
　　　　（６）ΔＷｉｊ（４１！）（Ｌゝ　一ηＤｉ（
ｅ）Ｙｊ（ｆｆｉ−１）　　　　　　（ｆ＝２，　　・
・・，Ｌ）（７）しかし、次に式も良く用いられている
。これは上式ノΔＷｉ　ｊ　（ｆｆ）　”　を１次にデ
ジタルローバスフィルタに通したことになっており、α
はその時定数を決めるパラメタである。Wi j (j2) (t”' １Wi j (/
! ) (L) 1ΔW i j (42) LL'
(6)ΔWij(41!)(Lゝ -ηDi(
e) Yj(ffi-1) (f=2, ・
..., L) (7) However, the following formula is also often used. This means that ΔWi j (ff)'' in the above equation is passed through a digital low-pass filter in the first order, and α
is the parameter that determines the time constant.

ΔＷｉ　ｊ　　（ｅ）　（Ｌ″１〉＝ηＤｉ（ｊ！）Ｙ
ｊ（１−１）＋αΔＷｉ　ｊ　（ｊ２）　”　　　　　
　（８）この後ろ向き処理の過程において必要となる演
算はベクトル間の演算、或いは行列とベクトルとの演算
であり、特にその中心となるのは各層のニューロンの重
みを要素とする重み行列Ｗの転置行列Ｗ７と前記誤差ベ
クトルＤＪ　（−ｅ）との乗算である。この誤差ベクト
ルは工層内に複数個のニューロンがある一般の場合、誤
差はベクトルとなる．第５図（Ｅ）の左のフローチャー
トを説明する。ΔWi j (e) (L″1〉=ηDi(j!)Y
j(1-1)+αΔWi j(j2)”
(8) The operations required in this backward processing process are operations between vectors, or operations between matrices and vectors, and in particular, the center is the transposition of the weight matrix W, whose elements are the weights of neurons in each layer. This is a multiplication of the matrix W7 and the error vector DJ (-e). This error vector is a vector in the general case where there are multiple neurons in the engineering layer. The flowchart on the left side of FIG. 5(E) will be explained.

１層分の前向きの処理と後向きの処理が行われる。まず
、入力データＩｐをシフトレジスタ上にセットし、１層
分の前向き処理をシステムで行う。Forward processing and backward processing for one layer are performed. First, input data Ip is set on a shift register, and the system performs forward processing for one layer.

これは各層で行われるため、この前向き処理を層の数分
だけ繰り返す。すると出力データＯｐが出力されるので
、これをシフトレジスタ上にセットする。そして、ステ
ップ７つから以下を出力層のユニット分だけ並列に実行
する。すなわち誤差Ｄ，（Ｌ）＝Ｙｉ　（Ｌ）　　Ｏｐ
　（ｉ）を計算し、この誤差をトレイのｉ番目にセット
する。そして出力層から入力層に向かって各層毎に後向
き処理を行う。This is done for each layer, so this forward processing is repeated for as many layers as possible. Then, output data Op is output, so this is set on the shift register. Then, from step 7, the following steps are executed in parallel for as many units as the output layer. That is, the error D, (L) = Yi (L) Op
(i) and set this error to the i-th tray. Then, backward processing is performed for each layer from the output layer to the input layer.

この後向き処理は第５図（Ｅ）の右上側に示されている
。第Ｌ番目の層に関して、この層の数はＮ（ｌ）である
からシフトレジスタ長をＮ　（ｆ）にする。そして以下
の動作をこの前の層のユニット数だけ並列に実行する。This backward processing is shown on the upper right side of FIG. 5(E). Regarding the Lth layer, since the number of layers is N(l), the shift register length is set to N (f). Then, the following operations are executed in parallel for the number of units in the previous layer.

すなわち、上記（２）式を、ステップ８３において実行
する。ここで重要なのは重みはｗＪｚ（ｚ）となってお
り、これは重み行列の転置行列ＷＴの要素になっている
。そしてステップ８４において、上記（６），　（７）
あるいは（８）式を計算し、重みの更新を行う。ステッ
プ８５で、求まった誤差Ｄ．（ｆ−１）をトレイのｉ番
目に出力する。これは次の誤差を計算するため、ステッ
プ８４の動作に必要となる。第５図（Ａ）の右下は第５
図（Ｅ）の左のフローチャート、すなわち前向き処理と
後向き処理の連続処理を学習が習得するまで繰り返すこ
とを意味するフローチャートである。また、このような
処理において重みの更新と学習を安定にするために重み
の修正量の平滑化等の処理があるが、これらはいずれも
行列のスカラ倍及び行列同士の加減算からなり、やはり
、本ニューロコンピュータにおいて行える。またシグモ
イド関数ユニット１０１はハードウエアで実現するもの
としているが、ソフトウエアで実現してもよい。また、
学習の終了の反転手段１０３はホストコンピュータ上の
ソフトウエアで実現してもよい。That is, the above equation (2) is executed in step 83. What is important here is the weight wJz(z), which is an element of the transposed matrix WT of the weight matrix. Then, in step 84, the above (6), (7)
Alternatively, equation (8) is calculated and the weights are updated. In step 85, the error D. Output (f-1) to the i-th tray. This is necessary for the operation of step 84 in order to calculate the next error. The lower right of Figure 5 (A) is the 5th
This is a flowchart on the left side of FIG. 3(E), which means that sequential processing of forward processing and backward processing is repeated until learning is mastered. In addition, in order to stabilize weight updating and learning in such processing, there is processing such as smoothing the amount of weight correction, but these all involve scalar multiplication of matrices and addition and subtraction between matrices, and as expected, This can be done on this neurocomputer. Further, although the sigmoid function unit 101 is assumed to be realized by hardware, it may be realized by software. Also,
The learning end reversal means 103 may be realized by software on the host computer.

以上のニューロコンピュータをさらに第５図（Ｆ）を用
いて説明する。第５図（Ｆ）はエラーバックプロパゲー
ションの学習を行う時の処理フロー図である。ここでは
、ベクトル表示を用いている。同図においてＸ　（／！
）は第ｌ層のニューロンベクトル、Ｗは同しく結合係数
、すなわち重み行列である。ｒはシグモイド関数ｅ、（
ｆ）は第乏層の出力側から逆向きに伝播してきた誤差ベ
クトル、ΔＷは重みの修正量である。入力信号が与えら
れると、まず、３層である場合には、入力層はないもの
とすれば、隠れ層の前向き処理を行う。The above neurocomputer will be further explained using FIG. 5(F). FIG. 5(F) is a processing flow diagram when learning error backpropagation. Vector representation is used here. In the same figure, X (/!
) is the neuron vector of the lth layer, and W is the coupling coefficient, that is, the weight matrix. r is the sigmoid function e, (
f) is the error vector propagated in the opposite direction from the output side of the depleted layer, and ΔW is the weight correction amount. When an input signal is given, first, in the case of three layers, forward processing of the hidden layer is performed, assuming that there is no input layer.

それがｕ＝Ｗｘ（ｆ）である。このＵに非線形関数を施
せば、次の層、すなわち（ｊ２＋１）ｌ’ｉＪの入力と
なる。これは出力層の入力であるから、その前向き処理
を行う。そして教師信号を入力し、後向き処理になる。That is u=Wx(f). If a nonlinear function is applied to this U, it becomes the input for the next layer, that is, (j2+1)l'iJ. Since this is the input of the output layer, forward processing is performed on it. A teacher signal is then input, and backward processing begins.

出力層においては教師信号と出力信号の誤差ｅをｒの微
分を掛けて後向き処理にする。また中間層等の間の誤差
は逆伝播してくる誤差信号に微分をかけた変数に重み行
列の転置行列Ｗ７をかけて求められる。誤差ベクトルの
各要素にシグモイドの微分をかけた値に前のＷＴの要素
を掛けてこれよりΔＷを求め、Ｗを更新すればよい。こ
のようにして、出力層の後向き処理、及び隠れ層の後向
き処理が行われる。前向き処理で行う演算は、重み行列
Ｗと入力ベクトルＸとの積、この結果ベクトルの各要素
のシグモイド関数の値の計算である。この計算は各ニュ
ーロンで並列に計算できる。また後向き処理でも仕事は
大きく分けて２あり、１つ目は教師信号と出力信号との
誤差を順次変形しながら、後から前へ逆向きに伝播する
こと、また２つ目はその誤差を基に重みを修正すること
である。この逆向きの計算では重み行列Ｗの転置行列ｗ
Ｔによる乗算が必要になる。転置行列ＷＴとベクトルの
積は前の実施例で述べている。すなわちバックブロバゲ
ーションの学習を実現する再の重要な点は重み行列の転
置行列ＷＴとベクトル乗算の効率な実現方法である。In the output layer, the error e between the teacher signal and the output signal is multiplied by the differential of r for backward processing. Furthermore, the error between the intermediate layers and the like is obtained by multiplying the variable obtained by differentiating the back-propagated error signal by the transposed matrix W7 of the weight matrix. ΔW may be obtained by multiplying the value obtained by multiplying each element of the error vector by the sigmoid differential by the element of the previous WT, and W may be updated. In this way, backward processing of the output layer and backward processing of the hidden layer are performed. The calculation performed in forward processing is the product of the weight matrix W and the input vector X, and the calculation of the value of the sigmoid function of each element of the resulting vector. This calculation can be performed in parallel in each neuron. In addition, there are two main tasks in backward processing: the first is to sequentially transform the error between the teacher signal and the output signal and propagate it backwards from the back to the front, and the second is to propagate the error based on the error. The solution is to modify the weights. In this inverse calculation, the transposed matrix w of the weight matrix W
Multiplication by T is required. The product of the transposed matrix WT and the vector is described in the previous embodiment. In other words, an important point in realizing backbroadcasting learning is an efficient method of realizing vector multiplication with the transpose matrix WT of the weight matrix.

さらに第５図（Ｇ）と（Ｈ）を用いて前向き積和計算、
及び後向き積和計算の実施例を説明する．前向き積和演
算は行列×ベクトルの計算で、特に行列は重み行列Ｗで
ある。本発明で、行列ベクトル積Ｌｌ＝ＷＸを計算する
場合、例えば、次の式・　・　・（９）に対して、重み行列の行とベクトルＸとの積が同時に行
われる。この処理方式を第７図（鎖を用いて？明する。Further, using Figure 5 (G) and (H), calculate the forward sum of products,
An example of backward product-sum calculation will be explained. The forward product-sum operation is a matrix×vector calculation, and in particular, the matrix is a weight matrix W. In the present invention, when calculating the matrix-vector product Ll=WX, for example, the rows of the weight matrix and the vector X are simultaneously multiplied for the following equation (9). This processing method is explained using Figure 7 (chains).

重み行列Ｗは長方行列である。例えば、３×４の行列で
ある。ベクトル真の各要素はトレイ上に入力される。Ｔ
１の時刻において、Ｘ，とＷ．、Ｘ２とＷ２■、Ｘ３と
Ｗ３３が各々のユニットで計算される。Ｔ２に移るとベ
クトルＸの各要素は上に巡回シフトする。Ｔ２において
Ｗ，２とＸ２との積がＵ＋に足される。したがってＵ１
はこの時刻にはＸ　ＩＸ　Ｗ　＋　＋　＋　Ｘ　ｚ　Ｘ
　Ｗ　＋。となる。また、第２のユニットではＷ２３と
Ｘ３が掛けられ、第３番目のユニットではＷ，，ＸＸ４
が掛けられる。Ｔ，において、Ｗ，３とＸ３が掛けられ
Ｕｌに足し込まれる。Ｗ２４とＸ４が掛けられ、Ｕ２に
加えられる．Ｗ３１とＸ１が掛けられＵ３に足し込まれ
る。この時Ｘ２は演算の対象からはずされている。Ｔ４
において、Ｗ，４とＸａ　、Ｗｚ＋とＸ　Ｉ，　Ｗ３■
とＸ２がそれぞれ同時に掛けられＵ＋　、Ｕｚ　、Ｕ３
にそれぞれ足し込まれる。この場合、Ｘ３は演算の対象
外となっている。この演算の対象外を考慮することによ
って長方行列とベクトルとの積が実行される。The weight matrix W is a rectangular matrix. For example, it is a 3×4 matrix. Each element of the vector truth is entered on the tray. T
At time 1, X, and W. , X2 and W2, and X3 and W33 are calculated in each unit. Moving to T2, each element of vector X is cyclically shifted upward. At T2, the product of W,2 and X2 is added to U+. Therefore U1
is X IX W + + + X z X at this time
W +. becomes. Also, in the second unit W23 and X3 are multiplied, and in the third unit W,,XX4
is multiplied. At T,, W,3 and X3 are multiplied and added to Ul. W24 and X4 are multiplied and added to U2. W31 and X1 are multiplied and added to U3. At this time, X2 is excluded from the calculation target. T4
In, W, 4 and Xa, Wz+ and X I, W3■
and X2 are multiplied simultaneously and U+, Uz, U3
are added to each. In this case, X3 is not subject to calculation. The product of a rectangular matrix and a vector is performed by taking into consideration what is not subject to this operation.

？の部分ベクトルＷｉ２はＰＥ−■のローカルメモリ上
にＷｉｉが先頭になるようにスキューされて格納されて
いる。Ｘｉはトレイにのってリング上を反時計回りに一
回転する。ＵｉはＰＥ−，内部のレジスタ上に累積され
る。? The partial vector Wi2 is stored in the local memory of PE-2 in a skewed manner so that Wii is at the beginning. Xi gets on the tray and rotates counterclockwise around the ring. Ui is accumulated on a register inside PE-.

左端の状態でＵｉ＝Ｏの状態からスタートする。It starts from the state of Ui=O in the leftmost state.

ＰＥ−ｚは自分の目の前にあるＸｊとＷｉｊと掛け合わ
せ、その結果をＵｉに加算する。同時にＸｊは隣のトレ
イに隣接される（リング上を反時計回りに循環する）。PE-z multiplies Xj and Wij in front of it, and adds the result to Ui. At the same time, Xj is adjacent to the next tray (circulating counterclockwise on the ring).

これを４回繰り返すと全てのＵｉが同時に求まる。By repeating this four times, all Ui's can be found at the same time.

Ｗｉｔがスキューされていること、χｉが全てトレイ中
にある状態からスタートすること、Ｕｉが全て同時に求
まる。Wit is skewed, χi starts from a state where all are in the tray, and Ui are all found at the same time.

第５図（Ｈ）は後向き積和計算の説明である。FIG. 5(H) is an explanation of backward product-sum calculation.

これは転置行列と行ベクトル積、ｅ＝Ｗ”ｖを計算する
時のタイミング図である。この場合、ベクトルＶは前の
層の誤差ベクトルに非線形関数の微分を掛けた要素から
なるベクトルである。ｅは求めらようとする次の層での
逆伝播用の誤差へクトルである。本発明で重要なことは
、転置行列Ｗｔであっても、前向き積和計算において利
用されるメモリ上のＷと同じ配置にしたままで演算でき
ることである。This is a timing diagram when calculating the transposed matrix and the row vector product, e=W''v. In this case, the vector V is a vector consisting of elements obtained by multiplying the error vector of the previous layer by the derivative of the nonlinear function. .e is the error hector for backpropagation in the next layer to be determined.What is important in the present invention is that even if the transposed matrix Wt is It is possible to perform calculations with the same arrangement as W.

すなわち本発明では求めるべきＣのベクトルの巡回シフ
トによってなされる。演算するべき転置行列Ｗ”とベク
トルＶとの式は００）式に従う。That is, in the present invention, this is done by cyclically shifting the vector of C to be determined. The formula of the transposed matrix W'' and vector V to be calculated follows the formula 00).

上の式において示されるように、行列Ｗは転置されしか
も、長方行列である。ｅ，はＷ．Ｘｖ，＋Ｗ２１ＸＶ２
　＋Ｗ３１ＸＶｌである。この演算を行うために、第５
図（Ｈ）において、時間区域Ｔ＋においては第１のユニ
ット（ＤＳＰ）において、Ｗ１１とｖ１の積が演算され
ている。これがＯであるｅ，に差し込まれる。そして、
巡回シフトするとＴ２に移るが、ｅｌはＴ２時刻におい
ては演算？対象になっていない。モしてＴ３になると、
３番目のユニットにおいて演算対象となっている。As shown in the above equation, matrix W is transposed and is a rectangular matrix. e, is W. Xv, +W21XV2
+W31XVl. To perform this operation, the fifth
In the diagram (H), in the time period T+, the product of W11 and v1 is calculated in the first unit (DSP). This is inserted into e, which is O. and,
Cyclic shift moves to T2, but is el an operation at time T2? Not targeted. When it becomes T3,
It is the object of calculation in the third unit.

すなわちＷ３Ｉにｖ３を掛けた値が前の値に足し込まれ
るため、Ｗ，，Ｘｖ．に足し込まれる。そのため時間区
域Ｔ３においては、ｅ，の結果はＷ１１×ｖ，＋Ｗ３■
ＸＶ，となる。モしてＴ４に移ると、ｅ１は巡回シフト
として、第２番目のユニットで演算対象となる。ここで
、ｅ，にはＷ２，ＸＶ２が加えられるため、００式の行
列の第１行目とベクトルＶとの内積演算が実行され、そ
の演算結果がｅ，に格納されることになる。That is, since the value obtained by multiplying W3I by v3 is added to the previous value, W,,Xv. It is added to. Therefore, in time area T3, the result of e is W11×v,+W3■
It becomes XV. When the process moves to T4, e1 becomes a calculation target in the second unit as a cyclic shift. Here, since W2 and XV2 are added to e, an inner product operation between the first row of the matrix of equation 00 and the vector V is executed, and the result of the operation is stored in e.

同様に第２行目とベクトルとの積はｅ２を追えばよい。Similarly, the product of the second row and the vector can be obtained by following e2.

Ｔ１時刻にはＷ２２ＸＶ２　、Ｔ２にはＷＩ２ＸＶＩ，
Ｔｌでは、ｅ２が遊びになり、Ｔ４でＷ３２ＸＶ．の積
が求まれ、各々の積の和として計算される。Ｗ７の第３
行目とベクトルＶとの積はｅ３を追えばよい。ＴＩにお
いてはＷ３３Ｘｖ３　、Ｔ２においてはそれにＷ２３Ｘ
Ｖ２が足し込まれ、Ｔ３において、更にＷ，．Ｘｖ，が
足し込まれる。Ｔ４はｅ４は遊びとなる。ＷＴの第４行
目とベクトル■との積はｅ４を追えばよい。Ｔ＋時刻で
はｅ４は遊びである。Ｔ２ではＷｚ４ＸＶｉ　、Ｔ３で
はＷ２４Ｘｖ２が足し込まれ、Ｔ４において更にＷ，４
Ｘｖ，が足し込まれて、計算ができる。このように本発
明では、Ｗの部分ベクトルＷｉ９は前と同様ＰＥ．のロ
ーカル目上にＷｉｉが先頭になるようにスキューされて
格納されている。前と入れ替わるのはｅｉとＶｉである
。つまり、ｅｉはトレイ上を反時計回りに循環しながら
累積され、ＶｉはＰＥ−１内部に常駐する。W22XV2 at time T1, WI2XVI at T2,
In Tl, e2 becomes idle, and in T4, W32XV. The product is calculated as the sum of each product. W7 3rd
The product of the row and the vector V can be obtained by following e3. W33Xv3 in TI, W23X in T2
V2 is added, and at T3, W, . Xv, is added. For T4, e4 is a play. The product of the fourth row of WT and the vector ■ can be obtained by following e4. At time T+, e4 is idle. Wz4XVi is added at T2, W24Xv2 is added at T3, and W,4 is added at T4.
Xv, is added and calculation can be done. Thus, in the present invention, the partial vector Wi9 of W is PE. are stored skewed so that Wii is at the top of the local hierarchy. It is ei and Vi that are replaced with the previous ones. That is, ei is accumulated while circulating counterclockwise on the tray, and Vi resides inside PE-1.

左端の状態でｅ　ｊ＝ｏからスタートする。ＰＥ−ｔは
ＶｉとＷｉｊとを掛け合わせ、その結果を自分の目の前
にあるｅｊに加え込む。同時にこの更新されたｅｊは隣
のトレイに転送される（リング上を反時計回りに循環す
る）。これを４回繰り返すと全てのｅｊが同時に求まる
。Start from e j=o in the leftmost state. PE-t multiplies Vi and Wij and adds the result to ej in front of him. At the same time, this updated ej is transferred to the adjacent tray (circulating counterclockwise on the ring). By repeating this four times, all ej can be found at the same time.

このように本発明のニューロコンピュータは層が何層で
あっても実現でき、学習アルゴリズムの自由度が高いと
いう柔軟性を持つばかりでなく、ＤＳＰの速度そのまま
を利用でき、しかもそのＤＳＰの演算においてオーバヘ
ッドがなく、高速性があり、しかもＤＳＰによるＳＩＭ
Ｄが実行できる。In this way, the neurocomputer of the present invention can be realized with any number of layers, and not only has the flexibility of having a high degree of freedom in the learning algorithm, but also can utilize the speed of the DSP as it is, and moreover, in the calculation of the DSP. No overhead, high speed, and DSP-based SIM
D can be executed.

第６図は本発明の第５の実施例説明図であり、アナログ
データによる行列の積を求めるものである。図中、第２
図で示したものと同一のものは同一の記号で示してあり
、１ｄはデータ処理ユニットｌの処理装置であり、例え
ばアナログ乗算器１ｅと積分器１ｆで構成され、２ｄは
トレイ２のデータ保持回路であり、例えばサンプル／ホ
ールド回路２ｒで構成され、２ｅはトレイ２のデータ転
送回路であり、例えばアナログスイッチ２ｇとバッファ
アンプ２ｈで構成され、６はトレイ２にデータを設定す
る手段であり、例えばアナログスイッチ６ｄで構成され
る。FIG. 6 is an explanatory diagram of a fifth embodiment of the present invention, which calculates the product of matrices using analog data. In the figure, the second
Components that are the same as those shown in the figure are indicated by the same symbols, and 1d is a processing device of the data processing unit l, which is composed of, for example, an analog multiplier 1e and an integrator 1f, and 2d is a data holding unit of tray 2. The circuit is composed of, for example, a sample/hold circuit 2r, 2e is a data transfer circuit for the tray 2, which is composed of, for example, an analog switch 2g and a buffer amplifier 2h, 6 is means for setting data in the tray 2, For example, it is composed of an analog switch 6d.

本実施例の動作は本発明の原理図（第１図）で説明した
動作と同じである。The operation of this embodiment is the same as that described in the principle diagram of the present invention (FIG. 1).

第７図は本発明の第６の実施例説明図であり、帯行列と
ベクトルとの乗算を示している。図中、第２図で示した
ものと同一のものは同一の記号で示してある。FIG. 7 is an explanatory diagram of a sixth embodiment of the present invention, showing the multiplication of a band matrix and a vector. In the figure, the same parts as those shown in FIG. 2 are indicated by the same symbols.

本実施例の動作を第７図（Ｂ）を参照しつつ説明する。The operation of this embodiment will be explained with reference to FIG. 7(B).

本発明では、ｍ×ｎ　（ｎ≧ｍ≧１）で巾ｋの帯行列Ａ
と要素数ｎのベクトルＸとの乗算結果（要素数ｍのベク
トル！）を求める場合において、第７図（Ａ）の如く、
各々２つの入力を持ち乗算機能と概乗算結果の累積機能
を有するｍ個のデータ処理ユニット１と、ｎ個のトレイ
２と、前記各データ処理ユニット１にせとぞくされた入
力データ供給手段とから成る構成に於いて、第７図（Ｂ
）に示す手順で、第７図（Ｃ）及び第７図（Ｄ）のよう
な時系列で処理をするようにしている。従って、巾ｋの
帯行列とベクトルとの乗算がｋに比例する処理時間で実
行できる。In the present invention, a band matrix A of m×n (n≧m≧1) and width k
When calculating the multiplication result of the vector X with the number of elements n (vector with the number of elements m!), as shown in Fig. 7(A),
m data processing units 1 each having two inputs and having a multiplication function and an approximate multiplication result accumulation function, n trays 2, and input data supply means for each of the data processing units 1; In the configuration consisting of
), the processing is performed in chronological order as shown in FIGS. 7(C) and 7(D). Therefore, multiplication of a band matrix of width k by a vector can be performed in a processing time proportional to k.

本実施例に於いて重要な事は、ベクトルＸを１回転させ
ない事、及びベクトル見をシフトレジスタ３上にセット
する際に、第１の実施例等と異なり、頂度帯が始まる位
置にずらしておくことである。すなわち、帯の開始位置
から処理を開始する場合は、ある方向にずらしながら積
和演算を行えばｋに比例する時間で処理が終了する。し
かし、図示しないが何らかの事情で帯の途中に配置した
状態から処理を開始する場合は、始めにベクトル見を一
端までずらせばよいことは明らかであり、その場合、シ
フトレジスタ３が双方向にシフト可能であることが意味
を持つのである。What is important in this embodiment is that the vector It is a good idea to keep it. That is, when starting the process from the start position of the band, if the product-sum calculation is performed while shifting in a certain direction, the process will be completed in a time proportional to k. However, if for some reason (not shown) you start processing from a state where the band is placed in the middle of the band, it is obvious that it is sufficient to first shift the vector view to one end, and in that case, the shift register 3 shifts in both directions. It is meaningful that it is possible.

即ち、例えば帯の中央から処理を開始する場合は、初め
に右にｋ／２（小数点以下切り捨て）だけずらし、以後
逆方向（この場合左）にずらしながら積和演算を行えば
、合計３７２ｋに比例する時間で処理が終了する。That is, for example, when starting processing from the center of the band, first shift it to the right by k/2 (round down to the decimal point), and then perform the product-sum operation while shifting it in the opposite direction (in this case, to the left), resulting in a total of 372k. The process will complete in a proportional amount of time.

もし、シフトレジスタ３が双方向にシフト可能でなけれ
ば、ベクトル見を１回転させねばならないため、帯行列
の巾ｋではなくその大きさｎｌこ比例する時間が必要に
なる。大規模な帯行列の於いては、この差は非常に大き
く、帯行列とベクトルとの乗算が帯行列の巾ｋに比例す
る処理時間で実行可能となることは本発明の方式の利点
である。If the shift register 3 is not capable of shifting in both directions, the vector must be rotated once, which requires a time proportional to the size nl of the banded matrix rather than the width k. For large-scale banded matrices, this difference is very large, and the advantage of the method of the present invention is that the multiplication of a banded matrix and a vector can be performed in a processing time proportional to the width k of the banded matrix. .

第８図はトレイの構造を具体的に示す。FIG. 8 specifically shows the structure of the tray.

トレイは基本的には単なる１語のラッチであるが、ＤＳ
Ｐからのアクセスと、隣のトレイへの転送を１サイクル
で実行できる（ポストシフト）。Tray is basically just a one word latch, but DS
Access from P and transfer to the adjacent tray can be executed in one cycle (post shift).

機能の切り替えは、アドレス線の下位ビットにより、デ
ータのアクセスと同時に行い、速度を向上させている。Function switching is performed simultaneously with data access using the lower bits of the address line, improving speed.

一つのトレイはゲートアレイで約１　２　０　０　Ｂａ
ｓｉｃセルの規模であり、■チップに２〜４個入れるこ
とも可能である。One tray is a gate array with approximately 1200 Ba
It is the scale of a SIC cell, and it is also possible to put 2 to 4 pieces in a chip.

また、トレイ中にワークレジスタを数ワード内蔵するこ
とも可能である。It is also possible to incorporate several words of work registers in the tray.

第９図は本発明の実施例を用いて、実際に構成されたニ
ューロコンピュータのブロック図である。FIG. 9 is a block diagram of a neurocomputer actually constructed using an embodiment of the present invention.

Ｓａｎｄｙの基本構或はＤＳＰの一次元トーラス（リン
グ）結合によるＳＩＭＤ型マルチプロセッサである。It is a SIMD type multiprocessor based on Sandy's basic structure or a one-dimensional torus (ring) combination of DSPs.

特徴的なのは、結合トボロジーや動作は１次元シストリ
ックアレイと類似しているにも関わらず、ＳＩＭＤとし
て動作する事である。A distinctive feature is that it operates as a SIMD, even though its coupling topology and operation are similar to a one-dimensional systolic array.

各ＤＳＰと双方向バスで接続されている“トレイ”は、
転送機能を有するラッチであり、相互にリング状に接続
され、全体でサイクリックシフトレジスタを構或してい
る。以後このシフトレジスタをリングと呼ぶ。The “tray” is connected to each DSP via a bidirectional bus.
These latches have a transfer function and are connected to each other in a ring shape, making up a cyclic shift register as a whole. Hereinafter, this shift register will be referred to as a ring.

各ＤＳＰは２Ｋ語の内部メモリと６４語の外付けＲＡＭ
を持ち、内部メモリは１サイクルで、外部メモリはｌ〜
２サイクルでアクセスできる。Each DSP has 2K words of internal memory and 64 words of external RAM
, the internal memory takes one cycle, and the external memory takes l~
It can be accessed in 2 cycles.

外付けＲＡＭは、プログラムやデータの初期ロード用に
、共通バスでホストコンピュータのＶＭＥＷバスに接続
される。外部入力もパンファメモリを介してホストコン
ピュータに接続されている。The external RAM is connected to the host computer's VMEW bus via a common bus for initial loading of programs and data. External inputs are also connected to the host computer via breadboard memory.

第１０図は本発明の実施例における学習時の時間空間チ
ャートであり、縦方向はプロセッサの数を示し、横方向
は時間を示す。■は入力層のプロセッサの数、Ｈは隠れ
層のプロセッサの数、■はプロセッサの積和演算の時間
に対応する。FIG. 10 is a time-space chart during learning in the embodiment of the present invention, where the vertical direction shows the number of processors and the horizontal direction shows time. 2 corresponds to the number of processors in the input layer, H corresponds to the number of processors in the hidden layer, and 2 corresponds to the time for the product-sum operation of the processors.

入力信号が隠れ層の前向き積和に要する時間は、入力層
のプロセッサの数１と１つのプロセッサの積和に対応す
る時間Ｉとの積に比例する。次に、シグモイドの計算が
行われる。出力層においても出力層の前向き積和（２Ｈ
ｒ）とシグモイドが行ねれる。ここで、出力層のプロセ
ッサの数が隠れ層のプロセッサの数より少ないので、リ
ングの大きさ自体も小さくなる。次ぎに教師信号入力と
受信し、誤差計算を行い、誤差のバック・プロパゲーシ
ョンを行う。なお、この誤差計算は出力層のシグモイド
における誤差計算も服務出力層の後向き積和を行い、出
力層の重み更新を勾配ベクトル計算とローパスフィルタ
を介して行う。そして、隠れ層のシグモイドによる誤差
計算を経て、隠れ層においては、後向き積和は行わず隠
れ層の重み更新のみを行う。The time required for the input signal to perform the forward product-sum of the hidden layer is proportional to the product of the number 1 of processors in the input layer and the time I corresponding to the product-sum of one processor. Next, a sigmoid calculation is performed. Also in the output layer, the output layer's forward sum of products (2H
r) and sigmoid can be performed. Here, since the number of processors in the output layer is smaller than the number of processors in the hidden layer, the size of the ring itself is also smaller. Next, it receives the teacher signal input, calculates the error, and performs back propagation of the error. Note that in this error calculation, the error calculation in the sigmoid of the output layer is also performed by backward product sum of the service output layer, and the weight update of the output layer is performed via gradient vector calculation and a low-pass filter. Then, after calculating the error using the sigmoid in the hidden layer, only the weights of the hidden layer are updated without performing the backward sum of products.

〔Effect of the invention〕

以上説明した様に、本発明によれば従来の方法より広い
範囲の処理に対して、データ処理に伴うデータ転送によ
るオーバヘッド無しにデータを並列に処理出来る効果を
奏し、データ処理ユニットの台数に比例した高速なデー
タ処理が実現出来ることにより、行列演算あるいはニュ
ーロコンピュータ演算を行うデータ処理装置の性能向上
に寄与するところが大きい。As explained above, according to the present invention, data can be processed in parallel for a wider range of processing than conventional methods without the overhead of data transfer associated with data processing, and it is proportional to the number of data processing units. The ability to realize high-speed data processing greatly contributes to improving the performance of data processing devices that perform matrix operations or neurocomputer operations.

[Brief explanation of drawings]

第１図（Ａ）は、本発明の原理構成図、第１図（Ｂ）は
、本発明の動作フローチャート、第１図（Ｃ）は、本発
明の動作概要図、第１図（Ｄ）は、本発明の動作タイム
チャート、第２図（Ａ）は、第１の実施例の構成図、第
２図（Ｂ）は、第１の実施例の動作フローチャート、第２図（Ｃ）は、第１の実施例の動作概要図、第２図（
Ｄ）は、第１の実施例の動作タイムチャート、第３図（Ａ）は、第２の実施例の構或図、第３図（Ｂ）
は、第２の実施例の動作フローチャート、第３図（Ｃ）は、第２の実施例の動作概要図、第３図（
Ｄ）は、第２の実施例の動作タイムチャート、第４図（Ａ）は、第３の実施例の構或図、第４図（Ｂ）
は、第３の実施例の動作フローチャート、第４図（Ｃ）は、第３の実施例の動作概要図、第４図（
Ｄ）は、第３の実施例の動作タイムチャート、第４図（Ｅ）は、第３の実施例の詳細動作フローチャー
ト、第５図（Ａ）は、第４の実施例の構成図、第５図（Ｂ）
は、第４の実施例のニューロンモデル、第５図（Ｃ）は、第４の実施例のネットワーク、第５図
（Ｄ）は、第４の実施例の前向き処理フローチャート、第５図（Ｅ）は、第４の実施例の学習処理フローチャー
ト、第５図（Ｆ）は、Ｓａｎｄｙでエラーバックプロパゲー
ション学習を行うときの処理フローチャート、第５図（
Ｇ）は、Ｓａｎｄｙで行列ベクトル積む＝Ｗｘを計算す
るときのタイムチャート、第５図（Ｈ）は、転置行列で
の行列ベクトル積ｅ＝Ｗ”ｖを計算するときのタイムチ
ャート、第６図（Ａ）は、第５の実施例の構成図、第６
図（Ｂ）は、第５の実施例の動作フローチャート、第６図（Ｃ）は、第５の実施例の動作概要図、第６図（
Ｄ）は、第５の実施例の動作タイムチャート、第７図（Ａ）は、第６の実施例の構或図、第７図（Ｂ）
は、第６の実施例の動作フローチャート、第７図（Ｃ）は、第６の実施例の動作概要図、第７図（
Ｄ）は、第６の実施例の動作タイムチャート、第８図は、トレイの構造を具体的に示す図、第９図は、
本発明の実施例を用いて実際に構成されたニューロコン
ピュータのブロック図、第１０図は、本発明の実施例に
おける学習時の時間空間チャート、第１１図（Ａ）は、共通バスＳＩＭＤ方式の原理構成図
、第１１図（Ｂ）は、共通バスＳＭＤ方式による行列ベク
トル積の動作フローチャート、第１２図（Ａ）及び第１
２図（Ｂ）は、リングシストリック方式による行列ベク
トル積の動作原理図、第１２（Ｃ）は、リングシストリック方式による行列ベ
クトル積の動作原理図である。１・・・データ処理ユニット、２・・・トレイ、３・・・シフトレジスタ、４・・・記憶装置、５・・・同期手段、６・・・データ設定手段、７・・・長さ変更手段、１ｌ・・・データ処理ユニソト１の入力、１２・・・デ
ータ処理ユニット１の第２の入力、２１・・・トレイ２
の第１の入力、２２・・・トレイニの第１の出力、２３・・・トレイ２の第２の出力、２４・・・トレイ２の第２の入力、８２８３８４８５９１９２９３１の第１の入力、ｌの第１の出力、１の第２の入力、１の第２の出力、・　ＰＥ９・　ＰＥ９・　ＰＥ９・　ＰＥ９・　ＰＥ，・ＰＥ９　１の入出力、・共通バス．FIG. 1(A) is a principle block diagram of the present invention, FIG. 1(B) is an operation flowchart of the present invention, FIG. 1(C) is a schematic diagram of the operation of the present invention, and FIG. 1(D) is an operation time chart of the present invention, FIG. 2(A) is a configuration diagram of the first embodiment, FIG. 2(B) is an operation flowchart of the first embodiment, and FIG. 2(C) is an operation flowchart of the first embodiment. , a schematic diagram of the operation of the first embodiment, FIG.
D) is an operation time chart of the first embodiment, FIG. 3(A) is a configuration diagram of the second embodiment, FIG. 3(B)
is an operation flowchart of the second embodiment, FIG. 3(C) is an operation overview diagram of the second embodiment, and FIG.
D) is an operation time chart of the second embodiment, FIG. 4(A) is a configuration diagram of the third embodiment, FIG. 4(B)
is an operation flowchart of the third embodiment, FIG. 4(C) is an operation overview diagram of the third embodiment, and FIG.
D) is an operation time chart of the third embodiment, FIG. 4E is a detailed operation flowchart of the third embodiment, and FIG. 5A is a configuration diagram of the fourth embodiment. Figure 5 (B)
is a neuron model of the fourth embodiment, FIG. 5(C) is a network of the fourth embodiment, FIG. 5(D) is a forward processing flowchart of the fourth embodiment, and FIG. ) is a learning process flowchart of the fourth embodiment, FIG. 5(F) is a process flowchart when error back propagation learning is performed on Sandy, and FIG.
G) is a time chart when calculating matrix-vector product = Wx in Sandy, Figure 5 (H) is a time chart when calculating matrix-vector product e = W''v in a transposed matrix, Figure 6 (A) is a configuration diagram of the fifth embodiment;
FIG. 6(B) is an operation flowchart of the fifth embodiment, FIG. 6(C) is a schematic diagram of the operation of the fifth embodiment, and FIG.
D) is an operation time chart of the fifth embodiment, FIG. 7(A) is a configuration diagram of the sixth embodiment, FIG. 7(B)
is an operation flowchart of the sixth embodiment, FIG. 7(C) is an operation overview diagram of the sixth embodiment, and FIG.
D) is an operation time chart of the sixth embodiment, FIG. 8 is a diagram specifically showing the structure of the tray, and FIG. 9 is:
FIG. 10 is a block diagram of a neurocomputer actually constructed using an embodiment of the present invention, and FIG. 11 (A) is a time-space chart during learning in the embodiment of the present invention. The principle configuration diagram, Figure 11 (B) is an operational flowchart of matrix-vector product using the common bus SMD method, Figure 12 (A), and Figure 1
FIG. 2(B) is a diagram of the principle of operation of matrix-vector product using the ring systolic method, and FIG. 12(C) is a diagram of the principle of operation of matrix-vector product using the ring systolic method. DESCRIPTION OF SYMBOLS 1... Data processing unit, 2... Tray, 3... Shift register, 4... Storage device, 5... Synchronization means, 6... Data setting means, 7... Length change means, 1l...input of data processing unit 1, 12...second input of data processing unit 1, 21...tray 2;
22...First output of tray 2, 23...Second output of tray 2, 24...Second input of tray 2, 82 83 84 85 91 92 93 1 First input, first output of l, second input of l, second output of l, ・PE9 ・PE9 ・PE9 ・PE9 ・PE, ・I/O of PE9 1, ・Common bus.

Claims

[Claims] 1) a plurality of data processing units (1) each having at least one input (11), each having a first input (21) and an output (22) and each having a data storage and a data processing unit; Multiple trays (2
), wherein all or part of said trays (2) are each connected to a first input (11) of said data processing unit (1).
a second output (23) connected to the tray; and shifting means (3) comprising a first input (21) and an output (22) of the connecting tray (2). , data transfer on said shifting means (3) and said tray (
Parallel data characterized in that matrix operations or neurocomputer operations are performed by synchronously performing data transfer between 2) and the data processing unit (1) and data processing by the data processing unit (1). Processing method. 2) The parallel data processing system according to claim 1, wherein the shift means (3) is a cyclic shift register. 3) The parallel data processing system according to claim 1 or 2, further comprising means for changing the length of the shift means (3). 4) The parallel data processing system according to claim 3, wherein the means for changing the length of the shift means (3) is input switching means. 5) The parallel data processing system according to claim 3, wherein the means for changing the length of the shift means (3) comprises external data supply means and input selection means. 6) said data processing unit (1) outputs a first output (21);
), said tray (2) having a second input (24) connected to said first output (21), means for writing data from said data processing unit (1) to said tray (2). A parallel data processing method according to any one of claims 1 to 5, characterized in that it has the following. 7) The data processing unit (1) and the tray (2)
7. The parallel data processing system according to claim 6, wherein the data transfer path between the two is a bus commonly used for input and output. 8) When further processing the data processing result, the processing result is transferred to the tray (2) using the writing means.
Parallel data processing method described in Section. 9) said tray (2) is provided with a third input (25) and an output (26) each interconnected, said shifting means (
9. The parallel data processing system according to claim 1, wherein 3) is a bidirectional shift register. 10) Parallel data according to claim 9, characterized in that the data transfer path between the trays (2) constituting the bidirectional shift register is a bus commonly used for input and output. Processing method. 11) The parallel data processing method according to claim 9 or 10, wherein data is transferred bidirectionally on the bidirectional shift register. 12) Shifting means for circulating each element of the vector, the tray means (2) having an internal function of holding and transferring each element, and a tray means (2) corresponding to each row of the matrix and having at least two inputs. data processing unit means (1) having a function of multiplying between and accumulating the multiplication results; and a storage means (4) present in each data processing unit and capable of sequentially reading out elements of each row of the matrix. and by separating the data processing unit means (1) and said tray means (2) for cyclically shifting data,
Each data processing unit means (1) multiplies the elements of the cyclically shifted vector by the corresponding matrix elements from the storage means (4), and accumulates the multiplication results to determine the number of rows and columns. A parallel data processing method characterized by performing matrix operations or neurocomputer operations by calculating the product of rectangular matrices and vectors with different values. 13) A parallel data processing system according to claim 12, characterized in that said tray means (2) comprises bypass means (7) for changing the length of the cyclic shift. 14) The length of the shift register in the tray means (2) is set to n, and a vector consisting of elements equal to the number of n is set in each of the trays, and each of the data processing unit means (1) corresponds to the vector. Receiving vector elements and matrix elements from the tray and storage means (4), multiplying and accumulating them, and then repeating the operation of cycling through the vector elements n times, the results are stored in the tray means (2).
), the shift length of the cyclic shift is changed from n to m, and the same operation is repeated m times to further multiply the product of the rectangular matrix and the vector by a different rectangular matrix. Parallel data processing method according to 12. 15) When calculating the product of a transposed matrix of a rectangular matrix and a vector, partial row vectors constituting the matrix are stored in the storage means (4) connected to each data processing unit means (1).
The partial sum generated during the calculation is accumulated on the data holding circuit in each tray of the tray means (2), and the product of the data on each tray and the data from the storage means (4) is calculated. 13. The parallel data processing system according to claim 12, wherein the product of the transposed matrix and the vector can be calculated by transferring the partial sum onto a tray and performing cyclic shifting. 16) In the neural network, when the elements of each row of the rectangular matrix are made to correspond to the weights of the connecting branches connecting to the neuron model, the data processing unit means (1)
By multiplying each of the input variables in each data holding circuit of the tray means (2) by the weight from the corresponding storage means (4) and repeating the cyclic shifting operation within the tray means (2), A processing unit (103) that calculates the sum of the products of the weights of connected branches connected to the neuron model and input variables to the connected branches, and then applies a nonlinear function to perform forward processing of the neural network. 13. The parallel data processing method according to claim 12, wherein the parallel data processing method enables the following. 17) The parallel data processing method according to claim 16, wherein the nonlinear function is a sigmoid function. 18) The parallel data processing method according to claim 12, wherein the neural network is a hierarchical neural network having at least three layers. 19) Backward processing of the backpropagation learning rule in a hierarchical neural network, which is a process in which the error between the output signal from the output layer and the teacher signal is propagated toward the input layer in the opposite direction to the normal signal flow direction. In the method described in claim 15, a vector whose elements are back-propagated error signals and a transposed matrix W^T of a weighting coefficient matrix W whose elements are weights in the forward processing are combined using the method according to claim 15, that is, the transposed matrix of the matrix and the vector. While cyclically shifting the partial sum in the middle of the calculation on the tray means (2),
The product of each element of the weighting coefficient matrix stored in the storage means (4) and each element of the error vector in the data processing unit means (1) is calculated and added to the partial sum, and the result is stored as the partial sum in the tray. Claim 12, characterized in that by leaving it on the means (2), it is possible to execute the backpropagation learning rule by executing the process of calculating the product of the transposed matrix×vector as a backward product-sum calculation. 19. Parallel data processing method according to 18. 20) When the data is analog, the processing device of the data processing unit means (1) is composed of an analog multiplier and an integrator, and the data holding circuit of each tray of the tray means (2) is configured to store samples. Consists of a hold circuit,
20. A parallel data processing system according to claim 1, wherein the data transfer circuit of the tray means (2) comprises an analog switch and a buffer amplifier. 21) When multiplying a band matrix A with a matrix of m×n and a width k by a vector x with the number of elements n, the element tray means (2) of the vector x is performed without rotating the vector x once by cyclic shift. Claims 1 to 20 characterized in that the starting point from which the band of the matrix starts can be arbitrarily specified when shifting within the matrix.
Parallel data processing method described. 22) The parallel data processing method according to claim 21, wherein the direction of the shift can be bidirectional. 23) By separating the data processing unit means (1) and the tray having a data holding function, data transfer between the tray means (2) and data processing by the data processing unit means (1) can be performed simultaneously. The data transfer time is hidden behind the data processing time by performing the data transfer in parallel and making the time required for data transfer between the tray means (2) shorter than the time that the data processing unit means (1) has for data processing. 23. The parallel data processing method according to claim 1, wherein the parallel data processing method is characterized in that: