JP2000285081A

JP2000285081A - Data communication method between nodes

Info

Publication number: JP2000285081A
Application number: JP11094191A
Authority: JP
Inventors: Norihiko Kuroishi; 範彦黒石; Nobuaki Miyagawa; 宣明宮川; Mitsumasa Koyanagi; 光正小柳
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1999-03-31
Filing date: 1999-03-31
Publication date: 2000-10-13

Abstract

(57)【要約】【課題】無駄なメモリアクセスを省き、高い処理性能
を実現する。【解決手段】バス１を介してリング状に接続された複
数のノード２〜６の内のマスターノード２と複数のスレ
ーブノード３〜６との間でリング状にデータ転送を行う
際、マスターノード２は、スレーブノード３〜６に対し
て、演算を指示し、各スレーブノード３〜６の演算結果
を回収する際、各スレーブノードは、直前のノードが出
力したデータと自己が演算した結果との間で所定の演算
を行い、該演算の結果のデータを次のノードに出力す
る。よって、各スレーブノードがデータを送出する際、
あらかじめ決められたパイプライン演算を行うことがで
き、無駄なメモリアクセスを省き、高い処理性能を実現
することができる。 (57) [Summary] [PROBLEMS] To eliminate unnecessary memory access and realize high processing performance. SOLUTION: When performing data transfer in a ring shape between a master node 2 and a plurality of slave nodes 3 to 6 among a plurality of nodes 2 to 6 connected in a ring via a bus 1, a master node 2 instructs the slave nodes 3 to 6 to perform calculations, and when collecting the calculation results of the slave nodes 3 to 6, each slave node compares the data output by the immediately preceding node with the result calculated by itself. And performs a predetermined operation, and outputs data of the result of the operation to the next node. Therefore, when each slave node sends data,
A predetermined pipeline operation can be performed, unnecessary memory access can be omitted, and high processing performance can be realized.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、ノード間データ通
信方法に係り、より詳しくは、バスを介してリング状に
接続された複数のノードの内のマスターノードと複数の
スレーブノードとの間でリング状にデータ転送を行うノ
ード間データ通信方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a data communication method between nodes, and more particularly, to a method for transmitting data between a master node and a plurality of slave nodes among a plurality of nodes connected in a ring via a bus. The present invention relates to an inter-node data communication method for performing data transfer in a ring.

【０００２】[0002]

【従来の技術】計算機の処理能力を向上させる手段の一
つとして、複数のプロセッサを用いて並列処理を行う方
法がある。並列処理を行うため、本来一つであったデー
タの流れを多数に分割した場合、プロセッサ間でデータ
の送受信を行う必要がある。このプロセッサ間通信ネッ
トワークの形態は、密結合型と粗結合型に分けられ、一
般的に密結合型ネットワークは各プロセッサ間での通信
コスト(通信時間)のばらつきが小さく、少数個のプロセ
ッサを接続する場合に有利であり、一方粗結合型ネット
ワークは物理的・論理的に近くに配置されたプロセッサ
間での通信コストは小さいが互いに離れたプロセッサ間
での通信コストは大きい。2. Description of the Related Art As one of means for improving the processing capability of a computer, there is a method of performing parallel processing using a plurality of processors. In order to perform parallel processing, it is necessary to transmit and receive data between processors when the originally one data flow is divided into many. The form of this inter-processor communication network is divided into a tightly-coupled type and a loosely-coupled type.Generally, a tightly-coupled network has a small variation in communication cost (communication time) between each processor and connects a small number of processors. On the other hand, a loosely-coupled network has a low communication cost between processors physically and logically located near each other, but has a high communication cost between processors distant from each other.

【０００３】膨大な演算を必要とするアプリケーション
では多数のプロセッサを粗結合型ネットワークで接続し
たアーキテクチャが有効であり、このような例には特開
平7-58762「データ転送方式」が挙げられる。このアー
キテクチャでは、DMA制御部によりある程度まとまった
データを一度に転送できるため、大規模な演算に必要な
大量の入力データや処理後の結果データを効率よく転送
できる。しかし、DMA制御により転送されるのは同じノ
ード内の主記憶と通信処理用バッファの間にかぎられて
いるため、一度に転送できる容量は、通信処理用バッフ
ァメモリの容量により制限される。また、特開平9-9126
2号公報に記載の「マルチプロセッサシステム」では、D
MA制御部はリングバスと主記憶の間のデータ転送に直接
関わるため、より多量のデータを一度に転送することが
できる。このアーキテクチャで並列処理を行う場合、各
分散処理用プロセッサエレメント(以下スレーブノード)
による演算結果は、スレーブノードそれぞれに接続され
たローカルメモリに格納されており、最終的にはこれら
を一カ所に集め、更にデータを一つにまとめるデータ処
理(以下、集計処理)が必要となることがある。従来、こ
の集計処理を行う際に、あらかじめ各スレーブノードに
個別にデータ転送要求を出し、集計処理を行う特別な一
つの制御用プロセッサエレメント(マスターノード)に接
続されているローカルメモリに全てのデータを集めた
後、このマスターノードのみで集計処理を行っていた。For an application requiring a huge amount of computation, an architecture in which a large number of processors are connected by a loosely-coupled network is effective. An example of such an example is Japanese Patent Application Laid-Open No. 7-58762, "Data Transfer Method". In this architecture, since a certain amount of data can be transferred at a time by the DMA control unit, a large amount of input data necessary for a large-scale operation and result data after processing can be efficiently transferred. However, since data transferred by the DMA control is limited between the main memory and the communication processing buffer in the same node, the capacity that can be transferred at one time is limited by the capacity of the communication processing buffer memory. Also, JP-A-9-9126
In the “multiprocessor system” described in Japanese Patent Publication No. 2
Since the MA control unit is directly involved in the data transfer between the ring bus and the main memory, a larger amount of data can be transferred at a time. When parallel processing is performed with this architecture, each distributed processing processor element (hereinafter, slave node)
Is stored in a local memory connected to each of the slave nodes, and ultimately requires a data processing (hereinafter referred to as a totaling process) to collect these data in one place and further combine the data into one. Sometimes. Conventionally, when performing this aggregation processing, a data transfer request is individually issued to each slave node in advance, and all data are stored in a local memory connected to a special control processor element (master node) that performs the aggregation processing. After the collection, the tally processing was performed only by this master node.

【０００４】[0004]

【発明が解決しようとする課題】このシステムでは、演
算結果の回収と集計処理を行うためには、 (1) (各スレーブノードのローカルメモリ読み出し時間) (2) (各スレーブノードからマスターノードへのデータ
転送時間) (3) (スレーブノードから回収した演算結果をマスター
ノードのローカルメモリへ書き込む時間) (4) (3で書き込んだデータを集計処理を行うためにマス
ターノードのローカルメモリから読み出す時間) (5) (マスターノードによる演算時間) (6) (マスターノードのローカルメモリへの演算結果書
き込み時間) を合わせた処理時間が必要であった。ところが、この集
計処理が各スレーブノードによる演算結果の総和など、
パイプライン的に処理できる種類のものであれば、通信
パケットが一方向に流れている限り通信パケットの衝突
が発生しないマスタースレーブ形式のプロセッサ間通信
を行う場合、中間結果をローカルメモリに格納する処理
(3)および(4)は明らかに無駄である。In this system, in order to collect and collect the operation results, (1) (local memory read time of each slave node) (2) (from each slave node to the master node) (3) (Time to write the operation result collected from the slave node to the local memory of the master node) (4) (Time to read out the data written in (3) from the local memory of the master node in order to perform the aggregation processing (5) (Calculation time by master node) (6) Processing time including (calculation result writing time to local memory of master node) was required. However, this summation process is based on the sum of the operation results by each slave node.
In the case of a master-slave type inter-processor communication in which collision of communication packets does not occur as long as the communication packets flow in one direction, if the type of processing can be processed in a pipeline, the processing of storing the intermediate result in the local memory
(3) and (4) are obviously useless.

【０００５】本発明の目的は、各スレーブノードがデー
タを送出する際、あらかじめ決められたパイプライン演
算を行うことにより、無駄なメモリアクセスを省き、高
い処理性能を実現するプロセッサ間通信方式を提供する
ことにある。An object of the present invention is to provide an inter-processor communication system which realizes high processing performance by eliminating unnecessary memory access by performing a predetermined pipeline operation when each slave node transmits data. Is to do.

【０００６】[0006]

【課題を解決するための手段】上記目的達成のため請求
項１記載の発明は、バスを介してリング状に接続された
複数のノードの内のマスターノードと複数のスレーブノ
ードとの間でリング状にデータ転送を行うノード間デー
タ通信方法において、マスターノードは、スレーブノ
ードに対して、演算を指示し、スレーブノードの演算結
果を回収する際、前記演算が指示されたスレーブノード
は、直前のスレーブノードが出力したデータと自己が演
算した結果との間で所定の演算を行い、該演算の結果の
データを次のノードに出力することを特徴とする。According to the first aspect of the present invention, there is provided an apparatus for transmitting a signal between a master node and a plurality of slave nodes among a plurality of nodes connected in a ring via a bus. In the inter-node data communication method for performing data transfer in a state, the master node instructs the slave node to perform an operation, and when the operation result of the slave node is collected, the slave node to which the operation has been instructed transmits the immediately preceding operation. A predetermined operation is performed between the data output by the slave node and the result calculated by itself, and the data resulting from the operation is output to the next node.

【０００７】即ち、本発明は、バスを介してリング状に
接続された複数のノードの内のマスターノードと複数の
スレーブノードとの間でリング状にデータ転送を行うノ
ード間データ通信方法である。That is, the present invention is an inter-node data communication method for transferring data in a ring between a master node and a plurality of slave nodes among a plurality of nodes connected in a ring via a bus. .

【０００８】マスターノードは、スレーブノードに対し
て、演算を指示する。この場合、マスターノードは、上
記複数のスレーブノードの内のマスターノード以外のノ
ード（スレーブノード）全てに演算を指示してもよい
が、請求項２のように、複数のスレーブノードに対して
選択的に演算を指示するようにしてもよい。[0008] The master node instructs the slave node to perform an operation. In this case, the master node may instruct all the nodes (slave nodes) other than the master node among the plurality of slave nodes to perform the operation. The calculation may be instructed.

【０００９】そして、スレーブノードの演算結果を回収
する際は、演算が指示されたスレーブノードは、直前の
スレーブノードが出力したデータと自己が演算した結果
との間で所定の演算を行い、該演算の結果のデータを次
のノードに出力する。When the operation result of the slave node is collected, the slave node to which the operation is instructed performs a predetermined operation between the data output by the immediately preceding slave node and the result calculated by itself. The data resulting from the operation is output to the next node.

【００１０】ここで、演算が指示されたスレーブノード
とは、マスターノードにより上記複数のスレーブノード
の内のマスターノード以外のノード全てに演算が指示さ
れている場合には、全てのスレーブノードであり、複数
のスレーブノードに対して選択的に演算が指示されてい
る場合には、当該演算が指示されたスレーブノードであ
る。Here, the slave node to which the operation is instructed is all slave nodes when the operation of the master node is instructed to all the nodes other than the master node among the plurality of slave nodes. When the operation is selectively instructed to a plurality of slave nodes, the slave node to which the operation is instructed.

【００１１】また、直前のノードがマスターノードであ
りかつ演算が指示されたスレーブノードは、マスターノ
ードから、指示された演算を行うためのデータが入力さ
れた場合には、このデータと自己が演算した結果との間
で所定の演算を行い、該演算の結果のデータを次のノー
ドに出力するが、マスターノードから、上記データが入
力されなかった場合には、自己が演算した結果のデータ
を次のノードに出力する。When the immediately preceding node is the master node and the slave node to which the operation is instructed receives data for performing the instructed operation from the master node, the slave node instructs this data and itself to perform the operation. A predetermined operation is performed with respect to the result of the calculation, and the data of the result of the operation is output to the next node. If the data is not input from the master node, the data of the result calculated by itself is output. Output to the next node.

【００１２】以上説明したように本発明は、スレーブノ
ードの演算結果を回収する際、演算が指示されたスレー
ブノードは、直前のスレーブノードが出力したデータと
自己が演算した結果との間で所定の演算を行い、該演算
の結果のデータを次のノードに出力するので、マスター
ノードに最終的に戻るデータを、最終的に求めたいデー
タとすることができる。よって、マスターノードでは、
スレーブノードから回収した演算結果をマスターノード
のローカルメモリへ書き込んだり、該書き込んだデータ
をローカルメモリから読み出したり、しなくてもよいの
で、これらの時間を省略することができる。即ち、各ス
レーブノードがデータを送出する際、あらかじめ決めら
れたパイプライン演算を行って、無駄なメモリアクセス
を省き、高い処理性能を実現することができる。As described above, according to the present invention, when retrieving the operation result of a slave node, the slave node instructed to perform the operation determines a predetermined value between the data output by the immediately preceding slave node and the result calculated by itself. Is performed, and the data resulting from the calculation is output to the next node, so that the data finally returned to the master node can be finally obtained. So, on the master node,
Since it is not necessary to write the operation result collected from the slave node to the local memory of the master node or to read out the written data from the local memory, these times can be omitted. That is, when each slave node transmits data, a predetermined pipeline operation is performed, unnecessary memory access can be omitted, and high processing performance can be realized.

【００１３】ここで、前述したように、マスターノード
が、複数のスレーブノードに対して選択的に演算を指示
する場合には、複数のノードの内の演算が指示されなか
ったスレーブノードは、直前のスレーブノードが出力し
たデータの値を変更せずに次のノードに出力する。直前
のスレーブノードが出力したデータの値を変更せずに次
のノードに出力するには、直前のスレーブノードが出力
したデータをそのまま次のノードに出力する場合のほ
か、該データの値に、０を加算又は減算したり、１を乗
算又は除算したり、してもよい。Here, as described above, when the master node selectively instructs a plurality of slave nodes to perform an operation, the slave node of the plurality of nodes, for which the operation is not instructed, receives the immediately preceding operation. Output to the next node without changing the value of the data output by the slave node. In order to output the value of the data output by the immediately preceding slave node to the next node without changing the value, the data output by the immediately preceding slave node is directly output to the next node. 0 may be added or subtracted, and 1 may be multiplied or divided.

【００１４】以上の内容を図１を参照して更に詳細に説
明する。なお、以下に示した内容は、上記内容を理解し
易くするためのものであり、本発明は、これに限定され
ない。The above contents will be described in more detail with reference to FIG. The contents shown below are for making the above contents easy to understand, and the present invention is not limited thereto.

【００１５】図1に、本発明によるプロセッサ間の通信
の概念図を示す。リングバス1により、複数個(図１では
5個)のノード(プロセッサ−ローカルメモリ対)2〜6が環
状に接続されている。以下、プロセッサ間通信を行う場
合の動作を示す。FIG. 1 shows a conceptual diagram of communication between processors according to the present invention. A plurality (in FIG. 1)
(Five) nodes (processor-local memory pairs) 2 to 6 are connected in a ring. Hereinafter, an operation in the case of performing interprocessor communication will be described.

【００１６】まず、マスターノード2からデータを演算
しながら収集するためのパケットが送出される。この時
のパケットには、各スレーブノードに動作を指示するた
めの命令ワードと、初期データが含まれる。命令ワード
には、動作を示すコードやこのパケットを適用するスレ
ーブノードを指定するためのノード識別コードなどが含
まれる。ここでは、スレーブノード3およびスレーブノ
ード5のみに適用することを仮定する。初期データを送
信するのは、パケットが送出される方向に接続されてい
るスレーブノード、つまり最初の演算を行うスレーブノ
ードが特別な処理を行う必要がないようにするためであ
る。First, a packet for collecting data while calculating data is transmitted from the master node 2. The packet at this time includes an instruction word for instructing each slave node to perform an operation, and initial data. The instruction word includes a code indicating an operation, a node identification code for designating a slave node to which this packet is applied, and the like. Here, it is assumed that the present invention is applied only to the slave node 3 and the slave node 5. The transmission of the initial data is performed so that the slave node connected in the direction in which the packet is transmitted, that is, the slave node performing the first operation does not need to perform any special processing.

【００１７】次に、スレーブノード3がこのパケットを
受け取り、命令ワードを解釈する。スレーブノード3
は、命令ワードを解釈し、自ノードのローカルメモリか
らデータを読み出し、直前のノード(マスターノード2)
から送られてくるデータとの間で演算を行い、結果を次
のノード(スレーブノード4)に送出しなければならない
ことを知る。スレーブノード3は、自ノードのローカル
メモリよりデータの読み出しを開始し、マスターノード
2から1サイクル毎に送られてくるデータと、ローカルメ
モリから1サイクル毎に読み出されるデータを演算し、1
サイクル毎に結果をスレーブノード4に向けて出力す
る。この時、結果データに先立って、マスターノード2
から送られてきた命令コードを出力しておき、出力パケ
ットの構成を、マスターノード2から送られてきた入力
パケットの構成と一致させる。Next, the slave node 3 receives this packet and interprets the instruction word. Slave node 3
Interprets the instruction word, reads the data from the local memory of the own node, and reads the immediately preceding node (master node 2).
It performs an operation on the data sent from, and knows that the result must be sent to the next node (slave node 4). The slave node 3 starts reading data from the local memory of its own node, and
Calculate the data sent every cycle from 2 and the data read every cycle from local memory,
The result is output to the slave node 4 every cycle. At this time, prior to the result data, the master node 2
The instruction code transmitted from the master node 2 is output, and the configuration of the output packet is made to match the configuration of the input packet transmitted from the master node 2.

【００１８】スレーブノード4は、スレーブノード3が出
力したパケットを受信するが、命令コードのノード識別
コードを参照し、自ノードは対象外であることを知り、
パケットをそのままスレーブノード5へ送信する。The slave node 4 receives the packet output by the slave node 3, but refers to the node identification code of the instruction code, and knows that the own node is not a target.
The packet is transmitted to the slave node 5 as it is.

【００１９】スレーブノード5は、ノード識別コードを
参照し、スレーブノード3と同様の演算処理を行う。The slave node 5 performs the same operation as the slave node 3 with reference to the node identification code.

【００２０】順次すべてのスレーブノードを経て、最終
的にマスターノード2にパケットが復帰したなら、マス
ターノード2のローカルメモリに結果データを格納す
る。この時の結果データは、既に集計処理済みのデータ
となっている。When the packet finally returns to the master node 2 through all the slave nodes, the result data is stored in the local memory of the master node 2. The result data at this time is data that has already been subjected to tabulation processing.

【００２１】以上のように処理することにより、結果を
一旦マスターノードに集めて演算を行う場合に比較し
て、1つのパケットでデータの回収と集計処理を行える
ので、リングバスの短所である大きな通信レイテンシの
影響を抑えることができ、更にメモリアクセス回数を減
らすことができ、また必要な演算が終わった各スレーブ
ノードの演算器を有効に利用することができる。By processing as described above, data can be collected and counted in one packet, compared with a case where the results are once collected in the master node and the calculation is performed. The influence of the communication latency can be suppressed, the number of memory accesses can be further reduced, and the arithmetic unit of each slave node for which necessary arithmetic has been completed can be used effectively.

【００２２】[0022]

【発明の実施の形態】以下、図面を参照して、本発明の
実施の形態を詳細に説明する。以下、64ビットリングバ
スに適用した場合の実施の形態について説明する。Embodiments of the present invention will be described below in detail with reference to the drawings. Hereinafter, an embodiment when applied to a 64-bit ring bus will be described.

【００２３】図４には、本実施の形態に係るノード間デ
ータ通信方法を行う並列処理システムが示されている。
システムはシミュレーション用プログラム開発や結果デ
ータの解析に用いるホストコンピュータ10と、並列シミ
ュレーションエンジン11からなる。並列シミュレーショ
ンエンジンは、1つのマスターノード12と複数のスレー
ブノード13.1〜13.n-1をリングバス1でリング状に接続
することにより構成される。全てのノードにはプロセッ
サエレメント(PE)14とローカルメモリ15が含まれる。ま
た、マスターノードにはホストコンピュータと接続すた
めのインターフェース回路16が含まれる。FIG. 4 shows a parallel processing system for performing the inter-node data communication method according to the present embodiment.
The system includes a host computer 10 used for developing a simulation program and analyzing result data, and a parallel simulation engine 11. The parallel simulation engine is configured by connecting one master node 12 and a plurality of slave nodes 13.1 to 13.n-1 in a ring shape by a ring bus 1. All nodes include a processor element (PE) 14 and a local memory 15. The master node includes an interface circuit 16 for connecting to a host computer.

【００２４】図5に、本発明のプロセッサ間通信を行う
プロセッサエレメントの内部構成を示す。プロセッサエ
レメント17は、リングバスによるプロセッサ間通信処理
を行うリングバスインターフェースユニット18、プロセ
ッサエレメント全体の制御と32ビット整数演算処理を行
う整数演算ユニット19、64ビット倍精度浮動小数点演算
を行う浮動小数点演算ユニット20、整数演算ユニット19
および浮動小数点演算ユニット20に毎サイクル命令を供
給することができる2KBの命令キャッシュユニット21、
整数演算ユニット19または浮動小数点演算ユニット20か
らのデータアクセス要求に迅速に応えることができる8K
Bのデータキャッシュユニット22、プロセッサエレメン
ト外部のローカルメモリに対するインターフェースを提
供するメモリインターフェースユニット23より構成され
る。これらの機能ユニットは内部データバス26により相
互に接続されている。FIG. 5 shows an internal configuration of a processor element for performing interprocessor communication according to the present invention. The processor element 17 includes a ring bus interface unit 18 that performs inter-processor communication processing using a ring bus, an integer operation unit 19 that performs control of the entire processor element and performs 32-bit integer arithmetic processing, and a floating-point arithmetic that performs 64-bit double-precision floating-point arithmetic Unit 20, integer operation unit 19
And a 2 KB instruction cache unit 21 that can supply instructions every cycle to the floating point arithmetic unit 20;
8K that can quickly respond to data access requests from the integer operation unit 19 or floating point operation unit 20
The B data cache unit 22 includes a memory interface unit 23 that provides an interface to a local memory outside the processor element. These functional units are interconnected by an internal data bus 26.

【００２５】浮動小数点演算ユニット20は、パイプライ
ン加減算器およびパイプライン乗算器を備えており、IE
EE倍精度実数加減算および乗除算を4サイクルのレイテ
ンシで毎サイクルパイプライン的に実行することができ
る。また、リングバスインターフェースユニット18と浮
動小数点演算ユニット20は、リングバス演算入力バス24
とリングバス演算出力バス25で接続されている。このた
め浮動小数点演算ユニット20は、他のユニットにより内
部データバス26が使用中であっても、リングバスインタ
ーフェースユニット18から毎サイクル64ビット浮動小数
点演算データを受け取り、演算結果をリングバスインタ
ーフェースユニットに出力することができる。The floating-point operation unit 20 includes a pipeline adder / subtracter and a pipeline multiplier.
EE double-precision real number addition / subtraction and multiplication / division can be executed in a pipeline every cycle with a latency of 4 cycles. In addition, the ring bus interface unit 18 and the floating point arithmetic unit 20 are connected to the ring bus arithmetic input bus 24.
And a ring bus operation output bus 25. For this reason, even if the internal data bus 26 is being used by another unit, the floating-point arithmetic unit 20 receives 64-bit floating-point arithmetic data from the ring bus interface unit 18 every cycle and transfers the arithmetic result to the ring bus interface unit. Can be output.

【００２６】プロセッサエレメント17は、リングバス通
信パケットの受信タイミングを知らせる命令同期入出力
信号など、この他にも図示しない多くの信号を備え、他
のプロセッサエレメントとのリングバスによる通信、お
よび高速な演算処理が可能になっている。The processor element 17 has many other signals (not shown) such as an instruction synchronous input / output signal for notifying the reception timing of the ring bus communication packet, and performs communication with other processor elements via the ring bus and high-speed communication. Arithmetic processing is possible.

【００２７】図6にプロセッサ間通信にもちいるパケッ
トのデータフォーマットを示す。パケットは64ビット×
複数ワードで構成され、先頭から1ワードずつ64ビット
リングバスを介して伝送される。このパケットを構成す
る情報は、大きく命令ワード31、アドレスワード32、ダ
ミーワード33、データワード34の4種類に分けられる。
命令ワード31には、このパケットを受信したノードが決
められた動作を行えるようにパケットを識別するための
情報が含まれている。アドレスワード32には、受信側で
のオペランドデータ格納アドレスを示すアドレスフィー
ルド35が含まれる。ダミーワード33は、送信側および受
信側での処理タイミングをあわせるために挿入された無
効なデータであり、後続するデータワードの先頭が現れ
るタイミングを必要サイクル分遅らせるのみで、内容が
参照されることはない。データワード34は、入力時には
演算に用いられるオペランドデータであり、出力時には
演算結果データとなる。命令ワード31は、ワード内のビ
ット位置によって、更に4つの16ビット情報に分けられ
る。第63ビット〜第48ビットは命令識別コードフィール
ド36(FRPA)であり、パケットを受信したノードが他のパ
ケットと区別するための数値が割り当てられている。第
47ビット〜第32ビットは、演算データ長フィールド37(l
en)であり、後続するデータワードのワード数が格納さ
れる。第31ビット〜第16ビットは、送信ノードIDフィー
ルド38(src)であり、このパケットを送出したノードの
識別番号が格納される。第15ビット〜第0ビットは、受
信ノードIDフィールド39(dest)であり、このパケットを
受信すべきノードの識別番号が格納されている。受信ノ
ードIDフィールド39には、単一ノードのほかに、複数ノ
ードを指定することもできる。FIG. 6 shows a data format of a packet used for communication between processors. The packet is 64 bits x
It consists of multiple words and is transmitted via the 64-bit ring bus, one word at a time from the beginning. The information constituting this packet is roughly divided into four types: an instruction word 31, an address word 32, a dummy word 33, and a data word.
The command word 31 includes information for identifying the packet so that the node receiving the packet can perform a predetermined operation. The address word 32 includes an address field 35 indicating an operand data storage address on the receiving side. The dummy word 33 is invalid data inserted to match the processing timing on the transmission side and the reception side. There is no. The data word 34 is operand data used for operation at the time of input, and becomes operation result data at the time of output. The instruction word 31 is further divided into four 16-bit information according to bit positions in the word. The 63rd to 48th bits are an instruction identification code field 36 (FRPA), and are assigned numerical values for the node that has received the packet to distinguish it from other packets. No.
The 47th to 32nd bits are the operation data length field 37 (l
en), and the number of subsequent data words is stored. The 31st to 16th bits are a transmission node ID field 38 (src), which stores the identification number of the node that transmitted this packet. The fifteenth bit to the zeroth bit are a receiving node ID field 39 (dest), which stores the identification number of the node that should receive this packet. In the receiving node ID field 39, a plurality of nodes can be designated in addition to a single node.

【００２８】隣接ノード間は、図7に示すようにリング
バス入出力以外にコマンド同期入出力制御信号40〜41で
接続されており、パケットの命令ワードをリングバス上
に出力するタイミングでコマンド同期出力信号をアクテ
ィブにすることで、命令ワードの位置(つまりパケット
の始まり)を次のノードに知らせる。次に、本実施の形
態の作用を説明する。As shown in FIG. 7, the adjacent nodes are connected by command synchronization input / output control signals 40 to 41 in addition to the ring bus input / output. Activating the output signal informs the next node of the position of the instruction word (ie, the beginning of the packet). Next, the operation of the present embodiment will be described.

【００２９】図2には、本実施の形態が適用されるアプ
リケーションプログラムである、MOSトランジスタデバ
イスのモンテカルロシミュレータのフローチャートを示
す。即ち、ステップ８０で、前処理（初期化）し、ステ
ップ８２で、上記トランジスタデバイス内の電位を計算
（ポワソン方程式の計算）する。ステップ８４で、粒子
の運動を計算し、ステップ８６で、粒子の運動が一定と
なったか否かを判断することにより、収束したか否かを
判断する。収束していなければ、ステップ８２に戻っ
て、上記処理を実行し、収束した場合には、ステップ８
８で、後処理（本最終的に求めたデータを保存する）。FIG. 2 shows a flowchart of a Monte Carlo simulator for a MOS transistor device, which is an application program to which the present embodiment is applied. That is, in step 80, pre-processing (initialization) is performed, and in step 82, the potential in the transistor device is calculated (calculation of Poisson equation). In step 84, the motion of the particles is calculated, and in step 86, it is determined whether or not the motion of the particles has become constant, thereby determining whether or not the convergence has been achieved. If not converged, the process returns to step 82 to execute the above processing.
In step 8, post-processing (this finally obtained data is stored).

【００３０】ここで、モンテカルロシミュレーションで
は、高い精度を得るために多数の粒子を追跡する必要が
あり、単一プロセッサシステムにおいては、主に、ステ
ップ８４の粒子の運動計算処理の時間が問題となる。Here, in the Monte Carlo simulation, it is necessary to track a large number of particles in order to obtain high accuracy, and in a single processor system, the time of the particle motion calculation processing in step 84 is mainly a problem. .

【００３１】一方、粒子の運動計算をマルチプロセッサ
システムにより並列処理すると大幅に演算時間を短くす
ることができ、更に電位計算も並列処理することでより
演算時間を短くすることができる。On the other hand, when the motion calculation of the particles is processed in parallel by the multiprocessor system, the calculation time can be greatly reduced, and the calculation time can be further shortened by also performing the potential calculation in parallel.

【００３２】そこで、本実施の形態では、複数のスレー
ブノードで分散して演算を行い、演算した結果をマスタ
ーノードに戻すようにしている。この処理の流れを図3
に示す。図３（Ａ）には、マスターノードの処理ルーチ
ンが示され、図３（Ｂ）には、スレーブノードの処理ル
ーチンが示されている。Therefore, in the present embodiment, a plurality of slave nodes perform calculations in a distributed manner, and the calculation results are returned to the master node. Figure 3 shows the flow of this process.
Shown in FIG. 3A shows a processing routine of the master node, and FIG. 3B shows a processing routine of the slave node.

【００３３】最初にマスターノードは、ステップ１０２
で、前処理し、ステップ１０４で、各スレーブノードに
対して粒子を振分ける。なお、各スレーブには、同じ量
の粒子を振分ける。これにより、各スレーブノードは、
ステップ１２４に示すように、振分けられた粒子を設定
する。First, the master node performs step 102
In step 104, particles are distributed to each slave node. The same amount of particles is distributed to each slave. This allows each slave node to:
As shown in step 124, the sorted particles are set.

【００３４】次に、マスターノードは、ステップ１０６
で、電位の計算を指示する。これにより、スレーブノー
ドは、ステップ１２６で、電位を計算する。即ち、各粒
子のまわりの電界の状態を計算する。Next, the master node determines in step 106
Indicates the calculation of the potential. Accordingly, the slave node calculates the potential in step 126. That is, the state of the electric field around each particle is calculated.

【００３５】次に、マスターノードは、ステップ１０８
で、各粒子の運動の計算を指示する。これにより、スレ
ーブノードは、ステップ１２８で、自己の担当する各粒
子の運動を計算する。Next, the master node determines in step 108
Indicates the calculation of the motion of each particle. Accordingly, in step 128, the slave node calculates the motion of each particle in charge of itself.

【００３６】そして、最終的に求めたいデータは、所定
の電圧を印加した場合に、トランジスタの各部がどのく
らいの電位分布になるかであり、このためには、各スレ
ーブノードの担当する各粒子がどの場所に存在するのか
を知る必要がある。The final data to be obtained is how much the potential distribution of each part of the transistor is obtained when a predetermined voltage is applied. For this purpose, each particle assigned to each slave node must You need to know where they are.

【００３７】そこで、マスターノードは、ステップ１１
０で、各スレーブノードの計算結果を回収する。即ち、
前述したパケットデータをリング状に出力する。Therefore, the master node determines in step 11
At 0, the calculation result of each slave node is collected. That is,
The packet data described above is output in a ring shape.

【００３８】各スレーブノードは、直前のノードからパ
ケットデータを入力すると、ステップ１３０で、パケッ
トデータ内を書き換える。即ち、上記ステップ１２８の
粒子の運動計算により、自己が担当する粒子がトランジ
スタのどの場所に存在するかを把握することができる。
即ち、各場所にいくつ粒子が存在するか（即ち、各場所
の電荷量データ）を把握することができる。一方、パケ
ットデータ（図６参照）内のデータワード３４の各デー
タ領域は、トランジスタの各場所に対応して、当該場所
に存在する粒子の数（電荷量データ）を記録する領域で
ある。Upon receiving the packet data from the immediately preceding node, each slave node rewrites the packet data in step 130. In other words, the particle motion calculation in the above step 128 makes it possible to grasp where in the transistor the particle in charge of itself exists.
That is, it is possible to grasp how many particles exist at each location (that is, charge amount data at each location). On the other hand, each data area of the data word 34 in the packet data (see FIG. 6) is an area for recording the number of particles (charge amount data) existing at the location corresponding to each location of the transistor.

【００３９】そこで、ステップ１２８では、データワー
ド３４の各データ領域毎に、該データ領域に記録された
値を、該領域に記録された値（即ち、直前のノートにお
いて加算された値）と自己が演算して求めた粒子の値と
を加算した値に、書き換えて、次のノードに出力する。
これにより、パケットデータが最終的にマスターノード
に戻ってくると、データワード３４の各データ領域に
は、全てのスレーブノードにより計算された電荷量デー
タの総計が記録されている。よって、マスターノードで
は、スレーブノードから回収した電荷量データを各スレ
ーブノード毎にマスターノードのローカルメモリへ書き
込んだり、該書き込んだデータをローカルメモリから読
み出したり、しなくてもよいので、これらの時間を省略
することができる。Therefore, in step 128, for each data area of the data word 34, the value recorded in the data area is changed to the value recorded in the area (that is, the value added in the immediately preceding note). Is rewritten to a value obtained by adding the value of the particle obtained by the calculation, and the result is output to the next node.
Thus, when the packet data finally returns to the master node, the total of the charge amount data calculated by all the slave nodes is recorded in each data area of the data word 34. Therefore, the master node does not have to write the charge amount data collected from the slave node to the local memory of the master node for each slave node, and does not need to read or write the written data from the local memory. Can be omitted.

【００４０】そして、マスターノードは、ステップ１１
２で、前述したように収束したか否かを判断し、収束し
ていなければ、その旨各スレーブノードに知らせる（こ
れにより、ステップ１３２が否定判定される）。一方、
収束した場合には、ステップ１１４で、前述したよう
に、後処理をする。Then, the master node determines in step 11
In step 2, it is determined whether or not the convergence has been achieved as described above. If the convergence has not occurred, each slave node is informed of the convergence (thus, step 132 is negatively determined). on the other hand,
If the convergence has occurred, post-processing is performed in step 114 as described above.

【００４１】図8にマスターノードからのパケット出力
タイミングを示す。マスターノードからパケットを送出
する場合は、データワードとして初期値データを挿入す
る。初期値データは、受信ノードで加減算を行う場合は
0とし、乗算を行う場合は1とする。こうすることによ
り、マスターノードから直接パケットを受信したスレー
ブノードも、スレーブノードから間接的にパケットを受
信した他のスレーブノードも同様の処理を行えばよくな
る。FIG. 8 shows the packet output timing from the master node. When transmitting a packet from the master node, initial value data is inserted as a data word. The initial value data should be
Set to 0, and 1 for multiplication. By doing so, the slave node that directly receives the packet from the master node and the other slave nodes that receive the packet indirectly from the slave node may perform the same processing.

【００４２】ここで、演算データ長が全ノード数程度あ
るいはそれ以上になると、マスターノードが送信中のパ
ケットがリングバスを一周して再びマスターノードに復
帰してしまうことがある。このとき、初期データを整数
演算ユニット19などの他のユニットで生成すると、ある
いはローカルメモリに格納されたデータ列を初期データ
として用いようとすると、内部データバス26が使用され
てしまい、リングバス入力からローカルメモリへ演算結
果を格納しようとしても内部データバス26が開放される
まで待たなくてはならない。この間、リングバス入力か
らは毎サイクル結果データが送られてくるので、これを
すべてリングバスインターフェースユニット17内部の入
力バッファに格納しようとすると、大容量の入力バッフ
ァが必要になる。Here, if the operation data length is about the number of all nodes or more, the packet being transmitted by the master node may make a round of the ring bus and return to the master node again. At this time, if the initial data is generated by another unit such as the integer operation unit 19, or if an attempt is made to use the data string stored in the local memory as the initial data, the internal data bus 26 is used, and the ring bus input Even if it is attempted to store the operation result in the local memory, the user must wait until the internal data bus 26 is released. During this time, the result data is transmitted from the ring bus input every cycle. Therefore, if all the result data is to be stored in the input buffer inside the ring bus interface unit 17, a large-capacity input buffer is required.

【００４３】これを回避するため、初期データの生成は
リングバスインターフェースユニット18のみで行う。To avoid this, the initial data is generated only by the ring bus interface unit 18.

【００４４】図9は、スレーブノードにおけるパケット
受信タイミングおよび、演算後のパケット出力タイミン
グである。スレーブノードでパケットの受信が確認され
ると、演算オペランドの一方を浮動小数点演算ユニット
20に送るためのローカルメモリ−内部データバス間のバ
ースト転送の準備が開始される。この間、リングバス入
力にはダミーワードが入力されており、数サイクル後、
最初のオペランドデータdataB1がローカルメモリから内
部データバス上に読み出されるタイミングで、リングバ
ス入力にもう一方の演算オペランドデータdataA1が到達
するように、ダミーサイクルをあらかじめ調整してお
く。FIG. 9 shows the packet reception timing at the slave node and the packet output timing after the operation. When the reception of the packet is confirmed by the slave node, one of the operation operands is changed to a floating-point operation unit.
Preparation for a burst transfer between the local memory and the internal data bus for sending to 20 is started. During this time, a dummy word is input to the ring bus input, and after several cycles,
The dummy cycle is adjusted in advance so that the other operand data dataA1 reaches the ring bus input at the timing when the first operand data dataB1 is read from the local memory onto the internal data bus.

【００４５】浮動小数点演算ユニット20は、2列の演算
オペランドdataAn、dataBnをそれぞれリングバス演算入
力バス24および内部データバス26から毎サイクル受け取
り、パイプライン演算後、結果をリングバス演算出力バ
ス25に毎サイクル出力する。リングバスインターフェー
スユニット18は、この次のサイクルで演算結果dataCnを
リングバス出力から送出できるように、パケットの命令
ワード(FRPA)をあらかじめ出力しておく。このようにす
ることで、後続のノードは、同じ形式のパケットを受信
することができるようになり、ノードの位置に依存する
特別なタイミング合わせ処理が必要なくなる。The floating-point operation unit 20 receives two columns of operation operands dataAn and dataBn from the ring bus operation input bus 24 and the internal data bus 26 every cycle, respectively, and after pipeline operation, outputs the result to the ring bus operation output bus 25. Output every cycle. The ring bus interface unit 18 outputs the instruction word (FRPA) of the packet in advance so that the operation result dataCn can be transmitted from the ring bus output in the next cycle. By doing so, the subsequent nodes can receive the same type of packet, and there is no need for special timing adjustment processing depending on the position of the node.

【００４６】図10は、最終的にマスターノードに戻って
きたパケットの受信タイミングである。受信データは内
部データバス26を介してローカルメモリに格納される。
本実施の形態により、空間電荷分布の計算処理は以下の
ように改善される。メッシュ分割数（上記例ではトランジスタの各部数）:
100×100=10,000 スレーブノード数: 100個のとき、特開平9-91262号公報の発明によると通信(データ転送)時間: (10,000[純粋データ通信時
間]+100[パケット到達時間+復帰時間]+10[オーバヘッ
ド])×100[スレーブノード数]= 1,011,000サイクル演算時間: (1[各種オーバヘッド]+1[パイプラインロー
ド・加算・ストア]×10,000[演算回数])×100[スレーブ
ノード数]= 1,000,100サイクル通信+演算時間: 1,011,000+1,000,100=2,011,100サイク
ルの処理時間が必要であったのに対し、本実施の形態によ
ると通信+演算時間: 10,000[データ通信および演算時間]+10
[オーバヘッド]×100[スレーブノード数]+100[パケット
復帰時間]=11,100サイクルとなり、約200倍の高速化が可能となる。更に、スレー
ブノード数を増加させること(処理速度の向上のため)、
またはメッシュ分割数を増加させること(演算精度の向
上のため)により、高速化の効果は更に大きくなる。FIG. 10 shows the reception timing of the packet finally returned to the master node. The received data is stored in the local memory via the internal data bus 26.
According to this embodiment, the calculation processing of the space charge distribution is improved as follows. Number of mesh divisions (in the above example, each number of transistors):
100 × 100 = 10,000 When the number of slave nodes is 100, according to the invention of JP-A-9-91262, communication (data transfer) time: (10,000 [pure data communication time] +100 [packet arrival time + recovery time] +10 [overhead]) x 100 [number of slave nodes] = 1,011,000 cycles Operation time: (1 [various overhead] +1 [pipeline load / addition / store] x 10,000 [number of operations]) x 100 [number of slave nodes] = 1,000,100 cycles Communication + operation time: 1,011,000 + 1,000,100 = 2,011,100 cycles, whereas according to this embodiment communication + operation time: 10,000 [data communication and operation time] +10
[Overhead] x 100 [the number of slave nodes] + 100 [packet return time] = 11,100 cycles, and the speed can be increased about 200 times. Furthermore, increasing the number of slave nodes (to improve processing speed),
Alternatively, by increasing the number of mesh divisions (for improving calculation accuracy), the effect of speeding up is further increased.

【００４７】ところで、アプリケーションによっては、
このような各スレーブノードで演算を終えた後のデータ
の回収・集計処理は、32ビット整数の加減算のみで十分
な場合がある。この時は、図11に示すように、リングバ
ス演算入力バス24を浮動小数点演算ユニット20に接続す
るバス24.2のほかに、整数演算ユニット19に接続するバ
ス24.1を設け、さらに整数演算ユニット19による演算結
果を出力するバス25.1と、浮動小数点演算ユニット20に
よる演算結果を出力するバス25.2の一方をマルチプレク
サ42により選択し、リングバス演算出力バス25と接続す
るように構成すればよい。この場合、整数データは1ワ
ード32ビットであるから、64ビットのリングバスおよび
64ビットの内部・外部データバスを用いると毎サイクル
2ワードのデータを入出力できる。この転送能力を生か
すためには、整数演算ユニット19内部の32ビットALUに
加えて、さらに32ビット加減算器を備えることで、同時
に2つの加減算を行えるようになる。By the way, depending on the application,
In such data collection / aggregation processing after the operation is completed in each slave node, only addition and subtraction of 32-bit integers may be sufficient. At this time, as shown in FIG. 11, in addition to the bus 24.2 connecting the ring bus operation input bus 24 to the floating-point operation unit 20, a bus 24.1 connecting the integer operation unit 19 is provided. One of the bus 25.1 for outputting the operation result and the bus 25.2 for outputting the operation result by the floating-point operation unit 20 may be selected by the multiplexer 42 and connected to the ring bus operation output bus 25. In this case, since the integer data is 32 bits per word, a 64-bit ring bus and
Every cycle using 64-bit internal / external data bus
Can input and output 2-word data. In order to make use of this transfer capability, the addition of a 32-bit adder / subtractor in addition to the 32-bit ALU inside the integer operation unit 19 allows two additions / subtractions to be performed simultaneously.

【００４８】以上説明したように本実施の形態によれ
ば、リングバスにより接続されるマルチプロセッサシス
テムを用いて、各ノードに分散させた演算処理が終了し
た後、データを1つのノードに収集し、これに対して更
にパイプライン処理可能な演算処理を施す必要があるア
プリケーションにおいて、この最終的な演算を1ノード
で集中的に行うのではなく、各ノードがデータを送出す
る際にパイプライン的に処理することにより、多数のプ
ロセッサエレメント内部の演算器を有効に利用し、マス
ターノードに集中していたメモリアクセスを減らし、処
理効率を高めることができる。As described above, according to the present embodiment, after the arithmetic processing distributed to each node is completed using the multiprocessor system connected by the ring bus, data is collected to one node. However, in applications that need to perform arithmetic processing that can be further pipelined, this final operation is not performed intensively by one node, but is performed in a pipelined manner when each node sends data. In this way, it is possible to effectively use arithmetic units inside many processor elements, reduce memory accesses concentrated on the master node, and increase processing efficiency.

【００４９】[0049]

【発明の効果】以上説明したように本発明は、スレーブ
ノードの演算結果を回収する際、演算が指示されたスレ
ーブノードは、直前のスレーブノードが出力したデータ
と自己が演算した結果との間で所定の演算を行い、該演
算の結果のデータを次のノードに出力するので、マスタ
ーノードに最終的に戻るデータを、最終的に求めたいデ
ータとすることができ、処理時間を短縮することができ
る、という効果を有する。As described above, according to the present invention, when the operation result of the slave node is collected, the slave node to which the operation is instructed performs the operation between the data output by the immediately preceding slave node and the result calculated by itself. Performs a predetermined operation and outputs the data resulting from the operation to the next node, so that the data finally returned to the master node can be used as the data finally obtained, thereby reducing the processing time. Has the effect of being able to

[Brief description of the drawings]

【図1】本発明の概念図である。FIG. 1 is a conceptual diagram of the present invention.

【図2】本実施の形態で利用されるモンテカルロデバ
イスシミュレーションソフトウェアのフローチャートの
原形(シングルプロセッサ用フローチャート)である。FIG. 2 is an original flowchart (single processor flowchart) of the Monte Carlo device simulation software used in the present embodiment.

【図3】本実施の形態実施例で利用される並列化され
たモンテカルロデバイスシミュレーションソフトウェア
のフローチャートである。FIG. 3 is a flowchart of parallelized Monte Carlo device simulation software used in the embodiment of the present invention.

【図4】マルチプロセッサシステムの実施例である。FIG. 4 is an example of a multiprocessor system.

【図5】プロセッサエレメントの内部構成を示す図で
ある。FIG. 5 is a diagram illustrating an internal configuration of a processor element.

【図6】通信パケットのデータフォーマットを示す図
である。FIG. 6 is a diagram illustrating a data format of a communication packet.

【図7】通信パケットの送受信の同期をとるための信
号線とその接続を示す図である。FIG. 7 is a diagram illustrating signal lines for synchronizing transmission and reception of communication packets and their connections.

【図8】マスターノードからパケットを出力するタイ
ミングを示す図である。FIG. 8 is a diagram illustrating a timing of outputting a packet from a master node.

【図9】スレーブノードにおけるパケット受信タイミ
ングおよび、演算後のパケット出力タイミングを示す図
である。FIG. 9 is a diagram illustrating a packet reception timing in a slave node and a packet output timing after a calculation.

【図10】最終的にマスターノードに戻ってきたパケッ
トの受信タイミングを示す図である。FIG. 10 is a diagram illustrating reception timing of a packet that has finally returned to the master node.

【図11】複数の演算器とリングバスインターフェース
ユニットを接続する手段を有するプロセッサエレメント
の内部構成を示す図である。FIG. 11 is a diagram showing an internal configuration of a processor element having means for connecting a plurality of arithmetic units and a ring bus interface unit.

[Explanation of symbols]

1 リングバス 2 マスターノード 3〜6 スレーブノード 10 …ホストコンピュータ 11 並列シミュレーションエンジン 12 マスターノード 13.1〜13.n-1 スレーブノード 14 プロセッサエレメント 15 ローカルメモリ 16 ホストインターフェース回路 1 Ring bus 2 Master node 3 to 6 Slave node 10… Host computer 11 Parallel simulation engine 12 Master node 13.1 to 13.n-1 Slave node 14 Processor element 15 Local memory 16 Host interface circuit

───────────────────────────────────────────────────── フロントページの続き (72)発明者宮川宣明神奈川県海老名市本郷2274 富士ゼロックス株式会社内 (72)発明者小柳光正宮城県名取市ゆりが丘１−22−５Ｆターム(参考） 5B045 AA07 BB13 BB47 GG17 5K031 AA02 BA02 CA05 DA02 DB02 DB10 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Nobuaki Miyagawa 2274 Hongo, Ebina-shi, Kanagawa Fuji Xerox Co., Ltd. (72) Inventor Mitsumasa Koyanagi 1-22-5 Yurigaoka, Natori-shi, Miyagi F-term (reference) 5B045 AA07 BB13 BB47 GG17 5K031 AA02 BA02 CA05 DA02 DB02 DB10

Claims

[Claims]

1. An inter-node data communication method for transferring data in a ring between a master node and a plurality of slave nodes among a plurality of nodes connected in a ring via a bus, wherein the master node comprises: When an operation is instructed to the slave node and the operation result of the slave node is collected, the slave node to which the operation is instructed performs a predetermined operation between the data output by the immediately preceding slave node and the result calculated by itself. And outputting data of the result of the operation to a next node.

2. The method according to claim 1, wherein the master node selectively instructs a plurality of slave nodes to perform an operation, and a slave node of the plurality of nodes, the operation of which is not instructed, outputs data output by the immediately preceding slave node. The data communication method between nodes according to claim 1, wherein the value is output to the next node without changing the value.