JP5358287B2

JP5358287B2 - Parallel computing device

Info

Publication number: JP5358287B2
Application number: JP2009121389A
Authority: JP
Inventors: 新次郎豊田; 宣明宮川
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2009-05-19
Filing date: 2009-05-19
Publication date: 2013-12-04
Anticipated expiration: 2029-05-19
Also published as: JP2010271799A

Abstract

PROBLEM TO BE SOLVED: To provide a parallel computation apparatus that facilitates executing a structured program having a plurality of nests. SOLUTION: In the parallel computation apparatus, each of a plurality of arithmetic processors performs parallel arithmetic processing by use of a plurality of subprocessors. The subprocessor (SPE) 102A includes: an ALU (Arithmetic and Logic Unit) 95A performing arithmetic processing of input data based on control instructions; a G flag stack 11 sequentially accumulating flag information based on results performed with the arithmetic processing; and an SPE control unit 199A making the ALU 95A perform the arithmetic processing based on composition flag information composed with the accumulated flag information by a composition part 19. The subprocessor 102B includes an ALU 95B performing the arithmetic processing of input data based on the control instructions; and an SPE control unit 199B making the ALU 95B perform the arithmetic processing based on the composition flag composed by the composition part 19. COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、並列計算装置に関する。 The present invention relates to a parallel computing device.

近年、半導体技術の進歩により汎用プロセッサー（CPU（Central Processing Unit）等）の性能は飛躍的に向上したが（たとえば数Gflops/プロセッサー）、科学技術計算や画像処理等の分野において、更に大幅な性能向上が求められている。たとえば、数Tflops（Tera Floating point number Operations Per Second）、又は、数100GOPS（Giga Operation Per Second）以上の性能が求められている。こうした要求に応える為に、複数のプロセッサーを一個のLSI（Large Scale Integration）に集積する並列計算装置（並列プロセッサー）が研究開発されている。このような並列計算装置には、汎用CPUをコアとして、それを１つのLSI中に複数個集積する物もある。 In recent years, the performance of general-purpose processors (CPU (Central Processing Unit), etc.) has improved dramatically due to advances in semiconductor technology (for example, several Gflops / processor). There is a need for improvement. For example, performance of several Tflops (Tera Floating point number Operations Per Second) or several hundred GOPS (Giga Operation Per Second) or more is required. In order to meet these demands, parallel computing devices (parallel processors) that integrate multiple processors into a single LSI (Large Scale Integration) are being researched and developed. Some of such parallel computing devices have a general-purpose CPU as a core and a plurality of them are integrated in one LSI.

こうした性能要求に応える物としては例えば、株式会社ソニー・コンピュータエンタテインメントと株式会社東芝とIBM社とが共同開発したCell Broadband Engine（以下Cellとする）（画像処理及び科学技術計算用）、日本電気株式会社が開発したIMAP（画像処理用）、CONNEX社が開発したLine Dancer （画像処理用）などがある（非特許文献１から３参照）。 Examples of products that meet these performance requirements include Cell Broadband Engine (hereinafter referred to as Cell) (for image processing and scientific and engineering calculations) jointly developed by Sony Computer Entertainment Inc., Toshiba Corporation, and IBM, NEC Corporation There are IMAP (for image processing) developed by the company and Line Dancer (for image processing) developed by CONNEX (see Non-Patent Documents 1 to 3).

ところで科学技術計算や画像処理では、膨大な量のデータに対してほぼ同一の処理を行うという特徴がある。この特徴を生かして上記プロセッサーではSIMD(Single Instruction Multi Data)型のアーキテクチャが採用されている。つまり、多数のプロセッサーに対して別々のデータを与えるが、命令は同一とする制御方式である。 By the way, scientific and technical calculations and image processing are characterized in that almost the same processing is performed on a huge amount of data. Taking advantage of this feature, the processor adopts a SIMD (Single Instruction Multi Data) type architecture. That is, this is a control method in which separate data is given to a large number of processors, but the instructions are the same.

命令を同一にする理由は、異なる命令をインストラクションメモリから同時に読み出して、各プロセッサーに供給するMIMD（Multi Instruction Multi Data）方式では、複数のインストラクションメモリとそのデコード回路が必要となるのでハードウェアコストが増大し、かつソフトウェア開発の大幅な複雑化及び、ソフトおよびハードのデバッグが非常に困難になるからである。 The reason for making the instructions the same is because the MIMD (Multi Instruction Multi Data) method, which reads different instructions from the instruction memory at the same time and supplies them to each processor, requires multiple instruction memories and their decoding circuits, which increases the hardware cost. This is because the software development is greatly complicated and software and hardware debugging becomes very difficult.

次に、構造化について説明する。図４０は分岐の有るプログラムのフローチャートの一部である。このプログラムでは、変数abcと変数defの内容を比較し、変数abcの方が大きければ変数x1に変数abcの値を加え、そうでなければ変数x2に変数defの値を加えている。図４１は図４０のフローチャートをＣ言語で記述したものある。こうした記述方法は構造化プログラミングと呼ばれている。図４２は図４１のコードを計算機の機械語に近いアセンブラ言語へ変換したものである。ここでは、変数abcをレジスターR2（レジスター2）に、変数defをレジスターR3に、変数x1をレジスターR4に、変数x2をレジスターR5に、それぞれＣコンパイラーが割り付けたと仮定している。図４２で注意すべき点は、図４０のフローを実装する為に、条件ジャンプ命令BGT（比較結果が大きかった場合にジャンプする）を用いていることである。因みに、BR命令は常にジャンプする命令である。 Next, structuring will be described. FIG. 40 is a part of a flowchart of a program having a branch. In this program, the contents of the variable abc and the variable def are compared. If the variable abc is larger, the value of the variable abc is added to the variable x1, and if not, the value of the variable def is added to the variable x2. FIG. 41 describes the flowchart of FIG. 40 in C language. Such a description method is called structured programming. FIG. 42 is obtained by converting the code of FIG. 41 into an assembler language close to a machine language of a computer. Here, it is assumed that the C compiler has assigned the variable abc to the register R2 (register 2), the variable def to the register R3, the variable x1 to the register R4, and the variable x2 to the register R5. What should be noted in FIG. 42 is that a conditional jump instruction BGT (jump when the comparison result is large) is used to implement the flow of FIG. Incidentally, the BR instruction is an instruction that always jumps.

ところでSIMD型アーキテクチャでは、並列計算する場合において、条件ジャンプ命令を使ってプログラム分岐を実装することができない。例えば、８個のプロセッサーで構成されるSIMD型計算機を考える。レジスターR2やレジスターR3は８個のプロセッサーでそれぞれ別なので、それらに格納されているデータは異なる。したがって、レジスターR2とレジスターR3との比較結果が各プロセッサーでバラバラなので、或るプロセッサーではジャンプし、別のプロセッサーではジャンプしないという状態が生じるが、SIMD型なので別々の命令を実行できない。そのため、このままでは図４０のフローを実現できないことになる。なお、この問題はプログラム実行時にプロセッサーごとにジャンプ条件が異なる場合についてのみ発生し、予め回数が決まっているループの制御などのように、常に全てのプロセッサーでジャンプ条件が一致するような制御は、SIMD型アーキテクチャでも実装可能である。 By the way, in the SIMD type architecture, it is not possible to implement a program branch using a conditional jump instruction when performing parallel computation. For example, consider a SIMD computer composed of eight processors. Since the registers R2 and R3 are different for each of the eight processors, the data stored in them is different. Therefore, since the comparison result between the register R2 and the register R3 is different in each processor, there occurs a state in which a jump occurs in one processor and a jump does not occur in another processor, but separate instructions cannot be executed because the SIMD type. Therefore, the flow of FIG. 40 cannot be realized as it is. This problem occurs only when the jump condition is different for each processor at the time of program execution, and the control that always matches the jump condition in all the processors, such as the control of the loop with a predetermined number of times, It can also be implemented with SIMD type architecture.

前述のSIMD型アーキテクチャの欠点を回避する方法として、通常の命令を条件付にするアーキテクチャがある。これについてはSIMD型ではないが、ARM社のARMプロセッサーのマニュアル「ARMアーキテクチャリファレンスマニュアル」(ARM v6.pdf）の第A3章に詳しい記述がある。ARMプロセッサーのほぼ全ての命令は条件付で実行できるので、これらを使うと図４１のコードは例えば図４３のように記述できる。図４３中の「AL」は常に実行することを、「HI」は比較結果が大きかった場合に、「LS」は比較結果が小さいか等しい場合に実行することを示している。ここで「ADD HI, R4, R4, R2」という命令に“S”が付加されていないので、この命令ではCMP命令でセットされた条件が変更されない点に留意する。（上記マニュアルA3-7ページ参照） As a method for avoiding the drawbacks of the SIMD type architecture described above, there is an architecture that makes ordinary instructions conditional. This is not a SIMD type, but there is a detailed description in Chapter A3 of ARM's ARM processor manual "ARM Architecture Reference Manual" (ARM v6.pdf). Since almost all instructions of the ARM processor can be executed conditionally, the code shown in FIG. 41 can be written as shown in FIG. 43, for example. In FIG. 43, “AL” indicates that execution is always performed, “HI” indicates that the comparison result is large, and “LS” indicates that the comparison result is small or equal. Note that since the “S” is not added to the instruction “ADD HI, R4, R4, R2”, the condition set by the CMP instruction is not changed by this instruction. (Refer to page A3-7 in the above manual.)

ここで「命令を実行しない」ということの意味を確認しておく。一般的なプロセッサーに於いて命令は通常、命令フェッチ（IF）、命令解読（DEC）、オペランドフェッチ（OF）、演算実行（EXE）、演算結果の書き込み（WB）という手順で実行されている。現在の高速なプロセッサーではこの手順は、例えば図４４のタイミングチャートに示すように５つに分割されパイプライン化されている。CMP命令の比較結果はその命令のEXE部の最後、又はWB部にならないと確定しない。したがって、CMP命令の結果によって次のADD命令をNOP（ノーオペレーション）命令に変更するのは間に合わない。 Here, the meaning of "do not execute the instruction" is confirmed. In a general processor, an instruction is usually executed in the order of instruction fetch (IF), instruction decode (DEC), operand fetch (OF), operation execution (EXE), and operation result write (WB). In the current high-speed processor, this procedure is divided into five and pipelined as shown in the timing chart of FIG. 44, for example. The comparison result of the CMP instruction is not determined unless it reaches the end of the EXE part of the instruction or the WB part. Therefore, it is not in time to change the next ADD instruction to a NOP (no operation) instruction depending on the result of the CMP instruction.

しかしながら、次に続くADD命令の演算結果を所定の位置に書き込まなければ、何も実行しなかったことと等価である（ただし、オペランドフェッチ時等にプロセッサーの内部状態が変化してしまう場合を除く。こうしたことが起こる場合は後で補正が必要である。）。つまり、CMP命令の結果によって次のADD命令のWBを制御し、R4への書き込みを無効にしてしまえば、ADD命令はNOP命令と等価になる。CELL、IMAP、Line Dancer等のSIMD型プロセッサーは、このアイデアを元に条件付命令を実装している。 However, if the operation result of the subsequent ADD instruction is not written to a predetermined position, it is equivalent to nothing being executed (except when the internal state of the processor changes at the time of operand fetch etc.) If this happens, correction will be required later.) That is, if the WB of the next ADD instruction is controlled by the result of the CMP instruction and the writing to R4 is invalidated, the ADD instruction becomes equivalent to the NOP instruction. SIMD type processors such as CELL, IMAP, and Line Dancer implement conditional instructions based on this idea.

“CELLプログラミングチュートリアル”、「2.5 SIMD演算における条件分岐」、２００９年４月１３日検索、インターネット＜URL:http://www.fixstars.com＞“CELL Programming Tutorial”, “2.5 Conditional Branches in SIMD Operations”, search April 13, 2009, Internet <URL: http://www.fixstars.com> “An Integrated Memory Array Processor Architecture for Embedded Image Recognition System”, Kyo,S.;Okazaki,S.;Arai,T.,Computer Architecture, 2005. ISCA apos;05. Proceedings. 32nd International Symposium on Volume, Issue, 4-8 June 2005 Page(s):134 - 145, §5.3 2005 IEEE“An Integrated Memory Array Processor Architecture for Embedded Image Recognition System”, Kyo, S.; Okazaki, S.; Arai, T., Computer Architecture, 2005. ISCA apos; 05. Proceedings. 32nd International Symposium on Volume, Issue, 4 -8 June 2005 Page (s): 134-145, §5.3 2005 IEEE “The CA1024 :A fully programmable system-on-chip for cost-effective HDTV media processing“, Lazar Bivolarski, Bogdan Mitu, Anand Sheel,Gheorghe Stefan, Tom Thomson, Dan Tomescu、CA1024資料 P9、２００９年４月１３日検索、＜URL:http://www.hotchips.org/archives/hc18/2_Mon/HC18.S5/HC18.S5T2.pdf＞“The CA1024: A fully programmable system-on-chip for cost-effective HDTV media processing”, Lazar Bivolarski, Bogdan Mitu, Anand Sheel, Gheorghe Stefan, Tom Thomson, Dan Tomescu, CA1024 document P9, April 13, 2009 <URL: http: //www.hotchips.org/archives/hc18/2_Mon/HC18.S5/HC18.S5T2.pdf>

従来のSIMD型アーキテクチャを用いた技術では、分岐が一層までのフローには対応できるが、２層以上の分岐が有るネスト（入れ子）した構造化プログラムに対応するのは困難である。図４５に２重にネストしたプログラムの例を示す。この例では、符号（２）の比較命令のところでコンディションフラグが書き換えられてしまうので、符号（１）の比較結果（コンディションフラグ）を符号（２）の命令実行前に一旦どこかに退避しておき、符号（３）のelse文で復帰させなければならない。ARMプロセッサーではコンディションフラグをレジスターに書き込むことで退避可能である。図４６にプログラム例を示す。 The technology using the conventional SIMD type architecture can deal with a flow up to one branch, but it is difficult to cope with a nested structured program having two or more branches. FIG. 45 shows an example of a program nested twice. In this example, since the condition flag is rewritten at the comparison instruction of the code (2), the comparison result (condition flag) of the code (1) is temporarily saved somewhere before the execution of the instruction of the code (2). It must be restored with the else statement of code (3). In the ARM processor, it is possible to save by writing the condition flag to the register. FIG. 46 shows a program example.

MRS命令がコンディションフラグをレジスターR9に書き込む命令であり、MSR命令がレジスターから戻す命令である。ところがCELLやIMAP及びLine Dancerではコンディションコードを退避できない。したがって、図４５のコードは図４７のようにネストしないコードに書き換えなければならない（図４７のプログラムは変数x1の値によっては図４５と同じ動作にならない点に要注意）。通常のプログラムではネストが３重４重と深くなることも珍しくなく、そうした場合には書き換えが複雑になりプログラムの記述性が低下する。つまり、従来のSIMD型アーキテクチャの技術ではネストした構造化プログラミングへの対応が困難となる。 The MRS instruction is an instruction to write a condition flag to the register R9, and the MSR instruction is an instruction to return from the register. However, CELL, IMAP, and Line Dancer cannot save the condition code. Therefore, the code in FIG. 45 must be rewritten to a code that does not nest as shown in FIG. 47 (note that the program in FIG. 47 does not operate in the same manner as FIG. 45 depending on the value of variable x1). In normal programs, it is not uncommon for the nesting to be triple and quadruple, and in such a case, rewriting becomes complicated and the descriptiveness of the program is lowered. In other words, conventional SIMD architecture technology makes it difficult to support nested structured programming.

すなわち、従来の技術においては、並列計算装置において、複数のネストを持つ構造化プログラムを容易に実行することが難しいという問題があった。 That is, the conventional technique has a problem that it is difficult to easily execute a structured program having a plurality of nests in a parallel computing device.

本発明は、このような事情に鑑みてなされたもので、その目的は、複数のネストを持つ構造化プログラムを容易に実行できる並列計算装置を提供することにある。 The present invention has been made in view of such circumstances, and an object thereof is to provide a parallel computing device capable of easily executing a structured program having a plurality of nests.

上記問題を解決するために、請求項１に記載した発明は、並列して演算処理を行う複数の演算プロセッサー（例えば、実施の形態における演算プロセッサーＰＥ１０２）と、前記複数の演算プロセッサーのそれぞれに制御命令を供給する制御信号生成部（例えば、実施の形態における制御信号生成部（ＰＥ-Ｉ）３）と、を備え、前記複数の演算プロセッサーのそれぞれが、入力されたデータを前記制御命令に基づいて第１の演算処理する第１演算部（例えば、実施の形態におけるＡＬＵ９５Ａ）と、スタック構造であり、第１の演算処理された結果に基づいたフラグ情報が順次蓄積される第１制御情報保持部（例えば、実施の形態におけるＧフラグスタック１１）と、前記第１制御情報保持部に蓄積されたフラグ情報を合成する第１合成部（例えば、実施の形態における合成部１９）と、前記第１合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第１演算部に第１の演算処理させる第１制御部（例えば、実施の形態におけるＳＰＥ制御部１９９Ａ）と、を備える特定サブプロセッサー（例えば、実施の形態におけるサブプロセッサーＳＰＥ１０２Ａ）と、入力されたデータを前記制御命令に基づいて第２の演算処理する第２演算部（例えば、実施の形態におけるＡＬＵ９５Ｂ〜９５Ｄ）と、前記第１合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第２演算部に第２の演算処理させる第２制御部（例えば、実施の形態におけるＳＰＥ制御部１９９Ｂ〜１９９Ｄ）と、を備えるサブプロセッサー（例えば、実施の形態におけるＳＰＥ１０２Ｂ〜ＳＰＥ１０２Ｄ）と、を備えることを特徴とする並列計算装置（例えば、実施の形態における並列計算装置１）である。 In order to solve the above problem, the invention described in claim 1 is configured to control a plurality of arithmetic processors (for example, the arithmetic processor PE102 in the embodiment) that perform arithmetic processing in parallel and each of the plurality of arithmetic processors. A control signal generation unit (for example, a control signal generation unit (PE-I) 3 in the embodiment) for supplying an instruction, and each of the plurality of arithmetic processors receives input data based on the control instruction A first arithmetic unit that performs first arithmetic processing (for example, ALU 95A in the embodiment) and a stack structure, and first control information holding in which flag information based on the result of the first arithmetic processing is sequentially accumulated Unit (for example, the G flag stack 11 in the embodiment) and a first combining unit that combines the flag information accumulated in the first control information holding unit ( For example, the first calculation process is performed by the first calculation unit based on the combination flag information without accumulating the combination flag information combined by the combination unit 19) and the first combination unit in the stack structure in the embodiment. first control unit for (e.g., SPE control unit 199A in the embodiment), a specific sub-processor with a (e.g., sub-processor SPE102A in the embodiment), the second based on the input data to the control command The second arithmetic unit (for example, ALUs 95B to 95D in the embodiment) that performs the arithmetic processing of the first and second combining units synthesized by the first synthesizing unit is not accumulated in the stack structure, and the first the second control unit for the second calculation process on the second operation unit (e.g., SPE controller 199B~199D in the embodiment) A sub processor comprising (e.g., SPE102B～SPE102D in the embodiment) parallel computing device, characterized in that it comprises a, (e.g., parallel computing device 1 in the embodiment).

請求項２に記載した発明は、並列して演算処理を行う複数の演算プロセッサー（例えば、実施の形態における演算プロセッサーＰＥ２０２）と、前記複数の演算プロセッサーのそれぞれに制御命令を供給する制御信号生成部（例えば、実施の形態における制御信号生成部（ＰＥ-Ｉ）３）と、を備え、前記複数の演算プロセッサーのそれぞれが、入力されたデータを前記制御命令に基づいて第１の演算処理する第１演算部（例えば、実施の形態におけるＡＬＵ９５Ａ）と、スタック構造であり、第１の演算処理された結果に基づいたフラグ情報が順次蓄積される第１制御情報保持部（例えば、実施の形態におけるＧフラグスタック１１）と、前記第１制御情報保持部に蓄積されたフラグ情報を合成する第１合成部（例えば、実施の形態における合成部１９）と、前記第１合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第１演算部に第１の演算処理させる第１制御部（例えば、実施の形態におけるＳＰＥ制御部２９９Ａ）と、を備える特定サブプロセッサー（例えば、実施の形態におけるサブプロセッサーＳＰＥ２０２Ａ）と、入力されたデータを前記制御命令に基づいて第２の演算処理する第２演算部（例えば、実施の形態におけるＡＬＵ９５Ｂ）と、前記第１合成部が合成した合成フラグ情報を前記第２演算部の第２の演算処理の制御に用いるか否かを選択する選択部（例えば、実施の形態における実行選択部２４Ｂ）と、前記選択部が、前記第１合成部が合成した合成フラグ情報を前記第２演算部の第２の演算処理の制御に用いることを選択した場合に、前記第１合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第２演算部に第２の演算処理させる第２制御部（例えば、実施の形態におけるＳＰＥ制御部２９９Ｂ）と、を備えるサブプロセッサー（例えば、実施の形態におけるＳＰＥ２０２Ｂ）と、を備えることを特徴とする並列計算装置である。 According to a second aspect of the present invention, there are provided a plurality of arithmetic processors (for example, the arithmetic processor PE 202 in the embodiment) that perform arithmetic processing in parallel, and a control signal generation unit that supplies a control command to each of the plurality of arithmetic processors. (For example, the control signal generation unit (PE-I) 3 in the embodiment), and each of the plurality of arithmetic processors performs first arithmetic processing on the input data based on the control command. 1 arithmetic unit (for example, ALU 95A in the embodiment) and a stack structure, and a first control information holding unit (for example, in the exemplary embodiment) in which flag information based on the result of the first arithmetic processing is sequentially accumulated G flag stack 11) and a first combining unit (for example, in the embodiment) that combines the flag information accumulated in the first control information holding unit And a first control unit that causes the first calculation unit to perform a first calculation process based on the combination flag information without accumulating the combination flag information combined by the first combination unit in the stack structure. For example, a specific sub processor (for example, the sub processor SPE 202A in the embodiment) including the SPE control unit 299A in the embodiment, and a second arithmetic process on the input data based on the control command A calculation unit (for example, ALU 95B in the embodiment) and a selection unit (for example, selecting whether to use the synthesis flag information synthesized by the first synthesis unit for controlling the second computation process of the second computation unit) The execution selection unit 24B) in the embodiment and the selection unit use the combination flag information combined by the first combination unit for the second calculation processing of the second calculation unit. When the use is selected, the second calculation unit causes the second calculation unit to perform a second calculation process based on the combination flag information without accumulating the combination flag information combined by the first combination unit in the stack structure. And a sub processor (for example, SPE 202B in the embodiment) including two control units (for example, the SPE control unit 299B in the embodiment) .

請求項３に記載した発明は、並列して演算処理を行う複数の演算プロセッサー（例えば、実施の形態における演算プロセッサーＰＥ３０２）と、前記複数の演算プロセッサーのそれぞれに制御命令を供給する制御信号生成部（例えば、実施の形態における制御信号生成部（ＰＥ-Ｉ）３）と、を備え、前記複数の演算プロセッサーのそれぞれが、入力されたデータを前記制御命令に基づいて第１の演算処理する第１演算部（例えば、実施の形態におけるＡＬＵ９５Ａ）と、スタック構造であり、第１の演算処理された結果に基づいたフラグ情報が順次蓄積される第１制御情報保持部（例えば、実施の形態におけるＧフラグスタック１１Ａ）と、前記第１制御情報保持部に蓄積されたフラグ情報を合成する第１合成部（例えば、実施の形態におけるＧフラグスタック１０Ａの合成部１９Ａ）と、前記第１合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第１演算部に第１の演算処理させる第１制御部（例えば、実施の形態におけるＳＰＥ制御部３９９Ａ）と、を備える特定サブプロセッサー（例えば、実施の形態におけるサブプロセッサーＳＰＥ３０２Ａ）と、入力されたデータを前記制御命令に基づいて第２の演算処理する第２演算部（例えば、実施の形態におけるＡＬＵ９５Ｂ）と、スタック構造であり、演算処理された結果に基づいたフラグ情報が順次蓄積される第２制御情報保持部（例えば、実施の形態におけるＧフラグスタック１１Ｂ）と、前記第２制御情報保持部に蓄積されたフラグ情報を合成する第２合成部（例えば、実施の形態におけ合成部１９Ｂ）と、前記第１合成部が合成した合成フラグ情報と自サブプロセッサーの前記第２合成部が合成した合成フラグ情報のいずれかをスタック構造に蓄積させることなく選択する選択部（例えば、実施の形態における実行選択部３４Ｂ）と、前記選択部が選択した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第２演算部に第２の演算処理させる第２制御部（例えば、実施の形態におけるＳＰＥ制御部３９９Ｂ）と、を備えるサブプロセッサー（例えば、実施の形態におけるＳＰＥ３０２Ｂ）と、を備えることを特徴とする並列計算装置である。 According to a third aspect of the present invention, there are provided a plurality of arithmetic processors (for example, the arithmetic processor PE302 in the embodiment) that perform arithmetic processing in parallel, and a control signal generation unit that supplies a control command to each of the plurality of arithmetic processors. (For example, the control signal generation unit (PE-I) 3 in the embodiment), and each of the plurality of arithmetic processors performs first arithmetic processing on the input data based on the control command. 1 arithmetic unit (for example, ALU 95A in the embodiment) and a stack structure, and a first control information holding unit (for example, in the exemplary embodiment) in which flag information based on the result of the first arithmetic processing is sequentially accumulated G flag stack 11A) and a first combining unit that combines the flag information stored in the first control information holding unit (for example, in the embodiment) The first arithmetic processing unit based on the synthesis flag information without storing the synthesis flag information synthesized by the first synthesis unit in the stack structure. A first control unit (for example, the SPE control unit 399A in the embodiment), a specific sub processor (for example, the sub processor SPE 302A in the embodiment), and a second input data based on the control command A second control unit (for example, ALU95B in the embodiment) and a second control information holding unit (for example, an implementation) that has a stack structure and sequentially stores flag information based on the result of the calculation process. G flag stack 11B) in the embodiment and a second combining unit (for combining the flag information accumulated in the second control information holding unit) For example, in the embodiment, the synthesis unit 19B) and the synthesis flag information synthesized by the first synthesis unit and the synthesis flag information synthesized by the second synthesis unit of its own subprocessor are accumulated in the stack structure. The selection unit (for example, the execution selection unit 34B in the embodiment) to select without, and the second calculation unit based on the combination flag information without storing the combination flag information selected by the selection unit in the stack structure A parallel computing device comprising: a sub-processor (for example, SPE302B in the embodiment) including a second control unit (for example, the SPE control unit 399B in the embodiment) that performs the second arithmetic processing. There is .

請求項４に記載した発明は、並列して演算処理を行う複数の演算プロセッサー（例えば、実施の形態における演算プロセッサーＰＥ４０２）と、前記複数の演算プロセッサーのそれぞれに制御命令を供給する制御信号生成部（例えば、実施の形態における制御信号生成部（ＰＥ-Ｉ）３）と、を備え、前記複数の演算プロセッサーのそれぞれが、複数のサブプロセッサー（例えば、実施の形態におけるサブプロセッサーＳＰＥ４０２Ａ〜ＳＰＥ４０２Ｄ）を備え、前記サブプロセッサーのそれぞれが、（例えば、実施の形態におけるサブプロセッサーＳＰＥ４０２Ａの場合に、）入力されたデータを前記制御命令に基づいて演算処理する演算部（例えば、実施の形態におけるＡＬＵ９５Ａ）と、スタック構造であり、演算処理された結果に基づいたフラグ情報が順次蓄積される制御情報保持部（例えば、実施の形態におけるＧフラグスタック１１Ａ）と、自サブプロセッサーの前記制御情報保持部に蓄積されたフラグ情報を合成する合成部（例えば、実施の形態における合成部１９Ａ）と、任意のサブプロセッサーの前記合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報のいずれかを選択する選択部（例えば、実施の形態における実行選択部４４Ａ）と、前記選択部が選択した合成フラグ情報をスタック構造に蓄積させることなく、前記選択した合成フラグ情報に基づいて前記演算部に演算処理させる制御部（例えば、実施の形態におけるＳＰＥ制御部４９９Ａ）と、を備え、前記複数のサブプロセッサーが備える演算部は、互いに異なる演算処理を実施可能とすることを特徴とする並列計算装置である。 According to a fourth aspect of the present invention, there are provided a plurality of arithmetic processors (for example, the arithmetic processor PE 402 in the embodiment) that perform arithmetic processing in parallel, and a control signal generation unit that supplies a control command to each of the plurality of arithmetic processors. (For example, the control signal generation unit (PE-I) 3 in the embodiment), and each of the plurality of arithmetic processors includes a plurality of sub processors (for example, the sub processors SPE402A to SPE402D in the embodiment). Each of the sub-processors (for example, in the case of the sub-processor SPE 402A in the embodiment) an arithmetic unit (for example, ALU 95A in the embodiment) that performs processing on the input data based on the control command; , Stack structure, based on the processed result The control information holding unit (for example, the G flag stack 11A in the embodiment) that sequentially stores the flag information that has been stored, and the combining unit that combines the flag information stored in the control information holding unit of its own subprocessor (for example, implementation) And a selection unit (for example, an embodiment) that selects any one of the synthesis flag information without accumulating the synthesis flag information synthesized by the synthesis unit of any sub-processor in a stack structure. An execution selection unit 44A) and a control unit (for example, an embodiment) that causes the calculation unit to perform calculation processing based on the selected combination flag information without accumulating the combination flag information selected by the selection unit in a stack structure. SPE control unit 499A), and the arithmetic units included in the plurality of sub-processors are different from each other. A parallel computing apparatus characterized by enabling implementing calculation process.

請求項５に記載した発明は、並列して演算処理を行う複数の演算プロセッサー（例えば、実施の形態における演算プロセッサーＰＥ５０２）と、前記複数の演算プロセッサーのそれぞれに制御命令を供給する制御信号生成部（例えば、実施の形態における制御信号生成部（ＰＥ-Ｉ）３）と、を備え、前記複数の演算プロセッサーのそれぞれが、入力されたデータを前記制御命令に基づいて第１の演算処理する第１演算部（例えば、実施の形態におけるＡＬＵ９５Ａ）と、スタック構造であり、演算処理された結果に基づいたフラグ情報が順次蓄積される第１制御情報保持部（例えば、実施の形態におけるＧフラグスタック１１Ａ）と、前記第１制御情報保持部に蓄積されたフラグ情報を合成する第１合成部（例えば、実施の形態における合成部１９）と、前記第１合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第１演算部に第１の演算処理させる第１制御部（例えば、実施の形態におけるＳＰＥ制御部５９９Ａ）と、を備える特定サブプロセッサー（例えば、実施の形態におけるサブプロセッサーＳＰＥ５０２Ａ）と、入力されたデータを前記制御命令に基づいて第２の演算処理する第２演算部（例えば、実施の形態におけるＡＬＵ９５Ｂ）と、前記第１演算部により第１の演算処理された結果に基づいたフラグ情報と、自サブプロセッサーの前記第２演算部により第２の演算処理された結果に基づいたフラグ情報のいずれかを選択する選択部（例えば、実施の形態における実行選択部５５Ｂ）と、スタック構造であり、前記選択部により選択されたフラグ情報が順次蓄積される第２制御情報保持部（例えば、実施の形態におけるＧフラグスタック５１Ｂ）と、前記第２制御情報保持部に蓄積されたフラグ情報を合成する第２合成部（例えば、実施の形態における合成部５９Ｂ）と、前記第２合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第２演算部に第２の演算処理させる第２制御部（例えば、実施の形態におけるＳＰＥ制御部５９９Ｂ）と、を備えるサブプロセッサー（例えば、実施の形態におけるサブプロセッサーＳＰＥ５０２Ｂ）と、を備えることを特徴とする並列計算装置である。 The invention described in claim 5 includes a plurality of arithmetic processors (for example, arithmetic processor PE502 in the embodiment) that perform arithmetic processing in parallel, and a control signal generation unit that supplies a control command to each of the plurality of arithmetic processors. (For example, the control signal generation unit (PE-I) 3 in the embodiment), and each of the plurality of arithmetic processors performs first arithmetic processing on the input data based on the control command. 1 operation unit (for example, ALU 95A in the embodiment) and a stack structure, and a first control information holding unit (for example, G flag stack in the embodiment) in which flag information based on the result of operation processing is sequentially accumulated 11A) and a first combining unit that combines the flag information accumulated in the first control information holding unit (for example, combining in the embodiment) Unit 19) and a first control unit (for example, for causing the first calculation unit to perform a first calculation process based on the combination flag information without storing the combination flag information combined by the first combination unit in the stack structure) A specific sub processor (for example, sub processor SPE 502A in the embodiment), and a second operation that performs a second operation process on the input data based on the control command. Part (for example, ALU95B in the embodiment), flag information based on the result of the first arithmetic processing by the first arithmetic unit, and the second arithmetic processing by the second arithmetic unit of its own sub-processor A selection unit (for example, the execution selection unit 55B in the embodiment) for selecting any of the flag information based on the result, and a stack structure The second control information holding unit (for example, the G flag stack 51B in the embodiment) in which the flag information selected by the selection unit is sequentially stored and the flag information stored in the second control information holding unit are combined. A second synthesis unit (for example, the synthesis unit 59B in the embodiment) and the synthesis flag information synthesized by the second synthesis unit are not accumulated in the stack structure, and the second calculation unit is based on the synthesis flag information. A parallel processor comprising: a second processor (for example, the SPE controller 599B in the embodiment) that performs the second arithmetic processing; and a sub processor (for example, the sub processor SPE 502B in the embodiment). Device .

請求項６に記載した発明は、並列して演算処理を行う複数の演算プロセッサー（例えば、実施の形態における演算プロセッサーＰＥ６０２）と、前記複数の演算プロセッサーのそれぞれに制御命令を供給する制御信号生成部（例えば、実施の形態における制御信号生成部（ＰＥ-Ｉ）３）と、を備え、前記複数の演算プロセッサーのそれぞれが、複数のサブプロセッサー（例えば、実施の形態におけるＳＰＥ６０２Ａ〜ＳＰＥ６０２Ｄ）を備え、前記サブプロセッサーのそれぞれが、（例えば、実施の形態におけるサブプロセッサーＳＰＥ６０２Ａの場合に、）入力されたデータを前記制御命令に基づいて演算処理する演算部（例えば、実施の形態におけるＡＬＵ９５Ａ）と、任意のサブプロセッサーの前記演算部により演算処理された結果に基づいたフラグ情報のいずれかを選択する選択部（例えば、実施の形態における実行選択部６４５Ａ）と、スタック構造であり、前記選択部により選択されたフラグ情報が順次蓄積される制御情報保持部（例えば、実施の形態におけるＧフラグスタック５１Ａ）と、前記制御情報保持部に蓄積されたフラグ情報を合成する合成部（例えば、実施の形態における合成部５９Ａ）と、前記合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記演算部に演算処理させる制御部（例えば、実施の形態におけるＳＰＥ制御部６９９Ａ）と、を備え、前記複数のサブプロセッサーが備える演算部は、互いに異なる演算処理を実施可能とすることを特徴とする並列計算装置である。 According to a sixth aspect of the present invention, a plurality of arithmetic processors (for example, the arithmetic processor PE 602 in the embodiment) that perform arithmetic processing in parallel, and a control signal generation unit that supplies a control command to each of the plurality of arithmetic processors. (For example, the control signal generation unit (PE-I) 3 in the embodiment), and each of the plurality of arithmetic processors includes a plurality of sub-processors (for example, SPE602A to SPE602D in the embodiment), Each of the sub-processors (for example, in the case of the sub-processor SPE 602A in the embodiment) has an arithmetic unit (for example, ALU 95A in the embodiment) that performs arithmetic processing on input data based on the control command, and an arbitrary The result of the arithmetic processing by the arithmetic unit of the sub processor of A selection unit (for example, an execution selection unit 645A in the embodiment) that selects any one of the flag information based on the control information holding unit (which has a stack structure and sequentially stores the flag information selected by the selection unit ( For example, the G flag stack 51A in the embodiment, a combining unit that combines the flag information accumulated in the control information holding unit (for example, the combining unit 59A in the embodiment), and a combining flag combined by the combining unit A control unit (for example, the SPE control unit 699A in the embodiment) that causes the arithmetic unit to perform arithmetic processing based on the synthesis flag information without accumulating information in a stack structure, and the plurality of sub-processors includes The arithmetic unit is a parallel computing device characterized in that different arithmetic processes can be performed .

請求項７に記載した発明は、並列して演算処理を行う複数の演算プロセッサー（例えば、実施の形態における演算プロセッサーＰＥ７０２）と、前記複数の演算プロセッサーのそれぞれに制御命令を供給する制御信号生成部（例えば、実施の形態における制御信号生成部（ＰＥ-Ｉ）３）と、を備え、前記複数の演算プロセッサーのそれぞれが、入力されたデータを前記制御命令に基づいて第１の演算処理する第１演算部（例えば、実施の形態におけるＡＬＵ９５Ａ）と、スタック構造であり、第１の演算処理された結果に基づいたフラグ情報が順次蓄積される第１制御情報保持部（例えば、実施の形態におけるＧフラグスタック１１Ａ）と、前記第１制御情報保持部に蓄積されたフラグ情報を合成する第１合成部（例えば、実施の形態における合成部１９）と、前記第１合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報に基づいて前記第１演算部に第１の演算処理させる第１制御部（例えば、実施の形態におけるＳＰＥ制御部７９９Ａ）と、を備える特定サブプロセッサー（例えば、実施の形態におけるサブプロセッサーＳＰＥ７０２Ａ）と、入力されたデータを前記制御命令に基づいて第２の演算処理する第２演算部（例えば、実施の形態におけるＡＬＵ９５Ｂ）と、前記第１合成部が合成した合成フラグ情報をスタック構造に蓄積させることなく、前記合成フラグ情報をそのまま出力するか反転して出力するかを選択して、選択したものをフラグ情報出力として出力する反転処理部（例えば、実施の形態における反転処理部７８Ｂ）と、前記反転処理部が出力するフラグ情報出力をスタック構造に蓄積させることなく、前記フラグ情報出力か、常に命令実行を可能にする制御情報のいずれかを選択する選択部（例えば、実施の形態における実行選択部７４Ｂ）と、前記選択部が選択した情報に基づいて前記第２演算部に第２の演算処理させる第２制御部（例えば、実施の形態におけるＳＰＥ制御部７９９Ｂ）と、を備えるサブプロセッサー（例えば、実施の形態におけるサブプロセッサーＳＰＥ７０２Ｂ）と、を備えることを特徴とする並列計算装置である。 The invention described in claim 7 includes a plurality of arithmetic processors that perform arithmetic processing in parallel (for example, arithmetic processor PE 702 in the embodiment), and a control signal generation unit that supplies a control command to each of the plurality of arithmetic processors. (For example, the control signal generation unit (PE-I) 3 in the embodiment), and each of the plurality of arithmetic processors performs first arithmetic processing on the input data based on the control command. 1 arithmetic unit (for example, ALU 95A in the embodiment) and a stack structure, and a first control information holding unit (for example, in the exemplary embodiment) in which flag information based on the result of the first arithmetic processing is sequentially accumulated G flag stack 11A) and a first combining unit that combines the flag information stored in the first control information holding unit (for example, in the embodiment) And a first control unit that causes the first calculation unit to perform a first calculation process based on the combination flag information without accumulating the combination flag information combined by the first combination unit in a stack structure. (For example, the SPE control unit 799A in the embodiment), and a second sub processor (for example, the sub processor SPE 702A in the embodiment) that performs second arithmetic processing on input data based on the control command. Whether the combined flag information is output as it is or is inverted without accumulating the combined flag information combined by the two arithmetic units (for example, ALU 95B in the embodiment) and the first combining unit in the stack structure. Inversion processing unit that selects and outputs the selected information as flag information output (for example, inversion processing unit 78B in the embodiment) The selection unit that selects either the flag information output or the control information that always enables instruction execution without accumulating the flag information output from the inversion processing unit in the stack structure (for example, in the embodiment) An execution selection unit 74B) and a second control unit (for example, the SPE control unit 799B in the embodiment) that causes the second calculation unit to perform a second calculation process based on the information selected by the selection unit. And a processor (for example, the sub processor SPE 702B in the embodiment) .

請求項１から請求項７に記載した発明によれば、本発明の技術を使うことで、SIMD型にVLIW型を組み合わせた並列計算装置において、多重にネストした構造化プログラムをサポートするハードウェアを容易に実現できる。したがって、多数の演算素子（プロセッサー）を効率的に並列動作させられるので、科学技術計算や画像処理に必要とされる数Tflops又は数100GOPSの演算能力を持つ並列計算装置を容易に実現できる。 According to the first to seventh aspects of the invention, by using the technology of the present invention, in a parallel computing device combining a VLIW type with a SIMD type, hardware that supports multiple nested structured programs is provided. It can be easily realized. Therefore, since a large number of arithmetic elements (processors) can be operated in parallel efficiently, it is possible to easily realize a parallel computing device having a computing capacity of several Tflops or several hundred GOPS required for scientific calculation or image processing.

本発明の第１実施形態を示す概略ブロック図である。1 is a schematic block diagram showing a first embodiment of the present invention. 本発明の実施形態による演算プロセッサー２における各サブプロセッサーが参照可能な記憶部を示す図である。It is a figure which shows the memory | storage part which can refer each subprocessor in the arithmetic processor 2 by embodiment of this invention. 本発明の実施形態による演算プロセッサー２の構成を示すブロック図である。It is a block diagram which shows the structure of the arithmetic processor 2 by embodiment of this invention. 本発明の実施形態による構造化プログラミング用に導入する６個の命令を示す。6 illustrates six instructions introduced for structured programming according to an embodiment of the present invention. 本発明の実施形態によるフラグ処理部の構成例を示すブロック図である。る。It is a block diagram which shows the structural example of the flag process part by embodiment of this invention. The 本発明の実施形態によるアキュムレータへの書き込み制御回路の構成例を示すブロック図である。It is a block diagram which shows the structural example of the write-in control circuit to the accumulator by embodiment of this invention. 本発明の実施形態によるプログラム例を示す。2 shows an example program according to an embodiment of the present invention. 第1実施形態のプログラム例の変数とレジスターとの対応を示す。The correspondence between variables and registers in the example program of the first embodiment is shown. 第1実施形態のプログラム例の命令の動作を示す。The operation | movement of the command of the example of a program of 1st Embodiment is shown. 第1実施形態の並列計算装置１の演算処理部の概略構成を示すブロック図である。2 is a block diagram illustrating a schematic configuration of an arithmetic processing unit of the parallel computing device 1 of the first embodiment. FIG. 第1実施形態の演算プロセッサーにおける演算制御処理を行う構成を示すブロック図である。FIG. 2 is a block diagram showing a configuration for performing calculation control processing in the calculation processor of the first embodiment. 第1実施形態のＧフラグ処理部とＳＰＥ制御部を示すブロック図である。It is a block diagram which shows the G flag process part and SPE control part of 1st Embodiment. 第１実施形態による高速化処理が行えるプログラムを示す。The program which can perform the high-speed process by 1st Embodiment is shown. 第１実施形態による並列演算処理のプログラムを示す。The program of the parallel arithmetic processing by 1st Embodiment is shown. 第２実施形態の並列計算装置１の演算処理部の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the arithmetic processing part of the parallel computing device 1 of 2nd Embodiment. 第２実施形態の構成において追加する命令を示す。The command added in the structure of 2nd Embodiment is shown. 第２実施形態のＧフラグ処理部とＳＰＥ制御部を示すブロック図である。It is a block diagram which shows the G flag process part and SPE control part of 2nd Embodiment. 第２実施形態の並列計算装置１において、図１３のプログラムを実行するために４並列処理を行うVLIW型用に変換した例を示す。In the parallel computing device 1 of the second embodiment, an example converted to a VLIW type that performs four parallel processing to execute the program of FIG. 13 is shown. 第３実施形態の並列計算装置１の演算処理部の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the arithmetic processing part of the parallel computing device 1 of 3rd Embodiment. 第３実施形態の構成において追加する命令を示す。The command added in the structure of 3rd Embodiment is shown. 第３実施形態のＧフラグ処理部とＳＰＥ制御部を示すブロック図である。It is a block diagram which shows the G flag process part and SPE control part of 3rd Embodiment. 第３実施形態の並列計算装置１において、図１３のプログラムを実行するために４並列処理を行うVLIW型用に変換した例を示す。In the parallel computing device 1 of the third embodiment, an example of conversion to a VLIW type that performs four parallel processing to execute the program of FIG. 13 is shown. 第４実施形態の並列計算装置１の演算処理部の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the arithmetic processing part of the parallel computing device 1 of 4th Embodiment. 第４実施形態の構成において追加する命令を示す。The command added in the structure of 4th Embodiment is shown. 第４実施形態のＧフラグ処理部とＳＰＥ制御部を示すブロック図である。It is a block diagram which shows the G flag process part and SPE control part of 4th Embodiment. 第４実施形態のセレクターの制御を示す図である。It is a figure which shows control of the selector of 4th Embodiment. 第４実施形態の並列計算装置１において、図１３のプログラムを実行するために４並列のVLIW型用に変換した例を示す。In the parallel computing device 1 of the fourth embodiment, an example of conversion to a 4-parallel VLIW type for executing the program of FIG. 13 is shown. 第５実施形態の並列計算装置１の演算処理部の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the arithmetic processing part of the parallel computing device 1 of 5th Embodiment. 第５実施形態の構成において追加する命令を示す。The command added in the structure of 5th Embodiment is shown. 第５実施形態のＳＰＥの同期化回路を示すブロック図である。It is a block diagram which shows the synchronization circuit of SPE of 5th Embodiment. 第５実施形態の図１３のプログラムを並列化した例を示す。The example which parallelized the program of FIG. 13 of 5th Embodiment is shown. 第６実施形態の並列計算装置１の演算処理部の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the arithmetic processing part of the parallel computing device 1 of 6th Embodiment. 第６実施形態の構成において追加する命令を示す。The command added in the structure of 6th Embodiment is shown. 第６実施形態のＳＰＥの同期化回路を示すブロック図である。It is a block diagram which shows the synchronization circuit of SPE of 6th Embodiment. 第６実施形態の並列計算装置１において、図１３のプログラムを実行するために４並列のVLIW型用に変換した例を示す。In the parallel computing device 1 of the sixth embodiment, an example of conversion to a 4-parallel VLIW type in order to execute the program of FIG. 13 is shown. 第７実施形態の並列計算装置１の演算処理部の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the arithmetic processing part of the parallel computing device 1 of 7th Embodiment. 第７実施形態の構成において追加する命令を示す。The command added in the structure of 7th Embodiment is shown. 第７実施形態のＳＰＥの同期化回路を示すブロック図である。It is a block diagram which shows the synchronization circuit of SPE of 7th Embodiment. 第７実施形態の並列計算装置１において、図１３のプログラムを実行するために４並列のVLIW型用に変換した例を示す。In the parallel computing device 1 of the seventh embodiment, an example of conversion to a 4-parallel VLIW type to execute the program of FIG. 13 is shown. 分岐の有るプログラムのフローチャートの一部である。It is a part of flowchart of a program with a branch. 図４０のフローチャートをＣ言語で記述したものある。The flowchart of FIG. 40 is described in C language. 図４１のコードを計算機の機械語に近いアセンブラ言語へ変換したものである。The code of FIG. 41 is converted into an assembler language close to a machine language of a computer. 図４１のコードを計算機の機械語に近いアセンブラ言語へ変換したものである。The code of FIG. 41 is converted into an assembler language close to a machine language of a computer. 従来技術によるプログラム例によるタイミングチャートを示す。The timing chart by the example of a program by a prior art is shown. 従来技術によるプログラム例を示す。The example of a program by a prior art is shown. 従来技術によるプログラム例を示す。The example of a program by a prior art is shown. 従来技術によるプログラム例を示す。The example of a program by a prior art is shown.

（第１実施形態）
図を参照し、並列計算装置の一実施形態について示す。
図１は、本発明の第１実施形態を示す概略ブロック図である。
この図に示される並列計算装置１は、演算処理部１００に含まれる複数のプロセッサーによって並列処理を行う。各実施形態の詳細な説明に先立ち、並列計算装置１の構成概要について説明する。
並列計算装置１は、演算処理部１００、ＩＯ−ＣＰＵ４、命令メモリ５、外部メモリ９を備える。
演算処理部１００は、１０８個の演算プロセッサー（ＰＥ）２−０〜２−１０７（まとめて「演算プロセッサー（ＰＥ）２」という。）、及びＰＥ２のそれぞれに制御命令を供給する制御信号生成部（ＰＥ−Ｉ）３が実装されている。 (First embodiment)
An embodiment of a parallel computing device will be described with reference to the drawings.
FIG. 1 is a schematic block diagram showing a first embodiment of the present invention.
The parallel computing device 1 shown in this figure performs parallel processing by a plurality of processors included in the arithmetic processing unit 100. Prior to detailed description of each embodiment, a configuration outline of the parallel computing device 1 will be described.
The parallel computing device 1 includes an arithmetic processing unit 100, an IO-CPU 4, an instruction memory 5, and an external memory 9.
The arithmetic processing unit 100 includes 108 arithmetic processors (PE) 2-0 to 2-107 (collectively referred to as “arithmetic processor (PE) 2”), and a control signal generation unit that supplies a control command to each of the PEs 2. (PE-I) 3 is mounted.

演算プロセッサー２は、それぞれが４個のサブプロセッサー（ＳＰＥ）２Ａ〜２Ｄを有する。
ＳＰＥ２Ａ〜２Ｄは、それぞれが異なる命令を実行するVLIW（Very Long Instruction Word）型の構成を有している。それぞれのＰＥ２は、ＳＰＥが組み合わされた同じ構成である。また、全てのＰＥ２が有する１０８個のＳＰＥ２Ａは、SIMD(Single Instruction Multi Data)型で構成され、全てのＳＰＥ２Ａで同一の命令を実行する。また、ＳＰＥ２Ｂ、ＳＰＥ２Ｃ、ＳＰＥ２Ｄについても同様である。
それらのＳＰＥ２Ａ〜２Ｄは、構成の異なる２種類のＳＰＥの組み合わせで構成される。演算プロセッサー２の基本制御機能を有するＳＰＥ２Ａと、ＳＰＥ２Ａの制御を受けるＳＰＥ２Ｂ〜ＳＰＥ２Ｄの組み合わせを例にして説明する。 Each of the arithmetic processors 2 includes four sub processors (SPE) 2A to 2D.
Each of the SPEs 2A to 2D has a VLIW (Very Long Instruction Word) type configuration that executes different instructions. Each PE2 has the same configuration in which SPEs are combined. In addition, 108 SPEs 2A included in all PEs 2 are configured as a SIMD (Single Instruction Multi Data) type, and the same instruction is executed in all the SPEs 2A. The same applies to SPE2B, SPE2C, and SPE2D.
These SPEs 2A to 2D are configured by a combination of two types of SPEs having different configurations. A description will be given by taking as an example a combination of SPE2A having the basic control function of the arithmetic processor 2 and SPE2B to SPE2D under the control of SPE2A.

制御信号生成部（ＰＥ−Ｉ）３は、演算プロセッサー２の命令の実行順序を制御する。
ＰＥ−Ｉ３は、演算プロセッサー２のプログラムにおけるループ処理やサブルーチンコールなどの条件分岐を必要とする処理の制御を行う。ＰＥ-Ｉ３及びＰＥ２の命令をアセンブラプログラムで記述すると、ＳＰＥ２ＡからＳＰＥ２Ｄ及びＰＥ-Ｉ３の命令の５命令を並列に実行するVLIW型の命令として記述される。
SIMD+VLIW型の並列計算装置１で実行されるプログラムコードは、計算開始前にＩＯ−ＣＰＵ４によって外部メモリ９から予め読み込まれ、ＰＥ-Ｉ３に付属する命令メモリ５に書き込まれる。その後、ＩＯ−ＣＰＵ４がＰＥ-Ｉ３に計算開始信号を送ると、ＰＥ-Ｉ３は命令メモリから自分自身で実行する命令と、ＳＰＥ２ＡからＳＰＥ２Ｄで実行すべき４個の命令とを読み出して計算を開始する。計算対象のデータはＩＯ−ＣＰＵ４によって外部から取り込まれ、ＰＥ２のデータ入力レジスターにそれぞれ分割して転送される。また、計算結果はＩＯ−ＣＰＵ４によって演算処理部１００から読み出され、外部機器又は外部メモリ９へ転送される。
このように、演算処理部１００における複数のＰＥ２は、並列して演算処理を行うことができる。 The control signal generator (PE-I) 3 controls the execution order of instructions of the arithmetic processor 2.
The PE-I 3 controls processing that requires conditional branching such as loop processing and subroutine calls in the program of the arithmetic processor 2. When the PE-I3 and PE2 instructions are described in an assembler program, they are described as VLIW-type instructions that execute five instructions SPE2A to SPE2D and PE-I3 in parallel.
The program code executed by the SIMD + VLIW type parallel computing device 1 is read in advance from the external memory 9 by the IO-CPU 4 before the calculation is started, and is written in the instruction memory 5 attached to the PE-I 3. After that, when the IO-CPU 4 sends a calculation start signal to the PE-I3, the PE-I3 reads the instructions to be executed by itself from the instruction memory and the four instructions to be executed by the SPE2D from the SPE2A, and starts the calculation. To do. The data to be calculated is taken in from the outside by the IO-CPU 4 and transferred separately to the data input register of the PE 2. The calculation result is read from the arithmetic processing unit 100 by the IO-CPU 4 and transferred to the external device or the external memory 9.
Thus, the plurality of PEs 2 in the arithmetic processing unit 100 can perform arithmetic processing in parallel.

図を参照し演算プロセッサー２のプログラミングモデルを説明する。
図２は、本発明の実施形態における演算プロセッサー２における各サブプロセッサーが参照可能な記憶部を示す図である。この図に示されるＲ０からＲ１５は、各ＳＰＥが参照できる記憶領域を示す。 A programming model of the arithmetic processor 2 will be described with reference to the drawings.
FIG. 2 is a diagram illustrating a storage unit that can be referred to by each sub-processor in the arithmetic processor 2 according to the embodiment of the present invention. R0 to R15 shown in this figure indicate storage areas that can be referred to by each SPE.

各サブプロセッサーは、アキュムレータ（Ａｃｃ）方式で形成されている。つまり、ＡＬＵ（Arithmetic and Logic Unit）の入力の一方はＡｃｃに固定され、他方の入力だけ入力するデータの参照先を指定することできる。また、演算結果は通常、Ａｃｃに格納される。このように限定することで命令に必要なオペランドを少なくでき、機械語のビット数を減らすことができる。
４個のＳＰＥから共通に参照可能な１２個のレジスターR4〜R15がある。これらのレジスターは各ＳＰＥから読み書きできるが、２つ以上のＳＰＥからの書き込み処理が重なる場合は、ＳＰＥ２Ａ、ＳＰＥ２Ｂ、ＳＰＥ２Ｃ、ＳＰＥ２Ｄの順で優先的に処理される。ＳＰＥ２ＡのＡｃｃは、ＳＰＥ２Ａが参照するほかに、他のＳＰＥからはレジスターR0として参照することにより読み出すことができる。また、ＳＰＥ２ＡのＡｃｃは、ＳＰＥ２Ａは書き込むことができるが、ＳＰＥ２Ａ以外のＳＰＥからは書き込むことはできない。同様に、ＳＰＥ２Ｂ、ＳＰＥ２Ｃ、ＳＰＥ２ＤのＡｃｃは、それぞれレジスターR1、R2、R3として参照することにより読み出すことができるが、同じＳＰＥにないＡｃｃには書き込むことはできない。 Each sub-processor is formed by an accumulator (Acc) method. That is, one of the inputs of the ALU (Arithmetic and Logic Unit) is fixed to Acc, and the reference destination of the data to be input only for the other input can be designated. The calculation result is normally stored in Acc. By limiting in this way, the number of operands required for the instruction can be reduced, and the number of bits in the machine language can be reduced.
There are 12 registers R4 to R15 that can be commonly referenced from the 4 SPEs. These registers can be read from and written to by each SPE, but when writing processes from two or more SPEs overlap, they are processed preferentially in the order of SPE2A, SPE2B, SPE2C, and SPE2D. The Acc of the SPE2A can be read by referring to the register R0 from other SPEs in addition to the SPE2A. The Acc of SPE2A can be written by SPE2A, but cannot be written by SPEs other than SPE2A. Similarly, Acc of SPE2B, SPE2C, and SPE2D can be read by referring to registers R1, R2, and R3, respectively, but cannot be written to Acc that is not in the same SPE.

図３は、本発明の実施形態における演算プロセッサー２の構成を示すブロック図である。
演算プロセッサー２は、ＳＰＥ２Ａ〜２Ｄと、各ＳＰＥから参照されるレジスター９１を備える
ＳＰＥ２Ａは、Ａｃｃ９３Ａ、セレクター９４Ａ、ＡＬＵ９５Ａ、フラグレジスター９７Ａ、ＳＰＥ制御部９９Ａを備える。同様に、ＳＰＥ２Ｂは、Ａｃｃ９３Ｂ、セレクター９４Ｂ、ＡＬＵ９５Ｂ、フラグレジスター９７Ｂ、ＳＰＥ制御部９９Ｂを備える。ＳＰＥ２Ｃは、Ａｃｃ９３Ｃ、セレクター９４Ｃ、ＡＬＵ９５Ｃ、フラグレジスター９７Ｃ、制御部９９Ｃを備える。ＳＰＥ２Ｄは、Ａｃｃ９３Ｄ、セレクター９４Ｄ、ＡＬＵ９５Ｄ、フラグレジスター９７Ｄ、制御部９９Ｄを備える。
Ａｃｃ９３Ａ〜Ａｃｃ９３Ｄは、ＡＬＵ９５Ａ〜９５Ｄが参照するアキュムレータである。セレクター９４Ａ〜９４Ｄは、Ａｃｃ９３Ａ〜Ａｃｃ９３Ｄからの入力を選択する。ＡＬＵ９５Ａ〜９５Ｄは、各ＳＰＥにおいて演算を行うＡＬＵである。 FIG. 3 is a block diagram showing a configuration of the arithmetic processor 2 in the embodiment of the present invention.
The arithmetic processor 2 includes SPEs 2A to 2D and a register 91 referred to by each SPE. The SPE 2A includes an Acc 93A, a selector 94A, an ALU 95A, a flag register 97A, and an SPE control unit 99A. Similarly, the SPE 2B includes an Acc 93B, a selector 94B, an ALU 95B, a flag register 97B, and an SPE control unit 99B. The SPE2C includes an Acc 93C, a selector 94C, an ALU 95C, a flag register 97C, and a control unit 99C. The SPE2D includes an Acc 93D, a selector 94D, an ALU 95D, a flag register 97D, and a control unit 99D.
Acc93A to Acc93D are accumulators referred to by ALUs 95A to 95D. The selectors 94A to 94D select inputs from the Acc93A to Acc93D. ALUs 95A to 95D are ALUs that perform operations in each SPE.

まず、ＳＰＥ２Ａ〜２Ｄに共通する構成について示し、ＳＰＥ２Ａを代表して説明する。
ＳＰＥ２Ａにおいて、ＡＬＵ９５Ａの一方の入力は、Ａｃｃ９３Ａからのデータが供給される。ＡＬＵ９５Ａの他方の入力は、セレクター９４Ａにより、レジスター９１、Ａｃｃ９３ＡとＡｃｃ９３ＢとＡｃｃ９３ＣとＡｃｃ９３Ｄからのデータが選択され供給される。ＡＬＵ９５Ａによる演算結果は通常はＡｃｃ９３Ａに書き込まれるが、Ａｃｃ９３Ａのデータをレジスター９１へ転送する命令を使って、レジスターR4〜R15のいずれかを選択して書き込むことができる。但し、ＳＰＥ２Ａ以外の他のＳＰＥのＡｃｃ９３Ｂ〜Ａｃｃ９３Ｄに書き込むことはできない。
ＳＰＥ２Ｂ〜ＳＰＥ２Ｄにおいても、ＳＰＥ２Ａと同じ構成を有する。 First, a configuration common to the SPEs 2A to 2D is shown, and the SPE 2A will be described as a representative.
In SPE2A, one input of ALU 95A is supplied with data from Acc93A. The other input of the ALU 95A is selected and supplied by the selector 94A from the register 91, Acc93A, Acc93B, Acc93C, and Acc93D. The calculation result by the ALU 95A is normally written in the Acc 93A, but any one of the registers R4 to R15 can be selected and written using an instruction for transferring the data of the Acc 93A to the register 91. However, data cannot be written in Acc93B to Acc93D of other SPEs other than SPE2A.
SPE2B to SPE2D also have the same configuration as SPE2A.

また、フラグレジスター９７Ａは、ＡＬＵ９５Ａにおける演算処理の結果を示すフラグの値を記録し、保持する。フラグレジスター９７Ａが保持するフラグは、ＡＬＵ９５Ａが出力する４つのフラグ（Ｃ、Ｎ、Ｖ、Ｚ）がある。Ｃ（キャリー）フラグは、演算結果に桁上がりが生じたことを示す。Ｎ（ネガティブ）フラグは、演算処理により値が負となったことを示す。Ｖ（オーバーフロー）フラグは、演算処理により値がオーバーフローしたことを示す。Ｚ（ゼロ）フラグは、演算処理により値が０になったことを示す。 The flag register 97A records and holds a flag value indicating the result of the arithmetic processing in the ALU 95A. There are four flags (C, N, V, Z) output from the ALU 95A as the flags held by the flag register 97A. The C (carry) flag indicates that a carry has occurred in the operation result. The N (negative) flag indicates that the value has become negative by the arithmetic processing. The V (overflow) flag indicates that the value has overflowed due to the arithmetic processing. The Z (zero) flag indicates that the value has become 0 by the arithmetic processing.

制御部９９Ａは、フラグレジスター９７Ａなどに記録されているフラグの値や、ＰＥ-Ｉ３からの制御により、アキュムレータＡｃｃ９３Ａやレジスター９１（R4-R15）及びフラグレジスター９７Ａを制御する。
なお、フラグレジスター９７Ａ〜９７Ｄ及び制御部９９Ａ〜９９Ｄは、構成、機能の定義を代えることにより、いくつかの並列処理の方法を設定することができる。詳細については、以下に示す実施形態を参照する。 The control unit 99A controls the accumulator Acc93A, the register 91 (R4-R15), and the flag register 97A according to the value of the flag recorded in the flag register 97A or the like, or control from the PE-I3.
Note that the flag registers 97A to 97D and the control units 99A to 99D can set several parallel processing methods by changing the definitions of the configuration and function. For details, refer to the embodiment shown below.

続いて、並列計算装置１における多重ネスト（入れ子）を可能とする条件分岐処理を実現する構成例について説明する。
各実施形態に共通する基本構成として、各ＡＬＵにおける命令の実行制御を命令ごとに判断するのではなく、一つのフラグ（Ｇ)を設けて、その値が「１」ならば命令を実行し、「０」ならば実行しないという判定を行うこととする。このような構成にすることで命令ごとの条件判断フィールド（ビット）が不要になり、オブジェクトコードをコンパクトにできる。さらに、スタック構造を設けたＧフラグスタックによってこのフラグの値を保持することで、ＰＥ２の処理は、多重ネストを可能とする条件分岐処理が実現できる。
また、Ｇフラグスタック内に保持される全ての値の論理積を取った信号をＧフラグと呼ぶことにする。各ＳＰＥでは、Ｇフラグの値が「１」の場合に命令を実行し、「０」の場合には命令は実行しないように制御することが容易になる。また、ＰＥ２をリセット（初期化）した直後は、Ｇフラグスタック内の値は全て１とする。これにより、リセット直後の命令の実行は、Ｇフラグにより制限されることはない。 Next, a configuration example for realizing conditional branch processing that enables multiple nesting (nesting) in the parallel computing device 1 will be described.
As a basic configuration common to each embodiment, rather than determining instruction execution control in each ALU for each instruction, a single flag (G) is provided, and if the value is “1”, the instruction is executed. If “0”, it is determined not to execute. With this configuration, the condition determination field (bit) for each instruction becomes unnecessary, and the object code can be made compact. Further, by holding the value of this flag by the G flag stack provided with the stack structure, the processing of PE2 can realize a conditional branch process that enables multiple nesting.
A signal obtained by taking the logical product of all the values held in the G flag stack is called a G flag. In each SPE, it is easy to perform control so that an instruction is executed when the value of the G flag is “1” and is not executed when the value is “0”. Immediately after PE2 is reset (initialized), all values in the G flag stack are set to 1. Thereby, the execution of the instruction immediately after the reset is not restricted by the G flag.

図４は、本発明の実施形態における構造化プログラミング用に導入する６個の命令を示す。
これら６個の命令は、Ｇフラグの値に拘らず実行される。
「PSH」命令はオペランドにＣ、Ｎ、Ｖ、Ｚの各コンディションフラグの中から任意の数のコンディションフラグを選択し、条件判定の条件に指定できる。この命令はＧフラグスタックを１段下にプッシュし、最上段に新たな値を設定する。例えば、「PSH C, Z」命令とすると、Ｃ（キャリー）フラグとＺ（ゼロ）フラグの論理和を取って、それが「１」ならばＧフラグスタックの最上段の値を１にし、「０」ならば最上段の値を０にする。
「PSHI」命令は、Ｇフラグスタックを１段下にプッシュし、最上段に新たな値を設定する。この命令は、オペランド指定されたフラグの論理和を取った後で、それが「０」ならば、Ｇフラグスタックの最上段の値を１にし、「１」ならば最上段の値を０にする。これらの命令は「if 〜 then 文」によって示される処理に相当する。
「GINV」命令は、Ｇフラグスタックの最上段の値を反転するので、「else文」に相当する。
「POP」命令は、Ｇフラグスタックを１段上にポップ(シフト)し、最下層に１をセットする。これは「if文」の最後に相当する。
「POPI」命令は、「POP」命令と「GINV」命令を一つに纏めたものである。
「FLSH」命令は、Ｇフラグスタックに保持される値を全て１にする。 FIG. 4 shows six instructions introduced for structured programming in an embodiment of the present invention.
These six instructions are executed regardless of the value of the G flag.
The “PSH” instruction can select an arbitrary number of condition flags from among C, N, V, and Z condition flags as operands and specify them as condition determination conditions. This instruction pushes the G flag stack down one level and sets a new value at the top level. For example, in the case of the “PSH C, Z” instruction, the logical sum of the C (carry) flag and the Z (zero) flag is calculated, and if it is “1”, the value at the top of the G flag stack is set to 1. If the value is “0”, the uppermost value is set to “0”.
The “PSHI” instruction pushes the G flag stack down one level and sets a new value at the top level. In this instruction, after calculating the logical sum of the flags designated by the operand, if it is “0”, the value on the top level of the G flag stack is set to 1, and if it is “1”, the value on the top level is set to 0. To do. These instructions correspond to the process indicated by “if to then statement”.
Since the “GINV” instruction inverts the value at the top of the G flag stack, it corresponds to an “else statement”.
The “POP” instruction pops (shifts) the G flag stack up one level and sets 1 to the lowest layer. This corresponds to the end of the “if statement”.
The “POPI” instruction is a combination of a “POP” instruction and a “GINV” instruction.
The “FLSH” instruction sets all the values held in the G flag stack to 1.

図５は、本発明の実施形態におけるフラグ処理部の構成例を示すブロック図である。
図に示されるフラグ処理部１０は、Ｇフラグスタック１１、ＯＲ回路１２、ＡＮＤ回路１３、ＯＲ回路１４、ＯＲ回路１７、ＥＸＯＲ回路１８及び合成部（ＡＮＤ回路）１９が示されている。フラグ処理部１０は、ＰＥごとに少なくとも１個が設けられる。
Ｇフラグスタック１１は、フラグの値を記憶するスタック構造化された記憶部である。例としてスタックの階層を４層として示す。したがって、４層までのネスティングに対応可能である。同様の構成が全てのＰＥ２に必要である。Ｇフラグスタックは常に、図示されない並列計算装置１内部の基本クロックの立ち上がりで変化する。 FIG. 5 is a block diagram illustrating a configuration example of the flag processing unit in the embodiment of the present invention.
The flag processing unit 10 shown in the figure includes a G flag stack 11, an OR circuit 12, an AND circuit 13, an OR circuit 14, an OR circuit 17, an EXOR circuit 18, and a combining unit (AND circuit) 19. At least one flag processing unit 10 is provided for each PE.
The G flag stack 11 is a storage unit having a stack structure for storing flag values. As an example, the stack hierarchy is shown as four layers. Therefore, nesting up to four layers can be supported. A similar configuration is required for all PE2. The G flag stack always changes at the rising edge of the basic clock in the parallel computing device 1 (not shown).

図において、cnt_xxxとして示す信号はＰＥ-Ｉ３でＰＥ２の命令をデコードし、各ＰＥ２に含まれる同じ種類（例えばＳＰＥ２Ａ）の全てのＳＰＥに共通に与えられる制御信号である。上記の同じ種類の全てのＳＰＥは、ＳＩＭＤ構成により同じ命令による処理が並列処理されるものであり、その単位でＳＰＥ群を形成する。これらの制御信号は、ＳＰＥ群ごとに異なる。
一方、flag_xとして示す信号は、ＡＬＵ（例えば、図２におけるＡＬＵ９５Ａ）が出力したコンディションフラグをフラグレジスター（例えば、図２におけるフラグレジスター９７Ａ）で保持した値を出力する出力信号を示す。flag_xという信号は、個々のＳＰＥから出力される固有のコンディションフラグの値を示す信号である。したがって、全てのＳＰＥにそれぞれフラグレジスターを配置した構成では、ＰＥ２が１０８個在り、各ＰＥ２にＳＰＥが４個在るので、合計４３２本の異なる信号になる。
system_resetとして示す信号は、並列計算装置１のシステム全体をリセットする共通信号であり、この信号又はcnt_FLSH信号がアクティブになると、Ｇフラグスタックの値は全て１になる。
cnt_FLSH信号は、「FLSH」命令が発行されるとアクティブになる。 In the figure, a signal indicated as cnt_xxx is a control signal that is commonly given to all SPEs of the same type (for example, SPE2A) included in each PE2 by decoding the instruction of PE2 by PE-I3. All the SPEs of the same type are processed by the same instruction in parallel by the SIMD configuration, and an SPE group is formed in that unit. These control signals are different for each SPE group.
On the other hand, a signal indicated as flag_x indicates an output signal that outputs a value obtained by holding a condition flag output by an ALU (for example, ALU 95A in FIG. 2) in a flag register (for example, flag register 97A in FIG. 2). A signal flag_x is a signal indicating a value of a specific condition flag output from each SPE. Therefore, in the configuration in which flag registers are arranged in all SPEs, there are 108 PE2s, and there are 4 SPEs in each PE2, so that a total of 432 different signals are obtained.
The signal indicated as system_reset is a common signal that resets the entire system of the parallel computing device 1, and when this signal or the cnt_FLSH signal becomes active, the values of the G flag stack all become 1.
The cnt_FLSH signal becomes active when a “FLSH” instruction is issued.

「PSH」命令が発行されるとcnt_PSH信号がアクティブになり、Ｇフラグスタックがプシュされる。すなわち、スタックG0に保持された値がスタックG1へ、スタックG1に保持された値がスタックG2へ、スタックG2に保持された値がスタックG3へとシフトされる。スタックG3に保持された値は捨てられる。同時に「PSH」命令のオペランド指定に応じてcnt_C_en、cnt_N_en、cnt_V_en、cnt_Z_en信号がアクティブになり、キャリーフラグ（C）、ネガティブフラグ（N）、オーバーフローフラグ（V）、ゼロフラグ（Z）との論理和が取られて、その値がスタックG0に書き込まれる。また、「PSHI」命令は、前述の「PSH」命令のプッシュ動作と同様な動作をする。であるが、各コンディションフラグの論理和を取った後で反転されてからスタックG0に書き込まれる。 When the “PSH” instruction is issued, the cnt_PSH signal becomes active and the G flag stack is pushed. That is, the value held in the stack G0 is shifted to the stack G1, the value held in the stack G1 is shifted to the stack G2, and the value held in the stack G2 is shifted to the stack G3. The value held in the stack G3 is discarded. At the same time, the cnt_C_en, cnt_N_en, cnt_V_en, and cnt_Z_en signals become active according to the operand specification of the “PSH” instruction, and OR with the carry flag (C), negative flag (N), overflow flag (V), and zero flag (Z). Is taken and its value is written to stack G0. The “PSHI” instruction performs the same operation as the push operation of the “PSH” instruction described above. However, after taking the logical sum of the condition flags, they are inverted and then written to the stack G0.

「GINV」命令が発行されるとcnt_GINV信号がアクティブになり、スタックG0の値が反転される。
「POP」命令が発行されるとcnt_POP信号がアクティブになり、Ｇフラグスタックがポップされる。すなわち、スタックG1に保持された値がスタックG0へ、スタックG2に保持された値がスタックG1へ、スタックG3に保持された値がスタックG2へとシフトされる。また、スタックG3には１がセットされる。
「POPI」命令は、「POP」命令と「GINV」命令を組み合わせて一度に行う。すなわち、Ｇフラグスタックを１段ポップして、その後で最上段のスタックG0を反転する。
合成部１９は、スタックG0からスタックG3の全ての値の論理積を取った結果を示す信号が、命令の実行を制御する信号Global_Inst_en（Ｇフラグ）になる。この信号はＳＰＥごとに異なる。 When the “GINV” instruction is issued, the cnt_GINV signal becomes active, and the value of the stack G0 is inverted.
When the “POP” instruction is issued, the cnt_POP signal becomes active and the G flag stack is popped. That is, the value held in the stack G1 is shifted to the stack G0, the value held in the stack G2 is shifted to the stack G1, and the value held in the stack G3 is shifted to the stack G2. Further, 1 is set in the stack G3.
The “POPI” instruction is executed at once by combining the “POP” instruction and the “GINV” instruction. That is, the G flag stack is popped by one stage, and then the uppermost stack G0 is inverted.
In the synthesizer 19, a signal indicating the result of ANDing all the values of the stack G0 to the stack G3 becomes a signal Global_Inst_en (G flag) for controlling the execution of the instruction. This signal is different for each SPE.

ＰＥ２の制御において、「命令を実行しない」という動作を、「演算結果を書き込まない」ということで実現できる。そこで、ＰＥ-Ｉ３に在る命令デコーダ（図示しない）から供給されるＡｃｃの書き込み制御信号やレジスターR4〜R15の書き込み制御信号、或いはＣ、Ｎ、Ｖ、Ｚのコンディションフラグの書き込み制御信号に、Ｇフラグ（Global_Inst_en信号）との論理積をとることにより、命令の実行制御機構を実現する。 In the control of PE2, the operation of “not executing the instruction” can be realized by “not writing the operation result”. Therefore, an Acc write control signal supplied from an instruction decoder (not shown) in PE-I3, a write control signal for registers R4 to R15, or a write control signal for the C, N, V, and Z condition flags, By executing a logical product with the G flag (Global_Inst_en signal), an instruction execution control mechanism is realized.

図６は、本発明の実施形態におけるＡｃｃへの書き込み制御回路の構成例を示すブロック図である。
図には、Ａｃｃ制御部９２とＡｃｃ９３が示され、ALU_out信号は、ＡＬＵ（例えば、ＡＬＵ９５Ａ）が出力する信号であり、Acc_out信号は、Ａｃｃ９３がＡＬＵ（例えば、ＡＬＵ９５Ａ）に入力する信号である。また、Ａｃｃ制御部９２は、ＰＥ-Ｉ３に在る命令デコーダ（図示しない）からのＡｃｃ９３への書き込み制御信号cnt_Acc_wrと、Ｇフラグ（Global_Inst_en信号）との論理積をとってＡｃｃ９３のロードイネーブル信号としている。ロードイネーブル信号がアクティブになると、図示されない並列計算装置１内部の基本クロックの立ち上がりでＡｃｃ９３の状態が変化する。 FIG. 6 is a block diagram showing a configuration example of a write control circuit for Acc in the embodiment of the present invention.
In the figure, an Acc control unit 92 and Acc 93 are shown. The ALU_out signal is a signal output from an ALU (for example, ALU 95A), and the Acc_out signal is a signal input from the Acc 93 to an ALU (for example, ALU 95A). The Acc control unit 92 calculates the logical product of the write control signal cnt_Acc_wr to the Acc 93 from the instruction decoder (not shown) in the PE-I 3 and the G flag (Global_Inst_en signal) as a load enable signal for the Acc 93. Yes. When the load enable signal becomes active, the state of Acc 93 changes at the rising edge of the basic clock inside the parallel computing device 1 (not shown).

図７、図８、図９を参照し、第1実施形態のプログラム例を示す。
図７は、２重にネスト（入れ子）したプログラムを、並列計算装置１のアセンブラプログラムで記述した例を示す。このプログラムでは、図７のコード中の各変数を、図８に示すように各レジスターヘ割り付けてあると仮定している。また、図４２に示した命令以外に図７で用いる命令については、図９にその動作が説明されている。図７中で”//”記号の後には、プログラムの動作を理解し易くする為のコメントをＣ言語的に示す。これらのコメントは、図４５に示されたコードに対応している。この様に、本発明の技術を用いることで構造化されたプログラムを、容易にアセンブラコードに変換できる。すなわち容易に機械語に変換できる。 A program example of the first embodiment will be described with reference to FIGS. 7, 8, and 9.
FIG. 7 shows an example in which a double nested program is described by the assembler program of the parallel computing device 1. In this program, it is assumed that each variable in the code of FIG. 7 is allocated to each register as shown in FIG. In addition to the instructions shown in FIG. 42, the operations of the instructions used in FIG. 7 are described in FIG. In FIG. 7, after the symbol “//”, a comment for making it easy to understand the operation of the program is shown in C language. These comments correspond to the codes shown in FIG. As described above, a structured program can be easily converted into an assembler code by using the technique of the present invention. That is, it can be easily converted into machine language.

図１０は、並列計算装置１の演算処理部の概略構成を示すブロック図である。
ここでは、演算制御処理の説明に必要な主たる構成を示す。演算処理に関する基本構成は図３を参照する。
図に示される演算処理部１００は、並列して演算処理を行う複数の演算プロセッサー（ＰＥ）１０２と、複数のＰＥ１０２にＳＰＥ制御信号線を介して制御命令を供給する制御信号生成部（ＰＥ−Ｉ）３とを備える。 FIG. 10 is a block diagram illustrating a schematic configuration of the arithmetic processing unit of the parallel computing device 1.
Here, a main configuration necessary for explaining the arithmetic control process is shown. Refer to FIG. 3 for the basic configuration regarding the arithmetic processing.
The arithmetic processing unit 100 shown in the figure includes a plurality of arithmetic processors (PE) 102 that perform arithmetic processing in parallel, and a control signal generation unit (PE−) that supplies control commands to the plurality of PEs 102 via SPE control signal lines. I) 3 is provided.

ＰＥ１０２のそれぞれが、サブプロセッサー１０２Ａ（ＳＰＥ１０２Ａ）と、サブプロセッサー１０２Ｂ〜１０２Ｄ（ＳＰＥ１０２Ｂ〜１０２Ｄ）を備える。
ＳＰＥ１０２Ａ〜ＳＰＥ１０２Ｄは、図３のＳＰＥ２Ａ〜２Ｄの演算処理の基本構成と同じ構成を有するほかに、それぞれ次の構成を有する。
ＳＰＥ１０２Ａは、Ｇフラグ処理部１０とＳＰＥ制御部１９９Ａを備え、ＳＰＥ１０２Ｂ〜１０２Ｄは、ＳＰＥ制御部１９９Ｂ〜１９９Ｄを備える。
また、ＳＰＥ１０２ＡにおけるＧフラグ処理部１０は、ＳＰＥ制御部１９９Ａ〜１９９ＤにＧフラグ信号を供給する。ＳＰＥ制御部１９９Ａ〜１９９Ｄは、供給されたＧフラグ信号に基づいて、それぞれのＳＰＥにおける演算制御を行う。 Each of the PEs 102 includes a sub processor 102A (SPE 102A) and sub processors 102B to 102D (SPEs 102B to 102D).
The SPEs 102A to SPE102D have the same configuration as the basic configuration of the arithmetic processing of the SPEs 2A to 2D in FIG.
The SPE 102A includes a G flag processing unit 10 and an SPE control unit 199A, and the SPEs 102B to 102D include SPE control units 199B to 199D.
Further, the G flag processing unit 10 in the SPE 102A supplies the G flag signal to the SPE control units 199A to 199D. The SPE control units 199A to 199D perform calculation control in each SPE based on the supplied G flag signal.

図１１は、演算プロセッサーにおける演算制御処理を行う構成を示すブロック図である。
この図には、ＰＥ１０２におけるＳＰＥ１０２Ａ〜１０２Ｄの構成が示される。
ＳＰＥ１０２Ａでは、ＳＰＥ演算処理部１９０ＡとＧフラグ処理部１０の詳細構成、ＳＰＥ１０２Ｂでは、ＳＰＥ演算処理部１９０Ｂの詳細構成が示される。ＳＰＥ１０２ＣとＳＰＥ１０２Ｄでは、ＳＰＥ１０２Ｂと同様の構成を備えることから、記載を省略する。また、前述の図５と同じ構成には同じ数字の符号を附し、異なる構成について説明する。 FIG. 11 is a block diagram showing a configuration for performing arithmetic control processing in the arithmetic processor.
This figure shows the configuration of the SPEs 102A to 102D in the PE 102.
The SPE 102A shows the detailed configuration of the SPE calculation processing unit 190A and the G flag processing unit 10, and the SPE 102B shows the detailed configuration of the SPE calculation processing unit 190B. Since SPE102C and SPE102D have the same configuration as SPE102B, description thereof is omitted. Further, the same components as those in FIG. 5 described above are denoted by the same reference numerals, and different components will be described.

ＳＰＥ１０２ＡにおけるＳＰＥ演算処理部１９０Ａは、前述の図３に示したＡｃｃ９３Ａ、セレクター９４Ａ、ＡＬＵ９５Ａ、フラグレジスター９７Ａ、ＳＰＥ制御部１９９Ａを備える。
フラグレジスター９７Ａは、フラグレジスター９７Ａ−Ｃ、９７Ａ−Ｎ、９７Ａ−Ｖ、９７Ａ−Ｚを備え、それぞれが、ＡＬＵ９５Ａの演算結果に応じて変化するコンディションフラグＣ、Ｎ、Ｖ、Ｚの値を記録し、保持する。フラグレジスター９７Ａ−Ｃ、９７Ａ−Ｎ、９７Ａ−Ｖ、９７Ａ−Ｚは、記録された値の基づいてflag-C信号、flag-N信号、flag-V信号、flag-Z信号をそれぞれ出力する。 The SPE arithmetic processing unit 190A in the SPE 102A includes the Acc 93A, selector 94A, ALU 95A, flag register 97A, and SPE control unit 199A shown in FIG.
The flag register 97A includes flag registers 97A-C, 97A-N, 97A-V, and 97A-Z, and records the values of the condition flags C, N, V, and Z that change according to the calculation result of the ALU 95A. And hold. The flag registers 97A-C, 97A-N, 97A-V, and 97A-Z each output a flag-C signal, a flag-N signal, a flag-V signal, and a flag-Z signal based on the recorded values.

ＳＰＥ１０２Ａにおける制御部１９９Ａは、Ａｃｃ制御部９２Ａとフラグ制御部９６Ａを備える。
Ａｃｃ制御部９２Ａは、図６に示したＡｃｃ制御部９２と同じ構成であるが、ＳＰＥ１０２Ａの構成であることを示すため符号に「A」を付している。Ａｃｃ制御部９２Ａは、ＰＥ-Ｉ３に在る命令デコーダ（図示しない）からのアキュムレータへの書き込み信号cnt_Acc_wr_Aと、Ｇフラグ（Global_Inst_en信号）との論理積をとってＡｃｃ９３Ａのロードイネーブル信号としている。
フラグ制御部９６Ａは、フラグレジスター９７Ａに記憶される各コンディションフラグの値の書き込みをＧフラグとＰＥ-Ｉ３からの制御信号に応じて制御する。Ｇフラグがアクティブであり、それぞれのフラグの状態の書き込みを行う指令がＰＥ-Ｉ３から出力されているときに、フラグレジスター９７Ａは書き込まれる。コンディションフラグの書き込みを行う指令は、cnt_C_wr_A、cnt_N_wr_A、cnt_V_wr_A、cnt_Z_wr_A信号がアクティブであるとき、それぞれキャリーフラグ（C）、ネガティブフラグ（N）、オーバーフローフラグ（V）、ゼロフラグ（Z）の値が書き込まれる。レジスター９１（R4-R15）も同様に、ＰＥ−Ｉ３から出力される書き込みを行う指令に、Ｇフラグの値との論理積が取られる。 The control unit 199A in the SPE 102A includes an Acc control unit 92A and a flag control unit 96A.
The Acc control unit 92A has the same configuration as that of the Acc control unit 92 shown in FIG. 6, but a reference numeral “A” is added to indicate that the configuration is the SPE 102A. The Acc control unit 92A obtains a logical product of the write signal cnt_Acc_wr_A from the instruction decoder (not shown) in the PE-I3 to the accumulator and the G flag (Global_Inst_en signal) to obtain a load enable signal for the Acc93A.
The flag control unit 96A controls the writing of the value of each condition flag stored in the flag register 97A according to the G flag and the control signal from the PE-I3. The flag register 97A is written when the G flag is active and a command to write the state of each flag is output from the PE-I3. The commands to write the condition flag are the values of the carry flag (C), negative flag (N), overflow flag (V), and zero flag (Z) when the cnt_C_wr_A, cnt_N_wr_A, cnt_V_wr_A, and cnt_Z_wr_A signals are active. It is. Similarly, the register 91 (R4-R15) is ANDed with the value of the G flag in the write command output from the PE-I3.

同様にＳＰＥ１０２ＢにおけるＳＰＥ演算処理部１９０Ｂは、前述の図３に示したＡｃｃ９３Ｂ、セレクター９４Ｂ、ＡＬＵ９５Ｂ、ＳＰＥ制御部１９９Ｂを備える。ＳＰＥ演算処理部１９０Ｂは、図３に示したフラグレジスター９７Ｂを備えていない。
ＳＰＥ演算処理部１９０ＢにおけるＳＰＥ制御部１９９Ｂは、Ａｃｃ制御部９２Ｂを備える。Ａｃｃ制御部９２Ｂは、Ａｃｃ制御部９２Ａと同じ構成であり、ＰＥ-Ｉ３に在る命令デコーダ（図示しない）からのＡｃｃ９３Ｂへの書き込み信号cnt_Acc_wr_Bと、Ｇフラグ（Global_Inst_en信号）との論理積をとってＡｃｃ９３Ｂのロードイネーブル信号としている。 Similarly, the SPE arithmetic processing unit 190B in the SPE 102B includes the Acc 93B, selector 94B, ALU 95B, and SPE control unit 199B shown in FIG. The SPE arithmetic processing unit 190B does not include the flag register 97B illustrated in FIG.
The SPE control unit 199B in the SPE arithmetic processing unit 190B includes an Acc control unit 92B. The Acc control unit 92B has the same configuration as the Acc control unit 92A, and takes the logical product of the write signal cnt_Acc_wr_B to the Acc93B from the instruction decoder (not shown) in the PE-I3 and the G flag (Global_Inst_en signal). Acc93B load enable signal.

また、ＳＰＥ１０２ＡにおけるＧフラグ処理部１０について、Ｇフラグ処理部とＳＰＥ制御部との関係を示し説明する。
図１２は、Ｇフラグ処理部とＳＰＥ制御部を示すブロック図である。
この図には、Ｇフラグ処理部１０と、各ＳＰＥが備えるＳＰＥ制御部１９９Ａ〜１９９Ｄが示される。前述の図１０に示したように、Ｇフラグ処理部１０は、出力するＧフラグ信号（Global_Inst_en）をＳＰＥ制御部１９９Ａ〜１９９Ｄに入力する。
図に示されるＧフラグ処理部１０は、図５に示したＧフラグ処理部１０と同じ構成を有する。 The G flag processing unit 10 in the SPE 102A will be described with reference to the relationship between the G flag processing unit and the SPE control unit.
FIG. 12 is a block diagram illustrating the G flag processing unit and the SPE control unit.
This figure shows the G flag processing unit 10 and the SPE control units 199A to 199D included in each SPE. As shown in FIG. 10 described above, the G flag processing unit 10 inputs the G flag signal (Global_Inst_en) to be output to the SPE control units 199A to 199D.
The G flag processing unit 10 shown in the figure has the same configuration as the G flag processing unit 10 shown in FIG.

以上に示した構成により、ＧフラグスタックをＳＰＥ２Ａにだけ設け、ここから出力されるGlobal_Inst_en信号を、ＳＰＥ１０２ＡだけでなくＳＰＥ１０２Ｂ、ＳＰＥ１０２Ｃ、ＳＰＥ１０２Ｄの全ての実行制御に用いる。これにより、ＳＰＥ１０２Ａで条件判断を行って、その結果をＧフラグスタックに書き込むと同時に、他の全てのＳＰＥもその条件判断の結果にしたがって命令を実行する。 With the configuration described above, the G flag stack is provided only in the SPE 2A, and the Global_Inst_en signal output from the G flag stack is used not only for the SPE 102A but also for all execution controls of the SPE 102B, SPE 102C, and SPE 102D. As a result, the SPE 102A makes a condition determination and writes the result to the G flag stack. At the same time, all other SPEs execute instructions according to the result of the condition determination.

図を参照し、第１実施形態に示す並列計算装置１によって処理が高速化されることを、プログラム例を用いて示す。
図１３は、高速化処理が行えるプログラムを示す。
この図に示されるプログラムは、前述の図７と同じであるが、後の説明を分かり易くする為に処理単位ごとに「＊印」をつけ分類する。
図１４は、第１実施形態による並列演算処理のプログラムを示す。
この図に示されるプログラムは、第１実施形態に示した構成を用いてプログラムを４並列に変換した例を示す。ＳＰＥ１０２Ａで条件判断を行い、ＳＰＥ１０２Ｂ等ではその結果に応じて命令実行が制御される。
先ず「＊１」を付した命令部分を説明する。ＳＰＥ１０２Ａで条件判断する（ステップ３）までの間に、ＳＰＥ１０２Ｂによって「ADD R7」命令まで実行する。ＳＰＥ１０２Ｂは、「PSHI C,Z」命令を実行した直後のステップ４の「MV R7」命令で、演算結果をレジスターR7に書き込む。 Referring to the drawing, it will be shown by using a program example that the processing speed is increased by the parallel computing device 1 shown in the first embodiment.
FIG. 13 shows a program that can perform high-speed processing.
The program shown in this figure is the same as that shown in FIG. 7 described above. However, in order to make the following explanation easy to understand, “*” is assigned to each processing unit and classified.
FIG. 14 shows a program for parallel arithmetic processing according to the first embodiment.
The program shown in this figure shows an example in which the program is converted into four parallels using the configuration shown in the first embodiment. The SPE 102A determines the condition, and the SPE 102B or the like controls instruction execution according to the result.
First, the instruction part marked with “* 1” will be described. Until the condition is judged by the SPE 102A (step 3), the SPE 102B executes up to the “ADD R7” instruction. The SPE 102B writes the operation result in the register R7 by the “MV R7” instruction in Step 4 immediately after executing the “PSHI C, Z” instruction.

「＊２」を付した命令の部分も同様に、予めＳＰＥ１０２Ｂで「CLR」命令でＡｃｃ−Ｂを「０」にしておき、ＳＰＥ１０２Ａで「CMP R6」命令の結果をＧフラグスタックにプッシュしたステップ６の直後にレジスターR7に「０」を書き込む。
「＊３」と「＊４」を付した命令の部分は、注意が必要である。ＳＰＥ１０２ＣとＳＰＥ１０２Ｄで予めデータを用意しておいて、命令実行条件が決定した直後に用意したデータを続けて書き込みたいが、ステップ３ではＳＰＥ１０２Ａで「PSHI C,Z」命令が実行される。
ＳＰＥ１０２Ａでステップ３の「PSHI C,Z」命令が実行された後では、条件判断が行われることからＳＰＥ１０２ＣとＳＰＥ１０２Ｄで命令が実行されるかどうか不明である。そこで、ＳＰＥ１０２Ａにおいてステップ３の「PSHI C,Z」命令が実行される前に、ＳＰＥ１０２ＣとＳＰＥ１０２Ｄでは、データを準備している。このように４つのＳＰＥで並列処理することで、図１３では２１クロックかかった処理が、図１４では１１クロックと約半分で終えることができる。 Similarly, for the instruction part with “* 2”, the step in which Acc-B is set to “0” by the “CLR” instruction in the SPE 102B in advance and the result of the “CMP R6” instruction is pushed to the G flag stack in the SPE 102A. Immediately after 6, “0” is written to the register R7.
Care must be taken in the part of the instruction with “* 3” and “* 4”. Although data is prepared in advance by the SPE 102C and the SPE 102D and it is desired to write the prepared data immediately after the instruction execution condition is determined, in step 3, the “PSHI C, Z” instruction is executed in the SPE 102A.
After the “PSHI C, Z” instruction in step 3 is executed in the SPE 102A, it is unclear whether the instructions are executed in the SPE 102C and the SPE 102D because the condition is determined. Therefore, before the “PSHI C, Z” instruction in step 3 is executed in the SPE 102A, the SPE 102C and the SPE 102D prepare data. By performing parallel processing with four SPEs in this way, processing that took 21 clocks in FIG. 13 can be completed in about half, 11 clocks in FIG.

なお、図１４においてＳＰＥ１０２Ａ〜１０２Ｄで有効に使われていない部分を空白又は網掛けで示す。空白部分には任意の命令を配置することができるが、網掛け部分にはＳＰＥ１０２Ａの状態に同期して実行する命令（図１３のコード中には対象無し）か、図示する「NOP」命令を配置することができる。 In FIG. 14, portions that are not effectively used in the SPEs 102A to 102D are indicated by blanks or shades. Arbitrary instructions can be placed in the blank part, but the shaded part can be executed in synchronization with the state of the SPE 102A (there is no target in the code of FIG. 13) or the “NOP” instruction shown in the figure Can be arranged.

本実施形態によると、複数の演算プロセッサーＰＥ１０２のそれぞれが、サブプロセッサーＳＰＥ１０２ＡとサブプロセッサーＳＰＥ１０２Ｂ〜１０２Ｄによって形成される。
サブプロセッサーＳＰＥ１０２Ａにおいて、ＡＬＵ９５Ａは、入力されたデータを前記制御命令に基づいて演算処理する。Ｇフラグスタック１１は、スタック構造を有する記憶部であり、演算処理された結果に基づいたフラグ情報が順次蓄積される。合成部１９は、Ｇフラグスタック１１に蓄積されたフラグ情報を合成する。ＳＰＥ制御部１９９Ａは、合成部１９が合成した合成フラグ情報に基づいてＡｃｃ９３Ａやフラグレジスター９７Ａ及びレジスター９１への書き込みを制御する。
サブプロセッサーＳＰＥ１０２Ｂ〜１０２Ｄにおいて、ＡＬＵ９５Ｂ〜９５Ｄは、入力されたデータを制御命令に基づいて演算処理する。ＳＰＥ制御部１９９Ｂ〜１９９Ｄは、合成部１９が合成した合成フラグ情報に基づいてＡｃｃ９３Ｂ及びレジスター９１への書き込みを制御する。
これにより、SIMD型にVLIW型を組み合わせた並列計算装置において、多重にネストした構造化プログラムをサポートするハードウェアを容易に実現できる。 According to the present embodiment, each of the plurality of arithmetic processors PE102 is formed by the sub processor SPE 102A and the sub processors SPE 102B to 102D.
In the sub processor SPE102A, the ALU 95A performs arithmetic processing on the input data based on the control command. The G flag stack 11 is a storage unit having a stack structure, and flag information based on the result of arithmetic processing is sequentially accumulated. The combining unit 19 combines the flag information accumulated in the G flag stack 11. The SPE control unit 199A controls writing to the Acc 93A, the flag register 97A, and the register 91 based on the synthesis flag information synthesized by the synthesis unit 19.
In the sub processors SPE102B to 102D, the ALUs 95B to 95D perform arithmetic processing on the input data based on the control command. The SPE control units 199B to 199D control writing to the Acc 93B and the register 91 based on the synthesis flag information synthesized by the synthesis unit 19.
This makes it easy to implement hardware that supports multiple nested structured programs in a parallel computing device that combines the VLIW type with the SIMD type.

（第２実施形態）
次に本発明の第２実施形態について説明する。
図１５は、並列計算装置１の演算処理部の概略構成を示すブロック図である。
ここでは、演算制御処理の説明に必要な主たる構成を示す。演算処理については、図に示される演算処理部２００は、並列して演算処理を行う複数の演算プロセッサー（ＰＥ）２０２と、複数のＰＥ２０２にＳＰＥ制御信号線を介して制御命令を供給する制御信号生成部（ＰＥ−Ｉ）３と、ＰＥ−Ｉ３の制御を受けて各ＳＰＥを同期させる実行制御部２０Ｂ〜２０Ｄを備える。 (Second Embodiment)
Next, a second embodiment of the present invention will be described.
FIG. 15 is a block diagram illustrating a schematic configuration of the arithmetic processing unit of the parallel computing device 1.
Here, a main configuration necessary for explaining the arithmetic control process is shown. As for the arithmetic processing, the arithmetic processing unit 200 shown in the figure includes a plurality of arithmetic processors (PE) 202 that perform arithmetic processing in parallel, and a control signal that supplies a control command to the plurality of PEs 202 via an SPE control signal line. A generation unit (PE-I) 3 and execution control units 20B to 20D that synchronize the SPEs under the control of the PE-I3 are provided.

ＰＥ２０２のそれぞれが、サブプロセッサー（ＳＰＥ）２０２Ａと、サブプロセッサー（ＳＰＥ）２０２Ｂ〜２０２Ｄを備える。
ＳＰＥ２０２Ａは、Ｇフラグ処理部１０とＳＰＥ制御部２９９Ａを備える。
ＳＰＥ２０２Ｂ〜２０２Ｄは、それぞれＳＰＥ制御部２９９Ｂ〜２９９Ｄと実行選択部２４Ｂ〜２４Ｄを備える。
Ｇフラグ処理部１０は、ＳＰＥ制御部２９９Ａ〜２９９ＤにＧフラグ信号を供給する。
実行選択部２４Ｂ〜２４Ｄは、供給されたＧフラグ信号と実行許可信号に基づいて、ＳＰＥ制御部２９９Ｂ〜２９９Ｄにそれぞれ実行許可信号を出力する。
ＳＰＥ制御部２９９Ａ〜２９９Ｄは、供給された実行許可信号に基づいて、それぞれのＳＰＥの演算制御を行う。 Each of the PEs 202 includes a sub processor (SPE) 202A and sub processors (SPE) 202B to 202D.
The SPE 202A includes a G flag processing unit 10 and an SPE control unit 299A.
The SPEs 202B to 202D include SPE control units 299B to 299D and execution selection units 24B to 24D, respectively.
The G flag processing unit 10 supplies a G flag signal to the SPE control units 299A to 299D.
The execution selection units 24B to 24D output execution permission signals to the SPE control units 299B to 299D, respectively, based on the supplied G flag signal and execution permission signal.
The SPE control units 299A to 299D perform calculation control of each SPE based on the supplied execution permission signal.

図１６は、第２実施形態の構成において追加する命令を示す。
前述の図１５に示すように、Ｇフラグ処理部１０は、ＳＰＥ２０２Ａにだけ設ける。ＳＰＥ２０２Ａ以外のＳＰＥは、ＳＰＥ２０２Ａが出力するＧフラグの値に応じて命令実行が制御されるか、又はＳＰＥ２０２ＡのＧフラグに影響されず常に命令を実行するかを選択できるようにする。
この選択を行うために、この図に示される命令を追加する。これらの命令はＳＰＥ２０２Ｂ、ＳＰＥ２０２Ｃ、ＳＰＥ２０２Ｄで常に実行可能である。
例えば、ＳＰＥ２０２Ｂにおいて「SYNC」命令を実行すると、ＳＰＥ２０２Ｂは、それ以降はＳＰＥ２０２ＡのＧフラグの値を命令実行制御に使うようになり、「ASYNC」命令を実行すると、それ以降はＳＰＥ２０２ＡのＧフラグの値とは無関係に命令を実行するようになる。 FIG. 16 shows instructions to be added in the configuration of the second embodiment.
As shown in FIG. 15 described above, the G flag processing unit 10 is provided only in the SPE 202A. The SPEs other than the SPE 202A can select whether instruction execution is controlled according to the value of the G flag output from the SPE 202A, or whether the instruction is always executed regardless of the G flag of the SPE 202A.
To make this selection, the instructions shown in this figure are added. These instructions can always be executed by the SPE 202B, SPE 202C, and SPE 202D.
For example, when the “SYNC” instruction is executed in the SPE 202B, the SPE 202B uses the value of the G flag of the SPE 202A for the instruction execution control thereafter. When the “ASYNC” instruction is executed, the S flag of the SPE 202A after that The instruction will be executed regardless of the value.

また、ＳＰＥ１０２ＡにおけるＧフラグ処理部１０を参照し、Ｇフラグ処理部とＳＰＥ制御部の接続を示しつつ説明する。
図１７は、Ｇフラグ処理部とＳＰＥ制御部を示すブロック図である。
この図には、Ｇフラグ処理部１０と、各ＳＰＥが備えるＳＰＥ制御部２９９Ａ〜２９９Ｄ、実行選択部２４Ｂ〜２４Ｄ及び実行制御部２０Ｂ〜２０Ｄが示される。前述の図１０、１２、１５に示した構成と同じ構成には、同じ符号を附す。
Ｇフラグ処理部１０は、前述の図５に示した構成と同じであり、出力する信号をGlobal_Inst_en_Aとする。Global_Inst_en_Aは、ＳＰＥ２０２Ａの信号であることを明示する以外は、図５のGlobal_Inst_en信号と同じである。 Further, the G flag processing unit 10 in the SPE 102A will be described with reference to the connection between the G flag processing unit and the SPE control unit.
FIG. 17 is a block diagram illustrating the G flag processing unit and the SPE control unit.
This figure shows the G flag processing unit 10, and SPE control units 299A to 299D, execution selection units 24B to 24D, and execution control units 20B to 20D included in each SPE. The same reference numerals are given to the same components as those shown in FIGS.
The G flag processing unit 10 has the same configuration as that shown in FIG. 5 described above, and the output signal is Global_Inst_en_A. Global_Inst_en_A is the same as the Global_Inst_en signal in FIG. 5 except that it is clearly indicated that the signal is SPE 202A.

実行制御部２０Ｂは、図示しないＰＥ-Ｉ３からの制御信号によりＳＰＥ２０２Ｂの実行を制御する制御信号（en0_B）を出力する。
実行制御部２０Ｂからの出力en0_Bは、実行選択部２４Ｂに入り、ＳＰＥ制御部２９９Ｂを制御する制御信号Global_Inst_en_B信号を生成する。Global_Inst_en_B信号は、ＳＰＥ２０２Ｂにおいて、命令実行制御に使われる。並列計算装置１を初期化するsystem_reset信号がアクティブ(「１」)になると、フリップフロップ２１がセットされてen0_B信号が「１」になる。これにより、実行選択部２４ＢがＳＰＥ制御部２９９Ｂに入力するGlobal_Inst_en_B信号が常に「１」になるので、ＳＰＥ２０２Ｂでは、常に命令が実行される。
また、「SYNC」命令が発行されるとcnt_SYNC_B信号がアクティブになるが、cnt_ASYNC_B信号はノンアクティブ（「０」）のままなので、フリップフロップ２１の出力en0_Bが「０」になる。したがって、Global_Inst_en_Aの状態に応じてGlobal_Inst_en_ Bの状態が定まる。つまり、ＳＰＥ２０２ＢはＳＰＥ２０２ＡのＧフラグに応じて、その命令実行が制御される。 The execution control unit 20B outputs a control signal (en0_B) for controlling the execution of the SPE 202B by a control signal from the PE-I 3 (not shown).
The output en0_B from the execution control unit 20B enters the execution selection unit 24B and generates a control signal Global_Inst_en_B signal for controlling the SPE control unit 299B. The Global_Inst_en_B signal is used for instruction execution control in the SPE 202B. When the system_reset signal that initializes the parallel computing device 1 becomes active (“1”), the flip-flop 21 is set and the en0_B signal becomes “1”. As a result, the Global_Inst_en_B signal input to the SPE control unit 299B by the execution selection unit 24B is always “1”, so that the instruction is always executed in the SPE 202B.
When the “SYNC” instruction is issued, the cnt_SYNC_B signal becomes active, but the cnt_ASYNC_B signal remains inactive (“0”), so the output en0_B of the flip-flop 21 becomes “0”. Accordingly, the state of Global_Inst_en_B is determined according to the state of Global_Inst_en_A. That is, the instruction execution of the SPE 202B is controlled according to the G flag of the SPE 202A.

「ASYNC」命令が発行されるとcnt_ASYNC_B信号がアクティブになり、フリップフロップ２１の出力en0_Bは「１」になる。なお、フリップフロップ２１は、図示しない並列計算装置１内部の基本クロックの立ち上がりで変化する。
実行制御部２０Ｃ及び２０Ｄは、実行制御部２０Ｂと同じ構成であり、入力される信号がそれぞれＳＰＥ２０２Ｃ及びＳＰＥ２０２Ｄの制御信号である点が異なる。
実行選択部２４Ｃ及び２４Ｄは、それぞれ実行制御部２０Ｃ及び２０Ｄからの制御信号en0_C及びen0_Dによって制御され、出力に接続されるＳＰＥ制御部２９９Ｃ及び２９９Ｄを介してＳＰＥ２０２Ｃ及びＳＰＥ２０２Ｄの制御を行う。 When the “ASYNC” instruction is issued, the cnt_ASYNC_B signal becomes active, and the output en0_B of the flip-flop 21 becomes “1”. Note that the flip-flop 21 changes at the rising edge of the basic clock in the parallel computing device 1 (not shown).
The execution control units 20C and 20D have the same configuration as the execution control unit 20B, and are different in that input signals are control signals for the SPE 202C and SPE 202D, respectively.
The execution selection units 24C and 24D are controlled by control signals en0_C and en0_D from the execution control units 20C and 20D, respectively, and control the SPE 202C and SPE 202D via the SPE control units 299C and 299D connected to the outputs.

以上に示した構成により、ＧフラグスタックをＳＰＥ２０２Ａにだけ設け、ＳＰＥ２０２Ａ以外のＳＰＥは、ＳＰＥ２０２ＡのＧフラグに応じて命令実行が制御されるか、又はＳＰＥ２０２ＡのＧフラグに影響されず常に命令を実行するかを選択できる。 With the configuration described above, the G flag stack is provided only in the SPE 202A, and the SPEs other than the SPE 202A are controlled to execute instructions according to the G flag of the SPE 202A, or always execute instructions regardless of the G flag of the SPE 202A. You can choose what to do.

第２の実施形態に示すように並列計算装置１によって処理が高速化されることを、プログラム例を用いて示す。
図１８は、本実施形態に示した並列計算装置１において、図１３のプログラムを実行するために、４並列処理を行うVLIW型用に変換した例を示す。
ＳＰＥ２０２Ａにおいて、条件判断等を行い、ＳＰＥ２０２Ｂ等ではその結果に同期して命令を実行する。
先ず、「＊１」を付した命令の部分であるが、ＳＰＥ２０２Ａで条件判断する（ステップ３）までの間にＳＰＥ２０２Ｂで「ADD R7」命令まで実行する。ＳＰＥ２０２Ａで「PSHI C,Z」命令を実行した直後に「MV R7」命令で結果をレジスターR7に書き込む。「＊２」を付した命令の部分も同様であり、ステップ６において、Ａｃｃ−Ｂに予め「０」を用意しておくことで、ＳＰＥ２０２Ａで「CMP R6」命令の結果をＧフラグスタックにプッシュした直後にレジスターR7を「０」を書き込む。同時に「＊３」を付した命令で示すようにＳＰＥ２０２Ｃでも予めデータを用意しておき、直ぐにレジスターR9への書き込みを行える。 As shown in the second embodiment, it is shown by using a program example that the processing speed is increased by the parallel computing device 1.
FIG. 18 shows an example in which the parallel computing device 1 shown in the present embodiment is converted to a VLIW type that performs four parallel processing in order to execute the program of FIG.
The SPE 202A performs condition determination and the like, and the SPE 202B and the like execute an instruction in synchronization with the result.
First, the part of the instruction marked with “* 1” is executed up to the “ADD R7” instruction in the SPE 202B until the condition is judged in the SPE 202A (step 3). Immediately after executing the “PSHI C, Z” instruction in the SPE 202A, the result is written in the register R7 with the “MV R7” instruction. The same applies to the part of the instruction with “* 2”. In step 6, by preparing “0” in advance in Acc-B, the result of the “CMP R6” instruction is pushed to the G flag stack by SPE202A. Immediately after writing, “0” is written to the register R7. At the same time, as indicated by the instruction with “* 3”, data is prepared in advance in the SPE 202C, and writing to the register R9 can be performed immediately.

ＳＰＥ２０２Ｄによって実行される「＊４」を付した命令部分については注意が必要である。これらの命令はＳＰＥ２０２Ａにおける「GINV」命令（ステップ７）の後で、実行するかしないかが決定される。つまり、ＳＰＥ２０２Ｃのように事前にデータを用意することができない。そこで、ＳＰＥ２０２Ｄは、ＳＰＥ２０２ＡのＧフラグの値とは無関係に「MVA R10」命令と「INC」命令を実行しておき、ＳＰＥ２０２Ａでの「GINV」命令と同時に「SYNC」命令を実行することで、「MV R10」命令の実行制御をＳＰＥ２０２Ａと同期させている。 Care should be taken with respect to the instruction part with “* 4” executed by the SPE 202D. These instructions are determined to be executed or not after the “GINV” instruction (step 7) in the SPE 202A. That is, data cannot be prepared in advance like the SPE 202C. Therefore, the SPE 202D executes the “MVA R10” instruction and the “INC” instruction regardless of the value of the G flag of the SPE 202A, and executes the “SYNC” instruction simultaneously with the “GINV” instruction in the SPE 202A. The execution control of the “MV R10” instruction is synchronized with the SPE 202A.

このように４つのＳＰＥで並列処理することで、図１３では２１クロックかかった処理が、本実施形態では１２クロックで行える。
なお、図１８においてＳＰＥ２０２Ｂ〜２０２Ｄで使われていない部分を空白又は網掛けで示した。空白部分には任意の命令をおくことができ、ＳＰＥ２０２Ａと同期する必要が無い命令を並列実行できる。一方、網掛けの部分には、ＳＰＥ２０２Ａと同期した命令か、「NOP」命令が配置できる。本実施形態では、第１実施形態に比べて１クロック余計に掛かっているが、空白部分に他の命令を配置することができる。したがって、実施形態１よりも実行効率を高めることができるため、演算処理全体では短時間で処理を終了することが可能となる。 By performing parallel processing with four SPEs in this way, processing that took 21 clocks in FIG. 13 can be performed with 12 clocks in this embodiment.
In FIG. 18, portions not used in the SPEs 202B to 202D are indicated by blanks or shaded areas. Arbitrary instructions can be placed in the blank portion, and instructions that do not need to be synchronized with the SPE 202A can be executed in parallel. On the other hand, an instruction synchronized with the SPE 202A or a “NOP” instruction can be arranged in the shaded portion. In this embodiment, one extra clock is required as compared with the first embodiment, but other instructions can be arranged in the blank portion. Therefore, since the execution efficiency can be improved as compared with the first embodiment, the processing can be completed in a short time in the entire arithmetic processing.

（第３実施形態）
次に本発明の第３実施形態について説明する。
図１９は、並列計算装置１の演算処理部の概略構成を示すブロック図である。
ここでは、演算制御処理の説明に必要な主たる構成を示す。演算処理については、図に示される演算処理部３００は、並列して演算処理を行う複数の演算プロセッサー（ＰＥ）３０２と、複数のＰＥ３０２にＳＰＥ制御信号線を介して制御命令を供給する制御信号生成部（ＰＥ−Ｉ）３と、ＰＥ−Ｉ３の制御を受けて各ＳＰＥを同期させる実行制御部３０Ｂ〜３０Ｄを備える。 (Third embodiment)
Next, a third embodiment of the present invention will be described.
FIG. 19 is a block diagram illustrating a schematic configuration of an arithmetic processing unit of the parallel computing device 1.
Here, a main configuration necessary for explaining the arithmetic control process is shown. As for the arithmetic processing, the arithmetic processing unit 300 shown in the figure includes a plurality of arithmetic processors (PE) 302 that perform arithmetic processing in parallel, and a control signal that supplies a control command to the plurality of PEs 302 via the SPE control signal line. A generation unit (PE-I) 3 and execution control units 30 B to 30 D that synchronize each SPE under the control of the PE-I 3 are provided.

ＰＥ３０２のそれぞれが、サブプロセッサー（ＳＰＥ）３０２Ａと、サブプロセッサー（ＳＰＥ）３０２Ｂ〜３０２Ｄを備える。
ＳＰＥ３０２Ａは、Ｇフラグ処理部１０ＡとＳＰＥ制御部３９９Ａを備える。
ＳＰＥ３０２Ｂは、Ｇフラグ処理部１０ＢとＳＰＥ制御部３９９Ｂと実行選択部３４Ｂを備える。ＳＰＥ３０２Ｂは、前述の図１１に示したＳＰＥ１０２Ａに相当する構成に加え、実行選択部３４Ｂを備える。Ｇフラグ処理部１０ＢとＳＰＥ制御部３９９Ｂは、それぞれＧフラグ処理部１０とＳＰＥ制御部１９９Ａに相当し、入出力信号が、ＳＰＥ３０２Ｂとしての接続に代わる。また、Ｇフラグ処理部１０ＢとＳＰＥ制御部３９９Ｂは、実行選択部３４Ｂを介して接続する。
また、ＳＰＥ３０２Ｃは、Ｇフラグ処理部１０ＣとＳＰＥ制御部３９９Ｃと実行選択部３４Ｃを備える。ＳＰＥ３０２Ｄは、Ｇフラグ処理部１０ＤとＳＰＥ制御部３９９Ｄと実行選択部３４Ｄを備える。ＳＰＥ３０２ＣとＳＰＥ３０２Ｄは、ＳＰＥ３０２Ｂと同様の構成を有する。 Each of the PEs 302 includes a sub processor (SPE) 302A and sub processors (SPE) 302B to 302D.
The SPE 302A includes a G flag processing unit 10A and an SPE control unit 399A.
The SPE 302B includes a G flag processing unit 10B, an SPE control unit 399B, and an execution selection unit 34B. The SPE 302B includes an execution selection unit 34B in addition to the configuration corresponding to the SPE 102A shown in FIG. The G flag processing unit 10B and the SPE control unit 399B correspond to the G flag processing unit 10 and the SPE control unit 199A, respectively, and input / output signals replace the connection as the SPE 302B. In addition, the G flag processing unit 10B and the SPE control unit 399B are connected via the execution selection unit 34B.
The SPE 302C includes a G flag processing unit 10C, an SPE control unit 399C, and an execution selection unit 34C. The SPE 302D includes a G flag processing unit 10D, an SPE control unit 399D, and an execution selection unit 34D. SPE302C and SPE302D have the same configuration as SPE302B.

Ｇフラグ処理部１０Ａは、ＳＰＥ制御部３９９Ａと、実行選択部３４Ｂ〜３４Ｄを介してＳＰＥ制御部３９９Ｂ〜３９９ＤにＧフラグ信号を供給する。Ｇフラグ処理部１０Ｂは、実行選択部３４Ｂを介してＳＰＥ制御部３９９ＢにＧｂフラグ信号を供給する。Ｇフラグ処理部１０Ｃは、実行選択部３４Ｃを介してＳＰＥ制御部３９９ＣにＧｃフラグ信号を供給する。Ｇフラグ処理部１０Ｄは、実行選択部３４Ｄを介してＳＰＥ制御部３９９ＤにＧｄフラグ信号を供給する。
実行選択部３４Ｂは、供給されたＧフラグ信号とＧｂフラグ信号のいずれかを、ＳＰＥ制御部３９９Ｂの実行許可信号として出力する。実行選択部３４Ｃは、供給されたＧフラグ信号とＧｃフラグ信号のいずれかを、ＳＰＥ制御部３９９Ｃの実行許可信号として出力する。実行選択部３４Ｄは、供給されたＧフラグ信号とＧｄフラグ信号のいずれかを、ＳＰＥ制御部３９９Ｄの実行許可信号として出力する。
ＳＰＥ制御部３９９Ａ〜３９９Ｄは、供給された実行許可信号に基づいて、それぞれのＳＰＥの演算制御を行う。 The G flag processing unit 10A supplies the G flag signal to the SPE control units 399B to 399D via the SPE control unit 399A and the execution selection units 34B to 34D. The G flag processing unit 10B supplies a Gb flag signal to the SPE control unit 399B via the execution selection unit 34B. The G flag processing unit 10C supplies a Gc flag signal to the SPE control unit 399C via the execution selection unit 34C. The G flag processing unit 10D supplies a Gd flag signal to the SPE control unit 399D via the execution selection unit 34D.
The execution selection unit 34B outputs either the supplied G flag signal or Gb flag signal as an execution permission signal of the SPE control unit 399B. The execution selection unit 34C outputs either the supplied G flag signal or Gc flag signal as an execution permission signal of the SPE control unit 399C. The execution selection unit 34D outputs either the supplied G flag signal or Gd flag signal as an execution permission signal for the SPE control unit 399D.
The SPE control units 399A to 399D perform arithmetic control of each SPE based on the supplied execution permission signal.

本実施形態では、図５に示すＧフラグスタックをＳＰＥ３０２Ａ、ＳＰＥ３０２Ｂ、ＳＰＥ３０２Ｃ及びＳＰＥ３０２Ｄにそれぞれ設けた構成である。各ＳＰＥは通常はそれぞれのＧ、Ｇｂ、Ｇｃ、Ｇｄフラグで個別に命令実行が制御されるが、必要に応じて特定のＳＰＥのＧフラグに応じて命令実行が制御されるようにする。ここでは例として特定のＳＰＥをＳＰＥ３０２Ａとする。 In this embodiment, the G flag stack shown in FIG. 5 is provided in each of the SPE 302A, SPE 302B, SPE 302C, and SPE 302D. Each SPE is normally individually controlled by the respective G, Gb, Gc, and Gd flags, but the instruction execution is controlled according to the G flag of a specific SPE as necessary. Here, a specific SPE is assumed to be SPE 302A as an example.

実行選択部の制御を行うために、この図に示される命令を追加する。
図２０は、第３実施形態の構成において追加する命令を示す。
これらの命令はＳＰＥ３０２Ｂ、ＳＰＥ３０２Ｃ、ＳＰＥ３０２Ｄで実行可能である。
例えば、ＳＰＥ３０２Ｂにおいて「SYNC」命令を実行すると、ＳＰＥ３０２Ｂは、それ以降はＳＰＥ３０２ＡのＧフラグを命令実行制御に使うようになり、「ASYNC」命令を実行すると、それ以降はＳＰＥ３０２ＡのＧフラグから切り替えて、ＳＰＥ３０２Ｂが有するＧフラグ（Ｇｂ）の値を命令実行制御に使うようになる。ＳＰＥ３０２Ｃ、ＳＰＥ３０２Ｄについても、同様である。 In order to control the execution selection unit, an instruction shown in this figure is added.
FIG. 20 shows instructions to be added in the configuration of the third embodiment.
These instructions can be executed by the SPE 302B, SPE 302C, and SPE 302D.
For example, when the “SYNC” instruction is executed in the SPE 302B, the SPE 302B uses the G flag of the SPE 302A for instruction execution control thereafter. When the “ASYNC” instruction is executed, the SPE 302B switches from the G flag of the SPE 302A thereafter. The value of the G flag (Gb) of the SPE 302B is used for instruction execution control. The same applies to SPE302C and SPE302D.

また、ＳＰＥ１０２ＡにおけるＧフラグ処理部１０を参照し、Ｇフラグ処理部とＳＰＥ制御部の接続を示しつつ説明する。
図２１は、Ｇフラグ処理部とＳＰＥ制御部を示すブロック図である。
この図には、Ｇフラグ処理部１０Ａ〜１０Ｄと、各ＳＰＥが備えるＳＰＥ制御部３９９Ａ〜３９９Ｄ、実行選択部３４Ｂ〜３４Ｄは、及び実行制御部３０Ｂ〜３０Ｄが示される。前述の図１０、１２に示した構成と同じ構成には、同じ数字の符号を附す。
Ｇフラグ処理部１０Ａ〜１０Ｄは、前述の図５に示した構成と同じであり、それぞれ出力する信号をGlobal_Inst_en_A（「G」と示す。）〜Global_Inst_en_D（「Gd」と示す。）とする。Global_Inst_en_Aは、ＳＰＥ３０２Ａの信号であることを明示する以外は、図５のGlobal_Inst_en信号と同じである。Global_Inst_en_B（Gb）〜Global_Inst_en_D（Gd）についても同様である。 Further, the G flag processing unit 10 in the SPE 102A will be described with reference to the connection between the G flag processing unit and the SPE control unit.
FIG. 21 is a block diagram illustrating the G flag processing unit and the SPE control unit.
In this figure, G flag processing units 10A to 10D, SPE control units 399A to 399D, execution selection units 34B to 34D included in each SPE, and execution control units 30B to 30D are shown. The same components as those shown in FIGS. 10 and 12 are denoted by the same reference numerals.
The G flag processing sections 10A to 10D have the same configuration as that shown in FIG. Global_Inst_en_A is the same as the Global_Inst_en signal in FIG. 5 except that it is clearly indicated that the signal is SPE 302A. The same applies to Global_Inst_en_B (Gb) to Global_Inst_en_D (Gd).

実行制御部３０Ｂは、ＰＥ-Ｉ３からの制御信号によりＳＰＥ３０２Ｂの実行を制御する制御信号（sel_B）を出力する。
実行制御部３０Ｂは、実行選択部３４Ｂを制御する制御信号sel_B信号を生成する。並列計算装置１を初期化するsystem_reset信号がアクティブ(「１」)になると、フリップフロップ３１Ｂはセットされsel_B信号が「１」になる。したがって、セレクター３４ＢでGlobal_Inst_en_B信号が選ばれて、ＳＰＥＢ３０２Ｂの命令を制御するGlobal_Inst_en_act_B（Gba）信号になる。つまり、リセット直後はＳＰＥ３０２ＢのＧｂフラグで命令実行が制御される。 The execution control unit 30B outputs a control signal (sel_B) that controls the execution of the SPE 302B by a control signal from the PE-I 3.
The execution control unit 30B generates a control signal sel_B signal for controlling the execution selection unit 34B. When the system_reset signal for initializing the parallel computing device 1 becomes active (“1”), the flip-flop 31B is set and the sel_B signal becomes “1”. Therefore, the Global_Inst_en_B signal is selected by the selector 34B and becomes the Global_Inst_en_act_B (Gba) signal for controlling the instruction of the SPEB 302B. That is, immediately after resetting, instruction execution is controlled by the Gb flag of the SPE 302B.

「SYNC」命令が発行されるとcnt_SYNC_B信号がアクティブになるが、その時cnt_ASYNC_B信号がノンアクティブのままなので、sel_B信号が「０」になる。したがって、セレクター３４Ｂにおいて、Global_Inst_en_A信号が選ばれてlobal_Inst_en_act_B信号になる。つまり、ＳＰＥ３０２Ｂは、ＳＰＥ３０２ＡのＧフラグに応じて命令実行が制御される。
「ASYNC」命令が発行されるとcnt_ASYNC_B信号がアクティブになり、フリップフロップ３１Ｂには「１」が書き込まれる。したがって、ＳＰＥ３０２Ｂは、Ｇｂフラグの値を実行制御に使うようになる。なお、フリップフロップ３１Ｂは図示しない並列計算装置１内部の基本クロックの立ち上がりで変化する。
ＳＰＥ３０２Ｃ、ＳＰＥ３０２Ｄの実行制御回路３０Ｃと３０Ｄ（シンクロナイズ回路）も実行制御回路３０Ｂと同様であるが、図示されないセレクターのＢ入力への信号がGlobal_Inst_en_Bではなく、それぞれGlobal_Inst_en_CとGlobal_Inst_en_Dとなり、セレクターのＳ入力への信号がsel_Bではなく、それぞれsel_Cとsel_Dとなり、またセレクターの出力がGlobal_Inst_en_act_C（Gca）とGlobal_Inst_en_act_D（Gda）となるところが異なる。 When the “SYNC” instruction is issued, the cnt_SYNC_B signal becomes active. At that time, since the cnt_ASYNC_B signal remains inactive, the sel_B signal becomes “0”. Therefore, in the selector 34B, the Global_Inst_en_A signal is selected and becomes a lobal_Inst_en_act_B signal. That is, the instruction execution of the SPE 302B is controlled according to the G flag of the SPE 302A.
When the “ASYNC” instruction is issued, the cnt_ASYNC_B signal becomes active, and “1” is written in the flip-flop 31B. Accordingly, the SPE 302B uses the value of the Gb flag for execution control. The flip-flop 31B changes at the rising edge of the basic clock in the parallel computing device 1 (not shown).
The execution control circuits 30C and 30D (synchronization circuits) of the SPE 302C and SPE 302D are the same as the execution control circuit 30B, but the signals to the B input of the selector (not shown) are not Global_Inst_en_B but Global_Inst_en_C and Global_Inst_en_D, respectively, and the selector S input Is different from sel_B in that sel_C and sel_D respectively, and the selector outputs Global_Inst_en_act_C (Gca) and Global_Inst_en_act_D (Gda).

第３の実施形態に示す並列計算装置１によって処理が高速化されることを、プログラム例を用いて示す。
図２２に並列計算装置１において、図１３のプログラムを実行するために４並列処理を行うVLIW型用に図１３のプログラムを変換した例を示す。
ＳＰＥ３０２Ａで条件判断等を行い、ＳＰＥ３０２Ｂ等ではその結果に同期して命令を実行する。
先ず、「＊１」を付した命令の部分であるが、ＳＰＥ３０２Ａで条件判断する（ステップ３）までの間にＳＰＥ３０２Ｂによって「ADD R7」命令まで実行する。ＳＰＥ３０２Ａで「PSHI C,Z」命令を実行した直後に「MV R7」命令で結果をレジスターR7に書き込む。「＊２」を付した命令の部分も同様であり、Ａｃｃ−Ｂに予め０を用意しておくことで、ＳＰＥ３０２Ａで「CMP R6」命令の結果をＧフラグスタック１１Ａにプッシュした直後にレジスターR7をクリアできる。同時に「＊３」を付した命令で示すようにＳＰＥ３０２Ｃでも予めデータを用意しておき、直ぐにレジスターR9への書き込みを行える。
ＳＰＥ３０２Ｄで実行される「＊４」を付した命令部分については注意が必要である。これらの命令はＳＰＥ３０２Ａにおける「GINV」命令の後で、実行するかしないかが決定される。つまり、ＳＰＥ３０２Ｃのように事前にデータを用意することができない。そこで、ＳＰＥ３０２ＡのＧフラグとは無関係に「MVA R10」命令と「INC」命令を実行しておき、ＳＰＥ３０２Ａでの「GINV」命令と同時に「SYNC」命令を実行することで、「MV R10」命令だけをＳＰＥ３０２Ａと同期させている。このように４つのＳＰＥで並列処理することで、図１３では２１クロックかかった処理が１２クロックで終了することができる。 It will be shown by using a program example that the processing speed is increased by the parallel computing device 1 shown in the third embodiment.
FIG. 22 shows an example in which the program of FIG. 13 is converted in the parallel computing device 1 for the VLIW type that performs four parallel processing in order to execute the program of FIG.
The SPE 302A performs condition judgment and the like, and the SPE 302B etc. executes an instruction in synchronization with the result.
First, the part of the instruction marked with “* 1” is executed by the SPE 302B up to the “ADD R7” instruction until the condition is judged by the SPE 302A (step 3). Immediately after executing the “PSHI C, Z” instruction in the SPE 302A, the result is written in the register R7 with the “MV R7” instruction. The same applies to the part of the instruction with “* 2”. By preparing 0 in advance in Acc-B, register R7 immediately after the result of “CMP R6” instruction is pushed to G flag stack 11A by SPE302A. Can be cleared. At the same time, as indicated by the instruction with “* 3”, data is also prepared in advance in the SPE 302C, and writing to the register R9 can be performed immediately.
Care must be taken with respect to the instruction part with “* 4” executed by the SPE 302D. These instructions are determined to be executed or not after the “GINV” instruction in the SPE 302A. That is, data cannot be prepared in advance like the SPE 302C. Therefore, the “MVA R10” instruction and the “INC” instruction are executed regardless of the G flag of the SPE 302A, and the “MV R10” instruction is executed by executing the “SYNC” instruction simultaneously with the “GINV” instruction in the SPE 302A. Only with the SPE 302A. By performing parallel processing with four SPEs in this way, the processing that took 21 clocks in FIG. 13 can be completed in 12 clocks.

なお、図２２においてＳＰＥ３０２Ｂ、ＳＰＥ３０２Ｃ、ＳＰＥ３０２Ｄで使われていない部分が空白又は網掛けで示されている。空白部分には任意の命令を配置することができ、ＳＰＥ３０２Ａと同期する必要が無い命令を並列実行できる。一方、網掛けの部分にはＳＰＥ３０２Ａと同期した命令か、「NOP」命令を配置できる。
この例では、前述の第２実施形態と同じ結果になっているが、各ＳＰＥが独立にネストしたプログラムを実行できるので命令実行の柔軟性が上がり、演算処理全体では第２実施形態よりも短時間で処理できる。 In FIG. 22, portions not used in the SPE 302B, SPE 302C, and SPE 302D are indicated by blanks or shades. Arbitrary instructions can be placed in the blank portion, and instructions that do not need to be synchronized with the SPE 302A can be executed in parallel. On the other hand, an instruction synchronized with the SPE 302A or a “NOP” instruction can be arranged in the shaded portion.
In this example, the same result as in the second embodiment described above is obtained, but since each SPE can execute a nested program independently, the flexibility of instruction execution is improved, and the overall arithmetic processing is shorter than in the second embodiment. Can be processed in time.

（第４実施形態）
次に、本発明の第４の実施形態について説明する。
図２３は、並列計算装置１の演算処理部の概略構成を示すブロック図である。
ここでは、演算制御処理の説明に必要な主たる構成を示す。演算処理については、図に示される演算処理部４００は、並列して演算処理を行う複数の演算プロセッサー（ＰＥ）４０２と、複数のＰＥ４０２にＳＰＥ制御信号線を介して制御命令を供給する制御信号生成部（ＰＥ−Ｉ）３と、ＰＥ−Ｉ３の制御を受けて各ＳＰＥを同期させる実行制御部４０Ａ〜４０Ｄを備える。 (Fourth embodiment)
Next, a fourth embodiment of the present invention will be described.
FIG. 23 is a block diagram illustrating a schematic configuration of the arithmetic processing unit of the parallel computing device 1.
Here, a main configuration necessary for explaining the arithmetic control process is shown. As for the arithmetic processing, the arithmetic processing unit 400 shown in the figure includes a plurality of arithmetic processors (PE) 402 that perform arithmetic processing in parallel, and a control signal that supplies a control command to the plurality of PEs 402 via the SPE control signal line. A generation unit (PE-I) 3 and execution control units 40 A to 40 D that synchronize the SPEs under the control of the PE-I 3 are provided.

ＰＥ４０２のそれぞれが、サブプロセッサー（ＳＰＥ）４０２Ａと、サブプロセッサー（ＳＰＥ）４０２Ｂ〜４０２Ｄを備える。
ＳＰＥ４０２Ａは、Ｇフラグ処理部１０ＡとＳＰＥ制御部４９９Ａと実行選択部４４Ａを備える。ＳＰＥ４０２Ａは、前述の図１１に示したＳＰＥ１０２Ａに相当する構成に加え、実行選択部４４Ａを備える。Ｇフラグ処理部１０ＡとＳＰＥ制御部４９９Ａは、それぞれＧフラグ処理部１０とＳＰＥ制御部１９９Ａに相当し、入出力信号が、ＳＰＥ４０２Ａとしての接続を示す。また、Ｇフラグ処理部１０ＡとＳＰＥ制御部４９９Ａは、実行選択部４４Ａを介して接続する。
ＳＰＥ４０２Ｂは、Ｇフラグ処理部１０ＢとＳＰＥ制御部４９９Ｂと実行選択部４４Ｂを備える。ＳＰＥ４０２Ｂは、前述の図１１に示したＳＰＥ１０２Ａに相当する構成に加え、実行選択部４４Ｂを備える。Ｇフラグ処理部１０ＢとＳＰＥ制御部４９９Ｂは、それぞれＧフラグ処理部１０とＳＰＥ制御部１９９Ａに相当し、入出力信号が、ＳＰＥ４０２Ｂとしての接続に代わる。また、Ｇフラグ処理部１０ＢとＳＰＥ制御部４９９Ｂは、実行選択部４４Ｂを介して接続する。
また、ＳＰＥ４０２Ｃは、Ｇフラグ処理部１０ＣとＳＰＥ制御部４９９Ｃと実行選択部４４Ｃを備える。ＳＰＥ４０２Ｄは、Ｇフラグ処理部１０ＤとＳＰＥ制御部４９９Ｄと実行選択部４４Ｄを備える。ＳＰＥ４０２ＣとＳＰＥ４０２Ｄは、ＳＰＥ４０２Ｂと同様の構成を有する。 Each of the PEs 402 includes a sub processor (SPE) 402A and sub processors (SPE) 402B to 402D.
The SPE 402A includes a G flag processing unit 10A, an SPE control unit 499A, and an execution selection unit 44A. The SPE 402A includes an execution selection unit 44A in addition to the configuration corresponding to the SPE 102A shown in FIG. The G flag processing unit 10A and the SPE control unit 499A correspond to the G flag processing unit 10 and the SPE control unit 199A, respectively, and input / output signals indicate connection as the SPE 402A. In addition, the G flag processing unit 10A and the SPE control unit 499A are connected via the execution selection unit 44A.
The SPE 402B includes a G flag processing unit 10B, an SPE control unit 499B, and an execution selection unit 44B. The SPE 402B includes an execution selection unit 44B in addition to the configuration corresponding to the SPE 102A shown in FIG. The G flag processing unit 10B and the SPE control unit 499B correspond to the G flag processing unit 10 and the SPE control unit 199A, respectively, and input / output signals replace the connection as the SPE 402B. Further, the G flag processing unit 10B and the SPE control unit 499B are connected via the execution selection unit 44B.
In addition, the SPE 402C includes a G flag processing unit 10C, an SPE control unit 499C, and an execution selection unit 44C. The SPE 402D includes a G flag processing unit 10D, an SPE control unit 499D, and an execution selection unit 44D. SPE402C and SPE402D have the same configuration as SPE402B.

Ｇフラグ処理部１０Ａは、実行選択部４４Ａ〜４４Ｄを介してＳＰＥ制御部４９９Ａ〜４９９ＤにＧフラグ信号を供給する。Ｇフラグ処理部１０Ｂは、実行選択部４４Ａ〜４４Ｄを介してＳＰＥ制御部４９９Ａ〜４９９ＤにＧｂフラグ信号を供給する。Ｇフラグ処理部１０Ｃは、実行選択部４４Ａ〜４４Ｄを介してＳＰＥ制御部４９９Ａ〜４９９ＤにＧｃフラグ信号を供給する。Ｇフラグ処理部１０Ｄは、実行選択部４４Ａ〜４４Ｄを介してＳＰＥ制御部４９９Ａ〜４９９ＤにＧｄフラグ信号を供給する。
実行選択部４４Ａ〜４４Ｄは、Ｇフラグ信号とＧｂフラグ信号〜Ｇｄフラグ信号のいずれかをそれぞれ選択し、ＳＰＥ制御部４９９Ａ〜４９９Ｄの実行許可信号として出力する。
ＳＰＥ制御部４９９Ａ〜４９９Ｄは、供給された実行許可信号に基づいて、それぞれのＳＰＥの演算制御を行う。 The G flag processing unit 10A supplies the G flag signal to the SPE control units 499A to 499D via the execution selection units 44A to 44D. The G flag processing unit 10B supplies the Gb flag signal to the SPE control units 499A to 499D via the execution selection units 44A to 44D. The G flag processing unit 10C supplies the Gc flag signal to the SPE control units 499A to 499D via the execution selection units 44A to 44D. The G flag processing unit 10D supplies the Gd flag signal to the SPE control units 499A to 499D via the execution selection units 44A to 44D.
The execution selection units 44A to 44D select either the G flag signal or the Gb flag signal to the Gd flag signal, respectively, and output them as execution permission signals of the SPE control units 499A to 499D.
The SPE control units 499A to 499D perform arithmetic control of each SPE based on the supplied execution permission signal.

本実施形態では、図５に示すＧフラグスタックをＳＰＥ４０２Ａ、ＳＰＥ４０２Ｂ、ＳＰＥ４０２Ｃ及びＳＰＥ４０２Ｄにそれぞれ設けた構成である。各ＳＰＥは通常はそれぞれのＧ、Ｇｂ、Ｇｃ、Ｇｄフラグの値で個別に命令実行が制御されるが、必要に応じて特定のＳＰＥのＧフラグの値に応じて命令実行が制御されるようにする。 In this embodiment, the G flag stack shown in FIG. 5 is provided in each of the SPE 402A, SPE 402B, SPE 402C, and SPE 402D. Each SPE is normally individually controlled by the G, Gb, Gc, and Gd flag values, but the instruction execution is controlled according to the G flag value of a specific SPE as necessary. To.

図２４は、第４実施形態の構成において追加する命令を示す。
前述の図２０との違いは「SYNC」命令にA、B、C又はDの何れか一つのオペランドを指定できることである。例えば、ＳＰＥ４０２Ａにおいて「SYNC B」命令を実行すると、それ以降はＳＰＥ４０２ＢのＧフラグの値を命令実行制御に使うようになる。「ASYNC」命令を実行すると、それ以降は各ＳＰＥ４０２内に備えるＧフラグを参照し、その値命令実行制御に使うようになる。 FIG. 24 shows instructions to be added in the configuration of the fourth embodiment.
The difference from FIG. 20 described above is that any one operand of A, B, C, or D can be specified in the “SYNC” instruction. For example, when the “SYNC B” instruction is executed in the SPE 402A, the value of the G flag of the SPE 402B is used for instruction execution control thereafter. After executing the “ASYNC” instruction, the G flag provided in each SPE 402 is referred to and used for the value instruction execution control.

図２５は、Ｇフラグ処理部とＳＰＥ制御部を示すブロック図である。
また、ＳＰＥ１０２ＡにおけるＧフラグ処理部１０を参照し、Ｇフラグ処理部とＳＰＥ制御部の接続を示しつつ説明する。
この図には、Ｇフラグ処理部１０Ａ〜１０Ｄと、各ＳＰＥが備えるＳＰＥ制御部４９９Ａ〜４９９Ｄ、実行選択部４４Ｂ〜４４Ｄは、及び実行制御部４０Ｂ〜４０Ｄが示される。前述の図１０、１２、１５に示した構成と同じ構成には、同じ符号を附す。
Ｇフラグ処理部１０Ａ〜１０Ｄは、前述の図１０に示した構成と同じであり、それぞれ出力する信号をGlobal_Inst_en_A（「Ｇａ」と示す。）〜Global_Inst_en_D（「Ｇｄ」と示す。）とする。Global_Inst_en_Aは、ＳＰＥ４０２Ａの信号であることを明示する以外は、図１０のGlobal_Inst_en信号と同じである。Global_Inst_en_B（Ｇｂ）〜Global_Inst_en_D（Ｇｄ）についても同様である。 FIG. 25 is a block diagram illustrating the G flag processing unit and the SPE control unit.
Further, the G flag processing unit 10 in the SPE 102A will be described with reference to the connection between the G flag processing unit and the SPE control unit.
This figure shows G flag processing units 10A to 10D, SPE control units 499A to 499D, execution selection units 44B to 44D included in each SPE, and execution control units 40B to 40D. The same reference numerals are given to the same components as those shown in FIGS.
The G flag processing units 10A to 10D have the same configuration as that shown in FIG. Global_Inst_en_A is the same as the Global_Inst_en signal in FIG. The same applies to Global_Inst_en_B (Gb) to Global_Inst_en_D (Gd).

この回路の動作をＳＰＥ４０２Ａの場合について示す。Global_Inst_en_A（Ｇａ）は、図５のGlobal_Inst_en信号と同じであるが、他のＳＰＥの信号と識別する為に最後に_Aを付加している。 Global_Inst_en_Bなども同様である。
system_resetか又は、「ASYNC」命令が発行されてcnt_ASYNC_Aがアクティブになると、図中の２つのフリップフロップ４１Ａ、４２Ａはリセットされてsel1_A信号とsel0_A信号は共に「０」になる。よって、セレクター４４ＡでGlobal_Inst_en_A（Ｇａ）が選ばれて、ＳＰＥ４０２Ａの命令を制御する信号Global_Inst_en_act_Aになる。つまり、リセット直後は自分自身のＧフラグの値を実行制御に使う。「SYNC」命令が発行されるとcnt_SYNC_Aがアクティブになり、cnt_Gsel_1_Aとcnt_Gsel_0_Aの値をフリップフロップ４１Ａ、４２Ａに書き込む。これらの値によって選ばれたＧフラグ信号がGlobal_Inst_en_act_Aになる。 The operation of this circuit is shown for the SPE 402A. Global_Inst_en_A (Ga) is the same as the Global_Inst_en signal in FIG. 5, but _A is added at the end to distinguish it from other SPE signals. The same applies to Global_Inst_en_B.
When system_reset or “ASYNC” instruction is issued and cnt_ASYNC_A becomes active, the two flip-flops 41A and 42A in the figure are reset, and both the sel1_A signal and the sel0_A signal become “0”. Therefore, Global_Inst_en_A (Ga) is selected by the selector 44A and becomes the signal Global_Inst_en_act_A for controlling the instruction of the SPE 402A. That is, immediately after resetting, the value of its own G flag is used for execution control. When the “SYNC” instruction is issued, cnt_SYNC_A becomes active, and the values of cnt_Gsel_1_A and cnt_Gsel_0_A are written to the flip-flops 41A and 42A. The G flag signal selected by these values becomes Global_Inst_en_act_A.

図２６は、フリップフロップ４１Ａ、４２Ａの制御を示す図である。
フリップフロップ４１Ａ、４２Ａの制御は、cnt_Gsel_1_Aとcnt_Gsel_0_Aの設定により行う。
この図に示されるように、「SYNC」命令のオペランドに応じて値が決まる。図２５に示したフリップフロップ４１Ａ、４２Ａは、ＳＰＥ４０２Ａの場合である。ＳＰＥ４０２Ｂ、ＳＰＥ４０２Ｃ、ＳＰＥ４０２Ｄの実行制御回路４０Ｂ〜４０Ｄも同様である。ただし、system_resetかcnt_ASYNC_B〜cnt_ASYNC_Dがアクティブになった時のsel1_B〜sel1_Dとsel0_B〜sel0_Dを保持するフリップフロップ４１Ｂ〜４１Ｄ、４２Ｂ〜４２Ｄが出力する値が異なり、ＳＰＥ４０２Ｂは(0,1)、ＳＰＥ４０２Ｃは(1,0)、そしてＳＰＥ４０２Ｄは(1,1)である。 FIG. 26 is a diagram illustrating control of the flip-flops 41A and 42A.
The flip-flops 41A and 42A are controlled by setting cnt_Gsel_1_A and cnt_Gsel_0_A.
As shown in this figure, the value is determined according to the operand of the “SYNC” instruction. Flip-flops 41A and 42A shown in FIG. 25 are for the SPE 402A. The same applies to the execution control circuits 40B to 40D of the SPE 402B, SPE 402C, and SPE 402D. However, when system_reset or cnt_ASYNC_B to cnt_ASYNC_D become active, the values output by flip-flops 41B to 41D and 42B to 42D that hold sel1_B to sel1_D and sel0_B to sel0_D are different, SPE402B is (0, 1) (1,0), and SPE402D is (1,1).

第４の実施形態によって処理が高速化されることを、プログラム例を用いて示す。
図２７は、並列計算装置１において、図１３のプログラムを実行するために４並列のVLIW型用に図１３のプログラムを変換した例を示す。
条件判断はどのＳＰＥでも実行できるが、ここではＳＰＥ４０２Ｂで条件判断等を行い、他のＳＰＥはその結果に同期して命令を実行する。先ず、「＊１」を付した命令の部分であるが、ＳＰＥ４０２Ｂで条件判断する間にＳＰＥ４０２Ａによって「ADD R7」命令（ステップ３）まで実行しておき、ＳＰＥ４０２Ｂで「PSHI C,Z」命令を実行した直後に「MV R7」命令で結果をレジスターR7に書き込む。「＊２」を付した命令の部分も同様であり、Ａｃｃ９３Ａを予めＳＰＥ４０２Ａで「０」にしておくことで、ＳＰＥ４０２Ｂで「CMP R6」命令の結果をＧフラグスタックにプッシュした（ステップ６）直後にレジスターR7をクリアできる。同時に「＊３」を付した命令で示すようにＳＰＥ４０２Ｃでも予めデータを用意しておき、直ぐにレジスターR9への書き込みを行える。 It will be shown by using a program example that the processing speed is increased by the fourth embodiment.
FIG. 27 shows an example in which the program of FIG. 13 is converted for the 4-parallel VLIW type in order to execute the program of FIG. 13 in the parallel computing device 1.
The condition determination can be executed by any SPE, but here, the SPE 402B performs the condition determination and the other SPEs execute the instruction in synchronization with the result. First, the part of the instruction marked with “* 1” is executed until the “ADD R7” instruction (step 3) is executed by the SPE 402A while the condition is judged by the SPE 402B, and the “PSHI C, Z” instruction is issued by the SPE 402B. Immediately after execution, the result is written to the register R7 by the “MV R7” instruction. The same applies to the part of the instruction marked with “* 2”. By setting Acc93A to “0” in advance by the SPE 402A, the result of the “CMP R6” instruction is pushed to the G flag stack by the SPE 402B (step 6). You can clear register R7. At the same time, as indicated by the instruction with “* 3”, the SPE 402C also prepares data in advance and can immediately write to the register R9.

ＳＰＥ４０２Ｄで実行される「＊４」を付した命令部分については注意が必要である。これらの命令はＳＰＥ４０２Ｂにおける「GINV」命令の後で、実行するか否かが決定される。つまり、ＳＰＥ４０２Ｃのように事前にデータを用意することができない。そこで、ＳＰＥ４０２ＢのＧフラグの値とは無関係に「MVA R10」命令と「INC」命令を実行しておき、ＳＰＥ４０２Ｂでの「GINV」命令と同時に「SYNC B」命令を実行することで、「MV R10」命令だけをＳＰＥ４０２Ｂと同期させている。このように４つのＳＰＥで並列処理することで、図１３では２１クロックかかる処理を１２クロックで終了することができる。 Care should be taken with respect to the instruction part with “* 4” executed by the SPE 402D. These instructions are determined to be executed after the “GINV” instruction in the SPE 402B. That is, data cannot be prepared in advance like the SPE 402C. Therefore, the “MVA R10” instruction and the “INC” instruction are executed regardless of the value of the G flag of the SPE 402B, and the “SYNC B” instruction is executed simultaneously with the “GINV” instruction in the SPE 402B. Only the “R10” instruction is synchronized with SPE 402B. By performing parallel processing with four SPEs in this way, the processing of 21 clocks in FIG. 13 can be completed in 12 clocks.

なお、図２７でＳＰＥ４０２Ａ、ＳＰＥ４０２Ｃ、ＳＰＥ４０２Ｄで使われていない部分が空白又は網掛けで示される。空白部分には任意の命令を配置することができ、ＳＰＥ４０２Ｂと同期する必要が無い命令を並列実行できる。一方、網掛け部分にはＳＰＥ４０２Ｂと同期した命令か、「NOP」命令を配置することができる。この例では第３実施形態と処理速度が同じであるが、どのＳＰＥでも条件判断を行うマスターPEに設定できるので、ＳＰＥを使う上での柔軟性が格段に向上し、演算処理全体では第３実施形態よりも短時間で処理できる。 In FIG. 27, portions that are not used in SPE 402A, SPE 402C, and SPE 402D are indicated by blanks or shaded areas. Arbitrary instructions can be placed in the blank portion, and instructions that do not need to be synchronized with the SPE 402B can be executed in parallel. On the other hand, an instruction synchronized with the SPE 402B or a “NOP” instruction can be arranged in the shaded portion. In this example, the processing speed is the same as that of the third embodiment, but any SPE can be set as a master PE that performs condition determination, so that the flexibility in using the SPE is remarkably improved. Processing can be performed in a shorter time than the embodiment.

（第５実施形態）
本発明の第５の実施形態について説明する。
ここでは例としてＳＰＥ５０２Ａを特別なＳＰＥとし、他のＳＰＥがそれに同期するか否かを制御できるようにする。 (Fifth embodiment)
A fifth embodiment of the present invention will be described.
Here, as an example, the SPE 502A is a special SPE, and it is possible to control whether other SPEs synchronize with it.

図２８は、並列計算装置１の演算処理部の概略構成を示すブロック図である。
ここでは、演算制御処理の説明に必要な主たる構成を示す。演算処理については、図に示される演算処理部５００は、並列して演算処理を行う複数の演算プロセッサー（ＰＥ）５０２と、複数のＰＥ５０２にＳＰＥ制御信号線を介して制御命令を供給する制御信号生成部（ＰＥ−Ｉ）３と、ＰＥ−Ｉ３の制御を受けて各ＳＰＥを同期させる実行制御部５０ＳＣＢ〜５０ＳＣＤを備える。 FIG. 28 is a block diagram illustrating a schematic configuration of the arithmetic processing unit of the parallel computing device 1.
Here, a main configuration necessary for explaining the arithmetic control process is shown. Regarding the arithmetic processing, the arithmetic processing unit 500 shown in the figure includes a plurality of arithmetic processors (PE) 502 that perform arithmetic processing in parallel, and a control signal that supplies a control command to the plurality of PEs 502 via the SPE control signal line. A generation unit (PE-I) 3 and execution control units 50SCB to 50SCD that synchronize each SPE under the control of PE-I3 are provided.

ＰＥ５０２のそれぞれが、サブプロセッサー（ＳＰＥ）５０２Ａと、サブプロセッサー（ＳＰＥ）５０２Ｂ〜５０２Ｄを備える。
ＳＰＥ５０２Ａは、Ｇフラグソース５０ＦＳＡとＧフラグスタック５０ＳＴＡとＳＰＥ制御部５９９Ａを備える。
ＳＰＥ５０２Ａは、前述の図１１に示したＳＰＥ１０２Ａに相当する構成において、Ｇフラグ処理部１０を２つに分けた構成を有する。一方の前段処理を行うＧフラグソース５０ＦＳＡは、選択されたフラグレジスターの出力を１つの信号にまとめ、出力する。他方の後段処理を行うＧフラグスタック５０ＳＴＡは、前述のＧフラグスタック１１と合成部１９に相当する。Ｇフラグスタック５０ＳＴＡは、前段のフラグソース５０ＦＳＡが出力した信号を条件に応じて記録する。ＳＰＥ制御部５９９Ａは、ＳＰＥ制御部１９９Ａに相当する。
ＳＰＥ５０２Ｂは、Ｇフラグソース５０ＦＳＢとＧフラグスタック５０ＳＴＢとＳＰＥ制御部５９９Ｂと実行選択部５５Ｂを備える。すなわち、ＳＰＥ５０２Ｂは、前述のＳＰＥ５０２Ａに相当する構成に加え、実行選択部５５Ｂを備える。また、Ｇフラグソース５０ＦＳＢとＧフラグスタック５０ＳＴＢは、実行選択部５５Ｂを介して接続する。
また、ＳＰＥ５０２Ｃは、Ｇフラグソース５０ＦＳＣとＧフラグスタック５０ＳＴＣとＳＰＥ制御部５９９Ｃと実行選択部５５Ｃを備える。ＳＰＥ５０２Ｄは、Ｇフラグソース５０ＦＳＤとＧフラグスタック５０ＳＴＤとＳＰＥ制御部５９９Ｄと実行選択部５５Ｄを備える。ＳＰＥ５０２ＣとＳＰＥ５０２Ｄは、ＳＰＥ５０２Ｂと同様の構成を有する。 Each of the PEs 502 includes a sub processor (SPE) 502A and sub processors (SPE) 502B to 502D.
The SPE 502A includes a G flag source 50FSA, a G flag stack 50STA, and an SPE control unit 599A.
The SPE 502A has a configuration in which the G flag processing unit 10 is divided into two in the configuration corresponding to the SPE 102A shown in FIG. The G flag source 50FSA that performs one pre-stage processing collects and outputs the output of the selected flag register into one signal. The G flag stack 50STA that performs the other post-processing corresponds to the G flag stack 11 and the combining unit 19 described above. The G flag stack 50STA records the signal output from the preceding flag source 50FSA according to the condition. The SPE control unit 599A corresponds to the SPE control unit 199A.
The SPE 502B includes a G flag source 50FSB, a G flag stack 50STB, an SPE control unit 599B, and an execution selection unit 55B. That is, the SPE 502B includes an execution selection unit 55B in addition to the configuration corresponding to the SPE 502A described above. Also, the G flag source 50FSB and the G flag stack 50STB are connected via the execution selection unit 55B.
The SPE 502C includes a G flag source 50FSC, a G flag stack 50STC, an SPE control unit 599C, and an execution selection unit 55C. The SPE 502D includes a G flag source 50FSD, a G flag stack 50STD, an SPE control unit 599D, and an execution selection unit 55D. SPE502C and SPE502D have the same configuration as SPE502B.

Ｇフラグソース５０ＦＳＡは、Ｇフラグスタック５０ＳＴＡ〜５０ＳＴＤにＧフラグソース信号を供給する。Ｇフラグソース５０ＦＳＢは、実行選択部５５Ｂを介してＧフラグスタック５０ＳＴＢにＧフラグソース信号を供給する。Ｇフラグソース５０ＦＳＣは、実行選択部５５Ｃを介してＧフラグスタック５０ＳＴＣにＧフラグソース信号を供給する。Ｇフラグソース５０ＦＳＤは、実行選択部５５Ｄを介してＧフラグスタック５０ＳＴＤにＧフラグソース信号を供給する。
実行選択部５５Ｂ〜５５Ｄは、Ｇフラグソース信号又はＧｂフラグソース信号〜Ｇｄフラグソース信号のいずれかをそれぞれ選択し、それぞれのＧフラグスタック５０ＳＴＢ〜５０ＳＴＤに蓄積する。それぞれのＧフラグスタック５０ＳＴＢ〜５０ＳＴＤは、蓄積されたＧフラグソース信号に基づいて、ＳＰＥ制御部５９９Ａ〜５９９Ｄの実行制御信号を出力する。
ＳＰＥ制御部５９９Ａ〜５９９Ｄは、供給された実行許可信号に基づいて、それぞれのＳＰＥの演算制御を行う。 The G flag source 50FSA supplies a G flag source signal to the G flag stacks 50STA to 50STD. The G flag source 50FSB supplies the G flag source signal to the G flag stack 50STB via the execution selection unit 55B. The G flag source 50FSC supplies the G flag source signal to the G flag stack 50STC via the execution selection unit 55C. The G flag source 50FSD supplies a G flag source signal to the G flag stack 50STD via the execution selection unit 55D.
The execution selectors 55B to 55D respectively select one of the G flag source signal and the Gb flag source signal to the Gd flag source signal, and store them in the respective G flag stacks 50STB to 50STD. Each of the G flag stacks 50STB to 50STD outputs an execution control signal of the SPE control units 599A to 599D based on the accumulated G flag source signal.
The SPE control units 599A to 599D perform arithmetic control of each SPE based on the supplied execution permission signal.

本実施形態では、図５に示すＧフラグスタックをＳＰＥ５０２Ａ、ＳＰＥ５０２Ｂ、ＳＰＥ５０２Ｃ及びＳＰＥ５０２Ｄにそれぞれ設けた構成である。各ＳＰＥは通常はそれぞれのＧ、Ｇｂ、Ｇｃ、Ｇｄフラグソース信号に基づいて個別に命令実行が制御されるが、必要に応じて特定のＳＰＥのＧフラグソースに応じて命令実行が制御されるようにする。 In the present embodiment, the G flag stack shown in FIG. 5 is provided in each of the SPE 502A, SPE 502B, SPE 502C, and SPE 502D. Each SPE is normally individually controlled based on the G, Gb, Gc, and Gd flag source signals, but if necessary, the instruction execution is controlled according to the G flag source of a specific SPE. Like that.

図２９は、第５実施形態の構成において追加する命令を示す。
これらの命令はＳＰＥ５０２Ｂ、ＳＰＥ５０２Ｃ、ＳＰＥ５０２Ｄでのみ有効であり、ＳＰＥ５０２Ａで「PSH」命令又は「PSHI」命令でＧフラグスタックをプッシュするタイミングでのみ実行できる。 FIG. 29 shows instructions to be added in the configuration of the fifth embodiment.
These instructions are valid only in the SPE 502B, SPE 502C, and SPE 502D, and can be executed only when the G flag stack is pushed by the “PSH” instruction or the “PSHI” instruction in the SPE 502A.

図３０は、ＳＰＥ５０２Ｂの同期化回路を示すブロック図である。
この図には、ＳＰＥ５０２Ａの概略構成と、ＳＰＥ５０２Ｂの構成を示す。
ＳＰＥ５０２Ａは、Ｇフラグソース５０ＦＳＡ、フラグスタック５０ＳＴＡ及びＳＰＥ制御部５９９Ａが示されている。
Ｇフラグソース５０ＦＳＡは、図示されないフラグレジスター９７から出力されるコンディションフラグの値と、ＰＥ−Ｉ３が指定した制御信号とのゲート処理を行って、判定するコンディションフラグの状態を検出したＧフラグソース信号（Gflag_org_A）を出力する。入力される各信号は、前述の図５と同じである。
Ｇフラグソース５０ＦＳＡは、ＳＰＥ５０２ＡにおけるＧフラグスタック５０ＳＴＡにＧフラグソース信号（Gflag_org_A）を供給し、ＳＰＥ５０２Ａの制御を行う。また、Ｇフラグソース５０ＦＳＡは、ＳＰＥ５０２Ｂ及び図示されないＳＰＥ５０２Ｃ、ＳＰＥ５０２ＤにＧフラグソース信号（Gflag_org_A）を供給する。ＳＰＥ制御部５９９Ａは、ＳＰＥ制御部１９９Ａに相当する。 FIG. 30 is a block diagram showing a synchronization circuit of the SPE 502B.
This figure shows a schematic configuration of the SPE 502A and a configuration of the SPE 502B.
The SPE 502A includes a G flag source 50FSA, a flag stack 50STA, and an SPE control unit 599A.
The G flag source 50FSA performs a gate process on the value of the condition flag output from the flag register 97 (not shown) and the control signal designated by the PE-I3, and detects the state of the condition flag to be determined. (Gflag_org_A) is output. Each input signal is the same as in FIG.
The G flag source 50FSA supplies the G flag source signal (Gflag_org_A) to the G flag stack 50STA in the SPE 502A, and controls the SPE 502A. The G flag source 50FSA supplies a G flag source signal (Gflag_org_A) to the SPE 502B and SPE 502C and SPE 502D (not shown). The SPE control unit 599A corresponds to the SPE control unit 199A.

ＳＰＥ５０２Ｂは、Ｇフラグソース５０ＦＳＢ、フラグスタック５０ＳＴＢ及びＳＰＥ制御部５９９Ｂ、実行選択部５５Ｂ及び実行制御部５０ＳＣＢが示されている。
Ｇフラグソース５０ＦＳＢ及びＳＰＥ制御部５９９Ｂは、前述のＳＰＥ５０２Ａと同じ構成を有する。
実行選択部５５Ｂは、入力されるＧフラグソース５０ＦＳＡ及びＧフラグソース５０ＦＳＢの信号を選択してフラグスタック５０ＳＴＢに出力する。
実行制御部５０ＳＣＢは、ＳＰＥ５０２Ｂにおける「PSH」命令か「PSHI」命令に応じて、実行制御部５５Ｂの入力信号の選択を制御する。
フラグスタック５０ＳＴＢは、実行選択部５５Ｂで選択された結果を、Ｇフラグスタック５１Ｂに順次蓄積する。
ＳＰＥ５０２Ｂについて示したが、ＳＰＥ５０２Ｃ、ＳＰＥ５０２Ｄについても同じ構成で接続される。 The SPE 502B includes a G flag source 50FSB, a flag stack 50STB, an SPE control unit 599B, an execution selection unit 55B, and an execution control unit 50SCB.
The G flag source 50FSB and the SPE control unit 599B have the same configuration as the SPE 502A described above.
The execution selection unit 55B selects the signals of the input G flag source 50FSA and G flag source 50FSB and outputs them to the flag stack 50STB.
The execution control unit 50SCB controls selection of the input signal of the execution control unit 55B in accordance with the “PSH” instruction or the “PSHI” instruction in the SPE 502B.
The flag stack 50STB sequentially accumulates the results selected by the execution selection unit 55B in the G flag stack 51B.
Although SPE 502B is shown, SPE 502C and SPE 502D are also connected with the same configuration.

この実施形態ではＧフラグスタック５０ＳＴＡ〜５０ＳＴＤを全てのＳＰＥに設け、そこにプッシュされる値自体を制御している点が、前述の実施形態２、３、４と異なる。ＳＰＥ５０２ＡのＧフラグスタックは図５と基本的に同じであるが、オペランドの論理和をとった信号Gflag_org_Aを引き出して、他のＳＰＥに供給するところが異なる。 In this embodiment, the G flag stacks 50STA to 50STD are provided in all the SPEs, and the value itself pushed to the SPE is controlled, which is different from the second, third, and fourth embodiments. The G flag stack of the SPE 502A is basically the same as that in FIG. 5, except that a signal Gflag_org_A obtained by ORing operands is extracted and supplied to other SPEs.

ＳＰＥ５０２Ｂにおいて「PSH」命令か「PSHI」命令が実行された場合は、Ｇフラグスタックにプッシュされる値は図５の場合と同じである。したがって、ＳＰＥ５０２ＢはＳＰＥ５０２Ａと非同期にネスト（入れ子）したプログラムを実行できる。ＳＰＥ５０２Ｂにおいて「SYNC」命令又は「SYNCI」命令が発行されると、cnt_SYNC_B又はcnt_SYNCI_B信号がアクティブになり、sel1で選択されたGflag_org_Aが、Ｇフラグスタックにプッシュされる。「SYNCI」命令と「PSHI」命令の場合には、sel1で選択されたコンディションコードの値が反転されてからＧフラグスタックに書き込まれる。その他、「POP」命令や「FLSH」命令が発行された時の動作や、システムリセット信号system_resetがアクティブになった時の動作は、図５の場合と同様である。 When the “PSH” instruction or the “PSHI” instruction is executed in the SPE 502B, the value pushed to the G flag stack is the same as in FIG. Accordingly, the SPE 502B can execute a program nested in an asynchronous manner with the SPE 502A. When the “SYNC” instruction or the “SYNCI” instruction is issued in the SPE 502B, the cnt_SYNC_B or cnt_SYNCI_B signal becomes active, and Gflag_org_A selected by sel1 is pushed onto the G flag stack. In the case of the “SYNCI” instruction and the “PSHI” instruction, the value of the condition code selected by sel1 is inverted and written to the G flag stack. In addition, the operation when the “POP” instruction or the “FLSH” instruction is issued and the operation when the system reset signal system_reset becomes active are the same as those in FIG.

第５の実施形態によって処理が高速化されることを、プログラム例を用いて説明する。
図３１は、並列計算装置１において、図１３のプログラムを実行するために図１３のプログラムを並列化した例を示す。
ＳＰＥ５０２Ａで条件判断を２回行って「PSHI」命令でＧフラグスタックにプッシュしている。同時にＳＰＥ５０２ＢとＳＰＥ５０２Ｃでは２回とも「SYNCI」命令でＧフラグスタックにプッシュしている。この結果、ＳＰＥ５０２ＢとＳＰＥ５０２ＣのＧフラグはＳＰＥ５０２ＡのＧフラグ同じ値を保持しているので、ＳＰＥ５０２Ａで実行される命令はＳＰＥ５０２ＢとＳＰＥ５０２Ｃでも実行され、ＳＰＥ５０２Ａで実行されない命令はＳＰＥ５０２ＢとＳＰＥ５０２Ｃでも実行されない。 The speeding up of processing according to the fifth embodiment will be described using a program example.
FIG. 31 shows an example of parallelizing the program of FIG. 13 in order to execute the program of FIG. 13 in the parallel computing device 1.
In SPE 502A, the condition is judged twice and pushed to the G flag stack by the “PSHI” instruction. At the same time, the SPE 502B and the SPE 502C are both pushed to the G flag stack by the “SYNCI” instruction. As a result, since the G flag of SPE502B and SPE502C holds the same value as the G flag of SPE502A, the instruction executed in SPE502A is executed in SPE502B and SPE502C, and the instruction not executed in SPE502A is not executed in SPE502B and SPE502C.

特徴的なのはＳＰＥ５０２Ｄの動作である。ＳＰＥ５０２ＤではＳＰＥ５０２Ａで最初の「PSHI C,Z」命令が実行される時に、「SYNC」命令でＳＰＥ５０２ＡのＧフラグスタックにプッシュされる値の反転を、自身のＧフラグスタックにプッシュしている。したがって、ＳＰＥ５０２Ｄでは「else文」以降に相当する「＊５」を付した命令部分を直ぐに実行できる。
そこで、予め「MVA R8」命令と「ADD R5」命令でデータを用意しておき、「SYNC」命令の直後に「MV R8」命令で結果をレジスターR8に書き込んでいる。「MV R8」命令直後の「GINV」命令には注意が必要である。図１３の「＊４」を付した命令部分はＳＰＥ５０２Ａでの最初の条件判断が真で、２番目の条件判断が偽の場合に実行される。ＳＰＥ５０２Ｄでは最初の条件判断でＳＰＥ５０２Ａとは反対の値をＧフラグスタックにプッシュしているので、そのままでは「＊４」を付した命令部分を実行できない。そこで「GINV」命令によって最初にプッシュした値を反転している。その後、ＳＰＥ５０２Ａで２度目の「PSHI C,Z」命令が実行される時に、「SYNC」命令でＳＰＥ５０２Ａとは反対の値をプッシュすることで「＊４」を付した命令部分を担当できるようになる。この実施形態では全ての処理が１０クロックで終了する。 What is characteristic is the operation of the SPE 502D. In the SPE 502D, when the first “PSHI C, Z” instruction is executed in the SPE 502A, the inversion of the value pushed to the G flag stack of the SPE 502A by the “SYNC” instruction is pushed to its own G flag stack. Therefore, the SPE 502D can immediately execute the instruction part with “* 5” corresponding to the “else statement” and the subsequent.
Therefore, data is prepared in advance by the “MVA R8” instruction and the “ADD R5” instruction, and the result is written in the register R8 by the “MV R8” instruction immediately after the “SYNC” instruction. Care must be taken with the “GINV” instruction immediately after the “MV R8” instruction. The instruction portion with “* 4” in FIG. 13 is executed when the first condition judgment at SPE 502A is true and the second condition judgment is false. Since the SPE 502D pushes a value opposite to that of the SPE 502A to the G flag stack in the first condition determination, the instruction portion with “* 4” cannot be executed as it is. Therefore, the value pushed first by the “GINV” instruction is inverted. After that, when the second “PSHI C, Z” instruction is executed in SPE502A, the instruction part with “* 4” can be assigned by pushing the opposite value to SPE502A with “SYNC” instruction. Become. In this embodiment, all processing is completed in 10 clocks.

ＳＰＥ５０２Ａでは最後に「POP」命令を２度繰り返しているが、ＳＰＥ５０２Ｂなどと同様に１つの「FLSH」命令に置き換えることができる。また、ＳＰＥ５０２Ａ、ＳＰＥ５０２Ｂ、ＳＰＥ５０２Ｃで使われていない部分が空白となっているが、ここには任意の命令をおくことができ、ＳＰＥ５０２Ａとシンクロする必要が無い命令を並列に実行できる。 In the SPE 502A, the “POP” instruction is repeated twice last, but it can be replaced with one “FLSH” instruction as in the SPE 502B. In addition, although a portion not used in SPE502A, SPE502B, and SPE502C is blank, an arbitrary instruction can be placed here, and an instruction that does not need to be synchronized with SPE502A can be executed in parallel.

（第６実施形態）
本発明の第６の実施形態について説明する。
図３２は、並列計算装置１の演算処理部の概略構成を示すブロック図である。
ここでは、演算制御処理の説明に必要な主たる構成を示す。演算処理については、図に示される演算処理部６００は、並列して演算処理を行う複数の演算プロセッサー（ＰＥ）６０２と、複数のＰＥ６０２にＳＰＥ制御信号線を介して制御命令を供給する制御信号生成部（ＰＥ−Ｉ）３と、ＰＥ−Ｉ３の制御を受けて各ＳＰＥを同期させる実行制御部５０ＳＣＡ〜５０ＳＣＤを備える。 (Sixth embodiment)
A sixth embodiment of the present invention will be described.
FIG. 32 is a block diagram illustrating a schematic configuration of the arithmetic processing unit of the parallel computing device 1.
Here, a main configuration necessary for explaining the arithmetic control process is shown. As for the arithmetic processing, the arithmetic processing unit 600 shown in the figure includes a plurality of arithmetic processors (PE) 602 that perform arithmetic processing in parallel, and a control signal that supplies a control command to the plurality of PEs 602 via the SPE control signal line. A generation unit (PE-I) 3 and execution control units 50SCA to 50SCD that synchronize each SPE under the control of PE-I3 are provided.

ＰＥ６０２のそれぞれが、サブプロセッサー（ＳＰＥ）６０２Ａと、サブプロセッサー（ＳＰＥ）６０２Ｂ〜６０２Ｄを備える。
ＳＰＥ６０２Ａは、Ｇフラグソース５０ＦＳＡ、Ｇフラグスタック６０ＳＴＡ、ＳＰＥ制御部６９９Ａ及び実行選択部６４５Ａを備える。
ＳＰＥ６０２Ａは、前述のＳＰＥ５０２Ａに相当する構成に加え、入力されるＧフラグソース信号を実行選択部６４５Ａからの制御信号に基づいて選択する構成を備える。Ｇフラグスタック６０ＳＴＡは、Ｇフラグスタック５０ＳＴＡと同じ構成である。ＳＰＥ制御部６９９Ａは、ＳＰＥ制御部１９９Ａに相当する。
ＳＰＥ６０２Ｂは、Ｇフラグソース５０ＦＳＢとＧフラグスタック６０ＳＴＢとＳＰＥ制御部６９９Ｂと実行選択部６４５Ｂを備える。ＳＰＥ６０２Ｃは、Ｇフラグソース５０ＦＳＣとＧフラグスタック６０ＳＴＣとＳＰＥ制御部６９９Ｃと実行選択部６４５Ｃを備える。ＳＰＥ６０２Ｄは、Ｇフラグソース５０ＦＳＤとＧフラグスタック６０ＳＴＤとＳＰＥ制御部６９９Ｄと実行選択部６４５Ｄを備える。
ＳＰＥ６０２Ｂ〜ＳＰＥ６０２Ｄは、ＳＰＥ６０２Ａと同様の構成を有する。 Each of the PEs 602 includes a sub processor (SPE) 602A and sub processors (SPE) 602B to 602D.
The SPE 602A includes a G flag source 50FSA, a G flag stack 60STA, an SPE control unit 699A, and an execution selection unit 645A.
The SPE 602A has a configuration that selects an input G flag source signal based on a control signal from the execution selection unit 645A in addition to the configuration corresponding to the SPE 502A described above. The G flag stack 60STA has the same configuration as the G flag stack 50STA. The SPE control unit 699A corresponds to the SPE control unit 199A.
The SPE 602B includes a G flag source 50FSB, a G flag stack 60STB, an SPE control unit 699B, and an execution selection unit 645B. The SPE 602C includes a G flag source 50FSC, a G flag stack 60STC, an SPE control unit 699C, and an execution selection unit 645C. The SPE 602D includes a G flag source 50FSD, a G flag stack 60STD, an SPE control unit 699D, and an execution selection unit 645D.
SPE 602B to SPE 602D have the same configuration as SPE 602A.

Ｇフラグソース５０ＦＳＡ〜５０ＦＳＤは、実行選択部６４５Ａ〜６４５Ｄを介してＧフラグスタック６０ＳＴＡ〜６０ＳＴＤにＧフラグソース信号を供給する。
実行選択部６４５Ａ〜６４５Ｄは、Ｇフラグソース５０ＦＳＡ〜５０ＦＳＤから供給されたＧフラグソース信号（Ｇａフラグソース信号〜Ｇｄフラグソース信号）のいずれかをそれぞれ選択し、選択された信号の値をそれぞれのＧフラグスタック６０ＳＴＡ〜６０ＳＴＤに蓄積する。それぞれのＧフラグスタック６０ＳＴＡ〜６０ＳＴＤは、蓄積されたＧフラグソース信号に基づいて、ＳＰＥ制御部６９９Ａ〜６９９Ｄの実行制御信号を出力する。
ＳＰＥ制御部６９９Ａ〜６９９Ｄは、供給された実行許可信号に基づいて、それぞれのＳＰＥの演算制御を行う。 The G flag sources 50FSA to 50FSD supply G flag source signals to the G flag stacks 60STA to 60STD via the execution selection units 645A to 645D.
The execution selection units 645A to 645D respectively select one of the G flag source signals (Ga flag source signal to Gd flag source signal) supplied from the G flag sources 50FSA to 50FSD, and set the values of the selected signals to the respective values. The G flag stacks 60STA to 60STD are accumulated. Each of the G flag stacks 60STA to 60STD outputs an execution control signal of the SPE control units 699A to 699D based on the accumulated G flag source signal.
The SPE control units 699A to 699D perform calculation control of each SPE based on the supplied execution permission signal.

本実施形態では、図５に示すＧフラグスタックをＳＰＥ６０２Ａ、ＳＰＥ６０２Ｂ、ＳＰＥ６０２Ｃ及びＳＰＥ６０２Ｄにそれぞれ設けた構成である。各ＳＰＥは通常はそれぞれのＧａ、Ｇｂ、Ｇｃ、Ｇｄフラグソース信号の値に基づいて個別に命令実行が制御されるが、必要に応じての任意のＳＰＥのＧフラグに応じて命令実行が制御される。 In this embodiment, the G flag stack shown in FIG. 5 is provided in each of the SPE 602A, SPE 602B, SPE 602C, and SPE 602D. Each SPE is usually individually controlled for instruction execution based on the values of the respective Ga, Gb, Gc, and Gd flag source signals, but instruction execution is controlled according to the G flag of any SPE as required. Is done.

図３３は、第６実施形態の構成において追加する命令を示す。
図３４は、第６実施形態のＳＰＥの同期化回路を示すブロック図である。
図３４には、代表してＳＰＥ６０２Ａの構成例を示す。ＳＰＥ６０２Ａは、Ｇフラグソース５０ＦＳＡとＧフラグスタック６０ＳＴＡとＳＰＥ制御部６９９Ａと実行選択部６４５Ａを備える。
実行選択部６４５Ａは、実行選択部６４Ａと６５Ａを備える。
実行選択部６４Ａは、図２５に示す実行選択部４４Ａと同じで構成であり、実行選択部６５Ａは、図３０に示した実行選択部５５Ｂと同じ構成である。
また、実行制御部６０ＳＣＡは、図３０に示した実行制御部５０ＳＣＢに相当する実行制御部５０ＳＣＡの構成と図２５に示した実行制御部４０Ａの構成を合わせて備える。 FIG. 33 shows instructions to be added in the configuration of the sixth embodiment.
FIG. 34 is a block diagram illustrating an SPE synchronization circuit according to the sixth embodiment.
FIG. 34 shows a configuration example of the SPE 602A as a representative. The SPE 602A includes a G flag source 50 FSA, a G flag stack 60STA, an SPE control unit 699A, and an execution selection unit 645A.
The execution selection unit 645A includes execution selection units 64A and 65A.
The execution selection unit 64A has the same configuration as the execution selection unit 44A shown in FIG. 25, and the execution selection unit 65A has the same configuration as the execution selection unit 55B shown in FIG.
Further, the execution control unit 60SCA includes both the configuration of the execution control unit 50SCA corresponding to the execution control unit 50SCB shown in FIG. 30 and the configuration of the execution control unit 40A shown in FIG.

実行選択部６４５Ａにおける実行選択部６４Ａは、入力されるＧフラグソース５０ＦＳＡからＧフラグソース５０ＦＳＤの信号のいずれかを選択してその値を出力する。また、実行選択部６５Ａは、実行選択部６４Ａから出力されたフラグソース信号とＧフラグソース５０ＦＳＡからの信号のいずれかを選択してその値をフラグスタック６０ＳＴＡに出力する。
実行制御部６０ＳＣＡは、ＳＰＥ６０２Ａにおける「PSH」命令か「PSHI」命令に応じて、実行制御部６４５Ａの入力信号の選択を制御する。
フラグスタック６０ＳＴＡは、実行選択部６４５Ａで選択された結果を、Ｇフラグスタック５１Ａに順次蓄積する。
ＳＰＥ６０２Ａについて示したが、ＳＰＥ６０２Ｂ、ＳＰＥ６０２Ｃ、ＳＰＥ６０２Ｄについても同じ構成で接続される。 The execution selection unit 64A in the execution selection unit 645A selects one of the signals of the input G flag source 50FSA to the G flag source 50FSD and outputs the value. The execution selection unit 65A selects either the flag source signal output from the execution selection unit 64A or the signal from the G flag source 50FSA, and outputs the value to the flag stack 60STA.
The execution control unit 60SCA controls selection of the input signal of the execution control unit 645A in accordance with the “PSH” instruction or the “PSHI” instruction in the SPE 602A.
The flag stack 60STA sequentially accumulates the results selected by the execution selection unit 645A in the G flag stack 51A.
Although SPE 602A is shown, SPE 602B, SPE 602C, and SPE 602D are also connected in the same configuration.

「SYNC」命令と「SYNCI」命令はオペランドにA, B, C, Dの何れか一つを指定でき、それに応じて、前述の図２６のようにcnt_Gsel_1_Aとcnt_Gsel_0_A信号が決定される。これらの命令にしたがって、実行制御部６０ＳＣＡにおける実行制御部４０Ａは、ＳＰＥの実行制御の条件を他の任意のＳＰＥのＧフラグを参照させることにより、ＳＰＥの実行制御を他の任意のＳＰＥに同期させる。
また、「PSH」命令か又は「PSHI」命令が実行された場合に、Ｇフラグスタックにプッシュされる値は、図５の場合と同じである。これらの命令のオペランドによって選択されたコンディションフラグは、Gflag_org_Aとして他のＳＰＥにも供給される。Gflag_org_B、Gflag_org_C、Gflag_org_Dは、それぞれＳＰＥ６０２Ｂ、ＳＰＥ６０２Ｃ、ＳＰＥ６０２Ｄで生成される信号である。
「SYNC」命令又は「SYNCI」命令が発行されると、cnt_SYNC_A又はcnt_SYNCI_A信号がアクティブになり、実行選択部６４Ａで選択されたコンディションコードが、実行選択部６５Ａによって選択されてその値がＧフラグスタックにプッシュされる。「SYNCI」命令と「PSHI」命令の場合には、選択されたコンディションコードの値が反転されてからＧフラグスタックに書き込まれる。「POP」命令、「POPI」命令、「GINV」命令及び「FLSH」命令が発行された時の動作や、システムリセット信号system_resetがアクティブになった時の動作は、図５の場合と同じである。 In the “SYNC” instruction and the “SYNCI” instruction, any one of A, B, C, and D can be specified as an operand, and the cnt_Gsel — 1_A and cnt_Gsel — 0_A signals are determined accordingly as shown in FIG. In accordance with these instructions, the execution control unit 40A in the execution control unit 60SCA synchronizes the SPE execution control with any other SPE by referring to the G flag of any other SPE for the condition of the SPE execution control. Let me.
Further, when the “PSH” instruction or the “PSHI” instruction is executed, the value pushed to the G flag stack is the same as in FIG. The condition flag selected by the operands of these instructions is also supplied to other SPEs as Gflag_org_A. Gflag_org_B, Gflag_org_C, and Gflag_org_D are signals generated by SPE602B, SPE602C, and SPE602D, respectively.
When a “SYNC” instruction or a “SYNCI” instruction is issued, the cnt_SYNC_A or cnt_SYNCI_A signal becomes active, the condition code selected by the execution selection unit 64A is selected by the execution selection unit 65A, and the value is set to the G flag stack. Pushed to. In the case of the “SYNCI” instruction and the “PSHI” instruction, the value of the selected condition code is inverted and then written to the G flag stack. The operation when the “POP” instruction, “POPI” instruction, “GINV” instruction, and “FLSH” instruction are issued, and the operation when the system reset signal system_reset becomes active are the same as those in FIG. .

第６の実施形態によって処理が高速化されることを、プログラム例を用いて説明する。
図３５は、第６実施形態の並列計算装置１において、図１３のプログラムを実行するために４並列のVLIW型用に図１３のプログラムを変換した例を示す。
この例ではＳＰＥ６０２Ｂで２回条件判断を行い、「PSHI」命令でＧフラグスタックにプッシュしている。
同時にＳＰＥ６０２ＡとＳＰＥ６０２Ｃでは「SYNCI B」命令でＧフラグスタックをプッシュしている。したがって、ＳＰＥ６０２ＡとＳＰＥ６０２ＣのＧフラグはＳＰＥ６０２ＢのＧフラグと同じ値を保持しているので、ＳＰＥ６０２Ｂで実行される命令はＳＰＥ６０２Ａ及びＳＰＥ６０２Ｃでも実行され、ＳＰＥ６０２Ｂで実行されない命令はＳＰＥ６０２Ａ又はＳＰＥ６０２Ｃでも実行されない。 The speeding up of processing according to the sixth embodiment will be described using a program example.
FIG. 35 shows an example in which the program of FIG. 13 is converted for the 4-parallel VLIW type in order to execute the program of FIG. 13 in the parallel computing device 1 of the sixth embodiment.
In this example, the condition is judged twice in SPE 602B and pushed to the G flag stack by the “PSHI” instruction.
At the same time, the SPE 602A and SPE 602C push the G flag stack with the “SYNCIB” instruction. Therefore, since the G flag of SPE 602A and SPE 602C holds the same value as the G flag of SPE 602B, the instruction executed in SPE 602B is executed in SPE 602A and SPE 602C, and the instruction not executed in SPE 602B is not executed in SPE 602A or SPE 602C.

特徴的なのはＳＰＥ６０２Ｄの動作である。ここではＳＰＥ６０２Ｂで最初の「PSHI C,Z」命令が実行される時に、「SYNC B」命令でＳＰＥ６０２ＢのＧフラグスタックにプッシュされる値の反転を、自身のＧフラグスタックにプッシュしている。したがって、ＳＰＥ６０２Ｄでは「else文」以降に相当する「＊５」を付した命令部分を直ぐに実行できる。そこで、予め「MVA R8」命令と「ADD R5」命令でデータを用意しておき、「SYNC B」命令の直後に「MV R8」命令で結果をレジスターR8に書き込んでいる。「MV R8」命令直後の「GINV」命令は注意が必要である。図１３の「＊４」を付した命令部分はＳＰＥ６０２Ｂでの最初の条件判断が真で、２番目の条件判断が偽の場合に実行される。ＳＰＥ６０２Ｄでは最初の条件判断でＳＰＥ６０２Ｂとは反対の値をＧフラグスタックにプッシュしているので、そのままでは「＊４」を付した命令部分を実行することができない。そこで「GINV」命令によって、最初にプッシュした値を反転している。そして、ＳＰＥ６０２Ｂで２度目の「PSHI C,Z」命令が実行される時に、「SYNC B」命令でＳＰＥ６０２Ｂとは反対の値をプッシュすることで「＊４」を付した命令部分を担当できる。この実施形態では全ての処理が１０クロックで終了している。 What is characteristic is the operation of SPE602D. Here, when the first “PSHI C, Z” instruction is executed in the SPE 602B, the inversion of the value pushed to the G flag stack of the SPE 602B by the “SYNC B” instruction is pushed to its own G flag stack. Therefore, the SPE 602D can immediately execute the instruction part with “* 5” corresponding to the “else statement” and the subsequent. Therefore, data is prepared in advance by the “MVA R8” instruction and the “ADD R5” instruction, and the result is written in the register R8 by the “MV R8” instruction immediately after the “SYNC B” instruction. Care must be taken with the “GINV” instruction immediately after the “MV R8” instruction. The instruction portion with “* 4” in FIG. 13 is executed when the first condition judgment at SPE 602B is true and the second condition judgment is false. Since the SPE 602D pushes a value opposite to that of the SPE 602B to the G flag stack in the first condition determination, the instruction portion with “* 4” cannot be executed as it is. Therefore, the first pushed value is inverted by the “GINV” instruction. Then, when the second “PSHI C, Z” instruction is executed in the SPE 602B, the “SYNC B” instruction pushes a value opposite to that of the SPE 602B, thereby handling the instruction portion with “* 4”. In this embodiment, all processing is completed in 10 clocks.

なお、ＳＰＥ６０２Ａ、ＳＰＥ６０２Ｂ、ＳＰＥ６０２Ｃで使われていない部分が空白となっているが、ここには任意の命令をおくことができ、ＳＰＥ６０２Ｂと同期する必要が無い命令を並列に実行できる。このプログラム例では処理に必要なクロック数が実施形態５と同じであるが、条件判断するＳＰＥを任意に設定できるので、ＳＰＥの使い方の柔軟性が向上し、一般的には実施形態５よりも処理を高速化できる。 Note that although portions not used in the SPE 602A, SPE 602B, and SPE 602C are blank, any instruction can be placed here, and instructions that do not need to be synchronized with the SPE 602B can be executed in parallel. In this example program, the number of clocks required for processing is the same as in the fifth embodiment, but since the SPE for determining the condition can be set arbitrarily, the flexibility of how to use the SPE is improved, and generally more than in the fifth embodiment. Processing can be speeded up.

（第７実施形態）
次に本発明の第７の実施形態について説明する。
この実施形態では、図５に示すＧフラグスタックをＳＰＥ７０２Ａにだけ設け、ＳＰＥ７０２Ａ以外のＳＰＥはＳＰＥ７０２ＡのＧフラグ又はＧフラグの反転信号に応じて命令実行が制御されるか、又はＳＰＥ７０２ＡのＧフラグに影響されず常に命令を実行するかを選択する。 (Seventh embodiment)
Next, a seventh embodiment of the present invention will be described.
In this embodiment, the G flag stack shown in FIG. 5 is provided only in the SPE 702A, and the SPEs other than the SPE 702A are controlled to execute instructions according to the G flag of the SPE 702A or an inverted signal of the G flag, or the G flag of the SPE 702A. Select whether to always execute the instruction without being affected.

図３６は、並列計算装置１の演算処理部の概略構成を示すブロック図である。
ここでは、演算制御処理の説明に必要な主たる構成を示す。演算処理については、図に示される演算処理部７００は、並列して演算処理を行う複数の演算プロセッサー（ＰＥ）７０２と、複数のＰＥ７０２にＳＰＥ制御信号線を介して制御命令を供給する制御信号生成部（ＰＥ−Ｉ）３と、ＰＥ−Ｉ３の制御を受けて各ＳＰＥを同期させる実行制御部７０ＳＣＢ〜７０ＳＣＤ、ＰＥ−Ｉ３の制御を受けてＧフラグの信号を反転させる反転制御部７９Ｂ〜７９Ｄを備える。 FIG. 36 is a block diagram illustrating a schematic configuration of the arithmetic processing unit of the parallel computing device 1.
Here, a main configuration necessary for explaining the arithmetic control process is shown. As for the arithmetic processing, the arithmetic processing unit 700 shown in the figure includes a plurality of arithmetic processors (PE) 702 that perform arithmetic processing in parallel, and a control signal that supplies a control command to the plurality of PEs 702 via the SPE control signal line. A generation unit (PE-I) 3 and execution control units 70SCB to 70SCD that synchronize each SPE under the control of PE-I3, and an inversion control unit 79B that inverts the signal of the G flag under the control of PE-I3 79D.

ＰＥ７０２のそれぞれが、サブプロセッサー（ＳＰＥ）７０２Ａと、サブプロセッサー（ＳＰＥ）７０２Ｂ〜７０２Ｄを備える。ＳＰＥ７０２Ａは、Ｇフラグ処理部１０とＳＰＥ制御部７９９Ａを備える。ＳＰＥ７０２Ｂ〜７０２Ｄは、それぞれＳＰＥ制御部７９９Ｂ〜７９９Ｄ、処理選択部７５Ｂ〜７５Ｄを備える。
ＳＰＥ７０２Ｂ〜７０２Ｄにおける処理選択部７５Ｂ〜７５Ｄは、それぞれ実行選択部７４Ｂ〜７４Ｄと反転処理部７８Ｂ〜７８Ｄを備える。
Ｇフラグ処理部１０は、ＳＰＥ制御部７９９Ａと、反転処理部７８Ｂ〜７８Ｄ、実行選択部７４Ｂ〜７４Ｄを介してＳＰＥ制御部７９９Ｂ〜７９９ＤにＧフラグ信号を供給する。
反転処理部７８Ｂ〜７８Ｄは、Ｇフラグ処理部１０から供給されるＧフラグに対し、反転制御部７９Ｂ〜７９Ｄからの制御信号に基づいて反転処理を行う。実行選択部７４Ｂ〜７４Ｄは、供給されたＧフラグ信号又は反転されたＧフラグ信号と、実行制御部７０ＳＣＢ〜７０ＳＣＤの実行許可信号に基づいて、ＳＰＥ制御部７９９Ｂ〜７９９Ｄにそれぞれ実行許可信号を出力する。
ＳＰＥ制御部７９９Ａ〜７９９Ｄは、供給された実行許可信号に基づいて、それぞれのＳＰＥの演算制御を行う。 Each of the PEs 702 includes a sub processor (SPE) 702A and sub processors (SPE) 702B to 702D. The SPE 702A includes a G flag processing unit 10 and an SPE control unit 799A. The SPEs 702B to 702D include SPE control units 799B to 799D and process selection units 75B to 75D, respectively.
The process selection units 75B to 75D in the SPEs 702B to 702D include execution selection units 74B to 74D and inversion processing units 78B to 78D, respectively.
The G flag processing unit 10 supplies the G flag signal to the SPE control units 799B to 799D via the SPE control unit 799A, the inversion processing units 78B to 78D, and the execution selection units 74B to 74D.
The inversion processing units 78B to 78D perform inversion processing on the G flag supplied from the G flag processing unit 10 based on the control signals from the inversion control units 79B to 79D. The execution selection units 74B to 74D output execution permission signals to the SPE control units 799B to 799D based on the supplied G flag signal or the inverted G flag signal and the execution permission signals of the execution control units 70SCB to 70SCD, respectively. To do.
The SPE control units 799A to 799D perform arithmetic control of each SPE based on the supplied execution permission signal.

図３７は、第７実施形態の構成において追加する命令を示す。
これらの命令はＳＰＥ７０２Ｂ、ＳＰＥ７０２Ｃ及びＳＰＥ７０２Ｄでのみ実行可能である。例えば、ＳＰＥ７０２Ｂにおいて「SYNC」命令を実行するとＳＰＥ７０２ＡのＧフラグを命令実行制御に使うようになり、「SYNCI」命令を実行するとＳＰＥ７０２ＡのＧフラグの値の反転を命令実行制御に使うようになり、「ASYNC」命令を実行するとＳＰＥ７０２ＡのＧフラグの値とは無関係に命令を実行するようになる。 FIG. 37 shows instructions to be added in the configuration of the seventh embodiment.
These instructions can only be executed by SPE 702B, SPE 702C and SPE 702D. For example, when the “SYNC” instruction is executed in the SPE 702B, the G flag of the SPE 702A is used for instruction execution control. When the “SYNCI” instruction is executed, the inversion of the G flag value of the SPE 702A is used for instruction execution control. When the “ASYNC” instruction is executed, the instruction is executed regardless of the value of the G flag of the SPE 702A.

図３８は、ＳＰＥの同期化回路を示すブロック図である。
この例はＳＰＥ７０２Ｂの場合であり、煩雑になるので図５と同じ回路の図示を省略している。
この図には、Ｇフラグ処理部１０と、ＳＰＥ７０２Ａと７０２Ｂが備えるＳＰＥ制御部７９９Ａと７９９Ｂ、実行選択部７４Ｂ及び実行制御部７０ＳＣＢ、反転制御部７９Ｂ、反転処理部７８Ｂが示される。前述の図１０、１２、１５に示した構成と同じ構成には、同じ符号を附す。
Ｇフラグ処理部１０は、前述の図５に示した構成と同じであり、出力する信号をGlobal_Inst_en_Aとする。Global_Inst_en_Aは、ＳＰＥ７０２Ａの信号であることを明示する以外は、図５のGlobal_Inst_en信号と同じである。 FIG. 38 is a block diagram showing an SPE synchronization circuit.
This example is the case of SPE702B, and since it becomes complicated, illustration of the same circuit as FIG. 5 is omitted.
This figure shows a G flag processing unit 10, SPE control units 799A and 799B included in SPEs 702A and 702B, an execution selection unit 74B and an execution control unit 70SCB, an inversion control unit 79B, and an inversion processing unit 78B. The same reference numerals are given to the same components as those shown in FIGS.
The G flag processing unit 10 has the same configuration as that shown in FIG. 5 described above, and the output signal is Global_Inst_en_A. Global_Inst_en_A is the same as the Global_Inst_en signal in FIG. 5 except that it is clearly indicated that the signal is SPE 702A.

実行制御部７０ＳＣＢは、ＰＥ-Ｉ３からの制御信号によりＳＰＥ７０２Ｂの実行を制御する制御信号（Async_B）を出力する。実行制御部７０ＳＣＢは、実行選択部７４Ｂを制御して制御信号Global_Inst_en_B信号を生成する。反転処理部７８Ｂは、Ｇフラグの値の反転処理を行う。ここではＥＸＯＲ回路で示す。反転制御部７９Ｂは、ＰＥ−Ｉ３からの制御信号によりＳＰＥ７０２Ｂの実行を制御する制御信号（Inv_B）を出力する。 The execution control unit 70SCB outputs a control signal (Async_B) for controlling the execution of the SPE 702B by a control signal from the PE-I3. The execution control unit 70SCB generates a control signal Global_Inst_en_B signal by controlling the execution selection unit 74B. The inversion processing unit 78B performs inversion processing of the value of the G flag. Here, an EXOR circuit is shown. The inversion control unit 79B outputs a control signal (Inv_B) for controlling the execution of the SPE 702B by the control signal from the PE-I3.

Global_Inst_en_BはＳＰＥ７０２Ｂで命令実行制御に使われる信号である。system_resetがアクティブ(「１」)になると、フリップフロップ７１ＢがセットされてAsync_B信号が「１」になる。よって、Global_Inst_en_Bが常に「１」になるので、ＳＰＥ７０２Ｂでは常に命令が実行される。「SYNC」命令が発行されるとcnt_SYNC_B信号がアクティブになり、Inv_B信号は「０」にAsync_B信号も「０」になる。したがって、Global_Inst_en_BはGlobal_Inst_en_Aと同じになる。つまり、ＳＰＥ７０２ＢはＳＰＥ７０２ＡのＧフラグの値に応じて、その命令実行が制御される。 Global_Inst_en_B is a signal used for instruction execution control in the SPE 702B. When system_reset becomes active (“1”), the flip-flop 71B is set and the Async_B signal becomes “1”. Therefore, since Global_Inst_en_B is always “1”, the instruction is always executed in the SPE 702B. When the “SYNC” instruction is issued, the cnt_SYNC_B signal becomes active, the Inv_B signal becomes “0”, and the Async_B signal also becomes “0”. Therefore, Global_Inst_en_B is the same as Global_Inst_en_A. That is, the instruction execution of the SPE 702B is controlled according to the value of the G flag of the SPE 702A.

「SYNCI」命令が発行されるとcnt_SYNCI_B信号がアクティブになり、Inv_B信号は「１」にAsync_B信号は「０」になる。したがって、Global_Inst_en_BはGlobal_Inst_en_Aの反転になる。つまり、ＳＰＥ７０２ＢはＳＰＥ７０２ＡのＧフラグの値の反転に応じて、その命令実行が制御される。「ASYNC」命令が発行されるとcnt_ASYNC_B信号がアクティブになり、Inv_B信号は「０」に、Async_B信号は「１」になる。なお、フリップフロップ７１Ｂと反転制御部７９Ｂは、図示しない並列計算装置１内部の基本クロックの立ち上がりで変化する。
ＳＰＥ７０２Ｃ、ＳＰＥ７０２ＤについてもＳＰＥ７０２Ｂに示した構成と同じである。 When the “SYNCI” instruction is issued, the cnt_SYNCI_B signal becomes active, the Inv_B signal becomes “1”, and the Async_B signal becomes “0”. Therefore, Global_Inst_en_B is an inversion of Global_Inst_en_A. That is, the instruction execution of the SPE 702B is controlled according to the inversion of the value of the G flag of the SPE 702A. When the “ASYNC” instruction is issued, the cnt_ASYNC_B signal becomes active, the Inv_B signal becomes “0”, and the Async_B signal becomes “1”. Note that the flip-flop 71B and the inversion control unit 79B change at the rising edge of the basic clock in the parallel computing device 1 (not shown).
The SPE 702C and SPE 702D have the same configuration as that shown in the SPE 702B.

第７の実施形態によって処理が高速化されることを、プログラム例を用いて示す。
図３９は、第７実施形態の並列計算装置１において、図１３のプログラムを実行するために４並列のVLIW型用に図１３のプログラムを変換した例を示す。
ＳＰＥ７０２Ａで条件判断等を行い、ＳＰＥ７０２Ｂ等ではその結果に同期して命令を実行する。先ず、「＊１」を付した命令の部分であるが、ＳＰＥ７０２Ａで条件判断するまでの間にＳＰＥ７０２Ｂで「ADD R7」命令まで実行しておき、ＳＰＥ７０２Ａで「PSHI C,Z」命令を実行した直後に「MV R7」命令で結果をレジスターR7に書き込む。「＊２」を付した命令の部分も同様であり、「CLR」命令でＡｃｃ−Ａを予め「０」にしておくことで、ＳＰＥ７０２Ａで「CMP R6」命令の結果をＧフラグスタックにプッシュした直後にレジスターR7をクリアできる。
同様に「＊３」を付した命令で示すようにＳＰＥ７０２Ｃでも予めデータを用意しておき、直ぐにレジスターR9への書き込みを行える。 It will be shown by using a program example that the processing speed is increased by the seventh embodiment.
FIG. 39 shows an example in which the program of FIG. 13 is converted for the 4-parallel VLIW type in order to execute the program of FIG. 13 in the parallel computing device 1 of the seventh embodiment.
The SPE 702A performs condition determination and the like, and the SPE 702B executes an instruction in synchronization with the result. First, it is the part of the instruction with “* 1”. Until the condition is judged by SPE 702A, the SPE 702B executes up to the “ADD R7” instruction, and the SPE 702A executes the “PSHI C, Z” instruction. Immediately after that, the result is written to the register R7 by the “MV R7” instruction. The same applies to the instruction part with “* 2”. By setting Acc-A to “0” in advance with the “CLR” instruction, the result of the “CMP R6” instruction is pushed onto the G flag stack by the SPE 702A. Immediately after that, register R7 can be cleared.
Similarly, as indicated by the instruction with “* 3”, data is prepared in advance in the SPE 702C, and writing to the register R9 can be performed immediately.

ＳＰＥ７０２Ｄの「＊５」を付した命令部分では、先ずＳＰＥ７０２Ａの状態に拠らず「MVA R8」命令と「ADD R5」命令まで実行しておき、ＳＰＥ７０２Ａで「PSHI C,Z」命令を実行した時に「SYNCI」命令を実行している。したがって、ＳＰＥ７０２ＡのＧフラグの値の反転で実行制御されるので、「else文」以降に相当する「MV R8」命令を直ぐに実行できる。「＊４」を付した命令部分も同様で、一旦「ASYNC」命令を実行してＳＰＥ７０２ＡのＧフラグの値とは無関係に「MVA R10」命令と「INC」命令を実行しておき、ＳＰＥ７０２Ａでの「GINV」命令の後で「SYNC」命令を実行することで、「MV R10」命令だけをＳＰＥ７０２Ａと同期させている。このように４つのＳＰＥで並列処理することで、図１３では２１クロックかかった処理が１０クロックで終わる。 In the instruction part marked with “* 5” in SPE702D, first, the “MVA R8” instruction and “ADD R5” instruction are executed regardless of the state of SPE702A, and the “PSHI C, Z” instruction is executed in SPE702A. Sometimes "SYNCI" instruction is executed. Therefore, since execution control is performed by reversing the value of the G flag of the SPE 702A, the “MV R8” instruction corresponding to the “else statement” and the subsequent can be executed immediately. The instruction part with “* 4” is also the same. Once the “ASYNC” instruction is executed, the “MVA R10” instruction and the “INC” instruction are executed irrespective of the value of the G flag of the SPE 702A. By executing the “SYNC” instruction after the “GINV” instruction, only the “MV R10” instruction is synchronized with the SPE 702A. By performing parallel processing with four SPEs in this way, in FIG. 13, the processing that took 21 clocks is completed in 10 clocks.

なお、図３９でＳＰＥ７０２Ｂ、ＳＰＥ７０２Ｃ、ＳＰＥ７０２Ｄで使われていない部分が空白又は網掛けで示されている。空白部分には任意の命令を配置することができ、ＳＰＥ７０２Ａと同期する必要が無い命令を並列実行できる。一方、網掛けの部分にはＳＰＥ７０２Ａと同期した命令か、「NOP」命令が配置できる。本実施形態では、第２実施形態に比べて２クロック少なくて処理できる。したがって、演算処理全体でも第２実施形態より短時間で終わる可能性が高い。 In FIG. 39, portions not used in SPE 702B, SPE 702C, and SPE 702D are indicated by blanks or shaded areas. Arbitrary instructions can be placed in the blank portion, and instructions that do not need to be synchronized with the SPE 702A can be executed in parallel. On the other hand, an instruction synchronized with the SPE 702A or a “NOP” instruction can be arranged in the shaded portion. In the present embodiment, processing can be performed with two clocks fewer than in the second embodiment. Therefore, it is highly likely that the entire arithmetic processing will be completed in a shorter time than in the second embodiment.

以上に示された本発明の実施形態によれば、SIMD型にVLIW型を組み合わせた並列計算装置において、多重にネストした構造化プログラムをサポートするハードウェアを容易に実現できる。したがって、多数の演算素子（プロセッサー）を効率的に並列動作させられるので、科学技術計算や画像処理に必要とされる数Tflops又は数100GOPSの演算能力を持つ計算機を容易に実現できる。 According to the embodiment of the present invention described above, hardware that supports multiple nested structured programs can be easily realized in a parallel computing device combining a VLIW type with a SIMD type. Therefore, since a large number of arithmetic elements (processors) can be efficiently operated in parallel, a computer having an arithmetic capability of several Tflops or several hundred GOPS required for scientific calculation or image processing can be easily realized.

なお、本発明は、上記の各実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で変更可能である。
例えば、本発明の１個の並列処理装置１に内蔵される演算プロセッサーの個数を１０８としたが、本発明はこれに制限されるものではなく、一般的に２以上の演算プロセッサーを内蔵する計算機に適用できる。また、プログラムのネストの階層を４としたが、本発明はこれに制限されるものではなく、一般的に２以上のネストを持つ構造に有効である。
また、VLIW型で並列化するサブプロセッサーの個数を演算プロセッサーごとに４としたが、本発明はこれに制限されるものではなく、２以上のサブプロセッサーを持つシステムに適用できる。
また、命令実行を制御するフラグ情報として、キャリー（Ｃ）フラグ、ネガティブ（Ｎ）フラグ、オーバーフロー（Ｖ）フラグ、ゼロ（Ｚ）フラグの４つを用いているが、本発明はこれに制限されるものではなく、これらの中のいくつか、例えばＶフラグを省略することが可能であるし、或いは逆にハーフキャリー（Ｈ）を採用することも可能である。
また、本発明の実施形態では、特にアキュムレーター（Ａｃｃ）やＡＬＵのビット数について言及しなかったが、任意のビット数を持つ並列計算装置に、本発明を適用可能である。 The present invention is not limited to the above embodiments, and can be modified without departing from the spirit of the present invention.
For example, although the number of arithmetic processors incorporated in one parallel processing device 1 of the present invention is 108, the present invention is not limited to this, and a computer generally incorporating two or more arithmetic processors. Applicable to. Although the program nesting hierarchy is four, the present invention is not limited to this, and is generally effective for a structure having two or more nests.
Further, although the number of sub processors to be parallelized in the VLIW type is four for each arithmetic processor, the present invention is not limited to this, and can be applied to a system having two or more sub processors.
In addition, as flag information for controlling instruction execution, the carry (C) flag, negative (N) flag, overflow (V) flag, and zero (Z) flag are used, but the present invention is not limited to this. Some of them, for example, the V flag can be omitted, or conversely, half carry (H) can be adopted.
In the embodiment of the present invention, the number of bits of the accumulator (Acc) and ALU is not particularly mentioned, but the present invention can be applied to a parallel computing device having an arbitrary number of bits.

１０２演算プロセッサー（ＰＥ）
１０２Ａサブ演算プロセッサー（ＳＰＥ、特定サブプロセッサー）
１０２Ｂ、１０２Ｃ、１０２Ｄサブ演算プロセッサー（ＳＰＥ、サブプロセッサー）
１１Ｇフラグスタック（第１制御情報保持部）
１９合成部（第１合成部）
９５ＡＡＬＵ（第１演算部）
１９９ＡＳＰＥ制御部（第１制御部）
９５ＢＡＬＵ（第２演算部） 102 Computing processor (PE)
102A Sub operation processor (SPE, specific sub processor)
102B, 102C, 102D Sub arithmetic processor (SPE, sub processor)
11 G flag stack (first control information holding unit)
19 Synthesizer (first synthesizer)
95A ALU (first arithmetic unit)
199A SPE controller (first controller)
95B ALU (second arithmetic unit)

Claims

A plurality of arithmetic processors that perform arithmetic processing in parallel;
A control signal generator for supplying a control command to each of the plurality of arithmetic processors;
With
Each of the plurality of arithmetic processors is
A first operation unit for the first processing the input data based on the control command,
A first control information holding unit that has a stack structure and sequentially stores flag information based on a result of the first arithmetic processing;
A first combining unit that combines the flag information stored in the first control information holding unit;
A first control unit that causes the first calculation unit to perform a first calculation process based on the combination flag information without storing the synthesis flag information combined by the first combination unit in a stack structure ;
A specific subprocessor comprising:
A second arithmetic unit for the second arithmetic processing the input data based on the control command,
A second control unit that causes the second calculation unit to perform a second calculation process based on the combination flag information without storing the combination flag information combined by the first combination unit in a stack structure ;
A subprocessor comprising:
A parallel computing device comprising:

A plurality of arithmetic processors that perform arithmetic processing in parallel;
A control signal generator for supplying a control command to each of the plurality of arithmetic processors;
With
Each of the plurality of arithmetic processors is
A first operation unit for the first processing the input data based on the control command,
A first control information holding unit that has a stack structure and sequentially stores flag information based on a result of the first arithmetic processing;
A first combining unit that combines the flag information stored in the first control information holding unit;
A first control unit that causes the first calculation unit to perform a first calculation process based on the combination flag information without storing the synthesis flag information combined by the first combination unit in a stack structure ;
A specific subprocessor comprising:
A second arithmetic unit for the second arithmetic processing the input data based on the control command,
A selection unit that selects whether or not to use the synthesis flag information synthesized by the first synthesis unit for controlling the second computation process of the second computation unit;
Synthesis flag information synthesized by the first synthesis unit when the selection unit selects to use the synthesis flag information synthesized by the first synthesis unit for controlling the second computation process of the second computation unit . A second control unit that causes the second calculation unit to perform a second calculation process based on the synthesis flag information without storing the data in the stack structure ;
A subprocessor comprising:
A parallel computing device comprising:

A plurality of arithmetic processors that perform arithmetic processing in parallel;
A control signal generator for supplying a control command to each of the plurality of arithmetic processors;
With
Each of the plurality of arithmetic processors is
A first operation unit for the first processing the input data based on the control command,
A first control information holding unit that has a stack structure and sequentially stores flag information based on a result of the first arithmetic processing;
A first combining unit that combines the flag information stored in the first control information holding unit;
A first control unit that causes the first calculation unit to perform a first calculation process based on the combination flag information without storing the synthesis flag information combined by the first combination unit in a stack structure ;
A specific subprocessor comprising:
A second arithmetic unit for the second arithmetic processing the input data based on the control command,
A second control information holding unit that has a stack structure and sequentially stores flag information based on the result of the arithmetic processing;
A second combining unit that combines the flag information accumulated in the second control information holding unit;
A selection unit that selects any one of the synthesis flag information synthesized by the first synthesis unit and the synthesis flag information synthesized by the second synthesis unit of the sub processor without being accumulated in a stack structure ;
A second control unit that causes the second calculation unit to perform a second calculation process based on the combination flag information without storing the combination flag information selected by the selection unit in a stack structure ;
A subprocessor comprising:
A parallel computing device comprising:

A plurality of arithmetic processors that perform arithmetic processing in parallel;
A control signal generator for supplying a control command to each of the plurality of arithmetic processors;
With
Each of the plurality of arithmetic processors includes a plurality of sub-processors,
Each of the sub-processors
An arithmetic unit that performs arithmetic processing on the input data based on the control command;
A control information holding unit that has a stack structure and sequentially stores flag information based on the result of the arithmetic processing;
A combining unit that combines the flag information accumulated in the control information holding unit of its own sub-processor;
A selection unit that selects any one of the synthesis flag information without accumulating the synthesis flag information synthesized by the synthesis unit of any sub-processor in a stack structure ;
A control unit that causes the calculation unit to perform calculation processing based on the selected combination flag information without storing the combination flag information selected by the selection unit in a stack structure ;
Equipped with a,
The arithmetic units included in the plurality of sub-processors are:
A parallel computing device characterized in that different arithmetic processing can be performed .

A plurality of arithmetic processors that perform arithmetic processing in parallel;
A control signal generator for supplying a control command to each of the plurality of arithmetic processors;
With
Each of the plurality of arithmetic processors is
A first operation unit for the first processing the input data based on the control command,
A first control information holding unit that has a stack structure and sequentially stores flag information based on the result of the arithmetic processing;
A first combining unit that combines the flag information stored in the first control information holding unit;
A first control unit that causes the first calculation unit to perform a first calculation process based on the combination flag information without storing the synthesis flag information combined by the first combination unit in a stack structure ;
A specific subprocessor comprising:
A second arithmetic unit for the second arithmetic processing the input data based on the control command,
Either flag information based on the result of the first arithmetic processing by the first arithmetic unit or flag information based on the result of the second arithmetic processing by the second arithmetic unit of the sub processor is selected. A selection section;
A second control information holding unit that has a stack structure and sequentially stores flag information selected by the selection unit;
A second combining unit that combines the flag information accumulated in the second control information holding unit;
A second control unit that causes the second calculation unit to perform a second calculation process based on the combination flag information without accumulating the combination flag information combined by the second combination unit in a stack structure ;
A subprocessor comprising:
A parallel computing device comprising:

A plurality of arithmetic processors that perform arithmetic processing in parallel;
A control signal generator for supplying a control command to each of the plurality of arithmetic processors;
With
Each of the plurality of arithmetic processors includes a plurality of sub-processors,
Each of the sub-processors
An arithmetic unit that performs arithmetic processing on the input data based on the control command;
A selection unit that selects any one of flag information based on a result of calculation processing by the calculation unit of any sub-processor;
A control information holding unit that has a stack structure and sequentially stores flag information selected by the selection unit;
A combining unit that combines the flag information stored in the control information holding unit;
A control unit that causes the calculation unit to perform calculation processing based on the combination flag information without storing the combination flag information combined by the combination unit in a stack structure ;
Equipped with a,
The arithmetic units included in the plurality of sub-processors are:
A parallel computing device characterized in that different arithmetic processing can be performed .

A plurality of arithmetic processors that perform arithmetic processing in parallel;
A control signal generator for supplying a control command to each of the plurality of arithmetic processors;
With
Each of the plurality of arithmetic processors is
A first operation unit for the first processing the input data based on the control command,
A first control information holding unit that has a stack structure and sequentially stores flag information based on a result of the first arithmetic processing;
A first combining unit that combines the flag information stored in the first control information holding unit;
A first control unit that causes the first calculation unit to perform a first calculation process based on the combination flag information without storing the synthesis flag information combined by the first combination unit in a stack structure ;
A specific subprocessor comprising:
A second arithmetic unit for the second arithmetic processing the input data based on the control command,
The synthesis flag information synthesized by the first synthesis unit is selected to be output as it is or inverted and output as flag information output without being accumulated in the stack structure. An inversion processing unit;
A selection unit that selects either the flag information output or the control information that always enables instruction execution without accumulating the flag information output output by the inversion processing unit in a stack structure ;
A second control unit that causes the second calculation unit to perform a second calculation process based on the information selected by the selection unit;
A subprocessor comprising:
A parallel computing device comprising: