JPH10105412A

JPH10105412A - Object generation method for realizing efficient access to main storage

Info

Publication number: JPH10105412A
Application number: JP8258271A
Authority: JP
Inventors: Giichi Tanaka; 義一田中; Yuji Tsushima; 雄次對馬
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-09-30
Filing date: 1996-09-30
Publication date: 1998-04-24

Abstract

(57)【要約】【課題】プログラムにおいてループでアクセスするデ
ータ領域がキャッシュ容量を越えた場合に、ＤＲＡＭ、
ＳＤＲＡＭで作られた主記憶を効率的にアクセスするオ
ブジェクトを生成する。【解決手段】ループでアクセスするデータ領域がキャ
シュ容量を超え、ベクトル化できるデータ依存関係の場
合には、ベクトル命令と同様の処理を、ベクトルレジス
タを用いて複数のスカラ命令で実現するオブジェクト列
を生成する。【効果】ループ内の配列参照を平均的にアクセスする
方法に比べ、少数の配列を集中的にアクセスでき、同一
ＲＡＳアドレスで複数要素をアクセスする確率が高くな
り、データの効率的アクセスが実現される。 (57) [Summary] [PROBLEMS] When a data area accessed in a loop in a program exceeds a cache capacity, a DRAM,
An object for efficiently accessing the main memory created by the SDRAM is generated. When a data area accessed in a loop exceeds a cache capacity and has a data dependence that can be vectorized, an object sequence that implements the same processing as a vector instruction by a plurality of scalar instructions using a vector register is used. Generate. [Effect] Compared with the method of accessing array references in a loop on average, a small number of arrays can be accessed intensively, the probability of accessing a plurality of elements with the same RAS address is increased, and efficient data access is realized. You.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、命令レベル並列処
理を行うプログラムでのオブジェクト生成方法に関わ
り、特にプログラムがアクセスするデータが大きく、キ
ャッシュに入りきらない場合のループに好適なオブジェ
クト生成方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for generating objects in a program for performing instruction level parallel processing, and more particularly to a method for generating an object suitable for a loop in a case where data accessed by the program is large and cannot be stored in a cache. .

【０００２】[0002]

【従来の技術】スーパコンピュータの１つの方向とし
て、スカラプロセッサをノードプロセッサとする並列処
理方式が有望視されている。スカラプロセッサを用いた
並列スーパコンピュータが期待されるのは、半導体技術
の進歩によるクロック周波数の向上、複数の並列実行可
能な演算器を有効に生かすスーパスカラ方式等の命令レ
ベル並列処理の実現により、スカラプロセッサの処理性
能が飛躍的に向上していることがあげられる。2. Description of the Related Art As one direction of a supercomputer, a parallel processing system using a scalar processor as a node processor is considered promising. Parallel supercomputers using scalar processors are expected to be improved by improving the clock frequency due to advances in semiconductor technology, and by realizing instruction-level parallel processing such as the superscalar method that makes effective use of multiple arithmetic units that can execute in parallel. The processing performance of the processor is dramatically improved.

【０００３】しかしながら、その高いスカラプロセッサ
処理能力は、キャッシュメモリが有効に働くときのみ達
成される。大規模科学技術計算では、データ領域が大き
くデータの局所性が少ないという性質があるため、デー
タキャッシュが有効に働かないことが多い。その場合、
スカラプロセッサの性能は主記憶アクセスペナルティに
より大きく性能が低下する。中澤、中村、朴「超並列計
算機ＣＰ−ＰＡＣＳのアーキテクチャ」情報処理Ｖｏ
ｌ．３７，Ｎｏ．１，１９９６、ｐｐ．１８〜２８で
は、この主記憶アクセスペナルティを隠蔽を、主記憶の
多バンク化による主記憶アクセスのパイプライン化、多
数のレジスタ及び、レジスタへのプリロード機能により
実現した。However, its high scalar processor throughput is only achieved when the cache memory works effectively. In large-scale scientific and technical calculations, the data cache often does not work effectively because of the property that the data area is large and the locality of data is small. In that case,
The performance of the scalar processor is greatly reduced due to the main memory access penalty. Nakazawa, Nakamura, Park "Architecture of Massively Parallel Computer CP-PACS" Information Processing Vo
l. 37, no. 1, 1996; In Nos. 18 to 28, the concealment of the main memory access penalty was realized by a pipeline of the main memory access by using multiple banks of the main memory, a large number of registers, and a function of preloading the registers.

【０００４】[0004]

【発明が解決しようとする課題】上記の従来技術のみで
は、将来、スカラプロセッサのＣＰＵの性能が向上した
場合、演算に見合った主記憶バンド幅を確保できなくな
るという問題がある。例えば、周波数が１５０Ｍｈｚ，
１サイクルに発行できるロード命令は１、主記憶はＤＲ
ＡＭにより１６バンク構成である計算機を考える。ルー
プ内の配列を連続的にアクセスする場合、主記憶の素子
のサイクルタイムが１０６ｎｓ（１６／１５０）以下で
あれば、主記憶によるバンド幅が性能のネックにはなら
ないことになる。現状の素子のサイクル時間が１００ｎ
ｓであれば、ぎりぎりの主記憶バンド幅であるといえ
る。しかし、ＣＰＵ性能は年率５０％〜１００％の性能
向上が実現できているのに対し、ＤＲＡＭの性能向上は
１０％以下であることを考慮すると、従来のアプローチ
では限界があることがわかる。With only the above-mentioned prior art, there is a problem that if the performance of the CPU of the scalar processor is improved in the future, it becomes impossible to secure a main storage bandwidth suitable for the operation. For example, if the frequency is 150 MHz,
One load instruction can be issued in one cycle, and the main memory is DR
Consider a computer having a 16-bank configuration based on AM. When the array in the loop is continuously accessed, if the cycle time of the element of the main memory is 106 ns (16/150) or less, the bandwidth of the main memory does not become a performance bottleneck. Current device cycle time is 100n
If it is s, it can be said that it is the last main memory bandwidth. However, it can be seen that the conventional approach has a limit, considering that the CPU performance can be improved at an annual rate of 50% to 100%, while the DRAM performance improvement is 10% or less.

【０００５】また、ＤＲＡＭの性能を改善する方式とし
てページモードというものがある。ＤＲＡＭをアクセス
するには、アドレスの上位半分を指定するＲＡＳアクセ
ス、次にアドレスの下位半分を指定されるＣＡＳアクセ
スの２段階からなる。ページモードでは、同一ＲＡＳア
ドレスに対する参照に対しては、ＲＡＳアドレスのアク
セスをしないで、ＣＡＳアドレスのアクセスだけとなる
高速化モードである。また最近、ＳＤＲＡＭという高速
素子があらわれてきた。これは、同一ＲＡＳアドレスの
データに対しては、１サイクルピッチでデータをアクセ
スできるように改良した素子である。これらの性質を利
用するためには、アドレスの接近したデータは、ＲＡＳ
アドレスが切り替わらないように近接した時間内に処理
することが必要である。There is a page mode as a method for improving the performance of a DRAM. Accessing a DRAM consists of two stages: a RAS access specifying the upper half of the address and a CAS access specifying the lower half of the address. The page mode is a high-speed mode in which a reference to the same RAS address is not accessed by the RAS address, but only accessed by the CAS address. Recently, a high-speed device called an SDRAM has appeared. This is an element improved so that data of the same RAS address can be accessed at one cycle pitch. In order to take advantage of these properties, data whose addresses are close to each other must be
It is necessary to process within a close time so that the address is not switched.

【０００６】ここで、従来のスカラプロセッサによるル
ープ処理で、ループ内にｎ個の配列の連続アクセス参照
がある場合を考える。１つの配列参照Ａ（Ｉ），Ｉ＝
１，２．．に着目したとき、１回の配列参照Ａ（１）の
のち、次のループ繰り返しでＡ（２）を参照する。Ａ
（１），Ａ（２）は同一ＲＡＳアドレスのため、効率的
にアクセスできるはずであるが、１回のループ処理の間
に他の配列に関するｎ−１個のデータアクセスが発生す
るため、ＲＡＳアドレスが切り替わる可能性が非常に高
い。つまり、従来のループ処理方式ではＤＲＡＭのペー
ジモードやＳＤＲＡＭの性質を生かすことができない問
題がある。Here, consider a case in which there are n consecutive access references of an array in a loop in a loop process by a conventional scalar processor. One sequence reference A (I), I =
1,2. . , After one array reference A (1), A (2) is referenced in the next loop iteration. A
Since (1) and A (2) have the same RAS address, they should be able to be accessed efficiently. However, since n-1 data accesses for other arrays occur during one loop processing, RAS It is very likely that addresses will switch. That is, the conventional loop processing method has a problem that the page mode of the DRAM and the properties of the SDRAM cannot be utilized.

【０００７】本発明の目的は、ループでアクセスするデ
ータ領域がキャッシュ容量を超えた場合に、ＤＲＡＭ、
ＳＤＲＡＭで作られた主記憶を効率的にアクセスするオ
ブジェクト生成方法を提供することにある。[0007] An object of the present invention is to provide a DRAM, when a data area accessed in a loop exceeds a cache capacity.
It is an object of the present invention to provide an object generation method for efficiently accessing a main memory made of an SDRAM.

【０００８】[0008]

【課題を解決するための手段】上記の目的は、ループで
アクセスするデータ領域がキャシュ容量を超え、ベクト
ル化できるデータ依存関係の場合には、ベクトル命令と
同様の処理を、特願平８−２４９５９４号「ループ処理
の並列実行制御に適したレジスタ構成を有するプロセッ
サ」で発明されている多要素レジスタ（ベクトルレジス
タ）とその制御機構を用いて複数のスカラ命令で実現す
るコードを生成することによって達成される。The object of the present invention is to execute the same processing as the vector instruction when the data area accessed in the loop exceeds the cache capacity and has a data dependence that can be vectorized. No. 249,594 “Processor having register configuration suitable for parallel execution control of loop processing” by generating a code realized by a plurality of scalar instructions using a multi-element register (vector register) and its control mechanism invented. Achieved.

【０００９】何故なら、ベクトル処理では少数の配列を
集中的にアクセスするため、従来の方法に比べて、同一
ＲＡＳアドレスで複数要素をアクセスする確率が高くな
るためである。This is because in vector processing, since a small number of arrays are intensively accessed, the probability of accessing a plurality of elements with the same RAS address is higher than in the conventional method.

【００１０】[0010]

【発明の実施の形態】以下、本発明のコンパイラにおけ
る実施例を図を参照しつつ説明する。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of a compiler according to the present invention.

【００１１】まづ、本発明の説明の前に、本発明のコン
パイラが前提とする特願平８−２４９５９４号で示した
アーキテクチャの命令とレジスタの関係を図１１を参照
しつつ簡単に説明する。命令はオペコード１５０と、オ
ペランドフィールド１５１からなる。オペランドは、書
き込みオペランド１つと、２つの読み込みオペランドか
らなり、各オペランドはレジスタ種別フラグ１５２と論
理レジスタ番号１５３からなる。レジスタ種別フラグ１
５２が０の時は、通常のスカラレジスタを表し、レジス
タ種別フラグ１５２が１の時は、ベクトルレジスタとな
る。Before describing the present invention, the relationship between instructions and registers of the architecture disclosed in Japanese Patent Application No. 8-249594, which is premised on the compiler of the present invention, will be briefly described with reference to FIG. . The instruction includes an operation code 150 and an operand field 151. The operand includes one write operand and two read operands. Each operand includes a register type flag 152 and a logical register number 153. Register type flag 1
When 52 is 0, it represents a normal scalar register, and when the register type flag 152 is 1, it is a vector register.

【００１２】以下、ベクトルレジスタ指定時の動作を説
明する。論理レジスタ番号ごとに、対応するライトポイ
ンタ群１５４と、対応するリードポインタ群１５５を持
つ。命令で指定される論理レジスタ番号と、実際の大容
量物理レジスタファイル１７０であるベクトルレジスタ
とのマッピングは、書込側レジスタの場合は、指定され
た論理レジスタ番号１５７に対応するライトポインタに
よって指される要素１５８をさし、読み込み側レジスタ
の場合は、指定された論理レジスタ番号１６０に対応す
るリードポインタによって指される要素１６１を指す。
レジスタの読みだしのあとでは、リードポインタが、及
びレジスタの書き込みの後には、ライトポインタが１５
９、１６２により自動更新され、次にアクセスするとき
には、自動的にとなりの要素のレジスタをアクセスする
ベクトルレジスタ的構成となっている。The operation when the vector register is specified will be described below. Each logical register number has a corresponding write pointer group 154 and a corresponding read pointer group 155. The mapping between the logical register number specified by the instruction and the vector register which is the actual large-capacity physical register file 170 is indicated by the write pointer corresponding to the specified logical register number 157 in the case of the write-side register. In the case of a read-side register, it indicates an element 161 indicated by a read pointer corresponding to a specified logical register number 160.
After reading the register, the read pointer is set, and after writing the register, the write pointer is set to 15
It is automatically updated by 9 and 162, and has the structure of a vector register that automatically accesses the register of the next element at the next access.

【００１３】なお、自動更新の際、レジスタファイルの
最後を示すポインタは、１６９で示すようにレジスタフ
ァイルの最初をさすようにラップアラウンドされる。ま
た、ポインタの初期値設定は、リード・ライトポインタ
の値は同一値とし、即値（またはレジスタ）代入命令で
行う。At the time of automatic updating, the pointer indicating the end of the register file is wrapped around so as to point to the beginning of the register file as indicated by 169. The initial value of the pointer is set by an immediate (or register) assignment instruction, with the read / write pointer values being the same value.

【００１４】図１にコンパイラ全体の構造を示す。図１
のソースプログラム１が、構文解析２によって中間語に
変換され、これを入力として、データがキャッシュまた
は主記憶にあるかを判定し最適なプログラム変換を行う
最適化部３が、中間語４に変換する。そして、コード生
成部５が、対象マシン向けのオブジェクトコード６に変
換する。本発明は３及び５に係わり、オブジェクトコー
ド６の実行効率を向上させるものである。FIG. 1 shows the structure of the entire compiler. FIG.
Is converted into an intermediate language by the syntax analysis 2, and using the input as an input, the optimizing unit 3 which determines whether the data is in the cache or the main memory and performs the optimal program conversion is converted into the intermediate language 4. I do. Then, the code generation unit 5 converts the object code 6 into the object code 6 for the target machine. The present invention relates to 3 and 5, and relates to improving the execution efficiency of the object code 6.

【００１５】図１の最適化部３のうちデータの効率的ア
クセスのための最適化に関わる部分の構造を図２に示
す。図２に入力するソースプログラム１として、図５の
ＦＯＲＴＲＡＮプログラムを例としてあげ説明する。図
５の３０、３１はループを示し、これらは２重ループを
構成している。このようなプログラムに対して図２は以
下のような処理を行い、中間語４に変換する。FIG. 2 shows the structure of a part of the optimizing unit 3 shown in FIG. 1 which is related to optimization for efficient data access. As the source program 1 input in FIG. 2, the FORTRAN program in FIG. 5 will be described as an example. Reference numerals 30 and 31 in FIG. 5 indicate loops, which constitute a double loop. FIG. 2 performs the following processing on such a program to convert it into an intermediate language 4.

【００１６】図２の処理は、ソースプログラム中の最内
側ループを順次処理する。判定１０は当該ループにユー
ザがプログラムに記述した主記憶ターゲットディレクテ
ィブが存在するか判定する。主記憶ターゲットディレク
ティブとは、データがキャッシュに存在する確率が少な
いときに、以下のようにユーザがソースプログラム上の
ループの直前に指定するものである。この場合、後で述
べるような、当該ループのベクトル命令への変換処理１
５へ進む。The processing of FIG. 2 sequentially processes the innermost loop in the source program. The judgment 10 judges whether or not the main storage target directive described by the user in the program exists in the loop. The main storage target directive is specified by the user immediately before the loop on the source program as described below when the probability that data exists in the cache is low. In this case, as described later, a conversion process 1 of the loop into a vector instruction
Go to 5.

【００１７】＊ＯＰＴＩＯＮＭＥＭＴＡＲＧＥＴＤＯ３１Ｉ＝１，Ｌ図５のソースプログラムでは、ＤＯ３１にこのような指
定がないので、対象ループのアクセスデータ量算出処理
１１へ進む。処理１１は、図５のソースプログラムに対
して、内側ループから順にループ内でアクセスするデー
タ量を算出する。すなわち、まず、最内側のループ３１
では、配列Ａ３２、３３とＢ３４がループインデクスＩ
を添字としているため、ループ長のＬ個の要素を必要と
し、配列Ｃ３５は、ループで不変な添字であるのでただ
１つの要素のみをアクセスする。* OPTION MEMTARGET DO 31 I = 1, L In the source program of FIG. 5, since there is no such designation in DO 31, the process proceeds to the access data amount calculation processing 11 of the target loop. The process 11 calculates the data amount to be accessed in the loop from the inner loop to the source program in FIG. That is, first, the innermost loop 31
Then, the arrays A32, 33 and B34 have the loop index I
Is a subscript, L elements of the loop length are required, and the array C35 accesses only one element because it is an invariant subscript in the loop.

【００１８】従って、全体では、２＊Ｌ＋１個の要素数
となり、倍精度データの場合、（２＊Ｌ＋１）＊８バイ
トになる。同様に、ループ３０では、新たに、ループイ
ンデクスＫが１からＭまで変化するため、配列Ａ３２、
３３はループ３１と同様であるが、配列Ｂ３４はＭ＊Ｌ
個、配列Ｃ３５はＭ個となる。従って、全体では、（Ｌ
＋Ｌ＊Ｍ＋Ｍ）＊８バイトのデータ量となる。Accordingly, the total number of elements is 2 * L + 1, and in the case of double precision data, it is (2 * L + 1) * 8 bytes. Similarly, in the loop 30, since the loop index K newly changes from 1 to M, the array A32,
33 is similar to loop 31, but array B34 is M * L
And the number of arrays C35 is M. Accordingly, (L
+ L * M + M) * 8 bytes of data.

【００１９】判定１２では処理１１で求められた対象ル
ープのアクセスデータ容量をもとに、次の３つの処理に
分岐する。At decision 12, the process branches to the following three processes based on the access data capacity of the target loop obtained at process 11.

【００２０】第１はアクセスデータ量がキャッシュ容量
より小さいことが判明した場合、キャッシュにデータが
存在するとの前提で通常のソフトウェアパイプライン処
理を行うため、中間語の変更はしない。First, when it is determined that the amount of access data is smaller than the cache capacity, normal software pipeline processing is performed on the assumption that data exists in the cache, so that the intermediate language is not changed.

【００２１】第２はアクセスデータ量がキャッシュ容量
より大きいことが判明した場合、当該ループがベクトル
化可能否かの判定１３を行い、ベクトル化可能な場合は
当該ループをベクトル中間語への変換処理１５を行う。
ベクトル化できない場合は、通常のソフトウェアパイプ
ライン処理を行うため、中間語の変更はしない。ベクト
ル化否かの判定１３は、ベクトル処理的データ処理がデ
ータ依存関係から可能かどうかをチェックするもので、
公知である田中・岩沢「ベクトル計算機のためのコンパ
イル技術」情報処理，Ｖｏｌ．３１，Ｎｏ．６，ｐｐ．
７３６−７４３、１９９０などに詳しく書かれている。Second, when it is determined that the access data amount is larger than the cache capacity, it is determined whether or not the loop can be vectorized. If the loop can be vectorized, the loop is converted into a vector intermediate language. Perform 15.
If vectorization cannot be performed, normal software pipeline processing is performed, and the intermediate language is not changed. The determination 13 as to whether or not to vectorize is to check whether vector processing-like data processing is possible from the data dependency.
Well-known Tanaka and Iwasawa, "Compilation Techniques for Vector Computers," Information Processing, Vol. 31, No. 6, pp.
736-743, 1990 and the like.

【００２２】第３はアクセスデータ量とキャッシュ容量
の関係が不明な場合である。図５のプログラムの場合、
アクセスデータ量は、（Ｌ＋Ｌ＊Ｍ＋Ｍ）＊８バイトで
あるが、Ｌ，Ｍ等が変数であるため、コンパイル時には
その大きさが不明であるため、この場合に該当する。こ
の時、当該ループがベクトル化可能否かの判定１４を行
い、ベクトル化可能な場合は、１６に示すような中間語
を生成する。すなわち、ループ３１を、アクセスデータ
量がキャッシュ容量より小さい場合の条件３６が成立す
る場合は、キャッシュにデータが存在するとの前提で通
常のソフトウェアパイプライン処理を行うため、もとも
とのループの処理を実行し、アクセスデータ量がキャッ
シュ容量より大きい場合は、文３７〜文４２に示すよう
なベクトル中間語へ変換する。Third, the relationship between the amount of access data and the cache capacity is unknown. In the case of the program in FIG.
The access data amount is (L + L * M + M) * 8 bytes, but this is applicable because L and M are variables and their sizes are unknown at the time of compilation. At this time, it is determined whether or not the loop can be vectorized. If the loop can be vectorized, an intermediate language as shown in 16 is generated. That is, when the condition 36 when the access data amount is smaller than the cache capacity is satisfied, the normal software pipeline processing is performed on the assumption that data exists in the cache. If the access data amount is larger than the cache capacity, the data is converted into a vector intermediate language as shown in sentences 37 to 42.

【００２３】ここでＶＬＥＮＧ３７は、ベクトル処理を
行う処理長を設定するベクトル中間語、ＶＬＤＶＴＥ
ＭＰ１，Ｂ（１：Ｌ，Ｋ）３８はベクトルデータＢ
（Ｉ，Ｋ）Ｉ＝１，２，．．．，Ｌをベクトルレジスタ
ＶＴＥＭＰ１にロードするベクトル中間語、ＶＥＭＤ
ＶＴＥＭＰ２，ＶＴＥＭＰ１，Ｃ（Ｋ，Ｊ）３９はベク
トルレジスタＶＴＥＭＰ１とスカラデータＣ（Ｋ，Ｊ）
をベクトル乗算して、結果をベクトルレジスタＶＴＥＭ
Ｐ２に格納するベクトル中間語、ＶＬＤＶＴＥＭＰ
３，Ａ（１：Ｌ，Ｊ）４０はベクトルデータＡ（Ｉ，
Ｊ）Ｉ＝１，２，．．．，ＬをベクトルレジスタＶＴＥ
ＭＰ３にロードするベクトル中間語、ＶＥＡＤＶＴＥＭ
Ｐ４，ＶＴＥＭＰ２，ＶＴＥＭＰ３４１はベクトルレ
ジスタＶＴＥＭＰ２とベクトルレジスタＶＴＥＭＰ３を
ベクトル加算して、結果をベクトルレジスタＶＴＥＭＰ
４に格納するベクトル中間語、そしてＶＳＴＤＶＴＥ
ＭＰ４，Ａ（１：Ｌ，Ｊ）４２はベクトルレジスタＶＴ
ＥＭＰ４上のデータを、データ領域Ａ（Ｉ，Ｊ）Ｉ＝
１，２，．．．，Ｌに格納するベクトル中間語である。Here, VLENG 37 is a vector intermediate language for setting a processing length for performing vector processing, VLD VTE
MP1, B (1: L, K) 38 are vector data B
(I, K) I = 1, 2,. . . , L into the vector register VTEMP1, VEMD, VEMD
VTEMP2, VTEMP1, C (K, J) 39 are a vector register VTEMP1 and scalar data C (K, J).
, And the result is stored in a vector register VTEM.
Vector intermediate language stored in P2, VLD VTEMP
3, A (1: L, J) 40 is vector data A (I,
J) I = 1, 2,. . . , L to the vector register VTE
VEADVTEM, a vector intermediate language to load into MP3
P4, VTEMP2, and VTEMP3 41 perform vector addition of the vector register VTEMP2 and the vector register VTEMP3, and store the result in the vector register VTEMP.
4, and the intermediate language stored in VST4 and VSTD VTE
MP4, A (1: L, J) 42 are vector registers VT
The data on EMP4 is stored in data area A (I, J) I =
1, 2,. . . , L stored in the vector intermediate language.

【００２４】次に図１のコード生成部３のうち本発明に
関するベクトル処理的コード生成方法を図３、図４に示
す。図３に、図５のベクトル命令３７〜４２が入力され
た場合の動作を説明する。これらのベクトル命令は、ま
ず、命令並べ替え処理２０によってハードウェアに最適
な順序で実行できるように図６のベクトル命令３７〜４
２のように並べ替える（ベクトル命令３９、４０が交換
されている）。命令の並べ替えの方法は、公知の文献マ
イク・ジョンソン「スーパースカラプロセッサ」日経Ｂ
Ｐ出版センター１９９４，ｐｐ．１９０ー１９３に書か
れているリストスケジューリング法を適用すればよい。
なお、ここではメモリ演算ユニットが２個、浮動小数点
演算ユニットが２個、同時実行可能であると仮定してい
る。Next, a vector processing code generation method according to the present invention in the code generation unit 3 of FIG. 1 is shown in FIGS. FIG. 3 illustrates the operation when the vector instructions 37 to 42 in FIG. 5 are input. The vector instructions 37 to 4 shown in FIG. 6 are first executed by the instruction rearranging process 20 so that the vector instructions can be executed in the order most suitable for the hardware.
2 (the vector instructions 39 and 40 are exchanged). The method of rearranging instructions is described in the well-known document Mike Johnson "Super Scalar Processor" Nikkei B
P Publishing Center 1994, pp. The list scheduling method described in 190-193 may be applied.
Here, it is assumed that two memory operation units and two floating point operation units can be executed simultaneously.

【００２５】判定２１は、ループ長が変数か定数かを判
定する。中間語３７よりループ長がＬで変数のため、処
理２２に進む。処理２２はレジスタ構成の選択を行う。
仮にレジスタが５１２要素であった場合、２５６要素の
ベクトルレジスタ２個か、１２８要素のベクトルレジス
タ４組か、６４要素のベクトルレジスタ８組か、３２要
素のベクトルレジスタ１６組かなどを選択する。この
際、後で述べるようにレジスタの要素長をロードレイテ
ンシと同程度のものを選ぶことが重要である。この結
果、ロードレイテンシが５０の場合、ベクトル命令３８
〜４２に対しては、必要なレジスタ数が４であることか
ら、６４要素のベクトルレジスタ８組のレジスタ構成を
選択することになる。In decision 21, it is decided whether the loop length is a variable or a constant. Since the loop length is L and a variable from the intermediate language 37, the process proceeds to processing 22. Process 22 selects a register configuration.
If the number of registers is 512, two vector registers of 256 elements, four sets of vector registers of 128 elements, eight sets of vector registers of 64 elements, sixteen sets of vector registers of 32 elements, and the like are selected. In this case, as described later, it is important to select a register element length that is substantially equal to the load latency. As a result, when the load latency is 50, the vector instruction 38
Since the required number of registers is 4 for .about.42, a register configuration of eight sets of 64-element vector registers is selected.

【００２６】もし、判定２１により、ループ長が下記の
例のように定数の場合は、処理２４でレジスタ容量が許
す限りレジスタ長を選択する。下記の例では、必要なレ
ジスタが４であった場合、１２８要素のベクトルレジス
タ４組の構成を選択する。If the loop length is determined to be a constant as shown in the following example, the register length is selected in process 24 as long as the register capacity permits. In the following example, when the number of necessary registers is 4, a configuration of four sets of 128-element vector registers is selected.

【００２７】ＤＯ３１Ｉ＝１，１０００処理２３は選択したレジスタの要素長に基づくストリッ
プマイニング処理を行う。図６のベクトル中間語３７〜
４２に対しては、レジスタの要素長を６４としたことか
ら、図６の５０〜６８のようにベクトル処理長が６４以
下になるようにプログラムを変換する。これにより２つ
のベクトル中間語列が生成される。すなわち、ベクトル
長がレジスタ長のベクトル中間語列５２〜６０と、ベク
トル長が端数のベクトル命令列６１〜６８である。図６
のベクトル命令列の場合、前者のベクトル命令列はルー
プ長６４、後者のベクトル中間語列はループ長６４未満
である。もともとの、ループ長が変数であっても、前者
のループ長は定数であることに注意を要する。また、も
ともとのループ長が定数の場合、後者のループのベクト
ル長も定数になることも明らかである。DO 31 I = 1,1000 Processing 23 performs strip mining processing based on the element length of the selected register. The vector intermediate language 37 of FIG.
For 42, since the element length of the register is 64, the program is converted so that the vector processing length is 64 or less as shown at 50 to 68 in FIG. As a result, two vector intermediate word strings are generated. That is, the vector length is a vector intermediate word sequence 52 to 60 having a register length, and the vector length is a vector instruction sequence 61 to 68 having a fractional length. FIG.
, The former vector instruction sequence has a loop length of 64, and the latter vector intermediate sequence has a loop length of less than 64. Note that even though the original loop length is a variable, the former loop length is a constant. It is also apparent that when the original loop length is a constant, the vector length of the latter loop is also a constant.

【００２８】図６のベクトル長がレジスタ長のベクトル
中間語列５５〜６０は、ベクトル長が定数なので、ベク
トル中間語に対応するベクトル命令のシミュレーション
によって実行状況を推定する。ベクトル中間語列５６〜
６０に対応するベクトル命令は、ハードウェアが与えら
れれば、実行状況をシミュレーションすることが可能で
あることは明らかである。ロードストア命令が同時実行
可能な演算器が２個、同時実行可能な浮動小数点演算器
が２個あるなら、その実行の様子は、図７のようにな
る。ここでロードレイテンシを５０、演算レイテンシを
３と仮定している。すなわち、ロード命令の結果を使う
命令は５０サイクル遅れて実行が開始でき、演算命令の
結果を使う命令は３サイクル遅れて実行を開始すること
ができる。Since the vector intermediate word strings 55 to 60 having a register length of the vector length in FIG. 6 have a constant vector length, the execution state is estimated by simulating the vector instruction corresponding to the vector intermediate word. Vector intermediate string 56 ~
Obviously, the vector instruction corresponding to 60 can simulate the execution status if hardware is provided. If there are two arithmetic units capable of simultaneously executing the load store instruction and two floating point arithmetic units capable of simultaneously executing the load store instruction, the state of the execution is as shown in FIG. Here, it is assumed that the load latency is 50 and the operation latency is 3. That is, an instruction using the result of the load instruction can start execution with a delay of 50 cycles, and an instruction using the result of the operation instruction can start execution with a delay of three cycles.

【００２９】処理２７では、同時に実行しているベクト
ル命令の組み合わせがかわる時間を求める。図７の例で
は、黒丸の点がその時間であり、区間ａ、ｂ，ｃ，ｄ，
ｅ，ｆの６つに分けられる。In step 27, the time at which the combination of simultaneously executed vector instructions is changed is obtained. In the example of FIG. 7, the black circle points indicate the time, and the sections a, b, c, d, and
e and f are divided into six.

【００３０】処理２８では、処理２７でもとめた変化点
をもとに、変化点の間の区間ごとにベクトル実行のスカ
ラ処理によるコードの実現のためのプログラム変換を行
う。図８に変換後のコードを示す。ＭＳＰＴＲ１３０〜
１３３は、各４つのベクトルレジスタＱ１，Ｑ２，Ｑ
３、Ｑ４の第０要素からの初期オフセットを示してお
り、各レジスタとも６４要素として使っていることを示
している。In process 28, based on the change points obtained in process 27, a program conversion for implementing a code by scalar processing of vector execution is performed for each section between the change points. FIG. 8 shows the converted code. MSPTR130 ~
133 denotes four vector registers Q1, Q2, Q
3, indicates the initial offset from the 0th element of Q4, and indicates that each register is used as 64 elements.

【００３１】区間ａは２つのベクトルロード命令が同時
に実行する区間で、スカラ中間語ＬＤＵ１３６，１３７
を５０回実行する中間語を生成することで同一の処理を
実現できる。ループ処理はスカラ中間語１３４、１３
５、１３８でループが実現できる。ここで、スカラ中間
語１３４は、カウントレジスタＣＴＲに処理ループ回数
を設定するものである。スカラ中間語１３８は、ＣＴＲ
の内容が０以上であれば、ラベルｌａｂｅｌ１に分岐
し、ＣＴＲの内容を１減じるものである。スカラ中間語
ＬＤＵ１３６，１３７は通常のアドレス更新付ロード命
令に対応する中間語である。処理１３４〜１３８を実行
すると、レジスタを参照するとポインタが１つ進む機能
により、ベクトルレジスタの第０要素から第４９要素
に、Ｂ（Ｉ１、Ｋ），Ｂ（Ｉ１＋１、Ｋ），．，Ｂ（Ｉ
１＋４９，Ｋ）のデータが、ベクトルレジスタの第１２
８要素から第１７７要素に、Ａ（Ｉ１、Ｊ），Ａ（Ｉ１
＋１、Ｊ），．，Ａ（Ｉ１＋４９，Ｊ）のデータが格納
される。Section a is a section in which two vector load instructions are simultaneously executed, and is a scalar intermediate language LDU 136, 137.
The same processing can be realized by generating an intermediate language that executes the process 50 times. Loop processing is scalar intermediate language 134,13
The loop can be realized at 5, 138. Here, the scalar intermediate language 134 sets the number of processing loops in the count register CTR. Scalar intermediate language 138 is CTR
If the content of CTR is 0 or more, the process branches to label label1 and decrements the content of CTR by one. The scalar intermediate language LDU 136, 137 is an intermediate language corresponding to a normal load instruction with address update. When the processes 134 to 138 are executed, the pointer is advanced by one when referring to the register, so that B (I1, K), B (I1 + 1, K),. , B (I
1 + 49, K) is stored in the twelfth vector register.
From the 8th element to the 177th element, A (I1, J), A (I1
+1, J),. , A (I1 + 49, J).

【００３２】区間ｂは２つのベクトルロード命令と１つ
のベクトル乗算命令が実行する区間で、スカラ中間語Ｌ
ＤＵ１４１，１４２、スカラ中間語ＦＭＵＬ１４３を３
回実行する中間語を生成することで同一の処理を実現で
きる。ここで、ＦＭＵＬはスカラ乗算命令に対応する中
間語である。この結果、ベクトルレジスタの第５０要素
から第５２要素に、Ｂ（Ｉ１＋５０、Ｋ），Ｂ（Ｉ１＋
５１、Ｋ），Ｂ（Ｉ１＋５２，Ｋ）のデータが、ベクト
ルレジスタの第１７８要素から第１８０要素に、Ａ（Ｉ
１＋５０、Ｊ），Ａ（Ｉ１＋５１、Ｊ），Ａ（Ｉ１＋５
２，Ｊ）のデータが格納され、ベクトルレジスタの第６
４要素から第６６要素に、べクトルレジスタの第０要素
から第２要素の内容とＣ（Ｋ，Ｊ）を乗算した結果が格
納される。The section b is a section in which two vector load instructions and one vector multiplication instruction are executed.
DU141, 142, 3 scalar intermediate FMUL143
The same processing can be realized by generating an intermediate language to be executed twice. Here, FMUL is an intermediate language corresponding to a scalar multiplication instruction. As a result, B (I1 + 50, K) and B (I1 +
51, K) and B (I1 + 52, K) are transferred from the 178th element to the 180th element of the vector register by A (I
1 + 50, J), A (I1 + 51, J), A (I1 + 5
2, J) is stored in the sixth register of the vector register.
The result obtained by multiplying the contents of the 0th element to the 2nd element of the vector register by C (K, J) is stored from the 4th element to the 66th element.

【００３３】区間ｃは２つのベクトルロード命令、１つ
のベクトル乗算命令と１つのベクトル加算命令が実行す
る区間で、スカラ中間語ＬＤＵ１４７，１４８、スカラ
中間語ＦＭＵＬ１４９、スカラ中間語ＦＡＤＤ１５０を
１１回実行する中間語を生成することで同一の処理を実
現できる。ここで、ＦＡＤＤはスカラ加算命令に対応す
る中間語である。この結果、ベクトルレジスタの第５３
要素から第６３要素に、Ｂ（Ｉ１＋５３、Ｋ），Ｂ（Ｉ
１＋５４、Ｋ），．．，Ｂ（Ｉ１＋６３，Ｋ）のデータ
が、ベクトルレジスタの第１８１要素から第１９１要素
に、Ａ（Ｉ１＋５３、Ｊ），Ａ（Ｉ１＋５４、
Ｊ），．．，Ａ（Ｉ１＋６３，Ｊ）のデータが格納さ
れ、ベクトルレジスタの第６７要素から第７７要素に、
べクトルレジスタの第３要素から第１３要素の内容とＣ
（Ｋ，Ｊ）を乗算した結果が格納され、ベクトルレジス
タの第１９２要素から第２０２要素に、べクトルレジス
タの第０要素から第１０要素の内容とべクトルレジスタ
の第１２８要素から第１３８要素の内容を乗算した結果
が格納される。Section c is a section in which two vector load instructions, one vector multiplication instruction and one vector addition instruction are executed, and scalar intermediate languages LDU147 and 148, scalar intermediate language FMUL149 and scalar intermediate language FADD150 are executed 11 times. The same processing can be realized by generating an intermediate language. Here, FADD is an intermediate language corresponding to a scalar addition instruction. As a result, the 53rd of the vector register
From the element to the 63rd element, B (I1 + 53, K), B (I
1 + 54, K),. . , B (I1 + 63, K) are transferred from the 181st element to the 191st element of the vector register by A (I1 + 53, J), A (I1 + 54,
J),. . , A (I1 + 63, J) are stored, and the 67th element to the 77th element of the vector register are
The contents of the third to thirteenth elements of the vector register and C
The result of multiplication by (K, J) is stored, and the contents of the 0th to 10th elements of the vector register and the contents of the 128th element to the 138th element of the vector register are stored from the 192nd element to the 202nd element of the vector register. The result of multiplying the contents is stored.

【００３４】区間ｄは１つのベクトル乗算命令とベクト
ル加算命令とベクトルストア命令が実行する区間で、ス
カラ中間語ＦＭＵＬ１５４、スカラ中間語ＦＡＤＤ１５
５、スカラ中間語ＳＤＵ１５６を５０回実行する中間語
を生成することで同一の処理を実現できる。ここで、Ｓ
ＤＵは通常のアドレス更新付スカラストア命令に対応す
る中間語である。この結果、ベクトルレジスタの第７８
要素から第１２７要素に、べクトルレジスタの第１４要
素から第６３要素の内容とＣ（Ｋ，Ｊ）を乗算した結果
が格納され、ベクトルレジスタの第２０３要素から第２
５２要素に、べクトルレジスタの第１１要素から第６０
要素の内容とべクトルレジスタの第１３９要素から第１
８８要素の内容を乗算した結果が格納、ベクトルレジス
タの第１９２要素から第２４１要素の値が、データ域Ａ
（Ｉ１，Ｊ），Ａ（Ｉ１＋１，Ｊ），．．，Ａ（Ｉ１＋
４９，Ｊ）に書き込まれる。The section d is a section in which one vector multiplication instruction, vector addition instruction, and vector store instruction are executed, and the scalar intermediate language FMUL154 and the scalar intermediate language FADD15 are executed.
5. The same processing can be realized by generating an intermediate language that executes the scalar intermediate language SDU 156 50 times. Where S
DU is an intermediate language corresponding to a normal scalar store instruction with address update. As a result, the 78th of the vector register
The result obtained by multiplying the contents of the 14th to 63rd elements of the vector register by C (K, J) is stored in the 127th element from the element, and the 2nd element is stored in the second element from the 203rd element of the vector register.
The 52nd element is the 11th element to the 60th element of the vector register.
Element contents and first to 139th elements of the vector register
The result of multiplying the contents of the 88 elements is stored, and the values of the 192nd element to the 241st element of the vector register are stored in the data area A
(I1, J), A (I1 + 1, J),. . , A (I1 +
49, J).

【００３５】区間ｅは１つのベクトル加算命令とベクト
ルストア命令が実行する区間で、スカラ中間語ＦＡＤＤ
１６０、スカラ中間語ＳＤＵ１６１を３回実行する中間
語を生成することで同一の処理を実現できる。この結
果、ベクトルレジスタの第２５３要素から第２５５要素
に、べクトルレジスタの第６１要素から第６３要素の内
容とべクトルレジスタの第１８９要素から第１９１要素
の内容を乗算した結果が格納、ベクトルレジスタの第２
４２要素から第２４４要素の値が、データ域Ａ（Ｉ１＋
５０，Ｊ），Ａ（Ｉ１＋５１，Ｊ），Ａ（Ｉ＋５２，
Ｊ）に書き込まれる。Section e is a section where one vector addition instruction and one vector store instruction are executed, and is a scalar intermediate language FADD.
160, the same processing can be realized by generating an intermediate language that executes the scalar intermediate language SDU 161 three times. As a result, the result obtained by multiplying the contents of the 61st to 63rd elements of the vector register and the contents of the 189th element to the 191st element of the vector register is stored in the 253rd element to the 255th element of the vector register. Second
The value of the 42nd element to the 244th element is the data area A (I1 +
50, J), A (I1 + 51, J), A (I + 52,
J).

【００３６】区間ｆは１つベクトルストア命令が実行す
る区間で、スカラ中間語ＳＤＵ１６５を１１回実行する
中間語を生成することで同一の処理を実現できる。この
結果、ベクトルレジスタの第２４５要素から第２５５要
素の値が、データ域Ａ（Ｉ１＋５３，Ｊ），Ａ（Ｉ１＋
５４，Ｊ），．．，Ａ（Ｉ1＋６３，Ｊ）に書き込まれ
る。The section f is a section where one vector store instruction is executed, and the same processing can be realized by generating an intermediate language that executes the scalar intermediate language SDU 165 11 times. As a result, the values of the 245th element to the 255th element of the vector register are stored in the data areas A (I1 + 53, J) and A (I1 +
54, J),. . , A (I1 + 63, J).

【００３７】次に、図６のベクトル長が端数のベクトル
中間語列６４〜６８は、ベクトル長が変数なので、処理
２６の対応するベクトル命令のチェインスロットへの配
置を行ってベクトル命令の実行をシミュレーションす
る。チェインスロットとは、図９に示すように、並列実
行可能なベクトル命令を集めてグループ化したものであ
る。ロードストア命令が同時実行可能な演算器が２個、
同時実行可能な浮動小数点演算器が２個あるなら、４個
のスロット１１０〜１１３が時間軸方向に命令が埋まる
まで複数個、時間軸方向に並んでいる。対応するベクト
ル命令は２６に示した条件でチェインスロットに配置さ
れる。ベクトル中間語列６４〜６８の場合、図９のよう
に配置される。時間スロット１（１１４）は、並列実行
可能なロード命令のみしかないため、当該チェインスロ
ットの実行時間はＩ２−Ｉ１＋１サイクルであり、時間
スロット２（１１５）の実行時間は３つの命令がフロー
依存関係にあるため、演算レイテンシを３とした場合、
Ｉ２−Ｉ１＋１＋２＊３サイクルとなる。Next, in the vector intermediate word strings 64 to 68 having a fractional vector length shown in FIG. 6, since the vector length is a variable, the corresponding vector instruction in the processing 26 is arranged in the chain slot to execute the vector instruction. Simulate. As shown in FIG. 9, a chain slot is obtained by collecting and grouping vector instructions that can be executed in parallel. Two arithmetic units that can execute load store instructions simultaneously,
If there are two floating point arithmetic units that can be executed simultaneously, a plurality of four slots 110 to 113 are arranged in the time axis direction until instructions are filled in the time axis direction. The corresponding vector instruction is placed in the chain slot under the condition shown at 26. In the case of the vector intermediate word strings 64-68, they are arranged as shown in FIG. Since the time slot 1 (114) has only load instructions that can be executed in parallel, the execution time of the chain slot is I2-I1 + 1 cycles, and the execution time of the time slot 2 (115) is that three instructions have a flow dependency. Therefore, if the operation latency is 3,
I2−I1 + 1 + 2 * 3 cycles.

【００３８】処理２７では、各時間スロット毎に、同時
に実行しているベクトル命令の組み合わせがかわる時間
を求める。図６のベクトル中間語列６４〜６８では、黒
丸の点がその時間であり、区間ａ、ｂ，ｃ，ｄ，ｅ，ｆ
の６つに分けられる。In step 27, the time at which the combination of simultaneously executing vector instructions is changed for each time slot is determined. In the vector intermediate word strings 64 to 68 in FIG. 6, the points of black circles indicate the time, and the sections a, b, c, d, e, and f
Is divided into six.

【００３９】処理２８では、処理２７でもとめた変化点
をもとに、変化点の間の区間ごとにベクトル実行のスカ
ラ処理によるコード生成のためのプログラム変換を行
う。ベクトル中間語列５６〜６０での場合と同様にこの
変換を実施すると、その結果、図１０のスカラ中間語に
よるベクトル処理が実現できる。In process 28, based on the change points obtained in process 27, program conversion for code generation by scalar processing of vector execution is performed for each section between the change points. If this conversion is performed in the same manner as in the case of the vector intermediate language strings 56 to 60, as a result, the vector processing by the scalar intermediate language of FIG. 10 can be realized.

【００４０】ここで、ベクトル長が変数の場合、ベクト
ル命令の並列実行状況をチェインスロットという概念で
推定する事の理由、及び、ベクトルロード命令とフロー
依存の関係にあるベクトル命令を別のチェインスロット
に配置する理由を述べる。ベクトル長が変数の場合、ベ
クトル中間語列６４〜６８に対する最適な並列実行状況
は、図９に推定するものとは異なっている。仮にループ
長が６４の場合で、ロードレイテンシが５０サイクル、
演算レイテンシが３の場合、以前に説明したように図７
のように実行するのが最適である。しかし、この場合
に、最適な実行状況を推定すると、並列実行状況の組み
合わせの数は、命令数、命令実行レイテンシ、命令出現
パターンとループ長とに依存して多項式的に多くなり実
質的にコード生成は不可能である。このため、ループ長
が大きいときに、主に重なっている命令をグループ化し
たチェインスロットという概念を用いて組み合わせの数
を減らす。Here, when the vector length is a variable, the reason for estimating the parallel execution status of the vector instruction by the concept of a chain slot, and the fact that a vector instruction having a flow dependency with a vector load instruction is separated into another chain slot. The reason for arranging is described below. When the vector length is a variable, the optimal parallel execution status for the vector intermediate word strings 64-68 is different from that estimated in FIG. If the loop length is 64, the load latency is 50 cycles,
When the operation latency is 3, as described above, FIG.
It is best to execute as follows. However, in this case, when the optimum execution situation is estimated, the number of combinations of the parallel execution situations becomes polynomially large depending on the number of instructions, the instruction execution latency, the instruction appearance pattern and the loop length. Generation is not possible. For this reason, when the loop length is large, the number of combinations is reduced using the concept of a chain slot in which mainly overlapping instructions are grouped.

【００４１】すなわち、一般に演算レイテンシは小さ
く、主記憶レイテンシは大きいので、演算命令とその結
果を使用する命令は同一チェインスロットかそれ以後
に、ロード命令とそれを使用する演算命令は別チェイン
スロットに配置することで実現する。さらに、レジスタ
構成の選択の際に、主記憶レイテンシに近いものを選ぶ
事により、ロード命令とそれを使う演算命令の実行を最
適に近づけることができる。ループ長が大きく、何回か
のストリップマイニングを行うのであれば、最終回のス
トリップマイニング以外は理想的になる。That is, since the operation latency is generally low and the main storage latency is high, the operation instruction and the instruction using the result are placed in the same chain slot or thereafter, and the load instruction and the operation instruction using the same are placed in separate chain slots. It is realized by placing. Further, when selecting a register configuration, by selecting a register configuration close to the main storage latency, the execution of a load instruction and an operation instruction using the same can be optimized. If the loop length is large and strip mining is performed several times, it is ideal except for the final strip mining.

【００４２】なお、説明の簡単化のために、図１０のコ
ードは、正確にはループ長が６以下の場合には正しく動
くコードとなっていない。例えば、実際のループ長が１
の時には、ＶＥＭＤ５８，ＶＥＡＤ５９，ＶＳＴＤ６０
のベクトル命令は同時には実行されず、区間の数は３と
なる。なお、正しく動作するコードとするためには、区
間ｂ，ｃ，ｅ，ｆのコード部分にループ長による処理を
スキップする分岐をいれる必要がある。For the sake of simplicity, the code shown in FIG. 10 does not operate correctly when the loop length is 6 or less. For example, if the actual loop length is 1
At the time of, VEMD58, VEAD59, VSTD60
Are not executed at the same time, and the number of sections is three. In order to make the code operate correctly, it is necessary to add a branch for skipping the processing based on the loop length in the code portions of the sections b, c, e, and f.

【００４３】[0043]

【発明の効果】本発明のオブジェクト生成方法により、
ループでアクセスするデータ領域がキャシュ容量を超
え、ベクトル化できるデータ依存関係の場合には、ベク
トル命令と同様の処理を、ベクトルレジスタを用いて複
数のスカラ命令で実現するコードが生成できる。このオ
ブジェクト列を実行すると、従来のループ内の配列参照
を平均的にアクセスする方法に比べて、少数の配列を集
中的にアクセスするため、同一ＲＡＳアドレスで複数要
素をアクセスする確率が高くなり、データの効率的アク
セスが実現される。図１２にその効果の例を示す。この
図は、ＬＵ分解ソースプログラムのカーネルループに対
する生成コードの、実行効率を示している。丸印は、デ
ータがキャッシュにあるとして生成したコードによる性
能、黒丸印は、データが主記憶にあるとして、従来のソ
フトウェアパイプラインによるスケジューリングによる
性能、三角印がデータが主記憶にあるとしてベクトル処
理と同様の処理をスカラ命令で実現したコードによる性
能を示す。キャッシュターゲットのコードは配列が小さ
くキャッシュに存在する間は、高い実行性能を示す。主
記憶レイテンシを考慮してスケジューリングした従来コ
ードでは、ＳＤＲＡＭのＲＡＳミスが発生して全般に性
能が高くない。データが主記憶にあるとしてベクトル処
理と同様の処理をスカラ命令で実現したコードでは、Ｒ
ＡＳミスを少なくすることができるため、配列の大きな
領域で効率的な実行となる。According to the object generating method of the present invention,
In the case where the data area accessed in the loop exceeds the cache capacity and has a data dependency that can be vectorized, a code that realizes the same processing as the vector instruction by a plurality of scalar instructions using a vector register can be generated. When this object sequence is executed, the probability of accessing a plurality of elements with the same RAS address increases, since a small number of arrays are intensively accessed compared to the conventional method of accessing array references in a loop on average. Efficient access to data is achieved. FIG. 12 shows an example of the effect. This figure shows the execution efficiency of the generated code for the kernel loop of the LU decomposition source program. The circles indicate the performance based on the code generated assuming that the data is in the cache. The black circles indicate the performance based on the scheduling by the conventional software pipeline assuming that the data is in the main memory. This shows the performance of the code that implements the same processing as in the scalar instruction. The cache target code shows high execution performance while the array is small and exists in the cache. In the conventional code scheduled in consideration of the main storage latency, the RAS miss of the SDRAM occurs and the performance is not generally high. In a code that realizes the same processing as vector processing by a scalar instruction assuming that data is in main memory, R
Since AS misses can be reduced, efficient execution can be achieved in a large area of the array.

【００４４】従って、本発明で記述した、アクセスデー
タ領域がキャシュ容量以下の場合は、従来のキャッシュ
をターゲットとしたループスケジューリングコード、ア
クセスデータ領域がキャシュ容量を越えた場合は、ベク
トル処理的アクセスをスカラ命令で実現するコードを生
成することにより、図中の黒太線のように常に効率的な
実行性能が実現できる。Therefore, when the access data area described in the present invention is smaller than the cache capacity, the conventional loop scheduling code targeting the cache is used. When the access data area exceeds the cache capacity, the vector processing access is performed. By generating the code realized by the scalar instruction, efficient execution performance can always be realized as shown by the thick black line in the figure.

[Brief description of the drawings]

【図１】コンパイラ全体の構成図。FIG. 1 is a configuration diagram of an entire compiler.

【図２】コンパイラにおけるデータの効率的アクセスに
関する最適化部。FIG. 2 shows an optimization unit for efficient access of data in a compiler.

【図３】コンパイラにおけるベクトル的処理コード生成
方法１。FIG. 3 shows a vector processing code generation method 1 in a compiler.

【図４】コンパイラにおけるベクトル的処理コード生成
方法２。FIG. 4 shows a vector processing code generation method 2 in a compiler.

【図５】ソースプログラムとそれに対するデータの効率
的アクセスに関する変換後の例。FIG. 5 is an example after conversion relating to efficient access of a source program and data to the source program.

【図６】ベクトル中間語に対してストリップマイニング
を適用した例。FIG. 6 is an example in which strip mining is applied to a vector intermediate language.

【図７】ベクトル処理長が定数の場合のベクトル命令の
シミュレーションの結果の説明図。FIG. 7 is an explanatory diagram of a result of a simulation of a vector instruction when a vector processing length is a constant.

【図８】ベクトル処理長が定数の場合のスカラ命令列に
よるベクトル処理実現のコード列例。FIG. 8 is an example of a code sequence for implementing vector processing by a scalar instruction sequence when the vector processing length is a constant.

【図９】ベクトル処理長が変数の場合のベクトル命令の
シミュレーションの結果の説明図。FIG. 9 is an explanatory diagram of a simulation result of a vector instruction when a vector processing length is a variable.

【図１０】ベクトル処理長が変数の場合のスカラ命令列
によるベクトル処理実現のコード列例。FIG. 10 is an example of a code sequence for implementing vector processing by a scalar instruction sequence when the vector processing length is a variable.

【図１１】前提とするアーキテクチャの命令とレジスタ
構成の説明図。FIG. 11 is an explanatory diagram of an instruction and a register configuration of a prerequisite architecture.

【図１２】効果の説明図。FIG. 12 is an explanatory diagram of an effect.

[Explanation of symbols]

１ソースプログラム２構文解析部３最適化部４中間語５コード生成部６オブジェクトコード１５０オペコード１７０物理レジスタファイル。 DESCRIPTION OF SYMBOLS 1 Source program 2 Syntax analysis part 3 Optimization part 4 Intermediate language 5 Code generation part 6 Object code 150 Opcode 170 Physical register file.

Claims

[Claims]

1. A method for compiling a source program into an object program, wherein a data area accessed by the loop part exceeds a cache capacity and has a data dependency that can be vectorized with respect to a loop part of the source program. In the object generation method, the same operation order as in the vector processing is realized by a scalar instruction.

2. A method for compiling a source program into an object program, wherein it is impossible to determine whether a data area accessed by the loop part exceeds a cache capacity for a loop part of the source program, and In the case of a data dependence relationship that can be vectorized, the code for calculating the size of the data access area and the size of the data access area and the cache capacity are compared with each other. An object generation method, comprising: generating a code realized by a scalar instruction and, when the cache capacity is smaller, a code of a scalar instruction in the same operation order as specified by the loop.

3. A compiler which reads a source program, performs syntax analysis, converts the intermediate program into an intermediate language, optimizes the intermediate language, and generates a code. For a loop portion, if the data area accessed by the loop portion exceeds the cache capacity and has a data dependency that can be vectorized, the intermediate portion of the loop portion is vectorized and converted into a vector intermediate language,
An object generation method, wherein the code generation unit converts the vector intermediate language into a scalar intermediate language in the same operation order as the execution of the vector intermediate language.

4. A compiler which reads a source program, parses the language, converts the language into an intermediate language, optimizes the intermediate language, and generates a code. If the data area accessed by the loop part cannot exceed the cache capacity and the loop part has a data dependency that can be vectorized, an intermediate part for calculating the size of the data access area is used. A word, an intermediate word for comparing the size of the data area with the cache capacity, and a vector intermediate word obtained by vectorizing the intermediate word of the loop part when the cache area is larger than the cache area. An intermediate language for executing the intermediate language of the loop is converted, and the code generator converts the intermediate language to a scalar intermediate language in the same operation order as the execution of the vector intermediate language. Object generation method which is characterized in that.

5. The object generating method according to claim 3, wherein at the time of converting the vector intermediate language into the scalar intermediate language, a simulation of an execution state of a vector instruction sequence corresponding to the vector intermediate language sequence is performed. Detecting a section where the combination of vector instruction sequences during parallel execution does not change, and converting a plurality of scalar intermediate words corresponding to the concurrently executed vector instructions into intermediate words that are repeatedly executed by loop processing for the number of cycles of the section. An object generation method characterized by the following.

6. The object generating method according to claim 4, wherein at the time of converting the vector intermediate language into the scalar intermediate language, a simulation of an execution status of a vector instruction sequence corresponding to the vector intermediate language sequence is performed. Detecting a section where the combination of vector instruction sequences during parallel execution does not change, and converting a plurality of scalar intermediate words corresponding to the concurrently executed vector instructions into intermediate words that are repeatedly executed by loop processing for the number of cycles of the section. An object generation method characterized by the following.

7. The object generating method according to claim 5, wherein in simulating the execution status of the vector instruction sequence corresponding to the vector intermediate word sequence, if the number of vector processes is a variable, the vector intermediate word sequence corresponds to the variable. An object generation method, wherein a vector load instruction and a vector instruction using a result stored in a vector register of the vector load instruction in a vector instruction sequence are scheduled so as not to be simultaneously executed.

8. The object generating method according to claim 6, wherein when simulating the execution status of the vector instruction sequence corresponding to the vector intermediate word sequence, if the number of vector processes is a variable, the vector intermediate word sequence corresponds to the variable. An object generation method, wherein a vector load instruction and a vector instruction using a result stored in a vector register of the vector load instruction in a vector instruction sequence are scheduled so as not to be simultaneously executed.

9. In the object generating method according to claim 3, when converting the vector intermediate language into the scalar intermediate language, if the vector processing length is a variable, the number of registers and the register length are the vector register configuration. Wherein the register length is set to a size close to the latency of the load vector instruction.

10. The object generating method according to claim 3,
When converting the vector intermediate language into the scalar intermediate language, if the vector processing length is a variable, in selecting the number of registers and the register length, which are the configuration of the vector register, set the register length to a value close to the latency of the load vector instruction. An object generation method characterized in that the object generation method is set to: