JP3653841B2

JP3653841B2 - Problem area division / allocation method

Info

Publication number: JP3653841B2
Application number: JP00895796A
Authority: JP
Inventors: 淑子保田; 輝雄田中
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-01-23
Filing date: 1996-01-23
Publication date: 2005-06-02
Anticipated expiration: 2016-01-23
Also published as: JPH09198358A

Description

【０００１】
【発明の属する技術分野】
本発明は複数のプロセッサ間でネットワークを介してデータを転送する並列計算機に関する。
【０００２】
【従来の技術】
複数のプロセッサをネットワークで接続した並列計算機の発展に伴い、自然現象を解析する数値シミュレーションが実用化されつつある。自然現象(物理現象)を支配している基本原理の一つは近接作用である。近接作用問題は、数値解法において隣りあった格子点の値のみを使用して計算を進める。並列計算機では、物理空間（問題領域）を分割して、分割したデータ領域をそれぞれプロセッサに割り当てた場合に、隣接するプロセッサ間でのみデータ転送が発生することを意味する。
【０００３】
例えば、２次元格子状に配置されたプロセッサが座標ＰＥ［Ｘ，Ｙ］で表されるとすると、ＰＥ［Ｘ，Ｙ］は、ＰＥ［Ｘ＋１，Ｙ］，ＰＥ［Ｘ−１，Ｙ］，ＰＥ［Ｘ，Ｙ＋１］およびＰＥ［Ｘ，Ｙ−１］に格納されたデータを用いて計算を進める。この隣接転送にもっとも適したネットワークとして、格子結合および格子の端と端を接続したトーラス結合のネットワークがある。格子トーラス結合は、隣接転送の高速処理を対象とした最も単純なネットワークであるため、ネットワークの直径（最も離れたプロセッサ間の距離）が大きく、隣接転送以外の転送パターンになると転送経路の競合が発生し性能が低下する。
【０００４】
一方、ネットワークの直径を小さくし、隣接転送以外の転送パターンでも性能をだすことを目的としたハイパキューブネットワーク，Ｎ次元クロスバネットワークが提案されている。
【０００５】
ハイパキューブネットワークは、２のｍ乗個のプロセッサを２×２×２×……×２とｍ個の因数に分解し、これらの因数の各々を一辺の格子点数とするｍ次元格子空間上にプロセッサを並べ、それらを直接結合してデータ転送経路を構成する。
【０００６】
Ｎ次元クロスバネットワークは、特開平1−267763 号公報に開示されているように、ｎ台のプロセッサをｎ＝ｎ₁×ｎ₂×ｎ₃×……×ｎ_mと因数分解し、これらの因数の各々を一辺の格子点数とするｍ次元格子空間上にプロセッサを並べ、その各辺をクロスバスイッチからなる部分ネットワークで結合してデータ転送経路を構成する。
【０００７】
ハイパキューブネットワーク，Ｎ次元クロスバネットワークとも格子トーラス結合を包含するネットワークであるため、隣接転送であれば経路の競合がなくデータを最高性能で転送できる。また、格子トーラス結合に比べてより多くの経路を持つため、隣接転送以外の転送パターンでも格子トーラス結合よりも転送経路の競合は少なくなる。
【０００８】
従来、近接作用問題を扱う並列アプリケーションユーザは、プロセッサ間通信を高速にするために、並列計算機の台数，構成を意識し、データ転送で最も高い性能を得られるように（近接作用＝隣接転送になるように）問題領域を分割し、各プロセッサに割り付けてきた。
【０００９】
図９は、従来の分割・割り当て方式を示している。９−２は並列計算機であり、９−１は対象とする問題領域である。並列計算機９−２は２次元構成であり、各プロセッサは２次元座標のプロセッサ番号を有する。従来、ユーザは、図９に示すように対象とする問題領域９−１を並列計算機９−２と同じ次元構成に分割して、分割領域を対応するプロセッサにそのまま割り当てていた。並列計算機がｌ×ｍの２次元構成であれば、問題領域をｌ×ｍの２次元に分割して各プロセッサに割り当てる。並列計算機がｌ×ｍ×ｎ個のプロセッサを持つ３次元構成である場合には、問題領域をｌ×ｍ×ｎの３次元に分割して、分割データ領域を対応するプロセッサにそのまま割り当てる。このように並列計算機の台数と構成を意識して問題領域を分割し、プロセッサに分割データ領域を割り当て、近接作用をそのまま隣接転送に対応させていた。
【００１０】
【発明が解決しようとする課題】
従来の分割方法および割り当て方法では、物理空間における問題領域が並列計算機と同じ次元構成になるように分割し、プロセッサに割り当てている。そのため、使用する並列計算機のネットワークが格子トーラス結合より多くの転送経路を持っていても、隣接転送に使用する経路以外は使用していない。
【００１１】
一方、実際の物理空間には１次元，２次元，３次元，４次元…といったようにさまざまな次元がありえる。一般に、並列計算機の次元構成は動的に変更できない。そのため、問題領域の次元構成が並列計算機の次元構成と異なると、ユーザの労力が増大するだけでなく、領域の分割方法および分割データ領域の割り当て方法によっては隣接転送以外のデータ転送が発生し、その結果、経路の競合が発生して通信性能が低下してしまう。
【００１２】
本発明の目的は、格子トーラス結合を包含するネットワークを効率良く用いて、並列計算機の次元構成が、対象とする近接作用問題領域の次元構成と同じであるかのようにユーザにイメージさせ、そのイメージのままで近接作用問題においてデータ転送を行っても、経路の競合を発生することなく隣接転送を行った場合と同等の通信性能を得られるような問題領域分割方式と、各プロセッサへの領域割り当て方式とそれを備えるライブラリを提供することにある。
【００１３】
【課題を解決するための手段】
上記の目的を達成するために、格子トーラス結合を包含するネットワークを有し、プロセッサ台数Ｎが素数ｘ（≧２）のｎ乗で表せる並列計算機において、本願発明は、
（１）ｎを超えない範囲で問題領域の次元数ｍを決める手段
（２）物理空間の各次元の格子点数をｎ₁，ｎ₂，ｎ₃，……，ｎ_m、としたとき、
【００１４】
【数３】
log_xＮ≧log_x（ｎ₁×ｎ₂×ｎ₃×……×ｎ_m） …（式１）
ｎ₁，ｎ₂，ｎ₃，……，ｎ_m＝ｘ（≧２）の冪乗
を満たす領域分割方法候補の選択手段
（３）候補の中から使用する領域分割方法を決定する手段
（４）１次元表示した分割領域番号と同じ物理番号を有するプロセッサに分割領
域を割り当てる手段
を備える。これらの手段により、近接作用問題領域の次元を並列計算機の次元構成にあわせずに自由に領域を分割し、分割領域間でデータ転送をしても、ネットワーク中の転送経路の競合なしに高速データ転送を実現できる。
【００１５】
【発明の実施の形態】
以下、本発明にかかわる問題領域分割・割り当て方式を図面を参照して説明する。
【００１６】
本発明は、格子トーラス結合を包含するネットワーク（Ｎ次元クロスバネットワーク，ハイパキューブネットワーク，完全クロスバ結合，多段結合等）を有し、プロセッサ台数Ｎが素数ｘ（≧２）のｎ乗で表せる並列計算機に適用できる。また、この並列計算機のネットワークは全２重の経路を持ち、ルーティング方式は固定ルーティングを用いることを前提としている。また、プロセッサは１からＮまでの物理番号を持つ。
【００１７】
図１は本発明の問題領域分割・割り当て方式を備えた計算機ライブラリの全体構成を示したものである。図中、１−１は問題領域次元決定手段、１−２は、領域分割方法候補選択手段、１−３は、分割方法決定手段、１−４は、分割領域割り当て手段である。
【００１８】
問題領域次元決定手段１−１は、物理空間の問題領域の次元数を決定する。次の領域分割方法候補選択手段１−２では、問題領域の次元数とプロセッサ台数から問題領域の分割方法の候補をあげる。分割方法決定手段１−３では、候補の中から分割方法を一つ選び、問題領域を分割する。分割領域割り当て手段１−４では、分割した領域をプロセッサに割り当てる。
【００１９】
図２に各手段の処理内容を示す。問題領域を割り当てるべきプロセッサ台数Ｎが素数ｘ（≧２）のｎ乗で表せるとき（２−１）、問題領域次元決定手段１−１は、物理空間の問題領域の次元数ｍを入力とする（２−２）。決定にあたっては次元数ｍがｎを超えないようにする（２−３）。もし、次元数ｍがｎを超えた場合は、条件を満たすまで再入力を繰り返す。
【００２０】
次に、領域分割方法候補選択手段１−２で、問題領域の各次元の格子点数を
ｎ₁，ｎ₂，ｎ₃，……，ｎ_m、としたとき、
【００２１】
【数４】
log_xＮ≧log_x（ｎ₁×ｎ₂×ｎ₃×……×ｎ_m） …（式１）
ｎ₁，ｎ₂，ｎ₃，……，ｎ_m＝ｘ（≧２）の冪乗
を満たすｎ₁，ｎ₂，ｎ₃，……，ｎ_mの組合せを求める（２−４）。
【００２２】
分割方法決定手段１−３では、領域分割方法候補選択手段１−２で求めた組合せの中から最も適当と思われるものを選択し（２−５）、問題の分割方法とする。そして、分割方法に従い問題領域を分割する（２−６）。
【００２３】
最後に、分割領域割り当て手段１−４で、１次元表示した分割領域番号と同じ物理番号を有するプロセッサに分割領域を割り当てる（２−７）。
【００２４】
図３に問題領域を３次元に分割し、分割領域を２⁹個のプロセッサに割り当てる例を示す。
【００２５】
３次元に分割する場合、ｍ＝３とする（３−２）。プロセッサ台数Ｎが２⁹であるため、領域の次元数ｍをｍ＝１〜９まで９通り選択可能である（３−３）。分割方法候補を決めるために(ｎ₁，ｎ₂，ｎ₃）の組合せを求める（３−４）。組合せは、（２⁷，２¹，２¹)（２⁶，２¹，２²)（２³，２³，２³)（２²，２³，２⁴)……のようになる。上述の例で、問題領域を立方体として扱う場合は（８，８，８）の組合せを選択し（３−５）、２³×２³×２³にと分割する(３−６）。最後に、分割領域（Ｎｕｍ＝Ｘ＋８×（Ｙ−１）＋６４×（Ｚ−１）：１≦Ｘ≦８，１≦Ｙ≦８，１≦Ｚ≦８）をプロセッサ（Ｎｕｍ）に割り当てる（３−７）。
【００２６】
以上、四つの手段を用いることで、問題領域の次元構成が並列計算機の次元構成と異なる場合でも経路の競合を発生することなく隣接転送を行っている時と同等の通信性能を得られる。また、分割した領域を単純に同じ番号を持つプロセッサに割り当てればよいため、ユーザが問題領域と並列計算機の次元構成が異なることを意識する必要がない。この実施例では、これらの手段をライブラリとして提供しているが、ユーザ自身がプログラミングするときにこの手順に従い問題領域を分割し、プロセッサに分割領域を割りつけてもよい。
【００２７】
次に、本発明の２次元クロスバネットワークへの具体的な適用例を示す。問題領域を３次元に分割して、２次元クロスバネットワークを有する６４（８×８）プロセッサからなる並列計算機に分割領域を割り当てることを考える。
【００２８】
図４は本実施例で用いる並列計算機の構成を示したものである。図４で４−１〜４−６４はプロセッサ（以下ＰＥと略する）である。４−６５〜４−７２までは、Ｘ方向のクロスバスイッチ（以下Ｘ−ＸＢと略する）、４−７３〜４−８０は、Ｙ方向のクロスバスイッチ（以下Ｙ−ＸＢと略する）である。これらのクロスバスイッチを区別しない場合には、単にＸＢと呼ぶことがある。４−８１〜４−１４４は各Ｘ−ＸＢと各Ｙ−ＸＢの交点に設けられた中継スイッチ（以下ＥＸと略する）である。各ＸＢおよびＥＸは、各入力ポートが全出力ポートに直接結合する完全クロスバスイッチである。ＸＢとＥＸとの組合せをまとめて２次元クロスバネットワークと呼ぶ。
【００２９】
各ＰＥは、２次元座標空間の一つの格子点のＸ座標，Ｙ座標と、その座標から求められる物理番号をあらかじめそのＰＥ番号として与えられている。例えば、この例では、各ＰＥは［Ｘ，Ｙ］，Ｘ＋８×（Ｙ−１）という値を番号として持っている。ルーティング方法は固定ルーティングを用いており、まず、データをＸ方向に転送し、次にＹ方向に転送する。
【００３０】
まず、図１の問題領域次元決定手段１−１により、問題領域の次元数を決定する。本実施例では、ＰＥ数６４は２⁶とも表せるため、問題領域の次元数は６次元以下であればいずれでもよい。ここでは問題領域の次元数を３次元にし、それを入力とする。
【００３１】
次に、図１の領域分割方法候補選択手段１−２により条件を満たす分割方法の候補を選び、分割方法決定手段１−３により分割方法を決定する。問題領域の各次元の格子点数をｎ₁，ｎ₂，ｎ₃（２の冪乗）としたとき、log_x２⁶≧log_x(ｎ₁×ｎ₂×ｎ₃）を満たす問題領域分割方法は分割領域数がプロセッサ台数と等しい場合でも計１０種類あり、Ｘ次元×Ｙ次元×Ｚ次元をそれぞれ１６×２×２，２×１６×２，２×２×１６，８×４×２，８×２×４，４×８×２，４×２×８，２×８×４，２×４×８，４×４×４のように分割する。どの分割方法を選択するかは自由であるので、ここでは、複数の候補の中から問題領域を１６×２×２に分割する方法を選択する。
【００３２】
図５は問題領域を分割した様子を示す。この問題は、３次元であるので、Ｘ座標，Ｙ座標，Ｚ座標の三つをＸ＋１６×（Ｙ−１）＋３２×（Ｚ−１）に代入して１次元表示し、各分割領域に図５の５−１に示すように１次元の番号を付ける。
【００３３】
最後に、図１の分割領域割り当て手段１−４により図４の同じ番号を持つプロセッサに図５の問題領域を割り当てる。例えば、問題領域［１］はＰＥ［１］に問題領域［１６］はＰＥ［１６］に割り当てる。
【００３４】
上述の分割割り当て方式による近接作用問題のデータ転送の様子を図６，図７および図８に示す。ユーザは、並列計算機の次元構成が問題領域の次元構成と同じであるとイメージしたままで、データ転送を行う。問題領域が３次元構成であるため、通信は計６方向で発生する。
【００３５】
図６は、１６×２×２に分割した問題領域における３方向のデータ転送パターンを示している。６−１は、＋Ｘ方向の近接作用であり、６−２は＋Ｙ方向の近接作用、６−３は＋Ｚ方向の近接作用を示す。端は反対側の端を隣としている。
【００３６】
図７および図８は、この問題領域を、本発明の手段を用いてプロセッサに割り当てた場合のデータ転送の様子を示している。図７は、＋Ｘ方向の近接作用の場合のデータ転送の様子、図８は、＋Ｙ方向の近接作用の場合のデータ転送の様子である。この二つの図を見てもわかるように、上述の分割・割り当て方式を用いたことで、２次元クロスバスイッチにおける隣接転送以外に使用する転送経路を用いて効率良くデータを転送することが可能になる。
【００３７】
ここでは、問題領域を１６×２×２に分割して、プロセッサに割り当てる例について説明したが、領域分割方法候補選択手段１−２から求まる候補であれば、隣接転送以外に使用する転送経路を用いて効率良く高速にデータを転送することが可能である。
【００３８】
また、ここでは、３次元の問題領域を２次元クロスバネットワークに割り当てる例を示したが、同様な方法でｍ次元の問題領域をＮ次元クロスバネットワークに割り当てても、高速なデータ転送が可能である。
【００３９】
以上の説明では、２次元クロスバネットワークを用いたが、これらに代えて図１０に示すような多数の中継スイッチで構成されるハイパキューブネットワーク（１０−１）や多段結合ネットワーク（１０−２），完全クロスバ結合ネットワークに対しても適用できる。
【００４０】
【発明の効果】
本発明によれば問題領域の次元構成が並列計算機の次元構成と異なる場合でも、次元構成を同一にした場合と同等の通信性能を得られる。
【図面の簡単な説明】
【図１】本発明の問題領域分割・割り当て方式のブロック図。
【図２】本発明の問題領域分割・割り当て方式の処理のフローチャート。
【図３】本発明の具体的なフローチャート。
【図４】本発明を適用する２次元クロスバネットワークのブロック図。
【図５】本発明の領域分割の説明図。
【図６】本発明のデータ転送パターンの説明図。
【図７】本発明の２次元クロスバネットワークにおけるデータのブロック図。
【図８】本発明の２次元クロスバネットワークにおけるデータのブロック図。
【図９】従来の問題領域分割・割り当て方式を示す説明図。
【図１０】本発明を適用可能なネットワークの説明図。
【符号の説明】
１−１…問題領域次元決定手段、１−２…領域分割方法候補選択手段、１−３…分割方法決定手段、１−４…分割領域割り当て手段。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a parallel computer that transfers data between a plurality of processors via a network.
[0002]
[Prior art]
With the development of parallel computers in which multiple processors are connected via a network, numerical simulations that analyze natural phenomena are being put into practical use. One of the basic principles governing natural phenomena (physical phenomena) is proximity action. The proximity action problem is calculated using only the values of adjacent grid points in the numerical solution. In a parallel computer, when a physical space (problem area) is divided and each divided data area is assigned to a processor, this means that data transfer occurs only between adjacent processors.
[0003]
For example, if a processor arranged in a two-dimensional grid is represented by coordinates PE [X, Y], PE [X, Y] is represented by PE [X + 1, Y], PE [X-1, Y], The calculation proceeds using the data stored in PE [X, Y + 1] and PE [X, Y-1]. As a network most suitable for this adjacent transfer, there is a lattice coupling and a torus coupling network in which the ends of the lattice are connected. Lattice torus coupling is the simplest network for high-speed processing of adjacent transfers, so the diameter of the network (distance between the furthest processors) is large, and transfer patterns that are not adjacent transfer cause contention of transfer paths. Occurs and performance decreases.
[0004]
On the other hand, a hypercube network and an N-dimensional crossbar network have been proposed for the purpose of reducing the diameter of the network and achieving performance even in transfer patterns other than adjacent transfer.
[0005]
The hypercube network decomposes 2 m processors into 2 × 2 × 2 × …… × 2 and m factors, and each of these factors is on an m-dimensional lattice space with the number of lattice points on one side. The data transfer path is configured by arranging the processors and connecting them directly.
[0006]
As disclosed in JP-A-1-267763, the N-dimensional crossbar network factors n processors into n = n ₁ × n ₂ × n ₃ × …… × n _m, and these factors Processors are arranged in an m-dimensional lattice space with each of the number of lattice points on one side, and each side is connected by a partial network composed of crossbar switches to form a data transfer path.
[0007]
Since both the hypercube network and the N-dimensional crossbar network are networks that include lattice torus coupling, adjacent transfers can transfer data with the highest performance without path contention. In addition, since there are more paths than lattice torus coupling, there is less competition for transfer paths than lattice torus coupling even in transfer patterns other than adjacent transfer.
[0008]
Conventionally, parallel application users who handle the proximity action problem are conscious of the number and configuration of parallel computers in order to speed up communication between processors, so that the highest performance can be obtained in data transfer (proximity action = adjacent transfer). The problem area has been divided and assigned to each processor.
[0009]
FIG. 9 shows a conventional division / allocation method. 9-2 is a parallel computer, and 9-1 is a target problem area. The parallel computer 9-2 has a two-dimensional configuration, and each processor has a processor number of two-dimensional coordinates. Conventionally, as shown in FIG. 9, the user divides the target problem area 9-1 into the same dimensional configuration as the parallel computer 9-2 and assigns the divided areas to the corresponding processors as they are. If the parallel computer has a two-dimensional configuration of l × m, the problem area is divided into two dimensions of l × m and assigned to each processor. When the parallel computer has a three-dimensional configuration having l × m × n processors, the problem area is divided into three dimensions of l × m × n, and the divided data areas are assigned to the corresponding processors as they are. As described above, the problem area is divided in consideration of the number and configuration of the parallel computers, the divided data areas are allocated to the processors, and the proximity action is made to correspond to the adjacent transfer as it is.
[0010]
[Problems to be solved by the invention]
In the conventional dividing method and assigning method, the problem area in the physical space is divided so as to have the same dimensional configuration as that of the parallel computer, and assigned to the processor. For this reason, even if the network of parallel computers to be used has more transfer paths than the lattice torus connection, only the paths used for adjacent transfer are used.
[0011]
On the other hand, an actual physical space can have various dimensions such as one dimension, two dimensions, three dimensions, four dimensions, and so on. In general, the dimensional configuration of a parallel computer cannot be changed dynamically. Therefore, if the dimensional configuration of the problem area is different from the dimensional configuration of the parallel computer, not only will the user's labor increase, but data transfer other than adjacent transfer will occur depending on the region division method and divided data region allocation method, As a result, path contention occurs and communication performance deteriorates.
[0012]
The object of the present invention is to efficiently use a network including lattice torus coupling, and let the user image the dimensional configuration of the parallel computer as if it is the same as the dimensional configuration of the target proximity action problem region. Even if data transfer is performed in the proximity effect problem with the image as it is, the problem area division method that can obtain the same communication performance as when adjacent transfer is performed without causing path contention and the area to each processor It is to provide an allocation method and a library provided with the allocation method.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, a parallel computer having a network including lattice torus coupling and capable of expressing the number of processors N by the nth power of a prime number x (≧ 2),
(1) within a range that does not exceed the n determines the number of dimensions m problem areas means (2) each dimension of the lattice points of the physical space n _1, n _2, n _3, ......, when n _m, a,
[0014]
[Equation 3]
log _x N ≧ log _x (n ₁ × n ₂ × n ₃ × …… × n _m ) (Formula 1)
n ₁ , n ₂ , n ₃ ,..., n _m = x (≧ 2) power of region division method candidate satisfying power (3) means for determining a region division method to be used from candidates (4 ) It is provided with means for allocating a divided area to a processor having the same physical number as the one-dimensionally displayed divided area number. By these means, the dimensions of the proximity action problem area can be freely divided without matching the dimension configuration of the parallel computer, and even if data is transferred between the divided areas, high-speed data can be transferred without competing for transfer paths in the network. Transfer can be realized.
[0015]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a problem area division / allocation method according to the present invention will be described with reference to the drawings.
[0016]
The present invention has a network (N-dimensional crossbar network, hypercube network, complete crossbar connection, multistage connection, etc.) including a lattice torus connection, and a parallel computer in which the number of processors N can be expressed by the nth power of a prime number x (≧ 2) Applicable to. This parallel computer network has a full duplex path, and the routing method is based on the premise that fixed routing is used. The processor has physical numbers from 1 to N.
[0017]
FIG. 1 shows the overall configuration of a computer library provided with the problem area division / allocation method of the present invention. In the figure, 1-1 is a problem area dimension determining means, 1-2 is an area dividing method candidate selecting means, 1-3 is a dividing method determining means, and 1-4 is a divided area assigning means.
[0018]
The problem area dimension determining unit 1-1 determines the number of dimensions of the problem area in the physical space. In the next area division method candidate selection means 1-2, the problem area division method candidates are raised from the number of dimensions of the problem area and the number of processors. The division method determining means 1-3 selects one division method from the candidates and divides the problem area. The divided area assigning means 1-4 assigns the divided area to the processor.
[0019]
FIG. 2 shows the processing contents of each means. When the number N of processors to which a problem area should be assigned can be expressed by a prime number x (≧ 2) raised to the nth power (2-1), the problem area dimension determination means 1-1 receives the dimension number m of the problem area in the physical space as an input. (2-2). In the determination, the number of dimensions m should not exceed n (2-3). If the dimension number m exceeds n, re-input is repeated until the condition is satisfied.
[0020]
Then, the region division method candidates selecting unit 1-2, each dimension of the lattice points of problem areas _{_{_{n 1, n 2, n 3}}} , ......, when n _m, a,
[0021]
[Expression 4]
log _x N ≧ log _x (n ₁ × n ₂ × n ₃ × …… × n _m ) (Formula 1)
_{_{_{n 1, n 2, n 3}}} , ......, n m = n 1 satisfying exponentiation of _{x (≧ 2), n 2} , n 3, ......, finding a combination of n _m (2-4).
[0022]
The dividing method determining means 1-3 selects the most appropriate combination from the combinations obtained by the area dividing method candidate selecting means 1-2 (2-5) and sets it as the problem dividing method. Then, the problem area is divided according to the dividing method (2-6).
[0023]
Finally, the divided area assigning means 1-4 assigns divided areas to the processors having the same physical numbers as the divided area numbers displayed one-dimensionally (2-7).
[0024]
Dividing the problem areas in the three-dimensional FIG. 3 shows an example of assigning the divided regions into two ^nine processors.
[0025]
When dividing into three dimensions, m = 3 is set (3-2). Since the number of processors N is 2 ^9, a number of dimensions m area m = 1 to 9 to nine selectable (3-3). In order to determine a division method candidate, a combination of (n ₁ , n ₂ , n ₃ ) is obtained (3-4). The combinations are as follows: (2 ⁷ , 2 ¹ , 2 ¹ ) (2 ⁶ , 2 ¹ , 2 ² ) (2 ³ , 2 ³ , 2 ³ ) (2 ² , 2 ³ , 2 ⁴ ). In the above example, when the problem area is handled as a cube, the combination of (8, 8, 8) is selected (3-5) and divided into 2 ³ × 2 ³ × 2 ³ (3-6). Finally, the divided area (Num = X + 8 × (Y−1) + 64 × (Z−1): 1 ≦ X ≦ 8, 1 ≦ Y ≦ 8, 1 ≦ Z ≦ 8) is allocated to the processor (Num) (3 -7).
[0026]
As described above, by using the four means, even when the dimensional configuration of the problem area is different from the dimensional configuration of the parallel computer, it is possible to obtain communication performance equivalent to that when performing adjacent transfer without causing path contention. Further, since the divided areas may be simply assigned to processors having the same number, the user does not need to be aware that the dimensional configuration of the problem area is different from that of the parallel computer. In this embodiment, these means are provided as a library. However, when the user himself / herself performs programming, the problem area may be divided according to this procedure, and the divided area may be allocated to the processor.
[0027]
Next, a specific application example of the present invention to a two-dimensional crossbar network will be described. Consider a case where a problem area is divided into three dimensions and the divided areas are allocated to a parallel computer composed of 64 (8 × 8) processors having a two-dimensional crossbar network.
[0028]
FIG. 4 shows the configuration of the parallel computer used in this embodiment. In FIG. 4, reference numerals 4-1 to 4-64 denote processors (hereinafter abbreviated as PE). 4-65 to 4-72 are X-direction crossbar switches (hereinafter abbreviated as X-XB), and 4-73 to 4-80 are Y-direction crossbar switches (hereinafter abbreviated as Y-XB). . When these crossbar switches are not distinguished, they may be simply referred to as XB. 4-81 to 4-144 are relay switches (hereinafter abbreviated as EX) provided at the intersections of the respective X-XBs and the respective Y-XBs. Each XB and EX is a full crossbar switch with each input port directly coupled to all output ports. A combination of XB and EX is collectively called a two-dimensional crossbar network.
[0029]
Each PE is given in advance as its PE number the X and Y coordinates of one lattice point in the two-dimensional coordinate space and the physical number obtained from the coordinates. For example, in this example, each PE has a value of [X, Y], X + 8 × (Y−1) as a number. The routing method uses fixed routing. First, data is transferred in the X direction, and then transferred in the Y direction.
[0030]
First, the number of dimensions of the problem area is determined by the problem area dimension determination means 1-1 in FIG. In this embodiment, since the PE number 64 can be expressed as ²⁶ , the number of dimensions of the problem area may be any as long as it is 6 dimensions or less. Here, the number of dimensions of the problem area is set to three dimensions, and this is used as an input.
[0031]
Next, candidate division methods satisfying the conditions are selected by the area division method candidate selection unit 1-2 in FIG. 1, and the division method is determined by the division method determination unit 1-3. Problem region dividing method satisfying log _x 2 ⁶ ≥log _x (n ₁ × n ₂ × n ₃ ), where n ₁ , n ₂ , n ₃ (power of 2) is the number of grid points in each dimension of the problem region There are a total of 10 types even when the number of divided areas is equal to the number of processors, and X dimension × Y dimension × Z dimension is 16 × 2 × 2, 2 × 16 × 2, 2 × 2 × 16, 8 × 4 × 2, respectively. Divide into 8 × 2 × 4, 4 × 8 × 2, 4 × 2 × 8, 2 × 8 × 4, 2 × 4 × 8, and 4 × 4 × 4. Since any division method can be freely selected, a method for dividing the problem area into 16 × 2 × 2 is selected from a plurality of candidates.
[0032]
FIG. 5 shows how the problem area is divided. Since this problem is three-dimensional, three of the X coordinate, Y coordinate, and Z coordinate are substituted into X + 16 × (Y−1) + 32 × (Z−1) and displayed one-dimensionally. A one-dimensional number is assigned as shown in 5-1.
[0033]
Finally, the problem area shown in FIG. 5 is assigned to the processors having the same numbers in FIG. 4 by the divided area assigning means 1-4 shown in FIG. For example, problem area [1] is assigned to PE [1] and problem area [16] is assigned to PE [16].
[0034]
FIG. 6, FIG. 7 and FIG. 8 show the state of data transfer of the proximity effect problem by the above-described division allocation method. The user performs data transfer while keeping the image that the dimensional configuration of the parallel computer is the same as the dimensional configuration of the problem area. Since the problem area has a three-dimensional configuration, communication occurs in a total of six directions.
[0035]
FIG. 6 shows data transfer patterns in three directions in the problem area divided into 16 × 2 × 2. 6-1 is a proximity action in the + X direction, 6-2 is a proximity action in the + Y direction, and 6-3 is a proximity action in the + Z direction. The end is adjacent to the opposite end.
[0036]
7 and 8 show the state of data transfer when this problem area is assigned to a processor using the means of the present invention. FIG. 7 shows the state of data transfer in the case of the proximity action in the + X direction, and FIG. 8 shows the state of data transfer in the case of the proximity action in the + Y direction. As can be seen from these two figures, by using the above-described division / allocation method, it is possible to efficiently transfer data using a transfer path used other than adjacent transfer in a two-dimensional crossbar switch. Become.
[0037]
Here, an example in which the problem area is divided into 16 × 2 × 2 and assigned to the processor has been described. However, if the candidate is obtained from the area division method candidate selection unit 1-2, a transfer path to be used other than the adjacent transfer is selected. It is possible to transfer data efficiently and at high speed.
[0038]
Also, here, an example is shown in which a three-dimensional problem area is assigned to a two-dimensional crossbar network, but high-speed data transfer is possible even if an m-dimensional problem area is assigned to an N-dimensional crossbar network in the same manner. .
[0039]
In the above description, a two-dimensional crossbar network is used. Instead of this, however, a hypercube network (10-1) or a multistage coupling network (10-2) composed of a large number of relay switches as shown in FIG. It can also be applied to a fully crossbar coupled network.
[0040]
【The invention's effect】
According to the present invention, even when the dimensional configuration of the problem area is different from the dimensional configuration of the parallel computer, it is possible to obtain the same communication performance as when the dimensional configuration is the same.
[Brief description of the drawings]
FIG. 1 is a block diagram of a problem area division / allocation method of the present invention.
FIG. 2 is a flowchart of processing of a problem area division / allocation method according to the present invention.
FIG. 3 is a specific flowchart of the present invention.
FIG. 4 is a block diagram of a two-dimensional crossbar network to which the present invention is applied.
FIG. 5 is an explanatory diagram of region division according to the present invention.
FIG. 6 is an explanatory diagram of a data transfer pattern according to the present invention.
FIG. 7 is a block diagram of data in the two-dimensional crossbar network of the present invention.
FIG. 8 is a block diagram of data in the two-dimensional crossbar network of the present invention.
FIG. 9 is an explanatory diagram showing a conventional problem area division / allocation method.
FIG. 10 is an explanatory diagram of a network to which the present invention is applicable.
[Explanation of symbols]
1-1: Problem area dimension determining means, 1-2: Area dividing method candidate selecting means, 1-3: Dividing method determining means, 1-4: Dividing area assigning means.

Claims

A problem region division / allocation method in a parallel computer having a hypercube network or an N-dimensional crossbar network that includes a function of a lattice torus coupling network, and in which the number of processors N can be expressed by the nth power of a prime number x (≧ 2),
receiving and determining a setting input of the dimension number m of the problem area within a range not exceeding n;
When the number of grid points in each dimension of the physical space is n ₁ , n ₂ , n ₃ ,..., N _m ,
log _x N ≧ log _x (n ₁ × n ₂ × n ₃ × …… × n _m )
n ₁ , n ₂ , n ₃ ,..., n _m = power of x (≧ 2) (Equation 1)
Selecting a region segmentation method candidate that satisfies
Determining a region dividing method to be used from among the candidates;
A problem area dividing / allocating method comprising: assigning a divided area to a processor having the same physical number as the one-dimensionally displayed divided area number.