JP2017162266A

JP2017162266A - Parallel processor, parallel processing method, and program

Info

Publication number: JP2017162266A
Application number: JP2016047000A
Authority: JP
Inventors: 佑基水野; Yuki Mizuno
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2016-03-10
Filing date: 2016-03-10
Publication date: 2017-09-14
Anticipated expiration: 2036-03-10
Also published as: JP6658123B2

Abstract

PROBLEM TO BE SOLVED: To execute inter-process communication processing executed in a parallel program and calculation processing in parallel.SOLUTION: A parallel processor (100) includes an inter-process communication unit (101) capable of controlling inter-process communication including at least data transfer processing executed among a plurality of processes in a parallel program, and a multithread controller (102) for progressing the inter-process communication by a first thread from when synchronous processing is performed in the first thread among a plurality of threads generated during a practical procedure of the parallel program until the first thread and one or more second threads that are the other thread among the plurality of threads are synchronized.SELECTED DRAWING: Figure 4A

Description

本発明は、複数のプロセスの間で実行されるプロセス間通信と、その他の演算処理とを並列に実行する並列処理装置等に関する。 The present invention relates to a parallel processing device or the like that executes interprocess communication executed between a plurality of processes and other arithmetic processing in parallel.

ＨＰＣ（ＨｉｇｈＰｅｒｆｏｒｍａｎｃｅＣｏｍｐｕｔｉｎｇ）の分野では、複数のコンピューティング・ノード（演算処理装置等）を用いて並列処理を実行する、並列プログラムが用いられている。係る並列プログラムは、例えば、複数のコンピューティング・ノード（以下、「計算ノード」と記載する場合がある）に対して個別にプロセスを割り当てることにより、複数のプロセスを並列に実行する。各プロセスは、相互にデータを送受信することで、実装された処理を実行する。また、各計算ノードが複数のスレッドを並列に処理可能な演算装置（例えば、マルチコアＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）等）を含む場合、各プロセスは、複数のスレッドを用いることで、並列処理を実行することができる。 In the field of HPC (High Performance Computing), a parallel program that executes parallel processing using a plurality of computing nodes (arithmetic processing devices or the like) is used. For example, such a parallel program executes a plurality of processes in parallel by individually allocating processes to a plurality of computing nodes (hereinafter may be referred to as “calculation nodes”). Each process performs implemented processing by transmitting and receiving data to and from each other. In addition, when each computing node includes an arithmetic device (for example, a multi-core CPU (Central Processing Unit)) that can process a plurality of threads in parallel, each process executes parallel processing by using a plurality of threads. be able to.

複数のスレッドを用いた並列処理に関する技術が、以下の各特許文献に開示されている。特許文献１は、マルチスレッド・プロセッサにおいて、スレッドの切り替え処理を高速に実行することを目的とした技術を開示する。特許文献１に開示された装置は、休止状態にあるスレッド用の命令を予め準備しておくことで、スレッドの切り替えに要する時間（特には、命令の取出しに要する時間）を短縮する。特許文献２は、プロセッサの稼働率を向上することを目的としたバリア同期方法に関する技術を開示する。特許文献２に開示された方法は、所定の個数のプロセッサがバリア同期ポイントに到達したした際、それらのプロセッサに、全プロセッサがバリア同期ポイントに到達するまで、バリア同期処理を行ったジョブとは別のジョブを実行させる。 Techniques related to parallel processing using a plurality of threads are disclosed in the following patent documents. Japanese Patent Application Laid-Open No. 2004-228561 discloses a technique aimed at executing thread switching processing at high speed in a multi-thread processor. The apparatus disclosed in Patent Document 1 shortens the time required for switching threads (particularly, the time required for fetching instructions) by preparing in advance instructions for threads in a dormant state. Patent Document 2 discloses a technique related to a barrier synchronization method for the purpose of improving the operating rate of a processor. In the method disclosed in Patent Document 2, when a predetermined number of processors reach the barrier synchronization point, a job in which barrier synchronization processing is performed until all processors reach the barrier synchronization point. Run another job.

また、複数のスレッドの同期処理に関する技術が以下の各特許文献に開示されている。特許文献３は、処理の絶対量が確定している処理を並列化する場合と、処理の絶対量が確定していない処理を並列化する場合とに共通して適用可能な同期処理方法に関する技術を開示する。特許文献４は、複数のプロセッサにおいて並列に処理される複数のスレッドに対して、階層構造のグループを設定し、当該グループ毎にバリア同期処理を実行する技術を開示する。 Further, technologies relating to the synchronization processing of a plurality of threads are disclosed in the following patent documents. Patent Document 3 discloses a technique related to a synchronous processing method that can be applied in common to a case in which a process in which the absolute amount of processing is determined is parallelized and a case in which a process in which the absolute amount of processing is not determined is parallelized. Is disclosed. Patent Document 4 discloses a technique of setting a hierarchical group for a plurality of threads processed in parallel in a plurality of processors and executing a barrier synchronization process for each group.

特開平１０−２８３２０３号公報Japanese Patent Laid-Open No. 10-283203 特開平１１−３１２１４８号公報JP 11-31148 A 特開２０１１−１３４１４５号公報JP 2011-134145 A 特開２００６−２５９８２１号公報JP 2006-259821 A

並列プログラムにおいて、プロセス間通信の進行を管理する方式として、「通信スレッド方式」と、「計算スレッド方式」とが知られている。通信スレッド方式は、計算スレッドとは独立した通信スレッドが、専らプロセス間通信の進行を管理する方式である。また、計算スレッド方式は、演算処理ユニット（演算装置等）において実行される計算スレッド自身が、プロセス間通信の進行を管理する方式である。通信スレッド方式の場合、通信スレッドが独立していることから、計算処理と通信処理の進行とを並行して実行可能である。一方、通信スレッド方式には以下のような問題点がある。即ち、通信スレッド方式の場合、通信スレッドの分、計算スレッドで利用できる演算資源（演算処理ユニットの処理能力）が減少する。また、通信スレッドと計算スレッドとが演算処理ユニットを共用する場合、スレッドの切り替え（スイッチ）が生じることから、コンテキストスイッチなどのオーバーヘッドが生じる。ＨＰＣでは、多くの場合、可能な限り計算処理に演算資源を割り当てること望ましく、計算スレッド方式が用いられることが多い。計算スレッド方式は、通信スレッドを利用しないことから、通信スレッド方式で問題となる、演算資源の減少及び通信スレッドと計算スレッドとのコンテキストスイッチが発生しない。 As a method for managing the progress of inter-process communication in a parallel program, a “communication thread method” and a “calculation thread method” are known. The communication thread method is a method in which a communication thread independent of a calculation thread exclusively manages the progress of inter-process communication. The calculation thread method is a method in which a calculation thread itself executed in an arithmetic processing unit (such as an arithmetic device) manages the progress of interprocess communication. In the case of the communication thread method, since the communication thread is independent, the calculation process and the progress of the communication process can be executed in parallel. On the other hand, the communication thread method has the following problems. That is, in the case of the communication thread method, the calculation resources (processing capacity of the calculation processing unit) that can be used by the calculation thread are reduced by the communication thread. In addition, when the communication thread and the calculation thread share the arithmetic processing unit, thread switching (switching) occurs, and overhead such as context switching occurs. In HPC, in many cases, it is desirable to allocate calculation resources to calculation processing as much as possible, and a calculation thread method is often used. Since the calculation thread method does not use a communication thread, a reduction in computing resources and a context switch between the communication thread and the calculation thread, which are problems in the communication thread method, do not occur.

しかしながら、計算スレッド方式では、並列プログラムにおける計算処理の途中でプロセス間通信に関する処理が記述されていない場合、プロセス間のデータ転送に関する処理が実行されない。この場合、プロセス間のデータ転送処理が実行されるタイミングが遅延してしまう場合がある。これにより、データ転送処理と、複数のプロセスにおける計算処理とが並列に実行されない場合がある、という問題を生じる。 However, in the calculation thread method, if the process related to the interprocess communication is not described during the calculation process in the parallel program, the process related to the data transfer between the processes is not executed. In this case, the timing at which data transfer processing between processes is executed may be delayed. This causes a problem that data transfer processing and calculation processing in a plurality of processes may not be executed in parallel.

これに対して、上記特許文献１に開示された技術は、スレッドの切り替えの高速化を目的とした技術であり、特許文献２に開示された技術は、スレッドを切り替えることで演算装置の稼働率を改善することを目的とした技術である。即ち、これらの技術はいずれもスレッドの切り替えを前提としていることから、コンテキストスイッチ等のオーバーヘッド自体を完全に排除することはできない。また、これらの技術は、いずれも、計算スレッド方式に関する上記課題を解決可能な技術ではない。上記特許文献３及び特許文献４に開示された技術は、いずれも複数スレッド間の同期処理そのものを実装する技術であり、計算スレッド方式に関する上記課題を解決可能な技術ではない。 On the other hand, the technique disclosed in Patent Document 1 is a technique aimed at speeding up the switching of threads, and the technique disclosed in Patent Document 2 is an operation rate of a computing device by switching threads. This technology aims to improve That is, since all of these technologies are premised on thread switching, the overhead itself such as context switching cannot be completely eliminated. In addition, none of these techniques can solve the above-described problems related to the calculation thread method. The techniques disclosed in Patent Document 3 and Patent Document 4 are both techniques for implementing the synchronization processing itself among a plurality of threads, and are not techniques that can solve the above-described problems related to the calculation thread method.

本発明は、上記のような事情を鑑みてなされたものである。即ち、本発明は、並列プログラムにおける計算処理の過程で消費される演算資源を用いて、プロセス間通信処理と、計算処理とを並列に実行可能な並列処理装置等を提供することを、主たる目的の一つとする。 The present invention has been made in view of the above circumstances. That is, the main object of the present invention is to provide a parallel processing device and the like capable of executing inter-process communication processing and calculation processing in parallel using calculation resources consumed in the process of calculation processing in a parallel program. One of them.

上記の目的を達成すべく、本発明の一態様に係る並列処理装置は、並列プログラムにおける複数のプロセスの間で実行される、データ転送処理を少なくとも含むプロセス間通信を制御可能なプロセス間通信部と、上記並列プログラムの実行過程で生成された複数のスレッドのうち、第１のスレッドにおいて同期処理が実行されてから、その上記第１のスレッドと、複数の上記スレッドのうち、他の上記スレッドである１以上の第２のスレッドとが同期されるまでに、上記第１のスレッドにより上記プロセス間通信を進行するマルチスレッド制御部と、を備える。 In order to achieve the above object, a parallel processing device according to one aspect of the present invention is an interprocess communication unit capable of controlling interprocess communication including at least data transfer processing, which is executed between a plurality of processes in a parallel program. Among the plurality of threads generated in the execution process of the parallel program, the first thread and the other thread among the plurality of threads after the synchronization process is executed in the first thread. A multi-thread control unit that advances the inter-process communication by the first thread until one or more second threads are synchronized.

また、本発明の一態様に係る並列処理方法は、並列プログラムの実行に応じて、上記並列プログラムにおける複数のプロセスの間で実行される、データ転送処理を少なくとも含むプロセス間通信を制御し、上記並列プログラムの実行過程で生成された複数のスレッドうち、第１のスレッドにおいて同期処理が実行されてから、その上記第１のスレッドと、複数の上記スレッドのうち、他の上記スレッドである１以上の第２のスレッドとが同期されるまでに、上記第１のスレッドにより上記プロセス間通信を進行する。 The parallel processing method according to one aspect of the present invention controls inter-process communication including at least data transfer processing executed between a plurality of processes in the parallel program according to execution of the parallel program. Among the plurality of threads generated in the execution process of the parallel program, after the synchronization process is executed in the first thread, the first thread and one or more other threads among the plurality of threads Until the second thread is synchronized, the inter-process communication proceeds by the first thread.

また、同目的は、上記構成を有する並列処理装置、並列処理方法をコンピュータによって実現するコンピュータ・プログラム、及び、そのコンピュータ・プログラムが格納されているコンピュータ読み取り可能な記録媒体等によっても達成される。 The object is also achieved by a parallel processing device having the above configuration, a computer program that implements the parallel processing method by a computer, a computer-readable recording medium in which the computer program is stored, and the like.

本発明によれば、並列プログラムにおける計算処理の過程で消費される演算資源を用いて、プロセス間通信処理と、計算処理とを並列に実行可能である。 According to the present invention, it is possible to execute inter-process communication processing and calculation processing in parallel by using computation resources consumed in the process of calculation processing in a parallel program.

図１は、バタフライ方式によるプロセス間通信処理の過程を例示する説明図である。FIG. 1 is an explanatory diagram illustrating the process of inter-process communication processing by the butterfly method. 図２は、並列処理プログラムの具体例を示す説明図である。FIG. 2 is an explanatory diagram showing a specific example of a parallel processing program. 図３は、図２に例示する並列処理プログラムを実行した場合のタイムラインチャートの一例を示す説明図である。FIG. 3 is an explanatory diagram illustrating an example of a timeline chart when the parallel processing program illustrated in FIG. 2 is executed. 図４Ａは、本発明の第１の実施形態における並列処理装置の機能的な構成を例示するブロック図である。FIG. 4A is a block diagram illustrating a functional configuration of the parallel processing device according to the first embodiment of the present invention. 図４Ｂは、本発明の第２の実施形態における並列処理装置の機能的な構成を例示するブロック図である。FIG. 4B is a block diagram illustrating a functional configuration of the parallel processing device according to the second embodiment of the present invention. 図５Ａは、本発明の第２の実施形態における並列処理装置において実行されるプロセス間通信のアルゴリズムを例示する説明図（疑似コード）である。FIG. 5A is an explanatory diagram (pseudo code) illustrating an algorithm of interprocess communication executed in the parallel processing device according to the second embodiment of this invention. 図５Ｂは、本発明の第２の実施形態における並列処理装置において実行されるプロセス間通信のアルゴリズムの処理内容を例示するフローチャートである。FIG. 5B is a flowchart illustrating the processing contents of the inter-process communication algorithm executed in the parallel processing device according to the second embodiment of the present invention. 図６Ａは、本発明の第２の実施形態における通信処理記録部に記録されるデータの一例を示す説明図である。FIG. 6A is an explanatory diagram illustrating an example of data recorded in the communication processing recording unit according to the second embodiment of the present invention. 図６Ｂは、本発明の第２の実施形態における同期待ちスレッド記録部に記録されるデータの一例を示す説明図である。FIG. 6B is an explanatory diagram illustrating an example of data recorded in the synchronization waiting thread recording unit according to the second embodiment of this invention. 図７は、本発明の第２の実施形態における通信開始処理部の動作を例示するフローチャートである。FIG. 7 is a flowchart illustrating the operation of the communication start processing unit in the second embodiment of the present invention. 図８Ａは、本発明の第２の実施形態における通信処理管理部の動作の一例を例示するフローチャートである。FIG. 8A is a flowchart illustrating an example of the operation of the communication processing management unit in the second exemplary embodiment of the present invention. 図８Ｂは、本発明の第２の実施形態における通信処理管理部の動作の別例を例示するフローチャートである。FIG. 8B is a flowchart illustrating another example of the operation of the communication processing management unit in the second exemplary embodiment of the present invention. 図９Ａは、本発明の第２の実施形態におけるバリア同期処理部の動作の一例を例示するフローチャートである。FIG. 9A is a flowchart illustrating an example of the operation of the barrier synchronization processing unit in the second embodiment of the present invention. 図９Ｂは、本発明の第２の実施形態におけるバリア同期処理部の動作の別例を例示するフローチャートである。FIG. 9B is a flowchart illustrating another example of the operation of the barrier synchronization processing unit in the second exemplary embodiment of the present invention. 図１０は、本発明の第２の実施形態における通信完了処理部の動作を例示するフローチャートである。FIG. 10 is a flowchart illustrating the operation of the communication completion processing unit in the second embodiment of the present invention. 図１１は、本発明の第２の実施形態における並列処理装置において、並列処理プログラムを実行した場合のタイムラインチャートの一例を示す説明図である。FIG. 11 is an explanatory diagram illustrating an example of a timeline chart when a parallel processing program is executed in the parallel processing device according to the second exemplary embodiment of the present invention. 図１２は、本発明の各実施形態における並列処理装置の構成要素を実現可能なハードウェア装置の構成を例示する図面である。FIG. 12 is a diagram illustrating a configuration of a hardware device capable of realizing the components of the parallel processing device in each embodiment of the present invention.

本発明の実施形態に関する説明に先立って、本発明に関する技術的な検討事項等についてより詳細に説明する。 Prior to describing the embodiment of the present invention, technical considerations and the like regarding the present invention will be described in more detail.

ＨＰＣ分野のプログラミングにおいては、並列プログラムのプロセス間通信に、例えば、ＭＰＩ（ＭｅｓｓａｇｅＰａｓｓｉｎｇＩｎｔｅｒｆａｃｅ）等のプロセス間通信技術が用いられる。なお、以下の説明においては、具体例として、プロセス間通信がＭＰＩを用いて実行されることを想定するが、本実施形態はこれに限定されるものではない。なお、以下において、ＭＰＩを用いたプロセス間通信を「ＭＰＩ通信」と記載する場合がある。 In programming in the HPC field, inter-process communication technology such as MPI (Message Passing Interface) is used for inter-process communication of parallel programs. In the following description, as a specific example, it is assumed that inter-process communication is performed using MPI, but the present embodiment is not limited to this. In the following, inter-process communication using MPI may be referred to as “MPI communication”.

ＭＰＩの通信処理は、データ転送要求の発行処理、データ転送処理、データ転送の完了処理から構成される。データ転送要求の発行処理は、ＣＰＵコアがＩＯ（Ｉｎｐｕｔ／Ｏｕｔｐｕｔ）デバイスに通信の開始を要求する処理である。データ転送処理は、ＩＯデバイスがデータを転送する処理である。データ転送の完了処理は、ＣＰＵコアが、ＩＯデバイスのデータ転送の完了を確認又は待ち合せる処理である。ここで、ＣＰＵコアは、例えば、１以上のスレッドに関する処理を実行可能な演算処理ユニットである。例えば、１つのＣＰＵには１以上のＣＰＵコアが含まれてもよい。１つのＣＰＵは、例えば、ＣＰＵコア数又はそれ以上のスレッドを並列に処理可能であってもよい。 The MPI communication process includes a data transfer request issue process, a data transfer process, and a data transfer completion process. The data transfer request issuance process is a process in which the CPU core requests an IO (Input / Output) device to start communication. The data transfer process is a process in which the IO device transfers data. The data transfer completion process is a process in which the CPU core confirms or waits for the completion of the data transfer of the IO device. Here, the CPU core is an arithmetic processing unit capable of executing processing related to one or more threads, for example. For example, one CPU may include one or more CPU cores. For example, one CPU may be capable of processing a thread having a number of CPU cores or more in parallel.

ＭＰＩの通信方式は、ブロッキング通信と非ブロッキング通信とに分類することができる。ブロッキング通信の場合、ＣＰＵコアは、例えば、データ転送要求の発行処理を実行した後、続けてデータ転送の完了処理を行い、ＩＯデバイスによるデータ転送処理が完了するのを待ち合せる。その後、ＣＰＵコアは、プロセス間通信以外の演算処理を実行する。 The MPI communication method can be classified into blocking communication and non-blocking communication. In the case of blocking communication, for example, after executing a data transfer request issuance process, the CPU core performs a data transfer completion process and waits for the data transfer process by the IO device to be completed. Thereafter, the CPU core executes arithmetic processing other than inter-process communication.

非ブロッキング通信の場合、ＣＰＵコアは、データ転送要求の発行処理を実行した後、データ転送の完了処理を実行せずに計算処理を実行する。その後、ＣＰＵコアは適切なタイミングでデータ転送の完了処理を実行する。非ブロッキング通信の場合、ＣＰＵコアによる計算処理と、ＩＯデバイスによるデータ転送処理とが並列に実行されることから、プログラム全体の実行時間が短縮される。 In the case of non-blocking communication, the CPU core executes a calculation process without executing a data transfer completion process after executing a data transfer request issuance process. Thereafter, the CPU core executes a data transfer completion process at an appropriate timing. In the case of non-blocking communication, since the calculation process by the CPU core and the data transfer process by the IO device are executed in parallel, the execution time of the entire program is shortened.

また、ＭＰＩの通信方式は、通信対象の数の観点から、一対一通信と集団通信とに分類することができる。一対一通信は、二つのプロセス間でデータを送信又は受信するＭＰＩ通信である。集団通信は二つ以上のプロセス集団（複数のプロセスから構成される集合）において、データの収集、配布、交換、リダクション演算などを行うＭＰＩ通信である。集団通信は、例えば、一対一通信を組合せることで実現されてもよい。 In addition, the MPI communication method can be classified into one-to-one communication and collective communication from the viewpoint of the number of communication objects. One-to-one communication is MPI communication that transmits or receives data between two processes. Collective communication is MPI communication that performs data collection, distribution, exchange, reduction operation, and the like in two or more process groups (a group composed of a plurality of processes). Collective communication may be realized, for example, by combining one-to-one communication.

ＭＰＩにおいては、例えば、ＭＰＩ＿Ｉａｌｌｒｅｄｕｃｅ関数という非ブロッキングの集団通信関数が提供されている。この関数は、あるプロセス集団にリダクション演算を実行させ、その結果を全プロセスに返却する関数である。なお、リダクション演算は、ある特定数の要素からなるデータ集合に対して特定の演算を施すことで、特定数よりも少数の要素からなる演算結果集合を求める演算である。具体的には、リダクション演算は、例えば、複数の入力データから１つの出力データを求める演算である。具体例を用いて説明する。並列プログラムが８個プロセスを用いて実行され、各プロセスが”０”から”７”の値を一つずつ持ち合っている場合を想定する。係る並列プログラムにおいて、ＭＰＩ＿Ｉａｌｌｒｅｄｕｃｅ関数を用いて総和を計算するリダクション演算が実行されると、８個のプロセス全てに２３（”０”から”７”の総和の結果）が返却される。ＭＰＩ＿Ｉａｌｌｒｅｕｃｅ関数を実現可能な単純な通信アルゴリズムとしては、各プロセスが他の全プロセスとデータを送受信した後でリダクション演算を行う方法が考えられる。この場合の通信回数は、”Ｏ（Ｎ＾２）（Ｎ：プロセス数）”のオーダになる。即ち、通信回数がプロセス数の２乗に比例して増大することから、プロセスの数が増加すると、プロセス間通信に要する処理負荷が増大する。 In MPI, for example, a non-blocking collective communication function called MPI_Iallreduce function is provided. This function is a function that causes a certain process group to perform a reduction operation and returns the result to all processes. The reduction operation is an operation for obtaining an operation result set including a smaller number of elements than a specific number by performing a specific operation on a data set including a specific number of elements. Specifically, the reduction operation is an operation for obtaining one output data from a plurality of input data, for example. This will be described using a specific example. Assume that a parallel program is executed using eight processes, and each process has a value of “0” to “7” one by one. In such a parallel program, when a reduction operation for calculating the sum using the MPI_Iallreduce function is executed, 23 (the result of the sum from “0” to “7”) is returned to all eight processes. As a simple communication algorithm capable of realizing the MPI_Iallreuce function, a method of performing a reduction operation after each process transmits / receives data to / from all other processes can be considered. The number of communications in this case is on the order of “O (N ^ 2) (N: number of processes)”. That is, since the number of communication increases in proportion to the square of the number of processes, when the number of processes increases, the processing load required for inter-process communication increases.

一般的に、プロセス間通信を高速化するために、より効率的な通信アルゴリズムが用いられる。そのようなアルゴリズムの一つに、バタフライ方式がある。バタフライ方式は、通信を段階（ラウンド）に分ける。換言すると、バタフライ方式においては、複数の段階（ラウンド）の通信処理が実行される。各ラウンドにおいては、各プロセスは、前のラウンドまでに演算した途中結果を、他の所定のプロセスとの間で送受信する。 In general, more efficient communication algorithms are used to speed up interprocess communication. One such algorithm is the butterfly method. The butterfly method divides communication into stages. In other words, in the butterfly method, a plurality of stages (rounds) of communication processing are executed. In each round, each process transmits / receives an intermediate result calculated up to the previous round to / from another predetermined process.

図１は、バタフライ方式を採用した場合に、プロセス「０」が保持する値の遷移を示している。プロセス「０」は最初に値”０”を持っている。ラウンド１では、プロセス「０」は、プロセス「４」から値”４”を受信する。プロセス「０」は受信した値を、自身が保持する値”０”と演算（この場合は加算）し、値”４”を得る。ラウンド２では、プロセス「０」は、プロセス「２」から値”８”を受信し、自身が保持する値”４”と演算して値”１２”を得る。ここで、プロセス「２」から受信する値”８”は、プロセス２がラウンド１までに演算した途中結果である。最後に、プロセス「０」は、ラウンド３で、プロセス「１」から値”１１”を受信し、自身が保持する値”１２”と演算することで、値”２３”を得る。値”１１”は、プロセス「１」がラウンド２までに演算した途中結果である。最終的に得られた値”２３”がＭＰＩ＿Ｉａｌｌｒｅｄｕｃｅ関数の結果である。 FIG. 1 shows transition of values held by the process “0” when the butterfly method is employed. Process “0” initially has the value “0”. In round 1, the process “0” receives the value “4” from the process “4”. The process “0” calculates (adds in this case) the received value with the value “0” held by itself and obtains the value “4”. In round 2, the process “0” receives the value “8” from the process “2”, calculates the value “4” held by itself, and obtains the value “12”. Here, the value “8” received from the process “2” is an intermediate result calculated by the process 2 until round 1. Finally, the process “0” receives the value “11” from the process “1” in round 3 and calculates the value “12” held by itself to obtain the value “23”. The value “11” is an intermediate result calculated by the process “1” up to round 2. The value “23” finally obtained is the result of the MPI_Iallreduce function.

バタフライ方式の場合、各プロセスが演算の途中結果を送受信することにより、通信回数がＯ（Ｎ×ｌｏｇ（Ｎ））（Ｎ：プロセス数）のオーダにまで削減される。しかしながら、バラフライ方式は、演算の途中結果を送受信することから、前のラウンドが完了していないと、その次のラウンドを開始することができない。即ち、バタフライ方式のような通信アルゴリズムは、演算処理の途中結果を送受信することで通信回数を削減するが、これにより、通信に順序関係が発生する。このため、各プロセスは、係る順序関係に従って通信処理を順番に繰り返す必要がある。以下、順序関係に従った通信処理の実行を、「通信処理の進行」（又は「プロセス間通信処理の進行」）と記載する場合がある。通信処理の進行は、例えば、通信スレッド方式と計算スレッド方式とのいずれかを用いて行われる。 In the case of the butterfly method, each process transmits and receives an intermediate result of an operation, so that the number of communication is reduced to the order of O (N × log (N)) (N: number of processes). However, since the rose fly method transmits and receives a result in the middle of an operation, the next round cannot be started unless the previous round is completed. That is, a communication algorithm such as the butterfly method reduces the number of times of communication by transmitting and receiving intermediate results of arithmetic processing, but this causes an order relationship in communication. For this reason, each process needs to repeat communication processing in order according to the order relation. Hereinafter, execution of communication processing in accordance with the order relationship may be referred to as “progress of communication processing” (or “progress of inter-process communication processing”). The progress of the communication process is performed using, for example, either a communication thread method or a calculation thread method.

以下、通信スレッド方式を採用した場合の通信処理の進行について概要を説明する。通信スレッド方式を採用した場合、各プロセスは、計算スレッドとは独立した通信スレッドを用いて通信処理の進行を実行する。計算スレッドは、並列プログラムに記述されたＭＰＩライブラリ関数の一つであるＭＰＩ通信開始関数を実行し、通信スレッドに通信処理を依頼する。そして、計算スレッドは、通信処理以外の計算処理を開始する。係る計算スレッドにおける計算処理と並行して、通信スレッド及びＩＯデバイスは、データ転送要求の発行処理、データ転送処理、データ転送の完了処理、次のデータ転送要求の発行処理等、通信処理を繰り返し実行する。計算スレッドは、計算処理を実行した後、並列プログラムに記述されたＭＰＩ通信完了関数を実行し、通信スレッドによるデータ転送の完了処理の終了を待ち合せる。 Hereinafter, an overview of the progress of communication processing when the communication thread method is adopted will be described. When the communication thread method is adopted, each process executes the progress of communication processing using a communication thread independent of the calculation thread. The calculation thread executes an MPI communication start function that is one of MPI library functions described in the parallel program, and requests communication processing from the communication thread. Then, the calculation thread starts calculation processing other than communication processing. In parallel with the calculation processing in the calculation thread, the communication thread and the IO device repeatedly execute communication processing such as data transfer request issue processing, data transfer processing, data transfer completion processing, and next data transfer request issue processing. To do. After executing the calculation process, the calculation thread executes the MPI communication completion function described in the parallel program, and waits for the completion of the data transfer completion process by the communication thread.

通信スレッド方式は、通信スレッドが計算スレッドとは独立していることから、計算処理と、通信処理の進行とが並行して実行される。一方、通信スレッド方式の場合、通信スレッドの分、計算スレッドが利用可能なＣＰＵコアの演算能力が減少する、という問題がある。また、通信スレッドと計算スレッドとがＣＰＵコアを共用する場合、スレッドを切り替える際に、コンテキストスイッチなどのオーバーヘッドが生じる、という問題がある。例えば、スレッド切り替えを高速化することにより、通信スレッド本方式に関する上記問題を緩和することが考えられるが、コンテキストスイッチ自体はなくならないことから、その効果は限定的である。 In the communication thread method, since the communication thread is independent of the calculation thread, the calculation process and the progress of the communication process are executed in parallel. On the other hand, in the case of the communication thread method, there is a problem that the calculation capability of the CPU core that can use the calculation thread is reduced by the communication thread. Further, when a communication thread and a calculation thread share a CPU core, there is a problem that an overhead such as a context switch occurs when switching threads. For example, it may be possible to alleviate the above-described problem related to the communication thread main method by speeding up thread switching, but the effect is limited because the context switch itself is not lost.

ＨＰＣでは、計算処理に対してＣＰＵコアの処理能力を可能な限り多く割り当てることが望ましく、計算スレッド方式が多く用いられる。以下、計算スレッド方式の概要について説明する。計算スレッド方式は、ＣＰＵコアで実行される計算スレッド自身が通信処理を進行する方式である。計算スレッドは、並列プログラムに記述されたＭＰＩ通信開始関数を実行する。そして、計算スレッドは、データ転送要求の発行処理を実行し、計算処理を開始する。要求を受け取ったＩＯデバイスは、計算スレッドの計算処理と並行して、データ転送処理を実行する。計算スレッドは、例えば、計算処理の途中で、並列プログラムに記述されたＭＰＩ関数（ＭＰＩにおいて定義されたＡＰＩ）を実行し、その過程において、先に要求したデータ転送の完了処理と、次のデータ転送要求の発行処理とを実行する。計算スレッドは、計算処理の後、並列プログラムに記述されたＭＰＩ通信完了関数を実行し、データ転送の完了処理が終了するのを待ち合せる。 In HPC, it is desirable to allocate as much processing capacity of the CPU core as possible to calculation processing, and a calculation thread method is often used. The outline of the calculation thread method will be described below. The calculation thread method is a method in which a calculation thread itself executed by a CPU core advances communication processing. The calculation thread executes an MPI communication start function described in the parallel program. Then, the calculation thread executes a data transfer request issuance process and starts the calculation process. The IO device that has received the request executes data transfer processing in parallel with the calculation processing of the calculation thread. For example, the calculation thread executes an MPI function (API defined in MPI) described in the parallel program in the middle of the calculation process, and in the process, completes the data transfer completion process requested earlier and the next data. Transfer request issuance processing is executed. After the calculation process, the calculation thread executes the MPI communication completion function described in the parallel program, and waits for the completion of the data transfer completion process.

計算スレッド方式は、通信スレッドを利用しないことから、通信スレッド方式を採用した場合に発生する問題を解消可能である。即ち、計算スレッド方式を採用した場合、計算スレッドが利用可能なＣＰＵコアの演算能力が、通信スレッドの分、減少するという問題は発生しない。また、計算スレッドと、通信スレッドとの切り替えが不要であることから、スレッドの切り替えに伴うコンテキストスイッチが発生しない。 Since the calculation thread method does not use a communication thread, the problem that occurs when the communication thread method is adopted can be solved. In other words, when the calculation thread method is adopted, there is no problem that the calculation capability of the CPU core that can use the calculation thread is reduced by the communication thread. In addition, since there is no need to switch between a calculation thread and a communication thread, no context switch occurs due to thread switching.

しかしながら、計算スレッド方式では、計算処理の途中にＭＰＩ関数の実行が記述されていない場合、ＭＰＩ関数が実行されない。即ち、開発者等が並列プログラムを作成する際に、計算処理の途中にＭＰＩ関数を記述していない場合、計算処理の途中ではＭＰＩ関数が実行されない。この場合、計算処理の途中では、データ転送の完了処理と後続のデータ転送要求の発行処理とが実行されない。これにより、データ転送の完了処理と後続のデータ転送要求の発行処理とが、例えば、ＭＰＩ通信完了関数が実行されるまで遅延してしまい、これに続く後続のデータ転送処理と計算処理とが並列に実行されない可能性がある。 However, in the calculation thread method, if the execution of the MPI function is not described during the calculation process, the MPI function is not executed. That is, when a developer or the like creates a parallel program and the MPI function is not described during the calculation process, the MPI function is not executed during the calculation process. In this case, during the calculation process, the data transfer completion process and the subsequent data transfer request issue process are not executed. As a result, the data transfer completion process and the subsequent data transfer request issuance process are delayed until, for example, the MPI communication completion function is executed, and the subsequent data transfer process and the calculation process are paralleled. May not be executed.

上記のような現象を図２、図３に示す具体例を参照して説明する。図２は、並列プログラムのプログラムコードの一例である。”ｐａｒａ１（）”、”ｐａｒａ２（）”における計算処理は計算ノードにおける１以上のＣＰＵコアを用いて、マルチスレッドで実行される。即ち、係る並列プログラムは、マルチスレッドプログラムであることを想定する。”ｐａｒａ１（）”完了時、”ｐａｒａ２（）”完了時に、スレッド間でバリア同期が実行される。 The above phenomenon will be described with reference to specific examples shown in FIGS. FIG. 2 is an example of the program code of the parallel program. The calculation processing in “para1 ()” and “para2 ()” is executed in a multi-thread using one or more CPU cores in the calculation node. That is, it is assumed that the parallel program is a multithread program. Barrier synchronization is executed between threads when “para1 ()” is completed and when “para2 ()” is completed.

”ｐａｒａ１（）”、”ｐａｒａ２（）”における計算処理が実行される前でＭＰＩ＿Ｉａｌｌｒｅｄｕｃｅ関数によりプロセス間通信を開始し、計算処理が実行された後でＭＰＩ＿Ｗａｉｔ関数によりプロセス通信の完了を待ち合せる。係るプロセス間通信は、異なる計算ノードで実行されるプロセスの間で実行される。即ち、係る並列プログラムは、複数の計算ノードにおいて並列にプロセスが実行される、マルチプロセスプログラムであることを想定する。 Inter-process communication is started by the MPI_Illreduce function before the calculation process in “para1 ()” and “para2 ()” is executed, and the completion of the process communication is waited by the MPI_Wait function after the calculation process is executed. Such inter-process communication is executed between processes executed in different calculation nodes. That is, the parallel program is assumed to be a multi-process program in which processes are executed in parallel on a plurality of calculation nodes.

ここで、”ｐａｒａ１（）”、”ｐａｒａ２（）”の実装コードの途中には、ＭＰＩ関数の呼び出しが実装されていないことを想定する。即ち、”ｐａｒａ１（）”、”ｐａｒａ２（）”の途中には、ＭＰＩ関数の呼び出しが記述されていないことから、ＭＰＩ関数が実行されない。 Here, it is assumed that the MPI function call is not implemented in the middle of the implementation code of “para1 ()” and “para2 ()”. That is, since the MPI function call is not described in the middle of “para1 ()” and “para2 ()”, the MPI function is not executed.

図３は、図２のプログラムコードを、４個のＣＰＵコア（図３における”Ｃｏｒｅ０”乃至”Ｃｏｒｅ３”）を用いて実行した場合のタイムラインチャートである。”ｐａｒａ１（）”、”ｐａｒａ２（）”は、ＣＰＵコア４個を使用して、４スレッドで実行される。仮に、ＭＰＩ＿Ｉａｌｌｒｅｄｕｃｅ関数が３ラウンドで構成されていることを想定する。この場合、ＭＰＩ＿Ｉａｌｌｒｅｄｕｃｅ関数が実行されると、ラウンド１のデータ転送要求の発行処理が実行される。その後、”ｐａｒａ１（）”、”ｐａｒａ２（）”の計算処理が開始される。ラウンド１のデータ転送処理は、”ｐａｒａ１（）”、”ｐａｒａ２（）”の計算処理と並列に実行される。一方、”ｐａｒａ１（）”、”ｐａｒａ２（）”の計算処理の途中でＭＰＩ関数が実行されないことから、ＭＰＩ＿Ｗａｉｔ関数が実行されるまで、ラウンド１のデータ転送の完了処理が実行されない。これにより、ラウンド２のデータ転送要求の発行処理が実行されない。その結果、ラウンド２、ラウンド３のデータ転送処理が、”ｐａｒａ１（）”、”ｐａｒａ２（）”の計算処理の後まで遅延してしまい、計算処理と並列に実行されない。 FIG. 3 is a timeline chart when the program code of FIG. 2 is executed using four CPU cores (“Core 0” to “Core 3” in FIG. 3). “Para1 ()” and “para2 ()” are executed by four threads using four CPU cores. It is assumed that the MPI_Iallreduce function is composed of 3 rounds. In this case, when the MPI_Iallreduce function is executed, processing for issuing a round 1 data transfer request is executed. Thereafter, calculation processing of “para1 ()” and “para2 ()” is started. The data transfer process of round 1 is executed in parallel with the calculation process of “para1 ()” and “para2 ()”. On the other hand, since the MPI function is not executed during the calculation process of “para1 ()” and “para2 ()”, the round 1 data transfer completion process is not executed until the MPI_Wait function is executed. Thus, the round 2 data transfer request issuance process is not executed. As a result, the round 2 and round 3 data transfer processing is delayed until the calculation processing of “para1 ()” and “para2 ()”, and is not executed in parallel with the calculation processing.

これに対して、本発明に係る以下の各実施形態における並列処理装置は、上記のような問題を解決可能な技術を提供する。以下の各実施形態における並列処理装置は、一例として、並列プログラムにおいてスレッドの同期処理が実行される際に、プロセス間通信を進行させることが可能である。スレッドの同期待ち時間にＣＰＵコアの演算資源を利用することで、通信処理と計算処理とを並列に実行し、プログラム全体の実行時間を短縮することが可能である。 On the other hand, the parallel processing device in each of the following embodiments according to the present invention provides a technique capable of solving the above-described problems. As an example, the parallel processing device in each of the following embodiments can advance interprocess communication when thread synchronization processing is executed in a parallel program. By using the computation resources of the CPU core for the thread synchronization waiting time, it is possible to execute the communication process and the calculation process in parallel to reduce the execution time of the entire program.

＜第１の実施形態＞
以下、本発明の基本的な実施形態である第１の実施形態について図面を参照して説明する。 <First Embodiment>
Hereinafter, a first embodiment, which is a basic embodiment of the present invention, will be described with reference to the drawings.

図４Ａに例示するように、本実施形態における並列処理装置１００は、プロセス間通信部１０１と、マルチスレッド制御部１０２と、を含む。並列処理装置１００を構成するこれらの構成要素の間は、適切な通信方法を用いて通信可能に接続されている。 As illustrated in FIG. 4A, the parallel processing device 100 in the present embodiment includes an inter-process communication unit 101 and a multi-thread control unit 102. These components constituting the parallel processing apparatus 100 are communicably connected using an appropriate communication method.

本実施形態における並列処理装置１００は、並列プログラムを実行可能な情報処理装置（計算ノード）であってもよい。並列処理装置１００は、例えば、複数のスレッドを並列処理可能な演算処理装置（例えば、マルチコアＣＰＵ等）と、メモリと、を少なくとも備える。並列処理装置１００は、複数の演算処理ユニットを備えてもよい。係る演算処理装置は、例えば、演算処理を実行可能な演算処理ユニット（例えば、ＣＰＵコア）を複数備えてもよい。係る演算処理ユニットは、例えば、１以上のスレッドを並列に実行可能であってもよい。なお、メモリには、並列プログラム及び当該並列プログラムの実行に必要なデータが保持されてもよい。 The parallel processing device 100 in the present embodiment may be an information processing device (calculation node) that can execute a parallel program. The parallel processing device 100 includes, for example, at least an arithmetic processing device (for example, a multi-core CPU) that can process a plurality of threads in parallel and a memory. The parallel processing apparatus 100 may include a plurality of arithmetic processing units. For example, the arithmetic processing device may include a plurality of arithmetic processing units (for example, CPU cores) capable of executing arithmetic processing. Such an arithmetic processing unit may be capable of executing one or more threads in parallel, for example. Note that the memory may hold a parallel program and data necessary for executing the parallel program.

並列プログラムは、１以上のスレッドにより並列に実行可能な処理を含むプロセスを１以上含んでもよい。具体的には、並列プログラムが実行された際、１以上のプロセスが生成される。プロセスが複数生成された場合、各プロセスは、例えば、演算処理装置のうちのいずれかに割り当てられてもよい。なお、並列プログラムの実行過程において生成されたプロセスは、一つの並列処理装置１００に含まれる複数の演算処理装置に割り当てられてもよく、複数の並列処理装置１００に割り当てられてもよい。それぞれのプロセスは、割り当てられた演算処理装置を用いて、並列プログラムに実装された処理を実行してもよい。 The parallel program may include one or more processes including processes that can be executed in parallel by one or more threads. Specifically, when a parallel program is executed, one or more processes are generated. When a plurality of processes are generated, each process may be assigned to any of the arithmetic processing devices, for example. It should be noted that the processes generated in the execution process of the parallel program may be assigned to a plurality of arithmetic processing devices included in one parallel processing device 100, or may be assigned to a plurality of parallel processing devices 100. Each process may execute the processing implemented in the parallel program using the assigned arithmetic processing unit.

並列プログラムは、その実行過程において（具体的には、各プロセスの実行過程において）、１以上のスレッドを生成してもよい。複数のスレッドが生成された場合、それらは、演算処理装置において並列に実行可能である。 The parallel program may generate one or more threads in the execution process (specifically, in the execution process of each process). When a plurality of threads are generated, they can be executed in parallel in the arithmetic processing unit.

プロセス間通信部１０１は、並列プログラムにおける複数のプロセスの間で実行される、データの転送処理を少なくとも含むプロセス間通信を制御する。具体的には、プロセス間通信部１０１は、例えば、ある演算処理装置（第１の演算処理装置と記載する）において実行されるプロセスと、他の演算処理装置（第２の演算処理装置と記載する）において実行されるプロセスとの間のデータの送受信等に関するプロセス間通信の進行を制御可能である。この場合、第１の演算処理装置と、第２の演算処理装置とは、同じ計算ノードに存在してもよく、異なる計算ノードに存在してもよい。 The inter-process communication unit 101 controls inter-process communication including at least data transfer processing executed between a plurality of processes in the parallel program. Specifically, the inter-process communication unit 101 includes, for example, a process executed in a certain arithmetic processing device (described as a first arithmetic processing device) and another arithmetic processing device (described as a second arithmetic processing device). It is possible to control the progress of inter-process communication related to data transmission / reception between processes executed in In this case, the first arithmetic processing device and the second arithmetic processing device may exist in the same calculation node or may exist in different calculation nodes.

プロセス間通信部１０１は、例えば、複数のプロセスの間における、データ転送要求の発行処理（データ転送の開始処理）、データ転送処理、データ転送の完了処理等を処理してもよい。プロセス間通信部１０１は、例えば、ある順序関係に従って、上記したようなプロセス間通信におけるデータ転送処理を進行させてもよい。 The inter-process communication unit 101 may process, for example, a data transfer request issuance process (data transfer start process), a data transfer process, a data transfer completion process, and the like among a plurality of processes. For example, the inter-process communication unit 101 may advance the data transfer process in the inter-process communication as described above according to a certain order relation.

上記説明したプロセス間通信部１０１は、ＭＰＩを用いてプロセス間通信処理を実行してもよい。また、係るプロセス間通信部１０１は、例えば、ＭＰＩを拡張することにより実現されてもよい。係るプロセス間通信部１０１は、ＭＰＩに限定されず、他の適切なプロセス間通信技術を使用あるいは拡張することで実現されてもよい。 The inter-process communication unit 101 described above may execute inter-process communication processing using MPI. Further, the inter-process communication unit 101 may be realized by extending MPI, for example. The inter-process communication unit 101 is not limited to MPI, and may be realized by using or extending other appropriate inter-process communication technology.

マルチスレッド制御部１０２は、並列処理装置１００において並列プログラムが実行された際、当該並列プログラムにより生成されるスレッドによるプロセス間通信に関する処理を制御する。例えば、マルチスレッド制御部１０２は、あるスレッド（第１のスレッドと記載する）において同期処理が実行されてから、当該第１のスレッドと、他のスレッド（第２のスレッドと記載する）とが同期されるまでに、当該第１のスレッドにより、プロセス間通信を進行する。マルチスレッド制御部１０２は、例えば、上記第１のスレッドにより、プロセス間通信部１０１を用いて（プロセス間通信部１０１が提供する機能を使用して）、プロセス間通信に関する処理を実行してもよい。 When a parallel program is executed in the parallel processing apparatus 100, the multi-thread control unit 102 controls processing related to inter-process communication by threads generated by the parallel program. For example, the multi-thread control unit 102 may execute the first thread and another thread (described as a second thread) after a synchronization process is executed in a certain thread (described as a first thread). Until the synchronization, inter-process communication proceeds by the first thread. For example, the multi-thread control unit 102 may execute processing related to inter-process communication by using the inter-process communication unit 101 (using the function provided by the inter-process communication unit 101) using the first thread. Good.

マルチスレッド制御部１０２は、例えば、第１のスレッドにおいて同期処理（例えば、バリア同期処理）が実行された際、他に実行されている全ての第２のスレッドが同期処理を実行済みか否か（同期待ち状態か否か）を確認してもよい。そして、他に同期処理を実行していない第２のスレッドが存在する場合、当該第２のスレッドが同期処理を実行するまで、第１のスレッドが、プロセス間通信部１０１の機能を用いてプロセス間通信処理を実行するよう、第１のスレッドを制御してもよい。 For example, when the synchronization processing (for example, barrier synchronization processing) is executed in the first thread, the multi-thread control unit 102 determines whether all other second threads that have been executed have executed the synchronization processing. You may confirm (whether or not you are waiting for synchronization). If there is another second thread that does not execute the synchronization process, the first thread uses the function of the inter-process communication unit 101 until the second thread executes the synchronization process. The first thread may be controlled to execute the inter-communication process.

上記説明したマルチスレッド制御部１０２は、例えば、ＯｐｅｎＭＰを用いてスレッド間の並列処理を実行してもよい。また、係るマルチスレッド制御部１０２は、例えば、ＯｐｅｎＭＰを拡張することにより実現されてもよい。係るマルチスレッド制御部１０２は、ＯｐｅｎＭＰに限定されず、他の適切な並列処理技術を使用あるいは拡張することで実現されてもよい。 The multi-thread control unit 102 described above may execute parallel processing between threads using, for example, OpenMP. The multi-thread control unit 102 may be realized by extending OpenMP, for example. The multi-thread control unit 102 is not limited to OpenMP, and may be realized by using or extending other appropriate parallel processing technology.

上記のように構成された並列処理装置１００は、第１のスレッドが、並列に実行される第２のスレッドの同期を待つ間に、演算処理装置の演算資源を用いてプロセス間通信を進行することが可能である。これにより、並列プログラムの実行時間が短縮される。 The parallel processing device 100 configured as described above proceeds with inter-process communication using arithmetic resources of the arithmetic processing device while the first thread waits for synchronization of the second thread executed in parallel. It is possible. This shortens the execution time of the parallel program.

また、並列処理装置１００は、第１のスレッドが実行する計算処理の途中に、プロセス願通信処理が明示的には実装されていない場合であっても計算処理の途中でプロセス間通信処理を進行することができる。なぜならば、第１のスレッドが同期処理を実行した際に、当該第１のスレッドがプロセス間通信処理を実行するよう制御するからである。これにより、並列処理装置１００は、並列処理プログラムの計算処理の途中に明示的にプロセス間通信処理を記載することなく、プロセス間通信処理と、計算処理とを並列に実行可能である。並列処理装置１００は、プロセス間通信用の、独立した通信スレッドを生成することなく、プロセス間通信処理と、計算処理とを並列に実行可能である。以上より、本実施形態における並列処理装置１００によれば、並列プログラムにおける計算処理の過程で消費される演算資源を用いて、プロセス間通信処理と、計算処理とを並列に実行可能である。 Further, the parallel processing apparatus 100 proceeds with the inter-process communication process in the middle of the calculation process even when the process request communication process is not explicitly implemented in the middle of the calculation process executed by the first thread. can do. This is because, when the first thread executes the synchronization process, the first thread is controlled to execute the inter-process communication process. Thereby, the parallel processing apparatus 100 can execute the inter-process communication process and the calculation process in parallel without explicitly describing the inter-process communication process in the middle of the calculation process of the parallel processing program. The parallel processing apparatus 100 can execute interprocess communication processing and calculation processing in parallel without generating an independent communication thread for interprocess communication. As described above, according to the parallel processing device 100 of the present embodiment, the inter-process communication process and the calculation process can be executed in parallel using the operation resources consumed in the process of the calculation process in the parallel program.

＜第２の実施形態＞
上記第１の実施形態を基礎とする、本発明の第２の実施形態について図面を参照して詳細に説明する。 <Second Embodiment>
A second embodiment of the present invention based on the first embodiment will be described in detail with reference to the drawings.

［構成］
図４Ｂは本実施形態における並列処理装置４００の機能的な構成を例示するブロック図である。 [Constitution]
FIG. 4B is a block diagram illustrating a functional configuration of the parallel processing device 400 according to this embodiment.

本発明は、大別して、プロセス間通信部２００と、マルチスレッド制御部３００とから構成される。 The present invention is roughly composed of an inter-process communication unit 200 and a multi-thread control unit 300.

プロセス間通信部２００は、通信開始処理部２０１と、通信完了処理部２０２と通信処理管理部２０３と、通信処理記録部２０４とを含む。 The inter-process communication unit 200 includes a communication start processing unit 201, a communication completion processing unit 202, a communication processing management unit 203, and a communication processing recording unit 204.

プロセス間通信部２００は、複数のプロセスの間のプロセス間通信に関する処理を実行する。通信開始処理部２０１は、プロセス間通信処理を開始する機能を提供する。通信完了処理部２０２は、プロセス間通信を終了する機能を提供する。通信処理管理部２０３は、プロセス間通信におけるデータ転送処理を制御する機能を提供する。通信処理記録部２０４は、プロセス間通信に関する情報を保持する。 The inter-process communication unit 200 executes processing related to inter-process communication between a plurality of processes. The communication start processing unit 201 provides a function for starting inter-process communication processing. The communication completion processing unit 202 provides a function for terminating the interprocess communication. The communication processing management unit 203 provides a function for controlling data transfer processing in inter-process communication. The communication processing recording unit 204 holds information related to interprocess communication.

マルチスレッド制御部３００はバリア同期処理部３０１と、同期待ちスレッド記録部３０２とを含む。バリア同期処理部３０１は、複数のスレッドの間の同期（バリア同期）処理を実行する機能を提供する。同期待ちスレッド記録部３０２は、同期待ち状態にあるスレッドに関する情報を保持する。同期待ちスレッド記録部３０２は、同期待ち状態にあるスレッドの数（あるいは、同期待ち状態にないスレッドの数）を表す情報を保持してもよい。 The multi-thread control unit 300 includes a barrier synchronization processing unit 301 and a synchronization waiting thread recording unit 302. The barrier synchronization processing unit 301 provides a function of executing synchronization (barrier synchronization) processing between a plurality of threads. The synchronization waiting thread recording unit 302 holds information regarding threads that are in a synchronization waiting state. The synchronization waiting thread recording unit 302 may hold information indicating the number of threads waiting for synchronization (or the number of threads not waiting for synchronization).

図５Ａは、通信処理管理部２０３が実行する通信処理を実現可能な通信アルゴリズムを表すプログラム（疑似コード）の一例である。以下、図５Ａに例示するプログラムにより表される通信アルゴリズムを、単に「通信アルゴリズム」と記載する。図５Ｂは、通信処理管理部２０３が実行する通信処理を表すフローチャートであり、通信アルゴリズムの処理を表す。 FIG. 5A is an example of a program (pseudo code) representing a communication algorithm capable of realizing the communication process executed by the communication process management unit 203. Hereinafter, the communication algorithm represented by the program illustrated in FIG. 5A is simply referred to as “communication algorithm”. FIG. 5B is a flowchart showing the communication process executed by the communication process management unit 203, and shows the process of the communication algorithm.

通信アルゴリズムにおいては、データ転送要求の発行処理とデータ転送の完了処理とが必要なラウンド数分繰り返される。図５Ａに示す具体例の場合、ラウンド数は”Ｎ”（Ｎは１以上の自然数）である。図５Ａに示す通信アルゴリズムにおいては、「ｓｔｅｐ１」乃至「ｓｔｅｐＮ」のラウンドにおいて、データ転送要求の発行処理及びデータ転送の完了処理の少なくとも一方の処理が実行される。 In the communication algorithm, the data transfer request issuance process and the data transfer completion process are repeated for the required number of rounds. In the specific example shown in FIG. 5A, the number of rounds is “N” (N is a natural number of 1 or more). In the communication algorithm shown in FIG. 5A, at least one of a data transfer request issuance process and a data transfer completion process is executed in the “step 1” to “step N” rounds.

データ転送の完了処理は、データ転送処理が完了したか否かの完了確認を実行し、データ転送が未完了の場合、完了を待ち合せずに完了処理を中断してよい。この場合、図５Ａにおけるｓｔｅｐ（「ｓｔｅｐ１」乃至「ｓｔｅｐＮ」）は、例えば、処理を再開する位置（ラウンド）を示す。通信アルゴリズムは、処理を中断した場合、処理を再開する際のｓｔｅｐ（ラウンド）を通信アルゴリズムの実行元（通信アルゴリズムの呼び出し元）に返却（提供）してもよい。通信アルゴリズムは、処理を完了した場合、データ転送処理が完了したことを表す情報を、通信アルゴリズムの実行元に返却（提供）してもよい。図５Ａに示す具体例ではデータ転送処理が完了したことを表す情報として「アルゴリズム完了」を表す情報が用いられる。通信アルゴリズムは、データ転送処理の完了を待ち合せないので、短い時間で実行可能である。 In the data transfer completion process, completion confirmation as to whether or not the data transfer process has been completed is executed. If the data transfer has not been completed, the completion process may be interrupted without waiting for completion. In this case, step (“step 1” to “step N”) in FIG. 5A indicates, for example, a position (round) at which the process is resumed. When the process is interrupted, the communication algorithm may return (provide) a step (round) at the time of resuming the process to the communication algorithm execution source (communication algorithm caller). When the communication algorithm is completed, information indicating that the data transfer process is completed may be returned (provided) to the execution source of the communication algorithm. In the specific example shown in FIG. 5A, information indicating “algorithm completion” is used as information indicating that the data transfer processing has been completed. Since the communication algorithm does not wait for the completion of the data transfer process, it can be executed in a short time.

図５Ｂに例示するフローチャートを参照して、通信アルゴリズムの同債について概要を説明する。以下の説明においては、一例として、通信処理管理部２０３が、通信アルゴリズムを実行することを想定する。 An outline of the bond of the communication algorithm will be described with reference to the flowchart illustrated in FIG. 5B. In the following description, as an example, it is assumed that the communication processing management unit 203 executes a communication algorithm.

通信処理管理部２０３が通信アルゴリズムを実行する際、データ転送処理のラウンド数を指定する。データ転送処理が開始される場合、例えば、ラウンド数として「１」が設定されてもよい。 When the communication processing management unit 203 executes the communication algorithm, the number of rounds of data transfer processing is designated. When the data transfer process is started, for example, “1” may be set as the number of rounds.

データ転送処理のラウンド数を確認した結果（ステップＳ５０１）、ラウンド数が「１」の場合、通信アルゴリズムは、１番目のラウンド（最初のラウンド）におけるデータ転送要求の発行処理を実行する（ステップＳ５０２）。ラウンド数に「２」が設定される（ステップＳ５０３）。通信アルゴリズムは、通信アルゴリズムの実行元に処理を再開する際のｓｔｅｐ（ラウンド数）を返却（提供）する（ステップＳ５１３）。 As a result of confirming the number of rounds in the data transfer process (step S501), when the number of rounds is “1”, the communication algorithm executes a process for issuing a data transfer request in the first round (first round) (step S502). ). “2” is set as the number of rounds (step S503). The communication algorithm returns (provides) a step (number of rounds) at the time of resuming the processing to the execution source of the communication algorithm (step S513).

データ転送処理のラウンド数を確認した結果（ステップＳ５０１）、ラウンド数が「ｉ」（ｉは２以上Ｎ未満の整数）の場合、通信アルゴリズムは、「ｉ−１」番目のラウンドにおけるデータ転送の完了処理を実行する（ステップＳ５０４）。 As a result of confirming the number of rounds in the data transfer process (step S501), when the number of rounds is “i” (i is an integer not less than 2 and less than N), the communication algorithm is the data transfer in the “i−1” -th round. A completion process is executed (step S504).

「ｉ−１」番目のラウンドのデータ転送が完了していない場合（ステップＳ５０５においてＮＯ）、通信アルゴリズムは、処理を再開する際のｓｔｅｐ（ラウンド）を「ｉ」に設定する（ステップＳ５０６）。即ち、再度「ｉ」番目のラウンドから処理が再開されるよう、処理を再開する際のｓｔｅｐが設定される。この場合、通信アルゴリズムは、「ｉ−１」番目のラウンドのデータ転送の完了を待ち合せず、ステップＳ５１３に処理を進める。 If the data transfer of the “i−1” -th round has not been completed (NO in step S505), the communication algorithm sets step (round) when restarting the process to “i” (step S506). In other words, a step for restarting the process is set so that the process is restarted from the “i” -th round. In this case, the communication algorithm does not wait for completion of the data transfer of the “i−1” -th round, and proceeds to step S513.

「ｉ−１」番目のラウンドのデータ転送が完了した場合（ステップＳ５０５においてＹＥＳ）、通信アルゴリズムは、「ｉ」番目のラウンドのデータ転送要求を発行する（ステップＳ５０７）。通信アルゴリズムは、処理を再開する際のｓｔｅｐ（ラウンド）を「ｉ＋１」に設定する（ステップＳ５０８）。即ち、「ｉ＋１」番目のラウンド（次のラウンド）から処理が再開されるよう、処理を再開する際のｓｔｅｐが設定される。 When the data transfer for the “i−1” -th round is completed (YES in step S505), the communication algorithm issues a data transfer request for the “i” -th round (step S507). The communication algorithm sets “i + 1” as a step (round) when the process is resumed (step S508). That is, a step for resuming the process is set so that the process is resumed from the “i + 1” -th round (next round).

ステップＳ５０６又はステップＳ５０８の処理を実行した後、通信アルゴリズムは、通信アルゴリズムの実行元に処理を再開する際のｓｔｅｐ（ラウンド数）を返却（提供）する（ステップＳ５１３）。 After executing the process of step S506 or step S508, the communication algorithm returns (provides) a step (number of rounds) when the process is resumed to the execution source of the communication algorithm (step S513).

データ転送処理のラウンド数を確認した結果（ステップＳ５０１）、ラウンド数が「Ｎ」（最後のラウンド）の場合、通信アルゴリズムは、「Ｍ」番目のラウンドにおけるデータの転送の完了処理を実行する（ステップＳ５０９）。「Ｍ」番目のラウンドは、例えば、「Ｎ」番目のラウンドの、一つ前のラウンドを表す。 As a result of confirming the number of rounds in the data transfer process (step S501), when the number of rounds is “N” (the last round), the communication algorithm executes a data transfer completion process in the “M” -th round ( Step S509). The “M” th round represents, for example, the previous round of the “N” th round.

「Ｍ」番目のラウンドのデータ転送が完了していない場合（ステップＳ５１０においてＮＯ）、通信アルゴリズムは、処理を再開する際のｓｔｅｐ（ラウンド）を「Ｎ」に設定する（ステップＳ５１１）。この場合、通信アルゴリズムは、「Ｍ」番目のラウンドのデータ転送の完了を待ち合せず、ステップＳ５１３に処理を進める。 If the data transfer of the “M” -th round has not been completed (NO in step S510), the communication algorithm sets “N” as the step (round) when the process is resumed (step S511). In this case, the communication algorithm does not wait for completion of the data transfer of the “M” -th round, and proceeds to step S513.

「Ｍ」番目のラウンドのデータ転送が完了した場合（ステップＳ５１０においてＹＥＳ）、通信アルゴリズムは、通信アルゴリズムの実行元に、データン転送処理の完了を表す情報（例えば、図５Ａにおける「アルゴリズム完了」）を提供する（ステップＳ５１２）。 When the data transfer of the “M” -th round is completed (YES in step S510), the communication algorithm sends information indicating the completion of the data transfer process to the execution source of the communication algorithm (for example, “algorithm completion” in FIG. 5A). Is provided (step S512).

図６Ａは、通信処理記録部２０４に記録されるデータの一例を示す説明図である。 FIG. 6A is an explanatory diagram illustrating an example of data recorded in the communication processing recording unit 204.

通信ＩＤ６０１は、プロセス間通信を識別可能な識別子（ＩＤ：Ｉｄｅｎｔｉｆｉｅｒ）を示す。通信ＩＤ６０１には、例えば、プロセス間通信を特定可能な識別子を表すデータが設定されてもよい。 The communication ID 601 indicates an identifier (ID: Identifier) that can identify inter-process communication. In the communication ID 601, for example, data representing an identifier that can specify inter-process communication may be set.

通信アルゴリズム６０２は、プロセス間通信に適用される通信アルゴリズムを示す。プロセス間通信に適用される通信アルゴリズムとしては、１以上のラウンド（段階）で通信処理が実行される通信アルゴリズムが適宜選択されてよい。例えば、係る通信アルゴリズムは、バタフライ方式であってもよい。また、係る通信アルゴリズムは、各プロセスが他の全プロセスとデータを送受信する方式であってもよい。係る通信アルゴリズムは、例えば、ＭＰＩにおける集団通信関数ごとに、適切に選択されてもよい。通信アルゴリズム６０２には、例えば、通信アルゴリズムを特定可能なデータが設定されてもよい。通信アルゴリズム６０２には、プロセス間通信に適用される種々のアルゴリズムが設定され得る。 A communication algorithm 602 indicates a communication algorithm applied to inter-process communication. As a communication algorithm applied to the inter-process communication, a communication algorithm that performs communication processing in one or more rounds (stages) may be appropriately selected. For example, the communication algorithm may be a butterfly method. The communication algorithm may be a method in which each process transmits / receives data to / from all other processes. Such a communication algorithm may be appropriately selected for each collective communication function in MPI, for example. For example, data that can specify a communication algorithm may be set in the communication algorithm 602. In the communication algorithm 602, various algorithms applied to inter-process communication can be set.

再開位置６０３は、通信アルゴリズムを再開する際のｓｔｅｐ（ラウンド数）、又は、通信アルゴリズムの完了を示す。再開位置６０３には、例えば、ｓｔｅｐ（ラウンド数）を表すデータ、又は、通信アルゴリズムの完了を表すデータが設定されてもよい。再開位置６０３の初期値は、例えば、「ｓｔｅｐ１」に設定されてもよい。係る初期値の設定は、例えば、図５Ａに示す通信アルゴリズムにおいて、最初のデータ転送要求の発行処理を実行することを表す。再開位置６０３に、「アルゴリズム完了」が設定されている場合、例えば、通信アルゴリズムが完了（データ転送処理が完了）していることを表す。以下、通信処理記録部２０４に記録される通信ＩＤ６０１と、通信アルゴリズム６０２と、再開位置６０３との組合せを、単に「エントリ」と記載する場合がある。 The resume position 603 indicates a step (number of rounds) when the communication algorithm is resumed, or the completion of the communication algorithm. For example, data representing step (number of rounds) or data representing completion of the communication algorithm may be set in the resume position 603. The initial value of the resume position 603 may be set to “step 1”, for example. The setting of the initial value represents, for example, that the first data transfer request issuance process is executed in the communication algorithm shown in FIG. 5A. When “algorithm completion” is set at the resume position 603, for example, it indicates that the communication algorithm is completed (data transfer processing is completed). Hereinafter, the combination of the communication ID 601, the communication algorithm 602, and the resume position 603 recorded in the communication processing recording unit 204 may be simply referred to as “entry”.

図６Ｂは、同期待ちスレッド記録部３０２に記録されるデータの一例を示す説明図である。同期待ちスレッド記録部３０２には、ある並列処理を実行する１以上のスレッドの内、同期待ち状態にないスレッドの数を表すカウンタ６０４を保持する。例えば、１以上のスレッドにより並列処理が実行される際、当該並列処理を実行するスレッドの総数がカウンタ６０４に設定される。一つの具体例として、ＯＭＰを用いてスレッドの並列処理が実装される場合、”＃ｐｒａｇｍａｏｍｐｐａｒａｌｌｅｌ”等、並列処理が開始されるタイミングで、カウンタ６０４に、当該並列処理を実行するスレッドの総数が設定されてもよい。 FIG. 6B is an explanatory diagram illustrating an example of data recorded in the synchronization waiting thread recording unit 302. The synchronization waiting thread recording unit 302 holds a counter 604 that represents the number of threads that are not in a synchronization waiting state among one or more threads that execute certain parallel processing. For example, when parallel processing is executed by one or more threads, the total number of threads that execute the parallel processing is set in the counter 604. As one specific example, when parallel processing of a thread is implemented using OMP, the total number of threads that execute the parallel processing is displayed in the counter 604 at the timing when parallel processing is started, such as “#pragma om parallel”. May be set.

各スレッドは、同期処理を実行した際（即ち、当該スレッドが同期待ち状態になった際）、カウンタ６０４の値を漸減（デクリメント）させる。一つの具体例として、ＯＭＰを用いて並列処理が実装される場合、”＃ｐｒａｇｍａｏｍｐｂａｒｉｉｅｒ”等のバリア同期処理の実行開始時に、各スレッドはカウンタ６０４をデクリメントする。 Each thread gradually decreases (decrements) the value of the counter 604 when the synchronization processing is executed (that is, when the thread enters a synchronization waiting state). As one specific example, when parallel processing is implemented using OMP, each thread decrements the counter 604 at the start of execution of barrier synchronization processing such as “#pragma om barrier”.

各スレッドは、カウンタ６０４が”０”になるまで、同期待ち状態となる。各スレッドは、例えば、カウンタ６０４が”０”になるまでループ処理で待ち続けてもよい。この場合、各スレッドはか、カウンタ６０４が”０”になった際にループを抜け、カウンタ６０４の値を、再度並列処理を実行する総スレッド数で初期化してもよい。 Each thread is in a synchronization waiting state until the counter 604 reaches “0”. Each thread may continue to wait in a loop process until the counter 604 becomes “0”, for example. In this case, each thread may exit the loop when the counter 604 reaches “0”, and the value of the counter 604 may be initialized with the total number of threads that execute parallel processing again.

上記のように構成された並列処理装置４００は、上記第１の実施形態における並列処理装置１００と同様の処理を実現可能であってもよい。本実施形態におけるプロセス間通信部２００は、第１の実施形態におけるプロセス間通信部１０１と同様の処理を実現可能であってもよい。本実施形態におけるマルチスレッド制御部３００は、第１の実施形態におけるマルチスレッド制御部１０２と同様の処理を実現可能であってもよい。 The parallel processing device 400 configured as described above may be able to realize the same processing as the parallel processing device 100 in the first embodiment. The inter-process communication unit 200 in the present embodiment may be able to realize the same processing as the inter-process communication unit 101 in the first embodiment. The multi-thread control unit 300 according to the present embodiment may be able to realize the same processing as the multi-thread control unit 102 according to the first embodiment.

［動作］
以下、本実施形態における並列処理装置４００の動作について説明する。図７は、通信開始処理部２０１の動作の一例を表したフローチャートである。図８Ａは、通信処理管理部２０３の動作の一例を表したフローチャートである。図９Ａは、バリア同期処理部３０１の動作の一例を表したフローチャートである。図１０は、通信完了処理部２０２の動作の一例を表したフローチャートである。 [Operation]
Hereinafter, the operation of the parallel processing device 400 in the present embodiment will be described. FIG. 7 is a flowchart showing an example of the operation of the communication start processing unit 201. FIG. 8A is a flowchart showing an example of the operation of the communication processing management unit 203. FIG. 9A is a flowchart illustrating an example of the operation of the barrier synchronization processing unit 301. FIG. 10 is a flowchart showing an example of the operation of the communication completion processing unit 202.

本実施形態における並列処理装置４００は、上記第１の実施形態と同様、並列プログラムを実行する。係る並列プログラムは、１以上のスレッドにより並列に実行可能な処理を含むプロセスを１以上含んでもよい。即ち、並列処理装置４００において実行される並列処理プログラムは、マルチプロセス・マルチスレッド処理が実装されたプログラムであってもよい。 The parallel processing device 400 in the present embodiment executes a parallel program as in the first embodiment. Such a parallel program may include one or more processes including processes that can be executed in parallel by one or more threads. That is, the parallel processing program executed in the parallel processing device 400 may be a program in which multi-process / multi-thread processing is implemented.

係る並列プログラムが実行された場合、例えば、複数のプロセスが生成され、それらが並行して実行されてもよい。この場合各プロセスは、同一の並列処理装置４００において実行されてもよく、複数の並列処理装置４００において並行して実行されてもよい。複数のプロセスは、必要に応じてプロセス間通信を実行することにより、並列プログラムとして実装された処理を実行する。この際、あるプロセスを実行する過程において、複数のスレッドが生成され、当該複数のスレッドが並列に実行されてもよい。複数のスレッドによる並列処理が開始する際、当該並列処理を実行するスレッドの総数が、同期待ちスレッド記録部３０２のカウンタ６０４に設定されてもよい。 When such a parallel program is executed, for example, a plurality of processes may be generated and executed in parallel. In this case, each process may be executed by the same parallel processing device 400 or may be executed in parallel by a plurality of parallel processing devices 400. The plurality of processes execute processing implemented as a parallel program by executing inter-process communication as necessary. At this time, in the process of executing a certain process, a plurality of threads may be generated and the plurality of threads may be executed in parallel. When parallel processing by a plurality of threads starts, the total number of threads that execute the parallel processing may be set in the counter 604 of the synchronization waiting thread recording unit 302.

以下、並列プログラムがプロセス間通信を開始した場合の動作について説明する。以下の説明においては、並列処理プログラムが、通信開始処理部２０１、通信完了処理部２０２及び通信処理管理部２０３等が提供する機能を用いて、各種処理を実行することを想定する。例えば、通信開始処理部２０１、通信完了処理部２０２及び通信処理管理部２０３がコンピュータ・プログラムとして実現される場合、並列処理プログラムは、係るコンピュータ・プログラム（例えば、ライブラリや実行ファイル等）を呼び出す（実行する）ことが可能である。 The operation when the parallel program starts interprocess communication will be described below. In the following description, it is assumed that the parallel processing program executes various processes using functions provided by the communication start processing unit 201, the communication completion processing unit 202, the communication processing management unit 203, and the like. For example, when the communication start processing unit 201, the communication completion processing unit 202, and the communication processing management unit 203 are realized as a computer program, the parallel processing program calls the computer program (for example, a library or an execution file) ( Can be executed).

通信開始処理部２０１は、例えば、プロセス間通信に関する情報を、通信処理記録部２０４に記録する機能を提供してもよい。具体的には、通信開始処理部２０１は、例えば、あるプロセス間通信について、通信処理記録部２０４における通信ＩＤ６０１、通信アルゴリズム６０２及び再開位置６０３にそれぞれ適切な情報を設定してもよい。通信開始処理部２０１は、例えば、通信処理管理部２０３に、通信処理を実行するよう通知することで、プロセス間通信を開始する機能を提供してもよい。 For example, the communication start processing unit 201 may provide a function of recording information related to inter-process communication in the communication processing recording unit 204. Specifically, for example, the communication start processing unit 201 may set appropriate information for the communication ID 601, the communication algorithm 602, and the restart position 603 in the communication processing recording unit 204 for a certain inter-process communication. For example, the communication start processing unit 201 may provide a function of starting the inter-process communication by notifying the communication processing management unit 203 to execute the communication processing.

並列プログラムがプロセス間通信を開始した際、通信開始処理部２０１は、当該プロセス間通信に関する情報を、通信処理記録部２０４に記録する（ステップＳ７０１）。この場合、例えば、並列プログラムにおいて生成されたスレッドが、通信開始処理部２０１を用いて、プロセス間通信に関する情報を通信処理記録部２０４に記録してもよい。係る処理は、スレッド間で排他的に実行され得る。この場合、通信開始処理部２０１は、上記再開位置を表す情報として、初期値「ｓｔｅｐ１」を再開位置６０３に設定してもよい。 When the parallel program starts inter-process communication, the communication start processing unit 201 records information related to the inter-process communication in the communication processing recording unit 204 (step S701). In this case, for example, a thread generated in the parallel program may use the communication start processing unit 201 to record information related to inter-process communication in the communication processing recording unit 204. Such processing can be executed exclusively between threads. In this case, the communication start processing unit 201 may set the initial value “step 1” as the restart position 603 as information indicating the restart position.

通信開始処理部２０１は、通信処理を開始する（ステップＳ７０２）。例えば、通信開始処理部２０１は、通信処理管理部２０３がプロセス間通信に関する処理（プロセス間通信処理）を開始するよう制御してもよい。 The communication start processing unit 201 starts communication processing (step S702). For example, the communication start processing unit 201 may control the communication processing management unit 203 to start processing related to inter-process communication (inter-process communication processing).

一つの具体例として、図２に示すような疑似コードの場合、並列プログラムは、例えば、ＭＰＩ＿Ｉａｌｌｒｅｄｕｃｅ関数を実行した際に、上記したプロセス間通信を開始する処理を実行してもよい。また、”＃ｐｒａｇｍａｏｍｐｐａｒａｌｌｅｌ”（複数スレッドによる並列処理の開始処理）が実行された際、並列処理を実行するスレッドの総数が、同期待ちスレッド記録部３０２のカウンタ６０４に設定される。 As a specific example, in the case of the pseudo code as shown in FIG. 2, the parallel program may execute the above-described process for starting the inter-process communication when the MPI_Iallreduce function is executed, for example. Also, when “#pragma om parallel” (parallel processing start processing by a plurality of threads) is executed, the total number of threads that execute parallel processing is set in the counter 604 of the synchronization waiting thread recording unit 302.

以下、プロセス間通信処理について具体的に説明する。並列処理プログラム（特には、並列プログラムで生成されたスレッド）は、以下に説明する通信処理管理部２０３が提供する機能を用いて、プロセス間通信処理を実行してもよい。以下説明するステップＳ８０１乃至ステップＳ８０５の処理は、スレッド間で排他的に実行され得る。 Hereinafter, the inter-process communication process will be specifically described. A parallel processing program (in particular, a thread generated by the parallel program) may execute inter-process communication processing using a function provided by the communication processing management unit 203 described below. The processes in steps S801 to S805 described below can be executed exclusively between threads.

通信処理管理部２０３は、未処理の通信処理の有無を判定する（ステップＳ８０１）。具体的には、通信処理管理部２０３は、通信処理記録部２０４を参照し、通信処理記録部２０４にプロセス間通信に関する情報（エントリ）が登録されている場合、未処理のプロセス間通信があると判定してもよい。 The communication process management unit 203 determines whether there is an unprocessed communication process (step S801). Specifically, the communication processing management unit 203 refers to the communication processing recording unit 204, and when information (entry) regarding inter-process communication is registered in the communication processing recording unit 204, there is unprocessed inter-process communication. May be determined.

ステップＳ８０１における判定の結果、未処理のプロセス間通信が存在しない場合（ステップＳ８０１においてＮＯ）、通信処理管理部２０３は、処理を終了してもよい。 If the result of determination in step S801 is that there is no unprocessed interprocess communication (NO in step S801), the communication process management unit 203 may end the process.

ステップＳ８０１における判定の結果、未処理のプロセス間通信が存在する場合（ステップＳ８０１においてＹＥＳ）、通信処理管理部２０３は、通信処理記録部２０４に設定されたエントリ（通信ＩＤ６０１、通信アルゴリズム６０２及び再開位置６０３）を取得する（ステップＳ８０２）。 If there is unprocessed interprocess communication as a result of the determination in step S801 (YES in step S801), the communication processing management unit 203 sets the entry (communication ID 601, communication algorithm 602, and restart) set in the communication processing recording unit 204. Position 603) is acquired (step S802).

通信処理管理部２０３は、ステップＳ８０２において取得した通信ＩＤ６０１に関する再開位置６０３を確認する。再開位置６０３に「アルゴリズム完了」を表すデータが設定されている場合（ステップＳ８０３においてＮＯ）、通信処理管理部２０３は、ステップＳ８０１から処理を続行する。 The communication processing management unit 203 confirms the restart position 603 related to the communication ID 601 acquired in step S802. When data representing “algorithm completion” is set at the resume position 603 (NO in step S803), the communication process management unit 203 continues the process from step S801.

再開位置６０３に「アルゴリズム完了」以外のデータが設定されている場合（ステップＳ８０３においてＹＥＳ）、通信処理管理部２０３は、当該通信ＩＤ６０１により識別されるプロセス間通信に関する通信アルゴリズムを実行する（ステップＳ８０４）。具体的には、通信処理管理部２０３は、通信処理記録部２０４に設定された各通信ＩＤ６０１について、再開位置６０３に設定されたデータ確認する。そして、再開位置６０３に設定されたデータが「アルゴリズム完了」ではない場合、通信処理管理部２０３は、通信ＩＤ６０１に設定されたデータにより識別されるプロセス間通信に関して、通信アルゴリズムを実行する。通信処理管理部２０３は、この際、通信アルゴリズムに対して、再開位置６０３に設定されたデータ（ｓｔｅｐを表す）を提供する。 When data other than “algorithm completion” is set at the resume position 603 (YES in step S803), the communication processing management unit 203 executes a communication algorithm related to inter-process communication identified by the communication ID 601 (step S804). ). Specifically, the communication process management unit 203 checks the data set at the resume position 603 for each communication ID 601 set in the communication process recording unit 204. If the data set at the resume position 603 is not “algorithm complete”, the communication processing management unit 203 executes the communication algorithm for the interprocess communication identified by the data set in the communication ID 601. At this time, the communication processing management unit 203 provides data (representing step) set at the resume position 603 to the communication algorithm.

通信処理管理部２０３は、通信アルゴリズムから返却（提供）されたデータを用いて、再開位置６０３を更新する（ステップＳ８０５）。上記したように、通信アルゴリズムからは、プロセス間通信のｓｔｅｐ（ラウンド数）を表すデータが提供されてもよい。通信アルゴリズムからは、プロセス間通信が完了したことを表すデータ（例えば、「アルゴリズム完了」を表すデータ）が提供されてもよい。 The communication processing management unit 203 updates the resume position 603 using the data returned (provided) from the communication algorithm (step S805). As described above, data representing a step (number of rounds) of inter-process communication may be provided from the communication algorithm. From the communication algorithm, data indicating that the inter-process communication is completed (for example, data indicating “algorithm completion”) may be provided.

以下、並列プログラムが、スレッドのバリア同期処理を実行した場合の処理について説明する。並列処理プログラム（具体的には、並列プログラムにおいて生成されたスレッド）は、以下において説明するバリア同期処理部３０１の機能を用いて、スレッドのバリア同期処理を実行してもよい。例えば、バリア同期処理部３０１がコンピュータ・プログラムとして実現される場合、並列処理プログラムにおいて生成されてスレッドは、係るコンピュータ・プログラムを呼び出す（実行する）ことが可能である。 Hereinafter, processing when the parallel program executes thread barrier synchronization processing will be described. A parallel processing program (specifically, a thread generated in the parallel program) may execute thread barrier synchronization processing using the function of the barrier synchronization processing unit 301 described below. For example, when the barrier synchronization processing unit 301 is realized as a computer program, a thread generated in the parallel processing program can call (execute) the computer program.

バリア同期処理部３０１は、例えば、スレッドの状態を取得する機能、スレッドの状態を設定する機能、及び、スレッド状態を消去する機能を提供してもよい。バリア同期処理部３０１は、例えば、スレッドを同期する機能（スレッドの同期を待ち合わせる機能）を提供してもよい。バリア同期処理部３０１は、例えば、通信処理管理部２０３に、通信処理を実行させる機能を提供してもよい。 For example, the barrier synchronization processing unit 301 may provide a function of acquiring a thread state, a function of setting a thread state, and a function of deleting a thread state. For example, the barrier synchronization processing unit 301 may provide a function of synchronizing threads (function of waiting for thread synchronization). For example, the barrier synchronization processing unit 301 may provide a function for causing the communication processing management unit 203 to execute communication processing.

並列プログラムが、スレッドのバリア同期処理を実行した場合、バリア同期処理部３０１は、当該スレッドが同期待ち状態になったことを、同期待ちスレッド記録部３０２に記録する（ステップＳ９０１）。例えば、あるスレッドはバリア同期処理部３０１を用いて、同期待ちスレッド記録部３０２に記録されたカウンタ６０４の値をデクリメント（”１”漸減）してもよい。一つの具体例として、ＯＭＰを用いて、スレッドの並列処理が実装される場合、スレッドのバリア同期処理は、例えば、”＃ｐｒａｇｍａｏｍｐｂａｒｒｉｅｒ”等のコードが実行された際に、実行されてもよい。 When the parallel program executes thread barrier synchronization processing, the barrier synchronization processing unit 301 records in the synchronization waiting thread recording unit 302 that the thread is in a synchronization waiting state (step S901). For example, a certain thread may use the barrier synchronization processing unit 301 to decrement ("1" gradually) the value of the counter 604 recorded in the synchronization waiting thread recording unit 302. As one specific example, when thread parallel processing is implemented using OMP, the thread barrier synchronization processing may be executed when a code such as “#pragma om barrier” is executed. Good.

同期待ち状態になったスレッドは、他のスレッドの状態を取得する（ステップＳ９０２）。係るスレッドは、バリア同期処理部３０１を用いて、他のスレッドの状態を取得してもよい。例えば、係るスレッドは、同期待ちスレッド記録部３０２に記録された６０４の値を参照し、その値に基づいて他のスレッドの状態を判定してもよい。係る参照処理は、例えば、排他的に実行され得る。以下、同期待ち状態になったスレッドを、「同期待ちスレッド」（第１のスレッドに相当）と記載する場合がある。 The thread that has entered the synchronization waiting state acquires the state of another thread (step S902). Such a thread may acquire the state of another thread using the barrier synchronization processing unit 301. For example, the thread may refer to the value 604 recorded in the synchronization waiting thread recording unit 302 and determine the state of another thread based on the value. Such a reference process can be executed exclusively, for example. Hereinafter, a thread that is in a synchronization waiting state may be referred to as a “synchronization waiting thread” (corresponding to a first thread).

ステップＳ９０２において取得した他のスレッドの状態から、ある並列処理に関する同期に参加するスレッドのうち、同期待ち状態ではないスレッドが存在すると判定された場合（ステップＳ９０３においてＮＯの場合）について以下に説明する。同期待ちスレッドは、例えば、同期待ちスレッド記録部３０２に記録された６０４の値が”０”ではない場合、ある並列処理に関して、同期待ち状態では無いスレッドが残っていると判定してもよい。 A case will be described below in which it is determined from the state of another thread acquired in step S902 that there is a thread that is not in a synchronization waiting state among threads participating in synchronization related to certain parallel processing (NO in step S903). . For example, when the value of 604 recorded in the synchronization waiting thread recording unit 302 is not “0”, the synchronization waiting thread may determine that a thread that is not in the synchronization waiting state remains for a certain parallel process.

ここで、同期に参加するスレッドは、例えば、ある並列処理において、自スレッドにおける処理が完了した後、他のスレッドによる処理の完了を待ち合わせるスレッドであってよい。例えば、１以上のスレッドが同期に参加する場合、それらのスレッドは、ある同期点（タイミング）において、他のスレッドによる処理の実行を待ち合せてよい。 Here, the thread participating in the synchronization may be, for example, a thread that waits for the completion of the process by another thread after the process in the own thread is completed in a certain parallel process. For example, when one or more threads participate in synchronization, they may wait for execution of processing by other threads at a certain synchronization point (timing).

ステップＳ９０３においてＮＯの場合、同期待ちスレッドは、自スレッドが最初に同期待ち状態になったスレッドであるか否かを確認する（ステップＳ９０４）。係るスレッドは、例えば、スレッド記録部３０２に記録された６０４の値に基づいて、自スレッドが最初に同期待ちになったスレッドか否かを判定してもよい。一例として、同期待ちスレッドは、スレッド記録部３０２に記録された６０４の値が他のスレッドによりデクリメントされていなければ、自スレッドが最初に同期待ちになったスレッドであると判定してもよい。 In the case of NO in step S903, the synchronization waiting thread confirms whether or not its own thread is the thread that first entered the synchronization waiting state (step S904). For example, the thread may determine whether or not the own thread is a thread that is initially waiting for synchronization based on the value of 604 recorded in the thread recording unit 302. As an example, if the value of 604 recorded in the thread recording unit 302 is not decremented by another thread, the synchronization waiting thread may determine that the own thread is the thread that first waited for synchronization.

同期待ちスレッドが、最初に同期待ち状態になったスレッドである場合（ステップＳ９０４においてＹＥＳ）について以下に説明する。係る同期待ちスレッドは、他のスレッドが同期待ち状態になるまで、通信処理管理部２０３による通信処理（例えば、上記ステップＳ８０１乃至ステップＳ８０５）を繰り返し実行する（ステップＳ９０５）。係る同期待ちスレッドは、バリア同期処理部３０１を用いて、通信処理管理部２０３が上記通信処理を実行するよう制御してもよい。係る同期待ちスレッドは、通信処理管理部２０３による通信処理を、適宜設定可能な時間間隔毎に、繰り返して実行してもよい。 The case where the synchronization waiting thread is the thread that first entered the synchronization waiting state (YES in step S904) will be described below. The synchronization waiting thread repeatedly executes communication processing (for example, steps S801 to S805 described above) by the communication processing management unit 203 until another thread enters a synchronization waiting state (step S905). The synchronization waiting thread may control the communication processing management unit 203 to execute the communication processing using the barrier synchronization processing unit 301. The synchronization waiting thread may repeatedly execute the communication process by the communication process management unit 203 at time intervals that can be appropriately set.

同期待ちスレッドが、最初に同期待ち状態になったスレッドではない場合（ステップＳ９０４においてＮＯ）、同期待ちスレッドは、ステップＳ９０２から処理を続行し、他のスレッドが同期待ち状態になるまで待ち合せる。 If the synchronization waiting thread is not the thread that first entered the synchronization waiting state (NO in step S904), the synchronization waiting thread continues processing from step S902 and waits until another thread enters the synchronization waiting state.

同期待ちスレッドが、ある並列処理に関して、同期待ち状態ではない他のスレッドが存在しないことを確認した場合（ステップＳ９０３においてＹＥＳ）について説明する。この場合、係る同期待ちスレッドは、ある並列処理に関する同期待ち処理を終了する。そして係る同期待ちスレッドは、他のスレッドが、ある並列処理に関する同期待ち処理を終了することを待ち合せる（ステップＳ９０６）。 A case where the synchronization waiting thread confirms that there is no other thread that is not in the synchronization waiting state for a certain parallel processing (YES in step S903) will be described. In this case, the synchronization waiting thread ends the synchronization waiting process related to a certain parallel process. Then, the synchronization waiting thread waits for another thread to finish the synchronization waiting process related to a certain parallel process (step S906).

ステップＳ９０６における同期処理は、ある並列処理を実行する各スレッドが、他のスレッドが同期待ち状態になったことを確認するまで待ち合せる処理である。これにより、例えば、あるスレッドが同期待ち状態になる前に他のスレッドが処理を続行してしまい、同期待ちスレッド記録部３０２に記録されたカウンタ６０４を変更してしまうことを防止可能である。 The synchronization process in step S906 is a process in which each thread executing a certain parallel process waits until it is confirmed that another thread has entered a synchronization wait state. As a result, for example, it is possible to prevent another thread from continuing processing before a certain thread enters the synchronization waiting state and changing the counter 604 recorded in the synchronization waiting thread recording unit 302.

あるスレッドが最初に同期待ち状態になったスレッドである場合（ステップＳ９０７においてＹＥＳ）、係るスレッドは、同期待ちスレッド記録部３０２に記録された、同期に参加した全てのスレッドに関する状態を初期化する（ステップＳ９０８）。係るスレッドは、例えば、バリア同期処理部３０１を用いて、同期待ちスレッド記録部３０２に記録されたカウンタを、並列処理を実行する総スレッド数により再度初期化する処理を実行してもよい。 If a thread is a thread that has been in a synchronization waiting state first (YES in step S907), the thread initializes the state relating to all threads that have participated in synchronization recorded in the synchronization waiting thread recording unit 302. (Step S908). For example, such a thread may execute a process of reinitializing the counter recorded in the synchronization waiting thread recording unit 302 with the total number of threads executing parallel processing by using the barrier synchronization processing unit 301.

なお、図９Ａは、説明の便宜上、最初に同期待ち状態になったスレッドがステップＳ９０８を実行する態様を示しているが、本実施形態はこれには限定されない。ステップＳ９０８は、最初に同期待ち状態になったスレッド以外のいずれかのスレッドにより実行されてもよい。 For convenience of explanation, FIG. 9A shows a mode in which a thread that is initially in a synchronization waiting state executes step S908, but the present embodiment is not limited to this. Step S908 may be executed by any thread other than the thread that first entered the synchronization wait state.

ステップＳ９０７においてにおいてＮＯの場合、又は、ステップＳ９０８における処理が実行された場合、同期待ちスレッドは、再度他のスレッドを待ち合せてもよい（ステップＳ９０９）。これにより、例えば、同期待ちスレッド記録部３０２に記録されたカウンタ６０４が初期化される前に、あるスレッドが処理を続行してしまうことを防止可能である。 If NO in step S907, or if the process in step S908 is executed, the synchronization waiting thread may wait for another thread again (step S909). Thereby, for example, it is possible to prevent a thread from continuing processing before the counter 604 recorded in the synchronization waiting thread recording unit 302 is initialized.

以下、並列プログラムがプロセス間通信を完了する場合の動作について説明する。以下の説明においては、並列処理プログラムが、通信完了処理部２０２等が提供する機能を用いて、各種処理を実行することを想定する。例えば、通信完了処理部２０２がコンピュータ・プログラム（例えば、ライブラリや実行ファイル等）として実現される場合、並列処理プログラムは、係るコンピュータ・プログラムを呼び出す（実行する）ことが可能である。 The operation when the parallel program completes the interprocess communication will be described below. In the following description, it is assumed that the parallel processing program executes various processes using functions provided by the communication completion processing unit 202 and the like. For example, when the communication completion processing unit 202 is realized as a computer program (for example, a library or an execution file), the parallel processing program can call (execute) the computer program.

並列プログラムがプロセス間通信を完了する際、通信完了処理部２０２は、通信処理記録部２０４を参照し、当該並列プログラムに関するプロセス間通信の再開位置を表すデータ（図６Ａにおける再開位置６０３）を取得する（ステップＳ１００１）。 When the parallel program completes the inter-process communication, the communication completion processing unit 202 refers to the communication processing recording unit 204 and acquires data (resumption position 603 in FIG. 6A) indicating the resumption position of the inter-process communication related to the parallel program. (Step S1001).

通信完了処理部２０２は、プロセス間通信の再開位置を表すデータを用いて、プロセス間通信が完了したか否かを判定する。具体的には、通信完了処理部２０２は、プロセス間通信の再開位置を表すデータに「アルゴリズム完了」が設定されているか否かを判定する。（ステップＳ１００２）。 The communication completion processing unit 202 determines whether or not the interprocess communication has been completed, using data representing the restart position of the interprocess communication. Specifically, the communication completion processing unit 202 determines whether or not “algorithm completion” is set in the data indicating the restart position of the interprocess communication. (Step S1002).

上記したように、プロセス間通信（データ転送処理）が完了している場合、プロセス間通信の再開位置（例えば、図６の６０３）に「アルゴリズム完了」が設定される。プロセス間通信（データ転送処理）が完了していない場合、プロセス間通信の再開位置には、例えば、ｓｔｅｐ数（ラウンド数）が設定される。 As described above, when the inter-process communication (data transfer process) is completed, “algorithm complete” is set at the restart position of the inter-process communication (for example, 603 in FIG. 6). When inter-process communication (data transfer processing) has not been completed, for example, the number of steps (number of rounds) is set as the restart position of inter-process communication.

プロセス間通信の再開位置が「アルゴリズム完了」ではない場合（ステップＳ１００２においてＮＯ）、並列プログラムに関するプロセス間通信が完了していない。この場合、通信完了処理部２０２は、通信処理管理部２０３によるプロセス間通信処理を実行し（ステップＳ１００３）、ステップＳ１００１から処理を続行する。即ち、通信完了処理部２０２は、プロセス間通信の再開位置に「アルゴリズム完了」が設定されるまで、通信処理管理部２０３によるプロセス間通信処理を繰り返し実行する。通信完了処理部２０２、例えば、通信処理管理部２０３がプロセス間通信処理を実行するよう制御してもよい。 When the restart position of the interprocess communication is not “algorithm completion” (NO in step S1002), the interprocess communication regarding the parallel program is not completed. In this case, the communication completion processing unit 202 executes inter-process communication processing by the communication processing management unit 203 (step S1003), and continues processing from step S1001. That is, the communication completion processing unit 202 repeatedly executes the interprocess communication processing by the communication processing management unit 203 until “algorithm completion” is set at the restart position of the interprocess communication. The communication completion processing unit 202, for example, the communication processing management unit 203 may be controlled to execute the inter-process communication processing.

プロセス間通信の再開位置が「アルゴリズム完了」である場合（ステップＳ１００２においてＹＥＳ）、並列プログラムに関するプロセス間通信が完了している。この場合、通信完了処理部２０２は、通信処理管理部２０３に記録された当該プロセス間通信に関するエントリを削除する（ステップＳ１００４）。ステップＳ１００４における処理は、スレッド間で排他的に実行され得る。 When the restart position of the interprocess communication is “algorithm complete” (YES in step S1002), the interprocess communication regarding the parallel program is completed. In this case, the communication completion processing unit 202 deletes the entry related to the interprocess communication recorded in the communication processing management unit 203 (step S1004). The process in step S1004 can be executed exclusively between threads.

一つの具体例として、並列プログラムは、ＭＰＩ＿Ｗａｉｔ関数を実行した際に、上記したプロセス間通信を開始する処理を実行してもよい。 As one specific example, the parallel program may execute the above-described process of starting the interprocess communication when the MPI_Wait function is executed.

［効果］
例えば、並列プログラムにおいて生成されたスレッドが単純に同期処理のみを実行する場合、当該スレッドは、他のスレッドの同期を待つ同期待ち状態になることがある。この場合、ＣＰＵコアにおいては、当該同期待ち状態になったスレッドに関する演算処理は実行されない。即ち、スレッドが同期待ち状態になることにより、ＣＰＵコアの演算資源が有効に活用されることなく消費される。 [effect]
For example, when a thread generated in a parallel program simply executes a synchronization process, the thread may be in a synchronization wait state waiting for the synchronization of another thread. In this case, the CPU core does not execute arithmetic processing related to the thread in the synchronization waiting state. That is, when the thread enters a synchronization waiting state, the CPU core's computing resources are consumed without being effectively utilized.

これに対して、上記のように構成された並列処理装置４００おいては、並列プログラムのスレッドは、同期処理を実行してから他のスレッドの同期を待つ間に、プロセス間通信を実行可能である（例えば、上記ステップＳ９０２乃至Ｓ９０５）。 On the other hand, in the parallel processing device 400 configured as described above, the threads of the parallel program can execute interprocess communication while waiting for synchronization of other threads after executing the synchronization processing. Yes (for example, steps S902 to S905).

即ち、並列処理装置４００は、スレッドの同期待ちで消費されていたＣＰＵコアの演算資源を使用して、プロセス間通信処理を進行することが可能である。これにより、計算処理に割り当てられるＣＰＵコアの演算資源を減らすことなく、プロセス間通信処理を進行することが可能である。 That is, the parallel processing device 400 can proceed with the inter-process communication process by using the computation resource of the CPU core that has been consumed while waiting for thread synchronization. Thereby, it is possible to proceed with the inter-process communication process without reducing the calculation resources of the CPU core assigned to the calculation process.

上記のように構成された並列処理装置４００においては、独立した通信スレッドを生成する必要がない。また、並列処理装置４００においては、ユーザが、例えば、並列プログラムにおける計算処理の途中に、プロセス間通信処理に関する処理を明示的に記載せずともよい。なぜならば、スレッドが同期待ち状態になった際に、プロセス間通信部２００及びマルチスレッド制御部３００によって、プロセス間通信を進行する処理が実行されるからである。これにより、プロセス間通信におけるデータ転送処理と、計算処理とを並列に実行することが可能になる。 In the parallel processing device 400 configured as described above, it is not necessary to generate an independent communication thread. Further, in the parallel processing device 400, the user does not have to explicitly describe the process related to the interprocess communication process in the middle of the calculation process in the parallel program, for example. This is because, when a thread enters a synchronization waiting state, the inter-process communication unit 200 and the multi-thread control unit 300 execute a process that advances inter-process communication. This makes it possible to execute data transfer processing and calculation processing in inter-process communication in parallel.

上記のような効果について、説明図を参照して説明する。図１１は、例えば、図２に例示したプログラムを、本実施形態の並列処理装置４００において実行した場合の効果を説明する説明図である。 The above effects will be described with reference to the explanatory drawings. FIG. 11 is an explanatory diagram for explaining the effect when, for example, the program illustrated in FIG. 2 is executed in the parallel processing device 400 of this embodiment.

図３に例示するチャートと比較すると、図１１のチャートにおいては、”ｐａｒａ１（）”と、”ｐａｒａ２（）”との計算処理の後で実行されるスレッドのバリア同期処理の際、プロセス間通信処理が実行される。これにより、並列処理装置４００は、スレッドが同期待ち状態にある（スレッドの同期を待ち合せている）ＣＰＵコアの演算資源を利用して、以下のプロセス間通信処理を順次進行することができる。即ち、並列処理装置４００は、具体的には、ラウンド１のテータ転送の完了処理（Ｓ１１０１）、ラウンド２のデータ転送要求の発行処理（Ｓ１１０２）、ラウンド２のデータ転送の完了処理（Ｓ１１０３）、ラウンド３のデータ転送要求の発行処理（Ｓ１１０４）を順に実行可能である。これにより、ラウンド２及びラウンド３のデータ転送処理が、計算処理と並行して実行されることから、並列プログラム全体の実行時間が短縮される。 Compared to the chart illustrated in FIG. 3, in the chart of FIG. 11, the inter-process communication is performed during the barrier synchronization processing of the thread executed after the calculation processing of “para1 ()” and “para2 ()”. Processing is executed. As a result, the parallel processing device 400 can sequentially proceed with the following inter-process communication processing using the computation resources of the CPU core in which the threads are in a synchronization waiting state (waiting for thread synchronization). That is, the parallel processing device 400 specifically includes a round 1 data transfer completion process (S1101), a round 2 data transfer request issuance process (S1102), a round 2 data transfer completion process (S1103), Round 3 data transfer request issuance processing (S1104) can be executed in sequence. As a result, the data transfer process of round 2 and round 3 is executed in parallel with the calculation process, so that the execution time of the entire parallel program is shortened.

以上より、本実施形態によれば、並列プログラムにおける計算処理の過程で消費される演算資源を用いて、プロセス間通信処理と、計算処理とを並列に実行可能である。 As described above, according to the present embodiment, the inter-process communication process and the calculation process can be executed in parallel using the operation resources consumed in the process of the calculation process in the parallel program.

以下、上記各実施形態に関する変形例等について説明する。 Hereinafter, modifications and the like related to the above embodiments will be described.

上記説明した実施形態においては、ステップＳ９０４乃至ステップＳ９０５における処理は、最初に同期待ち状態になったスレッドにより実行される。これにより、同期待ち状態のスレッドが発生してからなるべく早いタイミングで、プロセス間通信の処理を進行させることができる。 In the above-described embodiment, the processes in steps S904 to S905 are executed by the thread that first enters the synchronization wait state. As a result, the inter-process communication process can proceed at the earliest possible timing after the generation of the synchronization waiting thread.

本実施形態の一つの変形例として、図９ＡのステップＳ９０５における通信処理（図８ＡのステップＳ８０１乃至Ｓ８０５を含む）は、最初に同期待ち状態になったスレッド以外の適切なスレッドにより実行されてもよい。例えば、並列処理装置４００が、通信処理を実行しているスレッドの有無を表すデータ（フラグ）等を保持してもよい。この場合、例えば、あるスレッドが同期待ち状態になった際、当該スレッドは、上記フラグを確認する。通信処理を実行している他のスレッドが存在しない場合、当該スレッド自身が通信処理を実行してもよい。係る変形例は、具体的には、以下のように実現され得る。以下、図８Ｂ，図９Ｂを参照して説明する。 As a modification of the present embodiment, the communication process (including steps S801 to S805 in FIG. 8A) in step S905 in FIG. 9A may be executed by an appropriate thread other than the thread that first entered the synchronization wait state. Good. For example, the parallel processing device 400 may hold data (flag) indicating the presence / absence of a thread executing a communication process. In this case, for example, when a thread enters a synchronization waiting state, the thread checks the flag. If there is no other thread executing the communication process, the thread itself may execute the communication process. Specifically, such a modification can be realized as follows. Hereinafter, a description will be given with reference to FIGS. 8B and 9B.

本変形例においては、例えば、通信処理記録部２０４に、各通信ＩＤ（図６Ａの６０１）に関連付けて、当該通信ＩＤに関する通信処理が実行中であるか否かを表す情報が設定される。以下、係る情報を「処理中フラグ」と記載する。係る処理中フラグは、例えば、各スレッドがから排他的にアクセスされることを想定する。即ち、ある時点では１つのスレッドのみが、係る処理中フラグを参照、更新、削除することができることを想定する。 In this modification, for example, information indicating whether or not a communication process related to the communication ID is being executed is set in the communication process recording unit 204 in association with each communication ID (601 in FIG. 6A). Hereinafter, such information is referred to as “processing flag”. The in-process flag assumes that each thread is exclusively accessed from, for example. That is, it is assumed that only one thread can refer to, update, or delete the in-process flag at a certain time.

図７のステップＳ７０１において、通信開始処理部２０１は、あるプロセス間通信に関する処理中フラグを初期化する。具体的には、通信開始処理部２０１は、通信処理が実行されていないことを表す情報（例えば、「ＯＦＦ」等）を、処理中フラグに設定してよい。 In step S701 in FIG. 7, the communication start processing unit 201 initializes a processing flag relating to a certain inter-process communication. Specifically, the communication start processing unit 201 may set information (for example, “OFF” or the like) indicating that the communication process is not being executed in the processing flag.

あるスレッドが、同期処理を実行した場合、当該スレッドは、例えば、バリア同期処理部３０１をもいて、図９ＢのステップＳ９０１乃至ステップＳ９０５を実行する。この際、当該スレッドは、自スレッドが最初に同期待ち状態になったスレッドであるか否か（図９ＡのステップＳ９０４）を実行せずともよい。 When a certain thread executes the synchronization process, the thread has, for example, the barrier synchronization processing unit 301 and executes steps S901 to S905 in FIG. 9B. At this time, the thread does not need to execute whether or not the self thread is the thread that first entered the synchronization waiting state (step S904 in FIG. 9A).

当該スレッドは、例えば、通信処理管理部２０３を用いて、図８ＢのステップＳ８０１を実行する。 The thread executes step S801 of FIG. 8B using, for example, the communication processing management unit 203.

係るスレッドは、未処理のプロセス間通信処理が存在する場合（ステップＳ８０１においてＹＥＳ）、当該プロセス間通信処理に関する処理中フラグを確認する（図８ＢのステップＳ８０６）。具体的には、係るスレッドは、通信処理記録部２０４において、未処理のプロセス間通信に関する通信ＩＤに関連付けされた、処理中フラグを確認する。 When there is an unprocessed inter-process communication process (YES in step S801), the thread confirms a processing flag relating to the inter-process communication process (step S806 in FIG. 8B). Specifically, the thread checks the in-process flag associated with the communication ID related to unprocessed interprocess communication in the communication processing recording unit 204.

係る処理中フラグがＯＦＦの場合、当該通信処理を実行している他のスレッドが存在しない。この場合（ステップＳ８０６においてＹＥＳ）、係るスレッドは、係る処理中フラグに、通信処理が実行されていることを表す情報（例えば「ＯＮ」等）を設定する（図８ＢのステップＳ８０７）。ここで、ステップＳ８０６における処理中フラグの確認及び更新は、排他的に（アトミックに）実行され得る。 When the in-process flag is OFF, there is no other thread executing the communication process. In this case (YES in step S806), the thread sets information (for example, “ON” or the like) indicating that the communication process is being executed in the in-process flag (step S807 in FIG. 8B). Here, the confirmation and update of the processing flag in step S806 can be executed exclusively (atomically).

ステップＳ８０７の後、係るスレッドは、ステップＳ８０２から処理を続行する。ステップＳ８０２乃至ステップＳ８０５の処理は上記説明した実施形態と同様としてよい。 After step S807, the thread continues processing from step S802. The processing from step S802 to step S805 may be the same as in the above-described embodiment.

ステップＳ８０３においてＮＯ（即ち、通信アルゴリズムが完了している場合）、又は、ステップＳ８０５のおける処理が実行された後、係るスレッドは、処理中フラグに「ＯＦＦ」を設定する（ステップＳ８０８）。 In step S803, NO (that is, when the communication algorithm is completed), or after the processing in step S805 is executed, the thread sets “OFF” in the processing flag (step S808).

また、ステップＳ８０６において、処理中フラグがＯＦＦではない場合（ＯＮの場合）、他のスレッドが、当該通信処理を実行している可能性がある。この場合、係るスレッドは、ステップＳ８０１から処理を続行する。 In step S806, if the processing flag is not OFF (ON), there is a possibility that another thread is executing the communication process. In this case, the thread continues processing from step S801.

上記のような処理により、図８Ｂに例示する通信処理は、ある特定のスレッド（例えば、最初に同期待ち状態になったスレッド）に限定されず、複数のスレッドのうちいずれかのスレッドにより実行され得る。 Through the processing as described above, the communication processing illustrated in FIG. 8B is not limited to a specific thread (for example, a thread that is first in a synchronization waiting state), and is executed by any one of a plurality of threads. obtain.

上記実施形態においては、同期が成立するまで（例えば、同期に参加する全てのスレッドが同期待ち状態になるまで）バリア同期処理部３０１が通信処理管理部２０３を用いてプロセス間通信に関する処理を実行する。これにより、バリア同期処理部３０１がプロセス間通信に関する処理を実行しない場合に比べて、バリア同期に多少多くの時間を要する場合がある。バリア同期の実行時間を優先する場合、同期が成立するまでの基準時間を適切に定め、バリア同期処理部３０１が、係る基準時間を超えない範囲でプロセス間通信処理を進行してもよい。係る基準時間は、例えば、あるスレッドにより初の同期処理が実行されてから経過した時間であってもよい。係る基準時間は、例えば、以前の同期成立までに要した時間と、通信処理の進行に要した時間とを記録することで、同期成立に要した時間を超えないように定められてもよい。あるいは、係る基準時間は、例えば、並列プログラムのテスト等の際に実験的に定められてもよい。係る基準時間を定める方法は、上記に限定されず、他の適切な方法を採用してもよい。 In the above embodiment, until synchronization is established (for example, until all threads participating in synchronization are in a synchronization waiting state), the barrier synchronization processing unit 301 executes processing related to inter-process communication using the communication processing management unit 203 To do. As a result, the barrier synchronization processing unit 301 may require a little more time for barrier synchronization than when the barrier synchronization processing unit 301 does not execute processing related to inter-process communication. When priority is given to the execution time of barrier synchronization, a reference time until synchronization is established may be appropriately determined, and the barrier synchronization processing unit 301 may advance inter-process communication processing within a range that does not exceed the reference time. The reference time may be, for example, a time that has elapsed since the first synchronization process was executed by a certain thread. For example, the reference time may be determined so as not to exceed the time required for establishing synchronization by recording the time required until establishment of previous synchronization and the time required for progress of communication processing. Alternatively, the reference time may be experimentally determined, for example, when testing a parallel program. The method for determining the reference time is not limited to the above, and other appropriate methods may be adopted.

上記実施形態の説明においては、バリア同期処理部３０１が、スレッドがバリア同期処理を実行した際に、プロセス間通信処理を進行する。上記に限定されず、上記実施形態は、バリア同期以外のスレッドの待ち合わせが発生する機能（例えば、ロック取得機能など）にも適用することができる。 In the description of the above embodiment, the barrier synchronization processing unit 301 advances the inter-process communication process when the thread executes the barrier synchronization process. The present invention is not limited to the above, and the above-described embodiment can also be applied to a function (for example, a lock acquisition function) that causes thread waiting other than barrier synchronization.

上記各実施形態においては、本発明を並列処理装置（１００、４００）に適用した例として説明した。上記各実施形態では、例えば、並列処理装置（１００、４００）を動作させることによって、並列処理方法を実施することができる。並列処理方法を実施する方法は上記に限定されず、並列処理装置（１００、４００）と同様の動作あるいは処理を実行可能な、他の装置（例えば、コンピュータ等の情報処理装置）によって実施することも可能である。 In each of the above embodiments, the present invention has been described as an example in which the present invention is applied to a parallel processing device (100, 400). In each said embodiment, a parallel processing method can be implemented by operating a parallel processing apparatus (100,400), for example. The method of executing the parallel processing method is not limited to the above, and is executed by another device (for example, an information processing device such as a computer) capable of executing the same operation or processing as the parallel processing device (100, 400). Is also possible.

＜ハードウェア及びソフトウェア・プログラム（コンピュータ・プログラム）の構成＞
以下、上記説明した各実施形態を実現可能なハードウェア構成について説明する。 <Configuration of hardware and software program (computer program)>
Hereinafter, a hardware configuration capable of realizing each of the above-described embodiments will be described.

以下の説明においては、上記各実施形態において説明した並列処理装置（１００、４００）をまとめて、単に「並列処理装置」と記載する。また、これら並列処理装置の各構成要素を、単に「並列処理装置の構成要素」と記載する場合がある。 In the following description, the parallel processing devices (100, 400) described in the above embodiments are collectively referred to simply as “parallel processing devices”. In addition, each component of the parallel processing device may be simply referred to as “component of the parallel processing device”.

上記各実施形態において説明した並列処理装置は、１つ又は複数の専用のハードウェア装置により構成されてもよい。その場合、上記各図（図４Ａ、図４Ｂ）に示した各構成要素は、その一部又は全部を統合したハードウェア（処理ロジックを実装した集積回路あるいは記憶デバイス等）を用いて実現されてもよい。 The parallel processing device described in each of the above embodiments may be configured by one or a plurality of dedicated hardware devices. In that case, each component shown in each of the above figures (FIGS. 4A and 4B) is realized by using hardware (an integrated circuit or a storage device on which processing logic is mounted) that is partially or entirely integrated. Also good.

並列処理装置が専用のハードウェアにより実現される場合、係る並列処理装置の構成要素は、例えば、それぞれの機能を提供可能な回路構成（ｃｉｒｃｕｉｔｒｙ）により実現されてもよい。係る回路構成は、例えば、ＳｏＣ（ＳｙｓｔｅｍｏｎａＣｈｉｐ）等の集積回路や、当該集積回路を用いて実現されたチップセット等を含む。この場合、並列処理装置の構成要素が保持するデータは、例えば、ＳｏＣとして統合されたＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）領域やフラッシュメモリ領域、あるいは、当該ＳｏＣに接続された記憶デバイス（半導体記憶装置等）に記憶されてもよい。また、この場合、並列処理装置の各構成要素を接続する通信回線としては、周知の通信ネットワークを採用してもよい。また、各構成要素を接続する通信回線は、それぞれの構成要素間をピアツーピアで接続してもよい。 When the parallel processing device is realized by dedicated hardware, the components of the parallel processing device may be realized by, for example, a circuit configuration capable of providing each function. Such a circuit configuration includes, for example, an integrated circuit such as SoC (System on a Chip), a chip set realized using the integrated circuit, and the like. In this case, the data held by the components of the parallel processing device is, for example, a RAM (Random Access Memory) area integrated as SoC, a flash memory area, or a storage device (such as a semiconductor storage device) connected to the SoC. May be stored. In this case, a well-known communication network may be adopted as a communication line for connecting each component of the parallel processing device. Further, the communication line connecting each component may be connected between each component by peer-to-peer.

また、上述した並列処理装置は、図１２に例示するような汎用のハードウェアと、係るハードウェアによって実行される各種ソフトウェア・プログラム（コンピュータ・プログラム）とによって構成されてもよい。この場合、並列処理装置は、適切な数の汎用のハードウェア装置１２００と、ソフトウェア・プログラムとの組合せにより構成されてもよい。 Further, the parallel processing device described above may be configured by general-purpose hardware as exemplified in FIG. 12 and various software programs (computer programs) executed by the hardware. In this case, the parallel processing device may be configured by a combination of an appropriate number of general-purpose hardware devices 1200 and software programs.

図１２における演算装置１２０１は、汎用のＣＰＵ（中央処理装置：ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）やマイクロプロセッサ等の演算処理装置である。演算装置１２０１は、例えば後述する不揮発性記憶装置１２０３に記憶された各種ソフトウェア・プログラムを記憶装置１２０２に読み出し、係るソフトウェア・プログラムに従って処理を実行してもよい。例えば、上記各実施形態における並列処理装置の構成要素の機能は、演算装置１２０１により実行されるソフトウェア・プログラムを用いて実現されてもよい。 An arithmetic device 1201 in FIG. 12 is an arithmetic processing device such as a general-purpose CPU (Central Processing Unit) or a microprocessor. The arithmetic device 1201 may read various software programs stored in a nonvolatile storage device 1203, which will be described later, into the storage device 1202, and execute processing according to the software programs. For example, the functions of the components of the parallel processing device in each of the above embodiments may be realized using a software program executed by the arithmetic device 1201.

演算装置１２０１は、例えば、複数のＣＰＵコアを備えたマルチコアＣＰＵであってもよい。また演算装置１２０１は、１つのＣＰＵコアが複数のスレッドを実行可能な、マルチスレッドＣＰＵであってもよい。ハードウェア装置１２００は、複数の演算装置１２０１を備えてもよい。 The arithmetic device 1201 may be, for example, a multi-core CPU including a plurality of CPU cores. The arithmetic device 1201 may be a multi-thread CPU in which one CPU core can execute a plurality of threads. The hardware device 1200 may include a plurality of arithmetic devices 1201.

上記各実施形態において説明した並列プログラムは、複数の演算装置１２０１を用いて、並列にプロセスを実行することができる。この場合、並列プログラムは、１台のハードウェア装置１２００に含まれる複数の演算装置１２０１により実行されもよい。並列プログラムは、複数のハードウェア装置１２００に含まれる１以上の演算装置１２０１により実行されもよい。並列プログラムが生成するスレッドは、例えば、演算装置１２０１におけるＣＰＵコアを用いて並列に実行される。 The parallel programs described in the above embodiments can execute processes in parallel using a plurality of arithmetic devices 1201. In this case, the parallel program may be executed by a plurality of arithmetic devices 1201 included in one hardware device 1200. The parallel program may be executed by one or more arithmetic devices 1201 included in the plurality of hardware devices 1200. The threads generated by the parallel program are executed in parallel using, for example, a CPU core in the arithmetic device 1201.

記憶装置１２０２は、演算装置１２０１から参照可能な、ＲＡＭ等のメモリ装置であり、ソフトウェア・プログラムや各種データ等を記憶する。なお、記憶装置１２０２は、揮発性のメモリ装置であってもよい。 The storage device 1202 is a memory device such as a RAM that can be referred to from the arithmetic device 1201, and stores software programs, various data, and the like. Note that the storage device 1202 may be a volatile memory device.

不揮発性記憶装置１２０３は、例えば磁気ディスクドライブや、フラッシュメモリによる半導体記憶装置等の、不揮発性の記憶装置である。不揮発性記憶装置１２０３は、各種ソフトウェア・プログラムやデータ等を記憶可能である。例えば、並列プログラムは、不揮発性記憶装置１２０３に記憶されてもよい。 The nonvolatile storage device 1203 is a nonvolatile storage device such as a magnetic disk drive or a semiconductor storage device using flash memory. The nonvolatile storage device 1203 can store various software programs, data, and the like. For example, the parallel program may be stored in the nonvolatile storage device 1203.

ドライブ装置１２０４は、例えば、後述する記録媒体１２０５に対するデータの読み込みや書き込みを処理する装置である。 The drive device 1204 is, for example, a device that processes reading and writing of data with respect to a recording medium 1205 described later.

記録媒体１２０５は、例えば光ディスク、光磁気ディスク、半導体フラッシュメモリ等、データを記録可能な任意の記録媒体である。並列プログラムは、記録媒体１２０５に記録されていてもよい。 The recording medium 1205 is an arbitrary recording medium capable of recording data, such as an optical disk, a magneto-optical disk, and a semiconductor flash memory. The parallel program may be recorded on the recording medium 1205.

入出力インタフェース１２０６は、外部装置との間の入出力を制御する装置である。 The input / output interface 1206 is a device that controls input / output with an external device.

上述した各実施形態を例に説明した本発明における並列処理装置、あるいはその構成要素は、例えば、図１２に例示するハードウェア装置１２００に対して、上記各実施形態において説明した機能を実現可能なソフトウェア・プログラムを供給することにより、実現されてもよい。 The parallel processing device according to the present invention described with the above-described embodiments as an example, or a component thereof, for example, can implement the functions described in the above-described embodiments with respect to the hardware device 1200 illustrated in FIG. It may be realized by supplying a software program.

より具体的には、例えば、係るハードウェア装置１２００に対して供給したソフトウェア・プログラムを、演算装置１２０１が実行することによって、本発明が実現されてもよい。この場合、係るハードウェア装置１２００で稼働しているオペレーティングシステムや、データベース管理ソフト、ネットワークソフト、仮想環境基盤等のミドルウェアなどが各処理の一部を実行してもよい。 More specifically, for example, the present invention may be realized by the arithmetic device 1201 executing a software program supplied to the hardware device 1200. In this case, an operating system running on the hardware device 1200, middleware such as database management software, network software, and virtual environment infrastructure may execute a part of each process.

上述した各実施形態において、上記各図に示した各部は、上述したハードウェアにより実行されるソフトウェア・プログラムの機能（処理）単位である、ソフトウェアモジュールとして実現することができる。ただし、これらの図面に示した各ソフトウェアモジュールの区分けは、説明の便宜上の構成であり、実装に際しては、様々な構成が想定され得る。 In each embodiment described above, each unit illustrated in each of the above drawings can be realized as a software module, which is a function (processing) unit of a software program executed by the hardware described above. However, the division of each software module shown in these drawings is a configuration for convenience of explanation, and various configurations can be assumed for implementation.

図４Ａ、図４Ｂに例示した並列処理装置の各構成要素をソフトウェアモジュールとして実現する場合、例えば、これらのソフトウェアモジュールが不揮発性記憶装置１２０３に記憶されてもよい。そして、演算装置１２０１がそれぞれの処理を実行する際に、これらのソフトウェアモジュールを記憶装置１２０２に読み出してもよい。 When the respective components of the parallel processing device illustrated in FIGS. 4A and 4B are realized as software modules, for example, these software modules may be stored in the nonvolatile storage device 1203. These software modules may be read out to the storage device 1202 when the arithmetic device 1201 executes each process.

また、これらのソフトウェアモジュールは、共有メモリやプロセス間通信等の適宜の方法により、相互に各種データを伝達できるように構成されてもよい。このような構成により、これらのソフトウェアモジュールは、相互に通信可能に接続される。 In addition, these software modules may be configured to transmit various data to each other by an appropriate method such as shared memory or inter-process communication. With such a configuration, these software modules are connected so as to communicate with each other.

更に、上記ソフトウェア・プログラムは記録媒体１２０５に記録されてもよい。この場合、上記ソフトウェア・プログラムは、上記並列処理装置の出荷段階、あるいは運用段階等において、適宜ドライブ装置１２０４を通じて不揮発性記憶装置１２０３に格納されるよう構成されてもよい。 Further, the software program may be recorded on the recording medium 1205. In this case, the software program may be configured to be stored in the nonvolatile storage device 1203 through the drive device 1204 as appropriate at the time of shipment or operation of the parallel processing device.

なお、上記の場合において、上記ハードウェア装置１２００への各種ソフトウェア・プログラムの供給方法は、出荷前の製造段階、あるいは出荷後のメンテナンス段階等において、適当な治具を利用して当該装置内にインストールする方法を採用してもよい。また、各種ソフトウェア・プログラムの供給方法は、インターネット等の通信回線を介して外部からダウンロードする方法等のように、現在では一般的な手順を採用してもよい。 In the above case, the method of supplying various software programs to the hardware device 1200 is carried out in the device using an appropriate jig in the manufacturing stage before shipment or the maintenance stage after shipment. An installation method may be adopted. As a method for supplying various software programs, a general procedure may be adopted at present, such as a method of downloading from the outside via a communication line such as the Internet.

そして、このような場合において、本発明は、係るソフトウェア・プログラムを構成するコード、あるいは係るコードが記録されたところの、コンピュータ読み取り可能な記録媒体によって構成されると捉えることができる。この場合、係る記録媒体は、ハードウェア装置１２００と独立した媒体に限らず、ＬＡＮやインターネットなどにより伝送されたソフトウェア・プログラムをダウンロードして記憶又は一時記憶した記録媒体を含む。 In such a case, the present invention can be considered to be configured by a code that constitutes the software program or a computer-readable recording medium on which the code is recorded. In this case, the recording medium is not limited to a medium independent of the hardware device 1200 but includes a recording medium in which a software program transmitted via a LAN, the Internet, or the like is downloaded and stored or temporarily stored.

また、上述した並列処理装置の構成要素は、図１２に例示するハードウェア装置１２００を仮想化した仮想化環境と、当該仮想化環境において実行される各種ソフトウェア・プログラム（コンピュータ・プログラム）とによって構成されてもよい。この場合、図１２に例示するハードウェア装置１２００の構成要素は、当該仮想化環境における仮想デバイスとして提供される。なお、この場合も、図１２に例示するハードウェア装置１２００を物理的な装置として構成した場合と同様の構成にて、本発明を実現可能である。 Further, the components of the parallel processing device described above are configured by a virtualized environment in which the hardware device 1200 illustrated in FIG. 12 is virtualized and various software programs (computer programs) executed in the virtualized environment. May be. In this case, the components of the hardware device 1200 illustrated in FIG. 12 are provided as virtual devices in the virtual environment. In this case as well, the present invention can be realized with the same configuration as when the hardware device 1200 illustrated in FIG. 12 is configured as a physical device.

以上、本発明を、上述した模範的な実施形態に適用した例として説明した。しかしながら、本発明の技術的範囲は、上述した各実施形態に記載した範囲には限定されない。当業者には、係る実施形態に対して多様な変更又は改良を加えることが可能であることは明らかである。そのような場合、係る変更又は改良を加えた新たな実施形態も、本発明の技術的範囲に含まれ得る。上述した各実施形態を組合せた実施形態も本発明の技術的範囲に含まれる。更に、上述した各実施形態と、上述した各実施形態に変更又は改良を加えた新たな実施形態とを組合せた実施形態も、本発明の技術的範囲に含まれ得る。そしてこのことは、特許請求の範囲に記載した事項から明らかである。 In the above, this invention was demonstrated as an example applied to exemplary embodiment mentioned above. However, the technical scope of the present invention is not limited to the scope described in the above embodiments. It will be apparent to those skilled in the art that various modifications and improvements can be made to such embodiments. In such a case, new embodiments to which such changes or improvements are added can also be included in the technical scope of the present invention. Embodiments combining the above-described embodiments are also included in the technical scope of the present invention. Furthermore, embodiments combining the above-described embodiments and new embodiments obtained by changing or improving the above-described embodiments may also be included in the technical scope of the present invention. This is clear from the matters described in the claims.

１００並列処理装置
１０１プロセス間通信部
１０２マルチスレッド制御部
２００プロセス間通信部
２０１通信開始処理部
２０２通信完了処理部
２０３通信処理管理部
２０４通信処理記録部
３００マルチスレッド制御部
３０１バリア同期処理部
３０２同期待ちスレッド記録部
１２０１演算装置
１２０２記憶装置
１２０３不揮発性記憶装置
１２０４ドライブ装置
１２０５記録媒体
１２０６入出力インタフェース DESCRIPTION OF SYMBOLS 100 Parallel processing apparatus 101 Interprocess communication part 102 Multithread control part 200 Interprocess communication part 201 Communication start process part 202 Communication completion process part 203 Communication process management part 204 Communication process recording part 300 Multithread control part 301 Barrier synchronous process part 302 Synchronization waiting thread recording unit 1201 arithmetic device 1202 storage device 1203 nonvolatile storage device 1204 drive device 1205 recording medium 1206 input / output interface

Claims

An interprocess communication unit capable of controlling interprocess communication including at least data transfer processing, which is executed between a plurality of processes in a parallel program;
Among the plurality of threads generated in the execution process of the parallel program, the first thread and the other of the plurality of threads after the synchronization processing is executed in the first thread. A parallel processing device comprising: a multi-thread control unit that advances the inter-process communication by the first thread until one or more second threads are synchronized.

The multi-thread controller is
When synchronization processing is executed in the first thread and the first thread enters a synchronization waiting state, whether or not each of the one or more second threads is in a synchronization waiting state is determined. And when any one or more of the second threads is not in a synchronization waiting state, the first thread executes a process related to the inter-process communication using a function provided by the inter-process communication unit. The parallel processing apparatus according to claim 1.

The first thread is a thread that is first in a synchronization waiting state among the plurality of generated threads,
The multi-thread control unit performs the inter-process communication by the first thread between the time when the first thread is in a synchronization waiting state and the time when all the second threads are in a synchronization waiting state. The parallel processing apparatus according to claim 2, wherein a process that advances the interprocess communication is executed using a function provided by a unit.

The inter-process communication includes one or more stages of the data transfer process having an order relationship,
The inter-process communication unit can control the progress of the inter-process communication by advancing the stage of the data transfer process according to the order relationship,
4. The parallel processing according to claim 2, wherein the multi-thread control unit executes a process of progressing the inter-process communication by using the function provided by the inter-process communication unit by the first thread. 5. apparatus.

The inter-process communication includes, for each stage of the inter-process communication, at least one of a completion process that completes the data transfer of the previous stage and a process that executes the data transfer in each stage,
The inter-process communication unit confirms whether or not the data transfer process of the previous stage is completed in the completion process,
When the data transfer process of the previous stage is completed, the data transfer process in the stage is executed and the inter-process communication stage is set to the next stage,
If the data transfer process in the previous stage is not completed, the completion process is interrupted without waiting for the completion of the data transfer process in the previous stage, and the data transfer process is completed. The parallel processing device according to claim 4, wherein the parallel processing device provides information indicating a stage of the inter-process communication that is not performed.

The inter-process communication unit is
A communication processing recording unit that holds information that can identify the inter-process communication and information that represents a stage of the inter-process communication;
When the completion process is interrupted without waiting for the completion of the data transfer, information indicating the stage of the inter-process communication in which the data transfer process is not completed is set in the communication process recording unit,
When processing related to the inter-process communication is executed by the first thread, using the information indicating the stage of the inter-process communication set in the communication processing recording unit, from the stage represented by the information, The parallel processing device according to claim 5, wherein the processing for performing interprocess communication is executed.

The multi-thread control unit is configured such that the first thread enters a synchronization wait state after all the second threads enter a synchronization wait state after the first thread enters a synchronization wait state. 3. The process for advancing the inter-process communication is executed by the first thread using a function provided by the inter-process communication unit within a range that does not exceed a reference time. Item 4. The parallel processing device according to Item 3.

One or more circuit elements including one or more arithmetic units;
A storage device that provides a storage area that can be referred to from the circuit element,
The circuit element is:
A function for controlling inter-process communication including at least data transfer processing executed between a plurality of processes in a parallel program;
Among the plurality of threads generated in the execution process of the parallel program, the first thread and the other of the plurality of threads after the synchronization processing is executed in the first thread. A parallel processing device having a configuration capable of executing a function for executing processing related to the inter-process communication until one or more second threads are synchronized.

Controlling inter-process communication including at least data transfer processing executed between a plurality of processes in the parallel program in response to execution of the parallel program;
Among the plurality of threads generated in the execution process of the parallel program, the first thread is the other thread among the first thread and the plurality of threads after the synchronization process is executed in the first thread. A parallel processing method in which the inter-process communication proceeds by the first thread before the above second thread is synchronized.

In a computer equipped with an arithmetic processing unit capable of processing a plurality of threads in parallel,
A process for controlling inter-process communication including at least a data transfer process, which is executed between a plurality of processes in the parallel program according to the execution of the parallel program;
Among the plurality of threads generated in the execution process of the parallel program, the first thread is the other thread among the first thread and the plurality of threads after the synchronization process is executed in the first thread. A computer program for executing the process of progressing the inter-process communication by the first thread until the above second thread is synchronized.