JP5105359B2

JP5105359B2 - Central processing unit, selection circuit and selection method

Info

Publication number: JP5105359B2
Application number: JP2007324029A
Authority: JP
Inventors: 拓巳丸山; 正幸池田; 和彰村上
Original assignee: Kyushu University NUC; Fujitsu Ltd
Current assignee: Kyushu University NUC; Fujitsu Ltd
Priority date: 2007-12-14
Filing date: 2007-12-14
Publication date: 2012-12-26
Anticipated expiration: 2027-12-14
Also published as: JP2009146227A

Description

この発明は、中央処理装置、選択回路および選択方法に関する。 The present invention relates to a central processing unit, a selection circuit, and a selection method.

従来より、マルチスレッドの処理能力を向上させる技術として、例えば、ＣＭＰ（Ｃｈｉｐ−ｌｅｖｅｌＭｕｌｔｉＰｒｏｃｅｓｓｏｒ）やＳＭＴ（ＳｉｍｕｌｔａｎｅｏｕｓＭｕｌｔｉ−Ｔｈｒｅａｄ）といった技術が知られている。具体的には、ＣＭＰやＳＭＴは、複数のスレッドを空間的ないし時間的に並列に実行する技術である。ここで、プログラムは少なくても一つ（通常は、複数）のスレッドによって構成され、スレッドとは、プログラムの実行単位である。例えば、ＣＰＵがマルチスレッド実行に対応する場合には、各スレッドごとに処理を行うことになる。 2. Description of the Related Art Conventionally, techniques such as CMP (Chip-level Multi Processor) and SMT (Simultaneous Multi-Thread) are known as techniques for improving multi-thread processing capability. Specifically, CMP and SMT are techniques for executing a plurality of threads in parallel in space or time. Here, the program is composed of at least one thread (usually a plurality), and the thread is an execution unit of the program. For example, when the CPU supports multi-thread execution, processing is performed for each thread.

ＣＭＰは、マルチスレッド実行を行うことを目的として、一つのＣＰＵダイ（半導体本体）に複数のコアを搭載し、個々のコアがそれぞれ独自にスレッドを実行する技術である。例えば、ＣＭＰでは、図９の（１）に示すように、コアが、一つのＣＰＵダイに複数搭載され、複数のコアが大容量のＬ２キャッシュを共有する。このような構成のうえで、ＣＭＰでは、個々のコアがそれぞれ独自にスレッドを実行することにより、複数のスレッドを空間的ないし時間的に並列に実行する。 CMP is a technology in which a plurality of cores are mounted on one CPU die (semiconductor body) for the purpose of performing multi-thread execution, and each core executes its own thread. For example, in CMP, as shown in (1) of FIG. 9, a plurality of cores are mounted on one CPU die, and the plurality of cores share a large-capacity L2 cache. With such a configuration, in CMP, each core executes its own thread, thereby executing a plurality of threads in parallel in space or time.

また、ＳＭＴは、マルチスレッド実行を行うことを目的として、１コアのＣＰＵを複数のＣＰＵとして（言い換えると、一つのコアを、複数のコアとして）振る舞わせて、複数のスレッドを同時に実行する技術である。例えば、ＳＭＴでは、図９の（２）に示すように、一つのコアが、ひとつのＣＰＵダイに搭載される。このような構成のうえで、ＳＭＴでは、一つのコアが二つのコアとして振る舞うことにより、二つのスレッドを空間的ないし時間的に並列に実行する。なお、特許文献１には、ＳＭＴにおいて、プロセッサ環境を切り替える際に（例えば、ＣＰＵが、一つのスレッドを実行し、その後、他方のスレッドを実行するために、各種データ（例えば、使用するデータが格納されたレジスタのアドレスなど）を変更する処理を行う際に）、負荷を低減することを目的とした技術が開示されている。 In addition, SMT is a technique for executing a plurality of threads simultaneously by causing a single-core CPU to act as a plurality of CPUs (in other words, one core as a plurality of cores) for the purpose of performing multi-thread execution. It is. For example, in SMT, as shown in (2) of FIG. 9, one core is mounted on one CPU die. In such a configuration, in SMT, one core behaves as two cores, thereby executing two threads in parallel spatially or temporally. Note that Patent Document 1 describes various data (for example, data to be used in order for the CPU to execute one thread and then execute the other thread when the processor environment is switched in the SMT. A technique for reducing a load when performing a process of changing a stored register address or the like) is disclosed.

また、従来より、単一スレッドの性能を向上させる技術（複数のスレッドを並行して処理する場合に、個々のスレッドの処理速度を向上させる技術）として、性能を向上させたいスレッド（以下、本スレッドと記載する）のサブセットを別スレッド（以下、先行サブスレッドと記載する）として先行して実行することで、本スレッドの性能向上を図る技術が知られている。ここで、スレッドとは、複数の命令の集合体である。また、本スレッドとは、スレッドのうち、性能向上を図る対象となるスレッドである。例えば、本スレッドを実行する際に、必要なデータをメモリからフェッチする処理を省略（または、フェッチする処理時間を短縮）することを目的として、先行サブスレッドを実行し、本スレッドの実行に必要なデータをメモリからプリフェッチすることにより、単一スレッドの性能を向上させる。 Conventionally, as a technique for improving the performance of a single thread (a technique for improving the processing speed of individual threads when processing multiple threads in parallel), a thread (hereinafter referred to as this A technique for improving the performance of this thread by executing a subset of a thread (described as a thread) in advance as another thread (hereinafter referred to as a preceding sub thread) is known. Here, a thread is an aggregate of a plurality of instructions. In addition, this thread is a thread targeted for performance improvement among the threads. For example, when executing this thread, it is necessary to execute the preceding sub-thread for the purpose of omitting the process of fetching necessary data from the memory (or shortening the fetching process time). The performance of a single thread is improved by prefetching the correct data from memory.

上記したような単一スレッドの性能を向上させる技術として、例えば、非特許文献１には、ＳｌｉｐＳｔｒｅａｍ（本件明細書における以下の記載では、「ＳＳ」と記載する）という技術が開示されており、非特許文献２には、ＤｙｎａｍｉｃＳｐｅｃｕｌａｔｉｖｅＰｒｅ−ｃｏｍｐｕｔａｔｉｏｎ（本件明細書における以下の記載では、「ＤＳＰ」と記載する）という技術が開示されている。 As a technique for improving the performance of a single thread as described above, for example, Non-Patent Document 1 discloses a technique called “Slip Stream” (hereinafter referred to as “SS” in the present specification). Non-Patent Document 2 discloses a technique called Dynamic Specific Pre-computation (hereinafter referred to as “DSP” in the present specification).

ここで、上記した「ＳＳ」や「ＤＳＰ」といった技術について、さらに詳細に説明する。これらの技術（「ＳＳ」や「ＤＳＰ」）では、本スレッドを構成する命令がリタイアした後（本スレッドの実行後）に、命令間のデータ依存解析を行うことにより、先行サブスレッドを選択する。言い換えると、これらの技術（「ＳＳ」や「ＤＳＰ」）では、本スレッドの各命令を実行した後に、各命令を実行した結果得られる実行結果や各命令で用いられるデータが、他のスレッドを実行する際に用いられるかどうかを判別する。そして、これらの技術（「ＳＳ」や「ＤＳＰ」）では、このようにして各命令間におけるデータの依存関係を把握した上で、性能向上を図りたいスレッド（本スレッド）からデータ依存関係がある命令を検出し、検出した命令を先行サブスレッドとして選択する。例えば、これらの技術（「ＳＳ」や「ＤＳＰ」）では、命令（１）から命令（８）までで構成される本スレッドである場合に、命令（１）とデータの依存関係にある命令（例えば、命令（６）、（７）および（８））を検出し、検出した命令を先行サブスレッド（命令（１）、（６）、（７）および（８））として選択する。 Here, the techniques such as “SS” and “DSP” will be described in more detail. In these techniques (“SS” and “DSP”), after the instructions constituting this thread are retired (after execution of this thread), the data sub-analysis between the instructions is performed to select the preceding sub thread. . In other words, in these technologies (“SS” and “DSP”), after executing each instruction of this thread, the execution result obtained as a result of executing each instruction and the data used in each instruction are transferred to other threads. Determine whether it will be used when executing. In these technologies (“SS” and “DSP”), there is a data dependency relationship from the thread (this thread) to improve performance after grasping the data dependency relationship between the instructions in this way. An instruction is detected, and the detected instruction is selected as a preceding subthread. For example, in these technologies (“SS” and “DSP”), when this thread is composed of the instruction (1) to the instruction (8), the instruction (1) and the instruction having a data dependency relationship ( For example, the instruction (6), (7) and (8)) is detected, and the detected instruction is selected as the preceding sub thread (instruction (1), (6), (7) and (8)).

さらに、「ＳＳ」について具体的な例をあげて説明すると、「ＳＳ」では、（１）結果を後続命令に参照されていない命令（Ｕｎ−ｒｅｆｅｒｅｎｃｅｄｗｒｉｔｅｓ）と、（２）更新前と同じ値を書き込んでいる命令（ｎｏｎ−ｍｏｄｉｆｙｉｎｇｗｒｉｔｅｓ）と、（３）分岐予測の容易な分岐命令とを、命令リタイア時に検出する。その後、「ＳＳ」では、命令間のデータ依存解析をさかのぼって行う。その際、「ＳＳ」では、上記命令（（１）〜（３））と、上記命令実行に必要なデータのみを選択している命令とを、本スレッドから取り除く。そして、「ＳＳ」では、残った命令を、先行サブスレッドとして選択する。 Further, “SS” will be described with a specific example. In “SS”, (1) an instruction (Un-referenced writes) whose result is not referred to by a subsequent instruction, and (2) the same value as before the update. Are detected at the time of instruction retirement (non-modifying writes) and (3) branch instructions with easy branch prediction. Thereafter, in “SS”, the data dependence analysis between instructions is retroactively performed. At this time, in “SS”, the instructions ((1) to (3)) and an instruction that selects only data necessary for executing the instruction are removed from the thread. In “SS”, the remaining instruction is selected as the preceding sub-thread.

さらに、「ＤＳＰ」について具体的な例をあげて説明すると、「ＤＳＰ」では、（４）キャッシュミス（Ｃａｃｈｅ−ｍｉｓｓ）を頻発する命令（Ｄｅｌｉｎｑｕｅｎｔｌｏａｄ）を命令リタイア時に検出する。その後、「ＤＳＰ」では、各命令間のデータ依存解析をさかのぼって行う。その際、上記命令（（４））と、上記命令実行に必要なデータを選択している命令とを、先行サブスレッドとして選択する。 Further, “DSP” will be described with a specific example. In “DSP”, (4) an instruction that frequently causes a cache miss (Cache-miss) is detected at the time of instruction retirement. Thereafter, in the “DSP”, the data dependence analysis between each instruction is retroactively performed. At this time, the instruction ((4)) and an instruction that selects data necessary for executing the instruction are selected as preceding sub-threads.

特開２００５−２８４７４９号公報Japanese Patent Laid-Open No. 2005-284749 ＳｌｉｐＳｔｒｅａｍＫ．Ｓｕｎｄａｒａｍｏｏｒｔｈｙ，Ｚ．Ｐｕｒｓｅｒ，ａｎｄＥ．Ｒｏｔｅｎｂｅｒｇ， “ＳｌｉｐｓｔｒｅａｍＰｒｏｃｅｓｓｏｒｓ：ＩｍｐｒｏｖｉｎｇｂｏｔｈＰｅｒｆｏｒｍａｎｃｅａｎｄＦａｕｌｔＴｏｌｅｒａｎｃｅ”，ｉｎ９ｔｈＡＳＰＬＯＳ，Ｎｏｖ．２０００．Ｆｉｇ．１Slip Stream K.M. Sundaramoorty, Z. Purser, and E.M. Rotenberg, “Slipstream Processors: Improving both Performance and Fault Tolerance”, in 9th ASPLOS, Nov. 2000. FIG. 1 ＤｙｎａｍｉｃＳｐｅｃｕｌａｔｉｖｅＰｒｅ−ｃｏｍｐｕｔａｔｉｏｎＪａｍｉｓｏｎＤ．Ｃｏｌｌｉｎｓｙ，ＤｅａｎＭ．Ｔｕｌｌｓｅｎｙ，ＨｏｎｇＷａｎｇｚ，ＪｏｈｎＰ．Ｓｈｅｎｚｙ， “Ｄｙｎａｍｉｃ−ＳｐｅｃｕｌａｔｉｖｅＰｒｅｃｏｍｐｕｔａｔｉｏｎ”，ｉｎ３４ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＳｙｍｐｏｓｉｕｍｏｎＭｉｃｒｏ−ａｒｃｈｉｔｅｃｔｕｒｅ，Ｄｅｃｅｍｂｅｒ，２００１．Ｆｉｇ．１Dynamic Special Pre-computation Jamison D.D. Collinsy, Dean M. Tullseny, Hong Wangz, John P .; Shenzy, “Dynamic-Specific Precomputation”, in 34th International Symposium on Micro-architecture, December, 2001. FIG. 1

ところで、上記した従来の技術は、以下で説明するように、一度も実行していない本スレッドからは、先行サブスレッドを選択できないという課題があった。 By the way, as described below, the conventional technique described above has a problem that a preceding sub thread cannot be selected from the thread that has never been executed.

具体的には、上記した従来技術（「ＳＳ」や「ＤＳＰ」）では、本スレッドを構成する命令のリタイア時に命令間のデータ依存解析を行うことで、先行サブスレッドを識別して選択するため、一度も実行していない本スレッドからは、先行サブスレッドを選択できない。言い換えると、従来の技術（「ＳＳ」や「ＤＳＰ」）では、先行サブスレッドが選択されるのは、本スレッドが少なくとも一度実行された後であり、一度も実行していない本スレッドからは、先行サブスレッドを選択できない。 Specifically, in the above-described conventional techniques (“SS” and “DSP”), the data dependency analysis between instructions is performed at the time of retirement of the instructions constituting this thread, so that the preceding subthread is identified and selected. A preceding sub thread cannot be selected from this thread that has never been executed. In other words, in the conventional technique (“SS” or “DSP”), the preceding sub-thread is selected after this thread is executed at least once, and from this thread that has never been executed, The predecessor subthread cannot be selected.

そこで、この発明は、上述した従来技術の課題を解決するためになされたものであり、一度も実行していない本スレッドから、先行サブスレッドを選択することが可能な中央処理装置、選択回路および選択方法を提供することを目的とする。 Therefore, the present invention has been made to solve the above-described problems of the prior art, and a central processing unit, a selection circuit, and a selection circuit capable of selecting a preceding sub-thread from this thread that has never been executed. The purpose is to provide a selection method.

上述した課題を解決し、目的を達成するため、本発明は、実行対象となる複数の命令で構成される本スレッドを実行し、並びに、所定のスレッドに必要な情報をプリフェッチする一つまたは複数の命令を先行サブスレッドとして当該本スレッドから選択して当該所定のスレッドに先行して実行する中央処理装置であって、前記本スレッドの実行結果を解析して得られる命令各々間での関係ではなく、前記本スレッドを構成する命令各々が前記中央処理装置に指示する内容を判別して前記先行サブスレッドを選択する選択手段を備えることを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention executes one or a plurality of threads that execute a main thread including a plurality of instructions to be executed and prefetch information necessary for a predetermined thread. Is selected from the main thread as a preceding sub-thread and executed prior to the predetermined thread, and the relationship between the instructions obtained by analyzing the execution result of the main thread is as follows: And selecting means for selecting the preceding sub-thread by discriminating the contents instructed by each instruction constituting the thread to the central processing unit.

また、本発明は、上記の発明において、前記選択手段は、前記内容として、前記スレッドを構成する各々の命令種別を判別して前記先行サブスレッドを選択することを特徴とする。 Also, the present invention is characterized in that, in the above-mentioned invention, the selection means selects the preceding sub-thread by determining each instruction type constituting the thread as the content.

また、本発明は、上記の発明において、前記選択手段は、前記命令種別として、レジスタに格納されるデータをメモリに格納するストア命令または浮動小数点演算命令に該当するかを判別し、該当しない命令を選択することで前記先行サブスレッドを選択することを特徴とする。 Further, the present invention is the above invention, wherein the selecting means determines whether the instruction type corresponds to a store instruction or a floating point arithmetic instruction for storing data stored in a register in a memory, and does not correspond to the instruction The preceding sub-thread is selected by selecting.

本発明によれば、一度も実行していない本スレッドから、先行サブスレッドを選択することが可能である。 According to the present invention, it is possible to select a preceding sub-thread from this thread that has never been executed.

また、本発明によれば、命令種別を確認するだけで簡単に先行サブスレッドを選択することが可能である。 Further, according to the present invention, it is possible to easily select the preceding sub thread by simply confirming the instruction type.

また、本発明によれば、ストア命令と浮動小数点命令とを除外するのみで、簡単に先行サブスレッドを選択することが可能である。 Furthermore, according to the present invention, it is possible to easily select a preceding subthread by simply excluding store instructions and floating point instructions.

以下に添付図面を参照して、この発明に係る中央処理装置、選択回路および選択方法の実施例を詳細に説明する。なお、以下では、本実施例で用いる主要な用語、本実施例に係る先行サブスレッド識別回路の概要および特徴、中央処理装置の構成および処理の流れを順に説明し、最後に本実施例に対する種々の変形例を説明する。 Exemplary embodiments of a central processing unit, a selection circuit, and a selection method according to the present invention will be described below in detail with reference to the accompanying drawings. In the following, the main terms used in the present embodiment, the outline and features of the preceding sub-thread identification circuit according to the present embodiment, the configuration of the central processing unit and the flow of processing will be described in order, and finally various types of the present embodiment will be described. A modified example will be described.

［用語の説明］
まず最初に、図９を用いて、本実施例で用いる主要な用語を説明する。図９は、ＣＭＰおよびＳＭＴについて説明するための図である。本実施例で用いる「ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓＵｎｉｔ）（特許請求の範囲に記載の「中央処理装置」に対応する。）」とは、コンピュータを構成する部品の一つで、各装置の制御やデータの計算・加工を行なう。例えば、ＣＰＵとは、メモリに記憶されたプログラムを実行し、入力装置や記憶装置からデータを受け取り、演算・加工した上で、出力装置や記憶装置に出力する。なお、ＣＰＵとは、ＣＰＵダイ（半導体本体）に、コアが搭載されたものである。また、コアとは、ＣＰＵの核となる内部回路であって、演算回路やキャッシュメモリ、レジスタといった演算処理にかかわるユニットが統合的に配置されているものである。 [Explanation of terms]
First, main terms used in this embodiment will be described with reference to FIG. FIG. 9 is a diagram for explaining CMP and SMT. “CPU (Central Process Unit) (corresponding to“ Central processing unit ”described in the claims”) used in the present embodiment is one of the components constituting the computer, and controls and data of each device. Calculate and process. For example, the CPU executes a program stored in a memory, receives data from an input device or a storage device, calculates and processes the data, and outputs the data to an output device or a storage device. The CPU is a CPU die (semiconductor body) on which a core is mounted. The core is an internal circuit that is the core of the CPU, and is a unit in which units related to arithmetic processing such as an arithmetic circuit, a cache memory, and a register are integrated.

また、本実施例で用いるＣＰＵは、マルチスレッド実行を行うものであり、例えば、ＣＭＰ（Ｃｈｉｐ−ＬｅｖｅｌＭｕｌｔｉＰｒｏｃｅｓｓｏｒ）やＳＭＴ（ＳｉｍｕｌｔａｎｅｏｕｓＭｕｌｔｉ−Ｔｈｒｅａｄ）がこれに該当する。ここで、プログラムは少なくても一つ（通常は、複数）のスレッドによって構成され、スレッドとは、プログラムの実行単位である。例えば、ＣＰＵがマルチスレッド実行に対応する場合には、ＣＰＵは、各スレッドごとに処理を行うことになる。 Further, the CPU used in the present embodiment performs multi-thread execution, and for example, this is CMP (Chip-Level Multi Processor) or SMT (Simultaneous Multi-Thread). Here, the program is composed of at least one thread (usually a plurality), and the thread is an execution unit of the program. For example, when the CPU supports multi-thread execution, the CPU performs processing for each thread.

また、ＣＭＰとは、マルチスレッド実行を行うために、一つのＣＰＵダイ（半導体本体）に複数のコアを搭載し、個々のコアがそれぞれ独自にスレッドを実行する技術である。例えば、ＣＭＰでは、図９の（１）に示すように、コアが、一つのＣＰＵダイに複数搭載され、複数のコアが大容量のＬ２キャッシュを共有する。このような構成のうえで、ＣＭＰでは、個々のコアが、それぞれ独自にスレッドを実行する。また、ＳＭＴとは、マルチスレッド実行を行うために、１コアのＣＰＵを２つのＣＰＵとして（言い換えると、一つのコアを、複数（例えば二つ）のコアとして）振る舞わせて、複数の実行スレッドを同時に実行する技術である。 CMP is a technique in which a plurality of cores are mounted on one CPU die (semiconductor body) in order to perform multi-thread execution, and each core independently executes a thread. For example, in CMP, as shown in (1) of FIG. 9, a plurality of cores are mounted on one CPU die, and the plurality of cores share a large-capacity L2 cache. With such a configuration, in CMP, each core executes its own thread. In addition, in order to perform multi-thread execution, the SMT is configured such that one core CPU behaves as two CPUs (in other words, one core as a plurality of (for example, two) cores), and a plurality of execution threads. Is a technology for simultaneously executing

また、ＳＭＴでは、図９の（２）に示すように、一つのコアが、ひとつのＣＰＵダイに搭載される。このような構成のうえで、ＳＭＴでは、一つのコアが、二つのコアとして振舞うことにより、複数の実行スレッドを同時に実行する。なお、ＣＭＰとＳＭＴとは択一的なものではなく、両方を同時に適用することも可能である。 In SMT, as shown in (2) of FIG. 9, one core is mounted on one CPU die. With this configuration, in SMT, one core behaves as two cores, thereby executing a plurality of execution threads simultaneously. Note that CMP and SMT are not alternatives, and both can be applied simultaneously.

ここで、例えば、マルチスレッド実行に対応する場合には複数のスレッドに分割されて各スレッドごとに処理されるプログラムであっても、ＣＰＵがマルチスレッド実行に対応しない場合には、プログラムはシングルスレッドとして（複数のスレッドに分割されて処理されることなく一つのスレッド（実行単位）として）処理されることになる。またスレッドとは、複数の命令の集合体である。また、本スレッドとは、スレッドのうち、性能向上を図る対象となるスレッドである。 Here, for example, even when a program is divided into a plurality of threads and is processed for each thread when supporting multi-thread execution, if the CPU does not support multi-thread execution, the program is a single thread. (As one thread (execution unit) without being divided into a plurality of threads). A thread is an aggregate of a plurality of instructions. In addition, this thread is a thread targeted for performance improvement among the threads.

本実施例で用いる命令種別とは、上記した命令がＣＰＵに指示する内容の種類である。例えば、メモリ（例えば、キャッシュや、ＣＰＵ外に設けられるメモリなど）から、レジスタファイルにデータをロードする「ロード命令」や、レジスタファイルに格納されているデータをメモリに格納する「ストア命令」や、数値を、各桁の値の並びである「仮数部」と、小数点の位置を表わす「指数部」とで表現して演算する「浮動小数点演算命令」などがある。 The instruction type used in the present embodiment is the kind of content that the above-described instruction instructs the CPU. For example, a “load instruction” for loading data into a register file from a memory (for example, a cache or a memory provided outside the CPU), a “store instruction” for storing data stored in a register file in the memory, There are “floating point arithmetic instructions” for expressing numerical values by “mantissa part” which is a sequence of values of each digit and “exponent part” which represents the position of the decimal point.

上記したマルチスレッド実行とは、複数の実行スレッドを同時に実行することである。例えば、ＣＭＰにおけるマルチスレッド実行は、一つのＣＰＵダイに搭載された複数のコアそれぞれが、スレッドを実行することによって、一つのＣＰＵが、複数の処理を同時に行うものである。また、例えば、ＳＭＴにおけるマルチスレッド実行は、ＣＰＵの処理時間を非常に短い単位に分割し、複数のスレッドに順番に割り当てることによって、一つのＣＰＵ（一つのコア）が、複数の処理を同時に行っているようにみせているものである。 The above-mentioned multi-thread execution is to execute a plurality of execution threads simultaneously. For example, in multi-thread execution in CMP, each of a plurality of cores mounted on one CPU die executes a thread so that one CPU performs a plurality of processes simultaneously. Also, for example, in multi-thread execution in SMT, one CPU (one core) performs multiple processes simultaneously by dividing the CPU processing time into very short units and assigning them to multiple threads in order. It's what it looks like.

また、本実施例で用いるＣＰＵは、アウトオブオーダ（Ｏｕｔ−ｏｆ−Ｏｒｄｅｒ）実行を行うものであるとして説明する。 In addition, the CPU used in this embodiment will be described as performing out-of-order execution.

上記したアウトオブオーダ実行とは、依存関係にない複数の命令を、プログラム中での出現順序に関係なく次々と実行するものである。言い換えると、アウトオブオーダ実行とは、プログラムに記述された命令の順番に関係なく、処理に必要なデータが揃った命令から実行する仕組みである。例えば、アウトオブオーダ実行では、先の命令処理に必要なデータが揃っていなくても、後の命令処理に必要なデータが揃っていた場合、後の命令から先に実行する。また、インオーダ（Ｉｎ−Ｏｒｄｅｒ）実行とは、プログラム中での出現順序に従って、次々と実行するものである。言い換えると、インオーダ実行とは、プログラムに記述された命令の順番に従って実行する仕組みである。なお、本実施例で用いるＣＰＵは、アウトオブオーダ実行を行うものであるとして説明するが、インオーダ実行を行うものであってもよい。 The above-mentioned out-of-order execution is to execute a plurality of instructions that are not dependent on each other one after another regardless of the order of appearance in the program. In other words, out-of-order execution is a mechanism in which execution is performed from an instruction having data necessary for processing, regardless of the order of instructions described in the program. For example, in out-of-order execution, even if data necessary for the previous instruction processing is not prepared, if data necessary for the subsequent instruction processing is prepared, the subsequent instruction is executed first. In-order execution is executed one after another according to the order of appearance in the program. In other words, in-order execution is a mechanism for executing according to the order of instructions described in a program. The CPU used in this embodiment is described as performing out-of-order execution, but may be performing in-order execution.

プリフェッチとは、ＣＰＵがデータをあらかじめキャッシュメモリに読み出しておく機能のことである。 Prefetch is a function in which the CPU reads data in advance into the cache memory.

［先行サブスレッド識別回路の概要および特徴］
まず最初に、図１を用いて、本実施例に係るＣＰＵ１００に設けられる先行サブスレッド識別回路１１（特許請求の範囲に記載の「選択回路」に対応する）の概要および特徴を説明する。図１は、先行サブスレッド識別回路の概要と特徴を説明するための図である。 [Outline and features of preceding sub-thread identification circuit]
First, the outline and features of the preceding subthread identification circuit 11 (corresponding to the “selection circuit” recited in the claims) provided in the CPU 100 according to the present embodiment will be described with reference to FIG. FIG. 1 is a diagram for explaining the outline and features of the preceding subthread identification circuit.

同図に示すように、実施例１に係るＣＰＵ１００は、本発明に密接に関連するものとして、Ｄｅｃｏｄｅ１０と、先行サブスレッド識別回路１１と、Ｑｕｅｕｅ１２とを備える。ここで、Ｑｕｅｕｅ１２は、処理されるのを待っている命令を記憶するものである。 As shown in the figure, the CPU 100 according to the first embodiment includes a Decode 10, a preceding sub-thread identification circuit 11, and a Queue 12, which are closely related to the present invention. Here, the Queue 12 stores an instruction waiting to be processed.

このような構成のもと、実施例１に係るＣＰＵ１００は、実行対象となる複数の命令で構成される本スレッドを実行し、並びに、所定のスレッドに必要な情報をプリフェッチする一つまたは複数の命令を先行サブスレッド（特許請求の範囲に記載の「先行サブスレッド」に対応する）として本スレッドから選択して所定のスレッドに先行して実行することを概要とし、一度も実行していない本スレッドから、先行サブスレッドを選択することが可能である点に主たる特徴がある。 With such a configuration, the CPU 100 according to the first embodiment executes one or a plurality of threads that execute a main thread including a plurality of instructions to be executed and prefetch information necessary for a predetermined thread. This is an overview of selecting an instruction as a preceding subthread (corresponding to the “preceding subthread” recited in the claims) from this thread and executing it in advance of a predetermined thread. The main feature is that it is possible to select a preceding sub-thread from a thread.

かかる特徴は、主に、先行サブスレッド識別回路１１によって実現されるものであるので、以下では、先行サブスレッド識別回路１１を中心に説明を行い、実施例１に係るＣＰＵ１００における本スレッドの実行に関しては、説明を省略する。 Since this feature is mainly realized by the preceding subthread identifying circuit 11, the following description will focus on the preceding subthread identifying circuit 11, and the execution of this thread in the CPU 100 according to the first embodiment will be described. Will not be described.

まず、実施例１に係るＣＰＵ１００では、図１の（１）に示すように、Ｄｅｃｏｄｅ１０が、命令がどのような命令かを解釈し、先行サブスレッド識別回路１１に命令を送信する。例えば、図１の（１）に示す例では、Ｄｅｃｏｄｅ１０が、各命令の内容を解釈し、本スレッドを構成する命令（１）〜（８）を先行サブスレッド識別回路１１に送信する。 First, in the CPU 100 according to the first embodiment, as illustrated in (1) of FIG. 1, the Decode 10 interprets what the instruction is and transmits the instruction to the preceding subthread identification circuit 11. For example, in the example shown in (1) of FIG. 1, the Decode 10 interprets the contents of each instruction and transmits the instructions (1) to (8) constituting this thread to the preceding subthread identification circuit 11.

次に、実施例１に係るＣＰＵ１００では、先行サブスレッド識別回路１１が、図１の（２）に示すように、本スレッドの実行結果を解析して得られる命令各々間での関係ではなく、本スレッドを構成する命令各々がＣＰＵ１００に指示する内容を判別して先行サブスレッドを選択する。例えば、図１の（２）に示す例では、先行サブスレッド識別回路１１が、所定の命令種別を判別して、ストア命令または浮動小数点演算命令ではない命令である命令（１）、（６）、（７）および（８）を、先行サブスレッドとして選択する。 Next, in the CPU 100 according to the first embodiment, as shown in (2) of FIG. 1, the preceding sub-thread identification circuit 11 is not a relationship between instructions obtained by analyzing the execution result of this thread. The preceding sub-thread is selected by discriminating the contents instructed to the CPU 100 by each instruction constituting this thread. For example, in the example shown in (2) of FIG. 1, the preceding sub-thread identification circuit 11 determines a predetermined instruction type, and instructions (1) and (6) which are instructions that are not store instructions or floating-point arithmetic instructions. , (7) and (8) are selected as preceding sub-threads.

そして、実施例１に係るＣＰＵ１００では、先行サブスレッド識別回路１１が、図１の（３）に示すように、先行サブスレッドのみをＱｕｅｕｅ１２に送信する。 In the CPU 100 according to the first embodiment, the preceding subthread identification circuit 11 transmits only the preceding subthread to the Queue 12 as illustrated in (3) of FIG.

ここで、上記したように、先行サブスレッド識別回路１１は、マルチスレッド実行用に追加された構成において実行されるため、本スレッドとは独立に動作する。そして、先行サブスレッド識別回路１１は、先行サブスレッドの命令をデコードした後、デコードされた命令種別に応じて、先行サブスレッドとして必要な命令を識別して、Ｑｕｅｕｅ１２に送信する。 Here, as described above, the preceding sub-thread identification circuit 11 is executed in a configuration added for multi-thread execution, and thus operates independently of this thread. Then, the preceding subthread identification circuit 11 decodes the instruction of the preceding subthread, identifies an instruction necessary as the preceding subthread according to the decoded instruction type, and transmits it to the Queue 12.

このようなことから、実施例１に係るＣＰＵ１００は、上記した主たる特徴の如く、一度も実行していない本スレッドから、先行サブスレッドを選択することが可能である。 For this reason, the CPU 100 according to the first embodiment can select the preceding sub-thread from the main thread that has never been executed, as described above.

すなわち、従来の手法においては、ＣＰＵは、本スレッドを構成する命令がリタイアした後（本スレッドの実行後）に、命令間のデータ依存解析を行うことにより、先行サブスレッドを選択して実行していた。言い換えれば、従来の手法においては、本スレッドが一度も実行していない命令をサブスレッドとして先行実行することはできず、本スレッドの性能向上の制約となる。このような従来の手法に対して、実施例１に係るＣＰＵは、本スレッドが実行されるか否かとは関係なく、本スレッドの実行から独立して、先行サブスレッドを選択することが可能である点に特徴がある。 In other words, in the conventional method, the CPU selects and executes the preceding subthread by performing data dependency analysis between instructions after the instructions constituting this thread are retired (after execution of this thread). It was. In other words, in the conventional method, an instruction that has never been executed by the thread cannot be executed in advance as a sub thread, which is a limitation on the performance improvement of the thread. In contrast to such a conventional method, the CPU according to the first embodiment can select the preceding sub-thread independently of the execution of this thread regardless of whether or not this thread is executed. There is a feature in a certain point.

［ＣＰＵの構成］
次に、図２〜図５を用いて、実施例１に係るＣＰＵ１００の構成を説明する。ここで、図２は、中央処理装置（ＣＭＰ）の構成を示すブロック図である。図３は、ソースおよび本スレッド列および先行サブスレッド列の一例を示す図である。図４は、先行サブスレッド識別回路２９の回路図の一例を示す図である。図５は、先行サブスレッド識別回路による効果を説明するための図である。なお、以下では、特に言及がない限り、ＣＭＰに本発明を適用した場合におけるＣＰＵ１００の構成について説明する。 [CPU configuration]
Next, the configuration of the CPU 100 according to the first embodiment will be described with reference to FIGS. Here, FIG. 2 is a block diagram showing the configuration of the central processing unit (CMP). FIG. 3 is a diagram illustrating an example of a source, a main thread string, and a preceding sub thread string. FIG. 4 is a diagram illustrating an example of a circuit diagram of the preceding subthread identification circuit 29. FIG. 5 is a diagram for explaining the effect of the preceding subthread identification circuit. In the following, the configuration of the CPU 100 when the present invention is applied to CMP will be described unless otherwise specified.

まず、ＣＰＵ１００を構成する各構成間の連結関係について、簡単に説明する。図２に示すように、Ｌ２キャッシュ２０は、Ｉキャッシュ２１とＤキャッシュ２３とに接続される。また、Ｉキャッシュ２１は、Ｌ２キャッシュ２０とＰＣ２６とＤｅｃｏｄｅ２７とに接続される。また、Ｑｕｅｕｅ２２は、Ｄｅｃｏｄｅ２７とＲＯＢ２４とに接続される。また、Ｄキャッシュ２３は、ＲＯＢ２４とＥＸ２８とに接続される。また、ＲＯＢ２４は、Ｑｕｅｕｅ２２とＲＦ２５とＤキャッシュ２３とＥＸ２８とに接続される。また、ＲＦ２５は、ＲＯＢ２４に接続される。また、ＰＣ２６は、Ｉキャッシュ２１に接続される。また、Ｄｅｃｏｄｅ２７は、Ｉキャッシュ２１とＱｕｅｕｅ２２とに接続される。また、ＥＸ２８は、ＲＯＢ２４とＤキャッシュ２３とＲＦ２５とに接続される。 First, the connection relationship between each component which comprises CPU100 is demonstrated easily. As shown in FIG. 2, the L2 cache 20 is connected to an I cache 21 and a D cache 23. The I cache 21 is connected to the L2 cache 20, the PC 26, and the Decode 27. Further, the Queue 22 is connected to the Decode 27 and the ROB 24. The D cache 23 is connected to the ROB 24 and the EX 28. The ROB 24 is connected to the Queue 22, the RF 25, the D cache 23, and the EX 28. The RF 25 is connected to the ROB 24. The PC 26 is connected to the I cache 21. The Decode 27 is connected to the I cache 21 and the Queue 22. The EX 28 is connected to the ROB 24, the D cache 23, and the RF 25.

Ｌ２（ｌｅｖｅｌ２）キャッシュは、ＣＰＵ内部（コア外部）に設けられたキャッシュメモリの一種で、使用頻度の高い命令やデータを記憶する。例えば、Ｌ２キャッシュ２０は、図３に示す命令（１）〜（８）や命令実行に用いるデータ（オペランド）を記憶する。 The L2 (level2) cache is a type of cache memory provided inside the CPU (outside the core), and stores frequently used instructions and data. For example, the L2 cache 20 stores the instructions (1) to (8) shown in FIG. 3 and data (operands) used for instruction execution.

Ｉ（インストラクション）キャッシュは、ＣＰＵ内部（コア内部）に設けられたキャッシュメモリの一種で、命令を記憶する。例えば、Ｉキャッシュ２１は、図３に示す命令（１）〜（８）を記憶する。また、Ｉキャッシュ２１は、コア内部に設けられているため、コア外に設置されているＬ２キャッシュ２０と比較して、後述するＰＣ２６が、より高速にアクセスするキャッシュである。 The I (instruction) cache is a type of cache memory provided in the CPU (inside the core) and stores instructions. For example, the I cache 21 stores the instructions (1) to (8) shown in FIG. Since the I cache 21 is provided inside the core, the PC 26 described later is a cache that is accessed at a higher speed than the L2 cache 20 installed outside the core.

Ｑｕｅｕｅ２２は、何らかの処理を待たせる際に用いられ、具体的には、後述するＥＸ２８によって処理されるのを待っている命令を記憶する。例えば、Ｑｕｅｕｅ２２は、後述するＤｅｃｏｄｅ２７によってどのような命令かを解釈された命令を、本来の実行順番（アウトオブオーダ的に実行される場合における命令の実行順番ではなく、インオーダ的に実行される場合における命令の実行順番）で記憶する。 The Queue 22 is used when waiting for some processing, and specifically stores an instruction waiting to be processed by the EX 28 described later. For example, Queue 22 executes an instruction interpreted by Decode 27, which will be described later, in an in-order rather than an original execution order (in the case of executing out-of-order) In the order of instruction execution).

Ｄキャッシュ２３は、ＣＰＵ内部（コア内部）に設けられたキャッシュメモリの一種で、データ（オペランド）を記憶する。具体的には、ロード命令でＤキャッシュ２３の指定されたアドレスの内容を後述するＲＯＢ２４に転送する。またストア命令により、後述するＲＯＢ２４ないしＲＦ２５の内容をＤキャッシュ２３の指定されたアドレスに転送する。 The D cache 23 is a type of cache memory provided in the CPU (inside the core), and stores data (operands). Specifically, the contents of the designated address of the D cache 23 are transferred to the ROB 24 described later by the load instruction. Further, the contents of ROB 24 to RF 25 described later are transferred to a specified address of the D cache 23 by a store instruction.

ＲＯＢ２４（ＲｅｏｒｄｅｒＢｕｆｆｅｒ）は、アウトオブオーダ実行機能を備えるＣＰＵ内部（コア内部）に設けられたバッファであって、キューに記憶された命令に対して割り当てられ、アウトオブオーダ的に命令が実行されて得られた命令の実行結果を記憶する。またＲＯＢ２４は後述するＲＦ２５と共に、命令の実行に必要なソースデータ（オペランド）を後述するＥＸ２８に供給する。 The ROB 24 (Reorder Buffer) is a buffer provided in the CPU (inside the core) having an out-of-order execution function, and is assigned to instructions stored in the queue, and the instructions are executed out of order. The execution result of the obtained instruction is stored. The ROB 24 supplies source data (operands) necessary for execution of instructions to the EX 28 described later together with the RF 25 described later.

ＲＦ２５（ＲｅｇｉｓｔｅｒＦｉｌｅ）は、ＣＰＵ内部（コア内部）に設けられた記憶素子であり、演算や実行状態の保持に用いられ、データ（オペランド）（例えば、他の命令の実行結果を利用する場合など）や命令の実行結果を記憶する。具体的には、ＲＯＢ２４に記憶された命令の実行結果は、命令コミット時に本来の実行順番に対応する順番でＲＦ２５に転送される。このときＲＯＢ２４に記憶された実行結果は削除される。またＲＦ２５はＲＯＢ２４と共に、命令の実行に必要なデータ（オペランド）を後述するＥＸ２８に供給する。 RF25 (Register File) is a storage element provided in the CPU (inside the core), and is used to hold computations and execution states. For example, data (operand) (for example, when using execution results of other instructions) ) And instruction execution results are stored. Specifically, the execution results of the instructions stored in the ROB 24 are transferred to the RF 25 in the order corresponding to the original execution order when the instructions are committed. At this time, the execution result stored in the ROB 24 is deleted. The RF 25, together with the ROB 24, supplies data (operands) necessary for executing the instruction to the EX 28 described later.

ＰＣ（命令制御部）２６は、Ｉキャッシュ２１から、命令を読み取る（フェッチする）。例えば、ＰＣ（命令制御部）２６は、命令実行タイミングとなると、実行する命令がＩキャッシュ２１に記憶されているかを確認し、記憶されている場合には、実行する命令を読み取る。一方、例えば、ＰＣ（命令制御部）２６は、実行する命令がＩキャッシュ２１に記憶されていない場合には、Ｌ２キャッシュ２０から実行する命令を読み取り、Ｉキャッシュ２１に記憶させ、Ｉキャッシュ２１から、命令を読み取るなどする。なお、ＰＣ（命令制御部）２６は、Ｌ２キャッシュ２０にも実行する命令が記憶されていない場合には、ＣＰＵ外に設けられたメモリ（図示していない）から実行する命令を読み取り、Ｌ２キャッシュ２０に記憶させ、Ｉキャッシュ２１に記憶させ、Ｉキャッシュ２１から、命令を読み取る。 The PC (instruction control unit) 26 reads (fetches) an instruction from the I cache 21. For example, when the instruction execution timing comes, the PC (instruction control unit) 26 checks whether or not the instruction to be executed is stored in the I cache 21, and if it is stored, reads the instruction to be executed. On the other hand, for example, when the instruction to be executed is not stored in the I cache 21, the PC (instruction control unit) 26 reads the instruction to be executed from the L2 cache 20 and stores the instruction in the I cache 21. Read instructions, etc. In addition, when the instruction to be executed is not stored in the L2 cache 20, the PC (instruction control unit) 26 reads the instruction to be executed from a memory (not shown) provided outside the CPU, and the L2 cache 20 and stored in the I cache 21, and reads an instruction from the I cache 21.

Ｄｅｃｏｄｅ２７は、命令がどのような命令かを解釈（命令解釈）する。例えば、Ｄｅｃｏｄｅ２７は、実行する命令がＰＣ２６によって読み取られると、読み取られた命令を解釈する。そして、Ｄｅｃｏｄｅ２７は、Ｑｕｅｕｅ２２に空きがあるかを判別する。ここで、Ｑｕｅｕｅ２２に空きがある場合には、命令を実行するために必要な処理を実行させるための制御信号を、本来の実行順番でＱｕｅｕｅ２２に向けて送る。一方、Ｄｅｃｏｄｅ２７は、Ｑｕｅｕｅ２２に空きがない場合には、空きができるまで空きがあるかを判別する。またＲＯＢ２４は、命令がＱｕｅｕｅ２２に記憶された時点で結果格納用のエントリを割り当てる。 The Decode 27 interprets what the instruction is (instruction interpretation). For example, when the instruction to be executed is read by the PC 26, the Decode 27 interprets the read instruction. Then, the Decode 27 determines whether or not the Queue 22 has a space. Here, when there is a vacancy in Queue 22, a control signal for executing processing necessary to execute the instruction is sent to Queue 22 in the original execution order. On the other hand, if there is no vacancy in Queue 22, the Decode 27 determines whether there is a vacancy until it is available. The ROB 24 assigns an entry for storing a result when the instruction is stored in the Queue 22.

ＥＸ２８は、対応する命令実行に必要なオペランドをＲＯＢ２４ないしＲＦ２５から取得する。例えば、ＥＸ２８は、Ｑｕｅｕｅ２２に記憶された命令に対して、ＲＯＢ２４ないしＲＦ２５に必要なオペランドをリクエストする。命令が演算命令の場合、ＥＸ２８は取得したオペランドを使用して演算を実行し、結果を割り当てられたＲＯＢ２４に記憶させる。一方命令がロード命令の場合、ＥＸ２８は取得したオペランドを使用してアドレス計算を実行し、Ｄキャッシュ２３に対して該当アドレスのデータをリクエスト（要求）する。ここで、ＥＸ２８は、Ｄキャッシュ２３に必要なデータが記憶されている場合には、データを取得し、割り当てられているＲＯＢ２４に取得したデータを記憶させる。一方、ＥＸ２８は、Ｄキャッシュ２３に必要なデータが記憶されていない場合には、Ｌ２キャッシュ２０に対して必要なデータをリクエスト（要求）する。ここで、ＥＸ２８は、Ｌ２キャッシュ２０に必要なデータが記憶されている場合には、データを取得し、取得したデータをＤキャッシュ２３に記憶させ、割り当てられたＲＯＢ２４に記憶させる。一方、ＥＸ２８は、必要なデータがＬ２キャッシュ２０に記憶されていない場合には、必要なデータがリクエストを出したキャッシュやバッファに記憶されるまで待機する。そして、必要なデータが記憶されると、データを取得し、割り当てられたＲＯＢ２４に記憶させる。さらに詳細な一例をあげて説明すると、必要なオペランドが他の命令の実行結果である場合に、ＥＸ２８は、他の命令が実行されて実行結果がＲＯＢ２４に記憶されると、ＲＯＢ２４からオペランド（他の命令の実行結果）を取得する。実行結果がＲＯＢ２４からＲＦ２５に転送済みの場合は、ＲＯＢ２５から取得する。 The EX 28 obtains an operand necessary for executing the corresponding instruction from the ROB 24 to RF 25. For example, the EX 28 requests the operands required for the ROB 24 to the RF 25 for the instruction stored in the Queue 22. If the instruction is an operation instruction, the EX 28 performs an operation using the obtained operand and stores the result in the assigned ROB 24. On the other hand, if the instruction is a load instruction, the EX 28 performs address calculation using the acquired operand, and requests (requests) data of the corresponding address from the D cache 23. Here, when the necessary data is stored in the D cache 23, the EX 28 acquires the data and stores the acquired data in the assigned ROB 24. On the other hand, the EX 28 requests (requests) the necessary data from the L2 cache 20 when the necessary data is not stored in the D cache 23. Here, when the necessary data is stored in the L2 cache 20, the EX 28 acquires the data, stores the acquired data in the D cache 23, and stores it in the assigned ROB 24. On the other hand, if the necessary data is not stored in the L2 cache 20, the EX 28 waits until the necessary data is stored in the cache or buffer that issued the request. When necessary data is stored, the data is acquired and stored in the assigned ROB 24. To explain in more detail, when the required operand is the execution result of another instruction, the EX 28 executes the other instruction and stores the execution result in the ROB 24. Execution result). If the execution result has already been transferred from the ROB 24 to the RF 25, it is acquired from the ROB 25.

なお、本実施例では、上記した、ＥＸ２８と、Ｌ２キャッシュ２０とＤキャッシュ２３とＲＯＢ２４とＲＦ２５との間の関係を用いて記載するが、本発明はこれに限定されるものではなく、ＥＸ２８（または、ＣＰＵ１００）は、任意のキャッシュやバッファからオペランドを取得してよい。例えば、本実施例では、Ｌ２キャッシュとＤキャッシュの２階層のキャッシュを備える場合について説明するが、Ｌ３キャッシュを備えていてもよい。 In the present embodiment, the above-described relationship between EX28, L2 cache 20, Dcache 23, ROB24, and RF25 is used, but the present invention is not limited to this, and EX28 ( Alternatively, the CPU 100) may acquire the operand from an arbitrary cache or buffer. For example, in this embodiment, a case where a two-level cache of an L2 cache and a D cache is provided will be described, but an L3 cache may be provided.

また、ＥＸ２８（ＥｘｅｃｕｔｉｏｎＵｎｉｔ）は、キューに記憶されている命令の中から実行できる命令を選択して、処理能力を割り当てる。例えば、ＥＸ２８は、ＥＸ２８の処理能力に空きがあるかを判別する。ここで、ＥＸ２８は、ＥＸ２８の処理能力に空きがない場合には、空きができるまで判別を続行する。一方、ＥＸ２８は、ＥＸ２８の処理能力に空きがある場合には、キューに記憶されている実行待ち状態にある命令のなかから、すぐに実行できる命令を取得し、処理能力を割り当てる。さらに詳細には、ＥＸ２８は、すぐに実行できる命令として、命令実行に必要なオペランドがＲＯＢ２４ないしＲＦ２５に記憶されているものを取得する。 Further, EX28 (Execution Unit) selects an instruction that can be executed from the instructions stored in the queue, and assigns a processing capacity. For example, the EX 28 determines whether there is a vacancy in the processing capacity of the EX 28. Here, if there is no space in the processing capacity of the EX 28, the EX 28 continues the determination until there is space. On the other hand, when the processing capacity of the EX 28 is available, the EX 28 acquires an instruction that can be executed immediately from the instructions waiting for execution stored in the queue, and assigns the processing capacity. More specifically, the EX 28 acquires, as instructions that can be executed immediately, ones in which operands necessary for executing the instructions are stored in the ROB 24 to RF 25.

また、ＥＸ２８は、命令を実行（演算）する。例えば、ＥＸ２８は、処理能力を割り当てた命令を実行する際に必要なオペランドをＲＯＢ２４ないしＲＦ２５から取得し、命令を実行する。そして、例えば、ＥＸ２８は、命令を実行して得られた実行結果を、ＲＯＢ２４に記憶させる。 The EX 28 executes (calculates) the instruction. For example, the EX 28 obtains an operand necessary for executing an instruction to which processing capability is assigned from the ROB 24 to RF 25 and executes the instruction. For example, the EX 28 stores the execution result obtained by executing the instruction in the ROB 24.

また、ＲＯＢ２４は、本来の実行順番に対応する順番で、実行結果をＲＦ２５に記憶させる。例えば、ＲＯＢ２４は、コミットがあった場合に（該当する命令の処理を完了させていいと判断した場合に）、本来の実行順番に対応する順番で、ＲＯＢ２４に記憶されている実行結果をＲＦ２５に転送する。さらに詳細には、ＥＸ２８は、図３に示した命令１の実行結果をＲＯＢ２４に記憶し、その後、命令３の実行結果をＲＯＢ２４に記憶し、続いて、命令４の実行結果をＲＯＢ２４に記憶し、そして、命令２の実行結果をＲＯＢ２４に記憶し、コミットがあった場合に、ＲＯＢ２４に記憶した順番である命令１、命令３、命令４、命令２の順番で、実行結果をＲＦ２５に記憶するのではなく、本来の実行順番である命令１、命令２、命令３、命令４の順番を用いて、実行結果をＲＯＢ２４からＲＦ２５に転送する。 Further, the ROB 24 stores the execution results in the RF 25 in an order corresponding to the original execution order. For example, when there is a commit (when it is determined that the processing of the corresponding instruction is to be completed), the ROB 24 transfers the execution results stored in the ROB 24 to the RF 25 in the order corresponding to the original execution order. To do. More specifically, the EX 28 stores the execution result of the instruction 1 shown in FIG. 3 in the ROB 24, then stores the execution result of the instruction 3 in the ROB 24, and then stores the execution result of the instruction 4 in the ROB 24. Then, the execution result of the instruction 2 is stored in the ROB 24, and when there is a commit, the execution result is stored in the RF 25 in the order of the instruction 1, the instruction 3, the instruction 4, and the instruction 2 that are stored in the ROB 24. Instead, the execution result is transferred from the ROB 24 to the RF 25 using the order of the instruction 1, the instruction 2, the instruction 3, and the instruction 4, which is the original execution order.

なお、以上で説明したＣＰＵの構成は、本スレッドを実行する際に用いられる構成であり、従来の技術どおりに機能する構成である。続いて、本発明において、先行サブスレッドを選択して実行する構成について説明する。なお、図２に示すように、Ｉキャッシュ２１（ｍ）、Ｑｕｅｕｅ２２（ｍ）、Ｄキャッシュ２３（ｍ）、Ｄキャッシュ２３（ｍ）、ＲＯＢ２４（ｍ）、ＲＦ２５（ｍ）およびＥＸ２８（ｍ）は、マルチスレッド実行機能を実現するために付加されている構成である。実施例１に係るＣＰＵは、このようなＣＰＵの構成に、さらに、先行サブスレッドを選択することを目的として、Ｄｅｃｏｄｅ２７（ｍ）とＱｕｅｕｅ２２（ｍ）との間に先行サブレッド識別回路２９（特許請求の範囲に記載の「選択回路」に対応する」）を備えているものである。なお、以下では、従来の技術どおりに機能する構成と異なる部分について重点的に記載を行い、同様の部分については、簡単に記載し、または説明を省略する。 The configuration of the CPU described above is a configuration used when executing this thread, and is a configuration that functions in accordance with the conventional technology. Next, a configuration for selecting and executing a preceding sub thread in the present invention will be described. As shown in FIG. 2, I cache 21 (m), Queue 22 (m), D cache 23 (m), D cache 23 (m), ROB 24 (m), RF 25 (m) and EX 28 (m) are This is a configuration added to realize the multi-thread execution function. In the CPU according to the first embodiment, the preceding sub-red identification circuit 29 is provided between the Decode 27 (m) and the Queue 22 (m) for the purpose of selecting the preceding sub-thread in addition to such a configuration of the CPU. ") Corresponding to the" selection circuit "described in the range. In the following description, portions different from the configuration that functions in accordance with the conventional technology will be mainly described, and similar portions will be simply described or description thereof will be omitted.

まず、本発明に係る先行サブスレッドを実行する際に用いられるマルチスレッド実行用に追加された構成における接続関係について簡単に説明する。図２に示すように、Ｄｅｃｏｄｅ２７（ｍ）は、Ｉキャッシュ２１（ｍ）と先行サブスレッド識別回路２９と接続される。また、先行サブスレッド識別回路２９は、Ｄｅｃｏｄｅ２７（ｍ）とＱｕｅｕｅ２２（ｍ）とに接続される。また、Ｑｕｅｕｅ２２（ｍ）は、先行サブスレッド識別回路２９とＲＯＢ２４（ｍ）とに接続される。 First, the connection relation in the configuration added for multi-thread execution used when executing the preceding sub thread according to the present invention will be briefly described. As shown in FIG. 2, the Decode 27 (m) is connected to the I cache 21 (m) and the preceding subthread identification circuit 29. The preceding subthread identification circuit 29 is connected to the Decode 27 (m) and the Queue 22 (m). Further, the Queue 22 (m) is connected to the preceding sub thread identification circuit 29 and the ROB 24 (m).

Ｉキャッシュ２１（ｍ）は、Ｉキャッシュ２１と同様の機能を有し、ＰＣ２６（ｍ）は、ＰＣ２６と同様の機能を有して同様の処理を行い、Ｄｅｃｏｄｅ２７（ｍ）は、Ｄｅｃｏｄｅ２７と同様の機能を有して同様の処理を行う。 The I cache 21 (m) has the same function as the I cache 21, the PC 26 (m) has the same function as the PC 26 and performs the same processing, and the Decode 27 (m) is the same as the Decode 27. It has a function and performs the same processing.

Ｄキャッシュ２３（ｍ）は、後述する先行サブスレッド識別回路２９によって判別された先行サブスレッドを実行する際に必要となるオペランドを記憶する。 The D cache 23 (m) stores operands necessary for executing the preceding subthread determined by the preceding subthread identifying circuit 29 described later.

ＲＯＢ２４（ｍ）は、後述する先行サブスレッド識別回路２９によって判別されてキューに記憶された先行サブスレッドに対して割り当てられ、割り当てられた先行サブスレッドが用いるオペランドを記憶する。また、ＲＯＢ２４（ｍ）は、アウトオブオーダ的に先行サブスレッドが実行されて得られた実行結果を記憶する。 The ROB 24 (m) is assigned to the preceding subthread determined by the preceding subthread identification circuit 29 described later and stored in the queue, and stores an operand used by the assigned preceding subthread. Further, the ROB 24 (m) stores an execution result obtained by executing the preceding sub thread out of order.

ＲＦ２５（ｍ）は、ＣＰＵ内部に設けられた演算や実行状態の保持に用いる記憶素子であり、データを記憶する。具体的には、後述するＥＸ２８（ｍ）により実行されＲＯＢ２４（ｍ）に格納された先行サブスレッドの実行結果は、本来の実行順番に対応する順番でＲＯＢ２４（ｍ）からＲＦ２５（ｍ）に転送される。 The RF 25 (m) is a storage element provided in the CPU and used for holding computations and execution states, and stores data. Specifically, the execution result of the preceding sub-thread executed by EX28 (m) described later and stored in ROB 24 (m) is transferred from ROB 24 (m) to RF 25 (m) in the order corresponding to the original execution order. Is done.

Ｑｕｅｕｅ２２（ｍ）は、先行サブスレッド識別回路２９によって判別された先行サブスレッドを記憶する。 The Queue 22 (m) stores the preceding subthread determined by the preceding subthread identification circuit 29.

先行サブスレッド識別回路２９は、本スレッドの実行結果を解析して得られるスレッド各々間での関係ではなく、本スレッドを構成する命令各々が中央処理装置に指示する内容を判別して先行サブスレッドを選択する。例えば、先行サブスレッド識別回路２９は、内容として、各々の命令種別を判別して先行サブスレッドを選択する。さらに詳細な一例をあげると、先行サブスレッド識別回路２９は、命令種別として、レジスタに格納されるデータをメモリに格納するストア命令または浮動小数点演算命令に該当するかを判別し、該当しない命令を選択することで先行サブスレッドを選択する。 The preceding subthread identification circuit 29 discriminates not the relationship among the threads obtained by analyzing the execution result of this thread, but the contents instructed by the instructions constituting the thread to the central processing unit, and determines the preceding subthread. Select. For example, the preceding subthread identifying circuit 29 selects the preceding subthread by determining each instruction type as the content. As a more detailed example, the preceding subthread identification circuit 29 determines whether the instruction type corresponds to a store instruction or a floating-point operation instruction for storing data stored in the register in the memory, and determines an instruction that does not correspond to the instruction. Select the preceding sub-thread by selecting it.

ここで、先行サブスレッド識別回路２９による先行サブスレッド識別についてさらに説明すると、先行サブスレッド識別回路２９は、各命令間の依存関係を用いずに、本スレッドから先行サブスレッドを判別する。例えば、先行サブスレッド識別回路２９は、各命令間のデータ依存関係を用いずに、命令種別のみを用いて、先行サブスレッドを判別する。例えば、先行サブスレッド識別回路２９は、本スレッドの各命令の命令種別がＤｅｃｏｄｅ２７（ｍ）によって解読されてＱｕｅｕｅ２２（ｍ）に向けて送られると、送られた命令が予め記憶された命令種別と一致する（または、一致しない）かを判別し、一致する命令（または、一致しない命令）を先行サブスレッドとして判別して、先行サブスレッドのみをＱｕｅｕｅ２２（ｍ）に送る。 Here, the preceding subthread identification circuit 29 by the preceding subthread identification circuit 29 will be further described. The preceding subthread identification circuit 29 determines the preceding subthread from this thread without using the dependency relationship between the instructions. For example, the preceding subthread identifying circuit 29 determines the preceding subthread using only the instruction type without using the data dependency between the instructions. For example, when the instruction type of each instruction of this thread is decoded by the Decode 27 (m) and sent to the Queue 22 (m), the preceding sub-thread identification circuit 29 determines whether the sent instruction is stored in advance. It is determined whether they match (or do not match), a matching instruction (or non-matching instruction) is determined as a preceding subthread, and only the preceding subthread is sent to Queue 22 (m).

ここで、さらに図３および図４を用いて先行サブスレッド識別回路２９についてさらに詳細な一例をあげて説明する。例えば、図３に示すように、下記（Ｓ）に示すソースを用いる場合には、かかるソースはコンパイルされて、図３に示すように、下記（１）〜（８）に示す本スレッドとして認識される。 Here, the preceding subthread identification circuit 29 will be described with reference to FIGS. 3 and 4 in more detail. For example, as shown in FIG. 3, when the source shown in (S) below is used, the source is compiled and recognized as the main threads shown in (1) to (8) below as shown in FIG. Is done.

ここで、図３に示すソース（下記（Ｓ））では、「ｉ」は変数であって、１から１００００００まで変化させてソースを実行して処理を終了することを示す。また、図３に示す本スレッド（下記（１）〜（８））では、「Ｒ１」は、配列Ｘにおける先頭アドレスを示し、「Ｒ２」は、配列Ｙの先頭アドレスを示し、「Ｒ３」は、変数「ｉ」×４（バイト）を示し、「Ｒ４」は、１００００００−変数「ｉ」を示し、「ＦＲ」は、レジスタファイルを示す。 Here, in the source shown in FIG. 3 (below (S)), “i” is a variable, which indicates that the source is executed after changing from 1 to 1000000 and the process is terminated. Also, in the thread shown in FIG. 3 ((1) to (8) below), “R1” indicates the start address in the array X, “R2” indicates the start address of the array Y, and “R3” indicates , Variable “i” × 4 (bytes), “R4” indicates 1000000—variable “i”, and “FR” indicates a register file.

さらに、図３に示す本スレッド（下記（１）〜（８））各々の内容について説明する。命令（１）は、Ｒ１にＲ３を加算した値が示すメモリの配列Ｘにおけるアドレスに格納されているデータを、ＦＲ１（レジスタファイル「１」）に格納する命令である。命令（２）は、ＦＲ１に格納されている値とＦＲ２に格納されている値を乗算し、演算した結果をＦＲ３に格納する浮動小数点演算命令である。命令（３）は、ＦＲ３に格納されている値とＦＲ４に格納されている値を加算し、演算した結果をＦＲ５に格納する浮動小数点演算命令である。命令（４）は、ＦＲ５に格納されている値に対してＦＲ６に格納されている値を割算し、演算した結果をＦＲ７に格納する浮動小数点演算命令である。命令（５）は、Ｒ２にＲ３を加算した値が示すメモリの配列Ｙにおけるアドレスに、ＦＲ７に格納されている値をストアするストア命令である。命令（６）は、Ｒ３に８を加算する命令である。命令（７）は、Ｒ４から１を減算する命令である。命令（８）は、Ｒ４が０より大きいときには、命令（１）〜（８）の処理を繰り返させる命令である。 Furthermore, the contents of each thread (the following (1) to (8)) shown in FIG. 3 will be described. The instruction (1) is an instruction for storing, in FR1 (register file “1”), the data stored at the address in the memory array X indicated by the value obtained by adding R3 to R1. The instruction (2) is a floating point arithmetic instruction that multiplies the value stored in FR1 by the value stored in FR2 and stores the result of the operation in FR3. The instruction (3) is a floating point arithmetic instruction that adds the value stored in the FR3 and the value stored in the FR4 and stores the operation result in the FR5. The instruction (4) is a floating point arithmetic instruction that divides the value stored in FR6 with respect to the value stored in FR5 and stores the result of the operation in FR7. The instruction (5) is a store instruction for storing the value stored in the FR7 at the address in the memory array Y indicated by the value obtained by adding R3 to R2. The instruction (6) is an instruction for adding 8 to R3. The instruction (7) is an instruction for subtracting 1 from R4. The instruction (8) is an instruction for repeating the processes of the instructions (1) to (8) when R4 is larger than 0.

（Ｓ）ｄｏｉ＝１，１００００００Ｙ（ｉ）＝（Ｘ（ｉ）＊Ｂ＋Ｃ）／Ｄ）ｅｎｄｄｏ
（１）［Ｒ１＋Ｒ３］→ＦＲ１
（２）ＦＲ１×ＦＲ２→ＦＲ３
（３）ＦＲ３＋ＦＲ４→ＦＲ５
（４）ＦＲ５／ＦＲ６→ＦＲ７
（５）ＦＲ７→［Ｒ２＋Ｒ３］
（６）Ｒ３＋８→Ｒ３
（７）Ｒ４−１→Ｒ４
（８）ＢｒａｎｃｈＬｏｏｐｉｆＲ４＞０ (S) do i = 11,000,000 Y (i) = (X (i) * B + C) / D) end do
(1) [R1 + R3] → FR1
(2) FR1 × FR2 → FR3
(3) FR3 + FR4 → FR5
(4) FR5 / FR6 → FR7
(5) FR7 → [R2 + R3]
(6) R3 + 8 → R3
(7) R4-1 → R4
(8) Branch Loop if R4> 0

なお、ここで、ソースとは、高級言語を用いて記載されたソースコード（プログラム）である。高級言語とは、プログラミング言語において、自然語に近く、人間にとって理解しやすい構文や概念を持った言語の総称である。例えば、高級言語では、ソースコードが、主に英単語や記号などを組み合わせて記述され、コンパイラなどが、高級言語で記述されたソースコードを機械語に変換（コンパイル）し、処理装置（例えば、ＣＰＵ）が、機械語に変換されたソースコードを実行する。コンパイルとは、人間がプログラミング言語（例えば、高級言語）を用いて作成したプログラムのソースコードを、コンピュータ上で実行可能な機械語に変換することである。 Here, the source is a source code (program) written using a high-level language. A high-level language is a general term for programming languages that have syntax and concepts that are close to natural language and easy for humans to understand. For example, in a high-level language, source code is mainly described by combining English words and symbols, and a compiler or the like converts (compiles) the source code described in the high-level language into a machine language, and processes a processor (for example, CPU) executes the source code converted into the machine language. Compiling means converting the source code of a program created by a human using a programming language (for example, a high-level language) into a machine language that can be executed on a computer.

ここで、先行サブスレッド識別回路２９は、図４に示すように、Ｄｅｃｏｄｅ２７（ｍ）によって命令種別を解読された命令が送られると、送られてきた命令が浮動小数点演算命令（命令（２）〜（４））またはメモリストア命令（命令（５））である場合には、スイッチがＯＮになり（図４の（１））、命令がＱｕｅｕｅ２２（ｍ）に送られない（図４の（２））。一方、先行サブスレッド識別回路２９は、送られてきた命令が浮動小数点演算命令またはメモリストア命令でない場合には、命令が先行サブスレッドを構成する命令として判別され、スイッチがＯＦＦとなる（図４の（３））。そして、先行サブスレッド識別回路２９は、命令（１）、（６）〜（８）を、図４に示すように、先行サブスレッドを構成する各命令（ａ）〜（ｄ）（命令（１）、（６）〜（８）に対応する）として、Ｑｕｅｕｅ２２（ｍ）に送る（記憶させる）。ＲＯＢ２４（ｍ）は各命令がＱｕｅｕｅ２２（ｍ）に記憶された時点で、結果格納用のエントリを割り当てる。 Here, as shown in FIG. 4, when the instruction whose instruction type is decoded by the Decode 27 (m) is sent to the preceding subthread identification circuit 29, the sent instruction is changed to a floating point arithmetic instruction (instruction (2)). To (4)) or a memory store instruction (instruction (5)), the switch is turned on ((1) in FIG. 4), and the instruction is not sent to Queue 22 (m) (((4) in FIG. 4)). 2)). On the other hand, if the sent instruction is not a floating point operation instruction or a memory store instruction, the preceding subthread identification circuit 29 determines that the instruction is an instruction constituting the preceding subthread, and the switch is turned OFF (FIG. 4). (3)). Then, the preceding subthread identification circuit 29 converts the instructions (1), (6) to (8) into the instructions (a) to (d) (instruction (1) constituting the preceding subthread as shown in FIG. ), (Corresponding to (6) to (8)), it is sent (stored) to Queue 22 (m). The ROB 24 (m) assigns an entry for storing the result when each instruction is stored in the Queue 22 (m).

ＥＸ２８（ｍ）は、対応する命令実行に必要なオペランドをＲＯＢ２４（ｍ）ないしＲＦ２５（ｍ）から取得する。また、ＥＸ２８（ｍ）は、先行サブスレッド識別回路２９によって選択されてＱｕｅｕｅ２２（ｍ）に記憶された先行サブスレッドのなかから、すぐに実行できる命令を選択し、処理能力を割り当てる。また、ＥＸ２８（ｍ）は、命令を実行（演算）し、結果をＲＯＢ２４（ｍ）に記憶させる。ＲＦ２５（ｍ）は、ＲＯＢ２４（ｍ）に記憶された命令の実行結果を、命令コミット時に本来の実行順番に対応する順番で記憶する。 The EX 28 (m) acquires the operands necessary for executing the corresponding instruction from the ROB 24 (m) to the RF 25 (m). The EX 28 (m) selects an instruction that can be executed immediately from the preceding sub-threads selected by the preceding sub-thread identification circuit 29 and stored in the Queue 22 (m), and assigns processing capability. The EX 28 (m) executes (calculates) the instruction and stores the result in the ROB 24 (m). The RF 25 (m) stores the execution results of the instructions stored in the ROB 24 (m) in an order corresponding to the original execution order when the instructions are committed.

ここで、上記したように、先行サブスレッド識別回路２９は、マルチスレッド実行用に追加された構成において実行されるため、本スレッドとは独立に動作する。そして、先行サブスレッド識別回路２９は、先行サブスレッドの命令をデコードした後、デコードされた命令種別に応じて、先行サブスレッドとして必要な命令を識別して、ＥＸ２８などに送出する。 Here, as described above, the preceding sub-thread identification circuit 29 is executed in a configuration added for multi-thread execution, and thus operates independently of this thread. Then, the preceding subthread identification circuit 29 decodes the instruction of the preceding subthread, identifies the instruction necessary as the preceding subthread according to the decoded instruction type, and sends it to the EX 28 or the like.

すなわち、本発明は、本スレッドが実行されたか否かとは関係なく、本スレッドの実行から独立して、先行サブスレッドを選択することが可能である点に特徴がある。その結果、本スレッドが初めて実行される際にも（一度も実行されていない場合にも）、先行サブスレッドを選択して先行して実行することにより、本スレッドの性能を向上することが可能である点に特徴がある。 That is, the present invention is characterized in that the preceding subthread can be selected independently of the execution of this thread regardless of whether or not this thread has been executed. As a result, even when this thread is executed for the first time (even if it has never been executed), it is possible to improve the performance of this thread by selecting and executing the preceding sub-thread in advance. This is a feature.

例えば、本発明は、先行サブスレッドを本スレッドの初回実行の前にＥＸ２８（ｍ）によって実行させることにより、ＥＸ２８（ｍ）が、本スレッドを実行する際に必要なオペランドを予めメモリからＬ２キャッシュ２０、Ｄキャッシュ２３、およびＤキャッシュ２３（ｍ）へプリフェッチし、本スレッド実行時に、ＥＸ２８が、オペランドをメモリからフェッチする処理を省略することが可能であり、本スレッドを処理する性能が向上する。 For example, in the present invention, by executing the preceding sub thread by the EX 28 (m) before the first execution of the thread, the EX 28 (m) obtains an operand necessary for executing the thread from the memory in advance in the L2 cache. 20, the pre-fetch to the D cache 23 and the D cache 23 (m), and when executing this thread, the EX 28 can omit the process of fetching the operand from the memory, and the performance of processing this thread is improved. .

ここで、従来の手法と対比において、本発明の特徴について改めて説明する。図５の（従来）に示すように、従来の手法では、本スレッドを一度実行した後に、データの依存関係（各命令を実行した結果得られる実行結果や各命令で用いられるデータが、他の命令を実行する際に用いられるかどうかを）を解析し、解析結果を参照して先行サブスレッドを選択する。このため、本スレッドを一度目に実行する際には、本スレッドを処理する際の性能を向上させることはできない。例えば、図５に示す例では、依存関係を解析後、本スレッドの二巡目に相当する先行サブスレッドの一巡目が実行され、先行サブスレッドの一巡目による実行結果（例えば、図３の例では、Ｒ３の値や、Ｒ４の値など）を用いて先行サブスレッドの二巡目が行われることにより、初めて、その後行われる本スレッドの三巡目を処理する性能が向上する。 Here, the characteristics of the present invention will be described anew in comparison with the conventional method. As shown in FIG. 5 (conventional), in the conventional method, after executing this thread once, the data dependency relationship (the execution result obtained as a result of executing each instruction and the data used in each instruction is Whether it is used when executing an instruction), and the preceding subthread is selected by referring to the analysis result. For this reason, when this thread is executed for the first time, the performance of processing this thread cannot be improved. For example, in the example shown in FIG. 5, after analyzing the dependency relationship, the first round of the preceding subthread corresponding to the second round of this thread is executed, and the execution result of the first round of the preceding subthread (for example, the example of FIG. 3). Then, by performing the second round of the preceding sub thread using the value of R3, the value of R4, etc., the performance of processing the third round of the subsequent thread for the first time is improved.

すなわち、従来の手法では、図５の（Ａ）において、先行サブスレッドが実行された結果［Ｒ１＋Ｒ３］に格納されているデータをＦＲ１にロードするために、［Ｒ１＋Ｒ３］に格納されているデータをプリフェッチする指示が行われることにより、図５の（Ｂ）における本命令（１）を処理する性能が向上する。いいかえれば、図５の（従来）に示す例においては、従来の手法では、本スレッドの一巡目と二巡目とを処理する性能は向上しない。 That is, in the conventional method, in FIG. 5A, in order to load the data stored in [R1 + R3] as a result of the execution of the preceding subthread into FR1, the data stored in [R1 + R3] By performing the prefetch instruction, the performance of processing this instruction (1) in FIG. 5B is improved. In other words, in the example shown in FIG. 5 (conventional), the conventional method does not improve the performance of processing the first and second rounds of this thread.

このような従来の手法と比較して、図５の（Ｘ）に示すように、本発明を適用することにより、実施例１に係る中央処理装置は、本スレッドの実行と同時期に、先行サブスレッドを選択して実行することが可能である点に特徴がある。すなわち、本発明を適用することにより、図５の（Ｃ）に示すように、先行サブスレッドによって［Ｒ１＋Ｒ３］に格納されているデータがＦＲ１にロードされる。このことは、本スレッドの二巡目から見ると、実施例１に係る中央処理装置では、［Ｒ１＋Ｒ３］に格納されているデータをプリフェッチする指示が行われたことになる。この結果、実施例１に係る中央処理装置では、図５の（Ｄ）に示す本スレッド（１）を実行する性能が向上し、本スレッドを処理する際の性能が向上することになる。いいかえれば、従来の手法において本スレッドを実行する性能が向上するタイミングと比較して、本発明を適用することにより、実施例１に係る中央処理装置は、早いタイミング（より前に実行される本スレッド）において、本スレッドを実行する性能が向上する。 Compared with such a conventional method, as shown in (X) of FIG. 5, by applying the present invention, the central processing unit according to the first embodiment is preceded by the execution of this thread. It is characterized in that it is possible to select and execute a sub thread. That is, by applying the present invention, as shown in FIG. 5C, the data stored in [R1 + R3] is loaded into FR1 by the preceding sub thread. From the second round of this thread, this means that the central processing unit according to the first embodiment is instructed to prefetch the data stored in [R1 + R3]. As a result, in the central processing unit according to the first embodiment, the performance for executing this thread (1) shown in FIG. 5D is improved, and the performance for processing this thread is improved. In other words, by applying the present invention as compared with the timing at which the performance of executing this thread is improved in the conventional method, the central processing unit according to the first embodiment has an earlier timing (the book that is executed earlier). Thread), the performance of executing this thread is improved.

また、本発明を適用することにより、実施例１に係る中央処理装置は、図５（Ｙ）に示すように、先行サブスレッドを本スレッドの初回実行の前にＥＸ２８（ｍ）によって実行させることにより、本スレッドの一巡目から性能を向上させることが可能である点に特徴がある。すなわち、実施例１に係る中央処理装置では、図５（Ｅ）に示すように、ＥＸ２８（ｍ）が、本スレッドを実行する前に先行サブスレッドを実行することにより、本スレッドの一巡目が実行される前において、先行サブスレッドによって［Ｒ１＋Ｒ３］に格納されているデータがＦＲ１にロードされる。このことは、本スレッドの一巡目から見ると、実施例１に係る中央処理装置は、［Ｒ１＋Ｒ３］に格納されているデータをプリフェッチする指示が行われたことになる。この結果、実施例１に係る中央処理装置では、図５（Ｆ）に示す本スレッド（１）を実行する性能が向上し、本スレッドを処理する際の性能が向上することになる。いいかえれば、従来の手法においては、本スレッドの一巡目の性能を向上させることはできなかったのに対して、本発明を適用することにより、実施例１に係る中央処理装置は、本スレッドの一巡目から性能が向上する。なお、先行サブスレッドを選択して実行するタイミングに関しては、上記した図５の（Ｘ）や（Ｙ）に示す場合に限られず、任意のタイミングで選択して実行してよい。 Further, by applying the present invention, the central processing unit according to the first embodiment causes the preceding sub-thread to be executed by EX28 (m) before the first execution of this thread, as shown in FIG. 5 (Y). Therefore, the performance can be improved from the first round of this thread. In other words, in the central processing unit according to the first embodiment, as shown in FIG. 5 (E), the EX 28 (m) executes the preceding sub thread before executing this thread, so that the first round of this thread is performed. Before execution, the data stored in [R1 + R3] is loaded into FR1 by the preceding sub thread. From the first round of this thread, this means that the central processing unit according to the first embodiment has been instructed to prefetch the data stored in [R1 + R3]. As a result, in the central processing unit according to the first embodiment, the performance for executing this thread (1) shown in FIG. 5F is improved, and the performance for processing this thread is improved. In other words, in the conventional method, the performance of the first round of this thread could not be improved, but by applying the present invention, the central processing unit according to the first embodiment can Performance improves from the first round. Note that the timing for selecting and executing the preceding subthread is not limited to the case shown in (X) and (Y) of FIG. 5 described above, and may be selected and executed at any timing.

また、実施例１に係る中央処理装置は、メモリストア命令を除くことで、先行サブスレッドを実行することにより、メモリ領域が破壊されるのを防止することが可能である点に特徴がある。いいかえれば、データをメモリに書き込む処理を行わず、読み込む処理のみとすることによって、先行サブスレッドを実行することによって、本スレッドを実行するために必要なデータとしてメモリに格納されているデータを破壊することを回避することが可能である。これにより、先行サブスレッドを実行することにより、本スレッドの処理が誤動作してしまう状況を回避することが可能である点に特徴がある。 Further, the central processing unit according to the first embodiment is characterized in that the memory area can be prevented from being destroyed by executing the preceding subthread by excluding the memory store instruction. In other words, the data stored in the memory is destroyed as the data necessary to execute this thread by executing the preceding sub-thread by performing only the reading process without performing the process of writing the data to the memory. It is possible to avoid doing this. As a result, it is possible to avoid a situation in which the processing of this thread malfunctions by executing the preceding sub thread.

また、浮動小数点演算命令は、メモリアクセスを行うときに必要となるアドレスを選択することを目的として行われるケースは殆どなく、さらに浮動小数点演算命令の実行に要する時間は、整数演算命令など他の命令の実行時間より一般的に大きい。このため、実施例１に係る中央処理装置は、浮動小数点演算命令を除くことにより、先行サブスレッドを実行する際に必要な処理時間を短縮し、また、中央処理装置に対する処理を行う際の負担を軽減することが可能であり、本スレッドを実行する際の性能を、効率的に向上させることが可能である点に特徴がある。 Floating point arithmetic instructions are rarely used for the purpose of selecting addresses required for memory access, and the time required for execution of floating point arithmetic instructions is not limited to integer arithmetic instructions. Generally larger than instruction execution time. For this reason, the central processing unit according to the first embodiment reduces the processing time required for executing the preceding subthread by excluding the floating-point arithmetic instruction, and also burdens when performing processing on the central processing unit. This is characterized in that the performance at the time of executing this thread can be improved efficiently.

なお、図７に示すように、典型的な科学技術計算用プログラムにおいて、メモリストア命令および浮動小数点演算命令以外の割合は全命令の４０％未満にすぎず、先行サブスレッド用命令識別回路においてメモリストア命令および浮動小数点演算命令以外を選択することにより、本スレッドに対して十分に短い（処理時間が短い）先行サブスレッドを選択することが可能である。なお、図７は、命令種別の分布について説明するための図である。 As shown in FIG. 7, in a typical scientific and engineering calculation program, the ratio of instructions other than memory store instructions and floating point arithmetic instructions is less than 40% of all instructions. By selecting other than the store instruction and the floating-point operation instruction, it is possible to select a preceding subthread that is sufficiently short (processing time is short) for this thread. FIG. 7 is a diagram for explaining the distribution of instruction types.

［先行サブスレッド識別回路による処理の一例］
次に、図６を用いて、実施例１における先行サブスレッド識別回路２９による処理の一例について説明する。図６は、先行サブスレッド識別回路２９の処理の一例を示すフローチャートである。 [Example of processing by preceding subthread identification circuit]
Next, an example of processing performed by the preceding subthread identification circuit 29 in the first embodiment will be described with reference to FIG. FIG. 6 is a flowchart showing an example of processing of the preceding subthread identification circuit 29.

同図に示すように、先行サブスレッド識別回路２９は、命令を受信すると（ステップＳ１０１肯定）、命令内容種別を判別する（ステップＳ１０２）。つまり、本スレッドの実行結果を解析して得られる命令各々間での関係ではなく、本スレッドを構成する命令各々が中央処理装置に指示する内容を判別する。そして、先行サブスレッド識別回路２９は、ストア命令または浮動小数点演算命令かどうかを判別する（ステップＳ１０３）。ここで、先行サブスレッド識別回路２９は、受信した命令が、ストア命令または浮動小数点演算命令でない場合には（ステップＳ１０３否定）、受信した命令を先行サブスレッドとして選択して、Ｑｕｅｕｅ２２に送信する（ステップＳ１０４）。一方、先行サブスレッド識別回路２９は、受信した命令が、ストア命令または浮動小数点演算命令である場合には（ステップＳ１０３肯定）、Ｑｕｅｕｅ２２には送信しない（ステップＳ１０５）。そして、先行サブスレッド識別回路２９による選択処理を終了する。 As shown in the figure, when the preceding sub-thread identification circuit 29 receives an instruction (Yes at Step S101), it determines the instruction content type (Step S102). That is, it determines not the relationship between the instructions obtained by analyzing the execution result of this thread, but the content that each instruction constituting this thread instructs the central processing unit. Then, the preceding subthread identification circuit 29 determines whether it is a store instruction or a floating-point arithmetic instruction (step S103). Here, if the received instruction is not a store instruction or a floating-point operation instruction (No in step S103), the preceding subthread identification circuit 29 selects the received instruction as a preceding subthread and transmits it to Queue 22 ( Step S104). On the other hand, if the received instruction is a store instruction or a floating-point operation instruction (Yes at Step S103), the preceding subthread identification circuit 29 does not transmit to Queue 22 (Step S105). Then, the selection process by the preceding subthread identification circuit 29 is terminated.

［実施例１の効果］
上記したように、実施例１によれば、本スレッドの実行結果を解析して得られる命令各々間での関係ではなく、本スレッドを構成する命令各々が中央処理装置に指示する内容を判別して先行サブスレッドを選択するので、一度も実行していない本スレッドから、先行サブスレッドを選択することが可能である。 [Effect of Example 1]
As described above, according to the first embodiment, not the relationship among the instructions obtained by analyzing the execution result of this thread, but the contents that each of the instructions constituting this thread instructs the central processing unit are determined. Since the preceding subthread is selected, it is possible to select the preceding subthread from this thread that has never been executed.

また、実施例１によれば、先行サブスレッド識別回路２９は、内容として、各々の命令種別を判別して先行サブスレッドを選択するので、命令種別を確認するだけで簡単に先行サブスレッドを選択することが可能である。 Further, according to the first embodiment, the preceding subthread identification circuit 29 discriminates each instruction type as the content and selects the preceding subthread, so that the preceding subthread can be selected simply by confirming the instruction type. Is possible.

具体的には、従来の手法においては、本スレッドのリタイア命令のデータ依存解析結果に基づいて先行スレッド用命令を識別しているため、必要な回路規模が大きくなってしまう。このような従来の手法と比較して、本発明を適用することにより、先行サブスレッドを選択する際に必要な処理量や、処理に必要な回路の規模を小さくすることが可能である。この結果、例えば、本発明を簡単に実装することが可能である。 Specifically, in the conventional method, since the preceding thread instruction is identified based on the data dependency analysis result of the retirement instruction of this thread, the required circuit scale is increased. Compared with such a conventional method, by applying the present invention, it is possible to reduce the processing amount required for selecting the preceding sub-thread and the circuit scale required for the processing. As a result, for example, the present invention can be easily implemented.

また、実施例１によれば、先行サブスレッド識別回路２９は、命令種別として、レジスタに格納されるデータをメモリに格納するストア命令または浮動小数点演算命令に該当するかを判別し、該当しない命令を選択することで先行サブスレッドを選択するので、ストア命令と浮動小数点命令とを除外するのみで、簡単に先行サブスレッドを選択することが可能である。 Further, according to the first embodiment, the preceding subthread identification circuit 29 determines whether the instruction type corresponds to a store instruction or a floating-point arithmetic instruction that stores data stored in the register in the memory, and does not correspond to the instruction. Since the preceding subthread is selected by selecting, it is possible to easily select the preceding subthread only by excluding the store instruction and the floating-point instruction.

さて、これまで本発明の実施例について説明したが、本発明は上述した実施例以外にも、種々の異なる形態にて実施されてよいものである。 Although the embodiments of the present invention have been described so far, the present invention may be implemented in various different forms other than the embodiments described above.

例えば、実施例１では、ＣＭＰに本発明を適用した場合におけるＣＰＵ１００の構成について説明したが、本発明はこれに限定されるものではなく、ＳＭＴに本発明を適用してもよい。例えば、図８に示すように、ＤｅｃｏｄｅとＱｕｅｕｅとの間に先行サブスレッド識別回路を設けることによって、本発明を実施してもよい。 For example, in the first embodiment, the configuration of the CPU 100 when the present invention is applied to CMP has been described. However, the present invention is not limited to this, and the present invention may be applied to SMT. For example, as shown in FIG. 8, the present invention may be implemented by providing a preceding subthread identification circuit between Decode and Queue.

さらに詳細に説明すると、実施例１では、図２に示したＣＰＵ（ＣＭＰ）の構成において、上段部が、本スレッドの処理を行い、下段部が、先行サブスレッドの処理を司っている。ここで、実施例１においては、本スレッドは、上段部において従来技術通りに実行される。一方、実施例１においては、下段部は、上段部と同様に命令フェッチ、デコードを行うが、デコード結果は先行サブスレッド識別回路に送出され、該当回路が、命令デコード結果に基づき先行実行に必要な命令のみを先行サブスレッドとして選択し、Ｑｕｅｕｅに送出する場合について説明した。 More specifically, in the first embodiment, in the configuration of the CPU (CMP) shown in FIG. 2, the upper part performs the processing of this thread, and the lower part controls the processing of the preceding sub thread. Here, in the first embodiment, this thread is executed in the upper stage as in the prior art. On the other hand, in the first embodiment, the lower part performs instruction fetching and decoding in the same manner as the upper part, but the decoding result is sent to the preceding subthread identification circuit, and the corresponding circuit is necessary for the preceding execution based on the instruction decoding result. A case has been described in which only a simple instruction is selected as the preceding sub-thread and sent to Queue.

本発明の実施は、このようなＣＭＰに本発明を適用した場合に限られるものではなく、図８に示すようなＳＭＴに本発明を適用してもよい。すなわち、本スレッドは、従来技術通りに実行される。その上で、先行サブスレッドは、本スレッドと同様に命令フェッチ、デコードされるが、デコード結果は先行サブスレッド識別回路に送出され、該当回路は命令デコード結果に基づき先行実行に必要な命令のみを選択し、Ｑｕｅｕｅに送出する。なお、図８において、ＰＣ２６ａとＲＦ２５ａとは、本スレッドを実行する構成部である。また、ＰＣ２６ｂとＲＦ２５ｂとは、マルチスレッド実行機能をするために付加されている構成であり、先行サブスレッドを選択して実行する処理に用いられる。 Implementation of the present invention is not limited to the case where the present invention is applied to such CMP, and the present invention may be applied to SMT as shown in FIG. That is, this thread is executed as in the prior art. In addition, the preceding sub-thread is fetched and decoded in the same way as this thread, but the decoding result is sent to the preceding sub-thread identification circuit, and the corresponding circuit only receives instructions necessary for preceding execution based on the instruction decoding result. Select and send to Queue. In FIG. 8, a PC 26a and an RF 25a are components that execute this thread. The PC 26b and the RF 25b are added to perform a multi-thread execution function, and are used for processing to select and execute a preceding sub thread.

また、実施例では、先行サブスレッドを選択する際に、命令種別を用いる場合について説明したが、本発明はこれに限定されるものではなく、例えば命令実行結果を格納するレジスタ番号を用いるようにしてもよい。 In the embodiment, the case where the instruction type is used when selecting the preceding sub-thread has been described. However, the present invention is not limited to this, and for example, a register number for storing the instruction execution result is used. May be.

また、実施例では、命令種別として、ストア命令と浮動小数点演算命令とを除いて先行サブスレッドを選択する場合を説明したが、本発明はこれに限定されるものではなく、例えばさらに整数除算命令を除いて先行サブスレッドを選択するようにしてもよい。 In the embodiment, the case where the preceding subthread is selected as the instruction type except for the store instruction and the floating-point arithmetic instruction has been described. However, the present invention is not limited to this, and for example, an integer division instruction The preceding subthread may be selected except for.

上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報（例えば、図１〜図６、図８）については、特記する場合を除いて任意に変更することができる。 The processing procedures, control procedures, specific names, and information including various data and parameters (for example, FIGS. 1 to 6 and FIG. 8) shown in the documents and drawings are arbitrary unless otherwise specified. Can be changed.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる例えば、図２におけるＥＸ２８の機能において、「キューに記憶されている命令の中から実行できる命令を選択して、処理能力を割り当てる」機能と「本来の実行順番に対応する順番で、実行結果をＲＯＢ２４に記憶させる」機能とを分離してもよい。また、例えば、図２におけるＲＯＢ２４とＲＦ２５を統合してもよい。 Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. For example, in the function of the EX 28 in FIG. 2, a function of “selecting an instruction that can be executed from the instructions stored in the queue and assigning processing capability” and “in the original execution order” can be configured. The function of “store the execution result in the ROB 24 in the corresponding order” may be separated. Further, for example, the ROB 24 and the RF 25 in FIG. 2 may be integrated.

（付記１）実行対象となる複数の命令で構成される本スレッドを実行し、並びに、所定のスレッドに必要な情報をプリフェッチする一つまたは複数の命令を先行サブスレッドとして当該本スレッドから選択して当該所定のスレッドに先行して実行する中央処理装置であって、
前記本スレッドの実行結果を解析して得られる命令各々間での関係ではなく、前記本スレッドを構成する命令各々が前記中央処理装置に指示する内容を判別して前記先行サブスレッドを選択する選択手段を備えることを特徴とする中央処理装置。 (Supplementary note 1) This thread configured by a plurality of instructions to be executed is executed, and one or a plurality of instructions for prefetching information necessary for a predetermined thread are selected from the threads as a preceding sub thread. A central processing unit that executes prior to the predetermined thread,
A selection for selecting the preceding sub-thread by determining the contents instructed by the central processing unit by the instructions constituting the thread, not the relationship between the instructions obtained by analyzing the execution result of the thread. A central processing unit comprising means.

（付記２）前記選択手段は、前記内容として、前記スレッドを構成する各々の命令種別を判別して前記先行サブスレッドを選択することを特徴とする付記１に記載の中央処理装置。 (Supplementary note 2) The central processing unit according to supplementary note 1, wherein the selection means discriminates each instruction type constituting the thread and selects the preceding sub-thread as the content.

（付記３）前記選択手段は、前記命令種別として、レジスタに格納されるデータをメモリに格納するストア命令または浮動小数点演算命令に該当するかを判別し、該当しない命令を選択することで前記先行サブスレッドを選択することを特徴とする付記２に記載の中央処理装置。 (Additional remark 3) The said selection means discriminate | determines whether it corresponds to the store instruction | indication which stores the data stored in a register | resistor in a memory, or a floating point arithmetic instruction as said instruction classification, and selects the said instruction | indication by selecting an instruction | indication which does not correspond The central processing unit according to appendix 2, wherein a sub-thread is selected.

（付記４）実行対象となる複数の命令で構成される本スレッドを実行し、並びに、前記所定のスレッドに必要な情報をプリフェッチする一つまたは複数の命令を先行サブスレッドとして当該本スレッドから選択して当該所定のスレッドに先行して実行する中央処理装置に設けられた選択回路であって、
前記本スレッドの実行結果を解析して得られる命令各々間での関係ではなく、前記本スレッドを構成する命令各々が前記中央処理装置に指示する内容を判別して前記先行サブスレッドを選択する選択ステップを備えることを特徴とする選択回路。 (Appendix 4) Executing this thread composed of a plurality of instructions to be executed, and selecting one or more instructions for prefetching information necessary for the predetermined thread from the relevant thread as a preceding sub thread A selection circuit provided in the central processing unit that executes prior to the predetermined thread,
A selection for selecting the preceding sub-thread by determining the contents instructed by the central processing unit by the instructions constituting the thread, not the relationship between the instructions obtained by analyzing the execution result of the thread. A selection circuit comprising steps.

（付記５）実行対象となる複数の命令で構成される本スレッドを実行し、並びに、前記所定のスレッドに必要な情報をプリフェッチする一つまたは複数の命令を先行サブスレッドとして当該本スレッドから選択して当該所定のスレッドに先行して実行する中央処理装置における選択方法であって、
前記本スレッドの実行結果を解析して得られる命令各々間での関係ではなく、前記本スレッドを構成する命令各々が前記中央処理装置に指示する内容を判別して前記先行サブスレッドを選択する選択ステップを含んだことを特徴とする選択方法。 (Additional remark 5) This thread comprised of a plurality of instructions to be executed is executed, and one or a plurality of instructions for prefetching information necessary for the predetermined thread are selected from the current thread as a preceding sub thread. A selection method in the central processing unit that executes prior to the predetermined thread,
A selection for selecting the preceding sub-thread by determining the contents instructed by the central processing unit by the instructions constituting the thread, not the relationship between the instructions obtained by analyzing the execution result of the thread. A selection method characterized by including steps.

以上のように、本発明は、中央処理装置、選択回路および選択方法に有用であり、特に、実行対象となる複数の命令で構成される本スレッドを実行し、並びに、所定のスレッドに必要な情報をプリフェッチする一つまたは複数の命令を先行サブスレッドとして本スレッドから選択して所定のスレッドに先行して実行する中央処理装置であって、本スレッドの実行結果を解析して得られるスレッド各々間での関係ではなく、本スレッドを構成する命令各々が中央処理装置に指示する内容を判別して先行サブスレッドを選択する中央処理装置、選択回路および選択方法の実現に適する。 As described above, the present invention is useful for a central processing unit, a selection circuit, and a selection method. In particular, the present thread configured by a plurality of instructions to be executed is executed and is necessary for a predetermined thread. A central processing unit that selects one or more instructions for prefetching information as a preceding sub-thread from this thread and executes it in advance of a predetermined thread, each thread obtained by analyzing the execution result of this thread It is suitable for the realization of a central processing unit, a selection circuit, and a selection method that select the preceding sub-thread by determining the contents instructed by each instruction constituting the thread to the central processing unit.

先行サブスレッド識別回路の概要と特徴を説明するための図である。It is a figure for demonstrating the outline | summary and characteristic of a preceding subthread identification circuit. 中央処理装置（ＣＭＰ）の構成を示すブロック図である。It is a block diagram which shows the structure of a central processing unit (CMP). ソースおよび本スレッド列および先行サブスレッド列の一例を示す図である。It is a figure which shows an example of a source | sauce, this thread | sled row | line | column, and a preceding subthread row | line. 先行サブスレッド識別回路の回路図の一例を示す図である。It is a figure which shows an example of the circuit diagram of a preceding subthread identification circuit. 先行サブスレッド識別回路による効果を説明するための図である。It is a figure for demonstrating the effect by a preceding subthread identification circuit. 先行サブスレッド識別回路の処理の一例を示すフローチャートである。It is a flowchart which shows an example of a process of a preceding subthread identification circuit. 命令種別の分布について説明するための図である。It is a figure for demonstrating distribution of an instruction classification. 中央処理装置（ＳＭＴ）の構成を示すブロック図である。It is a block diagram which shows the structure of a central processing unit (SMT). ＣＭＰおよびＳＭＴについて説明するための図である。It is a figure for demonstrating CMP and SMT.

Explanation of symbols

１０Ｄｅｃｏｄｅ
１１先行サブスレッド識別回路
１２Ｑｕｅｕｅ
２０Ｌ２キャッシュ
２１Ｉキャッシュ
２２Ｑｕｅｕｅ
２３Ｄキャッシュ
２４ＲＯＢ
２５ＲＦ
２６ＰＣ
２７Ｄｅｃｏｄｅ
２８ＥＸ
２９先行サブスレッド識別回路 10 Decode
11 Predecessor sub-thread identification circuit 12 Queue
20 L2 cache 21 I cache 22 Queue
23 D-cache 24 ROB
25 RF
26 PC
27 Decode
28 EX
29 Leading sub-thread identification circuit

Claims

Execute this thread composed of a plurality of instructions to be executed, and select one or a plurality of instructions for prefetching information necessary for a predetermined thread from the main thread as a preceding sub thread and select the predetermined thread A central processing unit that executes prior to a thread,
A decoder that interprets the instructions;
A queue for storing instructions to be executed;
A central processing unit, comprising: a selection circuit that receives a decoding result by the decoder, selects a preceding subthread from the received decoding result based on an instruction type, and stores the selected preceding subthread in the queue.

The selection circuit, the central processing unit according to claim 1 to determine the instruction type of each instruction that constitutes the present thread and selects the leading sub thread.

The selection circuit determines whether the instruction type corresponds to a store instruction or a floating-point arithmetic instruction that stores data stored in a register in a memory, and selects the preceding subthread by selecting an instruction that does not correspond The central processing unit according to claim 2.

Execute this thread composed of a plurality of instructions to be executed, and select one or a plurality of instructions for prefetching information necessary for a predetermined thread from the main thread as a preceding sub thread and select the predetermined thread A selection circuit provided in a central processing unit that executes prior to a thread,
A decoding result is received from a decoder that interprets an instruction, a preceding subthread is selected from the received decoding result based on an instruction type, and the selected preceding subthread is stored in a queue that stores an instruction to be executed. Selection circuit.

Execute this thread composed of a plurality of instructions to be executed, and select one or a plurality of instructions for prefetching information necessary for a predetermined thread from the main thread as a preceding sub thread and select the predetermined thread A selection method in a central processing unit that executes prior to a thread,
A receiving step for receiving a decoding result from a decoder that interprets the instruction;
A selection step of selecting a preceding subthread based on the instruction type from the received decoding result;
And a storing step of storing the selected preceding subthread in a queue storing instructions to be executed.