CN113157631B - Processor circuit and data processing method - Google Patents
Processor circuit and data processing method Download PDFInfo
- Publication number
- CN113157631B CN113157631B CN202010073303.1A CN202010073303A CN113157631B CN 113157631 B CN113157631 B CN 113157631B CN 202010073303 A CN202010073303 A CN 202010073303A CN 113157631 B CN113157631 B CN 113157631B
- Authority
- CN
- China
- Prior art keywords
- instruction
- load
- data
- address
- processor circuit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims description 18
- 230000015654 memory Effects 0.000 claims description 142
- 238000000034 method Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 101150060298 add2 gene Proteins 0.000 description 8
- 238000001514 detection method Methods 0.000 description 4
- 230000007246 mechanism Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
Landscapes
- Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
A processor circuit includes an instruction decode unit, an instruction detector, an address generator, and a data buffer. The instruction decode unit is used for decoding the load instruction to generate a decoding result. The instruction detector is coupled to the instruction decoding unit for detecting whether the load instruction is in a load use situation. The address generator is coupled to the instruction decoding unit for generating a first address required by the load instruction according to the decoding result. The data buffer is coupled to the instruction detector and the address generator, and is configured to store the first address generated by the address generator and store data required by the load instruction according to the first address when the instruction detector detects that the load instruction is in the load use situation.
Description
Technical Field
The present disclosure relates to data processing technology, and more particularly, to a processor circuit and a data processing method capable of reducing load utilization stalls involved in load instructions.
Background
In order to reduce the time to access data or instructions from lower speed memories, cache (CPU) mechanisms are currently employed. By proper design, the cache mechanism can obtain the required data or instruction in a few clock cycles, thereby greatly improving the system performance. However, in the case where the cpu processes the load instruction and the add instruction sequentially, when the data required by the add instruction is the data to be read by the load instruction, it takes a while to access the data to be read from the local memory (or cache memory, such as static random access memory (static random access memory, SRAM)), so the cpu still needs to wait for several clock cycles to execute the add instruction. That is, the existing CPU using the cache mechanism still suffers from load-use (stall) stall.
Disclosure of Invention
Accordingly, embodiments of the present disclosure provide a method and apparatus for reducing the problem of pipeline stalls associated with load instructions, whether they are cache hits (hits) or cache misses (misses).
Certain embodiments of the present disclosure include a processor circuit. The processor circuit includes an instruction decode unit, an instruction detector, an address generator, and a data buffer. The instruction decode unit is used for decoding the load instruction to generate a decoding result. The instruction detector is coupled to the instruction decoding unit for detecting whether the load instruction is in a load use situation. The address generator is coupled to the instruction decoding unit for generating a first address required by the load instruction according to the decoding result. The data buffer is coupled to the instruction detector and the address generator, and is configured to store the first address generated by the address generator and store data required by the load instruction according to the first address when the instruction detector detects that the load instruction is in the load use situation.
Certain embodiments of the present disclosure include a data processing method. The data processing method comprises the following steps: receiving a loading instruction and detecting whether the loading instruction is in a loading use situation; decoding the load instruction to produce a decoded result; generating a first address required by the load instruction according to the decoding result; storing the first address in a data buffer when the load instruction is detected to be in the load use situation; and storing the data required by the load instruction in the data buffer according to the first address.
Drawings
The various forms of the disclosure will be clearly understood from the following embodiments read with reference to the accompanying drawings. It should be noted that the various features of the drawings are not necessarily drawn to scale in accordance with standard practices in the art. In fact, the dimensions of some of the features may be arbitrarily expanded or reduced for clarity of discussion.
FIG. 1 is a functional block diagram of a processor circuit according to some embodiments of the present disclosure.
FIG. 2 is a schematic diagram of an embodiment of the processor circuit shown in FIG. 1.
FIG. 3 is a schematic diagram of an embodiment of an instruction detection operation related to the instruction detector shown in FIG. 2.
FIG. 4 is a schematic diagram of an embodiment of the data buffer shown in FIG. 2.
FIG. 5 illustrates a flow chart of an embodiment of a data processing method for processing memory access instructions, as referred to by the processor circuit shown in FIG. 2.
FIG. 6 is a flow chart illustrating another embodiment of a data processing method for processing memory access instructions in accordance with the processor circuit of FIG. 2.
FIG. 7 is a schematic diagram of information stored in each of the memory spaces of FIG. 4 over a succession of clock cycles.
Fig. 8 is a flow chart of an embodiment of a data processing method according to the present disclosure.
Detailed Description
The following disclosure provides various embodiments or examples that can be used to implement the various features of the present disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. It should be understood that these descriptions are merely exemplary and are not intended to limit the present disclosure. For example, if an element is referred to as being "connected to" or "coupled to" another element, it can be directly connected or coupled, or other intervening elements (intervening) may be present therebetween.
Further, the present disclosure may repeat reference numerals and/or letters in the various examples. Such reuse is for brevity and clarity purposes and does not itself represent a relationship between the various embodiments and/or configurations discussed. Furthermore, it should be appreciated that the embodiments of the present disclosure provide many applicable concepts that can be embodied in a wide variety of specific contexts. The examples discussed below are provided for illustrative purposes only and are not intended to limit the scope of the present disclosure.
By preparing the data required by the pending instruction early, the data processing scheme of the present disclosure may reduce/avoid load-use stalls caused by executing the pending instruction. For example, in the case of sequentially processing a load instruction and an add instruction, the data processing scheme of the present disclosure may prepare the data required by the add instruction (including the data required by the load instruction) early, without waiting for the execution result of the load instruction to return, and may successfully execute the add instruction. Further description is as follows.
Fig. 1 is a functional block diagram of a processor circuit 100 according to some embodiments of the present disclosure. The processor circuit 100 may be used to reduce/avoid load utilization stalls caused by executing one or more instructions in the instruction stream instruction stream. The processor circuit 100 may include, but is not limited to, an instruction decode unit 122, an instruction detector 124, an address generator 136, and a data buffer 138. The instruction decoding unit 122 is configured to decode a plurality of instructions that are consecutive to each other in the instruction stream INS, and sequentially output decoding results of the plurality of instructions. For example, the instruction decoding unit 122 may decode the load instruction LWI in the instruction stream INS to generate the decoding result DR. For another example, the instruction decoding unit 122 may decode other instructions (such as store instructions or operation instructions) in the instruction stream INS to generate corresponding decoding results.
The instruction detector 124 is coupled to the instruction decoding unit 122 for detecting whether one or more load instructions in the load context are included in the instruction stream INS, wherein the load instructions may cause a load-use stall when the load instructions are in the load context. For example, the instruction detector 124 may receive the instruction stream INS temporarily stored in the instruction decoding unit 122, and further detect the instruction stream INS. For another example, the instruction detector 124 may directly receive the instruction stream INS for detection without passing through the instruction decoding unit 122.
In this embodiment, the load instruction in the load-use context may include a load-use instruction (load-use instruction), which may cause a load-use stall to occur when a subsequent instruction is executed. For example, when the instruction detector 124 detects that the load instruction LWI is a load-use instruction, this represents that if the processor circuit 100 uses the execution result of the load instruction LWI to execute an instruction following the load instruction LWI in the instruction stream INS, load-use stall will be caused. In this embodiment, instruction detector 124 may determine whether load-use data hazard (load-use data hazard) may occur when executing a load instruction LWI using the results of execution of the instruction. The instruction detector 124 may detect the load instruction LWI as a load-use instruction when it is determined that there is a risk of load-use data for executing the instruction using the execution result of the load instruction LWI.
In addition, instruction detector 124 may output an indication signal lu_instr, which may indicate whether the instruction currently processed by instruction decode unit 122 is a load instruction in the load use context. For example, in the case where the instruction currently processed by instruction decode unit 122 is a load instruction LWI, the indication signal lu_instr may indicate whether the load instruction LWI is a load-use instruction.
The address generator 136 is coupled to the instruction decoding unit 122, and is configured to generate an address related to each instruction according to a decoding result of the instruction. For example, the address generator 136 may generate the address addr as the address required by the load instruction LWI according to the decoding result DR of the load instruction LWI.
The data buffer 138 is coupled to the instruction detector 124 and the address generator 136, and is configured to store an address required by the load instruction generated by the address generator 136 when the instruction detector 124 detects that the load instruction may cause a load utilization stall, and store data required by the load instruction according to the address required by the load instruction. In this embodiment, when the instruction detector 124 detects that the load instruction LWI is a load-use instruction, the data buffer 138 may store the address addr generated by the address generator 136 and store the data lub_d according to the address addr as the data required by the load instruction LWI.
For example, in the case where the data required by the load instruction LWI is not already stored in the data buffer 138, the data buffer 138 may send a read request RR to the memory 180 to read the data MD pointed to by the address addr in the memory 180 as the data required by the load instruction LWI. In some embodiments, memory 180 may be a local memory or cache memory of processor circuit 100. In some embodiments, memory 180 may be an external memory or a secondary memory external to processor circuit 100.
It is noted that in the case where the processor circuit 100 needs to execute a pending instruction using data required by the load instruction LWI, since the data buffer 138 can store data required by the load instruction LWI (such as data lub_d), the instruction decoding unit 122 can obtain the data lub_d from the data buffer 138 for the processor circuit 100 to execute the pending instruction without waiting for the memory 180 (such as a cache memory or an external memory) to return the execution result of the load instruction LWI, so that stalling of load use can be reduced/avoided.
For ease of illustration, the data processing scheme of the present disclosure is described below using processor circuits having a pipeline architecture. However, the disclosure is not limited thereto. The application of the data processing scheme of the present disclosure to other circuit architectures requiring the execution of a previous instruction to execute a subsequent instruction all fall within the scope of the present disclosure following the spirit of the present disclosure.
FIG. 2 is a schematic diagram of an embodiment of the processor circuit 100 shown in FIG. 1. For ease of understanding, the processor circuit 200 may be implemented as a pipelined processor having a pipeline architecture, wherein the pipeline architecture may include five pipeline stages, which may be implemented by an instruction fetch stage (instruction FETCH STAGE) IF, an instruction decode stage (instruction decode stage) ID, an execution stage (instruction stage) EX, a memory access stage (memory ACCESS STAGE) MEM, and a write back stage (write back stage) WB, respectively. However, this is not intended to limit the disclosure. In some embodiments, these five pipeline stages may be implemented by an instruction fetch stage, an instruction decode stage, an operand fetch stage (operand FETCH STAGE), an execute stage, and a writeback stage. In some embodiments, the processor circuit 200 may be unit with a pipeline architecture that is greater than or less than five pipeline stages. All such design related variations follow the spirit of the present disclosure.
In this embodiment, the processor circuit 200 may include the instruction decode unit 122, the instruction detector 124, the address generator 136, and the data buffer 138 shown in FIG. 1. The instruction decode unit 122 and the instruction detector 124 may be located at the same pipeline stage (such as instruction decode stage ID) while the address generator 136 and the data buffer 138 may be located at the same pipeline stage (such as execution stage EX) to reduce/avoid problems of pipeline stalls. The related description will be described later.
In addition, processor circuit 200 may also include, but is not limited to, a plurality of pipeline registers (PIPELINE REGISTER) 201-204, instruction fetch unit 210, execution unit 232, memory 240, register file (REGISTER FILE, RF) 252, and bus interface unit (bus interface unit, BIU) 254. Since the pipeline register 201 is located between the instruction fetch stage IF and the instruction decode stage ID, it may be referred to as an instruction fetch/instruction decode register (IF/ID REGISTER). Similarly, pipeline registers 202, 203, and 204 may be referred to as instruction decode/execute registers (ID/EX registers), execute/memory access registers (EX/MEM REGISTER), and memory access/write-back registers (MEM/WB registers), respectively.
The instruction fetch unit 210 is located in the instruction fetch stage IF, and is configured to store the instruction stream INS, and store corresponding instructions in the instruction stream INS to the pipeline register 201 according to an address provided by a program counter (not shown in fig. 2).
The execution unit 232 is located at the execution stage EX, and is configured to execute an instruction according to the decoded result of the instruction provided by the pipeline register 202, and store the execution result of the instruction in the pipeline register 203, where the decoded result of the instruction may include the address and data required for executing the instruction. In this embodiment, execution units 232 may include, but are not limited to, an arithmetic logic unit (ARITHMETIC LOGIC UNIT, ALU) 233 and a multiply-accumulate unit (MAC) 234.
The memory 240 is located in the memory access stage MEM and may be implemented as the memory 180 shown in fig. 1. For example, the memory 240 may be implemented as a cache memory of the processor circuit 200. In this embodiment, the memory 240 is used to perform memory access operations according to instruction execution results provided by the pipeline registers 203. For example, in a write operation, the memory 240 may store data at a location pointed to by the address addr according to the instruction execution result. For another example, in the read operation, the memory 240 may output the data MD1 pointed to by the address addr according to the instruction execution result.
Both register file 252 and bus interface unit 254 may be in write-back stage WB. Register file 252 is used to store data from memory 240 that is buffered by pipeline registers 204. Bus interface unit 254 may serve as a data transfer interface between processor circuit 200 and external memory 260. In some embodiments, the register file 252 may be used to store data to be written to the external memory 260 or to store data MD2 read from the external memory 260.
Please refer to fig. 3 in conjunction with fig. 2. FIG. 3 is a schematic diagram of an embodiment of the instruction detection operations involved in the instruction detector 124 shown in FIG. 2. In this embodiment, the instruction detector 124 receives the plurality of instructions I0-I6 sequentially transmitted to the instruction decoding unit 122 in the instruction stream INS, and stores the plurality of instructions I0-I6 in the plurality of memory units ibuf-ibuf in the instruction detector 124, respectively. For convenience of illustration, the plurality of instructions I0-I6 may be implemented by a load instruction load, an add instruction add, a subtract instruction sub, a load instruction load, and a shift left logic instruction sll, respectively. Instruction detector 124 may perform a decode operation on the plurality of instructions I0-I6 to detect whether a load instruction (such as instruction I0, instruction I2, or instruction I5) is a load-use instruction.
For example, before the data required by instruction I0 (the data pointed to by address [ r8] in memory 240) is loaded from memory 240 into register r0 (located in instruction decode unit 122; not shown in FIG. 2), instruction I1 (add instruction add) that follows instruction I0 enters execution stage EX, requiring the use of the data of register r 0. Thus, instruction detector 124 may detect that instruction I0 is a load-use instruction in a load-use context. In addition, when instruction I0 enters the execution stage EX, instruction detector 124 may output an indication signal lu_instr having a first signal level (such as a high logic level) indicating that instruction I0 is a load-used instruction.
Similarly, for instruction I5, instruction I6 (left shift logic instruction sll) executed after instruction I5 enters execution stage EX before the data required by instruction I5 (the data pointed to by address r9 in memory 240) is loaded from memory 240 into register r2 (located in instruction decode unit 122; not shown in FIG. 2), and the data of register r2 is used. Thus, instruction detector 124 may detect instruction I5 as a load-use instruction.
For instruction I2, although the data required for instruction I4 to be executed after instruction I2 (the data stored in register r1 in instruction decode unit 122) is from the data required for instruction I2 (the data pointed to by address [ r9] in memory 240), since the data required for instruction I2 has been loaded from memory 240 to register r1 before instruction I4 enters execution stage EX, instruction detector 124 may detect that instruction I2 is not a load-using instruction and output an indication signal lu_instr having a second signal level (such as a low logic level).
It should be noted that the instruction types, sequences and numbers shown in fig. 3 are for convenience of description only and are not meant to be limiting of the present disclosure. Since one of ordinary skill in the art will appreciate the operation of the plurality of instructions I0-I6 shown in FIG. 3, further description of each instruction is omitted herein.
Based on the instruction detection operation described above, the instruction detector 124 can detect whether the load instruction LWI is a load-use instruction in the instruction decode stage ID. For example, in the case where the load instruction LWI is implemented by instruction I0 or instruction I5, when the load instruction LWI enters the instruction decode stage ID, the instruction detector 124 may detect that the load instruction LWI is a load-use instruction and output an indication signal lu_instr having the first signal level. In addition, in the case where the load instruction LWI is implemented by the instruction I2, when the load instruction LWI enters the instruction decode stage ID, the instruction detector 124 may detect that the load instruction LWI is not a load-use instruction, but outputs the instruction signal lu_instr having the second signal level.
After the instruction detector 124 detects that the load instruction LWI is a load-use instruction, the data required by the load instruction LWI may be provided by the data buffer 138 to the instruction decode unit 122 at the next pipeline stage (execution stage EX). Thus, in the event that the next instruction of the load instruction LWI requires immediate use of the data required by the load instruction LWI, the data required by the next instruction may be ready when the next instruction is in the instruction decode stage ID.
Please refer to fig. 4 in conjunction with fig. 2. FIG. 4 is a schematic diagram of an embodiment of the data buffer 138 shown in FIG. 2. The data buffer 138 may include, but is not limited to, a memory space 410 and a control circuit 420. Memory space 410 may employ flip-flops (not shown in fig. 4) as memory cells to enable data access for a single clock cycle. In this embodiment, the memory space 410 may include N entries E (0) -E (N-1) corresponding to N index values idx (0) -idx (N-1), respectively, where N is a positive integer greater than 1. Each entry may include, but is not limited to, a valid bit field V, a lock bit field L, a tag field TG, and a data field DA. The valid bit field V may indicate whether information is stored in the entry. The contents of N entries E (0) -E (N-1) in the valid bit field V may be represented as valid bits V (0) -V (N-1), respectively. The lock bit field L may indicate whether the entry is locked to protect the information stored by the entry from modification. The contents of N entries E (0) -E (N-1) in the lock bit field L may be denoted as lock bits L (0) -L (N-1), respectively. The tag field TG may be used to identify the data stored in the data field DA of the entry. For example, the tag field TG may indicate an address in the memory 240 (or external memory 260) of the data stored by the entry. The contents of N entries E (0) -E (N-1) in the tag field TG may be denoted as tags TG (0) -TG (N-1), respectively. The contents of N entries E (0) -E (N-1) in the data field DA may be represented as data DA (0) -DA (N-1), respectively.
The control circuit 420 includes, but is not limited to, a comparison circuit 422, a buffer 423, a selection circuit 424, a logic circuit 426, and a controller 428. The comparing circuit 422 is used for comparing the address addr with the tags TG (0) -TG (N-1) respectively to generate a hit signal lub_h. For example, when the address addr matches one of the tags TG (0) to TG (N-1), the hit signal lub_h may have a specific signal level (such as a high logic level); when the address addr does not match any one of the tags TG (0) to TG (N-1), the hit signal lub_h may have another specific signal level (such as a low logic level). In this embodiment, when the hit signal lub_h indicates that the address addr matches the tag TG (i) (i is a natural number smaller than N), the comparison circuit 422 may store the valid bit V (i) and the lock bit L (i) corresponding to the hit signal lub_h and the tag TG (i) to the buffer 423.
The selection circuit 424 may output one of the data DA (0) to DA (N-1) according to the hit signal lub_h. For example, when the hit signal lub_h indicates that the address addr matches the tag TG (i) (i is a natural number smaller than N), the selection circuit 424 may output the data DA (i) corresponding to the tag TG (i) as the data lub_d.
The logic circuit 426 may output a valid signal lub_dv to indicate whether the data lub_d is valid/available. For example, in the case where the hit signal lub_h indicates that the address addr matches the tag TG (i) (i is a natural number smaller than N), the valid bit V (i) indicates that there is information stored in the entry E (i), and the lock bit L (i) indicates that the entry E (i) is not locked, the valid signal lub_dv may be output with a specific signal level (such as a high logic level) indicating that the data lub_d is valid/available. When the valid signal lub_dv indicates that the data lub_d is valid/available, the instruction decode unit 122 may retrieve the data (data lub_d) required by the load instruction LWI from the data buffer 138, reducing/avoiding the problem of load utilization stalls.
Controller 428 is configured to selectively access memory space 410 based on indication lu_instr. For example, when the instruction signal lu_instr indicates that the load instruction LWI is a load-used instruction, the controller 428 may access the entry of the memory space 410 according to the address addr, and update at least one of the valid bit field V, the lock bit field L, the tag field TG, and the data field DA of the entry. When the indication signal lu_instr indicates that the load instruction LWI is not a load-use instruction, the controller 428 may not alter the information stored in the memory space 410. That is, the memory space 410 may store only information related to the load use instruction.
Notably, in operation, control circuitry 420 may maintain consistency of information stored by memory space 410 with information stored by memory 240 (or information stored by external memory 260). For example, when the processor circuit 200 is used to process the memory access instruction MAI, the instruction decoding unit 122 is used to decode the memory access instruction MAI to generate the decoding result DR'. The memory access instructions MAI may include, but are not limited to, store instructions for writing data to the memory and load instructions for reading data stored by the memory. In addition, the address generator 136 is configured to generate an address addr according to the decoding result DR', which may be an address required by the memory access instruction MAI. The control circuit 420 may check whether the address addr is already stored in the memory space 410. When the address addr is stored in the memory space 410, the control circuit 420 can update the data pointed to by the address addr in the memory space 410 into the data required by the memory access instruction MAI. Therefore, the data directly acquired from the data buffer 134 by the instruction decoding unit 122 matches the data stored in the memory (the memory 240 or the external memory 260).
FIG. 5 is a flow chart illustrating an embodiment of a data processing method for processing a memory access instruction MAI, which is referred to by the processor circuit 200 shown in FIG. 2. In this embodiment, the data buffer 138 included in the processor circuit 200 of FIG. 2 may be configured to perform the associated operations in the architecture of FIG. 4. In addition, the memory access instruction MAI may be implemented by a store instruction.
Referring to fig. 2, 4 and 5 together, in step 502, execution stage EX may begin executing the store instruction. The store instruction is used to store write data in the memory 240, where the write data is stored in registers of the instruction decode unit 122. In step 504, the address generator 136 may generate the address addr, i.e. the address required by the store instruction, according to the decoding result of the store instruction. The address generator 136 may output the address addr to the data buffer 138.
In step 506, the comparing circuit 422 can compare the address addr with the tags TG (0) -TG (N-1) to check whether the address addr is already stored in the memory space 410. If addr is checked to be stored in the memory space 410 (e.g., the hit signal lub_h has a low logic level), step 508 is performed. If it is checked that the address addr is not stored in the memory space 410 (e.g., the hit signal lub_h has a high logic level), step 512 is performed.
In step 508, the controller 428 may update the data field DA of the entry in the memory space 410 pointed to by the address addr to the write data. For example, when the address addr matches the tag TG (i), the controller 428 may update the data DA (i) of the entry E (i) to the write data.
In step 510, the controller 428 may update the access order of the N entries E (0) -E (N-1) according to the permutation policy REPLACEMENT POLICY. For example, the controller 428 may employ a least recently used (LEAST RECENTLY used, LRU) replacement policy. Thus, the controller 428 may set the last accessed entry E (i) as the most commonly used entry. In some embodiments, the controller 428 may employ a non-recently used (not most recently used, NMRU) permutation policy, a random permutation policy, or other permutation policy. In some embodiments, where the controller 428 employs a random permutation strategy, step 510 may be omitted.
In step 512, the execution stage EX may send a store request to the memory 240 via the pipeline register 203, wherein the store request includes the write data and the address addr generated by the address generator 136. If the memory 240 includes an address matching the address addr, the memory 240 may store the write data in the memory 240. If there is no address in the memory 240 that matches the address addr, the memory request may be sent by the bus interface unit 254 to the external memory 260 to store the write data in the memory location in the external memory 260 pointed to by the address addr. In step 514, the store instruction is ended.
FIG. 6 is a flow chart illustrating another embodiment of a data processing method for processing a memory access instruction MAI, which is related to the processor circuit 200 shown in FIG. 2. In this embodiment, the data buffer 138 included in the processor circuit 200 of FIG. 2 may employ the architecture of FIG. 4 to perform the associated operations. In addition, the memory access instruction MAI may be implemented by a load instruction.
Referring to FIGS. 2,4 and 6 together, at step 602, execution stage EX may begin executing the load instruction. The load instruction is used to load read data into registers of instruction decode unit 122. In step 604, the address generator 136 may generate an address addr, i.e., the address required by the load, according to the decoding result of the load. The address generator 136 may output the address addr to the data buffer 138.
In step 606, the comparing circuit 422 can compare the address addr with the tags TG (0) -TG (N-1) to check whether the address addr is already stored in the memory space 410. If the addr is checked to be stored in the memory space 410, then step 608 is executed; otherwise, step 616 is performed.
In step 608, the controller 428 may check whether the entry pointed to by the address addr is locked. For example, in the case where the address addr matches the tag TG (i), the controller 428 may check whether the valid bit field L of the entry E (i) has a specific bit pattern to determine whether the entry E (i) is locked. If it is determined that the entrance E (i) is not locked, step 610 is performed; otherwise, step 614 is performed. In this embodiment, when the bit value of the lock bit L (i) of entry E (i) is equal to 0, the controller 428 may determine that entry E (i) is not locked; when the bit value of the lock bit L (i) of entry E (i) is equal to 1, the controller 428 may determine that entry E (i) is locked.
In step 610, the controller 428 may update the access sequence of the N entries E (0) -E (N-1) according to the replacement policy. For example, the controller 428 may employ a least recently used replacement policy. Thus, the controller 428 may set the last accessed entry E (i) as the most commonly used entry. In some embodiments, controller 428 may employ a non-recently used permutation policy, a random permutation policy, or other permutation policy. In some embodiments where the controller 428 employs a random permutation strategy, step 610 may be omitted.
In step 612, the selection circuit 424 may take the data DA (i) of the entry E (i) as the data lub_d, so that the data buffer 138 may return the data lub_d to the pipeline core, such as the instruction decode unit 122. In addition, logic 426 may output an active signal lub_dv having a particular signal level (such as a high logic level) to indicate that data lub_d is active/available.
In step 614, the data buffer 138 may send a read request RR to the memory 240 to read the data pointed to by the address addr in the memory 240, wherein the read request includes the address addr.
In step 616, controller 428 may determine whether the load instruction is a load-used instruction based on indication signal lu_instr. If yes, go to step 618; otherwise, step 614 is performed.
In step 618, the controller 428 may select one of the N entries E (0) -E (N-1) according to the replacement policy to store the address addr in the tag field TG of the entry. In this embodiment, the controller 428 may employ a least recently used replacement policy to store the address addr in the tag field TG of the least recently used entry E (i). In some embodiments, the controller 428 may also employ other replacement policies to store the address addr.
In step 620, the controller 428 may set the content of the valid bit field V of the entry E (i) to indicate that the entry E (i) has stored information. For example, the controller 428 may set the bit value of the valid bit V (i) to "1". In addition, because the data requested by the load instruction has not been stored in the data field DA of entry E (i), the controller 428 may set the lock bit field L of entry E (i) to a particular bit pattern to protect the information stored by entry E (i) from modification by instructions other than the load instruction. In this embodiment, controller 428 may set the bit value of lock bit L (i) to "1".
In step 622, the memory 240 may check whether the data required by the load instruction is stored in the memory 240 according to the read request. If it is checked that the data required by the load instruction is stored in the memory 240, step 624 is performed; otherwise, step 626 is performed. For example, if it is checked that the memory 240 includes an address matching the address addr, it may be determined that the data required by the load instruction is stored in the memory 240.
In step 624, the memory 240 may return the data MD1 pointed to by the address addr in the memory 240 to a pipeline core, such as the instruction decode unit 122. Data MD1 is the data required for the load instruction.
In step 626, the data buffer 138 may send a read request RR to the external memory 260 via the bus interface unit 254 to read the data MD2 pointed to by the address addr in the external memory 260, wherein the data MD2 is the data required by the load instruction.
In step 628, controller 428 may determine whether to update the information stored in memory space 410 based on indication signal lu_instr. If it is determined that the information stored in the storage space 410 needs to be updated, step 630 is performed; otherwise, step 640 is performed. For example, when the indication signal lu_instr has a particular signal level (such as a high logic level), the controller 428 may determine that the information stored in the memory space 410 needs to be updated.
In step 630, the controller 428 may update the data field DA of the entry E (i) to the data MD1 returned by the memory 240. In step 632, since both the address addr and the data required by the load instruction are stored to entry E (i), controller 428 may set the lock bit field L of entry E (i) to another particular bit pattern to allow the information stored by entry E (i) to be modified. In this embodiment, controller 428 may set the bit value of lock bit L (i) to 0.
In step 634, controller 428 may determine whether to update the information stored in memory space 410 based on indication signal lu_instr. If it is determined that the information stored in the storage space 410 needs to be updated, step 636 is executed; otherwise, step 640 is performed. For example, when the indication signal lu_instr has a particular signal level (such as a high logic level), the controller 428 may determine that the information stored in the memory space 410 needs to be updated.
In step 636, the controller 428 may update the data field DA of the entry E (i) to the data MD2 returned by the external memory 260. In step 638, because both the address addr and the data required by the load instruction are stored to entry E (i), controller 428 may set the lock bit field L of entry E (i) to another particular bit pattern to allow the information stored by entry E (i) to be modified. In this embodiment, controller 428 may set the bit value of lock bit L (i) to 0. In step 640, the load instruction is ended.
To facilitate an understanding of the present disclosure, the data processing scheme provided by the present disclosure is described below in terms of an embodiment of operations performed by a data buffer in response to a plurality of instructions in succession. Fig. 7 shows a schematic diagram of information stored in each of the memory spaces 410 shown in fig. 4 in a continuous plurality of clock cycles C0 to C8. In this embodiment, the memory space 410 may include 4 entries E (0) -E (3) (i.e., N equals 4) to store information needed to execute instructions. In addition, the contents of the tag field TG and the data field DA of each entry can be represented by 16 (i.e., 0 x).
Referring to fig. 2,4 and 7, when the clock cycle CC0 starts, an entry E (0) corresponding to the index value idx (0) in the memory space 410 already stores an address 0x2000 and data 0xaa. In addition, the valid bit field V and the lock bit field L of entry E (0) are set to "1" and "0", respectively.
In clock cycle CC0, load1 enters execute stage EX and multiply instruction mux enters instruction decode stage ID, where execution of multiply instruction mux does not use the data stored in load 1's destination register. The address generator 136 generates the address 0x3000 (i.e., address addr) required by the load instruction load1 according to the decoding result of the load instruction load 1. In addition, load instruction load1 is a load use instruction. Thus, the data buffer 138 may receive address 0x3000 and an indication signal lu_instr having a high logic level.
Since the address 0x3000 is not yet stored in the memory space 410, the comparing circuit 422 can generate the hit signal lub_h with a low logic level. The controller 428 may select the entry E (1) according to the replacement policy to store the address addr in the tag field TG of the entry E (1). In addition, the controller 428 may set the valid bit field V and the lock bit field L of entry E (1) to "1". In some embodiments, the operations involved in clock cycle CC0 may be implemented by the multiple steps 602, 604, 606, 616, 618, 620 shown in FIG. 6.
Next, at the beginning of clock cycle CC1 after clock cycle CC0, entry E (1) in memory space 410 has stored address 0x3000, and valid bit field V and lock bit field L of entry E (1) are both set to "1". In clock cycle CC1, load1 enters memory access stage MEM, multiply instruction mux enters execute stage EX, and load2 enters instruction decode stage ID. The controller 428 receives the data MD1 returned by the memory 240, which is the data 0xbb pointed to by the address 0x3000 in the memory 240. Thus, the controller 428 may set the data field DA of entry E (1) to data 0xbb and the lock bit field L of entry E (1) to "0". In some embodiments, the operations involved in clock cycle CC1 may be implemented by multiple steps 614, 622, 624, 628, 630, 632 shown in FIG. 6.
In clock cycle CC2 following clock cycle CC1, multiply instruction mul enters memory access stage MEM, load instruction load2 enters execution stage EX, and add instruction add2 enters instruction decode stage ID. The address generator 136 generates the address 0x3000 (i.e., address addr) required by the load instruction load2 according to the decoding result of the load instruction load 2. The load instruction load2 is a load-using instruction, wherein the execution of the add instruction add2 requires the use of data stored in the destination register of the load instruction load 2.
Since the address 0x3000 required by the load instruction load2 is already stored in entry E (1), the compare circuit 422 generates the hit signal lub_h with a high logic level. The logic circuit 426 may output an active signal lub_dv having a high logic level. The selection circuit 424 may store data 0xbb stored in entry E (1) as data lub_d, such that the data buffer 138 may return data 0xbb required by load instruction load2 to the instruction decode unit 122. Thus, data 0xbb required by add instruction add2 is ready before add instruction add2 enters execution stage EX. In some embodiments, the operations involved in clock cycle CC2 may be implemented by the multiple steps 604, 606, 608, 610, 612 shown in FIG. 6.
In clock cycle CC3 following clock cycle CC2, load instruction load2 enters memory access stage MEM, add instruction add2 enters execution stage EX, and another load instruction load3 enters instruction decode stage ID. Since data 0xbb required by add instruction add2 is ready, processor circuit 200 may successfully execute add instruction add2.
In clock cycle CC4 following clock cycle CC3, add instruction add2 enters memory access stage MEM, load instruction load3 enters execution stage EX, and another load instruction load4 enters instruction decode stage ID. The address generator 136 generates the address 0x4000 required by the load instruction load3 according to the decoding result of the load instruction load 3. In addition, load instruction load3 is a load use instruction. Thus, the data buffer 138 may receive the address 0x4000 and the indication signal lu_instr having a high logic level.
Since the address 0x4000 is not yet stored in the memory space 410, the comparison circuit 422 can generate the hit signal lub_h with a low logic level. The controller 428 may select the entry E (2) according to the replacement policy to store the address addr in the tag field TG of the entry E (2). In addition, the controller 428 may set the valid bit field V and the lock bit field L of entry E (2) to "1". In some embodiments, the operations involved in clock cycle CC4 may be implemented by multiple steps 602, 604, 606, 616, 618, 620 shown in FIG. 6.
In clock cycle CC5 following clock cycle CC4, load instruction load3 enters memory access stage MEM, load instruction load4 enters execution stage EX, and logical shift left instruction sll enters instruction decode stage ID. The address generator 136 generates the address 0x4000 required by the load instruction load4 according to the decoding result of the load instruction load 4. In addition, load instruction load4 is a load use instruction. Thus, the data buffer 138 may receive the address 0x4000 and the indication signal lu_instr having a high logic level. In addition, the controller 428 may receive the data MD1 returned by the memory 240, which is the data 0xcc required by the load instruction load 3.
Since the address 0x4000 required by the load instruction load4 is already stored in entry E (2), the compare circuit 422 generates the hit signal lub_h with a high logic level. It is noted that the valid signal lub_dv still has a low logic level, indicating that the data requested by load4 is not ready, because the lock bit field L of entry E (2) is still "1" before the data field DA of entry E (2) is set to data 0xcc by the controller 428. After the data field DA of entry E (2) is set to data 0xcc, the controller 428 may set the lock bit field L of entry E (2) to "0" and provide the data 0xcc stored by entry E (2) to the instruction decode unit 122. By locking the bit field L, the processor circuit 200 ensures that the data loaded by the load instruction load4 during the execution stage EX is the data required by the load instruction load 3. In some embodiments, the operations involved in clock cycle CC5 may be implemented by the multiple steps 604, 606, 608, 610, 612 shown in FIG. 6.
In clock cycle CC6 following clock cycle CC5, load instruction load4 enters memory access stage MEM, logical shift left instruction sll enters execution stage EX, and store instruction store1 enters instruction decode stage ID.
In clock cycle CC7 following clock cycle CC6, the logical shift left instruction sll enters the memory access stage MEM, the store instruction store1 enters the execute stage EX, and the add instruction add4 enters the instruction decode stage ID. Store instruction store1 is used to store write data 0xdd in memory 240, where address generator 136 generates address 0x2000 required by store instruction store1 based on the decoded result of store instruction store 1. Since the address 0x2000 required for store1 is already stored in entry E (0), the compare circuit 422 may generate the hit signal lub_h with a high logic level. In addition, the controller 428 may update the data field DA of entry E (0) to write data 0xdd. In some embodiments, the operations involved in clock cycle CC7 may be implemented by multiple steps 502, 504, 506, 508, 510 shown in FIG. 5.
In clock cycle CC8, which follows clock cycle CC7, store instruction store1 enters memory access stage MEM and add instruction add4 enters execution stage EX. Memory 240 may store write data 0xdd in a storage location in memory 240 pointed to by address 0x 2000.
Since the details of the operation in each clock cycle shown in fig. 7 can be understood by those skilled in the art after reading the descriptions in the paragraphs related to fig. 1-6, further description is omitted here for brevity.
The data processing scheme provided by the present disclosure can be summarized in fig. 8. Fig. 8 is a flow chart of an embodiment of a data processing method according to the present disclosure. The data processing method 800 is described below based on the processor circuit 200 shown in fig. 2. However, it should be understood by those of ordinary skill in the art that the data processing method 800 may be used to control the processor circuit 100 shown in FIG. 1 without departing from the scope of the present disclosure. In addition, in some embodiments, other operations may be performed in the data processing method 800. In some embodiments, the data processing method 800 may be performed in a different order or with different operations.
In step 802, a load instruction is received and it is detected whether the load instruction is in a load use context. For example, the instruction detector 124 may detect whether the load instruction LWI is a load-use instruction. In some embodiments, the instruction detector 124 may determine whether executing an instruction using the execution result of the load instruction LWI may risk load usage data, thereby detecting whether the load instruction LWI is a load usage instruction, wherein the instruction is executed after the load instruction LWI.
In step 804, the load instruction is decoded to produce a decoded result. For example, instruction decode unit 122 may decode load instruction LWI to produce decoding result DR.
In step 806, the address required by the load instruction is generated based on the decoded result. For example, the address generator 136 may generate an address addr required by the load instruction LWI according to the decoding result DR.
In step 808, the address is stored in a data buffer when the load instruction is detected to be in the load use context. For example, when instruction detector 124 detects that load instruction LWI is the load-use instruction, data buffer 138 may store address addr.
In step 810, the data required by the load instruction is stored in the data buffer according to the address. For example, the data buffer 138 may store data required by the load instruction LWI according to the address addr.
Since a person having ordinary skill in the art will understand each operation of the data processing method 800 after reading the above description of the relevant paragraphs corresponding to fig. 1-7, further description is omitted here for brevity.
The foregoing description briefly sets forth features of certain embodiments of the present disclosure so that those skilled in the art to which the disclosure pertains may more fully understand the various forms of the disclosure. Those skilled in the art will appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments described herein. Those of ordinary skill in the art to which the present disclosure pertains will appreciate that such equivalent embodiments fall within the spirit and scope of the disclosure and that various changes, substitutions and alterations can be made hereto without departing from the spirit and scope of the disclosure.
[ Symbolic description ]
100. 200: Processor circuit
122: Instruction decoding unit
124: Instruction detector
136: Address generator
138: Data buffer
180: Memory device
201 To 204: pipeline register
232: Execution unit
233: Arithmetic logic unit
234: Multiplication accumulation unit
240: Memory device
252: Register file
254: Bus interface unit
260: External memory
410: Storage space
420: Control circuit
422: Comparison circuit
423: Buffer device
424: Selection circuit
426: Logic circuit
428: Controller for controlling a power supply
502-514, 602-640, 802-810: Step (a)
800: Data processing method
IF: instruction fetch stage
ID: instruction decode stage
EX: execution phase
MEM: memory access phase
WB: write-back phase
Ibuf0 to ibuf: memory cell
E (0) to E (N-1): an inlet
V: significant bit field
L: lock location element field
TG: label domain
DA: data field
INS: instruction streaming
LWI: load instruction
MAI: memory access instructions
I0 to I6: instructions for
Lub_d, MD1, MD2: data
Lub_instr: indication signal
Lub_dv: effective signal
Lub_h: hit signal
Addr: address of
RR: read request
DR, DR': decoding result
CC 0-CC 8: clock period
Claims (9)
1. A processor circuit, comprising:
an instruction decoding unit for decoding the load instruction to generate a decoding result;
The instruction detector is coupled to the instruction decoding unit and used for detecting whether the loading instruction is in a loading use situation, wherein the instruction detector is used for receiving the loading instruction and an instruction executed after the loading instruction and judging whether the loading use data risk occurs when the loading instruction is executed by using an execution result of the loading instruction; when judging that the execution result of the loading instruction is used for executing the instruction and the loading use data risk occurs, the instruction detector detects that the loading instruction is in the loading use situation;
An address generator, coupled to the instruction decoding unit, for generating a first address required by the load instruction according to the decoding result; and
The data buffer is coupled to the instruction detector and the address generator and is used for storing the first address generated by the address generator and storing data required by the load instruction according to the first address when the instruction detector detects that the load instruction is in the load use situation;
Wherein: when the processor circuit needs to execute a pending instruction using the data required by the load instruction, the instruction decode unit retrieves the data from the data buffer for the processor circuit to execute the pending instruction.
2. The processor circuit of claim 1 wherein the processor circuit has a pipeline architecture, the instruction decode unit and the instruction detector are located in an instruction decode stage of the pipeline architecture, and the address generator and the data buffer are located in an execution stage of the pipeline architecture.
3. The processor circuit of claim 1 wherein the data buffer further provides data required by the load instruction to the instruction decode unit.
4. The processor circuit of claim 1 further comprising:
the cache memory is coupled to the data buffer, wherein the data buffer is configured to send a read request to the cache memory to read the data pointed to by the first address in the cache memory as the data required by the load instruction.
5. The processor circuit of claim 1 further comprising:
The bus interface unit is coupled between the data buffer and the external memory, wherein the data buffer sends a read request to the external memory through the bus interface unit so as to read the data pointed by the first address in the external memory as the data required by the load instruction.
6. The processor circuit of claim 1 wherein when the processor circuit is to process a memory access instruction, the instruction decode unit is to decode the memory access instruction to generate another decoded result, the address generator is to generate a second address required by the memory access instruction according to the another decoded result, and the data buffer is to check whether the second address is already stored in the data buffer; when the second address is stored in the data buffer, the data buffer is used for updating the data pointed by the second address in the data buffer into the data required by the memory access instruction.
7. The processor circuit of claim 6 wherein the memory access instruction is another load instruction or a store instruction.
8. The processor circuit of claim 1 wherein the data buffer comprises:
a memory space comprising a plurality of entries, wherein the first address and data required by the load instruction are stored in one of the plurality of entries; and
The control circuit is coupled to the memory space, wherein after the first address is stored in the one entry and before the data required by the load instruction is stored in the one entry, the control circuit is configured to set the lock bit field of the one entry to a specific bit pattern to protect the information stored in the one entry from being modified by instructions other than the load instruction.
9. A data processing method, comprising:
Receiving a load instruction, and detecting whether the load instruction is in a load use situation, wherein an instruction detector is used for receiving the load instruction and an instruction executed after the load instruction, and judging whether the execution result of the load instruction is used for executing the instruction with the risk of load use data; when judging that the execution result of the loading instruction is used for executing the instruction and the loading use data risk occurs, the instruction detector detects that the loading instruction is in the loading use situation;
Decoding the load instruction to produce a decoded result;
generating a first address required by the load instruction according to the decoding result;
storing the first address in a data buffer when the load instruction is detected to be in the load use situation; and
Storing data required by the load instruction in the data buffer according to the first address; when the processor circuit needs to execute a pending instruction using the data required by the load instruction, the instruction decode unit retrieves the data from the data buffer for the processor circuit to execute the pending instruction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010073303.1A CN113157631B (en) | 2020-01-22 | 2020-01-22 | Processor circuit and data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010073303.1A CN113157631B (en) | 2020-01-22 | 2020-01-22 | Processor circuit and data processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113157631A CN113157631A (en) | 2021-07-23 |
CN113157631B true CN113157631B (en) | 2024-06-21 |
Family
ID=76881887
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010073303.1A Active CN113157631B (en) | 2020-01-22 | 2020-01-22 | Processor circuit and data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113157631B (en) |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1291765B1 (en) * | 1996-08-27 | 2009-12-30 | Panasonic Corporation | Multithreaded processor for processing multiple instruction streams independently of each other by flexibly controlling throughput in each instruction stream |
JP5209933B2 (en) * | 2007-10-19 | 2013-06-12 | ルネサスエレクトロニクス株式会社 | Data processing device |
US10102142B2 (en) * | 2012-12-26 | 2018-10-16 | Nvidia Corporation | Virtual address based memory reordering |
US10095519B2 (en) * | 2015-09-19 | 2018-10-09 | Microsoft Technology Licensing, Llc | Instruction block address register |
US20170371659A1 (en) * | 2016-06-23 | 2017-12-28 | Microsoft Technology Licensing, Llc | Load-store queue for block-based processor |
US10379855B2 (en) * | 2016-09-30 | 2019-08-13 | Intel Corporation | Processors, methods, systems, and instructions to load multiple data elements to destination storage locations other than packed data registers |
-
2020
- 2020-01-22 CN CN202010073303.1A patent/CN113157631B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN113157631A (en) | 2021-07-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI756616B (en) | Processor circuit and data processing method | |
US8131951B2 (en) | Utilization of a store buffer for error recovery on a store allocation cache miss | |
US8799590B2 (en) | System enabling transactional memory and prediction-based transaction execution method | |
KR100973951B1 (en) | Misaligned memory access prediction | |
JP5661863B2 (en) | System and method for data transfer in an execution device | |
US9146744B2 (en) | Store queue having restricted and unrestricted entries | |
CN100407134C (en) | System and method for handling exceptional instructions in a trace cache based processor | |
US8984261B2 (en) | Store data forwarding with no memory model restrictions | |
US8688962B2 (en) | Gather cache architecture | |
US8499123B1 (en) | Multi-stage pipeline for cache access | |
EP0394624B1 (en) | Multiple sequence processor system | |
US7836253B2 (en) | Cache memory having pipeline structure and method for controlling the same | |
US20090113192A1 (en) | Design structure for improving efficiency of short loop instruction fetch | |
CN115867888B (en) | Method and system for utilizing a primary-shadow physical register file | |
CN115640047B (en) | Instruction operation method and device, electronic device and storage medium | |
CN112559389A (en) | Storage control device, processing device, computer system, and storage control method | |
US7373486B2 (en) | Partially decoded register renamer | |
CN113157631B (en) | Processor circuit and data processing method | |
US11914998B2 (en) | Processor circuit and data processing method for load instruction execution | |
US6963965B1 (en) | Instruction-programmable processor with instruction loop cache | |
US6954848B2 (en) | Marking in history table instructions slowable/delayable for subsequent executions when result is not used immediately | |
CN115686624A (en) | Processor circuit and data processing method | |
US12014176B2 (en) | Apparatus and method for pipeline control | |
US11003457B2 (en) | Power-saving mechanism for memory sub-system in pipelined processor | |
JPH07200406A (en) | Cache system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |