CN117083594B - Method and apparatus for desynchronizing execution in a vector processor - Google Patents
Method and apparatus for desynchronizing execution in a vector processor Download PDFInfo
- Publication number
- CN117083594B CN117083594B CN202280017945.7A CN202280017945A CN117083594B CN 117083594 B CN117083594 B CN 117083594B CN 202280017945 A CN202280017945 A CN 202280017945A CN 117083594 B CN117083594 B CN 117083594B
- Authority
- CN
- China
- Prior art keywords
- vector
- register
- instruction
- access control
- memory access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3838—Dependency mechanisms, e.g. register scoreboarding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Complex Calculations (AREA)
Abstract
In one embodiment, the vector processor unit has a preload register of at least some of the vector length, vector constant, vector address, and vector stride. Each preload register has an input and an output. All preload register inputs are coupled to receive new vector parameters. The output of each of the preload registers is coupled to a first input of a respective multiplexer, and the second inputs of all respective multiplexers are coupled to the new vector parameter.
Description
RELATED APPLICATIONS
This patent application claims priority from pending U.S. patent application Ser. No. 63/180,634, entitled "method and apparatus for programmable machine learning and reasoning (Method and Apparatus for Programmable MACHINE LEARNING AND INFERENCE)" filed by the same inventor at month 27 of 2021, which is incorporated herein by reference. This patent application claims priority from pending U.S. patent application Ser. No. 63/180,562, entitled "method and apparatus for aggregation/scattering operations in vector processors (Method and Apparatus for Gather/Scatter Operations in a Vector Processor)" filed by the same inventor on month 27 of 2021, which is incorporated herein by reference. This patent application claims priority from pending U.S. patent application Ser. No. 17/669,995, entitled "method and apparatus for aggregation/scattering operations in vector processors (Method and Apparatus for Gather/Scatter Operations in a Vector Processor)" filed by the same inventor on 11/2/2022, which is incorporated herein by reference. This patent application claims priority from pending U.S. patent application Ser. No. 63/180,601, entitled "Multi-stack System in processor without effective Address Generator (System of Multiple STACKS IN A Processor Devoid of AN EFFECTIVE ADDRESS Generator)", filed by the same inventor on month 27 of 2021, which is incorporated herein by reference. This patent application claims priority from pending U.S. patent application Ser. No. 17/468,574 entitled "Multi-stack System in processor without effective Address Generator (System of Multiple STACKS IN A Processor Devoid of AN EFFECTIVE ADDRESS Generator)" filed by the same inventor on 7 of 9 of 2021, which is incorporated herein by reference. This patent application claims priority from pending U.S. patent application Ser. No. 17/701,582, entitled "method and apparatus for performing desynchronization execution in vector processors (Method and Apparatus for Desynchronizing Execution in a Vector Processor)" filed by the same inventor on day 22 of 3 of 2022, which is incorporated herein by reference.
Technical Field
The present method and apparatus relates to vector processors. More particularly, the present methods and apparatus relate to methods and apparatus for desynchronized execution in vector processors.
Background
To improve throughput, vector Processing Units (VPUs) access vectors in memory and perform vector operations at a high rate in a sequential manner. Since vector processors are built for very high speeds, interrupting the vector pipeline for any reason (such as, for example, to handle serial or scalar operations or housekeeping instructions) occurs at high cost in a performance-reducing manner.
This presents a technical problem which requires technical solutions using technical means.
Disclosure of Invention
The vector processor unit is provided with preload registers for vector length, vector constant, vector address and vector stride, each having an input and an output. All preload register inputs are coupled to receive new vector parameters. The output of each of the preload registers is coupled to a first input of a respective multiplexer, and the second inputs of all respective multiplexers are coupled to receive the new vector parameters.
Drawings
The disclosed technology is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. Items numbered identically are not necessarily identical.
The drawings illustrate various non-exclusive examples of the presently disclosed technology.
Fig. 1 shows a block diagram overview of a decoding unit according to an example, generally at 100.
Fig. 2 shows a block diagram overview of a vector register for addressing memory access control generally at 200.
FIG. 3 shows a block diagram overview of a portion of a vector processor unit including a memory access control preload register generally at 300.
Fig. 4 shows a flow chart illustrating the desynchronized execution of instructions and the synchronized execution of instructions generally at 400.
Fig. 5 illustrates a flow chart showing asynchronous, desynchronized, and synchronous execution of instructions generally at 500.
Fig. 6 illustrates a flow chart showing the execution of a vector instruction generally at 600.
Fig. 7 illustrates generally at 700 a flow chart showing the execution of a desynchronized vector instruction other than a non-desynchronized instruction.
Detailed Description
A method and apparatus for desynchronized execution in a vector processor is disclosed.
Definition and annotation
Various terms are used to describe the techniques disclosed herein. Applicant is a lexicographer and defines these terms as follows. Terms are referenced below at their initial use.
"Concurrent" is the same as "parallel" and is defined as doing two things at least partially at a time. This does not mean anything about how they relate to each other-they may be "synchronized" or "desynchronized".
"Synchronized" execution is the act of pipelining control of each aspect of instruction operation.
"Desynchronized" execution is the act of executing instructions that are essential components of its operation independently of pipeline control. Thus, pipeline control may control execution and completion of one or more instructions subsequent to an instruction undergoing desynchronized execution prior to completion of desynchronized execution.
Note that if an unacceptable change is made to the result of a program executing on a processor, then instruction execution following the desynchronized instruction is deemed to modify the critical processor state. An unacceptable change is the end result of all processing of a given program, unlike the case where all instructions are executed in a serial fashion, i.e., each instruction is completed before the next instruction begins. The critical processor state is a state that must be maintained to avoid unacceptable changes. Acceptable changes may include, but are not limited to, occurrence of a sequential fault or interrupt and updates to registers visible to the program occurring out of order with respect to the desynchronized instruction (but not out of order with respect to the non-desynchronized instruction). Note that changes that would be deemed unacceptable are prohibited from occurring through the process of resynchronizing execution.
A "desynchronized instruction" is an instruction whose execution is not 100% under control of pipeline control, i.e., whose essential component of operation is not under control of pipeline control, however pipeline control may monitor its progress.
A "non-desynchronized instruction" is an instruction that executes without desynchronization.
"Resynchronisation" execution causes instructions following the desynchronisation instruction to cease execution until the desynchronisation instruction is complete. This occurs when a subsequent instruction will modify the critical processor state, particularly when the processor state will affect the results of the desynchronization instruction.
An "asynchronous" instruction/execution-is an instruction that, as part of its execution, invokes an activity external to the processor that will complete in a time that is completely uncontrolled by the processor and unpredictable by the processor. Pipeline control cannot monitor its progress. At the same time, the processor may continue executing instructions.
"Asynchronous re-serialization" waits for asynchronous execution to complete and then allows subsequent instructions to execute. Generally, this is to preserve the integrity of the program results.
Note that the difference between desynchronization and asynchronization is subtle. In desynchronized execution, the processor has full control of both instructions being executed, even though it allows the second instruction to modify the processor state before the first (desynchronized) instruction has completed. In asynchronous execution, the processor has zero (no) control over the time in which activities external to the processor called by the asynchronous instruction will complete.
Note that the term desynchronized execution is used when allowing non-vector instructions to execute after vector instructions have begun but have not yet completed. The execution of vector instructions is considered to be desynchronized with subsequent non-vector instructions that are allowed to execute.
However, the disclosed desynchronization methods are not limited thereto. That is, although non-vector instructions that are executed when desynchronizing vector instructions are executed are generally discussed for clarity of explanation, the disclosed desynchronizing methods are not limited thereto. In alternative embodiments, the second vector instruction may be allowed to execute in a desynchronized manner while the first desynchronized vector instruction is executing. In addition, long-running instructions other than vector instructions (i.e., taking longer to complete execution than other instructions) are candidates for desynchronized execution.
Note that the term asynchronous execution is used, for example, for external load memory (xload) and external store memory (xsave) instructions that request a processor external to the Vector Processing Unit (VPU) to coordinate data movement between the memory of the VPU and the external memory.
"Modify/change/copy/transfer register" refers to modifying/changing/copying/transferring a value or parameter stored within a register. That is, for example, copying a first register to a second register will be understood to copy the contents or parameters contained or held in the first register into the second register such that the second register now contains the values or parameters of the first register.
"Contention" refers to two or more processes, such as but not limited to executing instructions that attempt to alter or access the same entity, such as but not limited to memory or registers, wherein the alteration will introduce uncertainty in the processing results. For example, if both executing instructions attempt to change a particular memory location, this is contention for resources, i.e., for the same particular memory location. Contention may lead to different results of processing depending on which instruction completes execution first. For example, desynchronization contention is the contention between a desynchronization instruction being executed and another instruction that will affect the processor output, resulting in a different output, depending on which instruction completes execution first. For example, asynchronous contention is contention between an executing asynchronous instruction and another instruction that will affect the processor output, resulting in a different output, depending on which instruction completes execution first.
"Vector parameters/new vector parameters" refers to information about the vector. In one example, it may be a plurality of signals. More specifically, it is the information that the processor needs to access memory (e.g., read and write vectors). "New" refers to a situation where a processor has used vector parameters and new vector operations are being queued or placed in the pipeline for future execution, the vector parameters for which are called "new vector parameters" to distinguish them from vector parameters currently being used in the executing vector instruction.
Detailed Description
In one example, a vector processor unit is provided having preloaded registers of vector length, vector constant, vector address, and vector stride. Each preload register has a corresponding input and a corresponding output. All preload register inputs are coupled to receive new vector parameters. The output of each of the preload registers is coupled to a first input of a respective multiplexer, and the second inputs of all respective multiplexers are coupled to receive the new vector parameters.
In one example, the present invention discloses mechanisms to determine when desynchronization and asynchronous execution can occur and to stop instruction execution if desynchronization and/or asynchronous execution must be completed (referred to as re-synchronization and asynchronous re-serialization, respectively), generally in order to maintain the integrity of program results. The disclosed method not only allows for desynchronized execution and asynchronous execution, but also limits the cases when re-synchronization or asynchronous re-serialization is to be performed, as re-synchronization and asynchronous re-serialization reduce program performance.
Fig. 1 shows a block diagram overview of a decoding unit generally at 100. At 102 is instruction fetch control to fetch instructions from a memory system. Although not germane to an understanding of decoding unit 100, the memory system may be, for example, a Random Access Memory (RAM). The instruction fetch control 102 outputs information to the instruction decode 104 via 103, and outputs execution/stop information to the operation state control 106 and the pipeline control 108 via 105. Instruction decode 104 outputs information via 107 to stall detection 112, result bypass detection 114, and resource allocation tracking 116. Pipeline control 108 outputs the information to resource allocation trace 116 via 117. The resource allocation tracking 116 outputs information via 119 to the resultant bypass detection 114 and stall detection 112. The result bypass detection 114 outputs information to the pipeline control 108 via 115. Stall detection 112 outputs information to pipeline control 108 via 113. Pipeline control 108 outputs and receives information to and from register unit 118, memory access control unit 120, scalar Arithmetic Logic Unit (ALU) 122, vector Arithmetic Logic Unit (ALU) 124, and branch unit 126 via 121. The branch unit 126 outputs information to the instruction fetch control 102 via 125. Branching unit 126 outputs information to fault control 110 via 123. Vector ALU 124 outputs information to fault control 110 via 123. Scalar ALU 122 outputs information to fault control 110 via 123. The memory access control unit 120 outputs information to the failure control 110 via 123. Register unit 118 outputs information to fault control 110 via 123. The fault control 110 is output to the pipeline control 108 via 109 and to the operating state control 106 via 111. Branch unit 126 receives information output from scalar ALU 122 and information output from vector ALU 124 via 127.
For purposes of a brief close discussion, it can be seen from FIG. 1 that pipeline control 108 communicates with, among other things, register unit 118, memory access control unit 120, scalar ALU 122, and vector ALU 124. Pipeline control 108 attempts to keep the processor where decoding unit 100 is located running as fast as possible by attempting to avoid stopping any scalar or vector ALUs from processing serially what can be done in parallel. In a simple sense, it is a traffic police that directs traffic to improve throughput.
In processors capable of performing both scalar and vector operations, it is preferable to keep the vector ALUs operating at as high a rate as possible, since vector operations involve more processing than scalar operations, and thus essentially determine the overall processing rate.
Fig. 2 shows a block diagram overview of a vector register for addressing memory access control generally at 200. At 201 is a new vector parameter, i.e. 201 represents the reception of a new vector parameter 201 to be loaded. The new vector parameters 201 are coupled to the inputs of the vector length registers 202 and the outputs of the vector length registers 202 are coupled to the memory access control 220 via 203. The new vector parameters 201 are also coupled to inputs of vector constant registers 204, and the outputs of vector length registers 202 are coupled to memory access control 220 via 205. The new vector parameters 201 are also coupled to inputs of vector address registers 206, and outputs of vector length registers 206 are coupled to memory access control 220 via 207. The new vector parameter 201 is coupled to an input of a vector stride register 208 and an output of the vector stride register 208 is coupled via 209 to a memory access control 220. Although vector length register 202, vector constant register 204, vector address register 206, and vector stride register 208 are shown, in some examples, one or more of vector length register 202 and vector constant register 204 are not provided.
Memory access control 220 is a functional block, not a register. Which receives as inputs the vector length supplied via 203 from vector length register 202, the vector constant supplied via 205 from vector constant register 204, the vector address supplied via 207 from vector address register 206, and the vector stride supplied via 209 from vector stride register 208. The combination of vector length register 202, vector constant register 204, vector address register 206, and vector stride register 208 may be referred to as vector control, and memory access control 220 may be referred to as a memory subsystem. I.e. vector control controls the addressing of the memory subsystem. The memory subsystem may include RAM (not shown).
In understanding fig. 3 described below, the reader will recognize that fig. 2 is an example of a device that does not support vector desynchronization in vector memory control, as shown, whereas fig. 3 is an example of a device that supports vector desynchronization in vector memory control, as shown.
FIG. 3 shows a block diagram overview of a portion of a vector processor unit including a memory access control preload register generally at 300.
At 301 are new vector parameters. The new vector parameters 301 are coupled to inputs of vector length preload registers 302 and outputs of vector length preload registers 302 are coupled to first inputs of respective multiplexers 310 via 303. A second input of the multiplexer 310 is coupled to a new vector parameter 301, i.e. bypass vector length preload register 302. The output of multiplexer 310 is coupled via 311 to vector length register 322. The output of vector length register 322 is coupled to memory access control 320 via 323.
The new vector parameters 301 are coupled to inputs of vector constant preload registers 304, and outputs of vector constant preload registers 304 are coupled to first inputs of respective multiplexers 312 via 305. A second input of multiplexer 312 is coupled to new vector parameter 301, i.e. bypass vector constant preload register 304. The output of multiplexer 312 is coupled via 313 to vector constant register 324. The output of vector constant register 324 is coupled to memory access control 320 via 325.
The new vector parameters 301 are coupled to inputs of vector address preload registers 306 and outputs of vector address preload registers 306 are coupled to first inputs of respective multiplexers 314 via 307. A second input of the multiplexer 314 is coupled to the new vector parameter 301, i.e. the bypass vector address preload register 306. The output of multiplexer 314 is coupled to vector constant register 326 via 315. The output of vector constant register 326 is coupled to memory access control 320 via 327.
The new vector parameter 301 is coupled to an input of a vector stride preload register 308 and an output of the vector stride preload register 308 is coupled to a first input of a multiplexer 316 via 309. A second input of multiplexer 316 is coupled to new vector parameter 301, i.e. bypass vector stride preload register 308. The output of multiplexer 316 is coupled via 317 to vector stride register 328. The output of vector stride register 328 is coupled to memory access control 320 via 329.
Although vector length preload register 302, vector constant preload register 304, vector address preload register 306, vector stride preload register 208, vector length register 322, vector constant register 324, vector address register 326, and vector stride register 328 are shown with respective multiplexers 310, 312, 314, 316, in some examples, one or more of vector length preload register 302, vector length register 322, vector constant register 304, and vector constant register 324, and respective multiplexers are not provided.
At 330 is multiplexer control. The output of multiplexer control 330 is coupled to respective control inputs of multiplexer 316, multiplexer 314, multiplexer 312, and multiplexer 310 via 331. That is, the control inputs of multiplexer 316, multiplexer 314, multiplexer 312, and multiplexer 310 are all controlled via link 331 output from multiplexer control 330. In one example, link 331 carries a single signal to all control inputs of multiplexer 316, multiplexer 314, multiplexer 312, and multiplexer 310, and in another example, link 331 carries a respective signal to each of the control inputs of multiplexer 316, multiplexer 314, multiplexer 312, and multiplexer 310, such that they are individually controllable.
The multiplexer control 330 identifies whether the memory access control register 350 is to be loaded with a new vector parameter setting 301 or is to be loaded from the corresponding output of the memory access control preload register 340, as described below, and thus controls the link 331 to update the memory access control register 350 at the correct point between 2 desynchronized vector arithmetic operations. The update is from a preloaded register (302, 304,306, 308) to a register (322, 324,326, 328) or from a new vector parameter 301 to a register (322, 324,326, 328). As described below, the multiplexer control 330 also controls the writing to each of the preload registers (302, 304,306, 308) and registers (322, 324,326, 328).
Vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 together comprise memory access control preload register 340. Each of vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 are individually considered memory access control preload registers.
Vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328. Each of vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328 individually together comprise memory access control register 350. Each of vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328 are individually considered memory access control registers.
Memory access control 320 is a functional block, not a register. Which takes as input the vector length, vector constant, vector address, and vector stride register values (provided by the respective memory access control registers 322, 324, 326, 328 via the respective links 323, 325, 327, 329). Registers 322, 324, 326, 328 and their corresponding parameters of communication via links 323, 325, 327, 329 are what may be referred to as vector control, and memory access control 320 may be referred to as a memory subsystem. I.e. vector control controls the addressing of the memory subsystem. The memory subsystem may include RAM (not shown).
When new vector parameters 301 pass through multiplexers 310, 312, 314, and 316, respectively, and then enter vector length register 322, vector constant register 324, vector constant register 326, and vector stride register 328 via 311, 313, 315, and 317, respectively, multiplexer control 330 is considered to be in a non-preloaded position.
Multiplexer control 330 is considered to be in the preload position when multiplexers 310, 312, 314, and 316 receive inputs from vector length preload register 302, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308, respectively, via 303, 305, 307, and 309, respectively.
That is, in the non-preloaded position, memory access control register 350 receives parameters from new vector parameters 301. In the preload position, memory access control register 350 receives parameters from memory access control preload register 340.
In order not to obscure this example, an example is not shown in which multiplexer control 330 controls the writing of signals to access control register 350 and memory access control preload register 340. In this way, the multiplexer control 330 controls which registers receive the new vector parameters 301.
In fig. 3, the multiplexer 310 is considered to be a first multiplexer.
In fig. 3, the multiplexer 312 is considered a second multiplexer.
In fig. 3, the multiplexer 314 is considered to be a third multiplexer.
In fig. 3, the multiplexer 316 is considered a fourth multiplexer.
Fig. 4 shows a flow chart illustrating the desynchronized execution of instructions and the synchronized execution of instructions generally at 400. At 402, the next instruction is fetched for execution. Proceed via 403 to 404. At 404, it is determined whether the next instruction to be executed affects or depends on the results of any current desynchronization instructions in the process. When the next instruction to be executed affects or depends on the outcome of any current desynchronization instruction in the process, this is referred to as desynchronization contention (yes), and then proceeds to 420 via 419. When the next instruction to be executed does not affect or depend on the results of any current desynchronization instructions in the process (no), then the process proceeds to 430, via 405, optionally asynchronous execution. At 420, execution is re-synchronized by waiting for all desynchronization operations to complete before proceeding via 405 to 430 to optional asynchronous execution. When optional asynchronous execution 430 is not present, then proceed to 410 via 409.
At 410 it is determined whether the next instruction fetched can be executed desynchronized. When the next instruction can be executed desynchronized (yes), then proceed to 412 via 411. At 412, desynchronized execution is initiated by allowing the processor to desynchronize the next instruction fetched, i.e., completion of the next instruction fetched occurs desynchronized relative to control of the processor, but the processor tracks when an internal signal is given indicating that the operation is complete. The processor does not wait for such a completion signal before proceeding to 402 via 415.
When the next instruction cannot be executed desynchronized (no), then proceed to 414 via 413. At 414, desynchronized execution is initiated by allowing the processor to execute the next instruction fetched in synchronization, i.e., the instruction makes the program appear to complete before proceeding to 402 via 415. The processor may be pipelined or employ other overlapping execution techniques, however, in a manner that makes the program appear to complete instructions before proceeding to 402.
Some operations are allowed to occur out of order, while other operations are not allowed. Not everything can be unordered, otherwise the overall integrity of the program (and thus its usefulness) is compromised. To avoid instructions that may disrupt the processor state, a process called re-synchronization (i.e., 420) is provided that stops further execution until the desynchronization operation has been completed. This affects performance, and this disclosure details some reasons for eliminating resynchronization, thereby speeding up program execution.
Knowing when there is a desynchronized execution of one or more instructions (e.g., vector instructions), such as at 412 in fig. 4, then the multiplexer control 330 in fig. 3 may perform an update to the memory access control register 350 at the correct point between the two desynchronized vector arithmetic operations.
One vector instruction may be desynchronized from an executing instruction in the pipeline, allowing another instruction to execute. If the subsequent instruction has a contention for resources with the desynchronization instruction, the subsequent instruction must wait until the contention disappears—this is one example of desynchronization contention, as described with respect to 404. However, if the second vector instruction can be executed without causing contention for resources, the second vector instruction may perform desynchronization.
An instruction eligible for desynchronized execution is any instruction that runs for a long period of time because this allows subsequent instructions to complete their execution while the desynchronized instruction is executing. Thus, the execution time of subsequent instructions executed when executing the desynchronization instruction is effectively reduced because they do not wait for the desynchronization instruction to complete execution.
Another way to view examples disclosed herein is to view which instructions can be executed while the desynchronization instructions are executing.
Because vector instructions run over a long period of time and represent a significant amount of work in a vector processor, it would be desirable to allow all non-vector instructions to execute while desynchronizing vector instruction execution. If this can be achieved, the processing time is limited by the execution of the vector instruction, as all other instructions will be executed while the desynchronized vector instruction is executing.
Vector instructions read operands from memory and write results to memory. Thus, instructions that do not access memory are candidates for execution while the desynchronization vector instruction is executing. These instructions include all scalar arithmetic instructions whose operands come from and results arrive at the register set. The scalar arithmetic instruction also includes an instruction to access memory using a different memory or a different memory region than the desynchronization vector instruction. This may include, but is not limited to, subroutine calls and returns, pushing and popping parameters from the stack.
There is a class of instructions that can cause contention with desynchronized vector instructions. For example, subsequent vector operations (vector address in memory, vector length, but not limited thereto) are set and instructions that may adversely affect the resources of the desynchronized vector instruction currently being executed are modified.
For performance reasons, it would be desirable if these contending instructions could also be executed in parallel with the desynchronized vector instructions.
If the processing of vectors represents a significant amount of work in the vector processor, then it is also very common for instructions to set those vectors, and it is a significant performance degradation that each time a new vector is set up, it must be re-synchronized for execution.
Thus, there is a need to set an instruction of a memory access control preload register (e.g., at 340 of FIG. 3) that specifies a vector address, a vector stride, a vector length, and a vector constant value such that a currently executing desynchronized vector instruction is not adversely affected.
Vector length, vector constant, vector address, and vector stride are entities that may reside in registers, such as memory access control register 350 in fig. 3 (e.g., 322, 324, 326, and 328, respectively), and communicate with memory access control 320 via 323, 325, 327, and 329, respectively. Vector length preloading, vector constant preloading, vector address preloading, and vector stride preloading are entities that may reside in memory access control preload registers 340 (e.g., 302, 304, 306, and 308, respectively, in fig. 3) and communicate with multiplexers 310, 312, 314, and 316, respectively, via 303, 305, 307, and 309, respectively. Vector length, vector constants, vector addresses, and vector stride (collectively referred to as vector ports for ease of discussion) allow for addressing of vectors in memory access control 320 (referred to as memory for ease of discussion). Thus, the vector port addresses the memory to point to the vector.
For example, the vector length is the length of the vector in memory.
For example, the vector constant is a constant used when calculating a vector. For example, if each element of vector a needs to be multiplied by 2, vector B (multiplier) is a vector whose elements all have a value of 2. The vector constants may be in registers that specify the value of each element of vector B, rather than requiring vector B to reside in memory.
The vector address is an address in which a vector is to be found. Vector stride is the stride value added to the vector address each time the memory of the vector element is accessed. For example, if the vector is a row of a matrix, the stride may be equal to 1, but if the vector is a column of a matrix having N elements in each row, the stride may be set to N. The vector address and vector stride are used to address memory locations where vectors may be read or written.
Detailed instruction execution examples
These detailed examples are illustrative of the technology disclosed herein as it is used to enhance the execution of vector processors.
An example of out-synchronization execution is first shown. An example of asynchronous execution is then shown. And finally an example of the correlation of co-pending application number 17/468,574 filed on 9 and 7 of 2021, which describes a parameter stack, a register stack, and a subroutine call stack separate from the local memory widely used by vector ALU 124.
In the following examples, these mnemonics have the following meanings:
mov-movement
RX-register, wherein X is the integer number of the register
SAS-set Address and stride
Slen-set vector Length
Sqrt-square root
VX-vector, where X is the integer number of the vector
Mem [ Xs ] -memory, where Xs is the memory address
Add-add
Div-separation
Etc-etc, meaning instructions that may proceed
Log-log
Post-annotation (not part of the execution code)
Xload-Loading data from external sources
Xsave-save data to external destination
Store-stored in local memory
Fetch-fetch from local memory
Xswait-pause instruction until asynchronous xsave operation is complete
Push-push the referenced value to the top of the stack
Call-transfer control to specified instruction/routine
Post annotation (not part of the execution code) and is an alternative syntax of//
Xlwait-pause instruction until asynchronous xload operation is complete
Although block 124 indicates a vector ALU(s) in fig. 1, the following example will consider the case where block 124 is a single vector ALU and is referred to as vector ALU 124, in order not to confuse the reader. The techniques disclosed herein are not limited thereto and multiple ALUs are possible.
Desynchronization execution
========================
Mov r0 100// r0 gives 100
Mov r1 1// r1 gives 1
Setting address and stride of sas 0r 0r 1// vector 0, v 0: address 100, stride 1
Setting address and stride of sas1 r 0r 1// vector 1, v1, address 100, stride 1
Mov 2 64// r2 gives 64
Slen r2// set the vector length to 64, so v 0v 1 occupies memory location mem [100,101,102, ], 163]
Sqrt v0 v1// v0 takes the square root of v1, and since v0 and v1 have the same address, v1 also takes the square root
Add r7 r8// without desynchronization, the instruction must wait until the previous sqrt instruction completes
Div r7 r9// without desynchronization, the instruction must wait until the previous sqrt instruction completes
etc
While the sqrt instruction is executing, the above instruction following the sqrt instruction has no reason to be unable to execute. This means that the pipeline control 108 (also referred to as pipeline control) needs to allow the sqrt instruction to execute desynchronized, so the pipeline control 108 may allow execution of subsequent instructions (add r7 r8 and div r7 r9 in the above example).
However, at some point, if the desynchronization operation is still in progress, pipeline control 108 may need to resynchronize the desynchronization operation. For example, if vector ALU 124 supports only one vector operation at a time, resynchronization is demonstrated as follows:
Sqrt v0 v1// this is desynchronized with pipeline control, i.e., allows the sqrt instruction to perform desynchronization
Add r7 r8// pipeline control allows it to execute
Div r7 r9// pipeline control allows it to execute
The log v 0v 1// pipeline control must be resynchronized because this cannot yet be performed due to resource contention (of v0 and v 1), i.e., log v 0v 1 is trying to use v0 and v1, however it is not known whether sqrt v 0v 1 has been completed (with v0 and v 1), so it must be resynchronized
In the example directly above, the original vector is square root, and then because no vector address is changed, the result of this square root will be operated on by a logarithmic function. But if vector ALU 124 can only perform one vector operation at a time, then the square root must be completed before the logarithm can begin. If the square root has not been completed (as monitored by resource allocation trace 116), then the desynchronized sqrt must be re-synchronized with the execution of pipeline control 108 because the sqrt instruction has not yet been re-synchronized. This indicates to stall detection 112 through resource allocation tracking 116 that a resynchronization needs to occur and stall detection 112 causes pipeline control 108 to tentatively execute logarithmic instructions until the resynchronization is complete and vector ALU 124 is available to complete.
Resynchronization represents a loss of performance and, although sometimes necessary, is undesirable. Ideally, vector ALU 124 should remain as busy as possible, with utilization as close to 100% as possible, because of the large amount of work in the vector processor handling the vector.
Consider the following example, which represents a number of common scenarios:
mov 0r 100// as in the example above, drops all the way to sqrt
mov r1 1
sas0 r0 r1
sas1 r0 r1
mov r2 64
slen r2
sqrt v0 v1
Mov r0 200// set new vector operations for operands and result vectors in mem 200,201,202,295
mov r1 1
sas0 r0 r1
sas1 r0 r1
mov r2 96
slen r2
Log v 0v 1// mem 200,201,202,295 gives log of mem 200,201,202,2095
In this case, the second occurrence of the sas0, sas1, and slen instructions changes the location in memory where the qualified operand and result vector reside. But if the sqrt instructions are still performing desynchronization when they are executed, they will adversely affect sqrt because the vector of sqrt unexpectedly has a changed address, stride, and length. So the second occurrence of sas0 must initiate a resynchronization, which is undesirable.
Fig. 3 shows an example of how resynchronization can be avoided.
By writing operands and result ports to memory access preload registers 302, 306, and 308, rather than to memory access control registers 322, 326, and 328, the second occurrence of the sas0, sas1, and slen instructions may be allowed to execute while the desynchronization sqrt is performed.
When the desynchronization operation is in progress and conversely results in memory access control preload register 340 being written, multiplexer control 330, which is controlled by pipeline control 108, recognizes an attempt to modify one of memory access control registers 350, i.e., multiplexer control 330 decides whether memory access control register 350 or memory access control preload register 340 is being written. Thus, when the desynchronization operation is in progress, register memory access control register 350 is not affected by subsequent instructions, and therefore the desynchronization operation is not adversely affected.
Pipeline control 108 also recognizes when the desynchronization operation is complete and if any of memory access control preload registers 340 have been modified, their contents are moved by multiplexer control 330 of pipeline control 108 into a corresponding one of memory access control registers 350. Thus, all of the functionality required for the second execution of the sas0, sas1, and slen instructions is provided, but they do not have to be re-synchronized and therefore lose performance. The vector logarithm instruction may now execute and, as a vector instruction, may execute in a desynchronized manner. If multiple vector instructions cannot be executed in parallel, the vector logarithm will first be resynchronized in response to pipeline control 108 such that only one desynchronized vector instruction is executing at a time.
The above allows the vector unit to remain nearly 100% busy (ignoring any inefficiency of startup in the particular embodiment). Vector ALU 124 performs the logarithm on one vector from square root on each element of another vector immediately, meeting the goal of keeping vector ALU 124 nearly 100% busy.
If sqrt completes before the second occurrence of the sas0, sas1, and slen instructions, no desynchronization operation is in progress. Pipeline control 108 recognizes this and allows memory access control registers 350 to be updated immediately by new vector parameters 301 via multiplexer control 330 without having to use memory access control preload registers 340.
The second sas0 may update registers 306 and 308 instead of registers 326 and 328 due to the desynchronized execution of sqrt, but when slen instruction is executed, the desynchronized execution has completed. In this case, when the desynchronization execution is complete, multiplexer control 330 updates registers 326 and 328 from registers 306 and 308 when sqrt is complete and allows slen to be written directly into register 322.
FIG. 3 illustrates a method in which desynchronization execution may continue and allow additional instructions to execute even when those instructions have resource contention, as the arrangement of FIG. 3 resolves the resource contention. The particular example shown in fig. 3 is illustrative and not limiting in scope.
Asynchronous execution
======================
Asynchronous execution is a form of desynchronized execution when certain actions cannot be predicted or expected due to their exceeding the control of the processor.
An example of this is the programming loading or saving of local memory with external memory or devices. If the program instructions initiate the process of the external process reading the local RAM and do something to the data, such as saving it to external memory, then the pipeline control 108 (also referred to as pipeline control) (in fig. 1) does not know when the external process will actually read the local memory. Similarly, if the program instructions initiate the process of the external process loading new data into the local RAM, the pipeline control 108 does not know when the data will actually be written to and available to the processor.
This example may be further elucidated by two instructions:
xload r1 r2 r3- -r 2 bytes of data starting from the external memory address r3 onward are loaded into the local memory starting from address r 1. That is, the contents of the external memory locations r3, r3+1, & gt, r3+r2-1 are loaded into the respective local memory locations r1, r1+1, & gt, r1+r2-1.
Xsave r1 r2 r3- -the r2 bytes of data forward from the local memory address r3 are saved to the forward external memory address r1. That is, the contents of the local memory locations r3, r3+1, & gt, r3+r2-1 are saved to the respective external memory locations r1, r1+1, & gt, r1+r2-1.
Where r1, r2 and r3 are registers containing the expected values of the operation.
Because xload and xsave can take a significant amount of time to operate, it would be preferable if the pipeline control 108 continued to execute instructions that follow xload or xsave, as it did for desynchronized execution. This variation of desynchronized execution is referred to as asynchronous execution because some of the activities of xload and xsave instructions occur asynchronously with respect to the pipeline control 108.
Asynchronous execution allows for faster program execution performance. However, when there is resource contention or data dependency, the same kind of problem as resynchronization must be considered. When asynchronous operations have not received an external indication that they are complete, the resource allocation tracking 116 monitors for these problems and, when necessary, instructs the stall detection 112 to suspend the pipeline control 108 from executing instructions when a problem is encountered that requires suspension of instruction execution until the problem is resolved or the asynchronous operation is complete. This differs from resynchronization in that asynchronous operations may be completed while the desynchronization vector operations are still in progress. However, even though resynchronization of the desynchronized vector operation has not been performed, instructions that must wait for the asynchronous operation to complete are now still executable.
Consider the xload instruction. Once it is issued by the pipeline control 108, at some unpredictable point in the future, the external process will write the data being retrieved from the external memory or external device to the local memory. If the local memory does not have separate write ports for external and internal (processor generated) writes, this is a contention for resources. Even if there are multiple write ports, future instructions may require the use of new data being loaded by xload. This is also a resource contention, which is the loading of data from an external source and the correct ordering of the data used by instructions following xload.
Consider the xsave instruction. Once it is issued by the pipeline control 108 (i.e., pipeline control 108), at some unpredictable point in the future, the external process will read the data from the local memory and save it to the external memory or external device. If the local memory does not have separate read ports for external and internal (processor generated) reads, this is a contention for resources. Even if there are multiple read ports, future instructions may overwrite data that is still in the process of being saved by the xsave instruction. This is also resource contention, which is the data, and contention is the correct ordering of the reads of data before it is overwritten by new data.
Here is an exemplary instruction stream:
mov r0 100
mov r1 64
mov r2 0x12345678
xload r 0r 1 r2// load 64 bytes from external mem [0x 12345678..] into local mem [100,101,..163 ]
Add r7 r8// these may be performed as xload proceeds asynchronously
mul r7 r9
mov r9 500
Storing r9 r7// writing r9 into local mem [500] -resource contention to memory write port with xload
In this example, xload is performed, but the loading of new data into local memory is performed asynchronously. Thus, add and mul instructions may be executed. But store instructions require data to be written to local memory. Because xload when it is also unpredictable to write to local memory, it is possible that store and xload will attempt to perform simultaneous writes that are not supported in designs with only one write port. Therefore, store instructions must be suspended until xload has completed writing to local memory. The resource allocation trace 116 monitors for asynchronism xload, detects the contention, and instructs the stall detection 112 to stop the pipeline control 108 from executing store instructions until the resource allocation trace 116 determines that the contention is resolved.
In this example, xload is allowed to execute asynchronously with some performance improvement until the instruction is stored. But additional improvements may be made because the store instruction is written to a different memory location than xload. While the asynchronization xload is still in progress, it will be desirable to store instructions and subsequently instructions to be allowed to execute.
One mechanism for such improvement is for an external process to request permission to write to local memory from the processor and buffer the write data until such permission is given by the pipeline control 108. This may be entirely satisfactory, if only a small amount of data is to be loaded from external memory, but if a large amount of data is being returned from external memory and permission from pipeline control 108 to write to local memory is delayed, the buffer may be unacceptably large. ( If a vector instruction running for a very long time is being performed desynchronized, the pipeline control 108 cannot interrupt it because it is desynchronized. It may take a long time to complete before the write port is no longer used. )
Another mechanism to solve this problem and eliminate the buffer is for the external process to shut down the vector processor clock, perform the write, and then restart the vector processor clock. This is like the vector processor becoming temporarily unconscious and during the time of zero activity the local RAM is written to and only then the vector processor becomes conscious again. From the perspective of the vector processor, it appears as if new data were suddenly present in local memory. This requires that the local memory be clocked out of the rest of the vector processor, which is not turned off during this "unintentional" operation.
Such "unintentional" operation does not solve all problems. Consider the following instruction streams:
mov 0 100// all instructions identical to those previously described
mov r1 64
mov r2 0x12345678
xload r0 r1 r2
add r7 r8
mul r7 r9
mov r9 500
Store r9 r7// the instruction is now allowed to execute
Etc// plus many more instructions
mov r9 100
Fetch r7 r9// get mem [100] and put it into r 9-this is a data contention with xload!
In this example, the fetch instruction retrieves data from local memory previously loaded by xload. The fetch execution is not allowed until xload has written the data to local memory.
The resource allocation tracking 116 monitors the local memory address associated with this xload and initiates the process of suspending any instructions to read or write to memory addresses in this range. This is an automatic means of resolving contention. Programming means may also or alternatively become available. Programmers generally know whether they are pre-fetching data and when to use the data later in the program. Thus, a programmer may use an instruction such as xlwait (xload wait) to alert the pipeline control 108 that it needs to wait until the outstanding asynchronization xload has completed, and then continue executing the instruction. This may lead to a simpler design by transferring responsibility to the programmer to ensure that race hazards are avoided.
Similar considerations relate to xsave instructions:
The pipeline control 108 may issue xsave asynchronous execution and continue executing subsequent instructions until an instruction with memory read port contention is encountered.
Memory read port contention can be eliminated by allowing external logic to turn off the vector processor clock.
The resource allocation tracking 116 monitors the local memory address associated with this xsave and initiates the process for suspending any instruction that modifies a memory address in this range.
The xswait instruction may transfer responsibility to the programmer to indicate when instruction execution should be suspended until asynchronous operation is complete.
Xsave have additional considerations regarding the meaning of it completing its operation. In the case of xload, the operation is not considered complete until all the data has been loaded into local memory. But for xsave, there are two times that can be considered to be completed:
all the data to be saved have been read out from the local memory
All data to be saved have been read out from the local memory and the external memory/device has acknowledged receipt of such data.
The latter definition of completion allows the external memory/process to indicate not only that the data has been received (as shown, xsave saves it to a legal location), but also the integrity of the received data (as shown, e.g., whether it arrived with good parity).
Typically, the program is concerned only with the former definition, i.e. that the data has been read from the internal memory, even though it may not have been received and acknowledged by the external memory/device. This is because the program only pays attention to that it can now continue to execute and modify the saved data because the original state of the data is being saved.
But sometimes the program may need to know xsave that is 100% complete in each path and that the external write has been confirmed. For example, the data may have such critical properties that if the data arrives at the receiving end with parity errors, the program may want re-xsave data until the acknowledgement of good data has been acknowledged.
To this end, there may be two variants of xswait that provide two variants of xsave-complex.
Fig. 5 illustrates a flow chart showing asynchronous, desynchronized, and synchronous execution of instructions generally at 500. At 502, a next instruction is fetched for execution. Proceed to 504 via 503. At 504, a determination is made as to whether the next instruction fetched to be executed affects or depends on the results of any current desynchronization instructions in the process, i.e., whether desynchronization contention exists. When the next instruction fetched to be executed affects or depends on the results of any current desynchronization instructions in the process (yes), then proceed to 520 via 519. At 520, the execution is re-synchronized by waiting for all desynchronization operations to complete before proceeding to 506 via 505. When the next instruction fetched to be executed does not affect or depend on the results of any current desynchronization instructions in the process (no), then proceed via 505 to 506.
A determination is made at 506 as to whether the next instruction fetched to be executed affects or depends on the results of any asynchronous operations in the process (i.e., asynchronous contention). The process proceeds via 521 to 522 when the next instruction to be executed affects or depends on the result of any asynchronous operation in the process (yes), otherwise if the next instruction to be executed does not affect or depend on the result of any asynchronous operation in the process (no), the process proceeds via 507 to 508. At 522, synchronous execution is performed by waiting for all desynchronization operations to complete before proceeding to 508 via 507. At 508 it is determined whether the next instruction to be executed can be executed asynchronously. When the next instruction to be executed can be executed asynchronously (yes), proceed via 517 to 518, otherwise if the next instruction to be executed cannot be executed asynchronously (no), proceed via 509 to 510. At 518, asynchronous execution is initiated by allowing the processor to asynchronously execute the next instruction.
At 510 it is determined whether the next instruction fetched can be executed desynchronized. When the next instruction can be executed desynchronized (yes), then proceed to 512 via 511. At 512, desynchronization execution is initiated by allowing the processor to desynchronize execution of the next instruction fetched, i.e., completion of the next instruction fetched occurs relative to control desynchronization of the processor, but the processor tracks when an internal signal is given indicating that the operation is complete. The processor does not wait for the completion signal before proceeding to 502 via 515.
When the next instruction cannot be executed desynchronized (no), then proceed to 514 via 513. At 514, desynchronized execution is initiated by allowing the processor to execute the next instruction fetched in synchronization, i.e., the instruction makes the program appear to complete before proceeding to 502 via 515. The processor may be pipelined or employ other overlapping execution techniques, however, in a manner that makes the program appear to complete instructions before proceeding to 502.
Fig. 6 illustrates a flow chart showing the execution of a vector instruction generally at 600. At 602, it is determined whether a first vector instruction is currently executing. When the first vector instruction is not currently executing (no), then a return is made to 602 via 601. When the first vector instruction is currently executing (yes), then proceed via 603 to 604 and access memory access control of the first vector instruction using the parameters stored in the registers, then proceed via 605 to 606.
At 606, it is determined whether the first vector instruction has completed execution. When the first vector instruction has completed execution (yes), then proceed via 601 to 602. When the first vector instruction has not completed execution (no), then proceed to 608 via 607.
At 608, a determination is made as to whether the second vector instruction is waiting to be executed. When the second vector instruction is not waiting for execution (no), then a return is made to 602 via 601. While the second vector instruction is waiting to be executed (yes), then proceed via 609 to 610 and load new vector parameters into the memory access control preload register for use with the second vector instruction, then proceed via 611 to 612. At 612, it is determined whether the first vector instruction has completed execution. When the first vector instruction has not completed execution (no), then proceed to 612 via 611. When the first vector instruction has completed execution (yes), then proceed to 614 via 613. At 614, the multiplexer is switched to the preload position, copying the contents of the memory access control preload register into the memory access control register, and then proceeds via 615 to 616. At 616, the multiplexer is switched to the non-preloaded position and then proceeds to 618 via 617. At 618, the second vector instruction is executed, representing the second vector instruction as the first vector instruction, and returns to 602 via 601.
When the multiplexer is in the non-preloaded position, it allows the setting of new vector parameters. For example, referring to FIG. 3, in the non-preloaded position, multiplexer control 330 allows new vector parameter 301 to enter multiplexers 310, 312, 314, and 316 and propagate to vector length register 322, vector constant register 324, vector address register 326, and vector stride register 328, respectively, via 311, 313, 315, and 317, respectively.
When the multiplexer is in the preload position, it allows a new vector parameter to be set from the memory access control preload register 340. For example, referring to fig. 3, in the preload position, multiplexer control 330 allows new vector parameters 301 that have been loaded into vector length preload register 320, vector constant preload register 304, vector address preload register 306, and vector stride preload register 308 to enter multiplexers 310, 312, 314, and 316 via 303, 305, 307, and 309, respectively, and to propagate to vector length register 322, vector constant register 324, vector address register 326, and vector stride register 328 via 311, 313, 315, and 317, respectively.
Fig. 7 illustrates generally at 700 a flow chart showing the execution of a desynchronized vector instruction other than a non-desynchronized instruction. At 702, it is determined whether a desynchronization vector instruction is currently being executed. If the desynchronization vector instruction is not currently executing (NO), then proceed via 703 to 714. At 714, in addition to the non-desynchronized instruction, a new desynchronized vector instruction is allowed to execute, and it proceeds to 702 via 701.
If a desynchronization vector instruction is currently being executed (yes), then proceed to 704 via 705. At 704, memory access control of the vector instruction is accessed using parameters stored in a memory access control register (e.g., at 350 of FIG. 3), and then proceeds to 706 via 707. At 706, a determination is made as to whether there is an instruction attempting to modify one or more memory access control registers (registers) (e.g., at 350 of FIG. 3). When there is no instruction attempting to modify the memory access control register (e.g., at 350 of fig. 3) (no), then proceed via 703 to 714.
When there is an instruction attempting to modify the memory access control register (yes), then proceed to 708 via 709. At 708, the corresponding one or more memory access control preload registers (registers) (e.g., fig. 3 at 340) are modified instead of the memory access control registers (e.g., fig. 3 at 350), and then proceeds via 711 to 710. For example, using fig. 3, vector length register 322 has a corresponding vector length preload register 302. This example applies to vector constant registers 324 and corresponding vector constant preload registers 304. This example applies to vector address register 326 and corresponding vector address preload register 306. This example applies to vector stride register 328 and corresponding vector stride preload register 308.
At 710, new desynchronized vector instruction execution is not allowed, but non-desynchronized instruction execution continues, and then proceeds to 712 via 713.
At 712, it is determined whether all desynchronization vector instructions have completed. When all desynchronization vector instructions have not been completed (no), then proceed to 704 via 715. When all desynchronization vector instructions have been completed (yes), then proceed via 717 to 716.
At 716, any modified memory access control preload register parameters are moved into the memory access control registers and then proceed via 719 to 718. Optionally, at 720, all memory access control preload register parameters are moved into the memory access control registers regardless of whether they have been modified. For example, using FIG. 3, all memory access control preload register 340 parameters are moved into memory access control register 350 using multiplexer control 330.
At 718, the instruction to modify the memory access control register no longer modifies the memory access control preload register, and then proceeds via 703 to 714. That is, for example, an instruction that would modify the memory access control register (e.g., at 350 of FIG. 3) may now do so instead of modifying the memory access control preload register (e.g., at 340 of FIG. 3). After 718, proceed to 714 to allow a new desynchronized vector instruction to be executed in addition to the non-desynchronized instruction.
Correlation of co-pending application Ser. No. 17/468,574 filed on 7/9/2021.
================================
These methods may be used with co-pending application number 17/468,574 filed on day 2021, month 9, and day 7. Co-pending application number 17/468,574, filed on 9 and 7 of 2021, describes a parameter stack, a register stack, and a subroutine call stack that are separate from the local memory and are widely used by vector ALU 124.
Consider the following sequence of instructions, which is similar to the previous example regarding the execution of desynchronization:
mov r0 100
mov r1 1
sas0 r0 r1
sas1 r0 r1
mov r2 64
slen r2
sqrt v0 v1// this may perform desynchronization
Push r0// as long as this has no stack in the same memory, vector ALU uses-!
push r1
push r2
call function_that_does_vector_log
Pushing/popping parameters onto/from the stack, saving and restoring registers, and subroutine calls and returns are all very common operations and are undesirable if they result in resynchronization of the desynchronization or asynchronous execution. The co-pending application No. 17/468,574 filed on 9 and 7 at 2021 avoids such resynchronization and thus works synergistically with the techniques disclosed herein.
Thus, a method and apparatus for desynchronized execution in a vector processor have been described.
For purposes of discussion and understanding the embodiments, it is understood that various terms are used by those skilled in the art to describe techniques and methods. Furthermore, in this description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the examples. It will be apparent, however, to one skilled in the art that the examples may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments. These embodiments are described in sufficient detail to enable those of ordinary skill in the art to practice the embodiments, and it is to be understood that other embodiments may be utilized and that logical, mechanical, electrical, and other changes may be made without departing from the scope of the embodiments.
As used in this specification, "one example" or "an example" or similar phrase means that the described feature is included in at least one example. Reference to "an example" in this specification does not necessarily refer to the same example, however, nor are such examples mutually exclusive. "an example" does not imply that only a single example is present. For example, the features, structures, acts described in "one example" may be included in other examples as well, without limitation. Accordingly, the present invention may include various combinations and/or integrations of the examples described herein.
As used in this specification, "substantially" or "substantially equivalent" or similar phrases are used to indicate that the items are very close or similar. Since two physical entities may never be exactly equivalent, phrases such as "substantially equivalent" are used to indicate that they are equivalent for all practical purposes.
It should be understood that any and all possible such combinations may be disclosed thereby in discussing any one or more examples of alternative methods or techniques. For example, if all five techniques discussed are possible, A, B, C, D, E are presented, each of which may or may not be present with each of the other techniques, thus yielding 2^5 or 32 combinations whose binary sequences range from non-A and non-B and non-C and non-D and non-E to A and B and C and D and E. Applicant hereby claims all such possible combinations. The applicant hereby states that the aforementioned combinations comply with the applicable EP (european patent) standard. No preference is given to any combination.
Thus, a method and apparatus for desynchronized execution in a vector processor have been described.
Claims (7)
1. A vector processor unit comprising:
A plurality of memory access control preload registers, each memory access control preload register having an input and an output, all of the memory access control preload register inputs coupled to receive new vector parameters;
a plurality of multiplexers, each multiplexer having a first input, a second input, a switching input, and an output, each of the memory access control preload register outputs coupled to the first input of a respective multiplexer, each of the second inputs of the respective multiplexer coupled to receive the new vector parameter;
A multiplexer control, each of the multiplexers switching the inputs in response to the multiplexer control;
A plurality of memory access control registers, each memory access control register having an input and an output, each of the memory access control register inputs coupled to the respective multiplexer output, and
A memory access control having a plurality of inputs, the plurality of memory access control register outputs coupled to the respective memory access control inputs.
2. The vector processor unit of claim 1 wherein said plurality of memory access control preload registers are selected from the group consisting of a vector length preload register, a vector constant preload register, a vector address preload register, and a vector stride preload register, and
Wherein the plurality of memory access control registers are selected from the group consisting of a vector length register, a vector constant register, a vector address register, and a vector stride register.
3. The vector processor unit of claim 1, wherein:
The plurality of memory access control preload registers includes a vector length preload register, a vector constant preload register, a vector address preload register, and a vector stride preload register, and
Wherein the plurality of memory access control registers includes a vector length register, a vector constant register, a vector address register, and a vector stride register.
4. A method for desynchronizing execution, comprising:
(a) Determining whether a first vector instruction is currently executing;
(b) Returning to (a) when the first vector instruction is not currently executing;
(c) Accessing memory access control of the first vector instruction using vector parameters stored in registers when the first vector instruction is currently executing;
(d) Determining whether a second vector instruction is waiting to be executed;
(e) Returning to (a) when the second vector instruction does not wait for execution;
(f) When the second vector instruction is waiting to execute, then loading new vector parameters into a preload register for use with the second vector instruction;
(g) Determining whether the first vector instruction has completed execution;
(h) Returning to (g) when the first vector instruction has not completed execution;
(i) When the first vector instruction has completed execution, then switching a multiplexer to a preload position to copy the contents of the preload register into the register;
(j) Switching the multiplexer to a non-preloaded position, and
(K) Executing the second vector instruction, representing the second vector instruction as the first vector instruction, and returning to (a).
5. The method of claim 4, comprising connecting the multiplexer non-preloaded locations to new vector parameters.
6. A method for desynchronizing execution, comprising:
(a) Determining whether a desynchronization vector instruction is currently executing;
(b) Proceeding to (c) when the desynchronization vector instruction is not currently executing;
(c) In addition to allowing non-desynchronized instruction execution, new desynchronized vector instruction execution is allowed;
(d) Memory access control for accessing vector instructions using parameters stored in a memory access control register;
(e) Determining whether the instruction is attempting to modify one or more memory access control registers;
(f) When the instruction is not attempting to modify the one or more memory access control registers, then proceeding to (c);
(g) When the instruction is attempting to modify one or more memory access control registers, then modifying one or more corresponding memory access control preload registers;
(h) New desynchronized vector instruction execution is not allowed, but non-desynchronized instruction execution continues;
(i) Determining whether all desynchronized vector instructions have completed execution;
(j) Proceeding to (d) when all of said desynchronized vector instructions have not completed execution;
(k) Proceeding to (l) when all of said desynchronized vector instructions have completed execution;
(l) Moving any modified memory access control preload register parameters into the one or more corresponding memory access control registers, and
(M) allowing instructions that modify memory access control register parameters to no longer modify the corresponding memory access control preload register parameters, and then proceed to (c).
7. The method of claim 6, wherein any modified memory access control preload register parameters are moved into the corresponding memory access control registers by switching multiplexers at (l).
Applications Claiming Priority (13)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202163180634P | 2021-04-27 | 2021-04-27 | |
US202163180562P | 2021-04-27 | 2021-04-27 | |
US202163180601P | 2021-04-27 | 2021-04-27 | |
US63/180,562 | 2021-04-27 | ||
US63/180,634 | 2021-04-27 | ||
US63/180,601 | 2021-04-27 | ||
US17/468,574 US20220342668A1 (en) | 2021-04-27 | 2021-09-07 | System of Multiple Stacks in a Processor Devoid of an Effective Address Generator |
US17/468,574 | 2021-09-07 | ||
US17/669,995 US12175116B2 (en) | 2021-04-27 | 2022-02-11 | Method and apparatus for gather/scatter operations in a vector processor |
US17/669,995 | 2022-02-11 | ||
US17/701,582 US11782871B2 (en) | 2021-04-27 | 2022-03-22 | Method and apparatus for desynchronizing execution in a vector processor |
US17/701,582 | 2022-03-22 | ||
PCT/US2022/021525 WO2022231733A1 (en) | 2021-04-27 | 2022-03-23 | Method and apparatus for desynchronizing execution in a vector processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117083594A CN117083594A (en) | 2023-11-17 |
CN117083594B true CN117083594B (en) | 2024-11-29 |
Family
ID=88510061
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202280017945.7A Active CN117083594B (en) | 2021-04-27 | 2022-03-23 | Method and apparatus for desynchronizing execution in a vector processor |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117083594B (en) |
DE (1) | DE112022000535T5 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361182A (en) * | 2014-11-21 | 2015-02-18 | 中国人民解放军国防科学技术大学 | Microprocessor micro system structure parameter optimization method based on Petri network |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0728786A (en) * | 1993-07-15 | 1995-01-31 | Hitachi Ltd | Vector processor |
JP3400458B2 (en) * | 1995-03-06 | 2003-04-28 | 株式会社 日立製作所 | Information processing device |
US6212630B1 (en) * | 1997-12-10 | 2001-04-03 | Matsushita Electric Industrial Co., Ltd. | Microprocessor for overlapping stack frame allocation with saving of subroutine data into stack area |
EP3340037B1 (en) * | 2016-12-22 | 2019-08-28 | ARM Limited | A data processing apparatus and method for controlling vector memory accesses |
-
2022
- 2022-03-23 CN CN202280017945.7A patent/CN117083594B/en active Active
- 2022-03-23 DE DE112022000535.1T patent/DE112022000535T5/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104361182A (en) * | 2014-11-21 | 2015-02-18 | 中国人民解放军国防科学技术大学 | Microprocessor micro system structure parameter optimization method based on Petri network |
Also Published As
Publication number | Publication date |
---|---|
DE112022000535T5 (en) | 2023-11-16 |
CN117083594A (en) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP2518616B2 (en) | Branching method | |
JP2786574B2 (en) | Method and apparatus for improving the performance of out-of-order load operations in a computer system | |
US6944850B2 (en) | Hop method for stepping parallel hardware threads | |
US5745721A (en) | Partitioned addressing apparatus for vector/scalar registers | |
US6671827B2 (en) | Journaling for parallel hardware threads in multithreaded processor | |
EP0644482B1 (en) | Dispatch of instructions to multiple execution units | |
EP1023659B1 (en) | Efficient processing of clustered branch instructions | |
US7529917B2 (en) | Method and apparatus for interrupt handling during loop processing in reconfigurable coarse grained array | |
WO1998013759A1 (en) | Data processor and data processing system | |
JPS58151655A (en) | Information processing device | |
JPH0766329B2 (en) | Information processing equipment | |
JP3439033B2 (en) | Interrupt control device and processor | |
US5544337A (en) | Vector processor having registers for control by vector resisters | |
US5623650A (en) | Method of processing a sequence of conditional vector IF statements | |
KR100986375B1 (en) | Fast conditional selection of operands | |
CN100361119C (en) | computing system | |
CN114610394B (en) | Instruction scheduling method, processing circuit and electronic equipment | |
US11782871B2 (en) | Method and apparatus for desynchronizing execution in a vector processor | |
JP7495030B2 (en) | Processors, processing methods, and related devices | |
CN117083594B (en) | Method and apparatus for desynchronizing execution in a vector processor | |
US20230236878A1 (en) | Efficiently launching tasks on a processor | |
JP2000353092A (en) | Information processor and register file switching method for the processor | |
EP1039376A1 (en) | Efficient sub-instruction emulation in a vliw processor | |
KR102379886B1 (en) | Vector instruction processing | |
EP0374598B1 (en) | Control store addressing from multiple sources |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |