[go: up one dir, main page]

CN120540708A - High-performance processors, processor clusters, and electronic devices - Google Patents

High-performance processors, processor clusters, and electronic devices

Info

Publication number
CN120540708A
CN120540708A CN202510572336.3A CN202510572336A CN120540708A CN 120540708 A CN120540708 A CN 120540708A CN 202510572336 A CN202510572336 A CN 202510572336A CN 120540708 A CN120540708 A CN 120540708A
Authority
CN
China
Prior art keywords
processor
instruction
vector
scalar
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510572336.3A
Other languages
Chinese (zh)
Inventor
王东琳
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Silang Technology
Original Assignee
Shanghai Silang Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Silang Technology filed Critical Shanghai Silang Technology
Priority to CN202510572336.3A priority Critical patent/CN120540708A/en
Publication of CN120540708A publication Critical patent/CN120540708A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30072Arrangements for executing specific machine instructions to perform conditional operations, e.g. using predicates or guards

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Complex Calculations (AREA)

Abstract

本申请提供一种高性能处理器、处理器簇和电子设备,该高性能处理器包括:标量处理器、向量处理器、本地存储器;标量处理器和向量处理器共享本地存储器;且向量处理器只访问本地存储器,只由标量处理器调用执行;标量处理器和向量处理器之间建立连接;标量处理器与全局存储器建立连接;标量处理器,用于从全局存储器中获取指令及参数;标量处理器确定满足执行条件后,调用向量处理器执行基于参数执行任务。本申请提供的高性能处理器,标量处理器从全局存储器中获取指令及参数后,在确定满足执行条件时,调用向量处理器基于参数执行任务,进而协调地使用多个异构核满足不同的计算需求。

The present application provides a high-performance processor, a processor cluster, and an electronic device. The high-performance processor includes: a scalar processor, a vector processor, and a local memory; the scalar processor and the vector processor share the local memory; the vector processor only accesses the local memory and is only called and executed by the scalar processor; a connection is established between the scalar processor and the vector processor; the scalar processor is connected to the global memory; the scalar processor is used to obtain instructions and parameters from the global memory; after the scalar processor determines that the execution conditions are met, it calls the vector processor to execute a task based on the parameters. The high-performance processor provided by the present application, after the scalar processor obtains instructions and parameters from the global memory, calls the vector processor to execute the task based on the parameters when it determines that the execution conditions are met, thereby coordinately using multiple heterogeneous cores to meet different computing needs.

Description

High performance processor, processor cluster, and electronic device
Technical Field
The present application relates to the field of computer technology, and in particular, to a high performance processor, a processor cluster, and an electronic device.
Background
In heterogeneous multi-core architectures, the processing cores may be different, they may have different architectures, clock frequencies, and functional consumption. The goal of this concern is to allow the processor to better accommodate different kinds of tasks by combining different types of cores together.
In heterogeneous multi-core architecture, how to perform heterogeneous multi-core number processing, and further, it is important to use multiple heterogeneous cores in coordination to meet different computing demands.
Disclosure of Invention
To address one of the above-mentioned technical drawbacks, the present application provides a high performance processor, a processor cluster, and an electronic device.
In a first aspect of the present application, a high performance processor is provided, the high performance processor comprising a scalar processor, a vector processor, a local memory;
the scalar processor and the vector processor share a local memory, and the vector processor only accesses the local memory and is only called and executed by the scalar processor;
Establishing a connection between a scalar processor and a vector processor;
the scalar processor establishes a connection with the global memory;
And the scalar processor is used for acquiring the instruction and the parameter from the global memory, and calling the vector processor to execute the task based on the parameter after the scalar processor determines that the execution condition is met.
Optionally, the scalar processor is an out-of-order multiple issue scalar processor;
The vector processor is an ultra-long instruction word vector processor.
Optionally, the high performance processor further comprises:
an instruction first-in first-out FIFO memory;
The depth of the instruction FIFO memory is 32 bits.
Optionally, the high performance processor further comprises an instruction FIFO memory and a data FIFO memory;
the depth of both the instruction FIFO memory and the data FIFO memory is 32 bits.
Optionally, the execution condition is that the vector processor does not perform task processing.
Optionally, the execution condition is that the instruction FIFO memory is not full;
a scalar processor for storing the memory addresses of the instructions and parameters into an instruction FIFO memory;
And the vector processor is used for reading the instruction and the data from the instruction FIFO memory when the task processing is not performed and executing the instruction based on the data.
Optionally, the execution condition is that neither the instruction FIFO memory nor the data FIFO memory is full;
a scalar processor for storing the memory address of the instruction in the instruction FIFO memory and the memory address of the parameter in the data FIFO memory;
And a vector processor for reading the instruction from the instruction FIFO memory when the task processing is not performed, reading the data from the data FIFO memory, and executing the instruction based on the data.
Optionally, the task includes an identification of a thread to which the task belongs.
In a second aspect of the application, there is provided a processor cluster comprising a plurality of the high performance processors of the first aspect.
In a third aspect of the application, an electronic device is provided, comprising a high performance processor according to the first aspect, or comprising a high performance processor according to the second aspect.
The application provides a high-performance processor, a processor cluster and electronic equipment, wherein the high-performance processor comprises a scalar processor, a vector processor and a local memory, the scalar processor and the vector processor share the local memory, the vector processor only accesses the local memory and is only called by the scalar processor to execute, a connection is established between the scalar processor and the vector processor, the scalar processor is connected with a global memory, the scalar processor is used for acquiring instructions and parameters from the global memory, and the vector processor is called to execute tasks based on the parameters after the scalar processor determines that the execution conditions are met. According to the high-performance processor provided by the application, after the scalar processor acquires the instruction and the parameter from the global memory, the vector processor is called to execute the task based on the parameter when the execution condition is determined to be met, and then a plurality of heterogeneous cores are used in a coordinated manner to meet different calculation requirements.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a schematic diagram of a high performance processor according to an embodiment of the present application;
FIG. 2 is a schematic diagram of another high performance processor according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an implementation process of a high performance processor according to an embodiment of the present application for an execution condition of a vector processor that is not performing task processing;
FIG. 4 is a schematic diagram of an implementation process of a high performance processor for an execution condition with an instruction FIFO memory not full according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a scalar processor according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a synchronization unit of a scalar processor according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a vector processor according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a vector operation unit according to an embodiment of the present application;
Fig. 9 is a schematic diagram of another architecture of a vector processor according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following detailed description of exemplary embodiments of the present application is provided in conjunction with the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application and not exhaustive of all embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
In practicing the present application, the inventors have discovered that the processing cores may be different in heterogeneous multi-core architectures, which may have different architectures, clock frequencies, and functional consumption. The goal of this concern is to allow the processor to better accommodate different kinds of tasks by combining different types of cores together. In heterogeneous multi-core architecture, how to perform heterogeneous multi-core number processing, and further, it is important to use multiple heterogeneous cores in coordination to meet different computing demands.
In order to solve the problems, the embodiment of the application provides a high-performance processor, a processor cluster and electronic equipment, wherein the high-performance processor comprises a scalar processor, a vector processor and a local memory, the scalar processor and the vector processor share the local memory, the vector processor only accesses the local memory and is only called by the scalar processor to execute, a connection is established between the scalar processor and the vector processor, the scalar processor is connected with a global memory, the scalar processor is used for acquiring instructions and parameters from the global memory, and the vector processor is called to execute tasks based on the parameters after the scalar processor determines that the execution conditions are met. According to the high-performance processor provided by the application, after the scalar processor acquires the instruction and the parameter from the global memory, the vector processor is called to execute the task based on the parameter when the execution condition is determined to be met, and then a plurality of heterogeneous cores are used in a coordinated manner to meet different calculation requirements.
The present embodiment provides a high performance processor that is a heterogeneous asynchronous processor consisting of a scalar processor and a vector processor.
Referring to fig. 1, the high performance processor includes a scalar processor, a vector processor, and a local memory.
Wherein the scalar processor and the vector processor share a local memory. And the vector processor can only access local memory, only by scalar processor calls. I.e. the vector processor can only access the local memory and can only be executed by scalar processor calls.
A connection is established between the scalar processor and the vector processor. Such as a scalar processor and a vector processor, connected by a dedicated instruction channel.
As shown in fig. 2, the high performance processor may also include two registers, one corresponding to the scalar processor and the other corresponding to the vector processor. The vector processor can read and write the corresponding register, and the scalar processor can read and write the corresponding register of the vector processor besides the corresponding register.
The scalar processor may read and write registers of the vector processor.
The scalar processor establishes a connection with global memory.
Scalar processor
Scalar processors are out-of-order multiple-issue scalar processors.
The scalar processor is used for mainly carrying out the work of data carrying, task management/synchronization/switching and the like and is responsible for calling the vector processor to execute acceleration operation. For example, instructions and parameters are fetched from global memory. And after the scalar processor determines that the execution condition is met, calling the vector processor to execute the task based on the parameters.
In particular, a scalar processor may be as shown in FIG. 5 and may include a fetch unit, a register renaming unit, an operation reservation stack unit, a store reservation stack unit, a scalar operation unit, a fetch unit, a program control unit, a synchronization unit, a pipeline control unit, a register file unit, and a special vector register file unit.
In addition, a scalar processor may include one or more other units, such as one or more other functional blocks, one or more instruction caches, one or more data stores, one or more special vector registers, one or more status flag registers, and so forth.
1. Finger taking unit
And the instruction fetching unit is used for fetching instructions and dispatching instructions.
Specifically, the instruction fetching unit is configured to generate an instruction fetching request address, output the value fetching request address to the instruction cache for instruction fetching, receive an instruction from the instruction cache, and store the instruction into the data storage. And sequentially reading the instructions meeting the conditions from the data storage in each period, decoding the read instructions and performing related inspection, and then sequentially dispatching the inspected instructions.
For example, the instruction fetch unit generates an instruction fetch request address, outputs the instruction fetch request address to the instruction cache to fetch instructions, receives the instructions from the instruction cache, stores the instructions in the data storage, sequentially finds one or more instructions from the eligible instructions for decoding and relevant checking, sequentially dispatches the eligible instructions, and dispatches one program control unit instruction and one synchronization unit instruction at most each time, and can dispatch one or more scalar operation unit instructions and one or more access unit instructions at each time.
2. Register renaming unit
A register renaming unit for receiving the instruction dispatched by the instruction dispatch unit and renaming the register.
Specifically, the register renaming unit is configured to receive and store the instruction dispatched by the instruction assigning unit, rename a special vector register, decode an instruction condition, and generate a pipeline blocking signal. Data from one or more of the scalar arithmetic unit, the access unit, the program control unit, the synchronization unit, the special vector register, the condition register, and the flag register is received and written back. And sending the instruction to one or more of an operation reservation stack unit, a storage reservation stack unit, a program control unit and a synchronization unit.
For example, a register renaming unit is a unit used in a scalar processor to receive instruction dispatch instructions from an instruction dispatch unit and perform registers, rename special vector registers, condition decode instructions, generate pipeline stall signals, and receive data from execution units (e.g., scalar arithmetic units, access units, program control units, synchronization units) written back to registers, special vector registers, condition registers, and status flags registers written back to corresponding registers.
Scalar processor writebacks support out-of-order writebacks with high execution efficiency and distribute instructions to operation reservation stack units, store reservation stack units, program control units, or synchronization units.
The register renaming unit bandwidth may be 6 bits, wherein multiple (e.g., 4) input instructions may be valid at the same time.
The condition registers may be plural in number and located in a register renaming unit.
The scalar arithmetic unit, the instructions of the memory unit support the operations of reading and writing the condition registers.
The instructions of the synchronization unit support the operation of reading the condition register.
Jump and function call instructions of the program control unit support the operation of reading the condition register.
When an instruction enters a condition register, if the condition register has an unexecuted instruction, the pipeline is blocked.
That is, the condition registers are not renamed, triggering a dispatch block to wait when a read-write correlation occurs. The condition register read-write rule is as follows:
● Reading a rule:
(1) The scalar arithmetic unit, the memory unit and the synchronization unit support conditional execution by all instructions, and the value of the condition register needs to be read.
(2) Scalar arithmetic units also support read condition register instruction operations.
(3) Jump and function call instructions of the program control unit support read condition register operations.
● Writing rules:
(1) Scalar arithmetic units support write condition register instructions.
(2) Scalar arithmetic unit logic class instructions and compare class instructions support the option of writing to a condition register.
When the instruction which is actually executed and is written into the condition register and is sent out before is not executed, and the instruction which is read or written into the same condition register enters, the pipeline is blocked, a condition execution blocking signal is generated, and the condition register waits for the completion of the writing of the condition register.
In addition, the register renaming unit includes one or more physical registers and one or more logical registers.
Any physical register is one of scalar physical register, vector physical register, condition register and flag register.
Any one of the logic registers is a scalar logic register, a vector logic register.
For example, the register renaming unit includes one or more physical registers, such as a plurality of 512-bit wide special vector registers, a plurality of condition registers, and a status flag register.
Wherein the special vector registers are renamed, the condition registers, and the status flag registers are not renamed.
The number of the logic registers is multiple, for example, the plurality of logic registers are scalar logic registers and a plurality of vector logic registers.
In addition, the mapping relationship of the logical registers and the physical registers is maintained by a register map. The mapping of vector logical registers and vector physical registers is maintained by a special vector register mapping table.
1) Register mapping table
Initially, all mapped physical registers of entries corresponding to all logical register indexes in the register mapping table are 0. When an instruction is executed or is interrupted, determining a logic register allocated for the relevant physical register, and updating the mapping of an entry corresponding to the logic register index allocated in the register mapping table to the identification of the relevant physical register.
For example, the depth 32, width 6 bits of the register map, maintains the mapping of all logical registers and all physical registers. Initially, the mapping of the register mapping table is invalid and the mapping physical registers of all entries are all 0. When a physical register is allocated for use by a logical register, the entry corresponding to the register map logical register index is changed to the ID of the physical register.
It should be noted that, when the instruction is actually executed, the register mapping table is updated, if the conditional instruction is not executed, the register mapping table is not updated, and in addition, when the jump is made, the register mapping table is not updated temporarily, but the interrupt return address must update the register mapping table to ensure that the interrupt can return normally.
2) Special vector register mapping table
Initially, the mapping vector physical registers of all entries corresponding to the vector logical register indexes in the special vector register mapping table are all 0. When the instruction is executed, a vector logic register allocated for the relevant vector physical register is determined, and the mapping of the table item corresponding to the vector logic register index allocated in the special vector register mapping table is updated to the identification of the relevant vector physical register.
For example, the depth 4, width 3 bits of the special vector register mapping table, maintains the mapping relationship of all vector logical registers and all vector physical registers. Initially, the special vector register mapping table mapping is invalid and the mapping vector physical registers of all entries are all 0. When a vector physical register is allocated for use by a vector logical register, the entry corresponding to the vector logical register index of the special vector register mapping table is changed to the ID of the vector physical register.
It should be noted that the special vector register mapping table is updated only when the instruction is actually executed, and is not updated if the conditional instruction is not executed, and is also temporarily not updated when the jump is made.
3. Operation reservation stack unit
The operation reservation stack unit is the transmit queue of the scalar operation unit.
The operation reservation stack unit is used for receiving the instruction, dispatch and rename information from the register renaming unit and pushing the instruction, dispatch and rename information into the queue. The ready instruction is popped to the scalar arithmetic unit for execution.
The operation reservation stack unit is also used for decoding the input instruction and storing the instruction type information.
That is, the operation reservation stack unit is the transmit queue of the scalar operation unit. The operation reservation stack unit receives instructions and associated dispatch and rename information from the register renaming unit to push them into a queue and pops up ready instructions to the scalar operation unit for execution. The operation reservation stack unit decodes the input instruction and stores instruction type information.
In particular, the depth of the operation reservation stack unit can be flexibly adjusted, for example, the depth of the operation reservation stack unit is 8. A plurality of scalar arithmetic units share one arithmetic reservation stack unit.
The instruction transmission and reception rules of the operation reservation stack unit are as follows:
(1) The output of the register renaming unit enters the operation reservation stack unit.
(2) When any free scalar arithmetic unit exists, it fetches instructions and operands from the operation reservation stack unit for execution.
(3) The principle of fetching instruction execution from the operation reservation stack unit is to fetch executable and ready-to-send instruction execution from the operation reservation stack unit in a front-to-back order.
(4) Whether or not a determination is made as to whether or not the values of all source registers or special vector registers or condition registers and status flag registers are ready to be sent.
(5) If multiple instructions can be sent, the oldest instruction is sent first according to the instruction order.
(6) If any scalar arithmetic unit is blocked, it cannot receive new instructions any more.
(7) If the previous instruction sent to any scalar arithmetic unit is a division instruction, a new division instruction can not be sent to the scalar arithmetic unit until the division result calculation is completed and the calculation completion En signal is returned.
4. Storage reservation stack unit
The memory reservation stack unit is a transmit queue of the memory access unit.
And the storage and reservation stack unit is used for receiving the instruction and the register renaming information from the register renaming unit and pushing the instruction and the register renaming information into a queue.
And the storage and reservation stack unit is also used for sending a read request to the register renaming unit when the instruction address register is ready and storing the read address operand.
And the register renaming unit is also used for calculating the address after the instruction acquires the address, decoding the address and storing the decoding information.
The register renaming unit is further used for transmitting the existence source register of any instruction to the access unit for execution after detection when the existence source register of any instruction is ready and address decoding is completed.
In particular, the depth of the storage and reservation stack unit can be flexibly adjusted, for example, the depth of the storage and reservation stack unit is 16. Multiple access units share a memory reservation stack unit. The memory reservation stack unit is the transmit queue of the memory unit. The store reservation stack unit receives instructions and register renaming information from the register renaming unit and pushes the instructions and register renaming information into a queue. When an instruction address register in a storage reservation stack unit is ready, a read request is sent to a register renaming unit and the read address operand is saved to a queue. After the instruction in the storage stack unit acquires the address, the address can be calculated and decoded, and the generated decoding information is stored in the queue. When the source register of an instruction (e.g., a write instruction) in the memory reservation stack unit is ready and address decoding is complete, it may be transmitted to the access unit for execution, with a series of checks, such as an address type check, an address compare check, an address forward check, etc., being performed prior to transmission.
The rules for storing and reserving the sending and receiving instructions of the stack unit are as follows:
(1) The output of the register renaming unit enters the memory reservation stack unit.
(2) When the source operand of the calculated address is ready, the memory access is calculated and stored in the memory reservation stack unit.
(3) The address-uncorrelated instructions can be disordered, and the disorder rule is that the instructions are read after the instructions are read, the instructions are written after the instructions are read, the instructions can be sent out in disorder after the instructions are written, the order of the written instructions after the instructions are written (different memory units cannot be simultaneously sent) is guaranteed, and even if the address-uncorrelated written instructions, the written instructions also need to be ordered.
(4) And the instructions related to the addresses are written after the instructions are read, the instructions are read after the instructions are written, the instructions are written after the instructions are written, and the instructions are read after the instructions are read, so that the order is guaranteed.
(5) The address is not related but includes all instructions that were not sent successfully (i.e., instructions on the way to the destination, including instructions at the memory unit level and at the memory unit output level) may be sent out of order to the same memory unit, but not to two or more memory units, when located in the same memory space.
(6) Access instructions located in the same memory space but not related to addresses can only be sent one at a time, and two or more access units cannot be sent at the same time.
(7) And the address correlation judging principle is that whether the addresses are correlated or not is judged by judging whether the addresses are correlated or not according to the granularity of the data in the same storage space if the addresses are uncorrelated in the different storage spaces.
5. Scalar arithmetic unit
In particular implementations, the scalar arithmetic unit may be one or more.
For example, a scalar processor includes two scalar arithmetic units, scalar arithmetic unit 0 and scalar arithmetic unit 1.
And the scalar operation unit is used for receiving the instruction and the data sent by the operation reservation stack unit, operating the data based on the instruction, and writing the operation result back to the register renaming unit.
The scalar operation unit is a calculation unit of the scalar processor, and may perform various types of fixed floating point operations, such as addition and subtraction, multiplication, division, logic operation, comparison operation, shift, etc., and it receives the instruction and data sent by the operation reservation stack unit, performs the operation, and writes the result back to the register file unit or the special vector register file unit of the register renaming unit.
The following examples of instructions are provided for illustrative purposes, and are not limited to the following instructions, nor are they limited to whether all instructions are included.
The instructions with the execution stage as one stage comprise fixed point addition and subtraction, logic class instructions, shift class instructions, fixed floating point comparison class instructions, read-write Flag instructions, fixed floating point maximum and minimum instructions, ABS instructions, bit reverse order instructions, selection instructions, special vector register distribution instructions, read special vector register instructions, byte reverse order instructions, merge instructions, immediate assignment instructions, firstOne instructions, CRC instructions, floating point classification instructions, floating point number fetch partial domain and Rounding instructions.
The instructions with three execution stages comprise a fixed-point multiplication instruction, a fixed-floating point conversion class instruction, a bit screening instruction, a Count instruction and a floating point addition and subtraction instruction.
The instructions supporting Bypass comprise a selection instruction, a fixed-point addition and subtraction instruction, a shift class instruction, an immediate assignment instruction, an ABS instruction, a logic class instruction, a comparison class instruction and a maximum and minimum instruction.
The execution cycle of the division instruction is uncertain, the data of the divisor and the dividend are related, the execution of the instruction is completed to generate a DivEn instruction which indicates that the execution of the instruction is completed and the result is output to the register file, a new division instruction can not be input during the execution of the division instruction, but other scalar calculation unit instructions can be input, the output result after the execution of the division is multiplexed with the output port of the first-stage pipeline, when the output port of the first-stage pipeline is not used by the other scalar calculation unit instructions, the result of the division is output, meanwhile, a DivEn mark is output, the DivEn mark is output to the operation reservation stack unit, and the Div instruction can be continuously output to the current scalar calculation unit
6. Access unit
In particular implementations, the access unit may be one or more.
For example, the scalar processor includes two memory units, memory unit 0 and memory unit 1, respectively.
The memory access unit is used for receiving and storing the instruction, the data and the register information sent by the stack unit and reading and writing the data based on the instruction and the register information.
The memory unit is a functional module that executes memory related instructions of the scalar processor. The memory access unit receives instructions and data from the memory reservation stack unit and register related information, executes the instructions accordingly according to the instructions, interacts with other units to read and write data, and for read instruction instructions and atomic write instruction instructions, the data needs to be written back to the register renaming unit, including register-level read instructions, write instructions including read instructions such as 8 bits, 16 bits, 32 bits, 64 bits, or other bit granularity and vector read instructions, write instructions including such as 128 bits, 256 bits, 512 bits, or other bit granularity. The time periods for different instruction processing are different.
In addition, the access unit is responsible for providing relevant instruction quantity information needed by the FENCE, and the access unit and the storage reservation stack unit are interacted to complete data storage configuration.
7. Program control unit
In a specific implementation, the program control unit is only one.
And the program control unit is used for receiving the instruction and the data from the register renaming unit, processing the data based on the instruction and outputting a processing result.
The program control unit is a functional module that executes control program execution order related instructions of the scalar processor. The program control unit receives the instruction and the data from the register renaming unit, processes the data according to the instruction, and outputs the processing result to other modules of the scalar processor. The time periods for different instruction processing are different.
The program control unit is responsible for controlling the execution direction of the program (such as stopping, interrupting, jumping and function calling), and relates to the execution of related instructions and the read-write control of configuration information, the program control unit is responsible for the configuration and prefetch operation of an instruction cache and the FENCE operation, and the program control unit is responsible for the read-write and control of a counter, the read-write of some other control information and the like.
8. Synchronization unit
In a specific implementation, the synchronization unit is only one.
And the synchronization unit is used for synchronizing the scalar processor and the vector processor.
As shown in fig. 6, the synchronization unit establishes a communication connection with the pipeline control unit, the register renaming unit, the program control unit, and the vector processor.
Instructions for the synchronization unit come from the register renaming unit, and reading and writing of data of the synchronization unit are interacted with the register renaming unit.
And the synchronization unit is used for receiving the pause signal sent by the pipeline control unit and sending an execution stage pause signal generated when the synchronization unit is communicated with the vector processor to the pipeline control unit so as to generate an execution pause signal of the scalar processor.
And the synchronization unit is used for generating an instruction and transmitting the instruction to the program control unit.
That is, the synchronization unit is a unit for synchronizing the scalar processor and the vector processor, and receives the instruction and the data sent by the register renaming unit, reads the data from the vector processor and writes the data back to the register file, reads the data from the register file unit or the special vector register file unit, sends the data to the functional module of the vector processor, and is responsible for starting and status inquiry of the vector processor, such as inquiring about reading and writing of a read FIFO (FirstInput First Output, first in first out) in the vector program control unit of the vector processor, configuring the register file, reading or writing of a scalar register, inquiring about the state of the register file, reading FIFO depth, reading and starting the instruction counter of the vector processor, and the like, and provides the program control unit with instruction information of the synchronization unit.
The synchronization unit interacts with pipeline control units, register renaming units, and program control units within the scalar processor, as well as with external vector processors, scalar processors, and vector processor transmit queue modules. The synchronization unit instruction comes from the register renaming unit, and the reading and writing of data are interacted with the register renaming unit. A blocking signal from the pipeline control unit is received and communicated with the vector processor to generate an execution stage blocking signal of the synchronization unit itself and sent to the pipeline control unit for generating ExeStall signals acting on the entire scalar processor. The synchronization unit generates an instruction to be executed in the next period, and transmits the instruction to the program control unit for the counter instruction of the program control unit. The synchronization unit interacts with the vector processor including, but not limited to, configuring the register file with special vector registers or registers, reading and writing scalar registers, querying the write state of the register file. The scalar processor interacts with the scalar processor and the vector processor transmit queue module unit including, but not limited to, starting the vector processor, querying the vector processor state, reading and writing FIFO data in the finger fetch unit of the vector processor, reading FIFO depth, reading and starting the vector processor instruction counter.
Therefore, the synchronization unit may have the following functions when it is specifically implemented (it should be noted that the following functions are only examples, and other functions may be also available, and the specific functions of the synchronization unit are not limited in this embodiment and the subsequent embodiments):
The boot vector processor functions, for use with a boot vector processor, include immediate and register starts, such as pipelining waiting until the start is successful, or writing the results of the successful or failed start back to the destination register.
The query vector processor performs a status function, supporting option B.
And a read-write FIFO function, which is located in an instruction fetch unit of the vector processor, such as a FIFO bit width of 32 bits, and the read-write FIFO, such as a read-write FIFO, waits until success, or the result of success or failure of the read-write FIFO is written back to the register.
Write register file functions, including special vector register writes or register writes.
Scalar register functions are read and written including immediate index or register index reads and writes.
The query register file write back status function, if waiting until all write register file has finished, or returns the result of whether the write register file is complete to the register.
When the correlation is not completed, the synchronization unit itself will generate a stall signal, stall waiting, which will be sent to the pipeline control unit to generate a pipeline stall signal.
And a FIFO (for example, a 32-bit-depth FIFO) can be added between the scalar processor and the vector processor and used for storing a request for starting the vector processor, the read-write FIFO of the vector processor is moved into the scalar processor and the vector processor transmission queue module, the scalar processor and the vector processor transmission queue module unit realize the starting vector processor, inquire the execution state of the vector processor, read-write the FIFO function, read the FIFO depth function, read the instruction counter function of the starting vector processor, the condition that the starting vector processor can be successfully started is that the FIFO of the starting vector processor is not full, the execution state of the inquiring vector processor is passed, and the condition that the vector processor is stopped is that the execution of the vector processor is finished and the FIFO of the starting vector processor is empty.
9. Pipeline control unit
And a pipeline control unit for generating a pipeline stall signal and/or generating start and stop signals for the scalar processor.
The pipeline control unit is a pipeline control unit of the scalar processor, and each unit in the scalar processor is connected and is responsible for generating blocking signals of the pipeline, such as blocking in a normal working mode and blocking in a debugging mode.
The pipeline control unit also communicates with the communication and synchronization unit to generate signals for starting and stopping the scalar processor.
In addition, the scalar processor can perform conditional execution decoding in practical application. For example, when performing conditional execution decoding, the scalar processor performs execution condition judgment on a preset bit of the instruction, if the condition is satisfied, a valid instruction is output, and otherwise, a null instruction is output. Wherein the null instruction represents a null instruction or an invalid instruction.
If the condition register has read-write related operation, triggering pipeline blocking, and waiting for the condition register write operation to finish execution and then performing read operation. Wherein, the condition register read-write has no bypass.
Taking 2 condition registers as an example, namely a condition register 0 and a condition register 1, taking [29:28] bits as preset bits, when the scalar processor decodes the condition execution, the scalar processor judges the execution condition of the input instruction based on the [29:28] bits encoded by the instruction set, if the condition is met, a valid instruction is output, and otherwise, a null instruction is output.
Wherein bits [29:28] are 00 for execution of condition register 0, bits [29:28] are 01 for execution of condition register 1, bits [29:28] are 10 for execution of.
If the condition register has read-write correlation, triggering pipeline blocking, waiting for the condition register to be written out and then performing read operation, wherein the condition register has no bypass in read-write.
(II) vector processor
The vector processor is a very long instruction word (Very Long Instruction Word, VLIW) vector processor.
Vector processors may be unevenly clustered, and may be concurrent with more than 20 instructions, such as various vector/matrix operation acceleration instructions, loop acceleration methods, and the like.
In particular implementations, a vector processor may be as shown in FIG. 7, including a vector program control unit, a plurality of functional units, a register file, and scalar registers.
In addition, the vector processor comprises a private vector register of the vector interleaving unit and a private vector register of the vector access unit.
1. Vector program control unit
And the vector program control unit is used for fetching and transmitting instructions.
That is, the vector program control unit is configured to fetch an instruction, determine whether to execute the instruction, and transmit the instruction to the functional unit based on the determination result.
The vector program control unit is also used for controlling the jump of the instruction.
The vector program control unit has scalar computing power.
The vector program control unit interacts with scalar registers.
In specific implementation, the vector program control unit is a fetch and instruction transmitting unit, fetches instructions from a cache according to a PC value, transmits instructions to each functional unit according to a waiting value (configured by waiting instructions) after judging whether to execute the instructions, and controls the jump of the instructions and has partial scalar computing capability.
In addition, the vector program control unit is also used for receiving starting commands sent by other operation processors and starting the vector processor. And returning an indication signal indicating whether the vector processor is finished to other operation processors.
Taking other operation processors as scalar processors, for example, the vector program control unit receives a start command sent by a synchronization unit of the scalar processor, starts the vector processor to execute, and returns an indication signal indicating whether the execution of the vector processor by the synchronization unit is finished.
2. Functional unit
And the functional unit is used for carrying out functional processing according to the instruction.
For example, the functional unit receives an instruction from the vector program control unit, processes data in accordance with the instruction, and outputs a processing result in accordance with an address specified in the instruction.
The functional units comprise one or more vector operation units, one or more vector interleaving units and one or more vector access units.
1) Vector operation unit
Any vector operation unit is used for carrying out vector operation according to the instruction.
As shown in FIG. 8, any vector operation unit includes a floating-point multiply-add operator unit, a floating-point multiply-accumulate operator unit, a floating-point operator unit, a tensor multiply subunit, and an intermediate result register.
Wherein the floating point multiply add operator unit and the floating point arithmetic operator unit share a launch slot. Thus, a maximum of 8 instructions for the vector arithmetic unit are issued per cycle.
The floating-point multiply-accumulate operation subunit and the tensor multiplication subunit share a transmitting slot.
The floating point multiply-add operator unit is executed for a functional unit executing the floating point multiply-add operator unit related instruction. For example, floating point multiply add operator unit related instructions are instructions for integer and floating point vector multiply add, multiply, add, tensor computation, and the like.
The 1 vector operation unit has independent intermediate result registers.
The intermediate result register is shared by 1 floating point multiply-add operator unit, 1 tensor multiply subunit and 1 floating point operator unit.
(1) The floating point multiply-add operator unit and the floating point multiply-accumulate operator unit can execute integer and floating point vector multiplication, multiply-accumulate and other operations. Types of support include, but are not limited to, int32, fp32, fp64.
(2) The floating point arithmetic subunit may perform integer and floating point vector arithmetic operations, such as comparisons, additions, subtractions, bit operations, and the like. Types of support include, but are not limited to, int8, uint8, int16, uint16, int32, uint32, bool, fp16, bf16, fp32, tf32, fp64.
(3) The tensor multiplication subunit may perform tensor multiplication, multiply-accumulate, and the like. Types of support include, but are not limited to, int8, bf16, fp16, tf32.
2) Vector interleaving unit
And any vector interleaving unit is used for interleaving and logically processing the data according to the instruction.
The vector interleaving unit is a control and data processing unit in the vector processor and is responsible for interleaving data, supporting logic and partial fixed-floating point calculation, and supporting a plurality of customized instructions, including functions of table lookup/lateral calculation/sparse matrix calculation/precision conversion/FIFO (FirstInput First Output, first-in first-out) and the like. And executing instructions such as data broadcasting, extraction, internal interleaving and the like.
Each vector interleaving unit has a set of private vector registers, so the private vector registers of the vector interleaving units are in one-to-one correspondence with the vector interleaving units.
3) Vector access unit
Any vector access unit is used for carrying out multimode access memory, address calculation and scalar calculation according to the instruction.
The vector access unit is a memory access unit in the vector processor, and is mainly responsible for read/write instructions and various scalar computations.
Wherein the read instruction/write instruction supports multiple access modes, such as a row mode, a column mode/a discrete mode/an extended mode/an accumulation mode.
Meanwhile, a plurality of parameter configurations are supported, and the maximum read instruction/write instruction data bit width can reach 1024bits. Address calculation, load/store, etc. instructions are performed.
All vector access units share a set of private vector registers, so the private vector registers of the vector access units are shared by multiple vector access units.
3. Register file
And the register file is used for returning data after receiving the read-write request. And rearranging the data and returning the data. And performing read-write interaction with the functional unit. The configuration registers of the vector program control unit are configured by data in the register file.
The register file is a general vector register file, is a main storage unit in the vector processor, is responsible for receiving read-write requests and returning data, and can be returned to the request module after the data are rearranged in part of functions.
The register file is in read-write interaction with functional units (such as a floating point multiply-add operator unit, a floating point arithmetic operator unit, a floating point multiply-add operator unit and a tensor multiply subunit) in the vector processor, and meanwhile configuration of a configuration register of the instruction fetching unit by using data in the register file is supported.
The register file is also used for writing data of other operation processors. And receiving state information of whether query data sent by other operation processors are written.
Taking other operation processors as scalar processors, for example, the synchronization unit of the scalar processor can write data into the register file, and the register file can also receive state information about whether the query data sent by the synchronization unit of the scalar processor is written.
The depth of the register file is configurable.
Fig. 9 shows a schematic diagram of a vector processor with functional units comprising 4 vector arithmetic units, 4 vector interleaving units, 4 vector access units.
The vector processor provided in this embodiment supports VLIW (Very Long Instruction Word ) instruction sets, each of which may consist of one or more instructions, each of which corresponds to a functional unit.
In addition, a read FIFO unit and a write FIFO unit are provided between the vector processor and other arithmetic processors.
The vector program control unit and other operation processors perform read operation on the read FIFO unit and write operation on the write FIFO unit.
Other arithmetic processors perform read or write operations to vector registers.
Taking other operation processors as scalar processors, for example, a read FIFO and a write FIFO unit for transmitting data are arranged between the scalar processor and the vector processor, and the scalar processor and the vector program control unit can perform read operation or write operation on the read FIFO and the write FIFO.
While the synchronization unit of the scalar processor can read or write to the scalar registers of the vector processor.
Additionally, the high performance processor may also include instruction FIFO (First Input First Output first in first out) memory. Or a high performance processor may also include an instruction FIFO memory store and a data FIFO memory store.
The depth of the instruction FIFO memory is 32 bits, and the depth of the data FIFO memory is 32 bits.
In particular implementations, the execution conditions may be varied, for example, the execution conditions are that the vector processor is not performing task processing. Or the execution condition is that the instruction FIFO memory is not full (e.g., the instruction FIFO memory is 32 bits deep, at which time the instruction FIFO memory forms an asynchronous queue). Or the execution condition is that the instruction FIFO memory and the data FIFO memory are both not full (e.g., the depth of the instruction FIFO memory and the data FIFO memory are both 32 bits, where the instruction FIFO memory forms an asynchronous queue of instructions and the data FIFO memory forms an asynchronous queue of parameters).
The specific implementation schemes of the high-performance processor are different according to different execution conditions, and the following descriptions respectively show:
● The execution condition is that the vector processor does not process the task
Taking an example where two tasks (task 0 and task1, respectively) need to be performed, referring to fig. 3 (the white-bottomed box in fig. 3 is performed by a scalar processor, and the gray-bottomed box is performed by a vector processor), the processing procedure of the high-performance processor for this execution condition is:
The scalar processor retrieves task0 and parameters of task0 from global memory.
The scalar processor determines whether the vector processor is performing task processing, and if the vector processor is not performing task processing, the scalar processor invokes the vector processor to execute task0 based on the task0 parameter.
The vector processor then executes task0 based on the task0 parameters, and at the same time, the scalar processor may again retrieve task1 and task1 parameters from global memory in preparation for task1 execution.
The scalar processor and the vector processor are synchronized to confirm whether the vector processor is performing task0 processing, if the vector processor is not performing task0, the scalar processor waits until the vector processor is performing task0 processing at the moment.
The scalar processor again determines that the vector processor is not performing task processing, and the scalar processor invokes the vector processor to perform task1 based on the parameters of task1.
In the heterogeneous multi-core number processing process, a scalar processor needs to be synchronized with a vector processor, and the scalar processor can call the vector processor to process the next task after the last task of the vector processor is processed.
● The execution condition being that the instruction FIFO memory is not full
For such execution conditions, a scalar processor in a high performance processor invoking a vector processor to perform a task based on a parameter may be such that the scalar processor stores the instruction and the memory address of the parameter in an instruction FIFO memory. The vector processor reads the instructions and parameters from the instruction FIFO memory when no task processing is performed, and executes the instructions based on the parameters.
Taking the example where there are two tasks (task 0 and task1, respectively) to be performed, see fig. 4 (the white-bottomed blocks in fig. 4 are performed by a scalar processor, and the gray-bottomed blocks are performed by a vector processor), the implementation of the high-performance processor for such execution conditions is based on an asynchronous queue instruction FIFO memory implementation.
For example, the scalar processor obtains task0 and parameters of task0 from the global memory.
The scalar processor determines whether the instruction FIFO memory is full, and if not, the task0 and the parameters of task0 are packed into task packet 0 (e.g., the first addresses of the task0 and task0 parameters are packed into task packet 0), and the task packet is stored in the instruction FIFO memory. The vector processor reads the task packet 0 from the instruction FIFO memory when the task processing is not performed, and executes task0 based on the parameters of task0.
During the task execution of the vector processor, the scalar processor can retrieve task1 and task1 parameters from the global memory. Obtaining the task1 and parameters of the task1 from the global memory, and preparing for the execution of the task 1. The scalar processor determines whether the instruction FIFO memory is full, if not, the task1 and parameters of the task1 are set as task packet 1, and the task packet is stored in the instruction FIFO memory. That is, the process of reading the task and the parameter by the scalar operation processor and grouping the task and the parameter into a task packet is irrelevant to whether the vector processor executes the task, so long as the instruction FIFO memory is not full (i.e. the number of task packets stored in the instruction FIFO memory is not 32), the scalar processor can repeatedly execute the task and the parameter obtained from the global memory, and after the scalar processor determines that the execution condition is satisfied, the vector processor is invoked to execute the task based on the parameter, the task and the parameter are continuously obtained, and the task and the parameter are packed into the task packet and stored in the instruction FIFO memory.
However, whether the vector processor processes a new task is irrelevant to the scalar processor, as long as the vector processor completes processing of a task (at this time, the vector processor is not performing task processing), the instruction FIFO memory is not empty, and a task packet can be obtained from the instruction FIFO memory, so as to execute the task packet. If the vector processor completes processing of a task (when the vector processor is not performing task processing), but the instruction FIFO memory is empty, execution of the task is stopped.
In the heterogeneous multi-core number processing process, the scalar processor and the vector processor do not need to be synchronized, and task reading and issuing of the scalar processor are irrelevant to task execution of the vector processor.
● The execution condition is that the instruction FIFO memory and the data FIFO memory are not full
For such execution conditions, the scalar processor in the high performance processor may invoke the vector processor to perform tasks based on the parameters by storing the memory address of the instruction in the instruction FIFO memory and the memory address of the parameter in the data FIFO memory. The vector processor reads the instruction from the instruction FIFO memory when the task processing is not performed, reads the parameter from the data FIFO memory, and executes the instruction based on the parameter.
Taking the example that two tasks (task 0 and task1 respectively) need to be executed, an asynchronous queue instruction FIFO memory and an asynchronous queue data FIFO memory are constructed according to the heterogeneous multi-core number processing process of the executing condition, wherein the depth of the instruction FIFO memory and the depth of the data FIFO memory are 32 bits, namely the instruction FIFO memory can store 32 task packets at the same time, and the data FIFO memory can store 32 task packets at the same time. At the same time, the instructions and parameters stored in the instruction FIFO memory and the data FIFO memory are corresponding (i.e., if instruction 2 is stored in the second location of the instruction FIFO memory, the parameters of instruction 2 are also stored in the second location of the data FIFO memory), thus ensuring that the parameters read by the vector processor from the instruction FIFO memory and the data FIFO memory, respectively, are the parameters of the instructions read.
First, the scalar processor retrieves task0 and parameters of task0 from global memory.
The scalar processor determines whether both the instruction FIFO memory and the data FIFO memory are not full (since the instruction FIFO memory and the data FIFO memory are corresponding at the time of storage, the instruction FIFO memory is not full, the data FIFO memory is not full, and the instruction FIFO memory is full, the data FIFO memory is full). If the instruction FIFO memory and the data FIFO memory are both not full, the scalar processor stores instruction 0 in the instruction FIFO memory, stores the memory address of the parameter of instruction 0 in the data FIFO memory (e.g., the scalar processor applies for a space on the local memory, stores the parameter of instruction 0 in the space, and stores the first address of the space in the data FIFO memory). The vector processor reads task0 from the instruction FIFO memory when task processing is not performed, reads the parameter of task0 from the data FIFO memory, and executes task0 based on the parameter of task0.
During the task execution of the vector processor, the scalar processor can retrieve task1 and task1 parameters from the global memory. Obtaining the task1 and parameters of the task1 from the global memory, and preparing for the execution of the task 1. The scalar processor determines whether the instruction FIFO memory and the data FIFO memory are both not full, and if not, the scalar processor stores instruction 1 in the instruction FIFO memory, stores the memory address of the parameter of instruction 1 in the data FIFO memory (e.g., the scalar processor applies for a space on the local memory, stores the parameter of instruction 1 in the space, and stores the first address of the space in the data FIFO memory). That is, the process of the quantum operation processor reading the task and the parameter, and storing the task and the parameter to the instruction FIFO memory and the data FIFO memory, respectively, is independent of whether the vector processor performs the task, and the scalar processor can repeatedly perform the fetching of the instruction and the parameter from the global memory as long as the instruction FIFO memory and the data FIFO memory are not full (i.e., the number of task packets stored in the instruction FIFO memory is not 32, and the number of task packets stored in the data FIFO memory is not 32). And after the scalar processor determines that the execution condition is met, calling the vector processor to execute the task based on the parameters, continuously acquiring the task and the parameters, and respectively storing the task and the parameters into the instruction FIFO memory and the data FIFO memory.
However, whether the vector processor processes a new task is irrelevant to the scalar processor, as long as the vector processor completes processing of a task (at this time, the vector processor is not performing task processing), the instruction FIFO memory and the data FIFO memory are not empty, and the task and the parameter can be respectively acquired from the instruction FIFO memory and the data FIFO memory, so that the task is performed based on the parameter. If the vector processor completes processing of a task (when the vector processor is not performing task processing), but the instruction FIFO memory and the data FIFO memory are empty, execution of the task is stopped.
In the heterogeneous multi-core number processing process, the scalar processor and the vector processor do not need to be synchronized, task reading and issuing of the scalar processor are irrelevant to task execution of the vector processor, and meanwhile, when the vector processor takes one task in the instruction FIFO memory, the vector processor also takes a parameter address from the data FIFO memory, so that a list of real parameters of the task is obtained, and thus, the real parameters are taken by calling the task.
In addition, in a specific implementation, each time a task is invoked, the scalar processor includes in the task an identification of the thread to which the task belongs, the identification being used to indicate which thread the task is a task. By identifying whether the execution of the task sent by each thread can be inquired, the synchronization is changed into asynchronization, and the instruction FIFO memory is an asynchronization executing mechanism.
The embodiment provides a high-performance processor which comprises a scalar processor, a vector processor and a local memory, wherein the scalar processor and the vector processor share the local memory, the vector processor only accesses the local memory and is only called and executed by the scalar processor, connection is established between the scalar processor and the vector processor, connection is established between the scalar processor and a global memory, the scalar processor is used for acquiring instructions and parameters from the global memory, and the vector processor is called to execute tasks based on the parameters after the scalar processor determines that execution conditions are met. In the high-performance processor provided by the embodiment, after the scalar processor acquires the instruction and the parameter from the global memory, the vector processor is called to execute the task based on the parameter when the execution condition is determined to be met, and then a plurality of heterogeneous cores are used in a coordinated manner to meet different computing demands.
Based on the same inventive concept of high performance processors, the present embodiment provides a processor cluster including a plurality of high performance processors as shown in fig. 1 or fig. 2.
For example, the high performance processor includes a scalar processor, a vector processor, a local memory;
the scalar processor and the vector processor share a local memory, and the vector processor only accesses the local memory and is only called and executed by the scalar processor;
Establishing a connection between a scalar processor and a vector processor;
the scalar processor establishes a connection with the global memory;
And the scalar processor is used for acquiring the instruction and the parameter from the global memory, and calling the vector processor to execute the task based on the parameter after the scalar processor determines that the execution condition is met.
Wherein the scalar processor is an out-of-order multiple-issue scalar processor;
The vector processor is an ultra-long instruction word vector processor.
Wherein the high performance processor further comprises:
an instruction first-in first-out FIFO memory;
The depth of the instruction FIFO memory is 32 bits.
The high-performance processor also comprises an instruction FIFO memory and a data FIFO memory;
the depth of both the instruction FIFO memory and the data FIFO memory is 32 bits.
The execution condition is that the vector processor does not perform task processing.
Wherein the execution condition is that the instruction FIFO memory is not full;
a scalar processor for storing the memory addresses of the instructions and parameters into an instruction FIFO memory;
And the vector processor is used for reading the instruction and the data from the instruction FIFO memory when the task processing is not performed and executing the instruction based on the data.
Wherein the execution condition is that the instruction FIFO memory and the data FIFO memory are both not full;
a scalar processor for storing the memory address of the instruction in the instruction FIFO memory and the memory address of the parameter in the data FIFO memory;
And a vector processor for reading the instruction from the instruction FIFO memory when the task processing is not performed, reading the data from the data FIFO memory, and executing the instruction based on the data.
The task comprises an identification of a thread to which the task belongs.
The processor cluster provided by the embodiment comprises a scalar processor, a vector processor and a local memory, wherein the scalar processor and the vector processor share the local memory, the vector processor only accesses the local memory and is only called and executed by the scalar processor, connection is established between the scalar processor and the vector processor, connection is established between the scalar processor and a global memory, the scalar processor is used for acquiring instructions and parameters from the global memory, and after the scalar processor determines that the execution condition is met, the vector processor is called to execute the task based on the parameters, so that a plurality of heterogeneous cores are used in a coordinated mode to meet different calculation requirements.
Based on the same inventive concept of high performance processors, the present embodiment provides an electronic device including the high performance processors as shown in fig. 1 or fig. 2, or including one or more processor clusters, wherein each processor cluster includes a plurality of high performance processors as shown in fig. 1 or fig. 2.
For example, the high performance processor includes a scalar processor, a vector processor, a local memory;
the scalar processor and the vector processor share a local memory, and the vector processor only accesses the local memory and is only called and executed by the scalar processor;
Establishing a connection between a scalar processor and a vector processor;
the scalar processor establishes a connection with the global memory;
And the scalar processor is used for acquiring the instruction and the parameter from the global memory, and calling the vector processor to execute the task based on the parameter after the scalar processor determines that the execution condition is met.
Wherein the scalar processor is an out-of-order multiple-issue scalar processor;
The vector processor is an ultra-long instruction word vector processor.
Wherein the high performance processor further comprises:
an instruction first-in first-out FIFO memory;
The depth of the instruction FIFO memory is 32 bits.
The high-performance processor also comprises an instruction FIFO memory and a data FIFO memory;
the depth of both the instruction FIFO memory and the data FIFO memory is 32 bits.
The execution condition is that the vector processor does not perform task processing.
Wherein the execution condition is that the instruction FIFO memory is not full;
a scalar processor for storing the memory addresses of the instructions and parameters into an instruction FIFO memory;
And the vector processor is used for reading the instruction and the data from the instruction FIFO memory when the task processing is not performed and executing the instruction based on the data.
Wherein the execution condition is that the instruction FIFO memory and the data FIFO memory are both not full;
a scalar processor for storing the memory address of the instruction in the instruction FIFO memory and the memory address of the parameter in the data FIFO memory;
And a vector processor for reading the instruction from the instruction FIFO memory when the task processing is not performed, reading the data from the data FIFO memory, and executing the instruction based on the data.
The task comprises an identification of a thread to which the task belongs.
The electronic device comprises a scalar processor, a vector processor and a local memory, wherein the scalar processor and the vector processor share the local memory, the vector processor only accesses the local memory and is only called and executed by the scalar processor, connection is established between the scalar processor and the vector processor, connection is established between the scalar processor and a global memory, the scalar processor is used for acquiring instructions and parameters from the global memory, and after the scalar processor determines that the execution condition is met, the vector processor is called to execute tasks based on the parameters, and further, a plurality of heterogeneous cores are used in a coordinated mode to meet different calculation requirements.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1.一种高性能处理器,其特征在于,所述高性能处理器包括:标量处理器、向量处理器、本地存储器;1. A high-performance processor, characterized in that the high-performance processor comprises: a scalar processor, a vector processor, and a local memory; 所述标量处理器和所述向量处理器共享本地存储器;且所述向量处理器只访问所述本地存储器,只由所述标量处理器调用执行;The scalar processor and the vector processor share a local memory; and the vector processor only accesses the local memory and is only called and executed by the scalar processor; 所述标量处理器和所述向量处理器之间建立连接;Establishing a connection between the scalar processor and the vector processor; 所述标量处理器与全局存储器建立连接;The scalar processor establishes a connection with the global memory; 所述标量处理器,用于从全局存储器中获取指令及参数;标量处理器确定满足执行条件后,调用向量处理器执行基于所述参数执行所述任务。The scalar processor is used to obtain instructions and parameters from the global memory; after the scalar processor determines that the execution condition is met, it calls the vector processor to execute the task based on the parameters. 2.根据权利要求1所述的高性能处理器,其特征在于,所述标量处理器为乱序多发射标量处理器;2. The high-performance processor according to claim 1, wherein the scalar processor is an out-of-order multi-issue scalar processor; 所述向量处理器为超长指令字向量处理器。The vector processor is a very long instruction word vector processor. 3.根据权利要求1所述的高性能处理器,其特征在于,所述高性能处理器,还包括:3. The high-performance processor according to claim 1, further comprising: 指令先进先出FIFO存储器;Instruction first-in-first-out FIFO memory; 其中,指令FIFO存储器的深度均为32bit。Among them, the depth of the instruction FIFO memory is 32 bits. 4.根据权利要求1所述的高性能处理器,其特征在于,所述高性能处理器,还包括:指令FIFO存储器存储器和数据FIFO存储器存储器;4. The high-performance processor according to claim 1, further comprising: an instruction FIFO memory and a data FIFO memory; 所述指令FIFO存储器和所述数据FIFO存储器的深度均为32bit。The depth of the instruction FIFO memory and the data FIFO memory are both 32 bits. 5.根据权利要求1所述的高性能处理器,其特征在于,所述执行条件为所述向量处理器未进行任务处理。5 . The high-performance processor according to claim 1 , wherein the execution condition is that the vector processor is not performing task processing. 6.根据权利要求3所述的高性能处理器,其特征在于,所述执行条件为指令FIFO存储器未满;6. The high-performance processor according to claim 3, wherein the execution condition is that the instruction FIFO memory is not full; 所述标量处理器,用于将所述指令和参数的存储地址存入指令FIFO存储器;The scalar processor is used to store the storage addresses of the instructions and parameters into the instruction FIFO memory; 所述向量处理器,用于在未进行任务处理时从所述指令FIFO存储器中读取指令和数据,基于所述数据执行所述指令。The vector processor is configured to read instructions and data from the instruction FIFO memory when no task processing is being performed, and execute the instructions based on the data. 7.根据权利要求4所述的高性能处理器,其特征在于,所述执行条件为指令FIFO存储器存储器和数据FIFO存储器存储器均未满;7. The high-performance processor according to claim 4, wherein the execution condition is that both the instruction FIFO memory and the data FIFO memory are not full; 所述标量处理器,用于将所述指令的存储地址存入指令FIFO存储器中,将所述参数的存储地址存入数据FIFO存储器中;The scalar processor is configured to store the storage address of the instruction into an instruction FIFO memory and store the storage address of the parameter into a data FIFO memory; 所述向量处理器,用于在未进行任务处理时从所述指令FIFO存储器中读取指令,从所述数据FIFO存储器读取数据,基于所述数据执行所述指令。The vector processor is configured to read instructions from the instruction FIFO memory and data from the data FIFO memory when no task processing is being performed, and execute the instructions based on the data. 8.根据权利要求1所述的高性能处理器,其特征在于,所述任务中包括该任务所属线程的标识。8 . The high-performance processor according to claim 1 , wherein the task includes an identifier of a thread to which the task belongs. 9.一种处理器簇,其特在于,包括多个如权利要求1至7任一项所述的高性能处理器。9. A processor cluster, comprising a plurality of high-performance processors according to any one of claims 1 to 7. 10.一种电子设备,其特征在于,包括:如权利要求1至7任一项所述的高性能处理器,或者,包括一个或多个如权利要求9所述的处理器簇。10. An electronic device, comprising: the high-performance processor according to any one of claims 1 to 7, or comprising one or more processor clusters according to claim 9.
CN202510572336.3A 2025-04-30 2025-04-30 High-performance processors, processor clusters, and electronic devices Pending CN120540708A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510572336.3A CN120540708A (en) 2025-04-30 2025-04-30 High-performance processors, processor clusters, and electronic devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510572336.3A CN120540708A (en) 2025-04-30 2025-04-30 High-performance processors, processor clusters, and electronic devices

Publications (1)

Publication Number Publication Date
CN120540708A true CN120540708A (en) 2025-08-26

Family

ID=96792189

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510572336.3A Pending CN120540708A (en) 2025-04-30 2025-04-30 High-performance processors, processor clusters, and electronic devices

Country Status (1)

Country Link
CN (1) CN120540708A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101371233A (en) * 2004-11-15 2009-02-18 辉达公司 Video processor with scalar components controlling vector components for video processing
CN115169541A (en) * 2022-08-17 2022-10-11 无锡江南计算技术研究所 Tensor, vector and scalar calculation acceleration and data scheduling system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101371233A (en) * 2004-11-15 2009-02-18 辉达公司 Video processor with scalar components controlling vector components for video processing
CN115169541A (en) * 2022-08-17 2022-10-11 无锡江南计算技术研究所 Tensor, vector and scalar calculation acceleration and data scheduling system

Similar Documents

Publication Publication Date Title
US5560029A (en) Data processing system with synchronization coprocessor for multiple threads
US8345053B2 (en) Graphics processors with parallel scheduling and execution of threads
US6163839A (en) Non-stalling circular counterflow pipeline processor with reorder buffer
US20180004530A1 (en) Advanced processor architecture
US8453161B2 (en) Method and apparatus for efficient helper thread state initialization using inter-thread register copy
WO1990014629A2 (en) Parallel multithreaded data processing system
HK1246442A1 (en) Block-based architecture with parallel execution of successive blocks
CN113590197A (en) Configurable processor supporting variable-length vector processing and implementation method thereof
CN112214241A (en) Method and system for distributed instruction execution unit
CN116670644A (en) A method for interleaving processing on general-purpose computing cores
CN109564546A (en) Storage and load are tracked by bypassing load store unit
US6725365B1 (en) Branching in a computer system
CN118467041B (en) Instruction processing method and device for out-of-order multi-issue processor
CN116414464A (en) Method and apparatus for scheduling tasks, electronic device and computer readable medium
WO2021061626A1 (en) Instruction executing method and apparatus
US7565658B2 (en) Hidden job start preparation in an instruction-parallel processor system
US7496737B2 (en) High priority guard transfer for execution control of dependent guarded instructions
CN120653221A (en) Floating point continuous accumulation method and adder
US20230093393A1 (en) Processor, processing method, and related device
WO2026016845A1 (en) Processor, graphics card, computer device, and dependency release method
US7197630B1 (en) Method and system for changing the executable status of an operation following a branch misprediction without refetching the operation
CN120540705B (en) Scalar processor, high-performance processor, and electronic device
CN120540709B (en) Vector processor, high-performance processor, and electronic device
CN120540708A (en) High-performance processors, processor clusters, and electronic devices
CN120540706A (en) High-performance processing methods and electronic devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination