[go: up one dir, main page]

WO2021013727A1 - Processor - Google Patents

Processor Download PDF

Info

Publication number
WO2021013727A1
WO2021013727A1 PCT/EP2020/070285 EP2020070285W WO2021013727A1 WO 2021013727 A1 WO2021013727 A1 WO 2021013727A1 EP 2020070285 W EP2020070285 W EP 2020070285W WO 2021013727 A1 WO2021013727 A1 WO 2021013727A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing
data
stages
program code
ccpu
Prior art date
Application number
PCT/EP2020/070285
Other languages
French (fr)
Inventor
Benjamin Elias PROBST
Original Assignee
PLS Patent-, Lizenz- und Schutzrechte Verwertung GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PLS Patent-, Lizenz- und Schutzrechte Verwertung GmbH filed Critical PLS Patent-, Lizenz- und Schutzrechte Verwertung GmbH
Publication of WO2021013727A1 publication Critical patent/WO2021013727A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0207Addressing or allocation; Relocation with multidimensional access, e.g. row/column, matrix
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17375One dimensional, e.g. linear array, ring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8007Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors single instruction multiple data [SIMD] multiprocessors
    • G06F15/8015One dimensional arrays, e.g. rings, linear arrays, buses
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to a processor architecture or a processor built with such a processor architecture according to the preamble of claim 1 and claim 12, and a method for data processing in a processor according to claim 13.
  • Processor architectures are constantly reaching their physical limits. These processor archi tectures are usually improved primarily by increasing the clock frequency of the processors, reducing the size of the structures and parallel use of several proces sor cores.
  • the performance of processor architectures can also be improved by architectural optimization of the flow control in areas such as pipelining and paral lelism, prediction of memory access, caching, messaging systems, or by the di- vide-and-rule approach.
  • Main frame computer systems with a network of hundreds and thousands of individual processors are often used, for example the "Earth Simulator", “AlphaGo” or others, but also highly specialized computing systems like the “Connection Machine” CM- 1 , CM-2 and CM-5, the “MasPar” MP-1 and MP-2, Berkeley Emulation Engine (BEE) or other solutions from Berkeley's RAMP (Research Accelerator for Multiple Processors), Cray computers like the T3E with its GigaRing architecture, etc. Also in image processing and deep learning architectures, massive parallelism is some times used, and on a smaller scale also in transputers and vector or field comput ers.
  • BEE Berkeley Emulation Engine
  • processors with general purpose archi tectures such as the Intel Xeon or Core-i series or processors from AMD, ARM or Nvidia, which are usually based on the Von Neumann architecture and Harvard architecture and are based on primarily sequential processing, have become in creasingly popular.
  • This basic principle is not changed even by a ring bus intro poker in the Cell or Sandy-Bridge architecture, for example, primarily in the sense of better scalability.
  • Von Neumann architectures One problem with Von Neumann architectures is that all data, both words and instructions, must be loaded and the results saved immediately in order to be available again. Different levels of smaller caches form the memory of each CPU core and a larger RAM forms the shared memory of all cores. A bottle neck often occurs at the point between the RAM and the higher cache levels, es pecially when new data needs to be loaded into a higher cache or memory page changes occur.
  • Classical RISC architectures are e.g. defined by the control flow process: instruc tion fetch, instruction decode, instruction decode/execute and memory/writeback.
  • each CPU core usually processes one thread in a process individually, or two threads share a common computing unit (e.g. in hyperthreading) and, depending on the process, also a common address space.
  • Parallel structures are only used, for example, when processing data fields in a field computer or vector computer, but the operand is the same for all data.
  • each core is assigned a similar task. The GPU has many cores for this purpose, which are kept rather simple, which receive instructions from a flatter memory hierarchy and which are used as huge field computers, so to speak, whose control flow is highly parallel.
  • control flow is parallel- dependent and therefore essentially the same for all data.
  • a RISC architecture it is only possible to a limited extent to interlink two running threads and their con trol flow in such a way that the further processing of an entire process with several threads and with several branches within which or even several different instruc tions are processed seamlessly from a program breakpoint onwards and without any loss of performance overall - regardless of how many branches still occur in the tree and are precalculated.
  • a virtual concurrency of threads that occurs in this process is actually a sequential execution within sequential processes, even if a process can now be processed by threads on several processor cores.
  • mutex functions or semaphores are required so that only one thread can access a data field at a time, thus ensuring that the result is unique.
  • This makes it more difficult to model more complex real operations and processes, which both correspond to natural operations and can be techni cally realized with known, simple procedures. In reality, several causes can often lead to one reaction, and one cause can also lead to several reactions.
  • such an architecture is not designed for such a bijective problem.
  • the known concepts CPU-RISC, GPU and neural computer are to be improved, especially with regard to applications in which many small sub-problems as well as problems of branched tree structures are to be solved. Especially downtimes and waiting times of the computing cores should be reduced with the structure according to the invention.
  • a corresponding method for advantageous operation and for the processing of program code in such an architecture should also be provided.
  • this processor architecture concerns a micropro cessor or CPU architecture, which is preferably designed on chip, die or semicon ductor level.
  • the present invention proposes a computer architecture which has at least one processor core and forms ring-shaped computing stages consisting of computing elements such as ALUs, FPUs, field computers or vector computers with information transfer in a closed loop.
  • the parallelism of processing at this level is no longer necessarily related to a structure of several parallel, fully-fledged cores, which means that preferably also a (in the prior-art partly excessive) communica tion between these fully-fledged cores can be omitted.
  • each of the parallel-operating computing stages is arranged around the pro cessor core in a ring structure, preferably directly connected to each other via data links, so that a, preferably direct, data exchange between the computing stages can take place.
  • each of the computing stages has a manage ment unit which manages the data flow between the computing stages, the pro cessor core and, optionally, also cache memories in slave mode under the central processor core.
  • An initial increase in performance can be achieved by, among other things, cen tralizing on a processor core, wherein a very simple but fully-fledged integrated control CPU can be designed as the heart of the processor core - also referred to as a CCPU ("control CPU") in the following.
  • This CCPU can, for example, be de signed with a simple RISC instruction set and the highest possible clock rate.
  • the control CPU can be specially designed in such a way that it is not designed to increase the computing capacity, but is optimized for reaction speed, data through put and/or energy efficiency.
  • One of the main activities of a CCPU designed according to the invention is to coordinate the forwarding of the continuously flowing data streams by filling in struction buffers, storing finished result sets, loading and setting operations by in terrupts and system calls, and controlling the dynamic forwarding of the data flow of a computing stage, e.g. in static field-path calculations.
  • One activity of the CCPU is to distribute the program instructions of a program to be executed, for example similar to the application of a clock algorithm for a paging, to place the instructions one after the other on the buffers and subsequently on stacks of the computing stages and to manage, replace or delete them in real time, i.e. specifically clock wise or counterclockwise in the ring-shaped arrangement of these.
  • This ring- shaped arrangement can have a uniform direction for each defined computing ring, but rings with opposite directions for each further organized ring can also be formed, for example in multiple MISD in a processor.
  • the CCPU communicates with the other CCPUs via buses of known type, e.g. similar to classical RISC MIMD architectures.
  • hybrid technologies with one CCPU and several rings of different directions are also possible in special embodiments to create a full duplex ring and to enable duplex in general.
  • a further activity which in the processor architecture according to the invention can be performed by a correspondingly designed CCPU, is starting and stopping processing operations - also in relation to flow control - and/or recording error mes sages (exceptions) from the arithmetic units, and, optionally, executing error pro grams, which is preferably executed independently of all currently running compu ting stages.
  • the entire computer architecture in accordance with the invention must therefore hardly ever stop or completely preempt pages.
  • the parallelism of the processes takes place in the in dividual computing stages of the surrounding computing elements and makes use of a concatenation of successive instructions and program sections via the func tionalities used in each of the computing stages, in combination with multiple use and coordinated data flow connection of the same and use of the ALUs, FPUs, etc. contained in the computing stage or following it.
  • an input is transformed into an output - similar to the logic circuit in an FPGA chip.
  • the management of such an input-output of a process according to the invention does not have to be handled directly according to the in vention by an operating system, but takes place primarily on the hardware accord ing to the invention, which allows e.g. significantly shorter access times.
  • An embodiment of a computer architecture in accordance with the invention can also be described as a decentralization of the computing core into a computing ring, wherein the most important execution hardware for the actual calculations is always explicitly multiple, directly and indirectly, parallel and distributed.
  • several instructions can be processed simultaneously in the computing units. Similar to a RISC-V ap proach, several RISC instructions from so-called "Standard Combination Instruc tions” or Standard Combination Execution - (SCE) can be decoded by a micropro gram per computing stage, which increases the possible representation complex ity and recombinability for the next stage at the output for immediate operations.
  • SCE Standard Combination Execution -
  • a routing of the data in the ring can be interpreted in real time by the control CPU according to the invention - preferably designed as a real-time processor - with one or more own control CPU cores from an instruction-machine code.
  • Special operations from the CISC and/or RISC data set as well as the extension of the latter by pre-programmable "standard combination instructions" can be com manded, for example, among others:
  • These instructions can be either 3 address instructions known under RISC using active coordination, or extended as 4 or 5 or 6 address instructions, e.g. in the form of ⁇ w1 , w2, result, physical target input address computing stage, ⁇ register number target instruction> , ⁇ thread ID> ⁇ with passive coordination of the CCPU.
  • an overall processor in accordance with the inven tion is thereby designed with a number N of arms or computing stages.
  • the present invention relates specifically to a microprocessor architecture of a digital computer, in particular formed on one, preferably only, semiconductor chip.
  • This has a plurality of, preferably similar, computing stages which are each formed with a computing unit having at least one computing stage decoder core which is formed for the mathematical or logical processing of data in accordance with an instruction set, in particular a generally extended RISC instruction set, and a man agement unit (MU) which is formed in order to provide the data and a program code for the computing unit assigned to it.
  • an instruction set in particular a generally extended RISC instruction set
  • MU man agement unit
  • the computing stages are arranged and designed in such a way that data com munication between the computing stages is established in a closed ring by means of the management units, in particular in a closed daisy chain ring with point-to- point connections between the management units.
  • point- to-multipoint connections can also be formed.
  • the microprocessor architecture also features a central processor core with a con trol CPU (CCPU), which is connected in a star configuration to each of the com puting stages via their MU.
  • CCPU con trol CPU
  • This is designed and configured in such a way that the program code and data associated therewith from an external memory are distrib uted via the MUs to the plurality of computing stages in such a way that, during the processing of the program code, successive, dependent calculations are dis tributed in each case to computing stages following one another in the ring, and the data to be processed are forwarded from one computing stage to a next in the ring during the course of their calculations in the closed ring.
  • the CCPU is thus designed to distribute the program code and data associated therewith from an external memory via the management units successively to the plurality of com puting stages in such a way that, when the program code is processed, the data to be processed are forwarded from one computing stage in the ring to a next computing stage in the ring during the course of their processing in the closed ring.
  • a first instruction can be executed on a first of the computing stages and a subsequent second instruction on a second of the computing stages (or only in special cases - such as a single loop, etc. - on the same computing stage), wherein data of an intermediate result is passed on between the first and second computing stage.
  • the present inven tion distributes a program code comprising instructions which process input data into output data, wherein the instructions are distributed over the computing stages in such a way that the output data of a first instruction on a first of the computing stages is passed over the ring as input data for a second instruction on a second of the computing stages connected thereto in the ring.
  • Pro cessing of the program code is carried out, so to speak, successively along the ring, wherein the data to be processed are each passed on successively along the ring between the computing stages - e.g. in the form of a data stream in direct byte stream format or in pages (page fields) - so that the sequence of steps of the program code is distributed over the computing stages with a passing on of the data to be processed along the closed ring from one computing stage to a next computing stage.
  • buses than 1 :N connections can be implemented between two computing stages in order to load pages over longer periods of time in a highly parallel manner and controlled by the MU into a reserved L4 cache field of the next computing stage while individual byte stream data continues to flow, i.e. without blocking the system.
  • the MUs can each be designed with a Paged Memory Man agement Unit (PMMU) and an individual L4 cache memory, which are designed in such a way that these, preferably independently of the computing unit, in particular in the event of a thread change, temporarily swap out memory pages not required by the computing unit into the individual cache memory and/or reload required memory pages from the individual cache memory, which can, for example, consist of a register of the instructions to be executed next, as expected, via the data.
  • PMMU Paged Memory Man agement Unit
  • individual L4 cache memory which are designed in such a way that these, preferably independently of the computing unit, in particular in the event of a thread change, temporarily swap out memory pages not required by the computing unit into the individual cache memory and/or reload required memory pages from the individual cache memory, which can, for example, consist of a register of the instructions to be executed next, as expected, via the data.
  • the CCPU can proactively preload the memory pages into the respective individual cache memory via the MUs, according to the distribution of the program flow, with a corresponding part of the program flow and/or the data in advance, in particular independently of the processing of the program flow by the computing unit and of the forwarding of the data between the computing stages.
  • the CCPU may have at least one level of local cache memory, preferably two or three or more levels of local cache memory, and the CCPU may be configured with a specialized processor core whose instruction set is designed to be opti mized for fast management and transfer of data.
  • the CCPU may be configured without processor cores to directly process the data and execute the program code itself.
  • the instruction set can be designed as a RISC instruction set for distributing program code and data to the computing units, taking into account branch predictions in the program flow.
  • one of the computing units is designed with at least one computing core of
  • the computing unit is further equipped with an execution engine (EE), which is designed to load the register sets of the computing cores from a local stack, and optionally with an interpreting engine (IE), which is designed to predecode com plex standard combination instructions into atomic instructions of the computing core.
  • EE execution engine
  • IE interpreting engine
  • Each computing arm has at least one EE.
  • an Execution Engine (EE) and/or an Interpreting Engine (IE) can also be designed in each case in a special embodiment for each of the computing cores.
  • the EE and optionally also the IE are connected to each other and to the associated MU, preferably via a common internal bus of the respective computing stage. However, this internal bus is formed separated from the ring via the MU.
  • such an IE can be a part of a computing stage, especially as a detail of a processing level of the MU.
  • the IE interprets instructions from the MU con figured via the ring and, optionally, also places decoded instructions on the arith metic units using the EE.
  • the IE can also assume the task of reproducing com pressed instructions - so-called SCE instructions - in machine code, e.g. in the form of microcode, through frequently used function chains stored in its pages.
  • the known function chains are usually only a few instructions long, and can be transmitted via a communication bus from the MU to the next MU in the ring and - e.g.
  • an IE can additionally or alternatively also be part of the CCPU, so that the CCPU (at least partially already) uses often used instruction chains correctly distributed, instead of passing the instruction chains through an MU to a following MU in the ring.
  • the processor architecture can preferably be designed and configured in such a way that during program execution an essentially continuous data flow of the user data to be processed takes place along the ring between the computing stages and, essentially independently thereof, a control flow from the CCPU to the com puting stages.
  • the MU can be designed to forward the data flow and the control flow, wherein the MU, in particular with a microprogram code on a control logic or on an updatable FPGA chip, provides management and coordina tion of data flow and the control flow via communication with the CCPU, each IE and IE, and the computing stage on the basis of the program code of a program to be executed.
  • the MU can be specially designed and configured to continuously mark in a local program cache those instructions of the program code for which the data required for their processing are entered in a local data cache, and to then make these marked instructions with the required data available to the computing unit for exe cution in a local stack.
  • the marked instructions can be loaded by one or more local EEs of the calculation unit or the MU of the computing stage into the local stack of an arithmetic unit designed for the respective marked instruction. Individual instructions, not entire threads, are especially marked. Depending on data availability and proven further continued correctness of the code, the individ ual instructions can be randomly mixed and matched in their execution across threads.
  • the computing stage can be designed and configured in such a way that it has an instruction stack which is filled by the CCPU and processed by the computing unit under coordination of the MU, and an input data cache which is initially filled by the CCPU and during the execution of the program code via the ring by another computing stage and processed by the computing unit.
  • the MU can be designed with a search run, which provides the correct computing unit with those instructions in the instruction stack for which instructions all necessary data are located in the data stack, in particular by means of the Execution Engine (EE).
  • EE Execution Engine
  • An embodiment of the CCPU can be designed and configured in such a way that it interprets the instruction-machine code of the program code from the memory and the distribution of the program code to the computing stages takes place by means of conditional jumps to subsequent computing stages with multiplex branching; loops via jumps back to the computing stage containing the next in struction; branching of the data stream to several units or merging of start-related computing streams into a list concatenation.
  • the invention relates to a digital computer processor architecture, in particular on a single chip or die, which is designed with a central control CPU (CCPU) and a plurality of computing stages. All the computing stages are con nected to the central CCPU in a star configuration and the computing stages are connected to each other in a closed data ring.
  • the computing stages have a man agement unit (MU) which is designed and configured for the administration and forwarding of program code and data between the computing stages and/or the CCPU.
  • MU man agement unit
  • the CCPU is designed and configured in such a way that it distributes a program code, in particular within a single thread, for execution from an external memory to the computing stages in such a way that a first partial calculation of the program code is distributed to a first of the computing stages and a subsequent second partial calculation is distributed to a second of the computing stages fol lowing in the ring to the first, and wherein MU is configured in such a way that result data of the first partial calculation are passed on as input data for the second partial calculation in the ring between the computing stages.
  • the invention also relates to a microchip or processor, e.g. as a die, semiconduc tor chip, microprocessor chip, formed with a microprocessor architecture de scribed herein.
  • the invention also relates to a method for executing a program code, in particular parallel and branched, on a digital computer, in particular with a processor architecture as described herein.
  • the method at least comprises:
  • the transfer of data in the ring and/or between the CCPU and the computing stage, as well as the administration of the program code and the data in the computing stages can be carried out by a management unit (MU), in particular wherein the computing stage has for this purpose a paged memory management unit (PMMU) as an associated part of the MU supporting the processing sequence and a smaller local cache memory for the administration of data and program code.
  • MU management unit
  • PMMU paged memory management unit
  • Fig. 1 shows a first schematic representation of an exemplary embodiment of a processor architecture according to the invention
  • FIG. 2 shows a second, more detailed illustration of an exemplary embodiment of a processor architecture according to the invention
  • Fig. 3 shows a schematic representation of an exemplary construction of an embodiment of one of the computing stages in the ring structure according to the invention
  • Fig. 4 shows an exemplary embodiment of a management unit in the present invention
  • Fig. 5 shows an embodiment of the present invention in the form of a block di- agram.
  • FIG. 1 schematically shows a possible example of an embodiment of a processor architecture 1 according to the invention.
  • This processor architecture 1 is specifi- cally designed as a processor, in particular as a single 1C (Integrated Circuit) com ponent or as a single chip, preferably on a common semiconductor substrate.
  • the processor architecture 1 has a processor core 2 with a con trol CPU (CCPU) 2 and a ring of several computing stages 5 with computing units 3.
  • CCPU con trol CPU
  • the number of the respective components shown here is a possible example, but not necessarily a limiting factor.
  • the computing stages 5a, 5b, 5c, 5d, 5e, 5f arranged in the ring can also be present in larger numbers, especially with a number of 2 to n computing stages 5, which are connected via data lines 40/45 to form a closed ring 7.
  • a transfer of data in ring 7 takes place in each case inde pendently of the other computing stages directly from one computing stage 5 to the next, wherein a management unit 4a (MU) of computing stage 5a carries out the data transfer independently of the activity of computing unit 3a. It is preferable that neither CCPU 2 nor computing unit 3 is loaded during the execution of such a data transfer in the ring.
  • MU management unit 4a
  • ring 7 is designed uni- directionally for data and directionally within a ring - in the example shown, ap proximately clockwise - depending on the embodiment, but always unidirectionally for data and directionally for instructions, for example in the form of a daisy-chain of individual and preferably autonomous connections between the computing stages 5.
  • the CCPU 2 does not use ring 7 for communication with the computing stages 5, but is connected separately from the ring in each case directly to the computing stages 5, as illustrated by the connections 48a, 48b, 48c, 48d, 48e, 48f.
  • this direct, star-shaped connection of the CCPU to the computing stages is also made via the MU 4, wherein only the CCPU 2 is designed with an interface 91 to an external RAM main memory 9 and/or to external bus systems for the processor architecture 1.
  • the computing stages 5 have only relatively small, internal cache memory 47, preferably divided into separate program and data caches or stacks.
  • the structure of an embodiment of a processor architecture 1 in accordance with the invention especially comprises a microcontroller as control CPU 2 (CCPU), - several sets of independent computing stages 5 with highly parallel ALU, FPU, vector computer, field computer at least once in full or shortened version, as well as
  • a common chip or semiconductor die or at least in a common electronic component, especially for instance as a single processor (with the exception of the RAM 9 shown here).
  • This is provided in particular in differentiation from systems which have such components as independent sys tems which are distributed and networked over several cabinets or even locations, or with respect to systems which have such components in the form of a plurality of dedicated chips on a common circuit board.
  • the phys ical arrangement or topology of the above-mentioned components on the chip area can also be designed in such a way that the CCPU 2 is arranged in a central chip area, and the computing units 3 are arranged around the CCPU 2 at the edge of the chip area.
  • a physical ar rangement of the components can also be formed at least partially, which is simi lar, for example, to that shown in Fig. 1.
  • Fig. 2 shows a more detailed example of an embodiment of a processor architec ture 1 according to the invention, in which several computing stages 5 are con nected via direct data lines 40/45 between the respective computing stages 5 to form a closed ring 7.
  • a central control CPU 2 CCPU
  • the computing core 2 may also be multi-core at best, i.e. have several CCPU processor cores 20 working in parallel, which maintain the overall principle described here and share the work of the computing core 2 in order to further increase performance and to be able to serve all computing stages 5 in a timely manner, e.g. by further reducing the waiting times of several system calls.
  • CISC Complex Instruction Set Computer
  • An example of a CISC architecture can use collected in structions to quickly describe a complex sequence of operations to be performed, but the complex instructions also require more time to decode the program code. This, as well as many partly redundantly designed instructions, limits the effi ciency.
  • coprocessors are also used for special tasks (e.g. for an out of order execution) as well as a distribution of tasks to one or more FPU (Float ing Point Unit), ALU (arithmetic logic unit) or other specific logic units for instruction set extensions like SSE (Streaming SIMD Extensions), AVX (Advanced Vector Extensions), etc.
  • GPU Graphics Processing Unit
  • CUDA Computer Unified Device Architecture
  • Vul can or OpenCL technology In a GPU, however, the micro-components of the in dividual cores are not linked with each other and cannot optimize each other.
  • the load/store operations can largely be carried out within the ring 7 between the computing stages 5 with the aid of the MUs 4, as well as by a CCPU 2 (preferably largely autonomous) designed specifically for this purpose, while at the same time all computing stages 5 process the program code and the data largely independently of this and to a large extent parallel to each other.
  • the present invention can, for example, achieve improvements in one or more of the following problems.
  • a branching in the program flow creates, for example, a new sub-thread (here, in addition to this designation, also called tree thread or inner flow) with a starting position and fixed variables at this point in time to a code which is still fixed but branched.
  • a surjective sequence must be operated by the processor, but the number of processor cores is limited and thread changes should be avoided if possible, as such a change would take many CPU cycles and thus valuable time.
  • a new level of abstraction can be created. For example, even in a field computer, it has so far hardly been possible to variably calculate fine-granular complex tasks with many input variants to be processed without reaching the structural limits of the field computer architecture. With a processor architecture that is in accordance with the invention, especially such tasks can be solved advantageously.
  • embodiments according to the invention can, for example, be designed as described below.
  • all caches with their runtime varia bles can be cached locally independently in the computing stages 5, and the re sults can be kept there during operation.
  • An initial distribution and loading of data and program code from the RAM 9 into the computing stages can be carried out under control of the CCPU 2 via the MUs 4 of the computing stages 5, especially while the computing units 3 of the computing stages 5 are still busy processing other tasks.
  • the program code is processed, the data are then shifted along the ring from one computing stage 5 to the next, wherein each of the computing stages 5 processes autonomously executable parts of the program code which are applicable to the data currently available in the computing stage.
  • a locally distributed, multiple stack for variables to be passed on can be formed in each case between the computing units 5 and a registration unit with a controlling microcontroller.
  • the registration unit is designed in such a way that it can set breakpoints for decision points and find its way back to a thread of the respective other decision over the entire system - for example via "labels" - while the other decision is processed in high parallel without necessarily discarding the results up to that point. In other words, a result tree memory is formed.
  • the central CCPU microcontroller 2 primarily only takes over interrupts as a func tion for the entire processor 1 .
  • a simply constructed but very fast CPU core can provide the organization of threads, starts and storage, as well as branching of the processes, as well as recognize and solve jams and con flicts.
  • the CCPU 2 can be specially (and preferably only) designed for these tasks, e.g. it can have an instruction set designed for the above-mentioned organizational processes, in particular it can dispense with structures that are not required for such tasks and/or it can have special structures for high-performance processing of these outputs.
  • MIMD-MISD Multiple Instruction Multiple Data - Multiple Instruction Single Data
  • the CCPU 2 merely provides cor responding memory pages in the local caches, which can also take place, for ex ample, during the preceding processing until the accumulated amount of buffered data at the input has been completely linked with the correct instructions and pro Obd.
  • an administration for virtual and locally con stantly reassignable memory pages can be formed, which automatically assigns thread branches and efficiently stores results bijectively with a bijectively defined code line and memory location (or several code lines and memory locations in an automated field), as well as loads required code pages for different code sections and/or splits them up among the computing units 5 according to the time required.
  • the administration is designed in such a way that the caches assign code pages to a processing line, and no longer to a specific physical processor core, as is usually the case.
  • results of a first computing stage 5a are always placed on top of a stack input of the next computing stage 5b following in the ring in a kind of "round robin" procedure in a uniform sequence.
  • processor architecture 1 it is also possible to dispense with a uniform clock across the entire processor architecture 1 .
  • Each of the computing stages 5 can process its input stack with data as fast as it is capable of doing so, i.e. for example, with an indi vidual clock.
  • the instructions for processing the results and data are placed in parallel on an instruction stack for each of the computing stages 5.
  • the processor architecture in accordance with the invention can also implement appropriate transition structures, which are designed to work between components of the processor architecture inde pendently of the clock via buffers and signals.
  • the processor architecture 1 according to the invention can be de signed in such a way that several distributed sequences of a thread are distributed over simple computing stages 5 and processed in a sequence according to a dy namic priority pattern which offers "out of order execution", in particular to model real-time processes and physical processes, especially according to an "action- reaction” principle.
  • an embodiment of a computer architecture 1 in accordance with the invention provides a bijective computer architecture which works on a main data stream for each processor 1 , but which can process a data flow several times in parallel by means of different instructions, preferably adaptively to the modelling of the problem in real time.
  • the computer architecture 1 is preferably not clock- dependent and provides a combination of properties of field and vector computers.
  • the computer architecture 1 according to the invention provides a new approach to how results are temporarily stored while maintaining a bijective over view for the system.
  • an embodiment according to the invention can also be described as a combination of an arbitrary or binary representable number of arbitrarily ad dressable computing stages 5, comparable to a kind of honeycomb field.
  • the first and the last column of the computing stages 5 are in turn linked via the output of the last one to the input of the first one, thus forming a ring 7.
  • Ring 7 is thus formed around the control processor CCPU 2 in the middle of the latter, wherein ring 7 and middle refer primarily to its interconnection arrangement and other geometric arrangements can also be provided in other embodiments, in particular those in which (e.g. within the scope of chip area or timing optimization) another physical arrangement is formed, which nevertheless implements a logical arrangement of the components in accordance with the invention.
  • the processor architecture 1 is also physically arranged on a semiconductor die in its basic structure similar to that shown in Fig. 1.
  • the computing stages 5 always result in a closed ring in their data flow and the control CPU CCPU 2 has a star-shaped connection to each of the computing stages 5 for their control and management, via which star-shaped con nection also an initial loading of program and code parts for execution from an external memory 9 on the computing units 5, as well as a retrieval of results or partial results of the program execution from the computing units 5 to the memory 9 takes place, which is controlled by the CCPU 2, preferably largely independent of a current processing of a current program code in the computing units 5 and/or of the flow of data along the ring.
  • a Management Unit 4 is designed as a preferably intelligent - i.e. programmable or configurable with its own instruction set - link between each of the computing stages 5, for example in the form of a configurable or program mable logic unit or as an FPGA.
  • a MU 4 is assigned to each computing stage 5, preferably arranged in the ring at its input for data.
  • the MU 4b of computing stage 5b provides a link between
  • the MU 4 is designed so that
  • the data to be processed is primarily transferred between the computing units 5 for successive processing between the computing units 5 along the ring,
  • program code to be executed by the respective computing unit 5 is loaded into a program buffer of computing stage 5
  • initial data is loaded by the respective computing unit 5 into a data buffer of computing stage 5 or result data is read back to CCPU 2,
  • the PMMU 46 and/or the processing unit 3 are transmitted or status and error data from these compo nents are transmitted to the CCPU 2.
  • the moving and/or loading of data is preferably carried out in the form of blocks, also called memory pages or pages.
  • the MU itself only works in one direction in the ring to process data.
  • the MU can transmit data to any other MU with a bypass function and thus intermediate results stored at the input, which are required for this merging at this stage of the calculation - but only without further processing of the data.
  • such a task of bypass data forwarding can also be performed by the CCPU, especially since the CCPU coordinates, approves and manages such operations anyway.
  • the number of computing stages 5 in circulation is shown here as an example with the four computing stages 5a,5b,..., but this number can be extended at will.
  • processor architectures 1 can be combined as Multiple Instruction Single Data (MISD) to form a Multi-MISD, option ally on a single chip, wherein data streams are managed or mirrored globally.
  • MISD Multiple Instruction Single Data
  • the program execution can be designed in such a way that each proces sor architecture 1 processes only one process.
  • time slice models can be used, but the entire processor architecture 1 and also the MU would have to switch the currently held pages.
  • a chip with several processor architectures 1 ac cording to the invention can switch to a multi-process or multi-thread mode by dividing processes on the instruction stack that are to be actively operated into address spaces and forming subsets from the address spaces on the RAM as long as the MU supports this, whereby, for example, an ordinary multicore processor can also be emulated.
  • These are not distributed threads with a decision tree, to which special reference was made, but real optional linear or branched processes already known at present, which run analogously on the structure. Due to the as signment technique, the processes can still be run for the threads subordinate to processes, even if there are several processes.
  • An embodiment according to the invention of a process can have the following aspects, for example:
  • An embodiment of an MISD computer according to the invention can provide coordinated and complex pipelining over an extended hierarchy of a control pro cessor 2.
  • the data throughput can thus be massively increased, since the program flow is not interrupted continuously, but only by a system call when required.
  • the calculated results are coordinated, but are only written back when they are needed There is no obligation to write back, but only a small number of registers can immediately notify about a change of state. The results can remain in circula tion until the data stream no longer needs to be manipulated by instructions and is terminated. A possible temporary write back is only for safety reasons, but is not absolutely necessary.
  • the CCPU 2 does not use the intermediate results, only e.g. in case of a crash or the like they can be used to restore the system. Only the correct sequence and connection of the instructions and the loading and saving as well as 10 are the tasks of the CCPU 2. As a result of this outsourcing, the CCPU 2 and its internal logic circuits can be designed comparatively small. At this point in the design of the CCPU the focus can be on consistency rather than speed.
  • error codes can only form further reportable fixed branches which can be easily integrated into the regular process.
  • An embodiment of a processor architecture 1 according to the invention can thereby form a highly variable field computer in the broad sense, with the possibil ity of separating and coordinating threads on the same processor 1 into sub threads, as well as merging them again. No longer are full-fledged processor cores required for a further thread. This saves space and valuable chip area.
  • a complete field computer can be simulated and variable handling with similar performance can be achieved.
  • problems of computation for quantum physics and for the research of quantum computers can be solved more efficiently.
  • bypassing and forwarding can be optionally implemented for each computing stage 5.
  • the instruction fetch phase amounts to a processing pro cess of the CCPU and is omitted in computing stages 5.
  • There are effectively only 3 or 4 roughly distinguishable phases for the pipeline and all of them are distributed in the individual components and can be executed in a highly parallel manner. This distribution makes it possible to create effective, further subdivided pipelines, which increases the overall performance of these simplified components for the remaining 3 or 4 individual phases of the control and/or simplifies the design.
  • a plurality of smaller and simpler control units according to the invention with their arithmetic units are easier to design and optimize than a large, complex control unit, especially if the smaller control units can be used several times in the same form.
  • the entire processor architecture 1 can be designed with an external interface like a conventional processor.
  • the evaluation of the speed of the individual cores often becomes less relevant. Instead, a throughput of the computing stages 5 is evaluated.
  • the number of parallel running threads is effectively lim ited directly by the number of individual processors (ALU, FPU, logical units, field computers and vector computers), which can even increase the parallelism capa bility of an architecture 1 in accordance with the invention one hundred fold.
  • ALU arithmetic unit of each type
  • 32 threads are active simultaneously, wherein such a pro cessor architecture can be implemented on a chip area that corresponds approxi mately to that of a classic single-core processor.
  • the efficiency and utilization of an embodiment of a computer archi tecture 1 according to the invention is anchored in its structure, especially with regard to calculation tasks with a high load and with a high number of calculation processes.
  • Fig. 3 shows an example of an embodiment of a computing stage 5 according to the invention with a computing unit 3 and a management unit 4 (MU) assigned to it.
  • MU management unit 4
  • an independent forwarding of data and instructions for the adaptation of the runtime characteristics is carried out, for example with an adaptation of the program code page distribution at the exit and re-entry of loops, which are distributed diffusely over the computing stages 5 and thus over the pro gram succession.
  • the control CPU 2 is designed in such a way that it constantly prepares an overhead for the preparation of the computing path through all computing stages 5, so that during the actual execution of the calculations by the computing units 3, only the results are entered at still free positions or at pre pared reserved positions in the workflow of computing stage 5, which is prepared on the local cache, are forwarded and then further processed in the next compu ting stage 5 - without requiring a further overhead at runtime.
  • the data inputs for a program instruction which is processed by one of the arith metic units 31 , 32, 33, 34 of computing unit 3, are always directly available in one of the many local register sets and are loaded by a so-called PMMU 46, whose associated instructions are simultaneously fetched and decoded by an interpreting engine 37 (IE), in that they are placed by an execution engine 36 (EE) in the input instruction registers of the arithmetic units and processed by the various arithmetic units, wherein the IE obtains its instructions for the loaded data from stack 47 of computing stage 5.
  • the IE 37 predecodes a complex standard combi nation instruction. This IE 37 can also be integrated in the EE 36.
  • a computing stage 5 has a management unit 4 (MU), which takes over the administration and coordination from the microprogram code of a control, for example on its updatable FPGA chip, via communication with the control CPU 2, the EE 36, the IE 37, as well as the local cache 47 of computing stage 5 on the basis of the program code of the program to be executed by the user.
  • MU management unit 4
  • a computing stage 5 ac cording to the invention can have several dedicated instruction stacks, in particular each for different arithmetic units, for example separate stacks for different types of instructions, especially separately in each case for example for an FPU 33, ALU 34, ... of computing stage 5.
  • Each computing stage 5 preferably has several field computers (FR) 32, vector computers, logical units (LE) 31 , ALU 33 and FPU 32 units (full and/or half value).
  • Computing stages 5 are also referred to as (MISD) field lines or arms of the pro cessor architecture, especially for Multiple Instruction Single Data (MSID) calcula tions.
  • Their data input is shown in Fig. 1 as an example on the counterclockwise side, their data output in clockwise direction.
  • the ring of computing stages 5a built up in this way is designed so that the data outputs can always be transferred via a multiplexer in the MU 4 to one or more data inputs of a next computing stage 5b.
  • a bypass to the same compu ting stage 5 itself and/or to the respective subsequent computing stage 5 via a correspondingly designed MU 4 is also optionally possible.
  • the MU 4 is bidirectionally connected to a Paged Memory Management Unit 46 (PMMU).
  • PMMU Paged Memory Management Unit
  • a level 4 cache memory connected to such a PMMU 46 can buffer 47 (L4) pages and hold them until the next page change.
  • L4 Paged Memory Management Unit
  • each of the MUs 4 has a respective individual PMMU 46 and a respective individual L4 cache 47.
  • a communication flow can also be established to control and configure the PPMU 46 from and/or via the MU 4.
  • Each computing unit 3 thus has a paging buffer 47, usually relatively small, e.g.
  • the paging buffer is designed as a highly parallel page buffer and is integrated in the computer architecture for direct 1 to 1 data flow, in particular for data with the size of a single (or in combination at best also multiples) of the units of a Multiple Instruction Single Data (MISD) computing stage.
  • MISD Multiple Instruction Single Data
  • the size of the paging buffer can be selected in particular depending on the number of ex isting computing stages.
  • each passing of N computing stages may require N data sets and thus the buffer for the number of computing stages may be formed in advance according to this number or a multiple thereof, so that the CCPU does not have to check the progress for a thread after each round, but only when, for example, less than half of the data still processable, or other paging directives, are still present in the buffer for a thread. If the CCPU is undersized, it will have to schedule the filling of data and instructions for this thread too often, which could be detrimental to performance. This can be used, for example, to standardize split data streams that are related to the start.
  • each MISD computing stage 5 has a fully associative instruction stack, which stack is filled from above via and with the help of the MU 4 by the control CPU 2, and from below is automatically extracted and processed by the MISD computing stage 5 when it is processed.
  • the width of the instruction words of the CCPU 2 can be independent of the width of the MISD level 5.
  • Each instruction architecture of the LE 31 , FR 32, FPU 33, ALU 34, etc. is independent and can optionally be designed with multiple redun dancy, e.g. depending on the probability of use based on the instruction frequency.
  • a search for the correct instruction or its memory page is carried out on the basis of the combination of the thread number and the instruction pointer in the program and, if necessary, a run-up priority or a queue on the instruction stack, wherein the page is made available via a so-called Program Memory Management Unit 46 (PMMU).
  • PMMU Program Memory Management Unit
  • the compiler prepares the program code in such a way that only the possible instruction modulo of the combinations of the program flow of the ring are entered on each memory page for an instruction stack.
  • An instruction which is marked as ready in the instruction stack, is inserted by the Execution Engine 36 (EE) into the registers of the corresponding arithmetic unit.
  • the EE 36 selects a possible, ready instruction from the instruction stack, e.g. on the basis of the highest priority and therefore currently critical latency or ac cording to the FIFO principle, and thus connects at least two linked valid data from the separate data cache reserved in and by the instruction with the instruction from the instruction stack.
  • the link from instruction and associated data is then inserted into the next free space of instruction execution in a selected arithmetic unit.
  • the EE 36 checks continuously and at the highest possible search rate for a possible insertion of instructions, at least as long as hardware capacities for instruction ex ecution are available in the arithmetic unit.
  • the computing speed is not distributed over a large processor construct and is therefore limited and complex, but only directly comprises a computing stage 5 and a decoupled communication bus be tween management unit 4 (MU) and control CPU 2 (CCPU), which thus gains speed.
  • Each instruction links to a signature of the thread, data stream and the assembler program counter -> PC bijectively its successor instruction and the last instruction with a jump decision.
  • a new, broad instruction format is provided, which includes one instruction and one variable, e.g. a 128 bit instruction format, which has a 64 bit instruction and a multiplicative of 8 bit variable (today 64 bit standard), wherein the extension of the computer's bandwidth is decoupled from the instruction bandwidth.
  • a marker of an independent instruction sequence is set to "used”, and is reset to "ready” after leaving the computing stage. Only when a storable, output- obligatory result is to be provided after the internal pipeline (DEC,(DEC),EX,MEM/WB), the state of computing stage 5 is set to "out” with the marker.
  • a further state of computing stage 5 is the marker "none", i.e. the idle state, in which computing unit 3 is switched off in computing stage 5 for energy saving.
  • a marker of 2 bits is sufficient for the state of com puting unit 3; in other embodiments, further states can be coded in a wider marker.
  • a counter can be implemented, which is stored for each instruction and which can be used to determine when this instruction is calculated, i.e. when the marker "out" is received.
  • This can, for example, be in the form of a table in the memory, especially meant as a separate extension of the register table of the instructions and candidates for the MU. Since several instructions are exe cuted simultaneously and even small pipelines for binary arithmetic steps are pro vided in the computing units 3, the current arithmetic state of an instruction is also noted on the loaded working page of the MU on its arithmetic unit for planning purposes. This makes it possible to register, for example, when a required com puting unit 3 becomes free. The MU itself loads the required page with the PMMU from the L4 cache and works and notes the current calculation state on this page until the page is written back to L4 and updated when there is a change and also at regular intervals.
  • a special feature of the process is that for complex Standard Combination Execu tion (SCE) instructions, a second decoding phase is initiated, which redirects the original instruction on the side of the instruction stack through a link to a page of the Paged Memory Management Unit 46 (PMMU) with the code page generated by the Interpreting Engine 37 (IE) and executes the first redirected instruction in the subsequent second decoding until the program jumps back to the instruction stack by the Management Unit 4 (MU).
  • the SCE can be designed as a part or extension of the IE. This means that, for example, frequent chains of instructions can be virtually designed externally as a single instruction, and this instruction can be set up internally or via the following chain of computing stages again and again without placing a load on the CCPU.
  • an exception can also be implemented in the same computing stage with respect to the re-execution in the same computing stage which is otherwise not primarily provided for in the computer architecture of the invention, e.g. in that special com puting stages are designed or preconfigured for special SCEs.
  • the last known instruction of a program is transmitted to the control CPU 2 (CCPU) together with the calculated output (entry, array or pages) for identifica tion.
  • Each instruction additionally contains an assignable thread ID, for example as instruction formats with computing stage number, instruction number, thread number, etc. which assign a correct next location for the instruction data.
  • an instruction is stored at a register location of a computing stage, and exactly this location is first identified by the CCPU for the instruction in order to find a successor and insert the correct data.
  • two identical pro grams with two different data sets and thus different thread numbers, but the same instruction numbers and the same computing stage numbers can be present and processed on one computing stage.
  • the Management Unit 4 manages internally at least as many pages (which are loaded from the Paged Memory Management Unit 46 (PMMU)) as there are inputs on the compu ting units 3.
  • FIG. 2 and Fig. 3 An embodiment of caches in the processor architecture according to the invention is also shown in Fig. 2 and Fig. 3 by way of example.
  • the caches are to be subdi vided primarily into caches L1 , L2, L3 of the CCPU 2 and caches 47 of the respec tive computing stages 5, wherein in the present invention the latter in particular have a special structure.
  • the control CPU 2 can be designed in an embodiment to a large extent analogous to one of the known computer architectures, which is specially supple mented with a termination procedure of time-stretchable real-time processors and several control buses extended outwards.
  • the page-mixed data and/or code caches 47 are arranged externally around the computing unit ring 3a, 3b, 3c, 3d or between this and the CCPU 2.
  • the latter arrangement can speed up the loading of pages for the CCPU 2 and one ring can then, for example, take up essentially the entire outer area of the chip area.
  • the data and instruction pages can be mixed in L4 caches in one embodiment, or in another embodiment as shown by way of examples in the figures the MUs are connected to an in-cache and an out-cache and the L4 cache only holds the in struction tables.
  • the L1 cache in the control CPU 2 is designed for instructions prior to their as signment. In this cache, instruction words are collected in the form of pages and inserted into the L4 cache of stage 5.
  • the L1 cache can only contain a few pages.
  • data and program code pages are stored separately from each other in the L1 cache, wherein from the perspective of the CCPU, the instructions to be inserted are regarded as data. This can make it easier to allocate pages at com puting stage 5 and to divide them into program code pages and data pages at computing stage 5.
  • Such a page can also have a header in a mixed L2 cache, for example, which identifies the stored page as data or instructions.
  • This header can, for example, be in the form of at least two page bits.
  • page markers used by paging can be stored, such as "invalid", "referenced”, etc.
  • an operating system fills a part of the RAM main memory 9 with the first pages of a program to be executed before each program start with the help of the CCPU 2 and prepares the structures in the computing stages 5. In this way, other memory pages in the computing stages 5 can be displaced prematurely and/or a check can be made to ensure that all processes are consistent.
  • the CCPU 2 can be designed with its own set of coordination instructions, which is formed e.g. by the compiler using the task code. It is possible to define your own complex Standard Combination Execution (SCE) beforehand via programming, by displaying the instructions of the complex operation on an extra page and by call ing them up in real time by an instruction in the program code.
  • SCE complex Standard Combination Execution
  • the MU 4 with the PMMU 46 fetches a page with an SCE instruction sequence from the L4 cache 47.
  • the CCPU has a special control instruction.
  • the instruction format that allows the data to flow in the ring can be different in many embodiments from the format that the CCPU uses for its control measures for the ring and instructs the CCPU.
  • a RISC-V instruction set can be extended by corresponding instructions.
  • each computing stage 5 can be directly connected via its MU 4 to more than just the next computing stage - in a special embodiment, in particular to each other and to itself - in particular to forward results, code pages or data pages.
  • MISD Multiple In struction Single Data
  • One such embodiment is a so-called Fully Associative Network (FN).
  • a second no-operation (Nop) instruction with an adjustable holding time for each instruction can be formed in the instruction set for each arithmetic unit 5, in which this arithmetic unit 5 lets the input data stream through as an output data stream if the planning detects a standstill (stall) in the next computing unit 5 in the ring.
  • the fully associative network FN can be simplified, because a holding period of 1 allows the instruction to pass exactly in the next step.
  • the instructions for the arith metic units 3 are not directly instructed by the CCPU 2, but each individual instruc tion is instructed by the corresponding MU 4, so that this MU 4 can intervene in the program flow based on signals from the CCPU 2.
  • a RISC instruction set can be extended, wherein this instruction set can be designed in particular to pro vide special instructions with which data can be loaded onto external structures, especially to the MUs 4 and/or their PMMU 46 or local cache 47.
  • this instruction set can be designed in particular to pro vide special instructions with which data can be loaded onto external structures, especially to the MUs 4 and/or their PMMU 46 or local cache 47.
  • the ex tended instructions it is possible, for example, to set a page in a MU 4 and com municate with it, collect all the results of a dividing instruction, optionally with the creation of a new branched thread path and continued calculation of both threads, and return and path logging of the final results of a thread.
  • the instruction set of the CCPU 2, the MU 4 and the complex SCE can preferably be in one embodiment at least substantially the same or subordinate elements of a subset of the instruc tion set supported on the CCPU 2.
  • the instruction space for MUs and the CCPU can, if necessary, be handled separately.
  • a Multiple Instruction Single Data (MISD) instruction set of the CCPU 2 can provide at least one of the following instructions:
  • the original instructions of the machine code can be coded in RISC and an exten sion can be used to provide the most complex SCE instructions for sequences that are used more often.
  • the data transfer for controlling the computing stages 5 can be minimized.
  • the decoding of the instructions also has a higher performance poten tial anyway due to the outsourcing, which can be used for such complex instruc tions.
  • the primary consideration in the design can also be the com puting throughput and not the reaction speed.
  • An instruction decoding in the IE 37 can also be designed with a pipeline.
  • the MU 4 will determine this in ad vance and the MU 4 will not load this instruction into computing unit 3, but another instruction which is already completely ready for execution.
  • the individual instructions are "ready for execution" (and not just entire threads with a large number of instructions). Through this consideration of individual instructions according to the invention, a thread or sub-thread of a process becomes, so to speak, only implicitly ready for execution. In many cases, this can be used to avoid so-called "busy waiting".
  • threads can be checked globally by the MU on instruc tion of the CCPU in combination with instructions still outstanding for this thread by the CCPU.
  • Loops in particular can be checked for their conditions. If a condi tional event can no longer occur, the entire thread is automatically terminated with an error state or interruption state, which is then handled by the CCPU 2 - in the background, so to speak - while the corresponding computing stage 5 devotes itself to the next task without interruption.
  • control processor CCPU 2 can document the jumps and decision parts executed during the execu tion via an operating system of the computer architecture 1 after completed calcu lation of each page to be executed, wherein these branches can already be avail able marked and prepared by the compiler, so that only the corresponding results of the branches are entered by the CCPU 2 at the appropriate places.
  • control processor CCPU 4 can store the distribution of the thread pages and provide a list of all open instructions in the instruction stacks of compu ting stage 5 and assign computing stage 5 to the thread pages, especially without disturbing the processing of the program code by computing stage 5. For example, when searching for a special instruction, this can be used to determine the respon sible page block. If thread pages are forwarded externally between computing stages 5, the CCPU 2 is informed by the MU 4, and the CCPU 2 then updates this information in your local program flow administration.
  • Fig. 4 shows in detail an example of a schematic structure of an embodiment of a Management Unit 4 (MU) according to the invention.
  • MU Management Unit 4
  • a primary task of the MU 4 is to perform the data flow between the computing units 3, i.e. especially from input 40, from a preceding computing unit in the ring to a subsequent computing unit 3 in the ring via output 45.
  • the MU 4 separates the instructions of the actual program for each input of computing unit 3 and places them on the fully associative instruction stack for each computing unit 3.
  • the instruction stack can also be designed as a separate stack for each input.
  • the MU does not load this current instruction, but another instruction which is already ready for execu tion.
  • the MU 4 can, for example, fill an instruction and/or data stack for an arith metic unit of computing unit 3.
  • a MU 4 can be specially designed with:
  • a cache-in 41 which receives incoming data, designed for caching and buffer ing byte stream data and pages according to strategies such as FIFO, LIFO or other strategies originally known from the process strategy, which are used here for the storage strategy,
  • a cache-out 43 designed to cache large pages that cannot be transferred in one go, and for stall treatment
  • a multiplexer 44 designed to provide outgoing data o to the computing unit or computing stage level belonging to the MU with an EE in which it assigns the instruction to the correct arithmetic unit
  • Fig. 5 shows a block diagram of an embodiment of a process according to the invention or a method according to the invention, specifically for.
  • a central control CPU reads in a program code and data from a memory.
  • the program code is distributed by the central control CPU via a star shaped, direct connection of the central control CPU to several computing stages. These computing stages are connected to each other in a closed ring for data communication.
  • the program code is distributed to the computing stages in such a way that when the program code is executed, intermediate results are forwarded along the ring between the computing stages as result data of a first instruction of the program code and as input data of a second instruction of the program code.
  • the instructions or blocks of instructions are thus, for example, successively dis tributed along the ring, wherein in particular conditional branches and sequence variants of the program code are also taken into account in the selection and con figuration of the distribution of blocks of instructions in order to be able to achieve a high degree of parallelism in the processing.
  • the final results of the execution of the program code are read in from the computing stages by the central control CPU and the final results are stored in the main memory or on a data carrier, e.g. RAM, or on a data storage medium such as an HDD or SSD.
  • a data carrier e.g. RAM
  • a data storage medium such as an HDD or SSD.
  • a new instruction “multiple_fork ( ⁇ Object_identifier>)" can be provided, which via a template feeds various inputs to any variable or object during its creation and synchronizes further sequence variants or allows own se- quences.
  • Each argument "multiple_value ( ⁇ Object_identifier>, args%)" in the fol lowing code creates a branching in the code.
  • a further argument can then be "mul- tiple_out ( ⁇ Object_identifier>, args...)", which outputs an array of pointers to the finished calculated objects.
  • Each of the "args” variables used creates a further sequence variant.
  • Such a method can be designed as an extension of the "p_thread" class for handling threads (also called activity car riers or light-weight process) as an execution strand or an execution sequence in the processing of a program as part of a process for flow control, but according to the invention, whole address spaces do not have to be copied or pseudo-copied, and only small sections of code are split when the compiled code is executed.
  • the "multiple_out" statement can still be read out as an array[] in order to continue or merge results of the individual strings in sequential form if required.
  • CCPU loads program page in L4 MU, after instruction of the CCPU it immedi ately finds the correct one with the first instruction, especially according to the in struction format described here.
  • CCPU sends data for transfer: Thread ID, initial data for transfer to all computing stages.
  • MU loads first instruction from internal buffer and fills the instruction with re quired data variables from L4.
  • CCPU gives start release to the thread as soon as all preparations are done.
  • IE Interprets instruction(s), checks free arithmetic units. In particular, an instruc tion page with standard combination execution instructions (SCE) is created, the first element is popped repeatedly until no more special instruction "fork” can be found and the resulting instruction page with the length n-1 is sent to the next computing unit, which again pops the first element and enters this instruction in the internal registry and writes it back to L4, continues to send the page etc. This results in a kind of self-configuration of the ring.
  • SCE standard combination execution instructions
  • the computing unit outputs the result after a few internal pipeline steps and buffers it on the cache-out, e.g. in the format register 0x **** : ⁇ Result, subsequent instruction number, subsequent computing stage, thread ID ⁇ . If there are multiple outputs, the MU can also report a message "fork" to the CCPU, e.g. with ⁇ Thread, new Child, Instruction Number ⁇ , so that the CCPU can keep track of all processes and relationships.
  • the MU of computing stage n+1 registers that its buffer is not empty.
  • the MU loads the data/the data field and searches for available instructions, e.g. in round robin principle over all pages from L4.
  • the MU finds a matching instruction and keeps the matching page in its internal memory.
  • an instruction supplement X_MERGE ⁇ Out, Threadl , Thread2, Reg- isterlnl , Registerln2, subsequent computing stage ⁇ data flows can be reunited on Threadl of the specified two threads for extended operations of any kind and this can be made known to the CCPU.
  • the MU waits until both data are available.
  • the MU automatically packs the new result into the correct fetch register of the page.
  • the compiler is guaranteed to have the correct instruction on a page in MU memory for this step.
  • Step 3 and following are repeated until the last instruction is reached and the MU finds a KILL instruction on the instruction page.
  • MU reports thread ID with message "done" to CCPU.
  • CCPU sends message "get” with the thread ID to all MUs with relevant output information in reverse order. This can be used to create a tree stack of the data which is known to the CCPU through the program code and its own planning of the process at the computing stages.
  • CCPU creates completed program classes for thread and branch overview for the user and stores them on the hard disk or RAM. Each instruction to put an intermediate result on the stack or a newly created or merged thread represents a label to be stored for the CCPU and is marked as such from the beginning.
  • the CCPU searches for the corresponding locations of instructions already during dis tribution and finds the corresponding storage locations of variables with the MUs to build a consistency tree.
  • new, so-called "interparallel sequences" of a thread can be run in several threads of a process, which all share an address space.
  • x (-b ⁇ Vb -4 a c) / 2 a
  • com plex equations can be processed in parallel in one thread and basic principles such as mathematical polynomial division (e.g. for estimating zeros in multidimen sional systems) can deliver results faster and in a highly parallel fashion.
  • a so-called “split module” splits or separates the data and the instructions of the program code, especially in the L2 cache and/or in the L4 cache if there are mixed pages there.
  • the split module separates instructions and constants, and detects required memory locations for variables that must be inserted into the MUs. In order to avoid crashes as far as possible, dynamic memory allocation can be avoided during a run if possible.
  • the memory can be prepared in the form of dummy variables and the instruction words of the program code are assigned the corresponding, correct registration numbers of the data. If a is set off against b, it must be possible to find these variables via their location (see also the instruction format above).
  • the split module is preferably a software and/or hardware part of the CCPU.
  • the CCPU 2 assigns a thread tag on all pages or byte streams and variables involved in the caches of computing stages 5 in order to identify a data flow, and inserts the data pages in the computing stage stack separately from the program code stack (growing from bottom to top) into a logically separate data stack (grow- ing from top to bottom) with the input variables of the program, and links the in structions to be executed with the already existing data on the data stack.
  • the least loaded computing stage 5 is preferably selected.
  • the CCPU 2 starts the program and leaves it to itself on the ring of computing stages. Since the CCPU 2 primarily only manages and uses instructions, it is gen erally faster than the switching requirements of the arithmetic units 3 and compu ting stages 5. To ensure that no results are lost, "dirty" pages are constantly written back to the CCPU 2 with low priority - especially e.g. in the case of exceptions, displacement of pages from the PMMU 46, expiration of a timer signal or an 10 call to the CCPU 2 - with a real-time priority.
  • the storing is preferably carried out automatically via the MU 4 with the stored tags.
  • the CCPU 2 is designed and configured in such a way that none of the involved stacks overflows, or memory pages are exchanged from stack 47 L4 before an overflow, e.g. into a cache L1 , L2, L3 of the CCPU 2 or into RAM 9.
  • the CCPU 2 is designed and configured in such a way that it prevents a thread from being prevented from executing due to lack of data and "starving". If, for example, no suitable data pair is found in an instruction ready for execution in a computing stage 5, the memory page is first swapped from the MU 4 to the instruction stack, i.e. a so-called paging is carried out, especially without such an explicit triggering by the CCPU.
  • each of the tags can have an in struction number or instruction field with a counter that records how long it has been stuck in relation to calculated calculation runs or FLOPs. This allows e.g. to determine an error tolerance and/or to find out if and where a deadlock occurs.
  • the clock frequency of a MU 4 subprocessor searching the cache for executable instructions can also be much higher than the CCPU 2 clock frequency, because smaller hardware structures can be optimized much easier - divide and conquer.
  • the division of labor in the processor architecture 1 according to the in vention creates an extended complex pipeline.
  • the system according to the inven tion can especially be designed as an offer-demand system.
  • the solution accord ing to the invention continuously links currently available data, and always pro Obs possible and already existing combinations first and in parallel - especially if no dependencies or conflicts occur, also regardless of a sequence of instructions implied in the original program code.
  • the result can first be buffered in inde pendently long and respectively optimized pipeline stages, and then entered into a transition cache between two computing stages 5 by a so-called Cache Manager of PMMU 46.
  • the next following computing stage 5 in the ring processes this result if all neces sary data for the processing in this following computing stage are available.
  • the tape of this factory method is always automatically filled and empties itself into the warehouse where the data is waiting for the next processing step. It is possible to prevent pages from being deleted from the out-cache in order to find and pro cess a data pair again - i.e. data that is linked together by the execution of an existing instruction.
  • control CPU 2 has already set a return instruction in all caches 47 on the ring if the result is to be written back and the program terminates legitimately. If an MU 4 sends the signal that a thread is completed, a system call is generated in the CCPU 2 that identifies the thread to be assigned and writes back all result pages, as well as releases all program code pages of the thread on all external mechanisms for overwriting and/or displacing. When this signal is re ceived, the known expiration tag of the data to be stored is searched for in the external caches 47 and the control CPU 2 directs a redirection to its own L1 cache with one or more valid return addresses.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Advance Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention relates to a microprocessor architecture (1) of a digital computer which comprises a plurality of computing stages (5), each of which is designed with a computing unit having at least one computing core for the mathematical or logical processing of data according to an instruction set, and a management unit (4) (MU) which provides the data and a program code for the computing unit 3. The computing stages (5) are arranged and designed in such a way that data communication between the computing stages (5) takes place in a closed ring by means of the management units (4). The architecture also has a central processor core (6) with a control CPU (2) (CCPU), which is connected in a star configuration to each of the computing stages (5) via its MU (4). The CCPU is designed and configured to distribute the program code and associated data from an external memory (9) via the MUs (4) to the plurality of computing stages (5) in such a way that, during the processing of the program code, successive, dependent calcula- tions are distributed in each case to successive computing stages in the ring, and the data to be processed are forwarded from one computing stage to the next in the ring during the course of their calculations in the closed ring.

Description

Processor
The invention relates to a processor architecture or a processor built with such a processor architecture according to the preamble of claim 1 and claim 12, and a method for data processing in a processor according to claim 13.
Requirements regarding the processing speed of programs and/or data by proces sors or digital computers, such as CPUs (Central Processing Units), DSPs (Digital Signal Processors), MCUs (Micro Controller Units), GPUs (Graphical Processing Units), TPUs (Tensor Processing Units), etc. are constantly increasing. Processor architectures are constantly reaching their physical limits. These processor archi tectures are usually improved primarily by increasing the clock frequency of the processors, reducing the size of the structures and parallel use of several proces sor cores. The performance of processor architectures can also be improved by architectural optimization of the flow control in areas such as pipelining and paral lelism, prediction of memory access, caching, messaging systems, or by the di- vide-and-rule approach. Examples of such improved processor architectures can be found in US 4 644 466, US 5 289 577, US 7 454 41 1 , US 4 972 338, US 5 907 867, US 6 1 15 814, US 201 1/231 857, US 5 689 679, US 5 724 406, US 6 374 286, US 5 815 727, US 5 706 466, US 7 653 906, US 2004/1 1 1546, DE 37 07 585, etc.
Especially predictions and the caching depending on them brought an acceleration for a lot of program runs. On the other hand, incorrect predictions made when caching with several cache levels for computing processes can quickly cancel out some of the advantages, since their elimination can be very time-consuming. In addition, previously unknown security problems with such architectures, which can hardly be solved with software or with much effort, have arisen.
For the calculation and simulation of technical and especially chemical processes with many objects to be modelled (e.g. atoms and molecules, digital twins, finite elements, etc.) or for quantum mechanics with many different correct results in the area of complex numbers, it is advantageous to create a computer architecture which can execute many calculation variants in parallel (Multiple Instruction Single Data - MISD) and branched in the same thread on one calculation object. Main frame computer systems with a network of hundreds and thousands of individual processors are often used, for example the "Earth Simulator", "AlphaGo" or others, but also highly specialized computing systems like the "Connection Machine" CM- 1 , CM-2 and CM-5, the "MasPar" MP-1 and MP-2, Berkeley Emulation Engine (BEE) or other solutions from Berkeley's RAMP (Research Accelerator for Multiple Processors), Cray computers like the T3E with its GigaRing architecture, etc. Also in image processing and deep learning architectures, massive parallelism is some times used, and on a smaller scale also in transputers and vector or field comput ers.
Not least due to scaling effects, however, processors with general purpose archi tectures such as the Intel Xeon or Core-i series or processors from AMD, ARM or Nvidia, which are usually based on the Von Neumann architecture and Harvard architecture and are based on primarily sequential processing, have become in creasingly popular. This basic principle is not changed even by a ring bus intro duced in the Cell or Sandy-Bridge architecture, for example, primarily in the sense of better scalability. One problem with Von Neumann architectures is that all data, both words and instructions, must be loaded and the results saved immediately in order to be available again. Different levels of smaller caches form the memory of each CPU core and a larger RAM forms the shared memory of all cores. A bottle neck often occurs at the point between the RAM and the higher cache levels, es pecially when new data needs to be loaded into a higher cache or memory page changes occur.
Classical RISC architectures are e.g. defined by the control flow process: instruc tion fetch, instruction decode, instruction decode/execute and memory/writeback. In parallel multiprocessor architectures, each CPU core usually processes one thread in a process individually, or two threads share a common computing unit (e.g. in hyperthreading) and, depending on the process, also a common address space. Parallel structures are only used, for example, when processing data fields in a field computer or vector computer, but the operand is the same for all data. Also in GPU processors, each core is assigned a similar task. The GPU has many cores for this purpose, which are kept rather simple, which receive instructions from a flatter memory hierarchy and which are used as huge field computers, so to speak, whose control flow is highly parallel. However, the control flow is parallel- dependent and therefore essentially the same for all data. In a RISC architecture it is only possible to a limited extent to interlink two running threads and their con trol flow in such a way that the further processing of an entire process with several threads and with several branches within which or even several different instruc tions are processed seamlessly from a program breakpoint onwards and without any loss of performance overall - regardless of how many branches still occur in the tree and are precalculated. A virtual concurrency of threads that occurs in this process is actually a sequential execution within sequential processes, even if a process can now be processed by threads on several processor cores. In order to be able to split data in a run-safe way, mutex functions or semaphores are required so that only one thread can access a data field at a time, thus ensuring that the result is unique. This makes it more difficult to model more complex real operations and processes, which both correspond to natural operations and can be techni cally realized with known, simple procedures. In reality, several causes can often lead to one reaction, and one cause can also lead to several reactions. However, such an architecture is not designed for such a bijective problem.
For example, in a RISC architecture - for example in the modelling of billions of molecules - many time-consuming memory accesses and branches are necessary to solve such a large task complexity sequentially after the RISC process. On the one hand, the sequential processing is slow, and especially in the case of a wrongly assumed branching with such short calculation sequences and many branches, the process is slowed down even further. The cooperation between pro cessors working in parallel cannot take place effectively, because the calculation of the influences between two molecules is always interrelated and influences each other.
Especially if only small code sections are calculated and the instructions cannot be synchronized, the interlacing of the processes in a multicore processor must constantly wait and be synchronized, e.g. by so-called abstraction labels. Espe cially if the manipulation of states only affects a very small object and is individual for many sub-objects, occurring patterns cannot be detected intelligently and eval uated for use to speed up the process. A similar problem can be found when using a GPU for such a task. While the GPU can perform many similar operations on a large data field in parallel, its specification does not support such small-scale branching, dependencies and interactions between the data and GPU cores in an efficient way.
Problem areas of known architectures include data/data stream, process and thread stream mapping, memory management, "out of order execution" with its buffer, due to non-real-time pipelining, and too flat branching calculations. In addi tion, communication and interference between tiny partial threads cannot take place effectively and causes constant calls to the operating system.
It is an object of the present invention to provide an improved processor with an improved processor architecture, in particular for a parallel processing of problems which are also manifoldly branched and/or have interacting partial calculations. The known concepts CPU-RISC, GPU and neural computer are to be improved, especially with regard to applications in which many small sub-problems as well as problems of branched tree structures are to be solved. Especially downtimes and waiting times of the computing cores should be reduced with the structure according to the invention. In addition to the architecture required for this purpose, a corresponding method for advantageous operation and for the processing of program code in such an architecture should also be provided.
It is a special object to provide a processor architecture which provides efficient parallel processing. In particular, this processor architecture concerns a micropro cessor or CPU architecture, which is preferably designed on chip, die or semicon ductor level.
According to the invention, these objects are solved by the features of the inde pendent claims and/or by features of the dependent claims or these solutions are further developed.
A problem of current computer architectures is often a lack of flexibility, especially for the tasks described in areas such as simulations of complex real-world pro cesses. The present invention proposes a computer architecture which has at least one processor core and forms ring-shaped computing stages consisting of computing elements such as ALUs, FPUs, field computers or vector computers with information transfer in a closed loop.
With regard to the processor core, the parallelism of processing at this level is no longer necessarily related to a structure of several parallel, fully-fledged cores, which means that preferably also a (in the prior-art partly excessive) communica tion between these fully-fledged cores can be omitted.
Several of the parallel-operating computing stages are arranged around the pro cessor core in a ring structure, preferably directly connected to each other via data links, so that a, preferably direct, data exchange between the computing stages can take place. In the ring structure, each of the computing stages has a manage ment unit which manages the data flow between the computing stages, the pro cessor core and, optionally, also cache memories in slave mode under the central processor core.
An initial increase in performance can be achieved by, among other things, cen tralizing on a processor core, wherein a very simple but fully-fledged integrated control CPU can be designed as the heart of the processor core - also referred to as a CCPU ("control CPU") in the following. This CCPU can, for example, be de signed with a simple RISC instruction set and the highest possible clock rate. The control CPU can be specially designed in such a way that it is not designed to increase the computing capacity, but is optimized for reaction speed, data through put and/or energy efficiency.
One of the main activities of a CCPU designed according to the invention is to coordinate the forwarding of the continuously flowing data streams by filling in struction buffers, storing finished result sets, loading and setting operations by in terrupts and system calls, and controlling the dynamic forwarding of the data flow of a computing stage, e.g. in static field-path calculations. One activity of the CCPU is to distribute the program instructions of a program to be executed, for example similar to the application of a clock algorithm for a paging, to place the instructions one after the other on the buffers and subsequently on stacks of the computing stages and to manage, replace or delete them in real time, i.e. specifically clock wise or counterclockwise in the ring-shaped arrangement of these. This ring- shaped arrangement can have a uniform direction for each defined computing ring, but rings with opposite directions for each further organized ring can also be formed, for example in multiple MISD in a processor. In the case of multiple MISD, the CCPU communicates with the other CCPUs via buses of known type, e.g. similar to classical RISC MIMD architectures. However, hybrid technologies with one CCPU and several rings of different directions are also possible in special embodiments to create a full duplex ring and to enable duplex in general.
A further activity, which in the processor architecture according to the invention can be performed by a correspondingly designed CCPU, is starting and stopping processing operations - also in relation to flow control - and/or recording error mes sages (exceptions) from the arithmetic units, and, optionally, executing error pro grams, which is preferably executed independently of all currently running compu ting stages. The entire computer architecture in accordance with the invention must therefore hardly ever stop or completely preempt pages.
According to the invention, the parallelism of the processes takes place in the in dividual computing stages of the surrounding computing elements and makes use of a concatenation of successive instructions and program sections via the func tionalities used in each of the computing stages, in combination with multiple use and coordinated data flow connection of the same and use of the ALUs, FPUs, etc. contained in the computing stage or following it.
In other words, an input is transformed into an output - similar to the logic circuit in an FPGA chip. The management of such an input-output of a process according to the invention (comparable to the UNIX principle realized on a higher abstraction level in pure software) does not have to be handled directly according to the in vention by an operating system, but takes place primarily on the hardware accord ing to the invention, which allows e.g. significantly shorter access times.
An embodiment of a computer architecture in accordance with the invention can also be described as a decentralization of the computing core into a computing ring, wherein the most important execution hardware for the actual calculations is always explicitly multiple, directly and indirectly, parallel and distributed. Thus, for each general computing stage with its computing elements, several instructions can be processed simultaneously in the computing units. Similar to a RISC-V ap proach, several RISC instructions from so-called "Standard Combination Instruc tions" or Standard Combination Execution - (SCE) can be decoded by a micropro gram per computing stage, which increases the possible representation complex ity and recombinability for the next stage at the output for immediate operations. Furthermore, the possible concatenation or, in the case of a stall bypassing to itself, a kind of variable pipeline, recombinable in real time and variable in length, is created.
A routing of the data in the ring can be interpreted in real time by the control CPU according to the invention - preferably designed as a real-time processor - with one or more own control CPU cores from an instruction-machine code. Special operations from the CISC and/or RISC data set as well as the extension of the latter by pre-programmable "standard combination instructions" can be com manded, for example, among others:
• Conditional jumps to subsequent computing units of a defined arithmetic level with multiplexed branching;
• Loops via jumps back to the next computing stage containing the next instruc tion;
• Branching of the data stream to several units;
• Merging of start-related computing streams to a list concatenation or multidi mensional fields;
• Immediate operations on immediate output;
• etc.
These instructions can be either 3 address instructions known under RISC using active coordination, or extended as 4 or 5 or 6 address instructions, e.g. in the form of {w1 , w2, result, physical target input address computing stage, <register number target instruction> , <thread ID> } with passive coordination of the CCPU.
In an exemplary embodiment, an overall processor in accordance with the inven tion is thereby designed with a number N of arms or computing stages. The present invention relates specifically to a microprocessor architecture of a digital computer, in particular formed on one, preferably only, semiconductor chip. This has a plurality of, preferably similar, computing stages which are each formed with a computing unit having at least one computing stage decoder core which is formed for the mathematical or logical processing of data in accordance with an instruction set, in particular a generally extended RISC instruction set, and a man agement unit (MU) which is formed in order to provide the data and a program code for the computing unit assigned to it.
The computing stages are arranged and designed in such a way that data com munication between the computing stages is established in a closed ring by means of the management units, in particular in a closed daisy chain ring with point-to- point connections between the management units. In special embodiments, point- to-multipoint connections can also be formed.
The microprocessor architecture also features a central processor core with a con trol CPU (CCPU), which is connected in a star configuration to each of the com puting stages via their MU. This is designed and configured in such a way that the program code and data associated therewith from an external memory are distrib uted via the MUs to the plurality of computing stages in such a way that, during the processing of the program code, successive, dependent calculations are dis tributed in each case to computing stages following one another in the ring, and the data to be processed are forwarded from one computing stage to a next in the ring during the course of their calculations in the closed ring. The CCPU is thus designed to distribute the program code and data associated therewith from an external memory via the management units successively to the plurality of com puting stages in such a way that, when the program code is processed, the data to be processed are forwarded from one computing stage in the ring to a next computing stage in the ring during the course of their processing in the closed ring.
From the program code of a single thread, a first instruction can be executed on a first of the computing stages and a subsequent second instruction on a second of the computing stages (or only in special cases - such as a single loop, etc. - on the same computing stage), wherein data of an intermediate result is passed on between the first and second computing stage. In other words, the present inven tion distributes a program code comprising instructions which process input data into output data, wherein the instructions are distributed over the computing stages in such a way that the output data of a first instruction on a first of the computing stages is passed over the ring as input data for a second instruction on a second of the computing stages connected thereto in the ring. For exceptions such as iterations, two decoded and thus several successively executed instructions can optionally also be permitted on the same computing stage by direct bypass. Pro cessing of the program code is carried out, so to speak, successively along the ring, wherein the data to be processed are each passed on successively along the ring between the computing stages - e.g. in the form of a data stream in direct byte stream format or in pages (page fields) - so that the sequence of steps of the program code is distributed over the computing stages with a passing on of the data to be processed along the closed ring from one computing stage to a next computing stage. In total, more buses than 1 :N connections can be implemented between two computing stages in order to load pages over longer periods of time in a highly parallel manner and controlled by the MU into a reserved L4 cache field of the next computing stage while individual byte stream data continues to flow, i.e. without blocking the system.
In one embodiment, the MUs can each be designed with a Paged Memory Man agement Unit (PMMU) and an individual L4 cache memory, which are designed in such a way that these, preferably independently of the computing unit, in particular in the event of a thread change, temporarily swap out memory pages not required by the computing unit into the individual cache memory and/or reload required memory pages from the individual cache memory, which can, for example, consist of a register of the instructions to be executed next, as expected, via the data.
In one embodiment, the CCPU can proactively preload the memory pages into the respective individual cache memory via the MUs, according to the distribution of the program flow, with a corresponding part of the program flow and/or the data in advance, in particular independently of the processing of the program flow by the computing unit and of the forwarding of the data between the computing stages. The CCPU may have at least one level of local cache memory, preferably two or three or more levels of local cache memory, and the CCPU may be configured with a specialized processor core whose instruction set is designed to be opti mized for fast management and transfer of data. Specifically, the CCPU may be configured without processor cores to directly process the data and execute the program code itself. In particular, the instruction set can be designed as a RISC instruction set for distributing program code and data to the computing units, taking into account branch predictions in the program flow.
Depending on the embodiment, one of the computing units is designed with at least one computing core of
• an Arithmetic Logical Unit (ALU),
• a Floating Point Unit (FPU),
• a field computer unit (FR) and/or
• a logical unit (LE).
The computing unit is further equipped with an execution engine (EE), which is designed to load the register sets of the computing cores from a local stack, and optionally with an interpreting engine (IE), which is designed to predecode com plex standard combination instructions into atomic instructions of the computing core. Each computing arm has at least one EE. In particular, an Execution Engine (EE) and/or an Interpreting Engine (IE) can also be designed in each case in a special embodiment for each of the computing cores. The EE and optionally also the IE are connected to each other and to the associated MU, preferably via a common internal bus of the respective computing stage. However, this internal bus is formed separated from the ring via the MU. In a preferred embodiment, such an IE can be a part of a computing stage, especially as a detail of a processing level of the MU. In the computing stage, the IE interprets instructions from the MU con figured via the ring and, optionally, also places decoded instructions on the arith metic units using the EE. The IE can also assume the task of reproducing com pressed instructions - so-called SCE instructions - in machine code, e.g. in the form of microcode, through frequently used function chains stored in its pages. The known function chains are usually only a few instructions long, and can be transmitted via a communication bus from the MU to the next MU in the ring and - e.g. with a chained list processed from behind - be placed on the next MU. In another embodiment, an IE can additionally or alternatively also be part of the CCPU, so that the CCPU (at least partially already) uses often used instruction chains correctly distributed, instead of passing the instruction chains through an MU to a following MU in the ring.
The processor architecture can preferably be designed and configured in such a way that during program execution an essentially continuous data flow of the user data to be processed takes place along the ring between the computing stages and, essentially independently thereof, a control flow from the CCPU to the com puting stages. Here, for example, the MU can be designed to forward the data flow and the control flow, wherein the MU, in particular with a microprogram code on a control logic or on an updatable FPGA chip, provides management and coordina tion of data flow and the control flow via communication with the CCPU, each IE and IE, and the computing stage on the basis of the program code of a program to be executed.
The MU can be specially designed and configured to continuously mark in a local program cache those instructions of the program code for which the data required for their processing are entered in a local data cache, and to then make these marked instructions with the required data available to the computing unit for exe cution in a local stack. In particular, the marked instructions can be loaded by one or more local EEs of the calculation unit or the MU of the computing stage into the local stack of an arithmetic unit designed for the respective marked instruction. Individual instructions, not entire threads, are especially marked. Depending on data availability and proven further continued correctness of the code, the individ ual instructions can be randomly mixed and matched in their execution across threads.
In one embodiment, the computing stage can be designed and configured in such a way that it has an instruction stack which is filled by the CCPU and processed by the computing unit under coordination of the MU, and an input data cache which is initially filled by the CCPU and during the execution of the program code via the ring by another computing stage and processed by the computing unit. The MU can be designed with a search run, which provides the correct computing unit with those instructions in the instruction stack for which instructions all necessary data are located in the data stack, in particular by means of the Execution Engine (EE).
An embodiment of the CCPU can be designed and configured in such a way that it interprets the instruction-machine code of the program code from the memory and the distribution of the program code to the computing stages takes place by means of conditional jumps to subsequent computing stages with multiplex branching; loops via jumps back to the computing stage containing the next in struction; branching of the data stream to several units or merging of start-related computing streams into a list concatenation.
In other words, the invention relates to a digital computer processor architecture, in particular on a single chip or die, which is designed with a central control CPU (CCPU) and a plurality of computing stages. All the computing stages are con nected to the central CCPU in a star configuration and the computing stages are connected to each other in a closed data ring. The computing stages have a man agement unit (MU) which is designed and configured for the administration and forwarding of program code and data between the computing stages and/or the CCPU. The CCPU is designed and configured in such a way that it distributes a program code, in particular within a single thread, for execution from an external memory to the computing stages in such a way that a first partial calculation of the program code is distributed to a first of the computing stages and a subsequent second partial calculation is distributed to a second of the computing stages fol lowing in the ring to the first, and wherein MU is configured in such a way that result data of the first partial calculation are passed on as input data for the second partial calculation in the ring between the computing stages.
The invention also relates to a microchip or processor, e.g. as a die, semiconduc tor chip, microprocessor chip, formed with a microprocessor architecture de scribed herein. In the same way, the invention also relates to a method for executing a program code, in particular parallel and branched, on a digital computer, in particular with a processor architecture as described herein. The method at least comprises:
• Reading of the program code and initial data from an external memory by a control CPU.
• An analysis and distribution of parts of the program code and the initial data to several computing stages, which are connected among themselves to form a closed ring. The distribution takes place via a direct connection between the control CPU and each of the MUs of the computing stages. The program code is distributed in such a way that interdependent, successive calculations are distributed to computing stages which follow each other in the ring, and
• A parallel execution of the calculations in the computing stages in the closed ring, wherein the results calculated by one computing stage are transferred as intermediate results to a following computing stage (and so forth), which further processes these intermediate results with the part of the program code distributed to it. In particular, the execution is carried out independently of the CCPU.
• A write back of final results and/or error messages of the execution of the pro gram code from the computing stages and storing them into the external memory by the CCPU using the MU in slave mode.
The transfer of data in the ring and/or between the CCPU and the computing stage, as well as the administration of the program code and the data in the computing stages can be carried out by a management unit (MU), in particular wherein the computing stage has for this purpose a paged memory management unit (PMMU) as an associated part of the MU supporting the processing sequence and a smaller local cache memory for the administration of data and program code.
The execution of the program code in the computing stages can be carried out by searching for executable program instructions and associated data in the local cache memory and making them available on a program and data stack for pro cessing by an arithmetic unit of the computing stage, which is carried out inde pendently of the calculations of the arithmetic unit by an execution engine (EE). The method according to the invention and device according to the invention are described in more detail below by means of concrete embodiment examples, which are shown schematically in the drawings merely by way of example, wherein further advantages of the invention are also discussed. The figures show in detail: Fig. 1 shows a first schematic representation of an exemplary embodiment of a processor architecture according to the invention;
Fig. 2 shows a second, more detailed illustration of an exemplary embodiment of a processor architecture according to the invention;
Fig. 3 shows a schematic representation of an exemplary construction of an embodiment of one of the computing stages in the ring structure according to the invention;
Fig. 4 shows an exemplary embodiment of a management unit in the present invention;
Fig. 5 shows an embodiment of the present invention in the form of a block di- agram.
The depictions in the figures are for illustration purposes only and, unless explicitly stated, are not to be regarded as being exactly to scale. Identical or functionally similar features are, as far as practicable, consistently marked with the same ref erence numerals and, where appropriate, distinguished by a letter as an index. The diagrams shown in each case show the basic technical structure, which can be supplemented or modified by a person skilled in the art according to general principles. The terms essentially, substantially, or at least approximately express that a feature must preferably, but not necessarily, be 100% exact or exactly as literally described, but that even slight deviations are permissible - not only with respect to unavoidable practical inaccuracies and tolerances, but especially, for example, to the extent that the technical effect essential for the invention is re tained as far as possible.
Fig. 1 schematically shows a possible example of an embodiment of a processor architecture 1 according to the invention. This processor architecture 1 is specifi- cally designed as a processor, in particular as a single 1C (Integrated Circuit) com ponent or as a single chip, preferably on a common semiconductor substrate. In its basic structure, the processor architecture 1 has a processor core 2 with a con trol CPU (CCPU) 2 and a ring of several computing stages 5 with computing units 3.
The number of the respective components shown here is a possible example, but not necessarily a limiting factor. In particular, the computing stages 5a, 5b, 5c, 5d, 5e, 5f arranged in the ring can also be present in larger numbers, especially with a number of 2 to n computing stages 5, which are connected via data lines 40/45 to form a closed ring 7. A transfer of data in ring 7 takes place in each case inde pendently of the other computing stages directly from one computing stage 5 to the next, wherein a management unit 4a (MU) of computing stage 5a carries out the data transfer independently of the activity of computing unit 3a. It is preferable that neither CCPU 2 nor computing unit 3 is loaded during the execution of such a data transfer in the ring. In accordance with the invention, ring 7 is designed uni- directionally for data and directionally within a ring - in the example shown, ap proximately clockwise - depending on the embodiment, but always unidirectionally for data and directionally for instructions, for example in the form of a daisy-chain of individual and preferably autonomous connections between the computing stages 5. The CCPU 2 does not use ring 7 for communication with the computing stages 5, but is connected separately from the ring in each case directly to the computing stages 5, as illustrated by the connections 48a, 48b, 48c, 48d, 48e, 48f. Preferably, this direct, star-shaped connection of the CCPU to the computing stages is also made via the MU 4, wherein only the CCPU 2 is designed with an interface 91 to an external RAM main memory 9 and/or to external bus systems for the processor architecture 1. The computing stages 5 have only relatively small, internal cache memory 47, preferably divided into separate program and data caches or stacks.
The structure of an embodiment of a processor architecture 1 in accordance with the invention especially comprises a microcontroller as control CPU 2 (CCPU), - several sets of independent computing stages 5 with highly parallel ALU, FPU, vector computer, field computer at least once in full or shortened version, as well as
- multiply designed multiplexers 4 as a link between the computing stages 5 among each other in a closed ring, in order to correctly feed the distributed sequences of a thread to the next computing stage 5, as well as a star-shaped, direct connection from each of the computing stages 5 to the control CPU 2.
These are preferably formed on a common chip or semiconductor die, or at least in a common electronic component, especially for instance as a single processor (with the exception of the RAM 9 shown here). This is provided in particular in differentiation from systems which have such components as independent sys tems which are distributed and networked over several cabinets or even locations, or with respect to systems which have such components in the form of a plurality of dedicated chips on a common circuit board. In a special embodiment, the phys ical arrangement or topology of the above-mentioned components on the chip area can also be designed in such a way that the CCPU 2 is arranged in a central chip area, and the computing units 3 are arranged around the CCPU 2 at the edge of the chip area. In particular, in addition to the logical arrangement, a physical ar rangement of the components can also be formed at least partially, which is simi lar, for example, to that shown in Fig. 1.
Fig. 2 shows a more detailed example of an embodiment of a processor architec ture 1 according to the invention, in which several computing stages 5 are con nected via direct data lines 40/45 between the respective computing stages 5 to form a closed ring 7. For the administration of this ring 7 of computing stages 5 and for the provision of data and program code for the computing stages, a central control CPU 2 (CCPU) is designed.
In other embodiments, especially in larger rings with considerably more computing stages 5, the computing core 2 may also be multi-core at best, i.e. have several CCPU processor cores 20 working in parallel, which maintain the overall principle described here and share the work of the computing core 2 in order to further increase performance and to be able to serve all computing stages 5 in a timely manner, e.g. by further reducing the waiting times of several system calls.
As a basis, in particular with regard to the subordinate low-level implementation units in the present invention, known basic designs can partly be used. For exam ple, an ALU (arithmetic-logical unit) or other sub-components within the scope of the present invention can be designed on the basis of known IP cores as prefab ricated functional blocks and/or, if necessary, adapted for their specific use within the scope of the chip design. Some of these basic designs on which the present invention can be based are briefly outlined again below in a non-exhaustive man ner, especially also in relation to the present invention.
Known CPUs with RISC (Reduced Instruction Set Computer) architecture, for ex ample, rely on loading, interpreting, executing each instruction and storing the re sult locally in a cache next to a register entry. A classic Von Neumann computer with RISC architecture, for example, has a concise instruction set with a simple decoding phase, wherein the program flow can be additionally optimized by "out of order execution", pipelining, predictions, caches and bypass. On the other hand, complex operations thus require many instructions and/or clocks. Since such a RISC computer core can only process one task at a time, a so-called "stall" can occur in large programs. Especially when wrong decisions are made in the predic tions, thread changes and because results can only be stored in memory one after the other, program parts have to wait for each other occasionally. Mutexes or sem aphores are often used for this purpose, wherein much computing power and pro cessing time of the computing cores is lost to the storage, organization and ad ministration of data, as well as to calls of an operating system. Since a certain instruction only addresses a certain part of an arithmetic unit, there is also during program execution almost always a large number of currently unused transistors in the arithmetic units. Especially for an efficient processing of many small, entan gled computational sequences, a RISC architecture is often too linear and not dis tributed enough.
More complex tasks that would require multiple steps in the RISC architecture are provided as a single instruction in an alternative CISC (Complex Instruction Set Computer) architecture. An example of a CISC architecture can use collected in structions to quickly describe a complex sequence of operations to be performed, but the complex instructions also require more time to decode the program code. This, as well as many partly redundantly designed instructions, limits the effi ciency. In many cases, coprocessors are also used for special tasks (e.g. for an out of order execution) as well as a distribution of tasks to one or more FPU (Float ing Point Unit), ALU (arithmetic logic unit) or other specific logic units for instruction set extensions like SSE (Streaming SIMD Extensions), AVX (Advanced Vector Extensions), etc. in the processors. Also widely used are, for example, special GPU (Graphics Processing Unit) architectures in which several highly parallel pro cessors referring to the same main memory, with the same overall task, each pro cess a different block of data. These GPUs can also be used for general program parts within the framework of CUDA (Compute Unified Device Architecture), Vul can or OpenCL technology. In a GPU, however, the micro-components of the in dividual cores are not linked with each other and cannot optimize each other.
To increase the processing speed, technologies such as data buffers, pipelining, DNA microprograms, multicore processors, paging, prediction, multilevel caching, message systems for multiprocessors, real-time processors, processes, threads, hyperthreading are also used. Many of these or similar known approaches are in principle also applicable in the context of the present invention at a suitable place, but the present invention suggests a new, basic processor architecture. For exam ple, in the present invention an ALU, FPU or even the CCPU can be designed according to - or similar to - known structures. For example, a CCPU 2 or a com puting unit 3 can be designed in its basic concept at least partially as an imple mentation or an implementation of a RISC or CISC architecture adapted according to the invention.
In other words, in an embodiment of a processor architecture 1 according to the invention, the load/store operations can largely be carried out within the ring 7 between the computing stages 5 with the aid of the MUs 4, as well as by a CCPU 2 (preferably largely autonomous) designed specifically for this purpose, while at the same time all computing stages 5 process the program code and the data largely independently of this and to a large extent parallel to each other. Especially in comparison to previously known approaches, the present invention can, for example, achieve improvements in one or more of the following problems. According to the invention, a branching in the program flow creates, for example, a new sub-thread (here, in addition to this designation, also called tree thread or inner flow) with a starting position and fixed variables at this point in time to a code which is still fixed but branched. Such a surjective sequence must be operated by the processor, but the number of processor cores is limited and thread changes should be avoided if possible, as such a change would take many CPU cycles and thus valuable time.
According to the invention, a new level of abstraction can be created. For example, even in a field computer, it has so far hardly been possible to variably calculate fine-granular complex tasks with many input variants to be processed without reaching the structural limits of the field computer architecture. With a processor architecture that is in accordance with the invention, especially such tasks can be solved advantageously. In this case, embodiments according to the invention can, for example, be designed as described below.
In one embodiment according to the invention, all caches with their runtime varia bles can be cached locally independently in the computing stages 5, and the re sults can be kept there during operation. An initial distribution and loading of data and program code from the RAM 9 into the computing stages can be carried out under control of the CCPU 2 via the MUs 4 of the computing stages 5, especially while the computing units 3 of the computing stages 5 are still busy processing other tasks. When the program code is processed, the data are then shifted along the ring from one computing stage 5 to the next, wherein each of the computing stages 5 processes autonomously executable parts of the program code which are applicable to the data currently available in the computing stage.
A locally distributed, multiple stack for variables to be passed on can be formed in each case between the computing units 5 and a registration unit with a controlling microcontroller. The registration unit is designed in such a way that it can set breakpoints for decision points and find its way back to a thread of the respective other decision over the entire system - for example via "labels" - while the other decision is processed in high parallel without necessarily discarding the results up to that point. In other words, a result tree memory is formed.
The central CCPU microcontroller 2 primarily only takes over interrupts as a func tion for the entire processor 1 . As the central CCPU 2, a simply constructed but very fast CPU core can provide the organization of threads, starts and storage, as well as branching of the processes, as well as recognize and solve jams and con flicts. Accordingly, the CCPU 2 can be specially (and preferably only) designed for these tasks, e.g. it can have an instruction set designed for the above-mentioned organizational processes, in particular it can dispense with structures that are not required for such tasks and/or it can have special structures for high-performance processing of these outputs.
In a further developed embodiment according to the invention, it is possible, for example, as a result thereof that several Multiple Instruction Single Data (MISD) main processes can be parallelized over it, and thus a Multiple Instruction Multiple Data - Multiple Instruction Single Data (MIMD-MISD) system (according to the Flynn classification) can be formed.
According to the invention, not only two decisions can be calculated continuously, but almost any number of decisions can be processed continuously, either time- dependent or time-independent. Although the time-dependent calculation is still bound to the number of parallel logical units (ALU) (which can be scaled as desired in a wide range), the overall system according to the invention is more performant than classical architectures for many tasks. If a sufficient number of parallel logical units is not available, time-independent processing can take place one after the other on the same logical unit. However, according to the invention, there is no need for a time-consuming thread change, but the CCPU 2 merely provides cor responding memory pages in the local caches, which can also take place, for ex ample, during the preceding processing until the accumulated amount of buffered data at the input has been completely linked with the correct instructions and pro cessed.
In one embodiment of the invention, an administration for virtual and locally con stantly reassignable memory pages can be formed, which automatically assigns thread branches and efficiently stores results bijectively with a bijectively defined code line and memory location (or several code lines and memory locations in an automated field), as well as loads required code pages for different code sections and/or splits them up among the computing units 5 according to the time required. The administration is designed in such a way that the caches assign code pages to a processing line, and no longer to a specific physical processor core, as is usually the case.
According to the invention, in the program flow on a processor architecture ac cording to the invention, results of a first computing stage 5a are always placed on top of a stack input of the next computing stage 5b following in the ring in a kind of "round robin" procedure in a uniform sequence.
In this case, in an embodiment of the processor architecture 1 according to the invention, it is also possible to dispense with a uniform clock across the entire processor architecture 1 . Each of the computing stages 5 can process its input stack with data as fast as it is capable of doing so, i.e. for example, with an indi vidual clock. In addition to the data to be processed, the instructions for processing the results and data are placed in parallel on an instruction stack for each of the computing stages 5. Where necessary, the processor architecture in accordance with the invention can also implement appropriate transition structures, which are designed to work between components of the processor architecture inde pendently of the clock via buffers and signals.
In particular, the processor architecture 1 according to the invention can be de signed in such a way that several distributed sequences of a thread are distributed over simple computing stages 5 and processed in a sequence according to a dy namic priority pattern which offers "out of order execution", in particular to model real-time processes and physical processes, especially according to an "action- reaction" principle.
In summary, an embodiment of a computer architecture 1 in accordance with the invention provides a bijective computer architecture which works on a main data stream for each processor 1 , but which can process a data flow several times in parallel by means of different instructions, preferably adaptively to the modelling of the problem in real time. The computer architecture 1 is preferably not clock- dependent and provides a combination of properties of field and vector computers. Specifically, the computer architecture 1 according to the invention provides a new approach to how results are temporarily stored while maintaining a bijective over view for the system.
In other words, an embodiment according to the invention can also be described as a combination of an arbitrary or binary representable number of arbitrarily ad dressable computing stages 5, comparable to a kind of honeycomb field. The first and the last column of the computing stages 5 are in turn linked via the output of the last one to the input of the first one, thus forming a ring 7. Ring 7 is thus formed around the control processor CCPU 2 in the middle of the latter, wherein ring 7 and middle refer primarily to its interconnection arrangement and other geometric arrangements can also be provided in other embodiments, in particular those in which (e.g. within the scope of chip area or timing optimization) another physical arrangement is formed, which nevertheless implements a logical arrangement of the components in accordance with the invention. In a preferred embodiment, the processor architecture 1 according to the invention is also physically arranged on a semiconductor die in its basic structure similar to that shown in Fig. 1. According to the invention, the computing stages 5 always result in a closed ring in their data flow and the control CPU CCPU 2 has a star-shaped connection to each of the computing stages 5 for their control and management, via which star-shaped con nection also an initial loading of program and code parts for execution from an external memory 9 on the computing units 5, as well as a retrieval of results or partial results of the program execution from the computing units 5 to the memory 9 takes place, which is controlled by the CCPU 2, preferably largely independent of a current processing of a current program code in the computing units 5 and/or of the flow of data along the ring.
In particular, a Management Unit 4 (MU) is designed as a preferably intelligent - i.e. programmable or configurable with its own instruction set - link between each of the computing stages 5, for example in the form of a configurable or program mable logic unit or as an FPGA. Thus, a MU 4 is assigned to each computing stage 5, preferably arranged in the ring at its input for data. For example, the MU 4b of computing stage 5b provides a link between
- the preceding computing stage 5a in the ring (i.e. especially its MU 4a),
- the next computing stage 5c in the ring (especially its MU 4c),
- the CCPU 2, and
- the local cache 47 of computing stage 5b.
The MU 4 is designed so that
- along the ring, the data to be processed is primarily transferred between the computing units 5 for successive processing between the computing units 5 along the ring,
- between CCPU 2 and the local cache 47, program code to be executed by the respective computing unit 5 is loaded into a program buffer of computing stage 5,
- between CCPU 2 and the local cache 47, initial data is loaded by the respective computing unit 5 into a data buffer of computing stage 5 or result data is read back to CCPU 2,
- data and program code are transferred between computing unit 3 of computing stage 5 and its local cache 47.
- control and configuration data from the CCPU 2 to the MU 4, the PMMU 46 and/or the processing unit 3 are transmitted or status and error data from these compo nents are transmitted to the CCPU 2.
The moving and/or loading of data is preferably carried out in the form of blocks, also called memory pages or pages. The MU itself only works in one direction in the ring to process data. For a merging of two thread parts, the MU can transmit data to any other MU with a bypass function and thus intermediate results stored at the input, which are required for this merging at this stage of the calculation - but only without further processing of the data. In one embodiment, such a trans mission runs especially for subsequent computing stages n+1 to n+m, wherein r is the number of computing stages, m<=r/4 and a data connection runs once through the middle of the ring from n to n+r/2. In one embodiment, such a task of bypass data forwarding can also be performed by the CCPU, especially since the CCPU coordinates, approves and manages such operations anyway. The number of computing stages 5 in circulation is shown here as an example with the four computing stages 5a,5b,..., but this number can be extended at will.
Several instances of processor architectures 1 according to the invention can be combined as Multiple Instruction Single Data (MISD) to form a Multi-MISD, option ally on a single chip, wherein data streams are managed or mirrored globally. For example, the program execution can be designed in such a way that each proces sor architecture 1 processes only one process. Furthermore, time slice models can be used, but the entire processor architecture 1 and also the MU would have to switch the currently held pages. A chip with several processor architectures 1 ac cording to the invention can switch to a multi-process or multi-thread mode by dividing processes on the instruction stack that are to be actively operated into address spaces and forming subsets from the address spaces on the RAM as long as the MU supports this, whereby, for example, an ordinary multicore processor can also be emulated. These are not distributed threads with a decision tree, to which special reference was made, but real optional linear or branched processes already known at present, which run analogously on the structure. Due to the as signment technique, the processes can still be run for the threads subordinate to processes, even if there are several processes.
An embodiment according to the invention of a process can have the following aspects, for example:
1 . An embodiment of an MISD computer according to the invention can provide coordinated and complex pipelining over an extended hierarchy of a control pro cessor 2. The data throughput can thus be massively increased, since the program flow is not interrupted continuously, but only by a system call when required.
2. The calculated results are coordinated, but are only written back when they are needed There is no obligation to write back, but only a small number of registers can immediately notify about a change of state. The results can remain in circula tion until the data stream no longer needs to be manipulated by instructions and is terminated. A possible temporary write back is only for safety reasons, but is not absolutely necessary. The CCPU 2 does not use the intermediate results, only e.g. in case of a crash or the like they can be used to restore the system. Only the correct sequence and connection of the instructions and the loading and saving as well as 10 are the tasks of the CCPU 2. As a result of this outsourcing, the CCPU 2 and its internal logic circuits can be designed comparatively small. At this point in the design of the CCPU the focus can be on consistency rather than speed.
3. According to the invention, error codes can only form further reportable fixed branches which can be easily integrated into the regular process.
4. An embodiment of a processor architecture 1 according to the invention can thereby form a highly variable field computer in the broad sense, with the possibil ity of separating and coordinating threads on the same processor 1 into sub threads, as well as merging them again. No longer are full-fledged processor cores required for a further thread. This saves space and valuable chip area.
5. In one embodiment of a processor architecture 1 in accordance with the inven tion, a complete field computer can be simulated and variable handling with similar performance can be achieved. In accordance with the invention, e.g. problems of computation for quantum physics and for the research of quantum computers can be solved more efficiently.
6. According to the invention, "Out of Order Executions" can also be omitted, since only just loaded and valid instructions are processed, wherein the method for dis tributing the program code and data in the processor architecture 1 is based on this search property of loaded and valid instructions. Since a high throughput is assumed, the compiler still has the task of optimizing as much as possible. As long as executable data and instructions are still loaded, there is a delay, but no loss of processor throughput.
7. In an embodiment in which a refillable FIFO queue is set for each arithmetic unit and resources are organized within these instructions, bypassing and forwarding can be optionally implemented for each computing stage 5. Compared to hyper threading, fewer extra modules are required on the chip with comparable effec tiveness. Furthermore, the instruction fetch phase amounts to a processing pro cess of the CCPU and is omitted in computing stages 5. There are effectively only 3 or 4 roughly distinguishable phases for the pipeline and all of them are distributed in the individual components and can be executed in a highly parallel manner. This distribution makes it possible to create effective, further subdivided pipelines, which increases the overall performance of these simplified components for the remaining 3 or 4 individual phases of the control and/or simplifies the design. A plurality of smaller and simpler control units according to the invention with their arithmetic units are easier to design and optimize than a large, complex control unit, especially if the smaller control units can be used several times in the same form.
8. According to the invention, there are - in contrast to field calculators - no perfor mance losses due to calculation blocks that are not completely full. According to the invention, a continuous filling up by the CCPU 2 takes place, in particular de pending on the filling level of the instruction cache of the computing stages 5.
9. The entire processor architecture 1 can be designed with an external interface like a conventional processor. The evaluation of the speed of the individual cores often becomes less relevant. Instead, a throughput of the computing stages 5 is evaluated. Furthermore, the number of parallel running threads is effectively lim ited directly by the number of individual processors (ALU, FPU, logical units, field computers and vector computers), which can even increase the parallelism capa bility of an architecture 1 in accordance with the invention one hundred fold. For example, in the case of 4 computing stages 5 with 8 arithmetic units of each type, in the best case up to 32 threads are active simultaneously, wherein such a pro cessor architecture can be implemented on a chip area that corresponds approxi mately to that of a classic single-core processor.
In summary, the efficiency and utilization of an embodiment of a computer archi tecture 1 according to the invention is anchored in its structure, especially with regard to calculation tasks with a high load and with a high number of calculation processes.
Fig. 3 shows an example of an embodiment of a computing stage 5 according to the invention with a computing unit 3 and a management unit 4 (MU) assigned to it.
According to the invention, an independent forwarding of data and instructions for the adaptation of the runtime characteristics is carried out, for example with an adaptation of the program code page distribution at the exit and re-entry of loops, which are distributed diffusely over the computing stages 5 and thus over the pro gram succession.
The path via the cache L1 , L2, L3 of the control CPU core 20, which is customary in the prior art, is circumvented with the ring structure according to the invention for a large number of the executed instruction sequences, so that a large number of load and save instructions can be omitted and/or distributed individually to the respective computing stages 5.
The control CPU 2 according to the invention is designed in such a way that it constantly prepares an overhead for the preparation of the computing path through all computing stages 5, so that during the actual execution of the calculations by the computing units 3, only the results are entered at still free positions or at pre pared reserved positions in the workflow of computing stage 5, which is prepared on the local cache, are forwarded and then further processed in the next compu ting stage 5 - without requiring a further overhead at runtime.
The data inputs for a program instruction, which is processed by one of the arith metic units 31 , 32, 33, 34 of computing unit 3, are always directly available in one of the many local register sets and are loaded by a so-called PMMU 46, whose associated instructions are simultaneously fetched and decoded by an interpreting engine 37 (IE), in that they are placed by an execution engine 36 (EE) in the input instruction registers of the arithmetic units and processed by the various arithmetic units, wherein the IE obtains its instructions for the loaded data from stack 47 of computing stage 5. If required, the IE 37 predecodes a complex standard combi nation instruction. This IE 37 can also be integrated in the EE 36.
Furthermore, a computing stage 5 has a management unit 4 (MU), which takes over the administration and coordination from the microprogram code of a control, for example on its updatable FPGA chip, via communication with the control CPU 2, the EE 36, the IE 37, as well as the local cache 47 of computing stage 5 on the basis of the program code of the program to be executed by the user.
The structure of large caches between the CPU cores, which is otherwise common in known processors to increase performance, can therefore be largely omitted in a processor architecture 1 according to the invention, so that these caches (or their space on the die) can be used for computing stages 5. Due to the distribution over the multitude of computing stages 5 in the ring, the problem of the program code to be processed is effectively broken down and results in small program code parts which can be processed in correspondingly compact computing stages 5. Thus, these computing stages 5 can each be designed with relatively small and preferably very fast caches, which can increase the overall performance and throughput of an embodiment according to the invention. A computing stage 5 ac cording to the invention can have several dedicated instruction stacks, in particular each for different arithmetic units, for example separate stacks for different types of instructions, especially separately in each case for example for an FPU 33, ALU 34, ... of computing stage 5.
Each computing stage 5 preferably has several field computers (FR) 32, vector computers, logical units (LE) 31 , ALU 33 and FPU 32 units (full and/or half value). Computing stages 5 are also referred to as (MISD) field lines or arms of the pro cessor architecture, especially for Multiple Instruction Single Data (MSID) calcula tions. Their data input is shown in Fig. 1 as an example on the counterclockwise side, their data output in clockwise direction. The ring of computing stages 5a built up in this way is designed so that the data outputs can always be transferred via a multiplexer in the MU 4 to one or more data inputs of a next computing stage 5b. In special embodiments according to the invention, a bypass to the same compu ting stage 5 itself and/or to the respective subsequent computing stage 5 via a correspondingly designed MU 4 is also optionally possible.
In the embodiment shown here, the MU 4 is bidirectionally connected to a Paged Memory Management Unit 46 (PMMU). A level 4 cache memory connected to such a PMMU 46 can buffer 47 (L4) pages and hold them until the next page change. As can be seen in Fig. 1 , each of the MUs 4 has a respective individual PMMU 46 and a respective individual L4 cache 47. As the dotted line next to the arrows for the data flow between MU 4 and PMMU 46 indicates, a communication flow can also be established to control and configure the PPMU 46 from and/or via the MU 4. Each computing unit 3 thus has a paging buffer 47, usually relatively small, e.g. one or more pages, which can act as an output and/or input buffer, especially since each output of a computing unit in the ring structure according to the invention is also the input of the next computing unit, the correct assignment of the pages can be carried out virtually. The paging buffer is designed as a highly parallel page buffer and is integrated in the computer architecture for direct 1 to 1 data flow, in particular for data with the size of a single (or in combination at best also multiples) of the units of a Multiple Instruction Single Data (MISD) computing stage. The size of the paging buffer can be selected in particular depending on the number of ex isting computing stages. For example, each passing of N computing stages may require N data sets and thus the buffer for the number of computing stages may be formed in advance according to this number or a multiple thereof, so that the CCPU does not have to check the progress for a thread after each round, but only when, for example, less than half of the data still processable, or other paging directives, are still present in the buffer for a thread. If the CCPU is undersized, it will have to schedule the filling of data and instructions for this thread too often, which could be detrimental to performance. This can be used, for example, to standardize split data streams that are related to the start. In addition, each MISD computing stage 5 has a fully associative instruction stack, which stack is filled from above via and with the help of the MU 4 by the control CPU 2, and from below is automatically extracted and processed by the MISD computing stage 5 when it is processed.
The width of the instruction words of the CCPU 2 can be independent of the width of the MISD level 5. Each instruction architecture of the LE 31 , FR 32, FPU 33, ALU 34, etc. is independent and can optionally be designed with multiple redun dancy, e.g. depending on the probability of use based on the instruction frequency. A search for the correct instruction or its memory page is carried out on the basis of the combination of the thread number and the instruction pointer in the program and, if necessary, a run-up priority or a queue on the instruction stack, wherein the page is made available via a so-called Program Memory Management Unit 46 (PMMU). As a result, only instructions in the next process step can react that have the thread number and the instruction pointer +1 as their signature, so that several stored instructions can be loaded from a paging buffer of the cache in the MU into computing unit 3 and can be executed by it.
The compiler, with which the program code is generated, prepares the program code in such a way that only the possible instruction modulo of the combinations of the program flow of the ring are entered on each memory page for an instruction stack. An instruction, which is marked as ready in the instruction stack, is inserted by the Execution Engine 36 (EE) into the registers of the corresponding arithmetic unit. The EE 36 selects a possible, ready instruction from the instruction stack, e.g. on the basis of the highest priority and therefore currently critical latency or ac cording to the FIFO principle, and thus connects at least two linked valid data from the separate data cache reserved in and by the instruction with the instruction from the instruction stack. The link from instruction and associated data is then inserted into the next free space of instruction execution in a selected arithmetic unit. The EE 36 checks continuously and at the highest possible search rate for a possible insertion of instructions, at least as long as hardware capacities for instruction ex ecution are available in the arithmetic unit. The computing speed is not distributed over a large processor construct and is therefore limited and complex, but only directly comprises a computing stage 5 and a decoupled communication bus be tween management unit 4 (MU) and control CPU 2 (CCPU), which thus gains speed.
Each instruction links to a signature of the thread, data stream and the assembler program counter -> PC bijectively its successor instruction and the last instruction with a jump decision. For this purpose, a new, broad instruction format is provided, which includes one instruction and one variable, e.g. a 128 bit instruction format, which has a 64 bit instruction and a multiplicative of 8 bit variable (today 64 bit standard), wherein the extension of the computer's bandwidth is decoupled from the instruction bandwidth.
During the insertion of an instruction (here also called ENT after the word "En trance"), a marker of an independent instruction sequence is set to "used", and is reset to "ready" after leaving the computing stage. Only when a storable, output- obligatory result is to be provided after the internal pipeline (DEC,(DEC),EX,MEM/WB), the state of computing stage 5 is set to "out" with the marker. A further state of computing stage 5 is the marker "none", i.e. the idle state, in which computing unit 3 is switched off in computing stage 5 for energy saving. In the above example, a marker of 2 bits is sufficient for the state of com puting unit 3; in other embodiments, further states can be coded in a wider marker.
For regular processes a counter can be implemented, which is stored for each instruction and which can be used to determine when this instruction is calculated, i.e. when the marker "out" is received. This can, for example, be in the form of a table in the memory, especially meant as a separate extension of the register table of the instructions and candidates for the MU. Since several instructions are exe cuted simultaneously and even small pipelines for binary arithmetic steps are pro vided in the computing units 3, the current arithmetic state of an instruction is also noted on the loaded working page of the MU on its arithmetic unit for planning purposes. This makes it possible to register, for example, when a required com puting unit 3 becomes free. The MU itself loads the required page with the PMMU from the L4 cache and works and notes the current calculation state on this page until the page is written back to L4 and updated when there is a change and also at regular intervals.
A special feature of the process is that for complex Standard Combination Execu tion (SCE) instructions, a second decoding phase is initiated, which redirects the original instruction on the side of the instruction stack through a link to a page of the Paged Memory Management Unit 46 (PMMU) with the code page generated by the Interpreting Engine 37 (IE) and executes the first redirected instruction in the subsequent second decoding until the program jumps back to the instruction stack by the Management Unit 4 (MU). The SCE can be designed as a part or extension of the IE. This means that, for example, frequent chains of instructions can be virtually designed externally as a single instruction, and this instruction can be set up internally or via the following chain of computing stages again and again without placing a load on the CCPU. In particular for such SCE instructions, an exception can also be implemented in the same computing stage with respect to the re-execution in the same computing stage which is otherwise not primarily provided for in the computer architecture of the invention, e.g. in that special com puting stages are designed or preconfigured for special SCEs.
The last known instruction of a program is transmitted to the control CPU 2 (CCPU) together with the calculated output (entry, array or pages) for identifica tion. Each instruction additionally contains an assignable thread ID, for example as instruction formats with computing stage number, instruction number, thread number, etc. which assign a correct next location for the instruction data. For ex ample, an instruction is stored at a register location of a computing stage, and exactly this location is first identified by the CCPU for the instruction in order to find a successor and insert the correct data. Thus, for example, two identical pro grams with two different data sets and thus different thread numbers, but the same instruction numbers and the same computing stage numbers can be present and processed on one computing stage. The n+1 computing stage must still be able to assign the correct data to the instructions - although everything is the same except for the data - which can be achieved by means of the thread ID. The Management Unit 4 (MU) manages internally at least as many pages (which are loaded from the Paged Memory Management Unit 46 (PMMU)) as there are inputs on the compu ting units 3.
An embodiment of caches in the processor architecture according to the invention is also shown in Fig. 2 and Fig. 3 by way of example. The caches are to be subdi vided primarily into caches L1 , L2, L3 of the CCPU 2 and caches 47 of the respec tive computing stages 5, wherein in the present invention the latter in particular have a special structure.
The control CPU 2 (CCPU) can be designed in an embodiment to a large extent analogous to one of the known computer architectures, which is specially supple mented with a termination procedure of time-stretchable real-time processors and several control buses extended outwards.
The page-mixed data and/or code caches 47 are arranged externally around the computing unit ring 3a, 3b, 3c, 3d or between this and the CCPU 2. The latter arrangement can speed up the loading of pages for the CCPU 2 and one ring can then, for example, take up essentially the entire outer area of the chip area. The data and instruction pages can be mixed in L4 caches in one embodiment, or in another embodiment as shown by way of examples in the figures the MUs are connected to an in-cache and an out-cache and the L4 cache only holds the in struction tables.
The L1 cache in the control CPU 2 is designed for instructions prior to their as signment. In this cache, instruction words are collected in the form of pages and inserted into the L4 cache of stage 5. The L1 cache can only contain a few pages. Preferably, data and program code pages are stored separately from each other in the L1 cache, wherein from the perspective of the CCPU, the instructions to be inserted are regarded as data. This can make it easier to allocate pages at com puting stage 5 and to divide them into program code pages and data pages at computing stage 5.
Such a page can also have a header in a mixed L2 cache, for example, which identifies the stored page as data or instructions. This header can, for example, be in the form of at least two page bits. Furthermore, page markers used by paging can be stored, such as "invalid", "referenced", etc.
In an embodiment an operating system fills a part of the RAM main memory 9 with the first pages of a program to be executed before each program start with the help of the CCPU 2 and prepares the structures in the computing stages 5. In this way, other memory pages in the computing stages 5 can be displaced prematurely and/or a check can be made to ensure that all processes are consistent.
The CCPU 2 can be designed with its own set of coordination instructions, which is formed e.g. by the compiler using the task code. It is possible to define your own complex Standard Combination Execution (SCE) beforehand via programming, by displaying the instructions of the complex operation on an extra page and by call ing them up in real time by an instruction in the program code. In this process, the MU 4 with the PMMU 46 fetches a page with an SCE instruction sequence from the L4 cache 47. To allow such a separate loading, the CCPU has a special control instruction. The instruction format that allows the data to flow in the ring can be different in many embodiments from the format that the CCPU uses for its control measures for the ring and instructs the CCPU. In an exemplary embodiment, a RISC-V instruction set can be extended by corresponding instructions.
In an embodiment, there can also be a bypass between each of the Multiple In struction Single Data (MISD) computing stages 5, so that each computing stage 5 can be directly connected via its MU 4 to more than just the next computing stage - in a special embodiment, in particular to each other and to itself - in particular to forward results, code pages or data pages. One such embodiment is a so-called Fully Associative Network (FN).
In addition, according to the invention, a second no-operation (Nop) instruction with an adjustable holding time for each instruction can be formed in the instruction set for each arithmetic unit 5, in which this arithmetic unit 5 lets the input data stream through as an output data stream if the planning detects a standstill (stall) in the next computing unit 5 in the ring. In this way, the fully associative network FN can be simplified, because a holding period of 1 allows the instruction to pass exactly in the next step. According to the invention, the instructions for the arith metic units 3 are not directly instructed by the CCPU 2, but each individual instruc tion is instructed by the corresponding MU 4, so that this MU 4 can intervene in the program flow based on signals from the CCPU 2.
For the instruction architecture of the CCPU 2, for example, a RISC instruction set can be extended, wherein this instruction set can be designed in particular to pro vide special instructions with which data can be loaded onto external structures, especially to the MUs 4 and/or their PMMU 46 or local cache 47. Among the ex tended instructions, it is possible, for example, to set a page in a MU 4 and com municate with it, collect all the results of a dividing instruction, optionally with the creation of a new branched thread path and continued calculation of both threads, and return and path logging of the final results of a thread. The instruction set of the CCPU 2, the MU 4 and the complex SCE can preferably be in one embodiment at least substantially the same or subordinate elements of a subset of the instruc tion set supported on the CCPU 2. The instruction space for MUs and the CCPU can, if necessary, be handled separately. For example, a Multiple Instruction Single Data (MISD) instruction set of the CCPU 2 can provide at least one of the following instructions:
- Linking of threads to data and instructions, and loading and saving of thread pages by the linker, and signal checking of flags;
- Checking of the latency priorities of threads and their instructions in the MU 4;
- Automation of a "Thread Fork()" in the sense of separating data streams (e.g. as explained above);
- Support for analogous separation of simple data streams based on decisions and pre-calculation of true/false decisions from not yet determined truth values;
- Automation of an instruction forwarding by a multiplexer by coordination guide lines, which are set and prepared by the CCPU 2 in the MU 4;
- Instructions for an automated preemption of memory pages and threads as well as sub-threads, especially so that no thread starves and the fairness of other data streams opens the possibility to calculate -> e.g. a signal check for deadlock and lifelock to the CCPU 2, as well as relative activity of the thread with error code handling;
- Control instructions for caches and page transitions and coordination between the different caches, especially with an option to force exceptional loading of pages into other caches and cache levels.
- Instructions for creating tree results for all branches of the program, which tree results are provided e.g. for tracing errors.
The original instructions of the machine code can be coded in RISC and an exten sion can be used to provide the most complex SCE instructions for sequences that are used more often. In an embodiment with such a provision of instructions with higher complexity, the data transfer for controlling the computing stages 5 can be minimized. The decoding of the instructions also has a higher performance poten tial anyway due to the outsourcing, which can be used for such complex instruc tions. In this context, the primary consideration in the design can also be the com puting throughput and not the reaction speed. An instruction decoding in the IE 37 can also be designed with a pipeline. If the necessary results from other calculations are not yet entered for the pro cessing of an instruction in computing unit 5, the MU 4 will determine this in ad vance and the MU 4 will not load this instruction into computing unit 3, but another instruction which is already completely ready for execution. In contrast to conven tional computers whose flow control is handled at the thread level, according to the present invention, the individual instructions are "ready for execution" (and not just entire threads with a large number of instructions). Through this consideration of individual instructions according to the invention, a thread or sub-thread of a process becomes, so to speak, only implicitly ready for execution. In many cases, this can be used to avoid so-called "busy waiting".
According to the invention, threads can be checked globally by the MU on instruc tion of the CCPU in combination with instructions still outstanding for this thread by the CCPU. Loops in particular can be checked for their conditions. If a condi tional event can no longer occur, the entire thread is automatically terminated with an error state or interruption state, which is then handled by the CCPU 2 - in the background, so to speak - while the corresponding computing stage 5 devotes itself to the next task without interruption.
For the unambiguous tracing of a branching sequence, the control processor CCPU 2 can document the jumps and decision parts executed during the execu tion via an operating system of the computer architecture 1 after completed calcu lation of each page to be executed, wherein these branches can already be avail able marked and prepared by the compiler, so that only the corresponding results of the branches are entered by the CCPU 2 at the appropriate places.
Furthermore, the control processor CCPU 4 can store the distribution of the thread pages and provide a list of all open instructions in the instruction stacks of compu ting stage 5 and assign computing stage 5 to the thread pages, especially without disturbing the processing of the program code by computing stage 5. For example, when searching for a special instruction, this can be used to determine the respon sible page block. If thread pages are forwarded externally between computing stages 5, the CCPU 2 is informed by the MU 4, and the CCPU 2 then updates this information in your local program flow administration. Fig. 4 shows in detail an example of a schematic structure of an embodiment of a Management Unit 4 (MU) according to the invention.
A primary task of the MU 4 is to perform the data flow between the computing units 3, i.e. especially from input 40, from a preceding computing unit in the ring to a subsequent computing unit 3 in the ring via output 45. The MU 4 separates the instructions of the actual program for each input of computing unit 3 and places them on the fully associative instruction stack for each computing unit 3. In partic ular, the instruction stack can also be designed as a separate stack for each input. When distributing the instructions of the program code, the strategic aim is that a subsequent instruction is planned into the respective following computing unit 3 of the next computing stage 5, and when executing the program code, the data pro cessed by the instructions is passed on in ring 7 between computing stages 5. If the necessary results of the previous calculations - on which data the current in struction is based - are not yet entered for an instruction, the MU does not load this current instruction, but another instruction which is already ready for execu tion. The MU 4 can, for example, fill an instruction and/or data stack for an arith metic unit of computing unit 3.
In one embodiment, a MU 4 can be specially designed with:
• a cache-in 41 which receives incoming data, designed for caching and buffer ing byte stream data and pages according to strategies such as FIFO, LIFO or other strategies originally known from the process strategy, which are used here for the storage strategy,
• a processing 42, designed for the interpretation (IE) and execution (EE) of the instructions
• a cache-out 43, designed to cache large pages that cannot be transferred in one go, and for stall treatment, and
• a multiplexer 44 (MUX) designed to provide outgoing data o to the computing unit or computing stage level belonging to the MU with an EE in which it assigns the instruction to the correct arithmetic unit,
o to the ring to a next MU, o to the control CPU or
o to a PMMU.
Fig. 5 shows a block diagram of an embodiment of a process according to the invention or a method according to the invention, specifically for.
In block 50 a central control CPU reads in a program code and data from a memory.
In block 51 the program code is distributed by the central control CPU via a star shaped, direct connection of the central control CPU to several computing stages. These computing stages are connected to each other in a closed ring for data communication. The program code is distributed to the computing stages in such a way that when the program code is executed, intermediate results are forwarded along the ring between the computing stages as result data of a first instruction of the program code and as input data of a second instruction of the program code. The instructions or blocks of instructions are thus, for example, successively dis tributed along the ring, wherein in particular conditional branches and sequence variants of the program code are also taken into account in the selection and con figuration of the distribution of blocks of instructions in order to be able to achieve a high degree of parallelism in the processing.
In block 52 the program code is executed on the computing stages, wherein pro cessed data is passed on between parts of the program code in the ring from one computing stage to the next.
In block 53, the final results of the execution of the program code are read in from the computing stages by the central control CPU and the final results are stored in the main memory or on a data carrier, e.g. RAM, or on a data storage medium such as an HDD or SSD.
In other words, based on an example of program code with reference to an em bodiment of a processor architecture 1 according to the invention - for example in a compiler similar to C++ - a new instruction "multiple_fork (<Object_identifier>)" can be provided, which via a template feeds various inputs to any variable or object during its creation and synchronizes further sequence variants or allows own se- quences. Each argument "multiple_value (<Object_identifier>, args...)" in the fol lowing code creates a branching in the code. A further argument can then be "mul- tiple_out (<Object_identifier>, args...)", which outputs an array of pointers to the finished calculated objects. Each of the "args" variables used creates a further sequence variant. Such a method, according to the invention, can be designed as an extension of the "p_thread" class for handling threads (also called activity car riers or light-weight process) as an execution strand or an execution sequence in the processing of a program as part of a process for flow control, but according to the invention, whole address spaces do not have to be copied or pseudo-copied, and only small sections of code are split when the compiled code is executed.
The following C++ similar pseudo code can be used as an example for explana tion: int i=1 , j=2, a=0;
/* Mark a as an element that can be synchronously thread-separated into individ ual, so-called strings (=strings)
corresponds approximately to "a+=multiple_value(a,i,j);", i.e. , a branch i and a branch j;
"\|<... >\|" addresses several identical objects or fundamental elements at this point */ a+=\|i,j\|
System. out.println("a="+a|i,j, <...>|);
/* Output of result branches of a with the variation of i and the variation of j;
Result: "a=\|1 ,2\|"
There are two solutions to the same problem. Transferred to a respective string, the results are logically separated */
System. out.println("a="+a|i|);
/* Result: "a=\|1\|" */
System. out.println("a="+a); /* Result: "a=\|1\|"
Result: "a=\|1 ,2\|" (automatic)*/
The result in this example is both 1 and 2, which means, for example, that, accord ing to the invention, different variants can be tried out for an existing context, which leads to a number of n new branches in the course of the computer architecture. However, the executions of the variants are not independent threads, because the variation can close within a single thread. Such a subdivision of a single thread using the "multiple_fork" statement can also be referred to as a "string thread".
The "multiple_out" statement can still be read out as an array[] in order to continue or merge results of the individual strings in sequential form if required. Similarly, in another embodiment, the input of branches can also be carried out via an array[], for example with "a+=\|[i,j]\|".
For example, information processing on two levels can be carried out as follows. An example of a possible functional process in an embodiment of a processor architecture according to the invention can be as follows:
1. Loading a program block from a hard disk or RAM into the L2 of the CCPU.
2. CCPU loads program page in L4 MU, after instruction of the CCPU it immedi ately finds the correct one with the first instruction, especially according to the in struction format described here.
3. MUs load L4 program page from CCPU and report status "loaded" to CCPU.
4. CCPU sends data for transfer: Thread ID, initial data for transfer to all computing stages.
5. MU loads first instruction from internal buffer and fills the instruction with re quired data variables from L4.
6. CCPU gives start release to the thread as soon as all preparations are done.
7. IE Interprets instruction(s), checks free arithmetic units. In particular, an instruc tion page with standard combination execution instructions (SCE) is created, the first element is popped repeatedly until no more special instruction "fork" can be found and the resulting instruction page with the length n-1 is sent to the next computing unit, which again pops the first element and enters this instruction in the internal registry and writes it back to L4, continues to send the page etc. This results in a kind of self-configuration of the ring.
8. EE sets instruction and input values to the correct computing unit.
9. The computing unit outputs the result after a few internal pipeline steps and buffers it on the cache-out, e.g. in the format register 0x****:{Result, subsequent instruction number, subsequent computing stage, thread ID}. If there are multiple outputs, the MU can also report a message "fork" to the CCPU, e.g. with {Thread, new Child, Instruction Number}, so that the CCPU can keep track of all processes and relationships.
10. The MU of computing stage n+1 registers that its buffer is not empty.
1 1 . The MU loads the data/the data field and searches for available instructions, e.g. in round robin principle over all pages from L4.
12 The MU finds a matching instruction and keeps the matching page in its internal memory. With an instruction supplement X_MERGE{Out, Threadl , Thread2, Reg- isterlnl , Registerln2, subsequent computing stage}, data flows can be reunited on Threadl of the specified two threads for extended operations of any kind and this can be made known to the CCPU. The MU waits until both data are available. The MU automatically packs the new result into the correct fetch register of the page. The compiler is guaranteed to have the correct instruction on a page in MU memory for this step.
13. Step 3 and following are repeated until the last instruction is reached and the MU finds a KILL instruction on the instruction page.
14. MU reports thread ID with message "done" to CCPU.
15. CCPU sends message "get" with the thread ID to all MUs with relevant output information in reverse order. This can be used to create a tree stack of the data which is known to the CCPU through the program code and its own planning of the process at the computing stages.
16. CCPU creates completed program classes for thread and branch overview for the user and stores them on the hard disk or RAM. Each instruction to put an intermediate result on the stack or a newly created or merged thread represents a label to be stored for the CCPU and is marked as such from the beginning. The CCPU searches for the corresponding locations of instructions already during dis tribution and finds the corresponding storage locations of variables with the MUs to build a consistency tree.
For example, according to the invention, new, so-called "interparallel sequences" of a thread can be run in several threads of a process, which all share an address space. For example x = (-b ±Vb -4 a c) / 2 a can be processed in real-time parallel with subsequent decisions, or higher com plex equations can be processed in parallel in one thread and basic principles such as mathematical polynomial division (e.g. for estimating zeros in multidimen sional systems) can deliver results faster and in a highly parallel fashion.
An example of how this can be done is as follows.
1 . A so-called "split module" splits or separates the data and the instructions of the program code, especially in the L2 cache and/or in the L4 cache if there are mixed pages there. On the one hand, if the user's program input provides a mixed situa tion at startup where the pages must first be created separately, the split module separates instructions and constants, and detects required memory locations for variables that must be inserted into the MUs. In order to avoid crashes as far as possible, dynamic memory allocation can be avoided during a run if possible. The memory can be prepared in the form of dummy variables and the instruction words of the program code are assigned the corresponding, correct registration numbers of the data. If a is set off against b, it must be possible to find these variables via their location (see also the instruction format above). The split module is preferably a software and/or hardware part of the CCPU.
2. The CCPU 2 assigns a thread tag on all pages or byte streams and variables involved in the caches of computing stages 5 in order to identify a data flow, and inserts the data pages in the computing stage stack separately from the program code stack (growing from bottom to top) into a logically separate data stack (grow- ing from top to bottom) with the input variables of the program, and links the in structions to be executed with the already existing data on the data stack. The least loaded computing stage 5 is preferably selected.
3. The CCPU 2 starts the program and leaves it to itself on the ring of computing stages. Since the CCPU 2 primarily only manages and uses instructions, it is gen erally faster than the switching requirements of the arithmetic units 3 and compu ting stages 5. To ensure that no results are lost, "dirty" pages are constantly written back to the CCPU 2 with low priority - especially e.g. in the case of exceptions, displacement of pages from the PMMU 46, expiration of a timer signal or an 10 call to the CCPU 2 - with a real-time priority. The storing is preferably carried out automatically via the MU 4 with the stored tags.
4. During the time-independent instruction processing on each computing stage 5, the result of the respective calculation is placed in an out-cache area in a data stack in cache 47 of computing stage 5. If two valid tags (raw data for the next instruction) are found in the instruction stack of the next computing stages on a RISC 3 address machine in its data cache for the subsequent instruction, the sub sequent instruction is passed to the next free and responsible computing unit 3 in the next computing stage 5. If it is a constant value, the compiler has already been instructed to use it.
The CCPU 2 is designed and configured in such a way that none of the involved stacks overflows, or memory pages are exchanged from stack 47 L4 before an overflow, e.g. into a cache L1 , L2, L3 of the CCPU 2 or into RAM 9. On the other hand, the CCPU 2 is designed and configured in such a way that it prevents a thread from being prevented from executing due to lack of data and "starving". If, for example, no suitable data pair is found in an instruction ready for execution in a computing stage 5, the memory page is first swapped from the MU 4 to the instruction stack, i.e. a so-called paging is carried out, especially without such an explicit triggering by the CCPU. Furthermore, each of the tags can have an in struction number or instruction field with a counter that records how long it has been stuck in relation to calculated calculation runs or FLOPs. This allows e.g. to determine an error tolerance and/or to find out if and where a deadlock occurs. For example, the clock frequency of a MU 4 subprocessor searching the cache for executable instructions can also be much higher than the CCPU 2 clock frequency, because smaller hardware structures can be optimized much easier - divide and conquer. The division of labor in the processor architecture 1 according to the in vention creates an extended complex pipeline. The system according to the inven tion can especially be designed as an offer-demand system. The solution accord ing to the invention continuously links currently available data, and always pro cesses possible and already existing combinations first and in parallel - especially if no dependencies or conflicts occur, also regardless of a sequence of instructions implied in the original program code. This means that every instruction in a com puting unit has its own optimized pipeline. The result can first be buffered in inde pendently long and respectively optimized pipeline stages, and then entered into a transition cache between two computing stages 5 by a so-called Cache Manager of PMMU 46.
The next following computing stage 5 in the ring processes this result if all neces sary data for the processing in this following computing stage are available.
The tape of this factory method is always automatically filled and empties itself into the warehouse where the data is waiting for the next processing step. It is possible to prevent pages from being deleted from the out-cache in order to find and pro cess a data pair again - i.e. data that is linked together by the execution of an existing instruction. This creates simple processing trees. When branching, both variants are simply processed and assigned different thread IDs and pages in the same step. For example, sub-threads with different thread IDs can be reported to the CCPU so that the CCPU can determine that they still belong to the original thread, but run under a fork, but are still the same program code and therefore the same thread.
5. During processing, the control CPU 2 has already set a return instruction in all caches 47 on the ring if the result is to be written back and the program terminates legitimately. If an MU 4 sends the signal that a thread is completed, a system call is generated in the CCPU 2 that identifies the thread to be assigned and writes back all result pages, as well as releases all program code pages of the thread on all external mechanisms for overwriting and/or displacing. When this signal is re ceived, the known expiration tag of the data to be stored is searched for in the external caches 47 and the control CPU 2 directs a redirection to its own L1 cache with one or more valid return addresses. On one side of the thread to be assigned, all child results are logged and the operating system decision paths are calculated, which is part of the CCPU 2. As a result, there are always n different results for branches, which are written by the CCPU 2 as a variable tree structure to the page in RAM 9 or immediately into a main data memory, especially via external IO in terface. Altogether it should be noted that first all pages are fetched from the ring of computing stages 5 into the CCPU 2 and computing stages 5 are released as fast and as far as possible to free computing capacities.

Claims

CLAIMS:
1. Microprocessor architecture (1 ) of a digital computer, in particular formed on a semiconductor chip, which comprises
a plurality of computing stages (5), each of which is formed with
o a computing unit (3) having at least one computing core (31 , 32, 33, 34), which is designed for a mathematical or logical processing of data according to an instruction set, and
o a management unit (4) (MU), which is designed to provide the data and a program code for the computing unit (3),
wherein the computing stages (5) are arranged and designed in such a way that by means of the MUs (4) data communication between the com puting stages (5) is established in a closed ring, in particular in a closed daisy chain ring with point-to-point connections between the MUs (4), and
a central processor core (6) having a control CPU (2) (CCPU), which is connected in a star configuration to each of the computing stages (5) via its MU (4), and
which is designed and configured in such a way as to distribute the pro gram code and data associated therewith from an external memory (9) via the MUs (4) to the plurality of computing stages (5) in such a way that, during the processing of the program code, successive, dependent calcu lations are distributed to computing stages (5) which follow one another in the ring, and the data to be processed are forwarded in the program code sequence in the closed ring from a first of the computing stages (5) to a next of the computing stages (5) in the ring.
2. Microprocessor architecture (1 ) according to claim 1 , characterized in that from the program code of a single thread, a first instruction is executed on a first of said computing stages (5) and a subsequent second instruction is executed on a second of said computing stages (5), wherein data of an in termediate result is passed between said first and second computing stages (5).
3. Microprocessor architecture (1 ) according to claim 1 or 2, characterized in that the MUs (4) are each equipped with a paged memory management unit (46) (PMMU) and an individual L4 cache memory (47), which are de signed in such a way that they, preferably independently of the computing unit (3), in particular in the event of a thread change, temporarily swap out memory pages not required by the computing unit (3) into the individual cache memory (47) and/or reload required memory pages from the individ ual cache memory (47).
4. Microprocessor architecture (1 ) according to claim 3, characterized in that the CCPU (2) preloads the memory pages into the respective individual cache memory (47) via the MUs (4) according to the distribution of the flow of the program code with a corresponding part of the program code and/or the data, in particular independently of the processing of the program code sequence by the computing unit (3) and of the forwarding of the data be tween the computing stages (5).
5. Microprocessor architecture (1 ) according to at least one of claims 1 to 4, characterized in that the CCPU (2) has at least one level of local cache memory, preferably two or three or more levels of local cache memory in the CCPU (2), and
the CCPU (2) is formed with a specialized processor core (20) whose in struction set is designed to be optimized for fast management and transfer of data.
6. Microprocessor architecture (1 ) according to at least one of claims 1 to 5, characterized in that during the program code sequence a data flow of the data to be processed takes place along the ring between the computing stages (5), wherein,
preferably independently thereof, a control flow from the CCPU (2) to the computing stages (5) takes place,
wherein the MU (4) is designed to forward the data flow and the control flow,
wherein the MU (4) provides management and coordination of data flow and the control flow via communication with the CCPU (2), the EE (36), the IE (37), and the computing stage (5) on the basis of the program code of a program to be executed.
7. Microprocessor architecture (1 ) according to at least one of claims 1 to 6, characterized in that the computing unit (3) is designed with at least one computing core of
an arithmetic logical unit 34 (ALU)
a floating point unit (33) (FPU),
a field computer unit (32) (FR) and/or
a logical unit (31 ) (LE), and
an execution engine (36) (EE) adapted to load the register sets of the com puting cores from a stack, and
an interpreting engine (37) (IE), designed to predecode complex standard combination instructions, which are preferably connected to one another and to the associated MU (4) via a common internal bus of the respective computing unit (3), in particular wherein the internal bus is designed sepa rately from the ring via the MU (4), and an execution engine (36) (EE) and an interpreting engine (37) (IE) is designed in each case for each of the computing cores.
8. Microprocessor architecture (1 ) according to at least one of claims 1 to 7, characterized in that the MU (4) is designed with
a cache-in (41 ) which receives incoming data (40), adapted to buffer and cache byte stream data and memory pages from another MU (4) or the CCPU (2),
a processing unit (42) designed to interpret instructions, in particular by means of an interpreting engine (37) (IE), and to execute instructions, in particular by means of an execution engine (36) (EE), a cache-out (43) adapted to cache memory pages for transfer to another MU (4) or the CCPU (2), and
a multiplexer (44) (MUX) adapted to provide outgoing data (45)
o to the computing unit (3) associated with the MU (4),
o to the ring (11 ) to a next MU (4b),
o to the control CPU (2) or
o to a PMMU (46).
9. Microprocessor architecture (1 ) according to claim 8, characterized in that the MU (4) is designed to continuously mark in a local program cache (47) those instructions of the program code for which the data required for their processing are entered in a local data cache (47), and these marked in structions with the required data are then sent to the computing unit (3) for execution in a local stack of an arithmetic unit (31 , 32, 33, 34) adapted for the respective instruction, in particular wherein the marked instructions are loaded from the local stack into the arithmetic unit (31 , 32, 33, 34) by an EE (36) of the computing unit (3) for execution.
10. Microprocessor architecture (1 ) according to at least one of claims 1 to 9, characterized in that the computing stage (5) is designed and configured in such a way that it has an instruction stack which is filled by the CCPU (2) and processed by the computing unit (3), and an input data cache which is initially filled by the CCPU (2) and during the execution of the program code via the ring by another computing stage (5) and processed by the compu ting unit (3),
wherein the MU (4) is designed with a search run which provides those in structions in the instruction stack for which all the required data are located in the data stack to the computing unit (3), in particular to an execution en gine (EE) of the computing unit which loads the instructions and the re quired data for execution into registers of an computing core.
11. Microprocessor architecture (1 ) according to at least one of claims 1 to 10, characterized in that the CCPU (2) is designed and configured in such a way that it interprets instruction machine code of the program code from the memory (9) and the distribution of the program code to the computing stages (5) is carried out on the basis of conditional jumps to subsequent computing stages with multiplex branching; loops via returns to the compu ting stage (5) containing the next instruction; branching of the data stream to several units or combination of start-related computing streams into a list concatenation.
12. Microchip or processor formed with a microprocessor architecture (1 ) ac cording to at least one of claims 1 to 11.
13. Method for executing a program code, in particular parallel and branched program code, on a digital computer, in particular with a processor architec ture (1 ) according to one of claims 1 to 11 , with
a reading in of the program code and initial data from an external memory (9) by a control CPU (2),
a distributing of parts of the program code and the initial data to several computing stages (5), which are connected among themselves to form a closed ring (7), via a direct connection between the control CPU (2) and each of the computing stages (5),
wherein the distributing of the program code takes place in such a way that successive calculations which are dependent on one another are dis tributed to computing stages (5) which follow one another in the ring (7), and
wherein, during parallel execution of the calculations in the computing stages (5) in the closed ring (7), the results calculated by one computing stage (5a) are transferred as intermediate results to a following computing stage (5b), which further processes these results with the program code distributed to it, in particular independently of the control CPU (2), and a read-back of final results and/or error messages of the execution of the program code from the computing stages (5) and a storage of these in the external memory (9) by the control CPU (2).
14. Method according to claim 13, characterized in that the transfer of data in the ring (7) and/or between the control CPU (2) and the computing stage (5), as well as a management of the program code and the data in the com puting stages (5) is carried out by a management unit (4) (MU), in particular wherein the computing stage (5) has for this purpose a program memory management unit (46) (PMMU) and a local cache memory (47) (L4) for the management of data and program code.
15. Method according to claim 14, characterized in that the execution of the program code in the computing stages (5) takes place with a search for combinations of executable program instructions and associated data in the local cache memory and a provision of these on a program and data stack for processing by an arithmetic unit (31 , 32, 33, 34) of the computing stage 5, which is carried out independently of the calculations of the computing stages (5) by an execution engine (36) (EE) of the MU (4).
PCT/EP2020/070285 2019-07-19 2020-07-17 Processor WO2021013727A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP19187372.8 2019-07-19
EP19187372.8A EP3767481A1 (en) 2019-07-19 2019-07-19 Processor

Publications (1)

Publication Number Publication Date
WO2021013727A1 true WO2021013727A1 (en) 2021-01-28

Family

ID=67438302

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/070285 WO2021013727A1 (en) 2019-07-19 2020-07-17 Processor

Country Status (2)

Country Link
EP (1) EP3767481A1 (en)
WO (1) WO2021013727A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328370A (en) * 2021-12-30 2022-04-12 苏州洪芯集成电路有限公司 A Vision Control Heterogeneous SoC Chip Architecture Based on RISC-V

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4644466A (en) 1983-12-05 1987-02-17 Nec Corporation Pipeline processor
DE3707585A1 (en) 1987-03-10 1987-10-15 Juergen G Kienhoefer Procedure for solving divide-and-conquer algorithms using a finite number of autonomous computers working in parallel in a network structure
US4972338A (en) 1985-06-13 1990-11-20 Intel Corporation Memory management for microprocessor system
US5289577A (en) 1992-06-04 1994-02-22 International Business Machines Incorporated Process-pipeline architecture for image/video processing
US5689679A (en) 1993-04-28 1997-11-18 Digital Equipment Corporation Memory system and method for selective multi-level caching using a cache level code
US5706466A (en) 1995-01-13 1998-01-06 Vlsi Technology, Inc. Von Neumann system with harvard processor and instruction buffer
US5724406A (en) 1994-03-22 1998-03-03 Ericsson Messaging Systems, Inc. Call processing system and method for providing a variety of messaging services
US5815727A (en) 1994-12-20 1998-09-29 Nec Corporation Parallel processor for executing plural thread program in parallel using virtual thread numbers
US5907867A (en) 1994-09-09 1999-05-25 Hitachi, Ltd. Translation lookaside buffer supporting multiple page sizes
US6115814A (en) 1997-11-14 2000-09-05 Compaq Computer Corporation Memory paging scheme for 8051 class microcontrollers
US6374286B1 (en) 1998-04-06 2002-04-16 Rockwell Collins, Inc. Real time processor capable of concurrently running multiple independent JAVA machines
DE10134981A1 (en) * 2001-07-16 2003-02-13 Frank Aatz Large parallel multi-processor system has a modular design with individual modules linked to a central array logic chip so that system capability is used more efficiently and effective system power is increased
US20030177273A1 (en) * 2002-03-14 2003-09-18 Hitachi, Ltd. Data communication method in shared memory multiprocessor system
US20040111546A1 (en) 2002-12-05 2004-06-10 International Business Machines Corporation Ring-topology based multiprocessor data access bus
US7454411B2 (en) 1999-09-28 2008-11-18 Universtiy Of Tennessee Research Foundation Parallel data processing architecture
US7653906B2 (en) 2002-10-23 2010-01-26 Intel Corporation Apparatus and method for reducing power consumption on simultaneous multi-threading systems
US20110231857A1 (en) 2010-03-19 2011-09-22 Vmware, Inc. Cache performance prediction and scheduling on commodity processors with shared caches

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4644466A (en) 1983-12-05 1987-02-17 Nec Corporation Pipeline processor
US4972338A (en) 1985-06-13 1990-11-20 Intel Corporation Memory management for microprocessor system
DE3707585A1 (en) 1987-03-10 1987-10-15 Juergen G Kienhoefer Procedure for solving divide-and-conquer algorithms using a finite number of autonomous computers working in parallel in a network structure
US5289577A (en) 1992-06-04 1994-02-22 International Business Machines Incorporated Process-pipeline architecture for image/video processing
US5689679A (en) 1993-04-28 1997-11-18 Digital Equipment Corporation Memory system and method for selective multi-level caching using a cache level code
US5724406A (en) 1994-03-22 1998-03-03 Ericsson Messaging Systems, Inc. Call processing system and method for providing a variety of messaging services
US5907867A (en) 1994-09-09 1999-05-25 Hitachi, Ltd. Translation lookaside buffer supporting multiple page sizes
US5815727A (en) 1994-12-20 1998-09-29 Nec Corporation Parallel processor for executing plural thread program in parallel using virtual thread numbers
US5706466A (en) 1995-01-13 1998-01-06 Vlsi Technology, Inc. Von Neumann system with harvard processor and instruction buffer
US6115814A (en) 1997-11-14 2000-09-05 Compaq Computer Corporation Memory paging scheme for 8051 class microcontrollers
US6374286B1 (en) 1998-04-06 2002-04-16 Rockwell Collins, Inc. Real time processor capable of concurrently running multiple independent JAVA machines
US7454411B2 (en) 1999-09-28 2008-11-18 Universtiy Of Tennessee Research Foundation Parallel data processing architecture
DE10134981A1 (en) * 2001-07-16 2003-02-13 Frank Aatz Large parallel multi-processor system has a modular design with individual modules linked to a central array logic chip so that system capability is used more efficiently and effective system power is increased
US20030177273A1 (en) * 2002-03-14 2003-09-18 Hitachi, Ltd. Data communication method in shared memory multiprocessor system
US7653906B2 (en) 2002-10-23 2010-01-26 Intel Corporation Apparatus and method for reducing power consumption on simultaneous multi-threading systems
US20040111546A1 (en) 2002-12-05 2004-06-10 International Business Machines Corporation Ring-topology based multiprocessor data access bus
US20110231857A1 (en) 2010-03-19 2011-09-22 Vmware, Inc. Cache performance prediction and scheduling on commodity processors with shared caches

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114328370A (en) * 2021-12-30 2022-04-12 苏州洪芯集成电路有限公司 A Vision Control Heterogeneous SoC Chip Architecture Based on RISC-V

Also Published As

Publication number Publication date
EP3767481A1 (en) 2021-01-20

Similar Documents

Publication Publication Date Title
KR102600852B1 (en) Accelerate data flow signal processing applications on heterogeneous CPU/GPU systems
US7904905B2 (en) System and method for efficiently executing single program multiple data (SPMD) programs
EP0623875B1 (en) Multi-processor computer system having process-independent communication register addressing
Goodman et al. PIPE: a VLSI decoupled architecture
RU2427895C2 (en) Multiprocessor architecture optimised for flows
US6829697B1 (en) Multiple logical interfaces to a shared coprocessor resource
US20090327610A1 (en) Method and System for Conducting Intensive Multitask and Multiflow Calculation in Real-Time
KR20180015754A (en) Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
US11875425B2 (en) Implementing heterogeneous wavefronts on a graphics processing unit (GPU)
Hu et al. A closer look at GPGPU
LaSalle et al. Mpi for big data: New tricks for an old dog
WO2021013727A1 (en) Processor
US20230367604A1 (en) Method of interleaved processing on a general-purpose computing core
CN117131910A (en) A convolution accelerator based on the expansion of the RISC-V instruction set architecture and a method for accelerating convolution operations
Végh How to Extend Single-Processor Approach to Explicitly Many-Processor Approach
Goossens et al. Deterministic OpenMP and the LBP parallelizing manycore processor
US20130061028A1 (en) Method and system for multi-mode instruction-level streaming
Abdolrashidi Improving Data-Dependent Parallelism in GPUs Through Programmer-Transparent Architectural Support
Garg Adding Preemptive Scheduling Capabilities to Accelerators
CN119902873A (en) A SIMT stack and thread scheduling optimization method, device, medium and GPGPU architecture for GPGPU branch instructions
Theobald Definition of the EARTH model
CN119512623A (en) Vector processor, operation method of vector processor and electronic device
Tarakji Design and investigation of scheduling mechanisms on accelerator-based heterogeneous computing systems
Kim et al. GPU Design, Programming, and Trends
Schuttenberg Linux Kernel Support For Micro-Heterogeneous Computing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20761743

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20761743

Country of ref document: EP

Kind code of ref document: A1